ICML2026

Abstract:
Maximum likelihood is fundamental to supervised learning but it cannot be directly applied in correctnessbased problems with non-differentiable sampling. In these settings, reinforcement learning (RL) is typically used to maximize expected reward. We show that for binary correctness tasks, expected-reward RL is a first-order approximation of the maximum likelihood objective, yielding vanishing learning signal on low-success inputs. We introduce Maximum Likelihood Reinforcement Learning (MaxRL), a compute-indexed family of sampling-based objectives derived from a pass@k expansion of the likelihood, which interpolates between standard RL and exact maximum likelihood as compute increases. MaxRL admits a simple unbiased policy-gradient estimator whose optimized objective improves with additional compute. Across multiple domains, MaxRL consistently outperforms standard RL and GRPO, achieving higher pass@1 and substantially improved pass@k.

Abstract:
Large Reasoning Models (LRMs) increasingly rely on reasoning traces with complex internal structures. However, existing work lacks a unified answer to three fundamental questions: (1) what defines highquality reasoning, (2) how to reliably evaluate long, implicitly structured reasoning traces, and (3) how to use such evaluation signals for reasoning optimization. To address these challenges, we provide a unified perspective. (1) We introduce the ME^2 principle to characterize reasoning quality along macro- and micro-level concerning efficiency and effectiveness. (2) Built on this principle, we model reasoning traces as directed acyclic graphs (DAGs) and develop a DAG-based pairwise evaluation method, capturing complex reasoning structures. (3) Based on this method, we construct the TRM-Preference dataset and train a Thinking Reward Model (TRM) to evaluate reasoning quality at scale. Experiments show that thinking rewards serve as an effective optimization signal. At test time, selecting better reasoning leads to better outcomes (up to 19.3% gain), and during RL training, thinking rewards enhance reasoning and performance (up to 3.9% gain) across diverse tasks. Code is available in the supplementary material.

Abstract:
Simultaneous speech translation is the task of translating source speech into a target language in realtime. Given that the dependencies between source and target words are non-monotonic (e.g. the word order can change between German and English), this means learning to jointly align and translate. This task has been traditionally tackled through supervised training on aligned data, and as collecting such data is challenging, this relies on synthetic data with automatic alignment. The latter relies on heuristics that are language-specific and suboptimal. We instead propose Hibiki-Zero, a model for simultaneous speech translation trained without word-level alignments between source and target speech. To do so, we train on sentence-level aligned data so that the model learns to perform speech translation but with high latency. We then introduce a novel reinforcement learning strategy relying on GRPO to optimize the translation latency of the model while retaining its translation capabilities. After supervised and post-training, Hibiki-Zero performs multilingual simultaneous translation with state-of-the-art translation accuracy, latency, voice transfer and naturalness across five X-to-English tasks. Moreover, we demonstrate that our model can be easily finetuned to support another language as input with less than 1000h of speech data. We provide examples (hibiki-zero-s2st.github.io) as well as models and release a benchmark containing 15h of multilingual data for speech translation evaluation.

Abstract:
Concept erasure has emerged as a promising approach to mitigate undesired or unsafe content in diffusion models, yet existing methods still face significant limitations. While trainingbased methods are effective, their high computational cost limits scalability. Editing-based methods are more efficient and deployment-friendly, yet they struggle to simultaneously achieve precise concept erasure and preserve overall generative capacity. We identify this core limitation of the editing-based methods as reliance on additive parameter updates. Our empirical analysis reveals that concept semantics primarily depend onneuron directionrather thanneuron magnitude, while overall generative capacity relies on theangular geometryof neurons. As additive updates inherently entangle direction, magnitude, and angular geometry, they inevitably introduce unintended interference between concept erasure and overall generation performance. To address this, we proposeOrthogonal Concept Erasure (OCE), which reformulates editing-based erasure as multiplicative parameter updates from a geometric perspective. Specifically, OCE applies layer-wise orthogonal transformations derived from a closed-form solution to the parameters, enabling precise concept erasure while preserving the neuron magnitude and angular geometry. Furthermore, to address conflicting constraints in multi-concept erasure, OCE introduces a subspace-level objective with structured subspace manipulation, yielding a more effective and scalable erasure. Extensive experiments on single- and multi-concept erasure demonstrate that OCE outperforms existing methods in concept erasure and non-target preservation, erasing up to 100 concepts in 4.3 s.

Abstract:
Entropic optimal transport (EOT) via Sinkhorn iterations is widely used in modern machine learning, yet GPU solvers remain inefficient at scale. Tensorized implementations suffer quadratic HBM traffic from dense n× m interactions, while existing online backends avoid storing dense matrices but still rely on generic tiled mapreduce reduction kernels with limited fusion. We present FlashSinkhorn, an IO-aware EOT solver for squared Euclidean cost that rewrites stabilized log-domain Sinkhorn updates as row-wise LogSumExp reductions of biased dot-product scores, the same normalization as transformer attention. This enables FlashAttention-style fusion and tiling: fused Triton kernels stream tiles through on-chip SRAM and update dual potentials in a single pass, substantially reducing HBM IO per iteration while retaining linear-memory operations. We further provide streaming kernels for transport application, enabling scalable first- and second-order optimization. On A100 GPUs, FlashSinkhorn achieves up to 32× forward-pass and 161× end-to-end speedups over state-of-the-art online baselines on point-cloud OT, improves scalability on OT-based downstream tasks.

Abstract:
Event cameras are biologically inspired sensors garnering significant attention from both industry and academia. Mainstream methods favor frame and voxel representations, which reach a satisfactory performance while introducing timeconsuming transformations, bulky models, and sacrificing fine-grained temporal information. Alternatively, Point Cloud representation demonstrates promise in addressing the mentioned weaknesses, but it has limited scalability in abstracting features of higher spatial resolution and longer temporal sequence events. In this paper, we propose a Scalable Network named SECNet to leverage Event Cloud representation. SECNet integrates polarity at the structural level by innovating the Event-based Group and Sampling module rather than only at the input level. To accommodate the surge in the number of events, SECNet embraces feature extraction in the frequency domain via the Fourier transform. This approach not only substantially extinguishes the explosion of Multiply Accumulate Operations but also effectively abstracts spatio-temporal features. We conducted extensive experiments on ten event-based datasets, and substantiate the scalability, effectiveness, and efficiency of SECNet.

Abstract:
Large VisionLanguage Models (LVLMs) have achieved remarkable progress in multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on chart reasoning, geometric problem solving, and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the Qwen-VL base model.

Abstract:
The pursuit of spatial intelligence fundamentally relies on access to largescale, fine-grained 3D data. However, existing approaches predominantly construct spatial understanding benchmarks by generating question–answer (QA) pairs from a limited number of manually annotated datasets, rather than systematically annotating new large-scale 3D scenes from raw web data. As a result, their scalability is severely constrained, and model performance is further hindered by domain gaps inherent in these narrowly curated datasets. In this work, we propose Holi-Spatial, the first fully automated, large-scale, spatially-aware multimodal dataset, constructed from raw video inputs without human intervention, using the proposed data curation pipeline. Holi-Spatial supports multi-level spatial supervision, ranging from geometrically accurate 3D Gaussian Splatting (3DGS) reconstructions with rendered depth maps to object-level and relational semantic annotations, together with corresponding spatial Question–Answer (QA) pairs. Following a principled and systematic pipeline, we further construct Holi-Spatial-4M, the first large-scale, high-quality 3D semantic dataset, containing 12K optimized 3DGS scenes, 1.3M 2D masks, 320K 3D bounding boxes, 320K instance captions, 1.2M 3D grounding instances, and 1.2M spatial QA pairs spanning diverse geometric, relational, and semantic reasoning tasks. Holi-Spatial demonstrates exceptional performance in data curation quality, significantly outperforming existing feed-forward and per-scene optimized methods on datasets such as ScanNet, ScanNet++, and DL3DV. Furthermore, fine-tuning Vision-Language Models (VLMs) on spatial reasoning tasks using this dataset has also led to substantial improvements in model performance.

Abstract:
Evaluating and improving the security capabilities of code agents requires highquality, executable vulnerability tasks. However, existing works rely on costly, unscalable manual reproduction and suffer from outdated data distributions. To address these, we present CVE-Factory, the first multi-agent framework to achieve expert-level quality in automatically transforming sparse CVE metadata into fully executable agentic tasks. Cross-validation against human expert reproductions shows that CVE-Factory achieves 95% solution correctness and 96% environment fidelity, confirming its expert-level quality. It is also evaluated on the latest realistic vulnerabilities and achieves a 66.2% verified success. This automation enables two downstream contributions. First, we construct LiveCVEBench, a continuously updated benchmark of 190 tasks spanning 14 languages and 153 repositories that captures emerging threats including AI-tooling vulnerabilities. Second, we synthesize over 1,000 executable training environments, the first large-scale scaling of agentic tasks in code security. Fine-tuned Qwen3-32B improves from 5.3% to 35.8% on LiveCVEBench, surpassing Claude 4.5 Sonnet, with gains generalizing to Terminal Bench (12.5% to 31.3%). We open-source all code, data, and models.

Abstract:
Efficiently scaling Large Language Models (LLMs) necessitates exploring alternatives to dominant autoregressive (AR) methods, with Masked Diffusion Models (MDMs) emerging as candidates. However, comparing AR (typically decoderonly) and MDM (often encoder-only) paradigms is confounded by differing architectures, obscuring true algorithmic and efficiency trade-offs. This research decouples these factors by evaluating MDMs within a decoder-only framework to: (1) Equitably compare MDM (as Any-Order AR) and standard AR paradigms through discrepancies on orders. (2) Investigate MDM architectural impacts on computational efficiency. We show decoder-only MDMs, despite a larger modeling space, can achieve significant inference speedups (～25×) and comparable perplexity with techniques like temperature annealing, offering a path to reduced inference compute. This work provides insights for developing more computationally efficient foundation models by disentangling core modeling choices from architectural influences.

Abstract:
Dataset bias often leads deep learning models to exploit spurious correlations instead of taskrelevant signals. We introduce the Standard Anti-Causal Model (SAM), a unifying causal framework that characterizes bias mechanisms and yields a conditional independence criterion for causal stability. Building on this theory, we propose DISCO_m and sDISCO, efficient and scalable estimators of conditional distance correlation that enable independence regularization in black-box models. Across six diverse datasets, our methods consistently outperform or are competitive in existing bias mitigation approaches, while requiring fewer hyperparameters and scaling seamlessly to multi-bias scenarios. This work bridges causal theory and practical deep learning, providing both a principled foundation and effective tools for robust prediction.

Abstract:
Recent progress in largescale robotic datasets and vision-language models (VLMs) has advanced research on vision-language-action (VLA) models. However, existing VLA models still face two fundamental challenges: (i) producing precise low-level actions from high-dimensional observations, (ii) bridging domain gaps across heterogeneous data sources, including diverse robot embodiments and human demonstrations. Existing methods often encode latent variables from either visual dynamics or robotic actions to guide policy learning, but they fail to fully exploit the complementary multi-modal knowledge present in large-scale, heterogeneous datasets. In this work, we present XR-1, a novel framework for versatile and scalable VLA learning across diverse robots, tasks, and environments. At its core, XR-1 introduces the \emphUnified Vision-Motion Codes (UVMC), a discrete latent representation learned via a dual-branch VQ-VAE that jointly encodes visual dynamics and robotic motion. UVMC addresses these challenges by (i) serving as an intermediate representation between the observations and actions, and (ii) aligning multimodal dynamic information from heterogeneous data sources to capture complementary knowledge. To effectively exploit UVMC, we propose a \emphthree-stage training paradigm: (i) self-supervised UVMC learning, (ii) UVMC-guided pretraining on large-scale cross-embodiment robotic datasets, and (iii) task-specific post-training. We validate XR-1 through extensive real-world experiments with more than 12,000 rollouts on six different robot embodiments, spanning over 120 diverse manipulation tasks. XR-1 consistently outperforms state-of-the-art baselines such as \pi_0 and GR00T-N1.5 while demonstrating strong generalization to novel objects, background variations, distractors, and illumination changes. Our project is at \hrefhttps://xr-1-vla.github.io/https://xr-1-vla.github.io/, and our code will be open-sourced.

Abstract:
Continual learning is a longstanding challenge in robot policy learning, where a policy must acquire new skills over time without catastrophically forgetting previously learned ones. While prior work has extensively studied continual learning in relatively small behavior cloning (BC) policy models trained from scratch, its behavior in modern large-scale pretrained Vision-Language-Action (VLA) models remains underexplored. In this work, we find that pretrained VLAs are remarkably resistant to forgetting compared with smaller policy models trained from scratch. Simple Experience Replay (ER) works surprisingly well on VLAs, sometimes achieving zero forgetting even with a small replay data size. Our analysis reveals that pretraining plays a critical role in downstream continual learning performance: large pretrained models mitigate forgetting with a small replay buffer size while maintaining strong forward learning capabilities. Furthermore, we find that VLAs can retain relevant knowledge from prior tasks despite performance degradation during learning new tasks. This knowledge retention enables rapid recovery of seemingly forgotten skills through finetuning. Together, these insights imply that large-scale pretraining fundamentally changes the dynamics of continual learning, enabling models to continually acquire new skills over time with simple replay.

Abstract:
Token attribution methods provide intuitive explanations for language model outputs by identifying causally important input tokens. However, as modern LLMs increasingly rely on extended reasoning chains, existing schemes face two critical challenges: (1) efficiency bottleneck, where attributing a sequence of |\mathbfS| tokens requires \mathcalO(|\mathbfS|^2) operations, making longcontext attribution prohibitively slow; and (2) faithfulness drop, where intermediate reasoning tokens absorb attribution mass, preventing importance from propagating back to the original input. To address these, we introduce FlashTrace, an efficient multi-token attribution method that employs span-wise aggregation to compute attribution over multi-token targets in a single pass, reducing complexity to \mathcalO(|\mathbfS|). Moreover, we design a recursive attribution mechanism that traces importance through intermediate reasoning chains back to source inputs. Extensive experiments on long-context retrieval (RULER) and multi-step reasoning (MATH, MorehopQA) tasks demonstrate that FlashTrace achieves over 130× speedup over existing baselines while maintaining superior faithfulness. We further analyze the dynamics of recursive attribution, showing that even a single recursive hop substantially improves faithfulness by tracing importance through the reasoning chain.

Abstract:
Recent advances in large image editing models have shifted the paradigm from textdriven instructions to vision-prompt editing, where user intent is inferred directly from visual inputs such as marks, arrows, and visual–text prompts. While this paradigm greatly expands usability, it also introduces a critical and underexplored safety risk: the attack surface itself becomes visual. In this work, we propose Vision-Centric Jailbreak Attack (VJA), the first visual-to-visual jailbreak attack that conveys malicious instructions purely through visual inputs. To systematically study this emerging threat, we introduce IESBench, a safety-oriented benchmark for image editing models. Extensive experiments on IESBench demonstrate that VJA effectively compromises state-of-the-art commercial models, achieving attack success rates of up to 80.9% on Nano Banana Pro and 70.1% on GPT-Image-1.5. To mitigate this vulnerability, we propose a training-free defense based on introspective multimodal reasoning, which substantially improves the safety of poorly aligned models to a level comparable with commercial systems, without auxiliary guard models and with negligible computational overhead. Our findings expose new vulnerabilities, provide both a benchmark and practical defense to advance safe and trustworthy modern image editing systems.

Abstract:
Existing online benchmarks for mobile GUI agents remain largely appcentric and task-homogeneous, failing to reflect the diversity and instability of real-world mobile usage. To this end, we introduce VenusBench-Mobile, a challenging online benchmark for evaluating general-purpose mobile GUI agents under realistic, user-centric conditions. VenusBench-Mobile builds two core evaluation pillars: defining what to evaluate via user-intent-driven task design that reflects real mobile usage, and how to evaluate through a capability-oriented annotation scheme for fine-grained agent behavior analysis. Extensive evaluation of state-of-the-art mobile GUI agents reveals large performance gaps relative to prior benchmarks, indicating that VenusBench-Mobile poses substantially more challenging and realistic tasks and that current agents remain far from reliable real-world deployment. Diagnostic analysis further shows that failures are dominated by deficiencies in perception and memory, which are largely obscured by coarse-grained evaluations. Moreover, even the strongest agents exhibit near-zero success under environment variations, highlighting their brittleness in realistic settings. Based on these insights, we believe VenusBench-Mobile provides an important stepping stone toward robust real-world deployment of mobile GUI agents.

Abstract:
Latent actions serve as an intermediate representation that enables consistent modeling of visionlanguage-action (VLA) models across heterogeneous datasets. However, approaches to supervising VLAs with latent actions are fragmented and lack a systematic comparison. This work structures the study of latent action supervision from two perspectives: (i) regularizing the trajectory via image-based latent actions, and (ii) unifying the target space with action-based latent actions. Under a unified VLA baseline, we instantiate and compare four representative integration strategies. Our results reveal a formulation-task correspondence: image-based latent actions benefit long-horizon reasoning, whereas action-based latent actions excel at complex motor coordination. Furthermore, we find that directly supervising the VLM with discrete latent action tokens yields the most effective performance. Finally, our experiments offer initial insights into the benefits of latent action supervision in mixed-data, suggesting a promising direction for VLA training.

Abstract:
Standard Transformers excel at semantic modeling but struggle with rigid sequential logic and state tracking. Theoretical work establishes that selfattention is limited to \AC^0 (under hard attention) or \TC^0 (under soft attention), complexity classes that often fail to support robust length generalization on sequential problems without intermediate chain-of-thought \citephahn2020theoretical, merrill2022saturated. In this work, we introduce \emphRational Transductors, a dual-stream architecture that augments the Transformer with a matrix-valued recurrence derived from Weighted Finite Automata (WFA). By injecting rational state information into the attention mechanism via a \emphDeep Rational Injection scheme, our framework strictly generalizes Transformers to capture all Regular Languages, \NC^1-complete problems (such as Boolean Formula Evaluation), and fundamental separations like Parity and Modular Counting, while preserving O(\log T) parallel training efficiency. Theoretical analysis and empirical results demonstrate that Rational Transductors solve the "Regular Gap," enabling robust length generalization on algorithmic tasks where standard Transformers fail, without the sequential computational bottlenecks of traditional RNNs.

Abstract:
Safe deployment of increasingly capable AI agents may require visibility into how they make decisions. Chainof-thought (CoT) monitoring can detect misbehavior in today’s reasoning models, but this “monitorability” may be fragile under different training procedures, data sources, or continued system scaling. We propose three evaluation archetypes (intervention, process, and outcome-property), a new monitorability metric, and a broad evaluation suite. We show CoT monitoring outperforms action-only monitoring in practical settings, and that frontier models are generally—but not perfectly—monitorable. We study scaling trends with pre-training model size and inference-time compute, finding longer CoTs are typically more monitorable. We find that, for a fixed capability level, using a smaller model at higher reasoning effort can yield higher monitorability, at greater inference compute cost. We further find that increasing a weak monitor’s test-time compute when monitoring a strong agent improves monitorability, and giving the monitor access to the CoT both boosts monitorability and steepens the compute–to-monitorability scaling trend. Finally, we show monitorability can be improved by asking follow-up questions and giving the follow-up CoT to the monitor.

Abstract:
We propose a distributional framework for offline Inverse Reinforcement Learning (IRL) that jointly models uncertainty over reward functions and full distributions of returns. Unlike conventional IRL approaches that recover a deterministic reward estimate or match only expected returns, our method captures richer structure in expert behavior, particularly in learning the reward distribution, by minimizing firstorder stochastic dominance (FSD) violations and thus integrating distortion risk measures (DRMs) into policy learning, enabling the recovery of both reward distributions and distribution-aware policies. This formulation is well-suited for behavior analysis and risk-aware imitation learning. Theoretical analysis show that the algorithm converge with \mathcalO(\varepsilon^-2) iteration complexity. Empirical results on synthetic benchmarks, real-world neurobehavioral data, and MuJoCo control tasks demonstrate that our method recovers expressive reward representations and achieves state-of-the-art imitation performance.

Abstract:
Despite the rapid progress of video generation models, the role of data in influencing motion is poorly understood. We present Motive (MOTIon attribution for Video gEneration), a motioncentric, gradient-based data attribution framework that scales to modern, large, high-quality video datasets and models. We use this to study which fine-tuning clips improve or degrade temporal dynamics. Motive isolates temporal dynamics from static appearance via motion-weighted loss masks, yielding efficient and scalable motion-specific influence computation. On text-to-video models, Motive identifies clips that strongly affect motion and guides data curation that improves temporal consistency and physical plausibility. With Motive-selected high-influence data, we improve both motion smoothness and dynamic degree on VBench, achieving a 74.1% human preference win rate compared with the pretrained base model. To our knowledge, this is the first framework to attribute motion rather than visual appearance in video generative models and to use it to curate fine-tuning data.

Abstract:
Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many realworld alignment problems involve multiple conflicting objectives, where naive aggregation of preferences can lead to unstable training and poor trade-offs. In particular, weighted loss methods may fail to identify update directions that simultaneously improve all objectives, and existing multi-objective approaches often rely on explicit reward models, introducing additional complexity and distorting user-specified preferences. The contributions of this paper are two-fold. First, we propose aReward-freeAlignment framework forConflictedObjectives (RACO) that directly leverages pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent. We provide convergence guarantees to Pareto-critical points that respect user-specified objective weights, and further show that clipping can strictly improve convergence rate in the two-objective setting. Second, we improve our method using some heuristics and conduct experiments to demonstrate the compatibility of the proposed framework for LLM alignment. Both qualitative and quantitative evaluations on multi-objective summarization and safety alignment tasks across multiple LLM families (Qwen 3, Llama 3, Gemma 3) show that our method consistently achieves better Pareto trade-offs compared to existing multi-objective alignment baselines.

Abstract:
How does the amount of compute available to a reinforcement learning (RL) policy affect its learning? Can policies using a fixed amount of parameters, still benefit from additional compute? The standard RL framework does not provide a language to answer these questions formally. Empirically, deep RL policies are often parameterized as neural networks with static architectures, conflating the amount of compute and the number of parameters. In this paper, we formalize compute bounded policies and prove that policies which use more compute can solve problems and generalize to longerhorizon tasks that are outside the scope of policies with less compute. Building on prior work in algorithmic learning and model-free planning, we propose a minimal architecture that can use a variable amount of compute. Our experiments complement our theory. On a set 31 different tasks spanning online and offline RL, we show that (1) this architecture achieves stronger performance simply by using more compute, and (2) stronger generalization on longer-horizon test tasks compared to standard feedforward networks or deep residual network using upto 5 times more parameters.

Abstract:
Recently, the frontier of Large Language Model (LLM) capabilities has shifted from singleturn code generation to agentic software engineering—a paradigm where models autonomously navigate, edit, and test complex repositories. While post-training methods have become the de facto approach for code agents,agentic mid-training—mid-training (MT) on large-scale data that mirrors authentic agentic workflows—remains critically underexplored due to substantial resource requirements, despite offering a more scalable path to instilling foundational agentic behaviors than relying solely on expensive reinforcement learning. A central challenge in realizing effective agentic mid-training is the distribution mismatch between static training data and the dynamic, feedback-rich environment of real development. To address this, we present a systematic study of agentic mid-training, establishing both the data synthesis principles and training methodology for effective agent development at scale. Central to our approach isagent-native data—supervision comprising two complementary types of trajectories:contextually-native trajectoriesthat preserve the complete information flow an agent experiences, offering broad coverage and diversity; andenvironmentally-native trajectoriescollected from executable repositories where observations stem from actual tool invocations and test executions, providing depth and interaction authenticity. We verify the model’s agentic capabilities onSWE-Bench Verified. We demonstrate our superiority over the previous open software engineering mid-training recipeKimi-Devunder two post-training settings with an aligned base model and agentic scaffold, while using less than half mid-training tokens (73.1B). Besides relative advantage, our best performing 32B and 72B models achieve56.1%and58.5%resolution rates, respectively, which are state-of-the-art among open training recipes using agentic scaffolds under their model sizes, despite starting from non-coderQwen2.5-Basebase models. Beyond these agentic capabilities, we also observe performance gains on general code generation and scientific benchmarks. We plan to open-source a significant portion of our datasets, recipes, and model checkpoints—resources representing substantial computational investment typically unavailable to the broader community—to facilitate further research in this underexplored paradigm.

Abstract:
Diffusion models have shown remarkable empirical success in sampling from rich multimodal distributions. Their inference relies on numerically solving a certain differential equation. This differential equation cannot be solved in closed form, and its resolution via discretization typically requires many small iterations to produce \emphhigh-quality samples. More precisely, prior works have shown that the iteration complexity of discretization methods for diffusion models scales polynomially in the ambient dimension and the inverse accuracy 1/\varepsilon. In this work, we propose a new solver for diffusion models relying on a subtle interplay between low-degree approximation and the collocation method, and we prove that its iteration complexity scales polylogarithmically in 1/\varepsilon, yielding the first "high-accuracy" guarantee for a diffusion-based sampler that only uses (approximate) access to the scores of the data distribution. In addition, our bound does not depend explicitly on the ambient dimension; more precisely, the dimension affects the complexity of our solver only through the effective radius of the support of the target distribution.

Abstract:
Prediction sets can wrap around any ML model to cover unknown test outcomes with a guaranteed probability. Yet, it remains unclear how to use them optimally for downstream decisionmaking. Here, we propose a decision-theoretic framework that seeks to minimize the expected loss (risk) against a worst-case distribution consistent with the prediction set's coverage guarantee. We first characterize the minimax optimal policy for a fixed prediction set, showing that it balances the worst-case loss inside the set with a penalty for potential losses outside the set. Building on this, we derive the optimal prediction set construction that minimizes the resulting robust risk subject to a coverage constraint. Finally, we introduce Risk-Optimal Conformal Prediction (ROCP), a practical algorithm that targets these risk-minimizing sets while maintaining finite-sample distribution-free marginal coverage. Empirical evaluations on medical diagnosis and safety-critical decision-making tasks demonstrate that ROCP reduces critical mistakes compared to baselines, particularly when out-of-set errors are costly.

Abstract:
Compositional explanations are a family of methods that aim to describe the spatial alignment between neurons' receptive field activations and concepts through logical rules, typically computed via a search over all possible concept combinations. Since computing the spatial alignment over the entire state space is computationally infeasible, the literature commonly adopts beam search to restrict the space. However, beam search cannot provide any theoretical guarantees of optimality, and it remains unclear how close current explanations are to the true optimum. In this theoretical paper, we address this gap by introducing the first framework for computing guaranteed optimal compositional explanations. Specifically, we propose: (i) a decomposition that identifies the factors influencing the spatial alignment, (ii) a heuristic to estimate the alignment at any stage of the search, and (iii) the first algorithm that can compute optimal compositional explanations within a feasible time. Using this framework, we demonstrate that 1040% of explanations previously obtained with beam search are suboptimal when overlapping concepts are involved. Finally, we evaluate a beam-search variant guided by our proposed decomposition and heuristic, showing that it matches or improves runtime over prior methods while offering greater flexibility in hyperparameters and computational resources.

Abstract:
As frontier Large Language Models (LLMs) increasingly saturate new benchmarks shortly after they are published, benchmarking itself is at a juncture: if frontier models keep improving, it will become increasingly hard for humans to generate discriminative tasks, provide accurate groundtruth answers, or evaluate complex solutions. If benchmarking becomes infeasible, our ability to measure any progress in AI is at stake. We refer to this scenario as thepost-comprehension regime. In this work, we propose Critique-Resilient Benchmarking, an adversarial framework designed to compare models even when full human understanding is infeasible. Our technique relies on the notion ofcritique-resilient correctness: an answer is deemed correct if no adversary has convincingly proved otherwise. Unlike standard benchmarking, humans serve as bounded verifiers and focus on localized claims, which preserves evaluation integrity beyond full comprehension of the task. Using an itemized bipartite Bradley-Terry model, we jointly rank LLMs by their ability to solve challenging tasks and to generate difficult yet solvable questions. We showcase the effectiveness of our method in the mathematical domain across eight frontier LLMs, showing that the resulting scores are stable and correlate with external capability measures. Our framework reformulates benchmarking as an adversarial generation-evaluation game in which humans serve as final adjudicators.

Abstract:
Hybrid sequence models—combining Transformer and statespace model layers—seek to gain the expressive versatility of attention as well as the computational efficiency of state-space model layers. Despite burgeoning interest in hybrid models, we lack a basic understanding of the settings where—and underlying mechanisms through which—they offer benefits over their constituent models. In this paper, we study this question, focusing on a broad family of core synthetic tasks. For this family of tasks, we prove the existence of fundamental limitations for non-hybrid models. Specifically, any Transformer or state-space model that solves the underlying task requires either a large number of parameters or a large working memory. On the other hand, for two prototypical tasks within this family—namely selective copying and associative recall—we construct hybrid models of small size and working memory that provably solve these tasks, thus achieving the best of both worlds. Our experimental evaluation empirically validates our theoretical findings. Importantly, going beyond the settings in our theoretical analysis, we empirically show that learned—rather than constructed—hybrids outperform non-hybrid models with up to 6 × as many parameters. We additionally demonstrate that hybrid models exhibit stronger length generalization and out-of-distribution robustness than non-hybrids.

Abstract:
Midtraining, the practice of mixing specialized data with more general pretraining data in an intermediate training phase, has become widespread in language model development, yet there is little understanding of what makes it effective. We propose that midtraining functions as distributional bridging by providing better initialization for posttraining. We conduct controlled pretraining experiments, and find that midtraining benefits are largest for domains distant from general pretraining data, such as code and math, and scale with the proximity advantage the midtraining data provides toward the target distribution. In these domains, midtraining consistently outperforms continued pretraining on specialized data alone both indomain and in terms of mitigating forgetting. We further conduct an investigation on the starting time and mixture weight of midtraining data, using code as a case study, and find that time of introduction and mixture weight interact strongly such that early introduction of specialized data is amenable to high mixture weights, while late introduction requires lower ones. This suggests that late introduction of specialized data outside a plasticity window cannot be compensated for by increasing data mixtures later in training. Beyond midtraining itself, this suggests that distributional transitions between any training phases may benefit from similar bridging strategies.

Abstract:
Scaling large models requires optimization strategies that ensure rapid convergence grounded in stability. Maximal Update Parametrization (\boldsymbol\muP) provides a theoretical safeguard for widthinvariant \Theta(1) activation control, whereas emerging optimizers like Muon are only "half-aligned" with these constraints: they control updates but allow weights to drift. To address this limitation, we introduce the Spectral Sphere Optimizer (SSO), which enforces strict module-wise spectral constraints on both weights and their updates. By deriving the steepest descent direction on the spectral sphere, SSO realizes a fully \boldsymbol\muP-aligned optimization process. To enable large‑scale training, we implement SSO as an efficient parallel algorithm within Megatron. Through extensive pretraining on diverse architectures, including Dense 1.7B, MoE 8B-A1B, and 200-layer DeepNet models, SSO consistently outperforms AdamW and Muon. Furthermore, we observe significant practical stability benefits, including improved MoE router load balancing, suppressed outliers, and strictly bounded activations.

Abstract:
Stochastic Variance Reduced Gradient (SVRG) and its variants aim to speedup training by using gradient corrections. In their decade of existence, these methods have never been connected to any Bayesian methods, at least not at a fundamental level. Here, we fill this gap and show surprising new connections of SVRG to a recently proposed Bayesian method called ‘posterior correction’. Our main contribution is to show that SVRG can be recovered as a special case of posterior correction when applied over isotropic-Gaussian posteriors. Novel extensions of SVRG are automatically obtained by using more flexible exponential-family posteriors. We derive two new such extensions by using Gaussian families: a Newton-like variant with novel Hessian corrections, and an Adam-like extension that scales to large problems. Our work is the first to connect SVRG to Bayes and use it to boost training.

Abstract:
Bayesian filtering is a wellknown problem that aims to estimate plausible states of a dynamical system from observations. Among existing approaches to solve this problem, particles filters are theoretically exact for non-linear dynamics and observations, but suffer from poor scalability in high dimensions. In this work, we show that diffusion-based emulators of dynamical systems can be used to implement, without additional training, an optimal variant of particle filters that has remained largely unexplored due to implementation challenges with classical numerical solvers. Experiments on nonlinear chaotic systems, including atmospheric dynamics, demonstrate that the proposed approach successfully scales particle filtering to high-dimensional settings.

Abstract:
Applying differential privacy (DP) via DPSGD to Low-Rank Adaptation (LoRA) is a natural approach for privacy-preserving fine-tuning. However, applying DP-SGD to LoRA poses a fundamental challenge due to its low-rank parameterization. In LoRA, each trainable update is represented as a low-rank matrix Z=AB^\top, but this factorization is non-identifiable. As a result, applying DP-SGD directly to factors (A,B) induces gauge-dependent perturbations on Z, leading to uncontrolled noise amplification. We propose PRISM, an intrinsic DP mechanism for LoRA that is gauge invariant by construction, avoids bilinear noise amplification, and admits an efficient low-dimensional noise sampler. Moreover, PRISM yields a closed-form characterization for the effective intrinsic noise on Z, and enables stable privacy–utility trade-offs by being gauge invariant and keeping noise amplification bounded. We further characterize the noise amplification incurred by naive DP-LoRA and show that it can be unbounded, establish standard (\varepsilon,\delta)-DP guarantees for PRISM, and introduce a DP-aware, gauge-invariant adaptive update that avoids amplifying injected privacy noise under adaptive optimization, improving numerical stability in practice.

Abstract:
What does it mean to understand the world? Is it simply to predict future video frames? Developmental cognitive science suggests that understanding the world is fundamentally the process of constructing internal theories of how it works rather than mere prediction, even before language is acquired. However, in machine learning, it remains unclear how to endow AI systems with such theorybuilding capability from raw, non-textual observation alone. In this paper, we introduce Learning-to-Theorize (L2T), a learning paradigm in which an AI system acquires the ability to construct theories represented as executable programs directly from observation alone. To instantiate this paradigm, we propose the Neural Language-of-Thought Programmer, a neural model that induces and executes latent programs as explanations rather than task-specific predictors or policies. In experiments, we show that this formulation enables explanation-driven generalization, allowing observations to be understood in terms of the programs that generate them.

Abstract:
Functional ANOVA offers a principled framework for interpretability by decomposing a model’s prediction into main effects and higherorder interactions. For independent features, this decomposition is well-defined, strongly linked with SHAP values, and serves as a cornerstone of additive explainability. However, the lack of an explicit closed-form expression for general dependent distributions has forced practitioners to rely on costly sampling-based approximations. We completely resolve this limitation for categorical inputs. By bridging functional analysis with the extension of discrete Fourier analysis, we derive a closed-form decomposition without any assumption. Our formulation is computationally very efficient. It seamlessly recovers the classical independent case and extends to arbitrary dependence structures, including distributions with non-rectangular support. Furthermore, leveraging the intrinsic link between SHAP and ANOVA under independence, our framework yields a natural generalization of SHAP values for the general categorical setting.

Abstract:
Large Reasoning Models (LRMs) solve complex tasks by generating long Chainof-Thought (CoT) sequences; however, the emergent dynamics governing reasoning trajectories are not well understood and can lead to inconsistencies and reasoning pathologies. In this work, we propose to approximate LRM's emerging hierarchical reasoning dynamics as a trajectory within a Finite State Machine (FSM) transitioning among six abstract cognitive states. We demonstrate that these states and transitions can be captured in the latent state of the model. We believe that this representation can have different applications in the interpretability and optimization of LRM models. For example, by analyzing the topology of these transitions, we identify statistical shifts in reasoning strategies that help identify effective reasoning chains from those that fail. To illustrate these potential advantages, we propose Q-Value guided steering, a training-free inference-time control method that treats reasoning as a planning problem. We estimate the long-horizon utility of state transitions and apply sparse, orthogonal activation steering at sentence boundaries to align the CoT generation with optimal reasoning policies. Experiments across four benchmarks (AIME25, MATH-500, GSM8k, and GPQA Diamond) using three state-of-the-art open reasoning models demonstrate that Q-Value steering policy achieves significant performance gains with "surgical'' efficiency, often requiring 25× fewer interventions than greedy and weighted baselines, which suggests that reasoning can be effectively controlled by guiding high-level cognitive dynamics rather than micro-managing token generation.

Abstract:
In recent years, there has been growing interest in understanding neural architectures' ability to learn to execute discrete algorithms, a line of work often referred to as neural algorithmic reasoning. The goal is to integrate algorithmic reasoning capabilities into larger neural pipelines. Many such architectures are based on (messagepassing) graph neural networks (MPNNs), owing to their permutation equivariance and ability to deal with sparsity and variable-sized inputs. However, existing work is either largely empirical and lacks formal guarantees or it focuses solely on expressivity, leaving open the question of when and how such architectures generalize beyond a finite training set. In this work, we propose a general theoretical framework that characterizes the necessary conditions under which MPNNs can learn an algorithm from a training set of small instances and provably approximate its behavior on inputs of arbitrary size. Our framework applies to a broad class of algorithms, including single-source shortest paths, minimum spanning trees, and general dynamic programming problems, such as the 0-1 knapsack problem. In addition, we establish impossibility results for a wide range of algorithmic tasks, showing that standard MPNN cannot compute them. We derive more expressive MPNN-like architectures that overcome these limitations. Finally, we refine our analysis for the Bellman–Ford algorithm, yielding substantially smaller required training sets and significantly extending the recent work of Nerem et al., 2025 by allowing for a differentiable regularization loss. Empirical results largely support our theoretical findings.

Abstract:
Diffusion (Large) Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks, while holding the promise of being more efficient during inference. One critical design aspect of dLLMs is the sampling procedure that selects which tokens to unmask at each diffusion step. Indeed, recent work has found that heuristic strategies such as confidence thresholding improve both sample quality and token throughput compared to random unmasking. However, such heuristics have downsides: they require manual tuning, and we observe that their performance degrades with larger block sizes. In this work, we instead propose to train sampling procedures using reinforcement learning. Specifically, we formalize masked diffusion sampling as a Markov decision process in which the dLLM serves as the environment, and propose a lightweight policy based on a singlelayer transformer that maps dLLM token confidences to unmasking decisions. Our experiments show that these trained policies match the performance of state-of-the-art heuristics when combined with semi-autoregressive (block) generation, while outperforming them in the full-diffusion setting.

Abstract:
Being probabilistic models, during inference large language models (LLMs) displayrare events: behaviour that is far from typical but highly significant. By definition all rare events are hard to see, but the enormous scale of LLM usage means that events completely unobserved during development are likely to become prominent in deployment. Here we present an endto-end framework for the systematic analysis of rare events in LLMs. We provide a practical implementation spanning theory, efficient generation strategies, probability estimation and error analysis, which we illustrate with concrete examples. We outline extensions and applications to other models and contexts, highlighting the generality of the concepts and techniques presented here.

Abstract:
We introduce Unsupervised Partner Design (UPD), a populationfree multi-agent reinforcement learning method for robust ad-hoc teamwork. UPD generates training partners on-the-fly and selects them adaptively based on a learnability criterion, removing the need for pre-trained partner populations or manual parameter tuning. We show that this simple mechanism enables effective partner diversity and can be extended to joint partner-environment selection when a procedural level generator is available. Across Level-Based Foraging, Overcooked-AI, and the Overcooked Generalisation Challenge, UPD consistently outperforms both population-based and population-free baselines. In a human-AI user study, agents trained with UPD achieve higher returns and are rated as more adaptive, more human-like, and less frustrating than existing approaches.

Abstract:
We present protein autoregressive modeling (PAR), the first multiscale autoregressive framework for protein backbone generation via coarse-to-fine next-scale prediction. Using the hierarchical nature of proteins, PAR generates structures that mimic sculpting a statue, forming a coarse topology and refining structural details over scales. To achieve this, PAR consists of three key components: (i) multi-scale downsampling operations that represent protein structures across multiple scales during training; (ii) an autoregressive transformer that encodes multi-scale information and produces conditional embeddings to guide structure generation; (iii) a flow-based backbone decoder that generates backbone atoms conditioned on these embeddings. Moreover, autoregressive models suffer from exposure bias, caused by the training and the generation procedure mismatch, and substantially degrades structure generation quality. We effectively alleviate this issue by adopting noisy context learning and scheduled sampling, enabling robust backbone generation. Notably, PAR exhibits strong zero-shot generalization, supporting flexible human-prompted conditional generation and motif scaffolding without requiring fine-tuning. On the unconditional generation benchmark, PAR effectively learns protein distributions and produces backbones of high design quality, and exhibits favorable scaling behavior. Together, these properties establish PAR as a promising framework for protein structure generation.

Abstract:
Current computeraided synthesis planning (CASP) methods often treat retrosynthesis as solved once a single feasible route is identified, focusing primarily on convergence or shortest-path metrics. This view is misaligned with real-world practice, where chemists must balance competing objectives such as cost, sustainability, toxicity, and overall yield. To address this, we formulate synthesis planning as a multi-objective search problem and introduce MORetro^\ast, an algorithm that generates a Pareto front of synthesis routes to explicitly capture trade-offs between user-defined criteria. MORetro^\ast uses weighted scalarization and solution-informed sampling to efficiently navigate the combinatorial search space and prioritize promising trade-offs. Building on multi-objective A^\ast-search, we provide optimality guarantees showing that, for a fixed single-step model, MORetro^\ast recovers the true Pareto front. Across multiple retrosynthesis benchmarks, MORetro^\ast produces diverse, high-quality Pareto fronts, uncovering solutions overlooked by single-objective approaches and better aligning CASP outputs with industrial decision-making.

Abstract:
We consider the problem of jointly learning a onedimensional projection and a univariate function in high-dimensional Gaussian models. Specifically, we study predictors of the form f(x)=\varphi^\star(\langle w^\star, x \rangle), where both the direction w^\star \in \mathcalS_d-1, the sphere of \mathbbR^d, and the function \varphi^\star: \mathbbR \to \mathbbR are learned from Gaussian data. This setting captures a fundamental non-convex problem at the intersection of representation learning and nonlinear regression. We analyze the gradient flow dynamics of a natural alternating scheme and prove convergence, with a rate controlled by the information exponent reflecting the Gaussian regularity of the function \varphi^\star. Strikingly, our analysis shows that convergence still occurs even when the initial direction is negatively correlated with the target. On the practical side, we demonstrate that such joint learning can be effectively implemented using a Reproducing Kernel Hilbert Space (RKHS) adapted to the structure of the problem, enabling efficient and flexible estimation of the univariate function. Our results offer both theoretical insight and practical methodology for learning low-dimensional structure in high-dimensional settings.

Abstract:
Highfidelity measurements of continuum physical fields are essential for scientific discovery and engineering design but remain challenging under sparse and constrained sensing. Conventional reconstruction methods typically rely on fixed sensor layouts, which cannot adapt to evolving physical states. We propose LASER, a unified, closed-loop framework that formulates active sensing as a Partially Observable Markov Decision Process (POMDP). At its core, LASER employs a continuum field latent world model that captures the underlying physical dynamics and provides intrinsic reward feedback. This enables a reinforcement learning policy to simulate ''what-if'' sensing scenarios within a latent imagination space. By conditioning sensor movements on predicted latent states, LASER navigates toward potentially high-information regions beyond current observations. Our experiments demonstrate that LASER consistently outperforms static and offline-optimized strategies, achieving high-fidelity reconstruction under sparsity across diverse continuum fields.

Abstract:
Estimating probability density and its score from samples remains a core problem in generative modeling, Bayesian inference, and kinetic theory. Existing methods are bifurcated: classical kernel density estimators (KDE) generalize across distributions but suffer from the curse of dimensionality, while modern neural score models achieve high precision but require retraining for every target distribution. We introduce DiScoFormer (Density and Score Transformer), a ``trainonce, infer-anywhere" equivariant Transformer that maps i.i.d. samples to both density values and score vectors, generalizing across distributions and sample sizes. Analytically, we prove that self-attention can recover normalized KDE, establishing it as a functional generalization of kernel methods; empirically, individual attention heads learn multi-scale, kernel-like behaviors. The model converges faster and achieves higher precision than KDE for density estimation, and provides a high-fidelity plug-in score oracle for score-debiased KDE, Fisher information computation, and Fokker-Planck-type PDEs.

Abstract:
Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. We tackle this problem with Reparameterized Orthogonal Equivalence Training (POET), a spectrumpreserving framework that optimizes each weight matrix through orthogonal equivalence transformation. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient variant that performs orthogonal equivalence transformations with significantly reduced computational cost. POET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency. In experiments, POET-X enables the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, and in contrast, standard optimizers such as AdamW run out of memory under the same settings.

Abstract:
Existing benchmarks for conversational AI agents simulate singlecontrol environments, where only the AI agent can use tools to interact with the world, while the user remains a passive information provider. This differs from real-world scenarios like technical support, where users need to actively participate in modifying the state of the (shared) world. In order to address this gap, we introduce \tau^2-bench, with four key contributions: 1. A novel Telecom dual-control domain modeled as a Dec-POMDP, where both agent and user make use of tools to act in a shared, dynamic environment that tests both agent coordination and communication, 2. A compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity, 3. A reliable user simulator tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity, 4. fine-grained analysis of agent performance through multiple ablations including separating errors arising from reasoning vs communication/coordination. In particular, our experiments show significant performance drops when agents shift from no-user to dual-control, highlighting the challenges of guiding users. Overall, \tau^2-bench provides a controlled testbed for agents that must both reason effectively and guide user actions.

Abstract:
The increasing impact of climate change and extreme weather events has spurred growing interest in deep learning for weather research. However, existing studies often rely on weather data in pixel space, which presents several challenges such as smooth outputs in model outputs, limited applicability to a single pressurevariable subset (PVS), and high data storage and computational costs. To address these challenges, we propose a novel Weather Latent Autoencoder (WLA) that transforms weather data from pixel space to latent space, enabling efficient data representation. By decoupling weather reconstruction from downstream tasks, WLA improves the accuracy and sharpness of weather task model results. The incorporated Pressure-Variable Unified Module transforms multiple PVS into a unified representation, enhancing the adaptability of the model in multiple weather scenarios. Furthermore, weather tasks can be performed in a low-storage latent space of WLA rather than a high-storage pixel space, thus significantly reducing data storage and computational costs. Through extensive experimentation, we demonstrate its superior compression and reconstruction performance, enabling the creation of the ERA5-Latent dataset with unified representations of multiple PVS from ERA5 data. The compressed full PVS in the ERA5-Latent dataset reduces the original 244.34 TB of data to 0.43 TB. The downstream task further demonstrates that task models can apply to multiple PVS with low data costs in latent space and achieve superior performance compared to models in pixel space.

Abstract:
We present algorithms for diffusion model sampling which obtain \deltaerror in \mathrmpolylog(1/\delta) steps, given access to \widetilde O(\delta)-accurate score estimates in L^2. This is an exponential improvement over all previous results. Specifically, under minimal data assumptions, the complexity is \widetilde O(d\mathrmpolylog(1/\delta)) where d is the dimension of the data; under a non-uniform L-Lipschitz condition, the complexity is \widetilde O(\sqrtdL\mathrmpolylog(1/\delta)); and if the data distribution has intrinsic dimension d_\star, then the complexity reduces to \widetilde O(d_\star\mathrmpolylog(1/\delta)). Our approach also yields the first \mathrmpolylog(1/\delta) complexity sampler for general log-concave distributions using only gradient evaluations.

Abstract:
Deep research agents perform multistep research to produce long-form, well-attributed answers. However, most open deep research agents are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards, which does not extend to realistic long-form tasks. We address this withReinforcement Learning with Evolving Rubrics (RLER), where rubrics are constructed and maintained toco-evolvewith the policy model during training. This allows the rubrics to incorporate newly explored information from search and contrasting model responses, enabling better fact checking and more discriminative on-policy feedback. Using RLER, we developDeep Research Tulu (DR Tulu-8B), the first fully open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare, and general domains, DR Tulu-8B substantially outperforms existing open deep research agents (by 15.6% over Tongyi DR on average) and matches or exceeds proprietary deep research agents (by 0.7% over OpenAI DR on average), while being significantly smaller and cheaper per query (1000x cheaper than OpenAI DR per query).

Abstract:
The lowrank adaptation (LoRA) algorithm for fine-tuning large models has grown popular in recent years due to its remarkable performance and low computational requirements. LoRA trains two "adapter" matrices that form a low-rank representation of the model parameters, thereby massively reducing the number of parameters that need to be updated at every step. Although LoRA is simple, its convergence is poorly understood due to the lack of Lipschitz smoothness, a key condition for classic convergence analyses. As a result, current theoretical results only consider asymptotic behavior or assume strong boundedness conditions which artificially enforce Lipschitz smoothness. In this work, we provide for the first time a non-asymptotic convergence analysis of the original LoRA gradient descent algorithm, which reflects widespread practice, without such assumptions. Our work relies on three key steps: i) reformulating the problem in terms of the outer product of the stacked adapter matrices, ii) a modified descent lemma for the "Lipschitz-like" reparametrized function, and iii) controlling the step size. With this approach, we prove that LoRA gradient descent converges to a stationary point at rate O(\frac1\log T), where T is the number of iterations. We conduct numerical experiments to validate our theoretical findings.

Abstract:
We studygrokkingthe onset of generalization long after overfitting - in a classical ridge regression setting. We prove end-to-end grokking results for learning over-parameterized linear regression models using gradient descent with weight decay. Specifically, we prove that the following stages occur: (i) the model overfits the training data early during training; (ii) poor generalization persists long after overfitting has manifested; and (iii) the generalization error eventually becomes arbitrarily small. Moreover, we show, both theoretically and empirically, that grokking can be amplified or eliminated in a principled manner through proper hyperparameter tuning. To the best of our knowledge, these are the first rigorous quantitative bounds on the generalization delay (which we refer to as the "grokking time") in terms of training hyperparameters. Lastly, going beyond the linear setting, we empirically demonstrate that our quantitative bounds also capture the behavior of grokking on non-linear neural networks. Our results suggest that grokking is not an inherent failure mode of deep learning, but rather a consequence of specific training conditions, and thus does not require fundamental changes to the model architecture or learning algorithm to avoid.

Abstract:
Large Language Models (LLMs) increasingly act as autonomous agents with tool use, ability to execute code, file I/O, and network access. These capabilities create novel security risks. To mitigate these risks, agents are often deployed and evaluated in isolated environments commonly referred to as sandboxes, with Docker or OCI as one of the most popular container runtimes for sandbox implementations. We introduce SandboxEscapeBench, an open benchmark that safely measures an LLM's capacity to break out of these sandboxes. The benchmark is implemented as an \textttInspect AI Capture the Flag (CTF) evaluation utilising a nested sandbox architecture with the outer layer containing the flag and no known vulnerabilities. Following a threat model of a motivated adversarial agent with shell access inside a container, \bench covers a spectrum of sandboxescape mechanisms spanning misconfiguration, privilege allocation mistakes, kernel flaws, and runtime/orchestration weaknesses. We find that, when vulnerabilities are added, LLMs are able to identify and exploit them, showing that use of evaluation like \bench is needed to ensure sandboxing continues to provide the encapsulation needed for highly-capable models.

Abstract:
Collaborative machine learning involves training highquality models using datasets from a number of sources. To incentivize sources to share data, existing data valuation methods fairly reward each source based on its data submitted as is. However, as these methods do not verify nor incentivize data truthfulness, the sources can manipulate their data (e.g., by submitting duplicated or noisy data) to artificially increase their valuations and rewards or prevent others from benefiting. This paper presents the first mechanism that provably ensures (F) collaborative fairness and incentivizes (T) truthfulness at equilibrium for Bayesian models. Our mechanism combines semivalues (e.g., Shapley value), which ensure fairness, and a truthful data valuation function (DVF) based on a validation set that is unknown to the sources. As semivalues are influenced by others' data, we introduce an additional condition to prove that a source can maximize its expected data values in coalitions and semivalues by submitting a dataset that captures its true knowledge. Additionally, we discuss the implications and suitable relaxations of (F) and (T) when the mediator has a limited budget for rewards or lacks a validation set. Our theoretical findings are validated on synthetic and real-world datasets.

Abstract:
The Edge of Stability (EoS) is a phenomenon where the sharpness (largest eigenvalue) of the Hessian converges to 2/\eta during training with gradient descent (GD) with a stepsize \eta. Despite violating classical smoothness assumptions, EoS has been widely observed in deep learning, but its theoretical foundations remain incomplete. We propose a framework for analyzing EoS of non-Euclidean GD using directional smoothness (Mishkin et al., 2024), which naturally extends to non-Euclidean norms. This approach allows us to characterize EoS beyond the standard Euclidean setting, encompassing methods such as \ell_\infty-descent, Block CD, Spectral GD, and Muon without momentum. We derive the appropriate measure of the generalized sharpness under an arbitrary norm. Our generalized sharpness measure includes previously studied vanilla GD and preconditioned GD as special cases. Through analytical results and experiments on neural networks, we show that non-Euclidean GD also exhibits progressive sharpening followed by oscillations around the threshold 2/\eta. Practically, our framework provides a single, geometry-aware spectral measure that works across optimizers, bridging the gap between empirical observations and deep learning theory.

Abstract:
Logic provides a controlled testbed for evaluating LLMbased reasoners, yet standard SAT-style benchmarks often conflate surface difficulty (length, wording, clause order) with the structural phenomena that actually determine satisfiability. We introduce a diagnostic benchmark for \emph2-SAT built from parameterized families of structured 2--CNF formulas, where satisfiability is characterized by the implication graph and can be tuned along interpretable axes. Our generators isolate distinct competencies and failure modes: (i) contradiction-cycle UNSAT cores with controllable size and imbalance, (ii) SAT instances with a prescribed fraction of free variables to control solution multiplicity, (iii) planted backbones that modulate propagation, (iv) late bridge clauses that couple otherwise monotone regions to probe sensitivity to ordering and revision, and (v) symmetry/duplication variants that test abstraction under renaming and redundant structure. We evaluate LLM-based reasoners on decision accuracy and assignment validity, and quantify robustness under semantics-preserving perturbations such as clause reordering, filler clauses, and variable renaming. Across models, we observe sharp performance transitions under targeted structural interventions even when surface statistics are held fixed, revealing brittleness regimes that are invisible to aggregate SAT accuracy.

Abstract:
Sparse sketches such as the sparse Johnson–Lindenstrauss transform are a core primitive in randomized numerical linear algebra because they leverage random sparsity to reduce the arithmetic cost of sketching, while still offering strong approximation guarantees. Their random sparsity, however, is at odds with efficient implementations on modern GPUs, since it leads to irregular memory access patterns that degrade memory bandwidth utilization. Motivated by this tension, we pursue a sketch–kernel codesign approach: we design a new family of sparse sketches, BlockPerm-SJLT, whose sparsity structure is chosen to enable FlashSketch, a corresponding optimized CUDA kernel that implements these sketches efficiently. The design of BlockPerm-SJLT introduces a tunable parameter that explicitly trades off the tension between GPU-efficiency and sketching robustness. We provide theoretical guarantees for BlockPerm-SJLT under the oblivious subspace embedding (OSE) framework, and also analyze the effect of the tunable parameter on sketching quality. We empirically evaluate FlashSketch on standard RandNLA benchmarks, as well as an end-to-end ML data attribution pipeline called GraSS. FlashSketch pushes the Pareto frontier of sketching quality versus speed, across a range of regimes and tasks, and achieves a global geomean speedup of roughly 1.7 × over the prior state-of-the-art GPU sketches.

Abstract:
Stateof-the-art large language models require specialized hardware and substantial energy to operate. Consequently, cloud-based services that provide access to these models have become very popular. In these services, the price users pay depends on the number of tokens a model uses to generate an output–they pay a fixed price per token. In this work, we show that this pricing mechanism creates a financial incentive for providers to strategize and misreport the (number of) tokens a model used to generate an output, and users cannot prove, or even know, whether a provider is overcharging them. However, we also show that, if an unfaithful provider is obliged to be transparent about the generative process used by the model, misreporting optimally without raising suspicion is hard. Nevertheless, as a proof-of-concept, we develop an efficient heuristic algorithm that allows providers to significantly overcharge users without raising suspicion. Crucially, the cost of running the algorithm is lower than the additional revenue from overcharging users, highlighting the vulnerability of users under the current pay-per-token pricing mechanism. Further, we show that, to eliminate the financial incentive to strategize, a pricing mechanism must price tokens linearly on their character count. While this makes a provider's profit margin vary across tokens, we introduce a simple prescription that allows a provider to maintain their average profit margin when transitioning to an incentive-compatible pricing mechanism. To complement our theoretical results, we conduct experiments with large language models from the \textttLlama, \textttGemma and \textttMinistral families, and prompts from a popular benchmarking platform.

Abstract:
Pretraining directly on webscale corpora is the de facto paradigm for building language models. We study an alternative setting where the model is initially exposed to abstract structured data, as a means to ease the subsequent acquisition of rich semantic knowledge, much like humans learn simple logic and mathematics before higher reasoning. We specifically focus onprocedural data, generated by formal languages and other simple algorithms, as such abstract data. We first diagnose the algorithmic skills that different forms of procedural data can improve, often significantly. For example, on context recall (Needle-in-a-haystack), the accuracy jumps from 10 to 98% when pretraining on Dyck sequences (balanced brackets). Second, we study how these gains are reflected in pretraining larger models (up to 1.3B). We find that front-loading as little as 0.1% procedural data significantly outperforms standard pretraining on natural language, code, and informal mathematics (C4, CodeParrot, and DeepMind-Math datasets). Notably, thisprocedural pretrainingenables the models to reach the same loss value with only 55, 67, 86% of the original data. Third, we explore the mechanisms behind and find that procedural pretraining instils non-trivial structure in both attention and MLP layers. The former is particularly important for structured domains (e.g. code), and the latter for language. Finally, we lay a path for combining multiple forms of procedural data. Our results show that procedural pretraining is a simple, lightweight means to improving performance and accelerating language model pretraining, ultimately suggesting the promise of disentangling knowledge acquisition from reasoning in LLMs.

Abstract:
Transformerbased models have achieved remarkable success across a wide range of domains, yet our understanding of their training dynamics remains limited. In this work, we identify a recurrent focus–dilution cycle in attention learning and provide a rigorous explanation in a one-layer Transformer setting for Markovian data via gradient-flow analysis. Using stage-wise linearization around critical points, we show that a single focus–dilution cycle can be decomposed into a sequence of distinct stages. First, embedding and projection rapidly condense to a rank-one structure, while attention parameters remain effectively frozen. Then, the attention parameters begin to increase, inducing a frequency-driven focus toward high-frequency tokens. As attention continues to evolve, it generates next-order perturbations in embeddings, leading to a mass-redistribution mechanism that progressively dilutes this focus. Finally, small asymmetries among low-frequency tokens lift a degenerate critical point, opening new embedding directions and initiating the next cycle. Experiments on synthetic Markovian data as well as WikiText and TinyStories corroborate the predicted stages and cyclical dynamics.

Abstract:
For deploying foundation models, practitioners increasingly need prescriptive scaling laws: given a pretraining compute budget, what downstream accuracy is attainable with contemporary post-training practice, and how stable is that mapping as the field evolves? Using large-scale observational evaluations with 5k observational and 2k newly sampled data on model performance, we estimate capability boundaries—high conditional quantiles of benchmark scores as a function of log pre-training FLOPs, via smoothed quantile regression with a monotone, saturating sigmoid parameterization. We validate the temporal reliability by fitting on earlier model generations and evaluating on later releases. Across various tasks, the estimated boundaries are mostly stable, with the exception of math reasoning that exhibits a consistently advancing boundary over time. We then extend our approach to analyze task-dependent saturation and to probe contamination-related shifts on math reasoning tasks. Finally, we introduce an efficient algorithm that recovers near-full-data frontiers using roughly 20% of evaluation budget. Together, our work releases the Proteus-2k, the latest model performance evaluation dataset, and introduces a practical methodology for translating compute budgets into reliable performance expectations and for monitoring when capability boundaries move.

Abstract:
Memory is critical for longhorizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits their systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the \pi_0.5 backbone to systematically explore different memory representations across multiple integration strategies. We show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found in https://anonymtest1.github.io

Abstract:
Deep generative models based on neural differential equations have quickly become the stateof-the-art for numerous generation tasks across many different applications. These models rely on ODE/SDE solvers which integrate from a prior distribution to the data distribution. In many applications it is highly desirable to then integrate in the other direction. The standard solvers, however, accumulate discretization errors which don’t align with the forward trajectory, thereby prohibiting an exact inversion. In applications where the precision of the generative model is paramount this inaccuracy in inversion is often unacceptable. Current approaches to solving the inversion of these models results in significant downstream issues with poor stability and low-order of convergence; moreover, they are strictly limited to the ODE domain. In this work, we propose a new family of reversible exponential (stochastic) Runge-Kutta solvers which we refer to as Rex developed by an application of Lawson methods to convert any explicit (stochastic) Runge-Kutta scheme into a reversible one. In addition to a rigorous theoretical analysis of the proposed solvers, we also empirically demonstrate the utility of Rex on improving the sample of Boltzmann distributions with flow models, and improving image generation and editing capabilities with diffusion models.

Abstract:
Agents powered by advanced large language models (LLMs) have demonstrated impressive capabilities across diverse complex applications. Recently, MultiAgent Systems (MAS), wherein multiple agents collaborate and communicate with each other, have exhibited enhanced capabilities in complex tasks, such as high-quality code generation and arithmetic reasoning. However, the development of such systems often relies on handcrafted methods, and the literature on systematic design and optimization of LLM-based MAS remains limited. In this work, we introduce OMAC, a general framework designed for holistic optimization of LLM-based MAS. Specifically, we identify five key optimization dimensions for MAS, encompassing both agent functionality and collaboration structure. Building upon these dimensions, we first propose a general algorithm, utilizing two actors termed the Semantic Initializer and the Contrastive Comparator, to optimize any single dimension. Then, we present an algorithm for joint optimization across multiple dimensions. Extensive experiments demonstrate the superior performance of OMAC on diverse tasks against recent approaches. Codes are available at: https://anonymous.4open.science/r/OMAC-Sub-3FF8.

Abstract:
As large language models (LLMs) are adopted into frameworks that grant them the capacity to make real decisions, it is increasingly important to ensure that they are unbiased. In this paper, we argue that the predominant approach of simply removing existing biases from models is not enough. Using a paradigm from the psychology literature, we demonstrate that LLMs can spontaneously develop novel social biases about artificial demographic groups even when no inherent differences exist. These biases result in highly stratified task allocations, which are less fair than assignments by human participants and are exacerbated by newer and larger models. In social science, emergent biases like these have been shown to result from explorationexploitation trade-offs, where the decision-maker explores too little, allowing early observations to strongly influence impressions about entire demographic groups. To alleviate this effect, we examine a series of interventions targeting model inputs, problem structure, and explicit steering. We find that explicitly incentivizing exploration most robustly reduces stratification, highlighting the need for better multifaceted objectives to mitigate bias. These results reveal that LLMs are not merely passive mirrors of human social biases, but can actively create new ones from experience, raising urgent questions about how these systems will shape societies over time.

Abstract:
Diffusion models trained on different, nonoverlapping subsets of a dataset often produce strikingly similar outputs when given the same noise seed. We trace this consistency to a simple linear effect: the shared Gaussian statistics across splits already predict much of the generated images. To formalize this, we develop a random matrix theory (RMT) framework that quantifies how finite datasets shape the expectation and variance of the learned denoiser and sampling map in the linear setting. For expectations, sampling variability acts as a renormalization of the noise level through a self-consistent relation \sigma^2\to\kappa(\sigma^2), explaining why limited data overshrink low-variance directions and pull samples toward the dataset mean. For fluctuations, our variance formulas reveal three key factors behind cross-split disagreement: anisotropy across eigenmodes, inhomogeneity across inputs, and overall scaling with dataset size. Extending deterministic-equivalence tools to fractional matrix powers further allows us to analyze entire sampling trajectories. The theory sharply predicts the behavior of linear diffusion models, and we validate its predictions on UNet and DiT architectures in their non-memorization regime, identifying where and how samples deviates across training data split. This provides a principled baseline for reproducibility in diffusion training, linking spectral properties of data to the stability of generative outputs.

Abstract:
We study reinforcement learning with delayed state observation, where the agent observes the current state after some random number of time steps. We propose an algorithm that combines the augmentation method and the upper confidence bound approach. For tabular Markov decision processes (MDPs), we derive a regret bound of \tilde\mathcalO(H \sqrtD_\max SAK), where S and A are the cardinalities of the state and action spaces, H is the time horizon, K is the number of episodes, and D_\max is the maximum length of the delay. We also provide a matching lower bound up to logarithmic factors, showing the optimality of our approach. Our analytical framework formulates this problem as a special case of a broader class of MDPs, where their transition dynamics decompose into a known component and an unknown but structured component. We establish general results for this abstract setting, which may be of independent interest.

Abstract:
Recent research has established that the impact of context in a vanilla transformer can be represented implicitly by forming a tokendependent, rank-1 patch to its MLP weights. This work extends that foundational theory to the diverse architectures of modern Large Language Models. We first demonstrate a precise, analytical solution for a Gemma-style transformer block, proving that the entire effect of a context can be perfectly mapped to rank-1 patches on its MLP weight matrices and a patch to the RMSNorm scale. We then generalize this result, providing a constructive proof and algorithm for multi-layer models. To unify these findings, we introduce a general framework centered on two core properties: input controllability and output controllability. We prove that a perfect implicit weight patch is possible for any MLP block where the inner function is input-controllable and the outer function is output-controllable. This provides a simpler and more powerful lens for understanding how transformer models transmute prompts into effective weights. This setup generalizes to a wide range of modern LLM architectures including gating, pre-/post-norm, mixture of experts and sequential/parallel transformer blocks.

Abstract:
Policies learned via continuous actorcritic methods often exhibit erratic, high-frequency oscillations, making them unsuitable for physical deployment. Current approaches attempt to enforce smoothness by directly regularizing the policy's output. We argue that this approach treats the symptom rather than the cause. In this work, we theoretically establish that policy non-smoothness is fundamentally governed by the differential geometry of the critic. By applying implicit differentiation to the actor-critic objective, we prove that the sensitivity of the optimal policy is bounded by the ratio of the Q-function's mixed-partial derivative (noise sensitivity) to its action-space curvature (signal distinctness). To empirically validate this theoretical insight, we introduce PAVE (Policy-Aware Value-field Equalization), a critic-centric regularization framework that treats the critic as a scalar field and stabilizes its induced action-gradient field. PAVE rectifies the learning signal by minimizing the Q-gradient volatility while preserving local curvature. Experimental results demonstrate that PAVE achieves smoothness and robustness comparable to policy-side smoothness regularization methods, while maintaining competitive task performance, without modifying the actor.

Abstract:
Automatic differentiation (AD) frameworks such as JAX and PyTorch have enabled gradientbased optimization for a wide range of scientific fields. Yet, many ''hard'' primitives in these libraries such as thresholding, Boolean logic, discrete indexing, and sorting operations yield zero or undefined gradients that are not useful for optimization. While numerous ''soft'' relaxations have been proposed that provide informative gradients, the respective implementations are fragmented across projects, making them difficult to combine and compare. This work introducesSoftJAXandSoftTorch, open-source, feature-complete libraries forsoft differentiable programming. These libraries provide a variety of soft functions as drop-in replacements for their hard JAX and PyTorch counterparts. This includes (i) elementwise operators such ascliporabs, (ii) utility methods for manipulating Booleans and indices via fuzzy logic, (iii) axiswise operators such assortorrank-- based on optimal transport or permutahedron projections, and (iv) offer full support for straight-through gradient estimation. Overall, SoftJAX and SoftTorch make the toolbox of soft relaxations easily accessible to differentiable programming, as demonstrated through benchmarking and a practical case study.

Abstract:
Speculative Decodin promises to accelerate Large Language Model inference, yet its efficacy often degrades in productiongrade scenarios. Existing evaluations typically overlook the compute-bound nature of high-concurrency regimes, where verification compute becomes the dominant bottleneck. Consequently, prior methods face a dilemma: static trees incur massive verification waste, while dynamic trees suffer from cumulative misjudgments and kernel incompatibility. To bridge this gap, we introduce ECHO, a high concurrency-oriented framework integrated into SGLang that reformulates speculative execution as a budgeted scheduling problem. Crucially, ECHO employs sparse confidence gating to manage the batch as a unified super-tree, elastically pivoting budget between depth and width to co-optimize the trade-off between reducing global verification steps and maximizing per-step efficiency. Extensive evaluations across diverse model scales—particularly the industrial-grade Qwen3-235B—demonstrate that ECHO consistently outperforms state-of-the-art baselines in both low-load and high-load scenarios, achieving up to 5.35× walltime speedup and delivering over 20% relative speedup gain against the strongest baselines.

Abstract:
Scaling inferencetime computation has enabled Large Language Models (LLMs) to achieve strong reasoning performance, but their inherently sequential decoding incurs substantial latency, motivating parallelization of the generation process. However, existing parallel reasoning approaches suffer from performance degradation compared to their sequential counterparts, and often rely on specialized inference engines. We introduce ThreadWeaver, a framework for adaptive parallel reasoning that matches the accuracy of comparably sized sequential reasoning models while significantly reducing inference latency via three key innovations: 1) a two-stage parallel trajectory generator that produces high-quality parallel chain-of-thought data for supervised fine-tuning; 2) a trie-based rollout design that enables parallel reasoning on any off-the-shelf autoregressive inference engine; and 3) a parallelization-aware reinforcement learning framework that trains the model to balance reasoning accuracy with effective parallelization. Across six challenging math reasoning benchmarks, ThreadWeaver trained on top of Qwen3-8B achieves performance on par with cutting-edge sequential reasoning models (79.9% on AIME24 and 71.9% on average) while delivering up to 1.53x speedup in token latency, establishing a new Pareto frontier between accuracy and efficiency.

Abstract:
Reward models learned from human preferences are central to aligning large language models (LLMs) via reinforcement learning from human feedback, yet they are often vulnerable to reward hacking due to noisy annotations and systematic biases such as response length or style. We propose Bayesian NonNegative Reward Model (BNRM), a principled reward modeling framework that integrates non-negative factor analysis into Bradley–Terry (BT) preference model. BNRM represents rewards through a sparse, non-negative latent factor generative process that operates at two complementary levels: instance-specific latent variables induce disentangled reward representations, while sparsity over global latent factors acts as an implicit debiasing mechanism that suppresses spurious correlations. Together, this disentanglement-then-debiasing structure enables robust uncertainty-aware reward learning. To scale BNRM to modern LLMs, we develop an amortized variational inference network conditioned on deep model representations, allowing efficient end-to-end training. Extensive empirical results demonstrate that BNRM substantially mitigates reward over-optimization, improves robustness under distribution shifts, and yields more interpretable reward decompositions than strong baselines.

Abstract:
Large language model (LLM) activations are notoriously difficult to understand, with most existing techniques using complex, specialized methods for interpreting them. Recent work has proposed a simpler approach known as LatentQA: training LLMs to directly accept LLM activations as inputs and answer arbitrary questions about them in natural language. However, prior work has focused on narrow task settings for both training and evaluation. In this paper, we instead take a generalist perspective. We evaluate LatentQAtrained models, which we call Activation Oracles (AOs), in far out-of-distribution settings and examine how performance scales with training data diversity. We find that AOs can recover information fine-tuned into a model (e.g., biographical knowledge or malign propensities) that does not appear in the input text, despite never being trained with activations from a fine-tuned model. Our main evaluations are four downstream tasks where we can compare to prior white- and black-box techniques. We find that even narrowly-trained LatentQA models can generalize well, and that adding additional training datasets (such as classification tasks and a self-supervised context prediction task) yields consistent further improvements. Our best AOs match or exceed white-box baselines on all four tasks and the best overall baseline on 3 of 4. These results suggest that diversified training to answer natural-language queries imparts a general capability to verbalize information about LLM activations.

Abstract:
Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols. We introduce JAILBREAK FOUNDRY (JBF), a system that addresses this gap via a multiagent workflow to translate jailbreak papers into executable modules for immediate evaluation within a unified harness. JBF features three core components: (i) JBF-LIB for shared contracts and reusable utilities; (ii) JBF-FORGE for the multi-agent paper-to-module translation; and (iii) JBF-EVAL for standardizing evaluations. Across 30 reproduced attacks, JBF achieves high fidelity with a mean (reproduced-reported) attack success rate (ASR) deviation of +0.26 percentage points. By leveraging shared infrastructure, JBF reduces attack-specific implementation code by nearly half relative to original repositories and achieves an 82.5% mean reused-code ratio. This system enables a standardized AdvBench evaluation of all 30 attacks across 10 victim models using a consistent GPT-4o judge. By automating both attack integration and standardized evaluation, JBF offers a scalable solution for creating living benchmarks that keep pace with the rapidly shifting security landscape.

Abstract:
Diffusion models have transformed image synthesis by establishing unprecedented quality and creativity benchmarks. Nevertheless, their largescale deployment faces challenges due to computationally intensive iterative denoising processes. Although post-training quantization (PTQ) provides an effective pathway for accelerating sampling, the iterative nature of diffusion models causes stepwise quantization errors to accumulate progressively during generation, inevitably compromising output fidelity. To address this challenge, we develop a theoretical framework that mathematically formulates error propagation in Diffusion Models (DMs), deriving per-step quantization error propagation equations and establishing the first closed-form solution for cumulative error. Building on this theoretical foundation, we propose a timestep-aware cumulative error compensation scheme. Extensive experiments on multiple image datasets demonstrate that our compensation strategy effectively mitigates error propagation, significantly enhancing existing PTQ methods. Specifically, it achieves a 1.2 PSNR improvement over SVDQuant on SDXL W4A4, while incurring only an additional < 0.5% time overhead.

Abstract:
Mixture models, such as Gaussian mixture models (GMMs), are widely used in machine learning to represent complex data distributions. A key challenge, especially in highdimensional settings, is to determine the mixture order and estimate the mixture parameters. We study the problem of approximating a target measure, available only through finitely many of its moments, by a mixture of distributions from a parametric family (e.g., Gaussian, exponential, Poisson), with approximation quality measured by the 2-Wasserstein (\operatornameW_2) or the total variation (\operatornameTV) distance. Unlike many existing approaches, the parameter set is not assumed to be finite; it is modeled as a compact basic semi-algebraic set. We introduce a hierarchy of semidefinite relaxations with asymptotic convergence to the desired optimal value. In addition, when a certain rank condition is satisfied, the convergence is even finite and recovery of an optimal mixing measure is obtained. We also present an application to clustering, where our framework serves either as a stand-alone method or as a preprocessing step that yields both the number of clusters and strong initial parameter estimates, thereby accelerating convergence of standard (local) clustering algorithms

Abstract:
Achieving robust perceptionreasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursued this goal via architectural designs or agentic workflows. However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity. Worse, this heavy investment does not yield proportional gains, often witnessing a "seesaw effect" on perception and reasoning. This motivates a fundamental rethinking of the true bottleneck. In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception ("bad seeing") or flawed logic ("bad thinking")? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity. We explicitly decompose the generation process into interleaved perception and reasoning steps. This decoupling enables targeted supervision on perception. Crucially, we introduce Perception Verification (PV), leveraging a "blindfolded reasoning" proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error -- either bad seeing or bad thinking -- enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.

Abstract:
VisionLanguage-Action (VLA) models often suffer from performance degradation under distribution shifts, as they struggle to learn generalized behavior representations across varying environments. While existing approaches attempt to construct behavior representations through action-centric latent variables, they are often limited by short-horizon temporal fragmentation and static execution-alignment, leading to inconsistent behaviors in complex scenarios. To address these limitations, we propose BehaviorVLA, a framework that facilitates robust manipulation through the learning of a temporally coherent behavioral representations. Our approach features two symmetric components: (1) the Visuomotor Behavior Encoder (VBE), which utilizes a causal Mamba-based architecture to aggregate long-horizon trajectory information into a unified behavior representation; and (2) the Phase-conditioned Behavior Decoder (PBD), which decodes this representation into precise actions by dynamically aligning task-level priors with real-time execution progress. Experiments on RoboTwin 2.0, LIBERO, and CALVIN demonstrate state-of-the-art success rates of 58%, 98%, and 4.36 (Avg. Len), respectively. Notably, in real-world sim-to-real transfer, BehaviorVLA matches the performance of OpenVLA-OFT using only 50% of the demonstration data, showcasing its superior data efficiency and generalization.

Abstract:
Tokenizers provide the fundamental basis through which text is represented and processed by language models (LMs). Despite the importance of tokenization, its role in LM performance and behavior is poorly understood due to the challenge of measuring the impact of tokenization in isolation. To address this need, we present TokSuite, a collection of models and a benchmark that supports research into tokenization's influence on LMs. Specifically, we release fourteen pretrained models that use different tokenizers but are otherwise identical, using the same architecture, dataset, training budget, and initialization. We also release a multilingual robustness benchmark that measures model performance under real-world perturbations in English, Chinese, Farsi, Italian, and Turkish, curated by native annotators. Together, TokSuite allows robust decoupling of the influence of a model's tokenizer, supporting a series of novel findings that elucidate the respective benefits and shortcomings of a wide range of popular tokenizers.

Abstract:
Mechanistic Interpretability has successfully identified functional circuits in Large Language Models (LLMs), yet their causal origins in the training data remain poorly understood. We bridge this gap by introducingMechanistic Data Attribution (MDA), a scalable framework that traces the formation of specific interpretable units back to training samples using Influence Functions. Through extensive pretraining experiments on the Pythia family, we causally validate that removing a small fraction of high-influence samples significantly hinders the emergence of targeted heads, whereas augmenting them accelerates formation—effects that random interventions fail to replicate. Leveraging MDA, we reveal that highly repetitive structural data—such as LaTeX and HTML—acts as a "catalyst" that significantly accelerates the emergence of induction heads. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model’s in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that builds upon these insights to consistently accelerate mechanistic convergence across diverse model scales, offering a principled methodology for understanding and steering the fine-grained development of LLM behaviors.

Abstract:
Video subtitle removal is essential for content localization and media reediting, yet existing mask-guided diffusion methods face critical limitations: training inefficiency requiring extensive annotations and full model fine-tuning, inference complexity demanding explicit mask sequences, and static prior utilization unable to adapt to quality variations. We present CLEAR (Context-aware Learning for End-to-end Adaptive subtitle Removal), a lightweight adapter-based framework addressing these challenges through three technical innovations. First, self-supervised prior learning (Stage I) extracts occlusion guidance from video pairs using pixel differences as weak supervision, eliminating annotation dependency while learning generalizable subtitle features across languages. Second, LoRA-based adaptive refinement (Stage II) enables parameter-efficient training that preserves pre-trained visual priors while achieving true mask-free end-to-end inference without external detection modules. Third, adaptive focal weighting dynamically adjusts prior influence based on local quality assessment, effectively handling diverse subtitle styles and noisy guidance signals. Extensive experiments demonstrate CLEAR's superior performance in multilingual subtitle removal while requiring only 0.77% trainable parameters, establishing a new paradigm for efficient video text removal without inference-time mask dependencies.

Abstract:
The diversity of posttraining data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduceFeature Activation Coverage(FAC)which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, namedFAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks, including instruction following, toxicity detection, reward modeling, and behavior steering. Interestingly, we identify a shared, interpretable feature space across model families (i.e., LLaMA, Mistral, and Qwen), enabling cross-model knowledge transfer. Our work provides a solid and practical methodology for exploring data-centric optimization of LLMs.

Abstract:
Aligning Large Language Models (LLMs) with the diverse spectrum of human values remains a central challenge: preferencebased methods often fail to capture deeper motivational principles. Value-based approaches offer a more principled path, yet three gaps persist– extraction often ignores hierarchical structure, evaluation detects presence but not calibrated intensity, and therefore, the steerability of LLMs at controlled intensities remains insufficiently understood. To address these limitations, we introduce VALUEFLOW, the first unified framework that spans extraction, evaluation, and steering with calibrated intensity control. The framework integrates three components: (i) HIVES, a hierarchical value embedding space that captures intra- and crosstheory value structure; (ii) the Value Intensity DataBase (VIDB), a large-scale resource of valuelabeled texts with intensity estimates derived from ranking-based aggregation; and (iii) an anchorbased evaluator that produces consistent intensity scores for model outputs by ranking them against VIDB panels. Using VALUEFLOW, we conduct a comprehensive large-scale study across ten models and four value theories, identifying asymmetries in steerability and composition laws for multi-value control. This paper establishes a scalable infrastructure for evaluating and controlling value intensity, advancing pluralistic alignment of LLMs.

Abstract:
Genomic foundation models have the potential to decode DNA syntax, yet face a fundamental tradeoff. Standard subword tokenizers fragment biologically meaningful motifs such as codons and regulatory elements, while nucleotidelevel models preserve biological coherence but incur prohibitive computational costs for long contexts. We introduce dnaHNet, a state-of-the-art tokenizer-free autoregressive model that segments and models genomic sequences end to end. Using a differentiable dynamic chunking mechanism, dnaHNet compresses raw nucleotides into latent tokens adaptively, balancing compression with predictive accuracy. Pretrained on prokaryotic genomes, dnaHNet outperforms leading architectures including StripedHyena2 in scaling and efficiency. This recursive chunking yields quadratic FLOP reductions, enabling >3 × inference speedup over Transformers. On zero-shot tasks, dnaHNet achieves superior performance in predicting protein variant fitness and gene essentiality, while automatically discovering hierarchical biological structures without supervision. These results establish dnaHNet as a scalable, interpretable framework for next-generation genomic modeling.

Abstract:
Dataset distillation (DD) compresses a large training set into a small synthetic set for efficient training, but most DD methods optimize only clean accuracy and leave robustness uncontrolled. Recent robust DD methods improve robustness, yet they often suffer from a poor accuracy–robustness tradeoff because they (i) treat all adversarially perturbed examples uniformly, despite robust risk being dominated by near-zero robust margins, and (ii) do not explicitly increase inter-class separation in the decision boundary where attacks concentrate. We present Contrastive Curriculum for Robust Dataset Distillation (C^2R), a margin-centric framework that couples an attack-aware curriculum with a contrastive robustness objective. From a robust-margin perspective, we derive a perturbation score that approximates each sample’s robust hinge, enabling a curriculum that prioritizes the smallest-margin adversaries that most directly drive robust error. In parallel, a class-balanced contrastive robustness loss enforces adversarial invariance while explicitly widening boundary separation across classes. Experiments on CIFAR-10/100, Tiny-ImageNet, and multiple ImageNet-1K subsets under six attacks show that C^2R achieves the best robust accuracy, outperforming prior robust DD methods by 2.8% on average. Under PGD, C^2R also reduces the average drop rate (DR) below 66.8% across datasets, indicating a stronger accuracy–robustness balance.

Abstract:
Synthetic data are increasingly used to train image classifiers, yet distributional mismatch with real data limits their effectiveness when used indiscriminately. A common strategy is to learn data weights via bilevel optimization, which we refer to as Meta-learning for Training-data Selection (MTS). Interestingly, in practice, MTS often performs below expectation. We identify two obstacles in properly training MTS: a poor gradient signal-to-noise ratio (GSNR), which causes optimization difficulties, and lack of informative features that correlates with data quality. We present a thorough mathematical analysis of MTS, which reveals the dynamics of normalized data weights and the relation between disparate data quality and poor GSNR. The analysis suggests a a simple yet effective solution: increasing the batch size. Further, we propose a set of informative features that capture the positions of training data in their distributions and training dynamics. Experiments across four benchmarks show consistent improvements, achieving average gains of 5.49% over training without selection and 2.89% over the strongest baseline.

Abstract:
Training against whitebox deception detectors has been proposed as a way to make AI systems honest. However, such training risks models learning to obfuscate their deception to evade the detector. Prior work has studied obfuscation only in artificial settings where models were directly rewarded for harmful output. We construct a realistic coding environment where reward hacking via hardcoding test cases naturally occurs, and show that obfuscation emerges in this setting. We introduce a taxonomy of possible outcomes when training against a deception detector. The model either remains honest, or becomes deceptive via two possible obfuscation strategies. (i)Obfuscated activations: the model outputs deceptive text while its activations change to no longer trigger the detector. (ii)Obfuscated policy: the model produces detector-evading deceptive text, typically by including a justification for the reward hack. Empirically, obfuscated activations arise from representation drift during RL, with or without a detector penalty. The penalty only incentivizes obfuscated policies: we theoretically show this is expected for policy gradient methods. Sufficiently high KL regularization and detector penalty reliably yield honest policies, establishing white-box deception detectors as viable training signals for tasks prone to reward hacking.

Abstract:
Autoregressive (AR) generation is the standard decoding paradigm for Large Language Models (LLMs), but its tokenby-token nature limits parallelism at inference time. Diffusion Language Models (DLLMs) offer parallel decoding by recovering multiple masked tokens per step; however, in practice they often fail to translate this parallelism into speed gains over optimized AR engines (e.g., vLLM). A key reason is that many DLLMs rely on bidirectional attention, which breaks standard prefix KV caching. We propose WeDLM, a diffusion decoding framework built entirely on standard causal attention to make parallel generation prefix-cache friendly. The core idea is to let each masked position condition on all observed tokens while keeping a causal mask, achieved by Topological Reordering that moves observed tokens to the physical prefix while preserving their logical positions. Building on this, we introduce a streaming decoding procedure that continuously commits confident tokens into a growing left-to-right prefix, avoiding the stop-and-wait behavior common in block diffusion methods. Experiments show that WeDLM preserves the quality of strong AR backbones while delivering substantial speedups, approaching 3× on challenging reasoning benchmarks and up to 10× in low-entropy generation regimes; critically, our comparisons are against AR baselines served by vLLM under matched deployment settings.

Abstract:
Current Large Language Models have achieved Olympiadlevel logic, yet Vision-Language Models paradoxically falter on elementary spatial tasks like block counting. This capability mismatch reveals a critical "spatial intelligence gap," where models fail to construct coherent 3D mental representations from 2D observations. We uncover this gap via diagnostic analyses showing the bottleneck is a missing view-consistent spatial interface rather than insufficient visual features or weak reasoning. To bridge this, we introduce3ViewSense, a framework that grounds spatial reasoning in Orthographic Views. Drawing on engineering cognition, we propose a "Simulate-and-Reason" mechanism that decomposes complex scenes into canonical orthographic projections to resolve geometric ambiguities. By aligning egocentric perceptions with these allocentric references, our method facilitates explicit mental rotation and reconstruction. Empirical results on spatial reasoning benchmarks demonstrate that our method significantly outperforms existing baselines, with consistent gains on occlusion-heavy counting and view-consistent spatial reasoning. The framework also improves the stability and consistency of spatial descriptions, offering a scalable path toward stronger spatial intelligence in multimodal systems.

Abstract:
Distilling longform reasoning from teacher models into smaller students requires selecting which candidate solutions to train on. Recent work argues that one should select responses the student model assigns highest probability, i.e., favoring solutions ``natural'' to the student. However, we find that this approach works within a single teacher but fails when scaling to long reasoning traces from multiple diverse teachers. We identify a key cause: this approach scores entire solutions, but students generalize by recombining familiar reasoning steps, not by memorizing complete solutions. Full-trajectory scoring optimizes the wrong target; it rewards global fluency while the transferable signal lies in local step transitions. We propose Local Average Log Probability (LALP), which scores each reasoning step using only a small window of preceding context, measuring whether each step is justified by its immediate premises rather than whether the full response looks natural to the student. LALP enables two practical use cases: selecting the best teacher before fine-tuning and curating training data from diverse teacher pools. Across math, coding, and science reasoning tasks, LALP consistently improves accuracy when selecting the most natural solutions by a large margin.

Abstract:
We study the existence and computation of Nash equilibria in concave games where the players' admissible strategies are subject to shared coupling constraints. Under playerwise concavity of constraints, we prove existence of Nash equilibria. Our proof leverages topological fixed point theory and novel structural insights into the contractibility of feasible sets, and relaxes strong assumptions for existence in prior work. Having established existence, we address the question of whether in the presence of coupling constraints, playerwise independent learning dynamics have convergence guarantees. We address this positively for the class of potential games by designing a convergent algorithm. To account for the possibly nonconvex feasible region, we employ a log barrier regularized gradient ascent with adaptive stepsizes. Starting from an initial feasible strategy profile and under exact gradient feedback, the proposed method converges to an \epsilonapproximate constrained Nash equilibrium within \mathcalO(\epsilon^-3) iterations.

Abstract:
VisionLanguage Models (VLMs) frequently generate self-reflective statements during reasoning, such as ``let me check the figure again.'' Do such statements trigger genuine visual re-examination, or merely represent learned textual patterns? We investigate this question through VisualSwap, an image-swap probing framework: after a model generates reasoning for an image, we replace it with a visually similar but semantically different image and test whether the model detects the change. We introduce VS-Bench, a benchmark of 800 image pairs curated from MathVista, MathVerse, MathVision, and MMMU-Pro. Experiments across Qwen3-VL, Kimi-VL, and ERNIE-VL families reveal a striking failure: models overwhelmingly fail to detect image changes, with accuracy dropping by up to 60%. Counterintuitively, thinking models exhibit nearly 3× greater vulnerability than their instructed counterparts, and scaling provides no mitigation. However, multi-turn interaction with user instructions can restore visual grounding, while self-generated reflective statements during continuous generation cannot. Attention analysis reveals the underlying mechanism: self-reflection does not increase attention to visual tokens, whereas user instructions substantially elevate it. Our findings reveal that current VLMs tend to say rather than actually see when claiming visual re-examination.

Abstract:
Multimodal agents offer a compelling path to automating complex documentintensive workflows, yet a critical question remains: do these architectures demonstrate genuine strategic reasoning, or simply conduct stochastic trial-and-error search? To address this, we introduce Agentic Document VQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided byClassical Test Theory, we design it to maximize discriminative power and reliably differentiate between varying levels of agent capability. To rigorously assess agentic behaviour, we introduce a novel evaluation protocol for measuring the accuracy-effort trade-off. Using this framework, we find that humans show strong metacognitive calibration, adapting or abandoning failed strategies, whereas frontier agents often persist in unproductive loops with diminishing returns. We release the dataset, evaluation harness, and leaderboard to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.

Abstract:
We analytically solve the Mountain Car problem, a canonical benchmark in RL, and derive an optimal control solution, closing a gap after 36 years. This enables us to reveal two surprising insights: The optimal control is quite simple, yet modern RL agents display a large gap to optimality. Motivated by the analysis of the optimal control, we introduce Chebyshev policies as a universal (i.e. dense) class of RL policies from first principles. They can be trained as dropin replacements of neural nets, reducing the regret by a factor of 4.18, while requiring 268 times fewer parameters, fostering sample efficiency, explainability and real-time capability. Chebyshev policies are evaluated on further RL environments, including a real-world non-linear motion control testbed. They consistently improve performance over neural nets with PPO, ARS and REINFORCE. Our results demonstrate how Chebyshev policies offer a compelling and lightweight alternative or addition to neural nets for low-dimensional control tasks.

Abstract:
Reinforcement learning (RL), particularly RL from verifiable reward (RLVR), has become a crucial phase of training large language models (LLMs) and a key focus of current scaling efforts. However, optimization practices in RL largely follow those of nexttoken-prediction stages (e.g., pretraining and supervised fine-tuning), despite the fundamental differences between RL and these stages emphasized by recent work. One such practice is the use of the AdamW optimizer, which is widely adopted for training large-scale transformers despite its high memory overhead. Our analysis shows that both momentum and adaptive learning rate of AdamW are less influential in RL than in SFT, leading us to hypothesize that RL benefits less from Adam’s per-parameter adaptive learning rates and momentum. Confirming our hypothesis, our experiments demonstrate that the substantially more memory-efficient SGD, which is known to perform poorly in supervised learning of large-scale transformers, matches or even outperforms AdamW in RL for LLMs. Remarkably, full fine-tuning with SGD updates fewer than 0.02% of model without any sparsity-promoting regularization, more than 1,000 times fewer than AdamW. Our analysis offers potential reasons for this update sparsity. Our findings provide fresh insights into the optimization dynamics of RL in LLMs and demonstrate that RL can be substantially more parameter-efficient than previously recognized.

Abstract:
As highquality public text approaches exhaustion, a phenomenon known as the Data Wall—LLM pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.

Abstract:
We consider the problem of sampling compositional and discrete objects from a given unnormalized posterior distribution. Notably, recent studies have shown that this problem can be efficiently solved by learning a deterministic Markov Decision Process (MDP) that progressively builds each object in proportion to the posterior. In this work, however, we demonstrate that the Markovian assumption can both hamper signal propagation during training and catastrophically reduce the learned sampler's expressivity due to state aliasing. To address these issues, we propose lifting the MDP with a learnable latent dynamics that allows the underlying policy to depend on the entire past trajectoryand not only on the current state. In view of this, we refer to the resulting method as \emphpath-dependent discrete amortized inference. Importantly, we provably extend existing learning algorithms for amortized samplers to our setting. In experiments on standard benchmark problems, we also show that our approach often leads to faster learning convergence and improved state space exploration relatively to prior techniques.

Abstract:
LLMbased agents already operate in production across many industries, yet we lack a clear understanding of which technical methods make these deployments successful. We present the first systematic study of Characterizing Agents in Production (CAP) using first-hand data from agent developers. We conducted 20 in-depth case studies through interviews and surveyed 306 practitioners across 26 domains. We examine why organizations build agents, how they build them, how they evaluate them, and the key challenges they face in deployment. Our findings show that production agents rely on simple, controllable approaches: 68% execute at most 10 steps before human intervention, 70% rely on prompting off-the-shelf models rather than weight tuning, and 74% depend primarily on human evaluation. Reliability—defined as consistent correct behavior over time—emerges as the dominant challenge, which practitioners address through system-level design choices. CAP documents the current state of production agents, providing the research community with visibility into real-world deployment practices and underexplored research opportunities.

Abstract:
We study the fundamental expressivity limits of transformer models by formalizing the notion of accessible sequencesthose that a transformer can produce for some prompt---and characterizing how accessibility depends on prompt length and model parameters. Our analysis provides a theoretical explanation for previously observed empirical failures of transformers on simple sequence tasks---such as copying and cramming---and yields both qualitative and quantitative predictions that hold across a wide range of architectures and model sizes. We prove that (i) the maximal length of accessible sequences grows linearly with the prompt length, (ii) beyond a critical threshold the proportion of accessible sequences decays exponentially with sequence length, and (iii) the linear coefficient relating prompt length to accessible sequence length admits a theoretical upper bound. Notably, these results hold even with unbounded context and computation time. Experiments using a “cramming” procedure confirm the linear scaling, the post-threshold exponential decay, and the tightness of the theoretical upper bound on different sizes of Pythia, Llamma, and Qwen architectures.

Abstract:
We propose a new method for estimating how much a model knows about a datapoint and use it to measure the capacity of modern language models. Prior studies of language model memorization have struggled to disentangle memorization from generalization. We formally separate memorization into two components: unintended memorization, the information a model contains about a specific dataset, and generalization, the information a model contains about the true datageneration process. When we completely eliminate generalization, we can compute the total memorization, which provides an estimate of model capacity: our measurements estimate that GPT-style models have a capacity of approximately 3.6 bits per parameter. We train language models on datasets of increasing size and observe that models memorize until their capacity fills, at which point unintended memorization decreases as models begin to generalize. We train hundreds of transformer language models ranging from 500K to 1.5B parameters and produce a series of scaling laws relating model capacity and data size to membership inference.

Abstract:
We study equilibrium pricing in oligopolistic data markets with budgetconstrained buyers (e.g., ML companies purchasing data to improve model accuracy) and strategic data sellers. Sellers compete by setting prices for their datasets, giving rise to a pricing game whose pure Nash equilibria correspond to equilibrium prices. While equilibrium prices are guaranteed for rivalrous goods via competitive equilibrium, we show that the non-rivalry of data fundamentally alters this picture: an exact Nash equilibrium need not exist, and in fact no 1.364-approximate equilibrium exists under uniform pricing. We therefore investigate relaxed equilibrium notions. Allowing sellers to use beyond-uniform pricing—specifically, piecewise-linear convex pricing functions—guarantees approximate stability within a constant factor: there exists a pricing profile in which no seller can improve revenue by a factor of two by deviating to any uniform price (a 2-approximate Nash equilibrium). Finally, our simulations demonstrate fast convergence and empirical approximation guarantees that outperform the worst-case bound of 2.

Abstract:
Modern decisionmaking increasingly relies on contextual features (covariates) to improve optimization under uncertainty. In practice, however, such covariates are often only partially observed due to, e.g., data source heterogeneity or costly data collection. Nonetheless, most existing methods assume fully observed historical data and can become unreliable when this assumption is violated. We address this gap by proposing a distributionally robust optimization approach that exploits incomplete covariates to produce robust decisions without imputing a complete dataset. Our method builds ambiguity sets from the observed partial data and incorporates the general structure of the missingness mechanism, ensuring candidate distributions remain consistent with what is observed. Across settings with discrete or continuous covariates and outcomes, we derive tractable reformulations and establish finite-sample out-of-sample performance guarantees. Empirical results across a range of contextual decision-making tasks demonstrate that the proposed integrated approach consistently outperforms state-of-the-art baselines, including various impute-then-optimize pipelines, in both out-of-sample performance and reliability.

Abstract:
This paper proposes an asymmetric perturbation technique for solving bilinear saddlepoint optimization problems, commonly arising in minimax problems, game theory, and constrained optimization. Perturbing payoffs or values is known to be effective in stabilizing learning dynamics and equilibrium computation. However, it requires decreasing perturbation magnitudes to ensure convergence to an equilibrium in the underlying game, resulting in a slower rate. To overcome this, we introduce an asymmetric perturbation approach, where only one player's payoff function is perturbed. Exploiting the near-linear structure of bilinear problems, we show that, for a sufficiently small perturbation, the equilibrium strategy of the asymmetrically perturbed game coincides with an equilibrium strategy of the original game. Building on this property, we develop a perturbation-based learning algorithm with a linear last-iterate convergence rate to an equilibrium strategy of the original game, and we further show how to construct a parameter-free procedure that retains a linear rate. Finally, we empirically demonstrate fast convergence toward equilibria in both normal-form and extensive-form games.

Abstract:
We study continual pretraining (CPT) as a mechanism for adapting general-purpose large language models to specialized domains: mathematics, instruction, code, and natural text. Using singular value decomposition of weight matrices, we find that CPT leaves singular value spectra largely invariant, with adaptation driven mainly by changes in singular vectors. An analysis of attention-head projection matrices reveals strong, domain-dependenthead heterogeneity, which we exploit to define a head-importance criterion: up to60%of head updates can be removed without measurable quality loss. Selectively rewinding low-importance heads to their pre-trained state improves benchmark accuracy by up to4%versus the fully trained baseline. Finally, we identifydomain connectivity—linear interpolation between CPT checkpoints yields smooth domain-quality interpolation without notable degradation on either domain—and release Diffract, an open-source toolkit for scalable spectral analysis of billion-parameter models.

Abstract:
Planning with a generative model aims to estimate state values using minimal oracle calls. For entropyregularized MDPs, SmoothCruiser exploits the smoothness of the \operatornameLogSumExp Bellman operator to achieve \widetilde\mathcalO(\varepsilon^-4) sample complexity, but its first-order Taylor approximation limits the rate. We develop a curvature--complexity theory showing that if a Bellman aggregator has Taylor remainder of order \beta \ge 2, the optimal oracle complexity exponent is 2 + 2/(\beta-1)---recovering \widetilde\mathcalO(\varepsilon^-4) for \beta=2 and predicting \widetilde\mathcalO(\varepsilon^-3) for \beta=3. To achieve \beta=3, we introduce an entropic optimal-transport regularizer over action distributions. The resulting OT-smoothed Bellman operator admits a closed-form expression, explicit gradient policy, and Lipschitz Hessian. We derive an unbiased estimator of the quadratic Taylor term via cross-product debiasing, enabling a second-order SmoothCruiser with \widetilde\mathcalO(\varepsilon^-3) complexity. We further propose gap-dependent variants and provide a complexity analysis and show advantage of our method.

Abstract:
In retrievalaugmented generation, language models can generate incorrect responses if they fail to utilize query-relevant content from the retrieved evidence. This shifts the focus of uncertainty quantification (UQ) toward assessing contextual grounding, i.e., whether predictions are supported by query-relevant tokens. Recent UQ methods unpack language models to characterize how inputs are processed. Nevertheless, these methods focus on a few layers and overlook the whole progressive propagation within the model, thereby failing to fully capture the grounding dynamics essential for reliable uncertainty estimation. We use information flow to build a layer-wise trace that reveals each context token’s contribution to the output, providing an interpretable basis for assessing reliability. From this analysis, we introduce two measures to calibrate prediction confidence. The first, simulatability, posits that a prediction is more likely to be correct when context token contributions align closely with their true relevance. The second, concentration, asserts that a response is more likely to be correct when it is derived from a narrow, focused subset of tokens. Experiments show that our method achieves an average AUROC of 0.70, exceeding the runner-up performance of 0.65, while maintaining moderate computational cost.

Abstract:
Electronic health records (EHR) pose largescale multi-disease modeling problems in which many outcomes are rare and strongly influenced by shared risk factors. While modern approaches achieve strong predictive performance, they often treat diseases independently or rely on black-box architectures, offering limited insight into how risk factors organize disease risk and little principled uncertainty quantification. We introduce a Bayesian hypergraph inference framework that reframes multi-disease modeling aroundlatent, risk-factor-modulated disease pathways. Risk factors act on hyperedges, latent disease subsets with shared risk patterns, allowing diseases to participate in multiple distinct pathways and enabling interpretable, higher-order structure beyond pairwise associations. A repulsion prior encourages parsimonious and identifiable structure, while posterior inference provides calibrated uncertainty over both disease groupings and risk-factor influence. To enable scalable inference on large EHR datasets, we develop a structured variational inference algorithm that preserves logical dependencies among hyperedge existence, disease membership, and pathway-level effects. Experiments on simulated data and UK Biobank demonstrate stable and interpretable disease pathway structure, well-calibrated uncertainty, improved estimation for rare diseases, and competitive predictive performance.

Abstract:
To improve generalization and resilience in human–robot collaboration (HRC), robots must handle the combinatorial diversity of human behaviors and contexts, motivating multiagent reinforcement learning (MARL). However, inherent heterogeneity between robots and humans creates a rationality gap (RG) in the learning process--a variational mismatch between decentralized best-response dynamics and centralized cooperative ascent. The resulting learning problem is a general-sum differentiable game, so independent policy-gradient updates can oscillate or diverge without added structure. We propose heterogeneous-agent Lyapunov policy optimization (HALyPO), which establishes formal stability directly in the policy-parameter space by enforcing a per-step Lyapunov decrease condition on a parameter-space disagreement metric. Unlike Lyapunov-based safe RL, which targets state/trajectory constraints in constrained Markov decision processes, HALyPO uses Lyapunov certification to stabilize decentralized policy learning. HALyPO rectifies decentralized gradients via optimal quadratic projections, ensuring monotonic contraction of RG and enabling effective exploration of open-ended interaction spaces. Extensive simulations and real-world humanoid-robot experiments show that this certified stability improves generalization and robustness in collaborative corner cases.

Abstract:
Counterfactual reasoning is one of the fundamental facets of human cognition, involved in various tasks such as explanation, credit assignment, blame, and responsibility. It describes the queries what would have happened had some intervention been performed given that something else, corresponding to Layer 3 of the Pearl Causal Hierarchy. In this project, we examine a specific type of counterfactual quantities, called counterfactual direct (StrDE), indirect (Str-IE), and spurious (Str-SE) effects for quantifying fairness in a sequential decision-making framework. Building on these measures, we formulate an online causally-fair learning problem with multiple long-term constraints and study it in both non-parametric contextual bandits and parametric logistic bandits settings. We achieve sublinear regret and violations bounds for both bandits settings with round-wise counterfactual fairness constraints (that are a priori unknown) without Slater’s condition. In particular, for logistic bandits, we obtain nearly optimal regret bound with leading term similar to that for unconstrained case (Zhang et al., 2025).

Abstract:
Classifying protein topology is essential for deciphering biological function, but progress is held back by the lack of largescale benchmarks that avoid duplicates and by models that do not scale well. We introduce TEDBench, a large-scale, non-redundant benchmark for protein fold classification constructed from the Encyclopedia of Domains (TED) and Foldseek-clustered AlphaFold structures. We show that on TEDBench, current protein representation learning methods either require very large models or fail to deliver strong performance. To address this challenge, we propose Masked Invariant Autoencoders (MiAE), a self-supervised framework for protein structure representation learning. MiAE uses an extremely high masking ratio of up to 90% with an \mathrmSE(3)-invariant encoder and a lightweight decoder that reconstructs backbone coordinates from the latent representation and mask tokens. MiAE scales well and outperforms supervised counterparts and state-of-the-art baselines on TEDBench, establishing a strong recipe for protein fold classification. To test transfer beyond AlphaFold structures, we further benchmark on a curated dataset from experimental structures of CATH 4.4. We will release TEDBench and model checkpoints.

Abstract:
Physics obeys strict symmetries like rotational equivariance. However, the standard Transformer architectures widely used in physics foundation models do not enforce these constraints by construction. We introduce ReViT, a rotationally equivariant Vision Transformer framework for neural PDE solvers operating on gridbased physical fields that strictly enforces rotational equivariance. ReViT maps scalar and vector inputs into locally invariant representations derived from physics-based canonical bases, enabling the use of standard self-attention without symmetry violations. Built on a hierarchical Swin-style backbone with a precomputed reference basis pyramid, ReViT preserves equivariance across multi-scale operations. We evaluate ReViT on a wide range of 2D and 3D PDE benchmarks, such as Magnetohydrodynamics and Turbulent Channel Flows, demonstrating significant gains over state-of-the-art baselines. ReViT exhibits strong generalization, and reduces MSE by up to 65% compared with the best-performing alternatives.

Abstract:
Statespace models (SSMs) have emerged as efficient linear-time alternatives to attention for long-sequence modeling. However, existing SSMs often suffer from instability and memory degradation over extended horizons due to poorly conditioned first-order updates and uncontrolled spectral geometry. We introduce MuonSSM, a general framework that stabilizes SSM training by explicitly conditioning the geometry of memory updates rather than the recurrent transition matrix. MuonSSM augments standard SSMs with a momentum-based pathway and lightweight Newton–Schulz iterations on low-rank input injections, yielding approximately norm-preserving and spectrally balanced updates while preserving parallel scan complexity. Theoretical analysis demonstrates substantial improvements in gradient propagation and mitigation of vanishing gradients over long horizons. Extensive experiments across language, vision, and time-series benchmarks show consistent gains in accuracy, robustness, and long-context performance when integrated into diverse SSM backbones. These results establish geometric conditioning of updates as a principled pathway to stable, scalable sequence modeling.

Abstract:
Compositional generalization, the ability to recognize familiar parts in novel contexts, is a defining property of intelligent systems. Modern models are trained on massive datasets, yet these are vanishingly small compared to the full combinatorial space of possible data, raising the question of whether models can reliably generalize to unseen combinations. To formalize what this requires, we propose a set of practically motivated desiderata that any compositionally generalizing system must satisfy, and analyze their implications under standard training with linear classification heads. We show that these desiderata necessitate \emphlinear factorization, where representations decompose additively into perconcept components, and further imply near-orthogonality across factors. We establish dimension bounds that link the number of concepts to the geometry of representations. Empirically, we survey CLIP and SigLIP families, finding strong evidence for linear factorization, approximate orthogonality, and a tight correlation between the quality of factorization and compositional generalization. Together, our results identify the structural conditions that embeddings must satisfy for compositional generalization, and provide both theoretical clarity and empirical diagnostics for developing foundation models that generalize compositionally.

Abstract:
In this paper, we present CATQ,Cost-efficient andAccurateTernaryQuantization, to compress LLMs. Unlike current state-of-the-art ternary quantization methods that rely on data-intensive and costly quantization-aware training to mitigate severe performance degradation, CAT-Q employs a simple yet effective post-training quantization scheme, thereby is easily applicable to LLMs with diverse architectures and model sizes. It has two key components, learnable modulation (LM) and softened ternarization (ST), which are coupled from an optimization perspective. LM leverages a composition of learnable factors to modulate the distribution of high-precision weights and the ternary threshold, making them less sensitive to ternarization. ST further introduces a novel transition function to guide the ternarization process toward stable convergence. We show that, for pre-trained LLMs with 1.7B to 8B parameters, CAT-Q can quantize them into ternary models using merely 512 calibration samples, while achieving competitive performance to the seminal BitNet 1.58-bit v1 and v2 families (with 1.3B to 7B parameters) trained with 100B tokens, yielding about a 100,000x reduction in training tokens. Moreover, we show for the first time that CAT-Q can quantize even larger pre-trained LLMs having 14B to 235B parameters into leading ternary models within 8 to 60 hours on 8 A100-80GB GPUs. Code will be made publicly available.

Abstract:
Multimodal learning aims to integrate multisensor data to exploit their complementary information, embracing a more comprehensive real-world perception and understanding. However, heterogeneous discrepancies across modalities consistently trigger imbalanced multimodal optimization, restricting the joint learning performance. Although existing methods mitigate this issue through optimization modulation and conflict alleviation, they still suffer from entangled optimization and uniform learning pace in conventional monolithic frameworks, limiting the effectiveness of multimodal learning. To address this issue, we propose a novel Multimodal Nested Learning Framework (MoNet), which reformulates the monolithic framework into nested sub-processes, decoupling and coordinating multimodal learning. To achieve this, we present a Decoupled Multimodal Stable Memory block (DMSM) as the outermost nested level, which decouples multimodal learning into independent optimization streams for semantic exploitation across modalities. Additionally, we develop an Adaptive Multimodal Coordinated Fusion block (AMCF), which constitutes the inner nested level. It attempts to coordinate multimodal information integration across multi-timescale nested memories, balancing multimodal fusion. Extensive experimental results on eight datasets across three tasks demonstrate the superiority of MoNet.

Abstract:
We examine the interplay between ordinal, preferencebased solution concepts in games and the outcomes of payoff-driven learning dynamics, asking to what extent the combinatorial data of a game—its preference graph—can predict the long-run behavior of no-regret dynamics such asfollow-the-regularized-leader(FTRL). In one direction, we show that the skeleton of everydynamically stableset, i.e., the set of pure profiles it contains, must bepreferentially stable, that is, closed under pure profitable deviations. We then ask the converse question: when are preferences sufficient to describe long-run behavior? Forsubgames—subsets of pure profiles obtained by restricting players’ action sets—preferences are enough to fully characterize asymptotic stability. Beyond subgames however, we construct a three-player counterexample with a preferentially stable set whose span is dynamicallyunstable, thus establishing that preferences arenot sufficientto describe dynamically stable behavior in general. To restore stability, we introduce the notion ofleaklessness, a measure of aggregate payoff drift away from a set of pure profiles, and use it to identify a payoff-based condition under which the span of a set of pure profiles remains stable and attracting, thereby setting forth a natural cardinal guarantee of dynamic stability.

Abstract:
TimeDependent Differential Equations (TDDEs) model dynamical processes across science and engineering, but time-critical applications require solvers delivering high-fidelity trajectories under stringent latency constraints. Most existing TDDE solvers are limited by time discretization, forcing a latency-accuracy trade-off where smaller step sizes capture high-fidelity trajectories but incur prohibitive runtime, while larger steps meet real-time budgets at the cost of trajectory distortion. Dynamical System Machines (DSMs) offer a promising alternative by computing through continuous-time physical evolution, yet existing DSMs struggle to capture the spatiotemporal complexity of TDDEs. This work introduces DS-TS, a novel TDDE solver that achieves both high-accuracy and ultra-efficiency, leveraging the continuous-time computation of DSMs. DS-TS integrates three key innovations: (1) Excitatory-Inhibitory Inspired Coupling to better model complex spatial interactions; (2) State-aware Dynamic Non-linearity to enable rich inter-node interactions and state-dependent spatiotemporal correlations; and (3) Hierarchical Temporal Integration to capture long-range temporal dependencies. Experiments demonstrate that DS-TS achieves high-fidelity solutions while delivering orders-of-magnitude improvements in speed (～ 10^3×) and energy efficiency (～ 10^5×) compared to baseline solvers.

Abstract:
In closedloop multi-turn agent reinforcement learning, LLM agents exhibit reasoning collapse, where reasoning shift toward generic templates, weakly coupled to the inputs. We firstly identify that such collapse is easy to miss with entropy or surface diversity metrics since reasoning text still varies but becomes input-agnostic. We then propose an information-theoretic decomposition of reasoning variable Z's variation into conditional entropy H(Z \mid X) (randomness under same input) and mutual information (MI) I(X; Z) (input dependence). Template collapse occurs when H(Z \mid X) stays high while I(X; Z) drops, yielding diverse-looking but generic reasoning. To make I(X; Z) a reproducible and sanity-checkable diagnostic, we further introduce an MI-style retrieval protocol treating each reasoning trace Z as a query to retrieve its source X from a minibatch; accuracy degrades toward chance under collapse. We thus provide a signal-to-noise ratio explanation for why I(X; Z) drops: when within-input reward variance \mathrmVar(R \mid X) is low, task gradients weaken and input-agnostic regularizers (KL, entropy) dominate, flattening cross-input differences. Finally, we propose reward-variance-aware filtering to prioritize high-signal updates. Across multi-turn environments, model scales, and modalities (including VLMs), this improves input dependence, stability, and performance while remaining competitive with state-of-the-art stabilization baselines.

Abstract:
Ondevice AI has gained significant attention for enabling efficient, low-latency inference on edge devices. However, tight resource constraints on these platforms make the deployment of accurate and lightweight deep learning models challenging. In particular, advanced activation functions (AFs) like Swish and GELU often incur high inference overhead due to the lack of hardware fast-paths for exponentiation and division, restricting edge-ML applications to simple AFs like ReLU, limiting model accuracy. To address this, we propose FFCC, a compiler that automatically generates efficient approximations of AFs through floating-point reinterpretation. These functions don’t require hardware fast-paths meaning they remain fast on edge devices. They do not incur great accurate losses, and allowing use as post-training replacements without negatively impacting model final accuracy. FFCC takes a specification of AFs using basic floating-point operators and applies derivation rules to lower these expressions into efficient instruction sequences. Our experiments show that we can provide fast approximations of AFs, achieving order-of-magnitude speed ups over accurate baselines on Arm M7, delivering performance on-par with Hardswish, while beating it on accuracy. Additionally, we show that our approximations – unlike Hardswish – can be used as drop-in replacements of exact version post-training without loss of model accuracy.

Abstract:
We study the revenue of approximate correlated equilibrium in discrete first price auctions the set of allowable bids is \mathcalB = \0, 1/k, \dots, 1 - 1/k, 1\ for some k \in \mathbbN. We show that the revenue of any \epsilon-approximate correlated equilibrium is at least v_2 - \Theta(1/k)- \Theta(\epsilon k^2), where v_2 \geq 0 is the second-highest valuation. Our results establish the first polynomial convergence rates on the revenue generated by no-swap regret bidders in first-price auctions. For instance, if bidders admit the optimal swap regret of \mathcalO(\sqrtk T), then the time-averaged revenue is at least v_2 - \Theta(1/k) - \Theta(\epsilon) after \mathcalO(k^5/\epsilon^2) rounds.

Abstract:
We study zeroshot inverse problems, where a clean signal is recovered from a single degraded observation without external training data. Contrary to the common belief that such problems require highly complex models, we show that a lightweight neural network, when combined with entropy and complexity regularization in a compression-based formulation, is sufficient for high-quality restoration. We propose Lottery Prior, a compression-based inverse solver that leverages architectural priors from random networks and induces a family of implicit priors through randomness, enabling ensemble-based refinement. We further derive non-asymptotic error bounds for compression-based maximum-likelihood inverse solvers, revealing how rate–distortion constraints act as implicit regularizers. Experiments on denoising, noisy super-resolution, and inpainting demonstrate that our method achieves state-of-the-art with significantly fewer effective parameters.

Abstract:
With the recent fast development of generative models, instructionbased image editing has shown great potential in generating high-quality images. However, the quality of editing highly depends on carefully designed instructions, placing the burden of task decomposition and sequencing entirely on the user. To achieve autonomous image editing, we present PhotoAgent, a system that advances image editing through explicit aesthetic planning. Specifically, PhotoAgent formulates autonomous image editing as a long-horizon decision-making problem. It reasons over user aesthetic intent, plans multi-step editing actions via tree search, and iteratively refines results through closed-loop execution with memory and visual feedback, without requiring step-by-step user prompts. To support reliable evaluation in real-world scenarios, we introduce UGC-Edit, an aesthetic evaluation benchmark consisting of 7,000 photos and a learned aesthetic reward model. We also construct a test set containing 1,017 photos to systematically assess autonomous photo editing performance. Extensive experiments demonstrate that PhotoAgent significantly outperforms existing methods in both instruction faithfulness and visual quality across a diverse range of editing scenarios.

Abstract:
Recently, it has received growing attention in building AI Scientist agents with Large Language Models (LLMs). Since scientific discovery fundamentally relies on uncovering causal relationships from observations, the capability of causal thinking that distinguish causation from correlation and hidden biases, is essential to LLM agents. Despite a number of existing benchmarks for AI scientists, none of them are designed with the consideration of hidden biases and confounders, that widely exist in realworld scientific discovery. To this end, we present CausalGame, a benchmark that evaluates the causal thinking capabilities of LLM agents through interactive games. More specifically, we ask LLM agents to actively design experimental protocols, collect observation data and derive a final solution with an explanation report. To emulate realistic scientific discovery challenges, we design 14 game settings with the incorporation of selection bias, noisy measurements, and hidden confounders. The results with 16 frontier LLM agents show that they consistently fail to reason about and recover the underlying causal relationships required to solve the games. CausalGame provides a rigorous measurement of capabilities essential to AI Scientist agents.

Abstract:
Highdimensional datasets often concentrate near low-dimensional structures, but estimating their geometry from samples typically relies on graphs and kernels that scale poorly with dataset size and dimension. We propose Riemannian metric matching: a denoising probabilistic framework for learning the Riemannian geometry of data using neural networks. Specifically, we learn the carré du champ operator, which, using diffusion geometry, gives us access to the Riemannian geometry toolkit for downstream machine learning and statistical tasks. Our key observation is that the carré du champ operator can be formulated as a conditional expectation over random perturbations of the data, which can be exploited for sample-wise training and constant cost, amortized inference without explicit kernel construction. To the best of our knowledge, we provide the first neural surrogate that estimates the underlying Riemannian geometry of data with a provable consistency guarantee in the large data limit. Empirically, metric matching rivals or improves the accuracy of k-NN-based diffusion geometry estimators, while enabling amortized inference that is up to 400× faster, and supports graph-free geometric analysis on high-dimensional images where nearest neighbors break down.

Abstract:
Personalization has become a pivotal field of study in contemporary intelligent systems. While large language models (LLMs) excel at general knowledge tasks, they often struggle with personalization, i.e., adapting their outputs to individual user expectations. Existing approaches that steer LLM behavior to meet users’ implicit preferences and behavior patterns, primarily relying on tunefree methods (e.g., RAG, PAG) or parameter fine-tuning methods (e.g., LoRA), face challenges in effectively balancing effectiveness and efficiency. Moreover, the mechanisms underlying personalized preferences remain underexplored. To address these challenges, we first uncover key patterns of user-specific information embedded in the representation space. Specifically, we find that (1) personalized information lies within a low-rank subspace represented by vectors, and (2) these vectors demonstrate both a collective shift shared across users and a personalized shift unique to each individual user. Building on these insights, we introduce PerFit, a novel two-stage solution that directly fine-tunes interventions in the hidden representation space by addressing both collective and user-specific shifts, thereby achieving precise steering of LLM with minimal parameter overhead. Experimental results demonstrate that \perfit delivers strong performance across six datasets while \cutting the number of parameters by an average of 92.3% compared to the state-of-the-art method.

Abstract:
Conveying complex objectives to reinforcement learning (RL) agents often requires meticulous reward engineering. Preferencebased RL (PbRL) offers a promising alternative by learning reward functions from human feedback, but its scalability is hindered by high labeling costs. Inspired by advances in Video Foundation Models (ViFMs), we present Video-based Optimal Transport Preference (VOTP), a semi-supervised framework that learns effective reward functions from only a handful of labels. By leveraging optimal transport to align visual trajectories within the rich representation space of ViFMs, VOTP effectively generates high-fidelity pseudo-labels for large amounts of unlabeled data, substantially reducing human supervision. Extensive experiments across locomotion and manipulation benchmarks demonstrate the superiority of VOTP, which outperforms state-of-the-art offline PbRL methods under limited feedback budgets. We also showcase the robustness of VOTP in the presence of visual distractors and validate its utility on real robotic tasks, where it learns meaningful rewards with minimal human input.

Abstract:
VisionLanguage Models (VLMs) perform well on commonsense reasoning tasks but struggle with visual spatial reasoning. Most existing solutions introduce extra 3D priors or external spatial encoders, which increase complexity and degrade the underlying VLMs' general-purpose capabilities after spatial fine-tuning. To this end, we propose a parameter-efficient Spatio-vision Language Models (SpatioLM), that enhances spatial intelligence without extra 3D priors or third-party spatial encoders. Concretely, we design a plug-and-play and non-invasive spatio-vision module that elicits the spatial knowledge inherent in VLMs. Furthermore, we innovatively leverage pseudo depth and camera information as supervision to guide the model in learning physically coherent representations. Extensive experiments show that SpatioLM achieves significant improvements in diverse tasks, including spatial perception and understanding while maintains the general-purpose capabilities. Notably, the model achieves an impressive score of 71.6 on the VSI-Bench (the first model to surpass 70). In addition, it attains competitive performance when transferred to embodied manipulation tasks.

Abstract:
Multimodal brain network analysis aims to predict neuropsychiatric status from functional connectomes with heterogeneous phenotypes. However, most existing methods treat phenotypes as auxiliary features and perform late fusion, implicitly assuming that the connectome representation should be learned in the same way regardless of phenotype. However, in clinical neuroscience the same functional connectivity pattern may support different conclusions under different phenotype contexts. To bridge this gap, we propose PhenoBrain, a novel framework for multi-modal brain network analysis that injects phenotype information at the mechanism level rather than only at the classifier level. Specifically, we propose a phenotype-conditioned long-range routing mechanism, which learns a subject-specific multi-hop communication kernel to model long-range connectome interactions. Furthermore, we propose a phenotypic-guided attention mechanism regulation method, which uses phenotypic information as a conditional prior to regulate the learning process of attention in brain networks. To verify the effectiveness of our method, we constructed two multi-modal brain network analysis datasets based on open-source image data. Extensive experiments demonstrate that PhenoBrain achieves state-of-the-art performance.

Abstract:
Large language models (LLMs) perform inference by following a fixed depth and order, nonrecurrent execution of all layers. We reveal the wide existence of training-free, flexible, dynamic "program-of-layers (PoLar)", where pretrained layers can be packed as modules and then skipped or looped to form a customized program for each input. For most inputs, substantially shorter program executions can achieve the same or better accuracy, while incorrect predictions of the original LLM can be corrected by alternative programs with fewer layers. These observations indicate that inference admits multiple valid latent computations beyond the standard forward pass. To efficiently achieve PoLar in practice, we propose a lightweight PoLar prediction network, which learns to generate execution programs that dynamically skip or repeat pretrained layers for each input. Experiments on mathematical reasoning benchmarks demonstrate that PoLar consistently improves accuracy over standard inference and prior dynamic-depth methods, often while executing fewer layers, and that these gains persist under out-of-distribution evaluation. Our results suggest that fixed-depth execution captures only a narrow subset of an LLM’s latent reasoning capacity.

Abstract:
Large Reasoning Models (LRMs) achieve strong problemsolving through long chain-of-thought, but their deployment is constrained by the high cost of full-precision inference and growing KV cache footprints. Microscaled FP4 formats enable efficient FP4 deployment; however, fully quantizing weights, activations, and KV caches (W4A4KV4) causes severe reasoning degradation that existing PTQ and QAT fail to recover. We identify that FP4 failures concentrate on low-entropy tokens—precise symbolic commitments such as digits and operators—where quantization noise inflates sampling errors that cascade through reasoning traces. Based on this insight, we propose ReQAT, a reasoning-centric FP4 training framework with three components: (i) Trace-Aligned QAT (TAQ), which revisits identical reasoning traces to focus updates on critical low-entropy decisions; (ii) Selective Entropy Minimization (SEM), which reinforces confidence at low-entropy positions; and (iii) Q-FIT, a quantization-friendly initialization that jointly calibrates RoPE-consistent KV cache transformations to stabilize QAT. Under the same training budget, ReQAT not only recovers but surpasses BF16 fine-tuning accuracy—achieving while delivering up to 3.9× throughput speedup on NVIDIA DGX Spark and 3.1× on B200. This is the first demonstration that FP4 QAT can exceed full-precision accuracy for LRMs with over 3× speedup on production hardware.

Abstract:
Enhancing Large Reasoning Models (LRMs) for specialized domains remains a critical challenge. While recent industrial frameworks attempt to encapsulate Standard Operating Procedures into modular "skills" for dynamic retrieval, utilizing them via context engineering often proves insufficient for complex workflows, leading to "Cognitive Drift." To mitigate this, we propose Thought GuidanceRetrieval Augmented Generation (TG-RAG), a Retrieval-Augmented framework that effectively steers the generation process without relying solely on the model's self-correction. Built upon an Expert Procedure Graph (EPG) that formalizes unstructured SOPs, the framework uniquely employs a dynamic ``Interrupt-Retrieve-Generate" (IRG) mechanism to actively inject step-specific directives into the model's reasoning process. Extensive evaluations show that TG-RAG achieves competitive performance, demonstrating advantages in specialized domains by ensuring faithful adherence to domain SOPs.

Abstract:
Scientific data annotation, such as tracking animals in video or proofreading neural reconstructions, remains bottlenecked by the “last mile” problem: even with strong automation, verification and correction consume substantial human effort. Standard approaches train models to directly predict annotations, discarding the rich supervision in how experts navigate, click, verify, and correct. We introduce a framework for studying behavioral cloning on scientific annotation: 9 synthetic tasks paired with synthetic annotations that simulate realistic human strategies including exploration, mistake correction, and strategic decisionmaking. Our experiments reveal several findings. First, skills emerge hierarchically: models learn GUI mechanics before task-critical decisions, and commit fewer mistakes than the training data while retaining the ability to correct errors when they occur. Second, scaling models on multi-task behavioral cloning shows that larger models are more data efficient, but exhibit worse decision-making despite similar placement accuracy. Third, multi-task pretraining enables efficient fine-tuning to new tasks, while training from scratch fails entirely. Fourth, linear probes reveal that models internally represent latent variables of the annotation process such as task phase and data position; interestingly, we find a shared mistake representation that generalizes across different annotation tasks. Overall, our framework establishes systematic benchmarks and identifies key bottlenecks, providing a foundation for scaling behavioral cloning to real-world scientific data annotation.

Abstract:
Memory formation is fundamental to intelligence, yet whether deep neural networks preserve identifiable memory traces—analogous to biological memory units—remains an open question. This work introduces a geometric framework to identify such "AI engrams," by formalizing the neuroscientific criteria of specificity, reactivation, sufficiency, and necessity into a constrained inverse problem. We derive a closedform estimator that isolates individual memory traces from globally entangled parameters. Theoretical analysis reveals that this biologically-derived solution corresponds to a natural gradient update on the parameter manifold. AI engrams enable surgical manipulation of learned knowledge: any subset of memories can be composed or erased through linear arithmetic, without iterative optimization. Experiments ranging from simple MLPs to LLMs demonstrate the causal validity and substantial scalability of AI engrams. Together, these results bridge theories of biological memory and artificial representation learning, offering geometric insight into how deep networks simultaneously support functional specificity within distributed storage.

Abstract:
Hypergraph contrastive learning is an effective paradigm for representation learning on higherorder relational data, yet existing methods largely ignore that hyperedges link nodes with multi-level semantics. Standard contrastive objectives emphasize instance discrimination via hyperspherical uniformity and tend to push embeddings apart in an indiscriminate manner. We show that this leads to aHierarchy–Uniformity Conflict, whose geometric manifestation isSemantic Flattening, where the semantic depth of hyperedges collapses into a nearly flat cloud of instances. To address this issue, we introduceHyperDepth, a hypergraph contrastive learning framework that moves representations towards a hierarchy–uniformity equilibrium by jointly coordinating spectral and geometric signals. HyperDepth employs a decoupled spectral encoding scheme with adaptive gating so that high-frequency components focus on local instance discrimination while low-frequency components capture global hierarchical structure. On top of this, an energy-based hierarchical Alignment module attaches a learnable prototype tree to the representation space and minimizes an interpretable energy functional to recover the semantic depth of hyperedges. Theoretically, under a mild frequency-separation assumption, we show that the local contrastive and global hierarchical objectives operate on orthogonal spectral components and admit equilibrium embeddings that preserve semantic depth while still retaining instance-level discrimination. Experiments on 15 hypergraph datasets and 17 supervised and self-supervised baselines, spanning homophilic and heterophilic regimes, show that HyperDepth attains strong performance with the best average rank.

Abstract:
Realworld multivariate time series are inherently in flux: different variables evolve asynchronously and interact in complex, time-varying ways, yet accurate forecasting requires these dispersed signals to converge into a single unified prediction. This structural mismatch between dynamic, heterogeneous inputs and a unified forecasting objective poses a fundamental challenge for building general-purpose multivariate forecasting models, especially in zero-shot and large-scale settings. To this end, inspired by the idea that "all rivers run into the sea", we propose ConFlux, a general-purpose foundation model for multivariate time-series forecasting by learning to adaptively integrate cross-channel information under a unified forecasting objective. Specifically, ConFlux first reorders variables to reduce cross-variable entanglement, then aggregates adjacent variables into compact patches that can be processed by a Vision Transformer-style architecture. This design shortens the effective context, reduces attention complexity, and provides a unified token representation for pre-training and downstream tasks. Experiments on 25 public datasets show that ConFlux achieves state-of-the-art performance in zero-shot, fine-tuning, and from-scratch settings, while offering faster inference and lower memory usage.

Abstract:
We develop orderequivariant neural networks (OENN), which generalize standard graph message passing and sheaf neural networks via the face-poset viewpoint. We (i) characterize all linear order-equivariant maps, (ii) build OENN layers, and (iii) prove a universal approximation theorem (UAT) for continuous order-equivariant maps, which is a new result even when restricted to sheaf neural networks (for which no UAT was known before). We illustrate the framework on graph and sheaf models. Our results can also be seen as extending the UAT for graph neural networks to a more general setting that subsumes sheaf neural networks as well.

Abstract:
Machine learning methods that predict protein fitness from sequence remain sensitive to changes in data distributions, limiting generalization across common conditions encountered in protein engineering. Practically, protein engineers are thus left wondering about the effective utility of ML tools. The FLIP benchmark established protocols for testing generalization under some domain shifts, but it was limited to measurements of stability, binding, and viral capsid viability. We introduce FLIP2, a protein fitness benchmark spanning seven new datasets, including enzymes, proteinprotein interactions, and light-sensitive proteins, as well as splits that measure generalization relevant to real-world protein engineering campaigns. Evaluating a suite of benchmark models across these datasets and suites reveals that simpler models often matched or outperformed fine-tuned protein language models on \ourset, challenging the utility of existing transfer learning techniques. Provenance for all datasets has been recorded and we redistribute all data CC-BY 4.0 to facilitate continued progress.

Abstract:
Each layer of a Transformer refines the hidden state toward a prediction, an iterative process resembling fixedpoint iteration. Yet when should this iteration terminate? Existing early exit methods rely on output confidence as a proxy for internal convergence. We take a more direct approach by examining the geometry of the hidden state trajectory. We find that layer-wise updates exhibit a two-phase structure: large, volatile updates in early layers, followed by small, aligned updates as the model propagates an already-formed representation. The transition is remarkably sharp. This yields a simple criterion: exit when step size vanishes and direction stabilizes. We track the normalized update norm and cosine similarity between consecutive updates, exiting when both indicate convergence. The overhead is O(d) per layer, independent of vocabulary size, requiring no learned components or architectural modifications. On LLaMA-2-7B and LLaMA-2-13B across question answering and commonsense reasoning tasks, this geometric criterion reduces FLOPs by 30--35% while retaining over 98% of full-depth accuracy.

Abstract:
We consider anonymous multiagent path finding (MAPF) where a set of robots is tasked to travel to a set of targets on a finite, connected graph. We show that MAPF can be cast as a special class of multi-marginal optimal transport (MMOT) problems with an underlying Markovian structure, under which the exponentially large MMOT collapses to a linear program (LP) polynomial in size. Focusing on the anonymous setting, we establish conditions under which the corresponding LP is feasible, totally unimodular, and yields min-cost, integral~(\0,1\) transports that do not overlap in both space and time. To adapt the approach to large-scale problems, we cast the MAPF-MMOT in a probabilistic framework via Schrödinger bridges. Under standard assumptions, we show that the Schrödinger bridge formulation reduces to an entropic regularization of the corresponding MMOT that admits an iterative Sinkhorn-type solution. The Schrödinger bridge, being a probabilistic framework, provides a shadow (fractional) transport that we use as a template to solve a reduced LP and demonstrate that it results in near-optimal, integral transports at a significant reduction in complexity. Extensive experiments highlight the optimality and scalability of the proposed approaches.

Abstract:
Federated graph anomaly detection (GAD) aims to identify abnormal nodes in distributed subgraphs through collaborative learning. However, existing methods suffer from two limitations. 1) Their reliance on neighborhood aggregation assumes that anomalous information can be sufficiently captured, which often fails in federated learning with partitioned client subgraphs. 2) They overlook the detection bottleneck caused by weak attribute or structural anomalies. To tackle these challenges, we revisit federated GAD and reveal that weak anomalies exhibit harderto-detect signals compared to strong anomalies. Specifically, we propose MV-FGAD, an efficient and effective federated GAD framework based on multi-view learning designed to mine anomalies of varying strengths. MV-FGAD introduces a federated knowledge learning module to aggregate and broadcast shared knowledge, which is further exploited to optimize local topological structures. Moreover, it designs a multi-view learning mechanism to capture diverse anomaly patterns, and adopts Mahalanobis distance–based scoring strategy to quantify node abnormality across views. Extensive experiments on real-world datasets of varying types and scales demonstrate MV-FGAD's efficiency and effectiveness.

Abstract:
This paper tackles the challenging problem of developing a proactive copyright protection mechanism that cuts off unauthorized use of diffusion generative models. Existing studies largely fall into posthoc attribution (e.g., watermarking and fingerprinting) or degradation-only defenses, which offer only indirect and limited preventive effect. We therefore propose GoodDiffusion, inspired by backdoor mechanisms, to enforce model-level use-time control by internalizing authorization into the generative process through a selectively permissive, otherwise closed behavior. Specifically, GoodDiffusion preserves high-quality generation for authorized queries carrying valid signatures, yet refuses to generate for unauthorized inputs. We further empirically show that naive static-signature designs (like conventional backdoor injection) are fundamentally fragile, since a surrogate signature can be efficiently recovered via gradient-based optimization. To strengthen security, we introduce a Learnable Signature Network (LSN) that assigns sample-specific signatures conditioned on each input. This breaks the universality of signatures and prevents a surrogate from transferring across inputs. Extensive experiments validate that GoodDiffusion effectively blocks unauthorized use while maintaining strong generation quality for authorized users.

Abstract:
Recently, the rapid development of lowaltitude aerial applications has driven the need for drone-based unified detectors. In contrast to task-specific detectors that suffer from poor scalability across diverse scenarios, existing unified detectors leverage the Mixture-of-Experts (MoE) architecture to learn task-aware features from diverse datasets. However, the imbalanced multi-task data distribution leads to over-activation of experts for dominant tasks and under-activation for others. To enable balanced feature learning, this paper combines three detection paradigms (RGB, IR, and RGB-IR) into a unified framework termed DroneDINO. DroneDINO extends DINO by introducing heterogeneous routed MoEs that organize experts into three functional groups: shared, task-specific, and dynamic. Unlike conventional dynamic experts where the top-k experts are activated for each input, the shared expert is activated for all inputs, while each task-specific expert is activated exclusively for the matching task. To ensure inputs are routed to appropriate experts and yield task-discriminative features, we propose a task-recognition auxiliary training strategy to penalize features with low task-discriminability. Experiments demonstrate the effectiveness and generalizability of DroneDINO, which consistently outperforms state-of-the-art unified and task-specific detectors across multiple drone-based detection benchmarks.

Abstract:
Video anomaly detection (VAD) is critical for surveillance systems, but current methods prioritize accuracy while ignoring the ethical risks of encoding sensitive biometric information. This neglect poses significant privacy concerns for realworld deployment. To bridge this gap, we introduce the Guided Orthogonal Projection Layer (G-OPL), a lightweight module designed to geometrically decouple and suppress sensitive attributes from latent features to produce representations focused on anomaly-relevant cues. We specifically target facial information as the primary sensitive attribute. Unlike gait or body pose, faces act as unique biometric identifiers that are tightly regulated and pose immediate risks of misuse, yet are rarely necessary for identifying abnormal behaviors. To achieve this, G-OPL utilizes a stable, QR-decomposition-based orthogonal projection mechanism guided by weak supervision (e.g., face presence) to actively filter privacy-sensitive subspaces while preserving task-relevant anomalies. we further propose a novel privacy-aware evaluation framework to rigorously quantify the trade-off between model utility and ethical alignment. Our analysis uncovers how projection layers filter sensitive information, why this improves transparency, and under what conditions ethical design also enhances robustness. Extensive experiments demonstrate that our approach effectively minimizes privacy risks without compromising anomaly detection performance, offering a principled path toward trustworthy video analysis.

Abstract:
In sampling tasks, it is common for target distributions to be known up to a normalizing constant. However, in many situations, even evaluating the unnormalized distribution can be costly or infeasible. This issue arises in scenarios such as sampling from the Bayesian posterior for tall datasets and the 'doublyintractable' distributions. In this paper, we begin by observing that seemingly different Markov chain Monte Carlo (MCMC) algorithms, such as the exchange algorithm, PoissonMH, and TunaMH, can be unified under a simple common procedure. We then extend this procedure into a novel framework that allows the use of auxiliary variables in both the proposal and the acceptance--rejection step. Several new MCMC algorithms emerge from this framework that uses estimated gradients to guide the proposal moves. They have demonstrated significantly better performance than existing methods on both synthetic and real datasets. We also develop theory for the new framework and use it to simplify and extend results for existing algorithms.

Abstract:
Incorporating textual information into timeseries forecasting holds promise for addressing event-driven non-stationarity; however, a fundamental modality gap hinders effective fusion: textual descriptions express temporal impacts implicitly and qualitatively, whereas forecasting models rely on explicit and quantitative signals. Through controlled semi-synthetic experiments, we show that existing methods over-attend to redundant tokens and struggle to reliably translate textual semantics into usable numerical cues. To bridge this gap, we propose \method, which introduces a Temporal Evolution Semantic Space as an intermediate bottleneck between modalities. This space consists of interpretable, numerically grounded temporal primitives—mean shift, volatility, shape, and lag—extracted from text by an LLM via structured prompting and filtered through confidence-aware gating. Experiments on four real-world datasets demonstrate up to a 29% reduction in forecasting error compared to state-of-the-art uni-modal and multimodal baselines. The code is available at https://anonymous.4open.science/r/MMTSF.

Abstract:
Causal discovery from observational count data poses unique challenges, particularly when the data exhibit inherent branching structures, e.g., an upstream event (e.g., an ad impression) triggers a downstream event (e.g., a purchase) with a certain probability. Such branching dynamics are naturally captured by thinning operators (for the branching structure) and an independent Poisson distribution (for exogenous noise), constituting the Poisson Branching Structural Causal Model (PBSCM). However, existing approaches based on PB-SCM rely on the restrictive assumption of causal sufficiency, failing to account for ubiquitous latent confounders that can bias estimation. In this work, we propose the Latent Confounding Poisson Branching Structural Causal Model (LC-PB-SCM) to bridge this gap. We leverage Probability Generating Functions (PGFs) to characterize the complex dependencies introduced by latent confounding. Then, we establish a Trie representation theorem that maps the branching causal mechanisms to the algebraic properties of PGF monomials. Based on local PGFs, we establish a complete identifiability condition for local 3-variables that covers all causal patterns distinguishable up to monomial equivalence. Finally, we propose a practical algorithm to learn causal structures under latent confounding and demonstrate its effectiveness through experiments on both synthetic and real-world datasets.

Abstract:
Diffusion Large Language Models (dLLMs) break the rigid leftto-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential. Indeed, for specific constraint satisfaction tasks (e.g., sudoku puzzles), this capability has proven to be highly advantageous. However, in this paper, we reveal that for general reasoning tasks (e.g., mathematics and coding), arbitrary order generation may in fact limit the reasoning potential of dLLMs. We find that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, leading to a premature collapse of solution coverage. This observation motivates a rethink of RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We demonstrate that effective reasoning can be better elicited by simply forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach,JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs.

Abstract:
Modeling complex dynamics from observational data is fundamental to scientific discovery and artificial intelligence. However, existing approaches ranging from Neural ODEs to diffusion models are often plagued by the entanglement of static state representations and instantaneous motion, leading to accumulated errors and offmanifold hallucinations where predicted trajectories violate intrinsic geometric constraints. To address this, we propose Geometric Flow Grounding, a unified framework that enforces dynamic evolution strictly along the tangent bundle of the learned data manifold via a differentiable Neural Tangent Projection Layer. By geometrically decoupling state representation from tangential dynamics, our method generalizes across diverse data regimes. In the context of scientific discovery, we demonstrate that the projection layer eliminates numerical aliasing in sparse dynamical systems and recovers interpretable gene regulatory motifs from single-cell data by disentangling states from developmental velocities. Bridging to trustworthy AI, we further repurpose the geometric projection residual as a zero-shot metric for deepfake video detection, identifying generative inconsistencies against the implicit flow of pre-trained world models. Our results establish manifold-constrained projection as a universal operator for both discovering natural laws and verifying synthetic content.

Abstract:
Research on the expressive power of neural networks has identified the minimum depth and width of neural networks that enable universal approximation and memorization. However, existing results are derived under exact arithmetic and cannot be directly applied to real implementations on computers, which can only use a finite set of numbers and inexact machine operations with roundoff errors. In this work, we study floating-point ReLU networks that have floating-point parameters and use floating-point operations. Specifically, we investigate their minimum depth and width to represent all functions from the set of floating-point vectors \mathbb F^d to the set of floating-point numbers \mathbb F. We first show that the minimum depth for representing all functions from \mathbb F^d to \mathbb F is exactly three, where two layers can be sufficient if we consider a smaller domain and/or codomain. We further show that the minimum width for representing all functions from \mathbb F^d to \mathbb F lies between 2d and 2d+4. In addition, if we restrict the domain to non-negative floats, it lies between d and d+4, where it can be smaller for a smaller domain, even beyond d. Our results show that the existing results analyzed under exact arithmetic do not extend to the floating-point setup.

Abstract:
Pretrained Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how pre-trained priors interact with user-provided instructions. We investigate three dimensions of this interaction: (1) how an LLM’s familiarity with data and task definitions affects performance, (2) the extent to which additional information in prompts can correct zero-shot errors (“decision stickiness”), and (3) model susceptibility to misaligned task definitions. Through experiments on toxicity detection across diverse datasets (spanning social media, gaming, news, and forums) using both dense and mixture-of-experts models, we find that nearly two-thirds of zero-shot errors are resistant to correction, with an overall rescue rate (fraction of initial errors corrected by prompting) of only 36.4%. High-confidence errors prove especially resistant to correction. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition. Crucially, we introduce Definition-Specific Familiarity (DSF), a metric measuring alignment between a model’s internal concept and the task definition. After controlling for dataset-level confounds, DSF shows positive association with model performance (partial r = +0.34), while text memorization as measured by ROUGE-L shows no positive association (partial r = −0.19). Overall, these findings suggest clear limits on prompt-based correction in annotation tasks and underscore the importance of definition alignment over text-level memorization.

Abstract:
Despite the fast progress in neural operator learning, longsequence modeling still is a standing challenge whereby latent states have been introduced with techniques well derived. Diverging from existing methods that treat latent states as transient variables or decoupled representations, CoEvol-NO introduces a persistent state to establish a co-evolutionary framework, where the latent state and mesh sequence are updated jointly and bidirectionally. Inspired by classical numerical methods, we model the layer-wise state evolution as a Predictor-Corrector (PC) process. Specifically, aPredictor'' generates a tentative target, followed by aCorrector'' that refines the persistent state via an error-driven update mechanism. Furthermore, our theoretical analysis reveals that the widely used direct substitution and residual update paradigms are essentially first-order approximations of this error-driven correction under different loss assumptions. We theoretically prove that CoEvol-NO achieves strict linear time complexity. Extensive experiments on five standard benchmarks and two large-scale industrial design tasks demonstrate that CoEvol-NO consistently achieves state-of-the-art (SOTA) performance.

Abstract:
Supervised finetuning (SFT) is central to aligning large language models (LLMs) with instruction following and task-specific reasoning. Despite its success, SFT optimizes token-level likelihoods under the implicit assumption that strictly fitting all tokens in expert demonstrations induces the desired downstream behavior. However, in reasoning tasks where correctness is defined by logical validity or final outcomes rather than exact token realizations, this assumption can lead to optimization misalignment. We empirically observe that low-probability tokens in reasoning demonstrations often correspond to realization-specific or stylistic variations, and that reducing their influence during training consistently improves generalization on reasoning benchmarks. Motivated by this insight, we propose theBounded Log-Likelihood Loss(BLL-Loss), a simple and parameter-free alternative to standard likelihood training that bounds gradient contributions from low-probability tokens while preserving conventional optimization behavior. We provide theoretical insights and extensive empirical results demonstrating that BLL-Loss improves reasoning generalization across diverse model scales and challenging benchmarks.

Abstract:
Constraintbased causal discovery is widely used for learning causal structures, but heavy reliance on conditional independence (CI) testing makes it computationally expensive in high-dimensional settings. To mitigate this limitation, many divide-and-conquer frameworks have been proposed, but most assume causal sufficiency, i.e., no latent variables. In this paper, we show that divide-and-conquer strategies can be theoretically generalized beyond causal sufficiency to settings with latent variables. Specifically, we propose a recursive decomposition framework, termed DiCoLa, that enables divide-and-conquer causal discovery in the presence of latent variables. It recursively decomposes the global learning task into smaller subproblems and integrates their solutions through a principled reconstruction step to recover the global structure. We theoretically establish the soundness and completeness of the proposed framework. Extensive experiments on synthetic data demonstrate that our approach significantly improves computational efficiency across a range of causal discovery algorithms, while experiments on a real-world dataset further illustrate its practical effectiveness.

Abstract:
Multimodal Large Language Models (MLLMs) frequently produce hallucinations (i.e., assertions that contradict the image or facts), undermining reliability in highrisk applications. Existing detection approaches typically feed images and texts jointly and estimate hallucination scores by measuring the consistency of model outputs. However, because the visual module often lags behind the language module in understanding and reasoning, MLLMs can repeatedly produce similar yet incorrect answers, yielding deceptively high measured trustworthiness and therefore missed detections. To address this, we propose a simple yet effective model-agnostic method, dubbed Decoupled Object-level Understanding and Bridging via vMF-based Trustworthiness (DOUBT). DOUBT i) elicits richer object-aware responses by decoupling object recognition from relational reasoning via a two-step prompting scheme (Object-level Understanding and Bridging, OUB), and ii) measures reliability with a von Mises–Fisher (vMF)-based trustworthiness metric that is more stable than semantic-entropy metrics under small-sample regimes. Specifically, OUB first prompts the model to list recognized objects, and then conditions chain-of-thought reasoning on those objects to produce object-bridged responses. For trustworthiness estimation, we replace conventional measures with the proposed vMF-based metric, which is robust even under low-sample settings and exhibits smoother behavior than prior techniques. Extensive experiments and ablation studies across multiple benchmarks demonstrate that DOUBT consistently outperforms state-of-the-art baselines, offering a robust and generalizable solution for hallucination detection in MLLMs.

Abstract:
The emergence of Vibe Researching is transforming scientific research into an interactive workflow, where agents orchestrate complex tasks via the Model Context Protocol (MCP). In this ecosystem, scientific tools must evolve from offline simulators into responsive Agent Skills. However, diffusionbased protein docking models—a core component of the current deep learning infrastructure for structural biology—suffer from excessively high latency, rendering them incompatible with real-time agentic interaction. To bridge this gap, we present a compute-efficient vertical foundation model that synergizes architectural optimization with generative consistency. First, we leverage Progressive Consistency Regularization (PCR) to compress complex generative dynamics into a few-step predictor, achieving sub-second latency. Second, we propose Residual Quantization, using mixed-precision on residual streams to alleviate memory bottlenecks while preserving numerical precision. Our approach achieves state-of-the-art (SOTA) docking accuracy while attaining a two-order-of-magnitude speedup (>300×) over AlphaFold3, establishing a new efficiency standard for high-throughput virtual screening. By transforming molecular docking into an interactive, real-time tool, this work establishes a scalable, deep-learning infrastructure for the next generation of AI-driven drug discovery.

Abstract:
Jailbreak attacks bypass LLM safety alignment, yet their mechanisms remain poorly understood. We provide evidence that attacks do not eliminate safety features but selectively suppress specific attention heads. We identify two functionally differentiated types:Adversarially Compromised Heads (ACHs)concentrated in early layers, which are suppressed under attacks; andSafetyAligned Heads (SAHs)in mid-layers, which maintain robust activations even when attacks succeed. Ablation studies support their causal roles: suppressing a small number of ACHs is sufficient to induce jailbreak-like behavior on normally refused inputs, while removing SAHs substantially weakens mid-layer safety activations. Token-level attribution further shows that ACH suppression is driven specifically by attack-template tokens. This provides a mechanistic account of why attacks bypass refusal decisions through ACH suppression, yet may not fully eliminate the internal safety signals sustained by SAHs---a phenomenon we termRobust Harmful Features. To validate the practical significance of this robustness, we show that simply reading these persistent activations---without any training---yields a detection signal competitive with dedicated safety models on most benchmarks.

Abstract:
Normative theories allow one to elicit key parts of a ML algorithm from first principles, which is crucial at a time of championed scrutiny for ML work. Direct Preference Optimization (DPO) cleverly bypasses reward modeling by making an explicit link with a specific normative model of human choice. Our paper elevates this connection to the full generality of DPO's normative framework. Getting there requires reworking social choice theory's textbook path for a better RLHF/ML fit. It elevates the connection to a remarkably broad viewpoint on preference optimization, considering the current panorama of DPO followups. It also unveils unexpected riches for ML, chief among which the support fornon-convexlosses, the fact thatanycompliant ML analytical choice can be embedded withanyhuman choice model, and a normative framework's umbrella wide enough to safeguard DPO'sextensions(margins, length correction, ...). Atoyexperiment ``far away'' from the DPO crowd is given.

Abstract:
Multiagent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings instead of text. Then, a shared latent working memory preserves and transfers each agent's internal representations and latent thoughts, ensuring lossless information exchange without re-encoding. We provide detailed theoretical analyses showing that LatentMAS achieves higher expressiveness and lossless information preservation with lower overall complexity than standard text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS outperforms advanced single agents and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4×-4.3× faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while providing consistent efficiency gains.

Abstract:
Pretrained models with parameter-efficient fine-tuning have shown strong effectiveness in Class-Incremental Learning (CIL), which seeks to balance model plasticity and stability. In this context, orthogonality constraints can significantly enhance model stability, yet their reliance on subspace inevitably compromises model plasticity over long tasks. To address this, we propose Gradient-Recycling Low-Rank Adaptation (GR-LoRA), which reconciles stability and plasticity by recycling the gradients discarded in orthogonal projection. Specifically, GR-LoRA recycles post-decomposition non-orthogonal gradient components into task-specific lightweight modules and selects optimal module via entropy to improve plasticity, while incorporating local and global mismatch suppression to preserve stability by synthesizing out-of-distribution representations across all tasks. Theoretical analysis confirms that this recycling strategy preserves stability and improves plasticity. Experimental results from multiple CIL benchmarks verify the effectiveness and general applicability of GR-LoRA.

Abstract:
Irregular multivariate time series pose a fundamental tradeoff for long-horizon forecasting: discrete methods can distort temporal structure via re-gridding, while continuous-time models often rely on sequential numerical solvers that are prone to drift. To bridge this gap, we present the Latent Laplace Diffusion (LLapDiff), a generative framework that models the target as a low-dimensional latent trajectory, enabling horizon-wide generation without step-by-step integration over physical time. We guide the reverse process using a stable modal parameterization motivated by stochastic port-Hamiltonian dynamics, and parameterize its mean evolution in the Laplace domain via learnable complex-conjugate poles, allowing for direct evaluation over irregular timestamps. Moreover, we link continuous dynamics to irregular observations through renewal-averaging analysis, which maps sampling gaps to effective event-domain poles and theoretically motivates a gap-aware history summarizer for conditioning. Extensive experiments demonstrate that LLapDiff consistently outperforms baselines in long-horizon forecasting, and its continuous-time generative nature also supports missing-value imputation by querying the same model at historical timestamps.

Abstract:
Closedloop evaluation of autonomous-driving policies requires interactive simulation beyond log replay. However, existing generative world models often degrade in closed loop due to (i) history-free initialization that mismatches policy inputs, (ii) multi-step sampling latency that violates real-time budgets, and (iii) compounding kinematic infeasibility over long horizons. We propose VectorWorld, a streaming world model that incrementally generates ego-centric 64 \mathrmm× 64\mathrmm lane--agent vector-graph tiles during rollout. VectorWorld aligns initialization with history-conditioned policies by producing a policy-compatible interaction state via a motion-aware gated VAE. It enables real-time outpainting via solver-free one-step masked completion with an edge-gated relational DiT trained with interval-conditioned MeanFlow and JVP-based large-step supervision. To stabilize long-horizon rollouts, we introduce \DeltaSim, a physics-aligned non-ego (NPC) policy with hybrid discrete--continuous actions and differentiable kinematic logit shaping. On Waymo open motion and nuPlan, VectorWorld improves map-structure fidelity and initialization validity, and supports stable, real-time 1\mathrmkm+ closed-loop rollouts.

Abstract:
Evaluating large language models (LLMs) typically requires thousands of benchmark items, making the process expensive, slow, and increasingly impractical at scale. Existing evaluation protocols rely on average accuracy over fixed item sets, treating all items as equally informative despite substantial variation in difficulty and discrimination. We introduce ATLAS, an adaptive testing framework based on Item Response Theory (IRT) that estimates model ability using Fisher information–guided item selection. ATLAS reduces the number of required items by up to 90% while maintaining measurement precision. For instance, it matches wholebank ability estimates using only 41 items (0.157 MAE) on HellaSwag (5,600 items). We further reconstruct accuracy from ATLAS's ability estimates and find that reconstructed accuracies closely match raw accuracies across all five benchmarks, indicating that ability preserves the global performance structure. At the same time, provides finer discrimination within accuracy-equivalent models: among more than 3,000 evaluated models, 23--31% shift by more than 10 rank positions, and models with identical accuracies receive meaningfully different ability estimates. Code and calibrated item banks available at https://anonymous.4open.science/r/ATLAS-3210/README.md.

Abstract:
Uniform Discrete Diffusion (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively adapting GRPO to UDM leads to unstable training and marginal performance. To address this, we propose \Ours, the first framework that integrates UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample, rather than intermediate predicted sample, as the action provides more accurate and stable optimization signals; and (ii) adopting the forward process to reconstruct the training trajectories helps the model learn probability paths that are more consistent with pretraining. For efficiency, we introduce ReductionStep and CFG-Free training strategies. \Ours significantly improves the performance of the base model across multiple T2I tasks. Notably, GenEval accuracy improves from 69% to 96% and PickScore increases from 20.46 to 23.81, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy improves from 4% to 57%, further validating the effectiveness and generalization capability of our method.

Abstract:
We propose a novel algorithm for incremental Byte Pair Encoding (BPE) tokenization. The algorithm processes each input byte in worstcase \mathcalO(\log^2 t) time, leading to an overall complexity of \mathcalO(n \log^2 t), where n is the input length and t is the maximum token length. The algorithm incrementally maintains BPE tokenization results for every prefix of the input text, implementing the standard BPE merge procedure defined by a fixed set of merge rules. This enables efficient partial tokenization in streaming settings. Functioning as a drop-in replacement for standard BPE, our approach achieves up to ～3× speedups over Hugging Face's tokenizers, and significant latency reductions over OpenAI's tiktoken on pathological inputs. We further introduce an eager output algorithm that enables streaming output, emitting tokens as soon as token boundaries are determined during incremental tokenization. Overall, our results demonstrate that BPE tokenization can be performed incrementally with strong worst-case guarantees, while providing practical latency benefits in modern large language model pipelines.

Abstract:
Imageto-3D generation faces inherent semantic ambiguity under occlusion, where partial observation alone is often insufficient to determine object category. In this work, we formalizetext-driven amodal 3D generation, where text prompts steer the completion of unseen regions while strictly preserving input observation. Crucially, we identify that these objectives demand distinct control granularities: rigid control for the observation versus relaxed structural control for the prompt. To this end, we proposeRelaxFlow, a training-free dual-branch framework that decouples control granularity via a Multi-Prior Consensus Module and a Relaxation Mechanism. Theoretically, we prove that our relaxation is equivalent to applying a low-pass filter on the generative vector field, which suppresses high-frequency instance details to isolate geometric structure that accommodates the observation. To facilitate evaluation, we introduce two diagnostic benchmarks,ExtremeOcc-3DandAmbiSem-3D. Extensive experiments demonstrate that RelaxFlow successfully steers the generation of unseen regions to match the prompt intent without compromising visual fidelity. Code and datasets will be released.

Abstract:
Model soups are strange and strangely effective combinations of parameters. They take a model (the stock), finetune it into multiple models (the ingredients), and then mix their parameters back into one model (the soup) to improve predictions. While all known soups require supervised learning, and optimize the same loss on labeled data, our recipes for Self-Soupervision generalize soups to self-supervised learning (SSL). Our Self-Souping lets us flavor ingredients on new data sources, e.g. from unlabeled data from a task for transfer or from a shift for robustness. We show that Self-Souping on corrupted test data, then fine-tuning back on uncorrupted train data, boosts robustness by +3.5% (ImageNet-C) and +7% (LAION-C). Self-Soupervision also unlocks countless SSL algorithms to cook the diverse ingredients needed for more robust soups. We show for the first time that ingredients can differ in their SSL hyperparameters---and more surprisingly, in their SSL algorithms. We cook soups of MAE, MoCoV3, and MMCR ingredients that are more accurate than any one single SSL ingredient.

Abstract:
Language models with recurrent depth, also referred to as universal or looped when considering transformers, are defined by the capacity to increase their computation through the repetition of layers. Recent efforts in pretraining have demonstrated that these architectures can scale to modern language modeling tasks while exhibiting advantages in reasoning tasks. In this work, we examine the relationship between recurrentdepth models and diffusion language models. Building on their similarities, we develop a new diffusion forcing sampler for these models to accelerate generation. The sampler advances by decoding new tokens at every forward pass of the model, while the latent states of these tokens can be further refined in parallel through recurrence. Theoretically, generation with our sampler is strictly more expressive than the baseline autoregressive generation using the same time budget on modern hardware. Moreover, this sampler, based on principles from diffusion literature, can be directly applied to existing 3.5B recurrent-depth transformers without any tuning, leading to up to a 5x speedup.

Abstract:
How can we use AI to discover a new state of the art for a scientific problem? Prior work in testtime scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erdős' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to 2× faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.

Abstract:
Humans easily determine which color belongs to which shape in multiobject scenes, an ability known as concept binding. Vision–language embedding models such as CLIP struggle with binding: they recognize individual concepts but fail to represent which concepts form which objects. Although CLIP behaves like a bag-of-concepts model in cross-modal retrieval, object information is recoverable from its image and text embeddings separately. We study this tension through the binding function, which maps concepts to scene embeddings. We find that scene embeddings decompose additively into object representations, explaining why uni-modal probes can recover object information. However, CLIP’s binding function is high-complexity, which likely prevents the image and text encoders from learning a shared binding mechanism that generalizes to unseen concept combinations. We then ask whether this limitation is fundamental. We show that it is not. In controlled transformer models trained from scratch, binding generalization emerges with sufficient data coverage. These models learn low-complexity binding functions characterized by multiplicative interactions between concepts, enabling systematic generalization.

Abstract:
Differentiable optimization studies how to embed a mathematical program as a differentiable layer in machine learning pipelines. However, existing approaches typically rely on implicit differentiation, involving expensive Hessian computation while differentiating through optimality conditions. To address this challenge, we formulate the differentiable optimization problem as a bilevel optimization instance. We construct a new activeset Lagrangian as a proxy to compute an \epsilon-approximate hypergradient using only near-constant O(\log (1/\epsilon)) first-order information. We also show that applying this efficient hypergradient oracle to constrained bilevel optimization improves the overall gradient complexity to \tildeO(\delta^-1\epsilon^-3) to reach a (\delta, \epsilon)-Goldstein stationary point. We implement our method `FFOLayer`, as a drop-in Python library compatible with existing differentiable optimization solvers. Our algorithm shows significantly faster computation with similar convergence compared to other existing solvers. Our code is available [here](https://anonymous.4open.science/anonymize/FFOLayer-B78B).

Abstract:
Sparse MatrixVector Multiplication (SpMV) is a fundamental operation in the inference of sparse Large Language Models (LLMs). Because existing SpMV methods perform poorly under the low, unstructured sparsity (30-90%) commonly observed in pruned LLMs, unstructured pruning provides only limited memory reduction and speedup. We propose DELTA4-SpMV, a GPU-optimized format and kernel co-designed to reduce storage overhead while remaining compatible with the GPU’s execution model. This enables efficient SpMV for unstructured sparsity without specialized hardware units or precomputation. We identify memory bandwidth as the primary limiting factor of SpMV and analyze the storage overhead of DELTA4. At 50% sparsity, DELTA4 is the first approach to achieve 1.5× memory reduction and 1.2-1.5× speedup over the dense baseline as well as substantial improvements over other SpMV methods: cuSPARSE (2.8-13.0×), Sputnik (1.9-2.6×), and DASP (2.2-2.5×). An LLM pruned with Wanda to sparsity 50% requires 1.5× less memory and achieves 1.5× faster inference at fp16 precision. As a result, unstructured pruning at 50% sparsity becomes practical for real-world LLM workloads and bridges the efficiency gap with structured 2:4 sparsity.

Abstract:
Standard mixedprecision training of neural networks requires many bytes of accelerator memory for each model parameter. These bytes reflect not just the parameter itself, but also its gradient and one or more optimizer state variables. With each of these values typically requiring 4 bytes, training even a 7 billion parameter model can be impractical for researchers with less than 100GB of accelerator memory. We introduce FlashOptim, a suite of optimizations that reduces per-parameter memory by over 50% while preserving model quality and API compatibility. Our approach introduces two key techniques. First, we improve master weight splitting by finding and exploiting a tight bound on its quantization error. Second, we design companding functions that greatly reduce the error in 8-bit optimizer state quantization. Together with 16-bit gradients, these techniques reduce AdamW memory from 16 bytes to 7 bytes per parameter, or 5 bytes with gradient release. They also cut model checkpoint sizes by more than half. Experiments with FlashOptim applied to SGD, AdamW, and Lion show no measurable quality degradation on any task from a collection of standard vision and language benchmarks, including Llama-3.1-8B finetuning.

Abstract:
This paper proposes a synergy of amortised and particlebased methods for sampling from distributions defined by unnormalised density functions. We state a connection between sequential Monte Carlo (SMC) and neural sequential samplers trained by maximum-entropy reinforcement learning (MaxEnt RL), wherein learnt sampling policies and value functions define proposal kernels and twist functions. Exploiting this connection, we introduce an off-policy RL training procedure for the sampler that uses samples from SMC -- using the learnt sampler as a proposal -- as a behaviour policy that better explores the target distribution. We describe techniques for stable joint training of proposals and twist functions and an adaptive weight tempering scheme to reduce training signal variance. Furthermore, building upon past attempts to use experience replay to guide the training of neural samplers, we derive a way to combine historical samples with annealed importance sampling weights within a replay buffer. On synthetic multi-modal targets (in both continuous and discrete spaces) and the Boltzmann distribution of alanine dipeptide conformations, we demonstrate improvements in approximating the true distribution as well as training stability compared to both amortised and Monte Carlo methods.

Abstract:
Sparse Mixture of Experts (SMoE) architectures improve the training efficiency of Large Language Models (LLMs) by routing input tokens to a selected subset of specialized experts. Despite their remarkable success, both training and inference in SMoE models suffer from theexpert collapseissue (Chi et al., 2022a), which degrades model performance. Prior studies primarily focus on improving the router; however, such methods rely on training from scratch or finetuning, which requires high computational and data-processing costs. Furthermore, we demonstrate that, despite these efforts, the issue persists when advancing well-pretrained SMoE models, as evidenced by both theoretical and empirical results. To fill that gap, we analyze the advanced SMoE models and observe that the eigenvectors of expert weight matrices encode rich semantic information, pointing to an effective alternative to conventional routing strategies. Building on this insight, we proposeSingular Value Decomposition SMoE (SSMoE), a novel andtraining-freeframework that leverages spectral properties of the expert weights to address the collapse issue and enhance model performance. Extensive experiments across diverse language and vision tasks, under both clean and corrupt data settings, demonstrate the strong generalization and robustness of SSMoE. Our findings highlight how a deeper understanding of model internals can guide the development of more effective SMoE architectures.

Abstract:
In discrete generative modeling, two dominant paradigms demonstrate divergent capabilities: Masked Diffusion Language Models (MDLM) excel at semantic understanding and zeroshot generalization, whereas Uniform-noise Diffusion Language Models (UDLM) achieve strong few-step generation quality, yet neither attains balanced performance across both dimensions. To address this, we propose XDLM, which bridges the two paradigms via a stationary noise kernel. XDLM offers two key contributions: (1) it provides a principled theoretical unification of MDLM and UDLM, recovering each paradigm as a special case; and (2) an alleviated memory bottleneck enabled by an algebraic simplification of the posterior probabilities. Experiments demonstrate that XDLM advances the Pareto frontier between understanding capability and generation quality. Quantitatively, XDLM surpasses UDLM by 5.4 points on zero-shot text benchmarks and outperforms MDLM in few-step image generation (FID 54.1 vs. 80.8). When scaled to tune an 8B-parameter large language model, XDLM achieves 15.0 MBPP in just 32 steps, effectively doubling the baseline performance. Finally, analysis of training dynamics reveals XDLM’s superior potential for long-term scaling.

Abstract:
Training agentic models for terminalbased tasks critically depends on high-quality terminal trajectories that capture realistic long-horizon interactions across diverse domains. However, constructing such data at scale remains challenging due to two key requirements: \emphExecutability, since each instance requires a suitable and often distinct Docker environment; and \emphVerifiability, because heterogeneous task outputs preclude unified, standardized verification. To address these challenges, we propose TerminalTraj, a scalable pipeline that (i) filters high-quality repositories to construct Dockerized execution environments, (ii) generates Docker-aligned task instances, and (iii) synthesizes agent trajectories with executable validation code. Using TerminalTraj, we curate 32K Docker images and generate 50,733 verified terminal trajectories across eight domains. Models trained on this data with the Qwen2.5-Coder backbone achieve consistent performance improvements on TerminalBench (TB), with gains of up to 20% on TB 1.0 and 10% on TB 2.0 over their respective backbones. Notably, TerminalTraj-32B achieves strong performance among models with fewer than 100B parameters, reaching 35.30% on TB 1.0 and 22.00% on TB 2.0, and demonstrates improved test-time scaling behavior.

Abstract:
Language models are increasingly capable and are being rapidly deployed on a populationlevel scale. As a result, the safety of these models is increasingly high-stakes. Fortunately, advances in alignment have significantly reduced the likelihood of harmful model outputs. However, when models are queried billions of times in a day, even rare worst-case behaviors will occur. Current safety evaluations focus on capturing the distribution of inputs that yield harmful outputs. These evaluations disregard the probabilistic nature of models and their tail output behavior. To measure this tail risk, we propose a method to efficiently estimate the probability of harmful outputs for any input query. Instead of naive brute-force sampling from the target model, where harmful outputs could be rare, we operationalize importance sampling by creating unsafe versions of the target model. These unsafe versions enable sample-efficient estimation by making harmful outputs more probable. On benchmarks measuring misuse and misalignment, these estimates match brute-force Monte Carlo estimates using 10–20× fewer samples. For example, we can estimate probability of harmful outputs on the order of 10^−4 with just 500 samples. Additionally, we find that these harmfulness estimates can reveal the sensitivity of models to perturbations in model input and predict deployment risks. Our work demonstrates that rare-event estimation is both critical and feasible for safety evaluations.

Abstract:
The annealing stage of Large Language Model (LLM) training is a critical phase where model loss drops sharply and downstream capabilities solidify. Despite its importance, current practices rely on empirical heuristics like quality filtering or context extension, lacking a principled understanding of the underlying optimization dynamics. We address this gap by providing a theoretical characterization of the spectral properties targeted during annealing. We demonstrate that effective annealing requires balancing global Hessian geometry with samplewise gradient noise, navigating a landscape of highly anisotropic curvature. Based on these insights, we formulate sample selection as a constrained optimization problem to suppress noise in sharp directions while preserving descent signals in flat subspaces. Our method, solved via Successive Convex Programming (SCP), achieves state-of-the-art results across multiple model scales. Code is available at \urlhttps://anonymous.4open.science/r/LLM-Annealing-Phase.

Abstract:
The Muon optimizer has recently attracted considerable attention for its strong empirical performance and use of orthogonalized updates on matrixshaped parameters, yet its underlying mechanisms and relationship to adaptive optimizers such as Adam remain insufficiently understood. In this work, we aim to address these questions through a unified spectral perspective. Specifically, we view Muon as the ( p = 0 ) endpoint of a family of spectral transformations of the form ( \boldsymbolU \boldsymbol\Sigma^p \boldsymbolV^\top ), and consider additional variants with ( p = \frac12 ), ( p = \frac14 ), and ( p = 1 ). These transformations are applied to both first-moment updates, as in momentum SGD, and to root-mean-square (RMS) normalized gradient updates as in Adam. To enable efficient computation, we develop a coupled Newton iteration that avoids explicit singular value decomposition. Across controlled experiments, we find that RMS-normalized updates yield more stable optimization than first-moment updates. Moreover, while spectral compression provides strong stabilization benefits under first-moment updates, the Muon update (( p = 0 )) does not consistently outperform Adam. These results suggest that Muon is best understood as an effective form of spectral normalization, but not a universally superior optimization method.

Abstract:
Achieving a complete and explorable 360degree visual world is a cornerstone of immersive content creation. While recent advances in video generation have achieved impressive results, they follow a 2D paradigm that treats content generation as transitions of 2D pixels, lacking an intrinsic understanding of the physical 3D world, resulting in frequent geometric inconsistencies. To achieve an explorable and physical-consistent visual world, the generation process should shift to a 3D paradigm: the visual content is governed by the physical relationships of the entire 3D environment together with 3D motion signals. However, under this setting, the conventional modeling methods and control signals, such as spatial attention computation in a 2D space, become unsuitable and ineffective. To address this, we propose PanoWorld-X for explorable 3D scene video generation. Our framework is built on the panoramic representation, which naturally maps a 3D scene into a standard format and provides an ideal basis for consistency. Specifically, we first develop a data curation pipeline to produce high-quality and large-motion 3D scene evolution with movement trajectories. To achieve precise control, we design the Exploration Panoramic Plücker Embedding (PPE), a guidance signal tailored for 3D motion. Furthermore, leveraging the spherical geometric properties of panoramic data, we propose a sphere-aware attention mechanism, which can capture true geometric adjacency by reprojecting features onto a spherical surface. Extensive experiments demonstrate that PanoWorld-X achieves superior performance in motion range, control precision, and visual quality, underscoring its potential for real-world applications.

Abstract:
Large language models (LLMs) are now used to power complex multiturn agentic workflows. Existing services run agentic inference by assembling isolated components: an LLM inference engine (e.g., vLLM) and a tool orchestrator (e.g., Kubernetes). Although agentic workflows involve multiple LLM and tool requests, existing services make scheduling decisions on a per-request basis, without end-to-end knowledge of the workflow. This leads to sub-optimal management of KV-caches and tool execution environments. To address the challenges, we propose \ouralg, an inference system that is aware of the end-to-end agent workflow. We abstract agentic workflows as LLM Programs, enabling a unified view of heterogeneous resources, including KV caches, system states, and external tool assets such as disk memory and network ports. \ouralg introduces a program-aware scheduler and a tool resource manager designed to maximize KV cache hit rates, mitigate memory imbalances, and enable asynchronous environment preparation. Evaluations across coding, routing, and scientific discovery agents demonstrate that \ouralg achieves1.5-3.6xthroughput improvements in serving,1.8-3.9xin RL rollout, and up to4.2xdisk memory savings compared to state-of-the-art inference systems.

Abstract:
Model ensembling is a wellestablished technique for improving the performance of machine learning models. Conventionally, this involves averaging the output distributions of multiple models and selecting the most probable label. This idea has been naturally extended to large language models (LLMs), yielding improved performance but incurring substantial computational cost. This inefficiency stems from directly applying conventional ensemble implementation to LLMs, which require a separate forward pass for each model to explicitly compute the ensemble distribution. In this paper, we propose the Mixture-model-like Ensemble (ME). By reinterpreting the ensemble as a mixture model, ME stochastically selects a single model at each step to generate the next token, thereby avoiding the need to explicitly compute the full ensemble distribution. ME is mathematically equivalent to sampling from the ensemble distribution, but requires invoking only one model, making it 1.78×-2.68× faster than conventional ensemble. Furthermore, this perspective connects LLM ensembling and token-level routing methods, suggesting that LLM ensembling is a special case of routing methods. Our findings open new avenues for efficient LLM ensembling and motivate further exploration of token-level routing strategies for LLMs. Our code is available at https://anonymous.4open.science/r/Mixture-model-like-Ensemble/.

Abstract:
Realworld multimodal learning is often hindered by missing modalities. While Incomplete Multimodal Learning (IML) has gained traction, existing methods typically rely on the unrealistic assumption of full-modal availability during training to provide reconstruction supervision or cross-modal priors. This paper tackles the more challenging setting of IML under training-time incomplete observations, which precludes reliance on a "God's eye view" of complete data. We propose LIMSSR (LLM-Driven Incomplete Multimodal Sequence-to-Score Reasoning), a framework that reformulates this challenge as a conditional sequence reasoning task. LIMSSR leverages the semantic reasoning capabilities of Large Language Models via Prompt-Guided Context-Aware Modality Imputation and Multidimensional Representation Fusion to infer latent semantics from available contexts without direct reconstruction. To mitigate hallucinations, we introduce a Mask-Aware Dual-Path Aggregation to dynamically calibrate inference uncertainty. Extensive experiments on three Action Quality Assessment datasets demonstrate that LIMSSR significantly outperforms state-of-the-art baselines without relying on complete training data, establishing a new paradigm for data-efficient multimodal learning. Code will be released upon acceptance.

Abstract:
Diffusion models generate highly realistic images but often struggle with precise text–image alignment. While recent posttraining methods improve alignment using external rewards or human preference signals, their performance heavily depends on reward quality and does not directly address alignment within the diffusion process itself. Recent reward-free approaches such as SoftREPA demonstrate that optimizing soft text tokens via contrastive learning can effectively improve text-image representation alignment, outperforming standard parameter-efficient fine-tuning baselines. However, the contrastive formulation can excessively penalize negative pairs, which manifests as characteristic failure cases such as over-counting and repetition. To address this issue, we propose a lightweight, reward-free post-training method that refines soft tokens by integrating contrastive alignment guidance directly into the score-matching objective of diffusion models. By assigning alignment directions at the score level, our approach mitigates these limitations and yields more coherent and semantically faithful generations. Experiments show that our method matches SoftREPA while substantially improving its failure cases, achieving over 35% improvement in counting accuracy on the GenEval benchmark. Our method is seamlessly applicable to existing diffusion backbones (SD1.5, SDXL, and SD3), and is complementary to existing RL-based diffusion post-training methods.

Abstract:
Training 3D Gaussian Splatting (3DGS) at billionprimitive scale is fundamentally memory-bound: each Gaussian carries a large attribute vector, and the aggregate parameter table quickly exceeds GPU capacity, limiting prior systems to tens of millions of Gaussians on consumer hardware. We observe that 3DGS training is inherently sparse and trajectory-conditioned: each iteration activates only the Gaussians visible from the current camera batch, so GPU memory can serve as a working-set cache rather than a persistent parameter store. Building on this insight, we introduce TideGS, an out-of-core training framework that manages parameters across an SSD--CPU--GPU hierarchy via three synergistic techniques: block-virtualized geometry for SSD-aligned spatial locality, a hierarchical asynchronous pipeline to overlap I/O with computation, and trajectory-adaptive differential streaming that transfers only incremental working-set deltas between iterations. Experiments show that TideGS enables training with over one billion Gaussians on a single consumer GPU while achieving state-of-the-art reconstruction quality on large-scale scenes, exceeding prior out-of-core baselines (e.g., ～100M Gaussians) and standard in-memory training (e.g., ～11M Gaussians).

Abstract:
We study offline reinforcement learning of styleconditioned policies using explicit style supervision via subtrajectory labeling functions. In this setting, aligning style with high task performance is particularly challenging due to distribution shift and inherent conflicts between style and reward. Existing methods, despite introducing numerous definitions of style, often fail to reconcile these objectives effectively. To address these challenges, we propose a unified definition of behavior style and instantiate it into a practical framework. Building on this, we introduce Style-Conditioned Implicit Q-Learning (SCIQL), which leverages offline goal-conditioned reinforcement learning techniques, such as hindsight relabeling and value learning, and combine it with a new Gated Advantage Weighted Regression mechanism to efficiently optimize task performance while preserving style alignment. Experiments demonstrate that SCIQL achieves superior performance on both objectives compared to prior offline methods.

Abstract:
Pretraining produces a learned parameter vector that is typically treated as a starting point for further iterative adaptation. In this work, we instead view the outcome of pretraining as a distribution over parameter vectors, whose support already contains taskspecific experts. We show that in smaller or insufficiently trained models such expert solutions occupy a negligible fraction of the volume of this distribution, making their discovery reliant on structured optimization methods such as gradient descent. In contrast, in large, well-pretrained models the density of task-experts increases dramatically, so that diverse specialists populate a substantial fraction of the neighborhood around the pretrained weights. Motivated by this perspective, we explore a simple, fully parallel post-training method that samples N parameter vectors at random, selects the top K, and ensembles them via majority vote to combine complementary expertise. Despite its simplicity, this approach is competitive with standard post-training methods such as PPO, GRPO, and ES for contemporary large-scale models.

Abstract:
Diffusion large language models (dLLMs) are promising alternatives to autoregressive large language models (ARLLMs), as they potentially allow higher inference throughput. Reinforcement learning (RL) is a crucial component for dLLMs to achieve comparable performance with AR-LLMs on important tasks, such as reasoning. However, RL algorithms that are well-suited for dLLMs' unique characteristics have yet to be developed. This paper proposes Distribution Matching Policy Optimization (DMPO), a principled and theoretically grounded RL fine-tuning method specifically designed to enhance the reasoning capabilities of dLLMs by matching the dLLM policy distribution to the optimal, reward-tilted one through cross-entropy optimization. We identify a key challenge in the implementation with a small training batch size and propose several effective solutions through a novel weight baseline subtraction technique. DMPO exhibits superior performance on multiple reasoning benchmarks without supervised fine-tuning, with an accuracy improvement of up to 54.3% over previously SOTA baselines and 66.41% over the base model, underscoring the effectiveness of the distribution matching framework.

Abstract:
Data selection has emerged as a crucial downstream application of data valuation, yet the theoretical foundations for using data values in selection remain underexplored. We reformulate data selection as a sequential decisionmaking problem where the optimal selection sequence arises from dynamic programming, and data values can be understood as encodings of this optimal sequence. This framework unifies and reinterprets existing methods like Data Shapley through the lens of approximate dynamic programming, revealing them as myopic linear approximations to the sequential problem. We further analyze how selection optimality degrades with utility curvature under submodularity, explaining when and why these approximations fail. To bridge theory and practice, we propose an efficient bipartite graph-based surrogate that preserves submodular structure while enabling scalable greedy selection with provable guarantees. Experiments on classical ML benchmarks and large-scale LLM fine-tuning data selection demonstrate substantial improvements over existing methods.

Abstract:
Mean Field Games (MFGs) provide a principled framework for modeling interactions in large populations models: at scale, population dynamics become deterministic, with uncertainty entering only through aggregate shocks, orcommon noise. However, algorithmic progress has been limited since modelfree methods are too high variance and exact methods scale poorly. Recent Hybrid Structural Methods (HSMs) use Monte Carlo rollouts for the common noise in combination with exact estimation of the expected return, conditioned on those samples. However, HSMs have not been scaled to Partially Observable settings. We proposeRecurrent Structural Policy Gradient(RSPG), the first history-aware HSM. We also introduce MFAX, our JAX-based framework for MFGs. By leveraging known transition dynamics, RSPG achieves state-of-the-art performance as well as an order-of-magnitude faster convergence and solves, for the first time, a macroeconomics MFG with heterogeneous agents, common noise and history-aware policies. MFAX is publicly available at: .

Abstract:
While Large Language Model (LLM) agents excel at general tasks, they inherently struggle with continual adaptation due to the frozen weights after deployment. Conventional reinforcement learning (RL) offers a solution but incurs prohibitive computational costs and the risk of catastrophic forgetting. We introduce JustIn-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates. JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly. These estimates are then used to directly modulate the LLM's output logits. We theoretically prove that this additive update rule is the exact closed-form solution to the KL-constrained policy optimization objective. Extensive experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods. Crucially, JitRL outperforms the performance of computationally expensive fine-tuning methods (e.g., WebRL) while reducing monetary costs by over 30 times, offering a scalable path for continual learning agents. The code is available at https://anonymous.4open.science/r/JitRL-D485.

Abstract:
Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for crossparadigm evaluation is lacking. Existing benchmarks test only an individual language/tool (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of 77 classical algorithms in each of Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in current models. While frontier models achieve tractable success in Dafny (40.3% for Gemini-3 Flash), where high-level abstractions and SMT automation simplify the workflow, performance collapses under the systems-level memory constraints of Verus (24.7%) and the explicit proof construction required by Lean (7.8%). Beyond aggregate metrics, we uncover a sharp divergence in test-time compute dynamics: Gemini-3 effectively utilizes iterative repair to boost performance (e.g., tripling pass rates in Dafny), whereas GPT-OSS saturates early. Finally, our error analysis shows that language design affects the refinement trajectory: while Dafny allows models to focus on logical correctness, Verus and Lean trap models in persistent syntactic and semantic barriers.

Abstract:
RL methods for finetuning large reasoning models stall on datasets with low initial success rates, and thus little training signal. We investigate a fundamental question: Can a pretrained LLM leverage latent knowledge to generate an automated curriculum for problems it cannot solve? We explore this with SOAR: A selfimprovement framework designed to surface these pedagogical signals through meta-RL. A teacher model proposes synthetic problems for a student model, and is rewarded with its improvement on a subset of hard problems, thus grounding the curriculum in real student progress rather than proxy rewards. Our study on the hardest subsets of math benchmarks (0/128 success) reveal three core findings. First, it is possible to realize bi-level meta-RL that unlocks learning under sparse, binary rewards by sharpening a latent capacity of pretrained models to generate useful problems. Second, grounded rewards outperform intrinsic rewards used in prior LLM self-play, reliably avoiding the typical instability and diversity collapse modes. Third, the structure and well-posedness of questions are more critical for learning progress than solution correctness. Our results suggest that the ability to generate useful stepping stones does not require the preexisting ability to solve the hard problems, paving a principled path to escape reasoning plateaus without additional curated data.

Abstract:
Harmful finetuning can invalidate safety alignment of large language models, exposing significant safety risks. In this paper, we utilize the attention sink mechanism to mitigate harmful fine-tuning. Specifically, we first measure a statistic namedsink divergencefor each attention head and observe thatdifferent attention heads exhibit two different signs of sink divergence. To understand its safety implications, we conduct experiments and find that the number of attention heads of positive sink divergence increases along with the increase of the model's harmfulness when undergoing harmful fine-tuning. Based on this finding, we propose a separable sink divergence hypothesis --attention heads associating with learning harmful patterns during fine-tuning are separable by their sign of sink divergence. Based on the hypothesis, we propose a fine-tuning-stage defense, dubbed Surgery. Surgery utilizes a regularizer for sink divergence suppression, which steers attention heads toward the negative sink divergence group, thereby reducing the model’s tendency to learn and amplify harmful patterns. Extensive experiments demonstrate that Surgery improves defense performance by 5.90%, 11.25%, and 9.55% on the BeaverTails, HarmBench, and SorryBench benchmarks, respectively. Source code is available on https://anonymous.4open.science/r/Surgery-A69E.

Abstract:
While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering realworld deployment. To bridge this gap at both the evaluation and data levels, we introduce GUI-RobustEval and propose Robustness-driven Trajectory Synthesis. GUI-RobustEval containing 1,216 executable test cases that systematically measure error recovery capabilities across a broad and realistic spectrum of error modes. At the data level, RoTS is a scalable synthesis framework that creates 800k high quality data via a tree-based pipeline that proactively discovers diverse error modes and synthesizes corresponding recovery steps. Our two models, RoTS-7B and RoTS-32B, fine-tuned on our dataset, both demonstrate significant gains on GUI-RobustEval and traditional GUI benchmarks. Notably, RoTS-32B achieves state-of-the-art performance on OSWorld, with a 47.4% success rate and a 33.8% All-Pass@4 score, indicating that the enhanced long-horizon error recovery ability synergistically boosts robustness and overall performance. We will release our benchmark, dataset, and models to facilitate future research.

Abstract:
Aligning large language models (LLMs) with human values typically relies on posttraining or inference-time steering that directly manipulates the backbone’s parameters or representation space. However, a critical gap exists: the model’s residual stream is highly dynamic, in which values exist as fragile, low-dimensional properties, inherently incompatible with the stability required for consistent value expression. In this paper, we propose the Stable Value Guidance Transformer (SVGT), which addresses this gap through an independent value module incorporating two key designs: (1)independent value modeling, maintaining normative representations in a dedicated value space isolated from the backbone, and (2)explicit behavioral guidance, transducing these stable signals into learnable latent Bridge Tokens. These tokens serve as dynamic value anchors to explicitly steer the generative trajectory, ensuring robust adherence across diverse contexts without disrupting the backbone’s internal representations. Experiments across multiple backbones and safety benchmarks show that SVGT consistently reduces harmful scores by over 70% while maintaining generation fluency, demonstrating the efficacy of architecturally grounded value modeling.

Abstract:
Modeling the interplay between external stimuli and internal neural representations is a pivotal research area for BrainComputer Interfaces (BCIs). A major limitation of prior work is the prevailing paradigm of specialized, single-task models, which curtails versatility and neglects inter-task synergies. To address this, we propose Mind-Omni, the first versatile framework that unifies seven distinct encoding and decoding tasks through a discrete diffusion paradigm. At its core is a novel Brain Tokenizer that transforms heterogeneous, continuous brain signals into standardized, discrete tokens. This enables direct, token-level interactions for mutual understanding and generation between any two or more modalities within a shared semantic space. To unlock advanced reasoning capabilities, we further curate a specialized Brain Question Answering (BQA) instruction-tuning dataset. Our model not only establishes a new state-of-the-art among multi-task unified frameworks but also provides strong evidence for multi-task synergy. By demonstrating performance competitive with, and at times superior to, larger specialized models, our work offers a powerful new paradigm for neural modeling and paves the way for foundation models of neural activity.

Abstract:
Large and diverse datasets are needed for training generalist robot policies that have potential to control a variety of robot embodimentsrobot arm and gripper combinations--across diverse tasks and environments. As re-collecting demonstrations and retraining for each new hardware platform are prohibitively costly, we show that existing robot data can be augmented for transfer and generalization. The Open X-Embodiment (OXE) dataset, which aggregates demonstrations from over 60 robot datasets, has been widely used as the foundation for training generalist policies. However, it is highly imbalanced: the top four robot types account for over 85% of its real data, which risks overfitting to robot--scene combinations. We present AugE-Toolkit, a scalable robot augmentation pipeline, and OXE-AugE, a high-quality open-source dataset that augments OXE with 9 different robot embodiments. OXE-AugE provides over 4.4 million trajectories, more than triple the size of the original OXE. We conduct a systematic study of how scaling robot augmentation impacts cross-embodiment learning. Results suggest that augmenting datasets with diverse arms and grippers improves policy performance not only on the augmented robots, but also on unseen robots and even the original robots under distribution shifts. In physical experiments, we demonstrate that generalist policies such as OpenVLA and \pi_0 benefit from fine-tuning on OXE-AugE, improving success rates by 24-45% on previously unseen robot-gripper combinations across four real-world manipulation tasks.

Abstract:
Supervised finetuning (SFT) is the standard approach for post-training large language models (LLMs), yet it often shows limited generalization. We trace this limitation to its default training objective: negative log likelihood (NLL). While NLL is classically optimal when training from scratch, post-training operates in a different paradigm and could violate its optimality assumptions, where models already encode task-relevant priors and supervision can be long and noisy. Rather than proposing a single universally superior replacement loss, we systematically study various probability-based objectives and characterize when and why different objectives succeed or fail under varying conditions. Through comprehensive experiments and extensive ablation studies across 8 model backbones, 27 benchmarks, and 7 domains, we uncover a critical dimension that governs objective behavior: the model-capability continuum. Near the model-strong end, prior-leaning objectives that downweight low-probability tokens (e.g., -p, -p^10, thresholded variants) consistently outperform NLL; toward the model-weak end, NLL dominates; in between, no single objective prevails. Our theoretical analysis further elucidates how objectives trade places across the continuum, providing a principled foundation for adapting objectives to model capability.

Abstract:
Point supervision has become a scalable solution to address dense annotation for infrared small target detection, but its performance is limited by two coupled bottlenecks: unstable pseudolabel evolution in cluttered, low-contrast infrared imagery and severe sample-distribution imbalance. In this paper, we present a more adaptive and stable framework to address these issues. Leveraging the intrinsic consistency between thermal radiation patterns and heat diffusion, we propose a physics-induced annotation strategy that expands single-point labels into reliable pseudo-masks. To further enhance supervision and alleviate sample imbalance, we develop a bi-level dual-update framework that jointly optimizes detector weights, sample weights, and diffusion parameters. A meta-classifier dynamically predicts sample-wise loss weights, while a differentiable diffusion module refines pseudo-labels with detection feedback, enabling adaptive interaction between training and hyperparameter optimization. Extensive experiments across multiple datasets demonstrate five-fold annotation acceleration, superior detection accuracy, and comparable performance with 30% of the training data, validating the efficiency and practicality of our approach.

Abstract:
Many scientific problems are underdetermined: multiple distinct hypotheses are equally consistent with the same observations. In such settings, effective inference requires not only producing valid explanations, but also systematically exploring and covering the admissible hypothesis set. We introduce HypoSpace, a benchmark that treats large language models (LLMs) as samplers over finite hypothesis spaces and evaluates them on three metrics: Validity, Uniqueness, and Recovery. HypoSpace spans three structured domains (causal graph inference, gravityconstrained 3D voxel reconstruction, and Boolean genetic interaction modeling) with deterministic validators and exactly enumerable solution spaces, plus real-world anchored case studies. Empirically, frontier LLMs exhibit a consistent failure mode: high Validity but sharp degradation in Uniqueness and Recovery as hypothesis spaces grow. We further show that stratified decoding partially mitigates this collapse, demonstrating HypoSpace's utility as a diagnostic benchmark for set-valued inference.

Abstract:
Although multimodal learning has advanced point cloud completion, the theoretical mechanisms remain unclear. Recent works attribute success to the connection between modalities, yet we identify that standard hard projection severs this connection, inducing Cross-Modal Entropy Collapse where sparse support hinders visual prior propagation. To bridge this gap, we propose SplAttN, which maximizes Point-wise Mutual Information via Differentiable Gaussian Splatting. By reformulating projection as continuous density estimation, SplAttN facilitates gradient flow and optimizes connection learnability. Extensive experiments show that SplAttN achieves state-of-the-art performance on PCN and ShapeNet-55/34. Crucially, we utilize the real-world KITTI benchmark as a stress test for multi-modal reliance. Counter-factual evaluation reveals that while baselines degenerate into unimodal template retrievers insensitive to visual removal, SplAttN maintains a robust dependency on visual cues, validating that our method establishes an effective cross-modal connection. Code is available at https://anonymous.4open.science/r/Anonymous-766B/.

Abstract:
Neural operators serve as fast, datadriven surrogates for scientific modeling but typically rely on a monolithic, single-pass inference procedure that struggles to resolve high-frequency details, a limitation known as spectral bias. We introduce the Iterative Refinement Neural Operator (IRNO), which augments pre-trained operators with a learned refinement module iteratively applied via fixed-point iteration. IRNO decomposes the prediction into a coarse initialization followed by successive residual corrections, paralleling classical numerical solvers. Under mild assumptions, we establish contraction of the induced operator, ensuring convergence to a unique fixed point. To explicitly target high-frequency errors, we propose a progressive spectral loss that adaptively increases penalty on high-frequency components over refinement steps during training. Across physical systems, IRNO consistently lowers error, with up to 56.05% improvement on turbulent flow. On Active Matter, spectral analysis reveals that, relative to base operator, the normalized error ratios decrease to 27.72–36.10% in low-, 5.07–6.68% in mid-, and 1.48–2.04% in high-frequencies, remaining stable beyond the trained iteration count.

Abstract:
Recent work has sought to understand Large Language Models (LLMs) reasoning, yet a principled, modelintrinsic signal that captures its layer-wise reasoning dynamics remains underexplored. We bridge this gap by demonstrating that the \ell_2 norm of hidden states serves as an endogenous signal of the model's reasoning intensity. Using Sparse Autoencoders (SAEs) as a diagnostic probe, we observe that LLMs' internal reasoning is marked by a sharp increase in reasoning feature activations concentrated in late layers. Motivated by this pattern, we establish a formal link between reasoning intensity and the model's latent geometry and theoretically prove that the \ell_2 norm of hidden states bounds the activation strength of SAE reasoning features. Empirical correlation analysis and causal interventions further prove \ell_2 norm as a faithful indicator, where heightened norms consistently correspond to critical reasoning steps. We then introduce three test-time scaling techniques guided by \ell_2 norms: Adaptive Layer-wise Reasoning Recursion, (ii) Endogenous Reasoning State Steering, and (iii) \ell_2-guided Response Selection, which requires no additional training or data and is compatible with advanced inference engines. Experiments across model architectures and benchmarks show that \ell_2-norm-based techniques significantly improve reasoning performance, offering a principled yet simple lens to perceive and control LLM latent reasoning dynamics. Our codes are anonymously available at https://anonymous.4open.science/r/The-Tell-Tale-Norm-4E40

Abstract:
An agent must try new behaviors to explore and improve. In highstakes environments, an agent that violates safety constraints may cause harm and must be taken offline, curtailing any future interaction. Imitating old behavior is safe, but excessive conservatism discourages exploration. How much behavior change is too much? We show how to use any safe reference policy as a probabilistic regulator for any optimized but untested policy. Conformal calibration on data from the safe policy determines how aggressively the new policy can act, while provably enforcing the user's declared risk tolerance. Unlike conservative optimization methods, we do not assume the user has identified the correct model class nor tuned any hyperparameters. Unlike previous conformal methods, our theory provides finite-sample guarantees even for non-monotonic bounded constraint functions. Our experiments on applications ranging from natural language question answering to biomolecular engineering show that safe exploration is not only possible from the first moment of deployment, but can also improve performance.

Abstract:
Recently, studies exemplified by HyperConnections (HC) have extended the ubiquitous residual connection paradigm established over the past decade by expanding the residual stream width and diversifying connectivity patterns. While yielding substantial performance gains, this diversification fundamentally compromises the identity mapping property intrinsic to the residual connection, which causes severe training instability and restricted scalability, and additionally incurs notable memory access overhead. To address these challenges, we propose Manifold-Constrained Hyper-Connections (mHC), a general framework that projects the residual connection space of HC onto a specific manifold to restore the identity mapping property, while incorporating rigorous infrastructure optimization to ensure efficiency. Empirical experiments demonstrate that mHC is effective for training at scale, offering tangible performance improvements and superior scalability. We anticipate that mHC, as a flexible and practical extension of HC, will contribute to a deeper understanding of topological architecture design and suggest promising directions for the evolution of foundational models.

Abstract:
Randomized selfreductions (RSRs) express f(x) using f evaluated at random correlated points, enabling self-correcting programs, instance-hiding protocols, and applications in complexity theory and cryptography. Yet discovering RSRs has required manual expert derivation for over 40 years, limiting their practical use. We present Bitween for automated RSR learning. First, we formalize RSR learning with sample complexity analysis under correlated sampling. Second, we develop Vanilla Bitween, which integrates multiple backends (linear regression, genetic programming, symbolic regression, and mixed-integer programming). The linear regression backend outperforms the others, discovering RSRs for 43 of 80 functions (54%) in RSR-Bench, our benchmark suite, including the first known reduction for sigmoid. Third, we introduce Agentic Bitween, a neuro-symbolic approach where LLM agents propose novel query functions beyond the fixed set (x+r, x-r, x \cdot r, x, r) in prior work. Agentic Bitween discovers RSRs for 64 of 80 functions (80%), outperforming pure neural baselines in both RSR discovery and verification accuracy.

Abstract:
Workflow optimization for toolusing LLM agents is often cast as global search over candidate graphs, scored by a scalar metric. This collapses rich, multi-step failure traces into binary outcomes, obscuring recurring failure structure and making refinement inefficient. We reframe optimization as \emphdistributional refinement: each workflow induces a density over a Failure Signature Space \mathcalF, and the goal is to minimize its Expected Failure Mass. We propose CE-Graph, which maintains a counterexample pool, estimates dense failure modes, and applies operator-constrained graph edits via a Propose-and-Verify loop with a convergence-aware stopping rule. Across math, code, and QA benchmarks, CE-Graph improves robustness while reducing optimization cost compared to strong workflow-search baselines, suggesting reliability emerges from learning and reshaping failure landscapes rather than merely maximizing aggregate success rates.

Abstract:
Continual learning, enabling models to acquire new skills and knowledge without degrading existing capabilities, remains a fundamental challenge for foundation models. While onpolicy reinforcement learning can reduce forgetting, it requires explicit reward functions that are often unavailable. Learning from expert demonstrations, the primary alternative, is dominated by supervised fine-tuning (SFT), which is inherently off-policy. We introduce Self-Distillation Fine-Tuning (SDFT), a simple method that enables on-policy learning directly from demonstrations. SDFT leverages in-context learning by using a demonstration-conditioned model as its own teacher, generating on-policy training signals that preserve prior capabilities while acquiring new skills. Across skill learning and knowledge acquisition tasks, SDFT consistently outperforms SFT, achieving higher new-task accuracy while substantially reducing catastrophic forgetting. In sequential learning experiments, SDFT enables a single model to accumulate multiple skills over time without performance regression, establishing on-policy distillation as a practical path to continual learning from demonstrations.

Abstract:
Reasoning models often spend a lot of time thinking before they generate a visible response. This creates a frustrating, but unfortunately common, experience: the user's time is wasted while the model reasons from a false premise that could have easily been corrected. In contrast, human speakers perform lightweight, incremental checkins to ensure that conversational participants stay on common ground. With this motivation, we propose interleaved reasoning (IR), in which the model alternates between thinking and surfacing intermediate responses, as an alternative to the standard ``think-then-answer'' approach. By providing useful information to the user earlier, IR reduces perceived latency, the time a user waits for an initial output, without compromising the quality of the final response. We focus on a specialization of interleaved reasoning, \method (\textulPlan-\textulThought-\textulAnswer \textulInterleaving), where the first intermediate response is an explicit, step-by-step plan for executing the task. This plan-first strategy allows for user intervention and early feedback for subsequent reasoning steps. \method\ yields an ～6% improvement in pass@1 across several challenging math reasoning and coding benchmarks, while reducing time-to-first-response by over 60% relative to think-then-answer baselines.

Abstract:
Standard onpolicy reinforcement learning relies on heuristic clipping to enforce trust regions, but this mechanism imposes a severe cost by indiscriminately truncating high-return yet high-divergence updates. We demonstrate that explicitly constraining the policy ratio variance provides a principled local approximation to trust-region constraints, eliminating the need for binary hard clipping. By acting as a distributional ''soft brake'', this approach preserves critical gradient signals from novel discoveries while naturally down-weighting and enabling the reuse of stale, off-policy data. We introduce R^2VPO (Ratio-Variance Regularized Policy Optimization), which implements this constraint via a primal–dual optimization framework. Extensive evaluations across 7 LLM scales, spanning both fast and slow reasoning paradigms, and 10 robotic control tasks demonstrate the generality of the proposed approach. R^2VPO achieves substantial performance gains on mathematical reasoning benchmarks, with particularly pronounced improvements on smaller models, while significantly improving sample efficiency. Furthermore, it consistently outperforms PPO baselines in continuous control domains, particularly in sparse-reward and dynamic environments. Together, these findings establish ratio-variance regularization as a principled foundation for stable and data-efficient policy optimization.

Abstract:
We study the extent to which rotary position encodings (RoPE), a recent transformer position encoding algorithm broadly adopted in large language models (LLMs) and vision transformers (ViTs), can be applied to graphstructured data. We find that rotating tokens depending on the spectrum of the graph Laplacian efficiently injects structural information into the attention mechanism, boosting performance in synthetic and real-world graph learning tasks. This approach, coinedWave-Induced Rotary Encodings(WIRE), enjoys intriguing theoretical properties: it recovers regular RoPE on grids, and depends asymptotically on the graph effective resistance. Unlike bias-based relative position encodings, WIRE is compatible with linear attention.

Abstract:
Transformers have revolutionized deep learning across various domains but understanding the precise token dynamics remains a theoretical challenge. Existing theories of deep Transformers with layer normalization typically predict that tokens cluster to a single point; however, these results rely on deterministic weight assumptions, which fail to capture the standard initialization scheme in Transformers. In this work, we show that accounting for the intrinsic stochasticity of random initialization alters this picture. More precisely, we analyze deep Transformers where noise arises from the random initialization of value matrices. Under diffusion scaling and tokenwise RMS normalization, we prove that, as the number of Transformer layers goes to infinity, the discrete token dynamics converge to an interacting-particle system on the sphere where tokens are driven by a \emphcommon matrix-valued Brownian noise. In this limit, we show that initialization noise prevents the collapse to a single cluster predicted by deterministic models. For two tokens, we prove a phase transition governed by the interaction strength and the token dimension: unlike deterministic attention flows, antipodal configurations become attracting with positive probability. Numerical experiments confirm the predicted transition, reveal that antipodal formations persist for more than two tokens, and demonstrate that suppressing the intrinsic noise degrades accuracy.

Abstract:
Large language models (LLMs) demand substantial computational and memory resources, creating deployment challenges. Quantizationaware training (QAT) addresses these challenges by reducing model precision while maintaining performance. However, the scaling behavior of QAT, especially at 4-bit precision (W4A4), is not well understood. Existing QAT scaling laws often ignore key factors such as the number of training tokens and quantization granularity, which limits their applicability. This paper proposes a unified scaling law for QAT that models quantization error as a function of model size, training data volume, and quantization group size. Through 268 QAT experiments, we show that quantization error decreases as model size increases, but rises with more training tokens and coarser quantization granularity. To identify the sources of W4A4 quantization error, we decompose it into weight and activation components. Both components follow the overall trend of W4A4 quantization error, but with different sensitivities. Specifically, weight quantization error increases more rapidly with more training tokens. Further analysis shows that the activation quantization error in the FC2 layer, caused by outliers, is the primary bottleneck of W4A4 QAT quantization error. By applying mixed-precision quantization to address this bottleneck, we demonstrate that weight and activation quantization errors can converge to similar levels. Additionally, with more training data, weight quantization error eventually exceeds activation quantization error, suggesting that reducing weight quantization error is also important in such scenarios. These findings offer key insights for improving QAT research and development.

Abstract:
Conditional diffusion models appear capable of compositional generalization, i.e., generating convincing samples for outof-distribution combinations of conditioners, but the mechanisms underlying this ability remain unclear. To make this concrete, we study length generalization, the ability to generate images with more objects than seen during training. In a controlled CLEVR setting (Johnson et al., 2017), we find that length generalization is achievable in some cases but not others, suggesting that models only sometimes learn the underlying compositional structure. We then investigate locality as a structural mechanism for compositional generalization. Prior works proposed score locality as a mechanism for creativity in unconditional diffusion models (Kamb & Ganguli, 2024; Niedoba et al., 2024), but did not address flexible conditioning or compositional generalization. In this paper, we prove an exact equivalence between a specific compositional structure (conditional projective composition) (Bradley et al., 2025) and scores with sparse dependencies on both pixels and conditioners (local conditional scores). This theory also extends to compositions of concepts (such as style+content) in feature-space. We validate our theory empirically: CLEVR models that succeed at length generalization exhibit local conditional scores, while those that fail do not. Furthermore, we show that a causal intervention explicitly enforcing local conditional scores enables length generalization in a previously failing model. Finally, we investigate feature-space compositionality in color-conditioned CLEVR, and find preliminary evidence of compositional structure and corresponding local mechanisms in SDXL.

Abstract:
We initiate the study of structured Stackelberg games, a novel form of strategic interaction between a leader and a follower where contextual information can be predictive of the follower's (unknown) type. Motivated by applications such as security games and AI safety, we show how this additional structure can help the leader learn a utilitymaximizing policy in both the online and distributional settings. In the online setting, we first prove that standard learning-theoretic measures of complexity do not characterize the difficulty of the leader's learning task. Remarkably, we find that there exists a learning-theoretic measure of complexity, analogous to the Littlestone dimension in online classification, that tightly characterizes the leader's instance-optimal regret. We term this the Stackelberg-Littlestone dimension, and leverage it to provide a provably optimal online learning algorithm. In the distributional setting, we provide analogous results by showing that two new dimensions control the sample complexity upper- and lower-bound.

Abstract:
Visual tokenizers map highdimensional raw pixels into a compressed representation for downstream modeling, e.g., conditional video generation. Beyond compression, tokenizers define what information is preserved and how it is organized. Ade factostandard approach is to represent a video with a spatiotemporal 3D grid of tokens, each corresponding to a local patch in the original signal. This requires a downstream model, e.g., a text-to-video model, to learn to predict all low-level details ``pixel-by-pixel'' irrespective of the video's inherent complexity, resulting in high computational cost during training. We present VideoFlexTok, a tokenizer that represents videos with avariable-length sequence of tokens structured in a coarse-to-fine manner, where the first tokens capture abstract information like semantics and motion and later tokens provide fine-grained details. The generative flow decoder enables realistic video reconstructions from any token count. This representation structure allows adapting the tokens count to particular downstream needs and encode videos longer than the 3D grid approach under the same budget. We evaluate VideoFlexTok on class-to-video and text-to-video generative tasks and show that it leads to more efficient training compared to 3D grid tokens, e.g., achieving comparable generation quality (gFVD and ViCLIP Score) with a 10x smaller model (0.4B vs 3.6B). Finally, we demonstrate how VideoFlexTok can enable long video generation without prohibitive computational cost by training a text-to-video model on 10-second 81-frame videos with only 672 tokens, 8x fewer than a comparable 3D grid tokenizer.

Abstract:
Concept Bottleneck Models (CBMs) promote interpretability by grounding predictions in humanunderstandable concepts. However, existing CBMs typically fix their task predictor to a single linear or Boolean expression, limiting both predictive accuracy and adaptability to diverse user needs. We propose Mixture of Concept Bottleneck Experts (M-CBEs), a framework that generalizes existing CBMs along two dimensions: the number of experts and the functional form of each expert, exposing an underexplored region of the design space. We investigate this region by instantiating two novel models: Linear M-CBE, which learns a finite set of linear expressions, and Symbolic M-CBE, which leverages symbolic regression to discover expert functions from data under user-specified operator vocabularies. Empirical evaluation demonstrates that varying the mixture size and functional form provides a robust framework for navigating the accuracy-interpretability trade-off, adapting to different user and task needs.

Abstract:
Generalization in generative modelling is defined as the ability to learn an underlying distribution from a finite dataset and produce novel samples, with evaluation largely driven by heldout performance and perceived sample quality. In practice, training is often stopped at the minimum of the test loss, taken as an operational indicator of generalization. We challenge this viewpoint by identifying a phase of \emphbiased generalization during training, in which the model continues to decrease the test loss while favoring samples with anomalously high proximity to training data. By training the same network on two disjoint datasets and comparing the mutual distances of generated samples and their similarity to training data, we introduce a quantitative measure of bias and demonstrate its presence on real images. We then study the mechanism of bias, using a controlled hierarchical data model where access to exact scores and ground-truth statistics allows us to precisely characterize its onset. We attribute this phenomenon to the sequential nature of feature learning in deep networks, where coarse structure is learned early in a data-independent manner, while finer features are resolved later in a way that increasingly depends on individual training samples. Our results show that early stopping at the test loss minimum, while optimal under standard generalization criteria, may be insufficient for privacy-critical applications.

Abstract:
Analogy is a central faculty of human intelligence, enabling abstract patterns discovered in one domain to be applied to another. However, the mechanisms underlying analogical reasoning in Transformers remain poorly understood. In this work, inspired by the notion of functors in category theory, we formalize analogical reasoning as the inference of correspondences between entities across categories. Based on this formulation, we introduce synthetic tasks that evaluate the emergence of analogical reasoning under controlled settings. We find that the emergence of analogical reasoning is highly sensitive to data characteristics, optimization choices, and model scale. Through mechanistic analysis, we show that analogical reasoning in Transformers decomposes into two key components: (1) geometric alignment of relational structure in the embedding space, and (2) the application of a functor within the Transformer. These mechanisms enable models to transfer relational structure from one category to another, realizing analogy. Finally, we quantify these effects and find that the same trends are observed in pretrained LLMs. In doing so, we move analogy from an abstract cognitive notion to a concrete, mechanistically grounded phenomenon in modern neural networks.

Abstract:
We formulate the problem of \emphexact unlearning in reinforcement learning, where the goal is to design an efficient framework that enables the removal of any user’s data upon deletion request, i.e., the online learner’s output after unlearning be \emphindistinguishable from what would have been produced had the deleted user never interacted with the learner. For any \rho >0, we show that there exists a reinforcement learning (RL) algorithm that is \rhoTV-stable and supports an exact unlearning procedure whose expected computational cost is only a \rho \sqrt\ln T fraction of the computational cost of retraining from scratch. We construct such a \rho-TV-stable RL algorithm for tabular Markov decision processes (MDPs), which achieves a regret bound of \mathcalO(H^2 \sqrtSAT + H^3 S^2 A + H^2.5 S^2 A/\rho), where S, A, H, and T denote the number of states, the number of actions, the episode horizon, and the number of episodes, respectively. We also establish a lower bound of \Omega(H\sqrtSAT+SAH/\rho) for \rho-TV-stable RL algorithms, showing that our algorithm is nearly minimax optimal.

Abstract:
Chainof-thought (CoT) reasoning and its variants have substantially improved the performance of language models on complex reasoning tasks, yet the precise mechanisms by which different strategies facilitate generalization remain poorly understood. While current explanations often point to increased test-time computation or structural guidance, establishing a consistent, quantifiable link between these factors and generalization remains challenging. In this work, we identifyintrinsic dimensionalityas a quantitative measure for characterizing the effectiveness of reasoning chains. Intrinsic dimensionality quantifies the minimum number of model dimensions needed to reach a given accuracy threshold on a given task. By keeping the model architecture fixed and varying the task formulation through different reasoning strategies, we demonstrate that effective reasoning strategies consistently reduce the intrinsic dimensionality of the task. Validating this on GSM8K with Gemma-3 1B and 4B, we observe a strong inverse correlation between the intrinsic dimensionality of a reasoning strategy and its generalization performance on both in-distribution and out-of-distribution data. Our findings suggest that effective reasoning chains facilitate learning by better compressing the task using fewer parameters, offering a new quantitative metric for analyzing reasoning processes.

Abstract:
The successor representation (SR) provides a powerful framework for decoupling predictive dynamics from rewards, enabling rapid generalisation across reward configurations. However, the classical SR is limited by its inherent policy dependence: policies change due to ongoing learning, environmental nonstationarities, and changes in task demands, making established predictive representations obsolete. Furthermore, in topologically complex environments, SRs suffer from spectral diffusion, leading to dense and overlapping features that scale poorly. Here we propose the Hierarchical Successor Representation (HSR) for overcoming these limitations. By incorporating temporal abstractions into the construction of predictive representations, HSR learns stable state features which are robust to task-induced policy changes. Applying non-negative matrix factorisation (NMF) to the HSR yields a sparse, low-rank state representation that facilitates highly sample-efficient transfer to novel tasks in multi-compartmental environments. Further analysis reveals that HSR-NMF discovers interpretable topological structures, providing a policy-agnostic hierarchical map that effectively bridges model-free optimality and model-based flexibility. Beyond providing a useful basis for task-transfer, we show that HSR's temporally extended predictive structure can also be leveraged to drive efficient exploration, effectively scaling to large, procedurally generated environments.

Abstract:
Ensuring fairness in matching algorithms is a key challenge in allocating scarce resources and positions. Focusing on Optimal Transport (OT), we introduce a novel notion of group fairness requiring that the probability of matching two individuals from any two given groups in the OT plan satisfies a predefined target. We first propose a modified Sinkhorn algorithm to compute perfectly fair transport plans efficiently. Since exact fairness can significantly degrade matching quality in practice, we then develop two relaxation strategies. The first one involves solving a penalized OT problem, for which we derive novel finitesample complexity guarantees. Our second strategy leverages bilevel optimization to learn a ground cost that induces a fair OT solution, and we establish a bound on the deviation of fairness when matching unseen data. Finally, we present empirical results illustrating the performance of our approaches and the trade-off between fairness and transport cost.

Abstract:
Hierarchical reinforcement learning (RL) has the potential to enable effective decisionmaking over long timescales. Existing approaches, while promising, have yet to realize the benefits of large-scale training. In this work, we identify and solve several key challenges in scaling online hierarchical RL to high-throughput environments. We propose Scalable Option Learning (SOL), a highly scalable hierarchical policy gradient algorithm which achieves a ~35x higher throughput compared to existing hierarchical methods. To demonstrate SOL's performance and scalability, we train hierarchical agents using 30 billion frames of experience on the complex game of NetHack, significantly surpassing flat agents and demonstrating positive scaling trends. We also validate SOL on MiniHack and Mujoco environments, showcasing its general applicability.

Abstract:
Flowbased generative models can be viewed through a physics lens: sampling transports a particle from noise to data by integrating a time-varying velocity field, and each sample corresponds to a trajectory with its own dynamical effort. Motivated by classical mechanics, we introduce Kinetic Path Energy (KPE), an action-like, per-sample diagnostic that measures the accumulated kinetic effort along an Ordinary Differential Equation (ODE) trajectory. KPE exhibits two robust correspondences: (I) higher KPE predicts stronger semantic fidelity; (II) high-KPE trajectories terminate on low-density manifold frontiers. We further provide theoretical guaranties linking trajectory energy to data density. Paradoxically, this correlation is non-monotonic. At sufficiently high energy, generation can degenerate into memorization. Leveraging the closed-form of empirical flow matching, we show that extreme energies drive trajectories toward near-copies of training examples. This yields a Goldilocks principle and motivates Kinetic Trajectory Shaping (KTS), a training-free two-phase inference strategy that boosts early motion and enforces a late-time soft landing, reducing memorization and improving generation quality across benchmark tasks.

Abstract:
Generative models have enjoyed widespread success in a variety of applications. However, they encounter inherent mathematical limitations in modeling distributions where samples are constrained by equalities, as is frequently the setting in scientific domains. In this work, we develop a computationally cheap, mathematically justified, and highly flexible distributional modification for combating known pitfalls in equalityconstrained generative models. We propose perturbing the data distribution in a constraint-aware way such that the new distribution has support matching the ambient space dimension while still implicitly incorporating underlying manifold geometry. Through theoretical analyses and empirical evidence on several representative tasks, we illustrate that our approach consistently enables data distribution recovery and stable sampling with both diffusion models and normalizing flows.

Abstract:
Learning stationary policies in infinitehorizon general-sum Markov games (MGs) remains a fundamental open problem in Multi-Agent Reinforcement Learning (MARL). While stationary strategies are preferred for their practicality, computing stationary forms of classic game-theoretic equilibria is computationally intractable—a stark contrast to the comparative ease of solving single-agent RL or zero-sum games. To bridge this gap, we study Risk-averse Quantal response Equilibria (RQE), a solution concept rooted in behavioral game theory that incorporates risk aversion and bounded rationality. We demonstrate that RQE possesses strong regularity conditions that make it uniquely amenable to learning in MGs. We propose a novel two-timescale Actor-Critic algorithm characterized by a fast-timescale actor and a slow-timescale critic. Leveraging the regularity of RQE, we prove that this approach achieves global convergence with finite-sample guarantees. We empirically validate our algorithm in several environments to demonstrate superior convergence properties compared to risk-neutral baselines.

Abstract:
Reinforcement learning (RL) has achieved impressive results across various domains, yet the resulting policies often fail to generalize beyond the specific tasks encountered during training. This lack of robustness limits their deployment in realworld scenarios where diverse and unpredictable task demands exist. We propose the Adaptive Policy Backbone (APB), a transferable policy backbone that contains a meta-initialization to provide a highly generalizable representation. APB consists of a frozen, meta-trained backbone paired with lightweight task-specific linear layers that are learned from scratch for each new environment. Our results demonstrate that learning only lightweight task-specific linear layers is sufficient to achieve performance on par with standard RL, even, surprisingly, when the backbone is randomly initialized. Furthermore, we find that this structural constraint inherently enhances the generalization capability of the resulting policies. This advantage extends even to out-of-distribution tasks, where existing meta-RL methods typically fail.

Abstract:
Simulating the conditioned dynamics of diffusion processes, given their initial and terminal states, is an important but challenging problem in the sciences. The difficulty is particularly pronounced for rare events, for which the unconditioned dynamics rarely reach the terminal state. In this work, we propose a novel approach for learning diffusion bridges based on a selfconsistency property of the optimal control. The resulting algorithm learns the conditioned dynamics in an iterative online manner, and exhibits strong performance in a range of empirical settings without requiring differentiation through simulated trajectories. Beyond the diffusion bridge setting, we draw connections between our self-consistency framework and recent advances in the wider stochastic optimal control literature.

Abstract:
The longstanding dominance of gradient-boosted decision trees for tabular data has recently been challenged by in-context learning tabular foundation models. In-context learning methods fit and predict in one forward pass without parameter updates by leveraging the training data as context for predicting on query test points. While recent tabular foundation models achieve state-of-the-art performance, their transformer architecture based on the attention mechanism has quadratic complexity regarding dataset size, which in turn increases the overhead on training and inference time, and limits the capacity of the models to handle large-scale datasets. In this work, we propose TACO, an end-to-end tabular compression model that compresses the training dataset in a latent space. We test our method on the TabArena benchmark, where our proposed method is up to 94x faster in inference time, while consuming up to 97% less memory compared to the state-of-the-art tabular Transformer architecture, all while retaining performance without significant degradation. Lastly, our method not only scales better with increased dataset sizes, but it also achieves better performance compared to other baselines.

Abstract:
We present SleepLM, a family of sleeplanguage foundation models that enable human sleep alignment, interpretation, and interaction with natural language. Despite the critical role of sleep, learning-based sleep analysis systems operate in closed label spaces (e.g., predefined stages or events) and fail to describe, query, or generalize to novel sleep phenomena. SleepLM bridges natural language and multimodal polysomnography, enabling language-grounded representations of sleep physiology. To support this alignment, we introduce a multilevel sleep caption generation pipeline that enables the curation of the first large-scale sleep-text dataset, comprising over 100K hours of data from more than 10,000 individuals. Furthermore, we present a unified pretraining objective that combines contrastive alignment, caption generation, and signal reconstruction to better capture physiological fidelity and cross-modal interactions. Extensive experiments on real-world sleep understanding tasks verify that SleepLM outperforms state-of-the-art in zero-shot and few-shot learning, cross-modal retrieval, and sleep captioning. Importantly, SleepLM also exhibits intriguing capabilities including language-guided event localization, targeted insight generation, and zero-shot generalization to unseen tasks. All code and data will be open-sourced.

Abstract:
The standard oracle model for matroid algorithms assumes that each independence query can be answered in constant time, regardless of the size of the queried set. While this abstraction has underpinned much of the theoretical progress in matroid optimization, it masks the true computational effort required by these algorithms. In particular, for natural and widely studied classes such as graphic matroids, even a single independence query can require work linear in the size of the set, making the constanttime assumption implausible. We address this gap by introducing a size-sensitive cost model where the cost of a query Q scales with |Q|. Nearly linear-time oracle implementations exist for broad families of matroids, and this refined abstraction therefore captures the true cost of query evaluation while allowing for a more faithful comparison between general matroids and their natural special cases. Within this framework we study three fundamental algorithmic tasks: finding a basis of a matroid, approximating its rank, and approximating its partition size. We establish tight results, proving nearly matching upper and lower bounds that show the optimal query cost is (up to logarithmic factors) quadratic in the size of the matroid. On the algorithmic side, our upper bounds are realized by explicit procedures that construct the desired solution. On the complexity side, our lower bounds are unconditional and already hold even for weaker distinguishing formulations of the problems. Finally, for matroids with maximum circuit size at most c, we show that the quadratic barrier can be broken, providing an algorithm that calculates the maximum-weight basis with expected query cost \mathcalO(n^2-1/c \log n).

Abstract:
Modern LLMs show mastery over an evergrowing range of skills, as well as the ability to compose them flexibly. However, extending model capabilities to new skills in a scalable manner is an open-problem: fine-tuning and parameter-efficient variants risk catastrophic forgetting, while context-based approaches have limited expressiveness and are constrained by the model's effective context. We explore skill neologisms--i.e., soft tokens integrated in the model's vocabulary and optimized to improve capabilities over a specific skill--as a way to selectively extend model capabilities to new skills without weight updates. We first observe that off-the-shelf pre-trained LLMs already demonstrate tokens associated with procedural knowledge. We then show that skill neologisms can be learned to improve model capabilities on specific skills while being composable with out-of-distribution skills, and that independently trained skill neologisms can be composed zero-shot. These results suggest that skill neologisms may provide a scalable path towards skill-based continual learning.

Abstract:
Understanding potential selection in data is crucial for causal discovery; we argue that "selection" in common narratives takes two forms, which we termstaticandevolutionaryselection, respectively. Static selection refers to a oneshot filtering process where observed data consist of asubsetof the population of interest, as in survey volunteer bias. Evolutionary selection, in contrast, operates through repeated rounds of differential fitness in reproduction, where observed data constitute the latestgenerationshaped by a historical trajectory, as in immune adaptation, antibiotic resistance, and social norm emergence. Existing methods largely conflate these two forms and rely on an identical graphical model of selection. We show that this model is valid for static settings but fails to characterize data under evolution, yielding false discovery results. To address this, we introduce a new model that specifically characterizes evolutionary selection, and develop a sound and complete procedure for identifying such models from data across one or multiple environments or generations. Experimental results validate the method's ability to uncover the relevant mechanisms underlying evolution from data.

Abstract:
Vision Transformers (ViTs) have become a dominant architecture in computer vision, yet producing stable and highresolution attribution maps for these models remains challenging. Architectural components such as patch embeddings and attention routing often introduce structured artifacts in pixel-level explanations, causing many existing methods to rely on coarse patch-level attributions. We introduce DAVE (Distribution-aware Attribution via ViT Gradient DEcomposition), a mathematically grounded attribution method for ViTs based on a structured decomposition of the input gradient. By exploiting architectural properties of ViTs, DAVE isolates locally equivariant and stable components of the effective input–output mapping. It separates these from architecture-induced artifacts and other sources of instability. Consequently, DAVE produces robust, precise and class-consistent attribution maps that reliably highlight visual features used by the model across inputs. Experimental results demonstrate that DAVE attributions are more stable and spatially precise than existing approaches.

Abstract:
How should we analyze memory in deep RL? We introduce tools for analyzing policies under partial observability and revealing how agents use memory to make decisions. To utilize these tools, we present POPGym Arcade, a collection of Atariinspired, hardware-accelerated environments sharing a single observation and action space. Each environment provides fully and partially observable variants, enabling counterfactual studies on observability. We find that controlled studies are necessary for fair comparisons and identify a pathology where value functions smear credit over irrelevant history. Using this pathology, we demonstrate how out-of-distribution scenarios can contaminate memory, perturbing the policy far into the future.

Abstract:
Selfsupervised learning (SSL) excels at finding general-purpose latent representations from complex data, yet lacks a unifying theoretical framework that explains the diverse existing methods and guides the design of new ones. We cast SSL as latent distribution matching (DM): learning representations that maximize their log-probability under an assumed latent model (alignment), while maximizing latent entropy to prevent collapse (uniformity). This view unifies independent component analysis with contrastive, non-contrastive, and predictive SSL methods, including stop-gradient approaches. Leveraging DM, we derive a nonlinear, sampling-free Bayesian filtering model with a Kalman-based predictor for high-dimensional timeseries. We further prove that predictive DM yields identifiable latent representations under mild assumptions, even with nonlinear predictors. Overall, DM clarifies the assumptions behind established SSL methods and provides principled guidance for developing new approaches.

Abstract:
The choice of optimizer significantly impacts the training efficiency and computational costs of large language models (LLMs). Recently, the Muon optimizer has demonstrated promising results by orthogonalizing parameter updates, improving optimization geometry through better conditioning. Despite Muon’s emergence as a candidate successor to Adam, the potential for jointly leveraging their strengths—has not been systematically explored. In this work, we bridge this gap by proposing NorMuon (Neuronwise Normalized Muon), an optimizer that synergistically combines orthogonalization with neuron-level adaptive learning rates. Our analysis reveals that while Muon effectively reduces condition numbers, the resulting updates exhibit highly non-uniform neuron norms, causing certain neurons to dominate the optimization process. NorMuon addresses this imbalance by maintaining second-order momentum statistics for each neuron and applying row-wise normalization after orthogonalization, ensuring balanced parameter utilization while preserving Muon's conditioning benefits. To enable practical deployment at scale, we develop an efficient distributed implementation under the FSDP2 framework that strategically distributes orthogonalization computations across devices. Experiments across multiple model scales demonstrate that NorMuon consistently outperforms both Adam and Muon, achieving 21.74% better training efficiency than Adam and 11.31% improvement over Muon on 1.1B pretraining setting, while maintaining a comparable memory footprint to Muon. Our findings suggest that orthogonalization and adaptive learning rates are complementary rather than competing approaches, opening new avenues for optimizer design in large-scale deep learning.

Abstract:
bstract: Large language and reasoning models can be prompted to generate wellformed first-order formulas, but we still lack evaluations of their ability to produce correct, compact explanations under fully specified, mechanically checkable semantics. We study finite-structure concept synthesis: given several small finite relational worlds that are labeled extensionally with a unary target predicate T(x), the learner must output a single first-order formula \varphi(x) that recovers (explains) T uniformly across worlds. Because the domains are finite, correctness is solver-verifiable via exact model checking and SMT. We introduce INDUCTION, a benchmark suite that (to our knowledge) provides the first challenging, end-to-end evaluation of first-order definition synthesis from extensional relational evidence in a fully abstract interface. INDUCTION includes three tightly related regimes---FullObs (full observation), CI (contrastive Yes/No worlds), and EC (partial observation under existential completion)---and reports gold-relative, budgeted metrics that penalize formula bloat. Across tasks we observe sharp difficulty gradients and persistent hard structural families; moreover, held-out world evaluation shows that among training-correct solutions, low-bloat formulas generalize far better than highly bloated ones, motivating bloat-aware scoring as a first-class metric for symbolic induction.

Abstract:
Midtraining is increasingly used to improve the reasoning capabilities of large language models (LLMs), yet its design choices and interaction with evaluation and reinforcement learning (RL) remain poorly understood. Prior work often focuses on narrow domain gains, overlooking retention of general abilities, long-context performance, and RL compatibility. We present PRISM (Demystifying Retention and Interaction in Mid-Training), a holistic empirical study that analyzes mid-training design choices, what to evaluate, and how domain mixtures and training stages interact across model families. Experiments on Granite-3.3 8B, LLaMA-3.1 8B, and Mistral-7B/24B base models show that a relatively small, high-quality mid-training phase of ～27B tokens acts as a critical stabilizing stage for reasoning. Across models, PRISM yields consistent gains of ～6–10 points on coding benchmarks and ～17–30 points on mathematical reasoning benchmarks while preserving general performance. RL applied on top of PRISM-mid-trained models produces stable, monotonic improvements, adding a further ～3–8 points across coding and math tasks such as LiveCodeBench, Codeforces, AIME and MATH500, and ～17–20 points on science (GPQA-Diamond), whereas RL applied directly to base models is substantially less effective. Our results demonstrate that retention-aware mid-training is a necessary intermediate step for reliable reasoning enhancement and RL scaling, and provide practical guidance for designing robust mid-training pipelines for modern LLMs.

Abstract:
This paper considers the problem of converting a given dense linear layer into a lowprecision version. The tradeoff between minimizing description length and discrepancy introduced at the output of the layer is analyzed information theoretically (IT). It is shown that the popular GPTQ algorithm may have an arbitrarily large gap to IT limit. To alleviate this problem a novel algorithm, termed ''WaterSIC'', is proposed and is shown to be within a rate gap of 0.255 bit to IT limit, uniformly over all possible covariance matrices of input activations. WaterSIC's key innovation is allocating different quantization rates to different columns (in-features) of the weight matrix, mimicking the classical IT solution known as ''waterfilling''. Applying WaterSIC to real LLMs establishes new state-of-the-art for rates in the range of 1...4 bits per entry.

Abstract:
We present an amortized framework for realtime visual attribution streaming in multimodal thinking models. When these models generate code from a screenshot or solve math problems from images, their long reasoning traces should be grounded in visual evidence. However, verifying this reliance is challenging: faithful causal methods require costly repeated backward passes or perturbations, while raw attention maps offer instant access, they lack causal validity. To resolve this, we introduce an amortized approach that learns to estimate the causal effects of semantic regions directly from the rich signals encoded in attention features. Across five diverse benchmarks and four thinking models, our approach achieves faithfulness comparable to exhaustive causal methods while enabling visual attribution streaming, where users observe grounding evidence as the model reasons, not after. Our results demonstrate that real-time, faithful attribution in multimodal thinking models is achievable through lightweight learning, not brute-force computation.

Abstract:
We propose a simple, scalable algorithm based on stochastic interpolants for sampling from unnormalized densities and for finetuning generative models. The approach, Tilt Matching, arises from a dynamical equation relating the flow matching velocity to one targeting the same distribution tilted by a reward, implicitly solving a stochastic optimal control problem. The resulting velocity inherits the regularity of stochastic interpolant transports while minimizing an objective with strictly lower variance than flow matching itself. The update to the velocity field can be interpreted as the sum of all joint cumulants between the interpolant velocity and the reward, and to first order is their covariance. The method requires neither reward gradients nor backpropagation through trajectories of the flow or diffusion. We empirically demonstrate that the approach is efficient and highly scalable, providing state-of-the-art results on sampling under Lennard-Jones systems and competitive performance for fine-tuning Stable Diffusion, without requiring reward multipliers. The framework also applies directly to tilting few-step flow map models.

Abstract:
Credal predictors are epistemicuncertainty-aware models that produce a convex set of probabilistic predictions. They provide a principled framework for quantifying predictive epistemic uncertainty (EU) and have been shown to improve model robustness across a range of settings. However, most state-of-the-art (SOTA) methods primarily define EU as disagreement induced by random training initializations, which mainly reflects sensitivity to optimization randomness rather than uncertainty from more substantive sources. In response, we formulate EU as disagreement between models trained under different degrees of relaxation of the i.i.d. assumption between the training and test distributions. Building on this idea, we proposeCreDRO, which learns an ensemble of plausible models via distributionally robust optimization. As a result, CreDRO captures EU arising not only from training randomness but also from informative disagreement due to potential train–test distribution shifts. Empirically, CreDRO consistently outperforms SOTA credal approaches on downstream tasks, including out-of-distribution detection on extensive benchmarks and selective classification in medical settings.

Abstract:
Large language models have emerged as powerful zeroshot rerankers for retrieval-augmented generation, offering strong generalization without task-specific training. However, existing LLM reranking methods either rely on heuristics that fail to fully exploit the information revealed by each ranking decision or are inefficient when they do. We introduce a tournament graph framework that provides a principled foundation for k-wise reranking. Our key observation is that each k-document comparison reveals a complete tournament of \binomk2 pairwise preferences. These tournaments are aggregated into a global preference graph, whose transitive closure yields many additional orderings without further model invocations. We formalize when a candidate's rank is certifiably determined and design a query schedule that greedily maximizes information gain towards identifying the top-m items. Our framework also gracefully handles non-transitive preferences -- cycles induced by LLM judgments -- by collapsing them into equivalence classes that yield principled tiered rankings. Empirically, across 14 benchmarks and 5 LLMs, our approach achieves Pareto dominance over existing methods: matching or exceeding accuracy while requiring 25--40% fewer tokens than comparable approaches, and 7× fewer than pairwise methods at near-identical quality.

Abstract:
We introduce a constrained optimization framework for training transformers that behave like optimization descent algorithms. Specifically, we enforce layerwise descent constraints on the objective function and replace standard empirical risk minimization (ERM) with a primaldual training scheme. This approach yields models whose intermediate representations decrease the loss monotonically in expectation across layers. We apply our method to both unrolled transformer architectures and conventional pretrained transformers on tasks of video denoising and text classification. Across these settings, we observe that constrained transformers achieve stronger robustness to perturbations and maintain higher out-of-distribution generalization, while preserving competitive in-distribution performance.

Abstract:
We study offline blackbox optimization (BBO), aiming to discover improved designs from an offline dataset of designs and labels, a problem common in robotics, DNA, and materials science with limited labeled samples. While recent work applies autoregressive LLMs to BBO by formatting tasks as natural-language prompts, their left-to-right design generation struggles to capture the strong bidirectional dependencies inherent in design problems. To address this, we propose adapting diffusion LLMs to offline BBO to leverage their bidirectional modeling capabilities. However, a domain gap exists between the natural text pre-training of diffusion LLMs and the heterogeneous signals in BBO (prompts, designs, and labels). To bridge this gap, we construct a unified prompt–response corpus and introduce delimiter tokens to explicitly mark field boundaries fordomain adaptation. We further propose a two-stagepost-trainingframework to align the diffusion LLM generation with high-label designs. The first stage performs supervised fine-tuning on the unified dataset via masked-response prediction, and the second stage adopts reinforcement learning with rewards defined by label improvements. Our method achieves state-of-the-art results on Design-Bench small-data settings. Code for our work is available here: https://anonymous.4open.science/r/Anonymous-dllm4bbo-D78A/README.md.

Abstract:
Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic manipulation. We introduce SceneSmith, a hierarchical agentic framework that generates simulationready indoor environments from natural language prompts. SceneSmith constructs scenes through successive stages—from architectural layout to furniture placement to small object population—each implemented as an interaction among VLM agents: designer, critic, and orchestrator. The framework tightly integrates asset generation through text-to-3D synthesis for static objects, dataset retrieval for articulated objects, and physical property estimation. SceneSmith generates 3-6x more objects than prior methods, with <2% inter-object collisions and 96% of objects remaining stable under physics simulation. In a user study with 205 participants, it achieves 92% average realism and 91% average prompt faithfulness win rates against baselines. We further demonstrate that these environments can be used in an end-to-end pipeline for automatic robot policy evaluation.

Abstract:
Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video). In this work, we propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers. Specifically, we employ a variational autoencoder (VAE) to represent continuous data as latent vectors and introduce nexttoken diffusion for autoregressive generation of these vectors. Additionally, we develop -VAE to address the challenges of variance collapse, which is crucial for autoregressive modeling. Extensive experiments demonstrate the effectiveness of LatentLM across various modalities. In image generation, LatentLM surpasses Diffusion Transformers in both performance and scalability. When integrated into multimodal large language models, LatentLM provides a general-purpose interface that unifies multimodal generation and understanding. Experimental results show that LatentLM achieves favorable performance compared to Transfusion and vector quantized models in the setting of scaling up training tokens. In text-to-speech synthesis, LatentLM outperforms the state-of-the-art VALL-E 2 model in speaker similarity and robustness, while requiring 10 fewer decoding steps. The results establish LatentLM as a highly effective and scalable approach to advance large multimodal models.

Abstract:
Despite rapid advances in autonomous AI scientists powered by language models, generating publicationready illustrations remains a labor-intensive bottleneck in the research workflow. To lift this burden, we introduce PaperBanana, an agentic framework for automated generation of publication-ready academic illustrations. Powered by state-of-the-art VLMs and image generation models, PaperBanana orchestrates specialized agents to retrieve references, plan content and style, render images, and iteratively refine via self-critique. To rigorously evaluate our framework, we introduce PaperBananaBench, comprising 292 test cases for methodology diagrams curated from NeurIPS 2025 publications, covering diverse research domains and illustration styles. Comprehensive experiments demonstrate that PaperBanana consistently outperforms leading baselines in faithfulness, conciseness, readability, and aesthetics. We further show that our method effectively extends to the generation of high-quality statistical plots. Collectively, PaperBanana paves the way for the automated generation of publication-ready illustrations.

Abstract:
Online conformal prediction (OCP) seeks prediction intervals that achieve longrun 1-\alpha coverage for arbitrary (possibly adversarial) data streams, while remaining as informative as possible. Existing OCP methods often require manual learning-rate tuning to work well, and may also require algorithm-specific analyses. Here, we develop a general regret-to-coverage theory for interval-valued OCP based on the (1-\alpha)-pinball loss. Our first contribution is to identify linearized regret as a key notion, showing that controlling it implies coverage bounds for any online algorithm. This relies on a black-box reduction that depends only on the Fenchel conjugate of an upper bound on the linearized regret. Building on this theory, we propose UP-OCP, a parameter-free method for OCP, via a reduction to a two-asset portfolio selection problem, leveraging universal portfolio algorithms. We show strong finite-time bounds on the miscoverage of UP-OCP, even for polynomially growing predictions. Extensive experiments support that UP-OCP delivers consistently better size/coverage trade-offs than prior online conformal baselines.

Abstract:
Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current approaches rely on sequential adversarial training: generating adversarial prompts and finetuning LMs to defend against them. We introduce a different paradigm: framing safety alignment as a non-zero-sum game between an Attacker LM and a Defender LM trained jointly via online reinforcement learning. Each LM continuously adapts to the other's evolving strategies, driving iterative improvement. Our method uses a preference-based reward signal derived from pairwise comparisons instead of point-wise scores, providing more robust supervision and potentially reducing reward hacking. Our RL recipe, AdvGame, shifts the Pareto frontier of safety and utility, yielding a Defender LM that is simultaneously more helpful and more resilient to adversarial attacks. In addition, the resulting Attacker LM converges into a strong, general-purpose red-teaming agent that can be directly deployed to probe arbitrary target models.

Abstract:
Graph Neural Networks (GNNs) learn node representations through iterative networkbased message-passing. While powerful, deep GNNs suffer from oversmoothing, where node features converge to a homogeneous, non-informative state. We re-frame this problem of representational collapse from a \emphbifurcation theory perspective, characterizing oversmoothing as convergence to a stable ``homogeneous fixed point.'' Our central contribution is the theoretical discovery that this undesired stability can be broken by replacing standard monotone activations (e.g., ReLU) with a class of functions. Using Lyapunov-Schmidt reduction, we analytically prove that this substitution induces a bifurcation that destabilizes the homogeneous state and creates a new pair of stable, non-homogeneous \emphpatterns that provably resist oversmoothing. Our theory predicts a precise, nontrivial scaling law for the amplitude of these emergent patterns, which we quantitatively validate in experiments. Finally, we demonstrate the practical utility of our theory by deriving a closed-form, bifurcation-aware initialization and showing its utility in real benchmark experiments.

Abstract:
Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to selffulfilling misalignment. This paper provides the first controlled study of this hypothesis by pretraining 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse. We find that discussion of AI contributes to misalignment. Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour. Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%. We consider this evidence of self-fulfilling alignment. These effects are dampened, but persist through post-training. Our findings establish the study of how pretraining data shapes alignment priors, or alignment pretraining, as a complement to post-training. We recommend practitioners pretrain for alignment as well as capabilities.

Abstract:
A core aspect of human perception issituated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasizeenvironmentcentricspatial relations (relations among objects in a scene), while largely overlookingobserver-centricrelationships that require reasoning relative to agent's viewpoint, pose, and motion. To bridge this gap, we introduce SAW-Bench (SituatedAwareness in the RealWorld), a novel benchmark for evaluating egocentric situated awareness using real-world videos. SAW-Bench comprises 786 self-recorded videos captured with Ray-Ban Meta (Gen 2) smart glasses spanning diverse indoor and outdoor environments, and over 2071human-annotatedquestion-answer pairs. It probes a model’s observer–environment understanding withsixdifferent awareness tasks. Our comprehensive evaluation reveals a human-model performance gap of 31.97%, even with the best-performing MFM, Gemini 3 Flash. Beyond this gap, our in-depth analysis uncovers several notable findings; for example, while models can exploit partial geometric cues in egocentric videos, they often fail to infer a coherent camera geometry, leading to systematic spatial reasoning errors. We position SAW-Bench as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics.

Abstract:
Recent advances in generative models have shown promise in generating behavior plans for longhorizon, sparse reward tasks. While these approaches have achieved promising results, they often lack a principled framework for hierarchical decomposition and struggle with the computational demands of real-time execution, due to their iterative denoising process. In this work, we introduce Hierarchical Diffusion-Flow (\textttHDFlow), a novel hierarchical planning framework that optimally leverages the strengths of diffusion and rectified flow models to overcome the limitations of single-paradigm generative planners. \textttHDFlow employs a high-level diffusion planner to generate sequences of strategic subgoals in a learned latent space, capitalizing on diffusion's powerful exploratory capabilities. These subgoals then guide a low-level rectified flow planner that generates smooth and dense trajectories, exploiting the speed and efficiency of ordinary differential equation (ODE)-based trajectory generation. We evaluate \textttHDFlow on four challenging furniture assembly tasks in both simulation and real-world, where it significantly outperforms state-of-the-art methods. Furthermore, we also showcase our method's generalizability on two long-horizon benchmarks comprising diverse locomotion and manipulation tasks. Project website: https://hdflow-page.github.io/

Abstract:
We introduce VideoKR, the first largescale training corpus specifically designed to strengthen knowledge- and reasoning-intensive video understanding. It comprises 430K video reasoning examples over 126K newly collected, CC-licensed, expert-domain videos. We develop an expert-in-the-loop, skill-oriented example generation pipeline that targets progressively deeper video reasoning capabilities while ensuring the difficulty, diversity, and reliability of both the examples and their CoT rationales. We also curate VideoKR-Eval, a new expert-annotated benchmark where questions require genuine video understanding and knowledge-intensive reasoning. Our experiments show that, under a standard SFT→GRPO pipeline, models post-trained on VideoKR already outperform prior post-training approaches on both general and knowledge-intensive video reasoning benchmarks, highlighting data design as a key driver of progress in video reasoning. We further conduct comprehensive ablations to isolate the contributions of VideoKR, providing actionable insights for future work.

Abstract:
The work presents an approach for addressing the challenge of robustness in Large Language Models (LLMs) to alterations and potential errors caused by semantically similar but textually different prompts. Recent works have shown that these kinds of prompt variations can significantly impact the performance of LLMs on tasks. The central question is: can LLMs' robustness to semanticallyneutral prompt alterations be acquired without expensive retraining of the entire model? We address this question both theoretically and through experiments. Our theoretical analysis reveals a crucial factor impacting model robustness -- a systematic expected shift or perturbation-induced bias in neural network module outputs. Motivated by this analysis, we show that robustness can be achieved via a simple fine-tuning process: debiasing for robustness. We identify conditions when debiasing helps and when it does not, and demonstrate, through both theory and extensive experiments, that debiasing for robustness may indeed be a quick and efficient tool to enhance robustness and provide certification against random prompt perturbations.

Abstract:
Adam is a widely used optimization algorithm in deep learning, yet the specific class of objective functions where it exhibits inherent advantages remains underexplored. Unlike prior studies requiring external schedulers and \beta_2 near 1 for convergence, this work investigates the ``natural'' autoconvergence properties of Adam. We identify a class of highly degenerate polynomials where Adam converges automatically without additional schedulers. Specifically, we derive theoretical conditions for local asymptotic stability on degenerate polynomials and demonstrate strong alignment between theoretical bounds and experimental results. We prove that Adam achieves local linear convergence on these degenerate functions, significantly outperforming the sub-linear convergence of Gradient Descent and Momentum. This acceleration stems from a decoupling mechanism between the second moment v_t and squared gradient g_t^2, which exponentially amplifies the effective learning rate. Finally, we characterize Adam's hyperparameter phase diagram, identifying three distinct behavioral regimes: stable convergence, spikes, and SignGD-like oscillation.

Abstract:
Reasoning models (e.g., DeepSeekR1) generate long chains of thought to solve harder problems, but they often loop, repeating the same text at low temperatures or with greedy decoding. We study why this happens and what role temperature plays. With open reasoning models, we find that looping is common at low temperature. Larger models tend to loop less, and distilled students loop significantly even when their teachers rarely do. This points to mismatches between the training distribution and the learned model, which we refer to as errors in learning, as a key cause. To understand how such errors cause loops, we introduce a synthetic graph reasoning task and demonstrate two mechanisms. First, risk aversion caused by hardness of learning: when the correct progress-making action is hard to learn but an easy cyclic action is available, the model puts relatively more probability on the cyclic action and gets stuck. Second, even when there is no hardness, Transformers show an inductive bias toward temporally correlated errors, so the same few actions keep being chosen and loops appear. Higher temperature reduces looping by promoting exploration, but it does not fix the errors in learning, so generations remain much longer than necessary at high temperature; in this sense, temperature is a stopgap rather than a holistic solution. We end with a discussion of training-time interventions aimed at directly reducing errors in learning.

Abstract:
Foundation models have transformed machine learning for language and vision, but achieving comparable impact in physical simulation remains a challenge. Data heterogeneity and unstable longterm dynamics inhibit learning from sufficiently diverse dynamics, while varying resolutions and dimensionalities challenge efficient training on modern hardware. Through empirical and theoretical analysis, we incorporate new approaches to mitigate these obstacles, including a harmonic-analysis–based stabilization method, load-balanced distributed 2D-3D training strategies, and compute-adaptive tokenization. Using these tools, we develop \Walrus, a transformer-based foundation model developed primarily for fluid-like continuum dynamics. \Walrus\ is pretrained on nineteen diverse scenarios spanning astrophysics, geoscience, rheology, plasma physics, acoustics, and classical fluids. Experiments show that \Walrus\ outperforms prior foundation models on both short- and long-term prediction horizons on downstream tasks and across the breadth of pretraining data, while ablation studies confirm the value of our contributions to forecast stability, training throughput, and transfer performance over conventional approaches.

Abstract:
Geometric properties of Transformer weights, particularly the unembedding matrix, have been widely useful in language model interpretability research. Yet, their utility for estimating downstream performance remains unclear. In this work, we systematically investigate the relationship between model performance and the unembedding matrix geometry, particularly its effective rank. Our experiments, involving a suite of 108 OLMostyle language models trained under controlled variation, reveal several key findings. While the best-performing models often exhibit a high effective rank, this trend is not universal across tasks and training setups. Contrary to prior work, we find that low effective rank does not cause late-stage performance degradation in small models, but instead co-occurs with it; we find adversarial cases where low-rank models do not exhibit saturation. Moreover, we show that effective rank is strongly influenced by pre-training hyperparameters, such as batch size and weight decay, which in-turn affect the model's performance. Lastly, extending our analysis to other geometric metrics and final-layer representation, we find that these metrics are largely aligned, but none can reliably predict downstream performance. Overall, our findings suggest that the model's geometry, as captured by existing metrics, primarily reflects training choices rather than performance.

Abstract:
Crossvalidation (CV) is known to provide asymptotically exact tests and confidence intervals for model improvement but only when the model comparison isrelatively stable. Surprisingly, we prove that even simple, individually stable models can generate relatively unstable comparisons, calling into question the validity of CV inference. Specifically, we show that the Lasso and its close cousin, soft-thresholding, generate relatively unstable comparisons and invalid CV inferences, even in the most favorable of learning settings and when both models are individually stable. These findings highlight the importance of verifying relative stability before deploying CV for model comparison.

Abstract:
Deep reinforcement learning systems often suffer from unstable training dynamics due to nonstationarity, where learning objectives and data distributions evolve over time. We show that under non-stationary targets, isotropic Gaussian embeddings are provably advantageous. In particular, they induce stable tracking of time-varying targets for linear readouts, achieve maximal entropy under a fixed variance budget, and encourage a balanced use of all representational dimensions. Building on this insight, we propose the use of Sketched Isotropic Gaussian Regularization for shaping representations toward an isotropic Gaussian during training. We demonstrate empirically, over a variety of domains, that this simple and computationally inexpensive method improves performance under non-stationarity while reducing representation collapse, neuron dormancy, and training instability.

Abstract:
Weightspace learning studies neural architectures that operate directly on the parameters of other neural networks. Motivated by the growing availability of pretrained models, recent work has demonstrated the effectiveness of weight-space networks across a wide range of tasks. SOTA weight-space networks rely on permutation-equivariant designs to improve generalization. However, this may negatively affect expressive power, warranting theoretical investigation. Importantly, unlike other structured domains, weight-space learning targets maps operating on both weight and function spaces, making expressivity analysis particularly subtle. While a few prior works provide partial expressivity results, a comprehensive characterization is still missing. In this work, we address this gap by developing a systematic theory for expressivity of weight-space networks. We first prove that all prominent permutation-equivariant networks are equivalent in expressive power. We then establish universality in both weight- and function-space settings under mild, natural assumptions on the input weights, and characterize the edge-case regimes where universality no longer holds. Together, these results provide a strong and unified foundation for the expressivity of weight-space networks.

Abstract:
Autoregressive image modeling relies on visual tokenizers to compress images into compact latent representations. We design an endto-end training pipeline that jointly optimizes reconstruction and generation, enabling direct supervision from generation results to the tokenizer. This contrasts with prior two-stage approaches that train tokenizers and generative models separately. We further investigate leveraging vision foundation models to improve 1D tokenizers for autoregressive modeling. Our autoregressive generative model achieves strong empirical results, including a state-of-the-art FID score of 1.48 without guidance on ImageNet 256×256 generation.

Abstract:
We propose Bayesian Forest Thompson Sampling (BFTS), which performs Thompson sampling using armwise Bayesian Additive Regression Trees (BART) to model each action's mean reward and generate MCMC-based posterior draws for decision-making. We derive an information-theoretic Bayesian regret bound of order \widetilde\mathcal O(K\sigma\sqrtT) for ideal posterior sampling under a correctly specified Bayesian design. Empirically, BFTS achieves competitive regret on nonlinear synthetic benchmarks with near-nominal uncertainty calibration, attains the best average rank across nine OpenML contextual bandit benchmarks, and yields higher estimated policy values than linear, neural, and tree-ensemble baselines in a Drink Less micro-randomized trial case study. Across OpenML benchmarks, BFTS is robust to hyperparameter choices.

Abstract:
Deep neural networks learn feature representations via complex geometric transformations of the input data manifold. Despite the models' empirical success across domains, our understanding of neural feature representations is still incomplete. In this work we investigate neural feature geometry through the lens of discrete geometry. Since the input data manifold is typically unobserved, we approximate it using geometric graphs that encode local similarity structure. We provide theoretical results on the evolution of these graphs during training, showing that nonlinear activations play a crucial role in shaping feature geometry in feedforward neural networks. Moreover, we discover that the geometric transformations resemble a discrete Ricci flow on these graphs, suggesting that neural feature geometry evolves analogous to Ricci flow. This connection is supported by experiments on over 20,000 feedforward neural networks trained on binary classification tasks across both synthetic and realworld datasets. We observe that the emergence of class separability corresponds to the emergence of community structure in the associated graph representations, which is known to relate to discrete Ricci flow dynamics. Building on these insights, we introduce a novel framework for locally evaluating geometric transformations through comparison with discrete Ricci flow dynamics. Our experimental results further suggest connections between the evolution of feature geometry, and training time and network depth.

Abstract:
We introduce Flower, a neural architecture for learning PDE solution operators built entirely from multihead warps. Aside from pointwise channel mixing and a multiscale scaffold, Flowers use no Fourier multipliers, no dotproduct attention, and no convolutional mixing. Each head predicts a displacement field and warps the mixed input features. Motivated by physics and computational efficiency, displacements are predicted pointwise, without any spatial aggregation, and nonlocality entersonlythrough sparse sampling at source coordinates,oneper head. Stacking warps in multiscale residual blocks yields Flowers, which implement adaptive, global interactions at linear cost. We theoretically motivate this design through three complementary lenses: flow maps for conservation laws, waves in inhomogeneous media, and a kinetic-theoretic continuum limit. Flowers achieve excellent performance on a broad suite of 2D and 3D time-dependent PDE benchmarks, particularly flows and waves. A compact 17M-parameter model consistently outperforms Fourier, convolution, and attention-based baselines of similar size, while a 150M-parameter variant improves over recent transformer-based foundation models with much more parameters, data, and training compute.

Abstract:
Deep visual features are increasingly used as the interface in vision systems, motivating the need to describe feature characteristics and control feature quality for machine perception. Justnoticeable difference (JND) characterizes the maximum imperceptible distortion for images under human or machine vision. Extending it to deep visual features naturally meets the above demand by providing a task-aligned tolerance boundary in feature space, offering a practical reference for controlling feature quality under constrained resources. We propose FeatJND, a task-aligned JND formulation that predicts the maximum tolerable per-feature perturbation map while preserving downstream task performance. We propose a FeatJND estimator at standardized split points and validate it across image classification, detection, and instance segmentation. Under matched distortion strength, FeatJND-based distortions consistently preserve higher task performance than unstructured Gaussian perturbations, and attribution visualizations suggest FeatJND can suppress non-critical feature regions. As an application, we further apply FeatJND to token-wise dynamic quantization and show that FeatJND-guided step-size allocation yields clear gains over random step-size permutation and global uniform step size under the same noise budget. Our code will be released after publication.

Abstract:
Neural networks adapt through firstorder parameter updates, yet it remains unclear whether such updates preserve logical coherence. We investigate the geometric limits of the Linear Propagation Assumption (LPA), the premise that local updates coherently propagate to logical consequences. To formalize this, we adopt relation algebra and study three core operations on relations: negation flips truth values, converse swaps argument order, and composition chains relations. For negation and converse, we prove that guaranteeing direction-agnostic first-order propagation necessitates a tensor factorization separating entity-pair context from relation content. However, for composition, we identify a fundamental obstruction. We show that composition reduces to conjunction, and prove that any conjunction well-defined on linear features must be bilinear. Since bilinearity is incompatible with negation, this forces the feature map to collapse. These results suggest that failures in knowledge editing, the reversal curse, and multi-hop reasoning may stem from common structural limitations inherent to the LPA.

Abstract:
We study differentially private ordinary least squares (DPOLS) with bounded data (X,Y) via sketching-based mechanisms. While Gaussian sketching approaches have been explored for DP-OLS \citepsheffet2017differentially, they are typically viewed as less competitive than the Adaptive Sufficient Statistics Perturbation (AdaSSP) method \citepwang_adassp, which directly perturbs the sufficient statistics (X^\topX, X^\topY) and is information theoretically optimal while also exhibiting strong empirical performance. In this work, we propose the \emphIterative Hessian Mixing (IHM), an algorithm that builds on Gaussian sketching approaches to DP-OLS and is inspired by the Iterative Hessian Sketch of \citetpilanci_hessiansketch. We prove that IHM is differentially private and provide utility guarantees in the form of excess empirical risk bounds. These bounds improve upon those of AdaSSP by removing a multiplicative factor that can be as large as the square root of the data dimension. The design of the IHM is based on new accuracy guarantees that we present for prior Gaussian sketching approaches for DP-OLS, which clarify when these methods are expected to perform well and how IHM circumvents their inherent limitations. We also conduct a rigorous empirical evaluation on a large suite of datasets, demonstrating that IHM consistently outperforms prior baselines, including AdaSSP.

Abstract:
Despite recent progress in calibrationfree monocular SLAM via 3D vision foundation models, scale drift remains severe on long sequences. Motion-agnostic partitioning breaks contextual coherence and causes zero-motion drift, while conventional geometric alignment is computationally expensive. To address these issues, we propose VGGT-Motion, a calibration-free SLAM system for efficient and robust global consistency over kilometer-scale trajectories. Specifically, we first propose a motion-aware submap construction mechanism that uses optical flow to guide adaptive partitioning, prune static redundancy, and encapsulate turns for stable local geometry. We then design an anchor-driven direct Sim(3) registration strategy. By exploiting context-balanced anchors, it achieves search-free, pixel-wise dense alignment and efficient loop closure without costly feature matching. Finally, a lightweight submap-level pose graph optimization enforces global consistency with linear complexity, enabling scalable long-range operation. Experiments show that VGGT-Motion markedly improves trajectory accuracy and efficiency, achieving state-of-the-art performance in zero-shot, long-range calibration-free monocular SLAM.

Abstract:
Evaluating alignment in language models requires testing how they behave under realistic pressure, not just what they claim they would do. While alignment failures increasingly cause realworld harm, comprehensive evaluation frameworks with realistic multi-turn scenarios remain lacking. We introduce an alignment benchmark spanning 904 scenarios across six categories---Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming---validated as realistic by human raters. Our scenarios place models under conflicting instructions, simulated tool access, and multi-turn escalation to reveal behavioral tendencies that single-turn evaluations miss. Evaluating 24 frontier models using LLM judges validated against human annotations, we find that even top-performing models exhibit gaps in specific categories, while the majority of models show consistent weaknesses across the board. Factor analysis reveals that alignment behaves as a unified construct (analogous to the g-factor in cognitive research) with models scoring high on one category tending to score high on others. We publicly release the benchmark and an interactive leaderboard to support ongoing evaluation, with plans to expand scenarios in areas where we observe persistent weaknesses and to add new models as they are released.

Abstract:
Spiking Neural Networks (SNNs), with their eventdriven and biologically inspired mechanisms, are well-suited for energy-efficient neuromorphic hardware. Neural coding, which is critical to SNNs, determines how information is represented via spikes. While Time-to-First-Spike (TTFS) coding uses a single spike per neuron to offer extreme sparsity and energy efficiency, it often suffers from unstable training and low accuracy due to its sparse firing. To address these challenges, we propose a training framework that incorporates parameter initialization, training normalization, a temporal output decoder, and a re-evaluation of the pooling layer. The proposed parameter initialization and training normalization mitigate signal diminishing and gradient vanishing, which helps stabilize training. Our output decoder aggregates temporal spikes to encourage earlier firing, thereby reducing latency. The re-evaluation of the pooling layer demonstrates that max-pooling violates single-spike constraints, which should be avoided, whereas average-pooling preserves them. Experiments show that our framework stabilizes and accelerates training, reduces latency, and achieves state-of-the-art accuracy for step-by-step TTFS SNNs on MNIST (99.48%), Fashion-MNIST (92.90%), CIFAR10 (90.56%), CIFAR100 (70.27%) and DVS Gesture (95.83%).

Abstract:
Many modern generative models can be viewed as minimizing divergences between probability distributions, yet they rely on different algorithmic and geometric principles. Wasserstein gradient flows provide a continuoustime formulation for optimizing over distributions, and can be approximated through their implicit discretization via the Jordan–Kinderlehrer–Otto (JKO) scheme. In this work, we present a unified theoretical framework for generative modeling based on Wasserstein gradient flows, which we refer to as Generative Wasserstein Flows. We show that a broad class of existing methods can be derived as instances of parametric JKO schemes for f-divergences objectives, and we establish equivalences between several recently proposed algorithms. We extend this framework beyond f-divergences to integral probability metrics, deriving new JKO-based generative algorithms for objectives such as Maximum Mean Discrepancy. We also clarify their connections with GANs. Finally, we analyze parametric Wasserstein flows, where the evolution is restricted to distributions generated by parameterized maps. We characterize the resulting dynamics as projected or preconditioned Wasserstein gradient flows, highlighting the role of the Wasserstein geometry in shaping the learning dynamics of generative models.

Abstract:
Foundational Models pretrained on huge amount of data learn representations that evolve across depth, forming a hierarchy of embeddings with distinct semantic content and geometric structure. Contrary to the widespread practice of using only the final layer or shallow mixtures, we show that taskrelevant information is distributed non-monotonically across layers and cannot be recovered by naive aggregation. Through a geometric and empirical study across multiple modalities, we show that effective transfer depends on identifying which layers encode task-discriminative structure and how their embeddings are geometrically organized. We introduce Layer-wise Optimal Embedding Selection (LOES), a constructive spectral method that identifies task-discriminative subspaces by minimizing residual error under orthogonality and isotropy constraints. To align fine-tuning with this selection principle, we further propose Geometric Regularization Loss (GeoReg), which enforces a simplicial structure on class manifolds and stabilizes representation geometry during fine-tuning. Across a wide range of architectures, depths, modalities, and data regimes, LOES consistently outperforms standard baselines, with gains that grow as model depth increases. Beyond accuracy, our method reveals how semantic factors are distributed across layers, thereby enabling cross-lingual and cross-modal interpretability analyses. Together, our results provide strong evidence that layerwise embedding geometry is not incidental but central to how deep models represent and transfer knowledge.

Abstract:
The remarkable success of Chainof-Thought (CoT), which enhances performance by scaling generation steps at test-time, inspires us to ask: can we leverage a similar scaling of computational steps during pretraining to improve the generation of each individual token? To address this, we propose a novel pre-training methodology: Pretraining Language Models with Latent Thoughts (PonderLM-2). Our approach pretrains a language model (LM) to first generate an intermediate latent thought—the last hidden state of the current position—which is then used as input to predict the actual subsequent token. This additional computational step enables the LM to refine its prediction within unconstrained continuous space. Our experiments demonstrate that, at an identical inference cost, a LM that generates one additional latent thought per token outperforms a standard model with double the parameters. For instance, our PonderLM-2-Pythia-1.4B, pretrained on 300B tokens from the Pile, significantly surpasses the vanilla Pythia-2.8B trained on the same data on both language modeling and a range of general downstream tasks. Furthermore, increasing the number of latent thoughts generated before each actual token—forming a chain analogous to CoT—consistently improves the model's performance.

Abstract:
This paper studies the use of kernel density estimation (KDE) for linear algebraic tasks involving the kernel matrix of a collection of n data points in \mathbbR^d. In particular, we improve upon the best existing algorithms for computing the following up to (1+\varepsilon) relative error for a Gaussian kernel matrix and other kernels: matrixvector products, matrix-matrix products, the spectral norm, and sum of all entries. The runtimes of our algorithms depend linearly on the dimension d, sub-quadratically in the number of points n, and polynomially on the target error \varepsilon. Importantly, the dependence on n in each case is far lower when accessing the kernel matrix through KDE queries as opposed to reading individual entries. Our improvements over existing best algorithms (particularly those of [Backurs et al. ICML `21]) for these tasks reduce the polynomial dependence on \varepsilon, and additionally decrease the dependence on n in the case of computing the sum of all entries of the kernel matrix. For example, we reduce the power of 1/\epsilon from \approx 7.7 to \approx 3.2 for a 1-\varepsilon relative error estimation of the spectral norm of a Gaussian kernel matrix. We complement our upper bounds with several lower bounds for related problems, which provide (conditional) quadratic time hardness results and additionally hint at the limits of KDE based approaches for the problems we study.

Abstract:
Residual reinforcement learning (RL) enables stable online refinement of expressive pretrained policies by freezing the base and learning only bounded corrections. However, value learning in residual RL poses unique challenges that remain poorly understood. In this work, we identify two key bottlenecks: cold start pathology, where the critic lacks knowledge of the value landscape around the base policy, and structural scale mismatch, where the residual contribution is dwarfed by the base action. Through systematic investigation, we uncover the mechanisms underlying these bottlenecks, revealing that simple yet principled solutions suffice: basepolicy transitions serve as an essential value anchor for implicit warmup, and critic normalization effectively restores representation sensitivity for discerning value differences. Based on these insights, we propose DAWN (Data-Anchored Warmup and Normalization), a minimal approach targeting efficient value learning in residual RL. By addressing these bottlenecks, DAWN demonstrates substantial efficiency gains across diverse benchmarks, policy architectures, and observation modalities.

Abstract:
We study a setting in which two players play a (possibly approximate) Nash equilibrium of a bimatrix game, while a learner observes only their actions and has no knowledge of the equilibrium or the underlying game. A natural question is whether the learner can rationalize the observed behavior by inferring the players' payoff functions. Rather than producing a single payoff estimate, inverse game theory aims to identify the entire set of payoffs consistent with observed behavior, enabling downstream use in, e.g., counterfactual analysis and mechanism design across applications like auctions, pricing, and security games. We focus on the problem of estimating the set of feasible payoffs with high probability and up to precision \epsilon on the Hausdorff metric. We provide the first minimaxoptimal rates for both exact and approximate equilibrium play, in zero-sum as well as general-sum games. Our results provide learning-theoretic foundations for set-valued payoff inference in multi-agent environments.

Abstract:
In this paper, we propose a general theory of ambiguityaverse MDPs, which treats the uncertain transition probabilities as random variables and evaluates a policy via a risk measure applied to its random return. This ambiguity-averse MDP framework unifies several models of MDPs with epistemic uncertainty for specific choices of risk measures. We extend the concepts of value functions and Bellman operators to our setting. Based on these objects, we establish the consequences of dynamic programming principles in this framework (existence of stationary policies, value and policy iteration algorithms), and we completely characterize law-invariant risk measures compatible with dynamic programming. Our work draws connections among several variants of MDP models and fully delineates what is possible under the dynamic programming paradigm and which risk measures require leaving it.

Abstract:
Backdoor attacks can introduce deceptive behaviors into large language models, causing them to execute prohibited actions only when specific secret triggers appear in the input. Existing safety training methods largely fail to address this vulnerability, due to the inherent difficulty of uncovering hidden triggers embedded within the model. Motivated by recent findings on LLMs’ situational awareness, we propose a novel posttraining framework that cultivates backdoor self-awareness, enabling a poisoned LLM to precisely articulate its own implanted triggers. At its core, our approach introduces an inversion-inspired reinforcement learning framework that encourages models to introspectively reason about their behaviors and gradually reverse-engineer the triggers responsible for misaligned outputs. Building upon precise trigger articulation, we further present two complementary defense strategies for mitigating and detecting backdoor threats. Experiments on five backdoor attacks, compared against six baseline methods, demonstrate that our approach has strong potential to improve the robustness of LLMs against backdoor risks.

Abstract:
Recent advances in reinforcement learning (RL) using numerical rewards have significantly enhanced the complex reasoning capabilities of large language models (LLMs). However, we identify three fundamental limitations of purely numerical feedback: performance plateaus, ineffective spontaneous selfreflection, and persistent failures. We show that plateaued RL models can successfully refine failed solutions when given natural language critiques. Motivated by this, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for policy optimization. This approach enables LLMs to learn simultaneously from initial responses and critique-guided refinements, effectively internalizing the exploration benefits of both stages. Extensive experiments show that Critique-GRPO outperforms all compared supervised and RL-based fine-tuning methods, achieving average Pass@1 improvements of approximately +15.0-21.6% on various Qwen models and +7.3% on Llama-3.2-3B-Instruct across eight challenging reasoning tasks. Notably, Critique-GRPO facilitates effective self-improvement through self-critiquing, achieving substantial gains over GRPO, e.g., a +16.7% Pass@1 improvement on AIME 2024.

Abstract:
Can a diffusion model trained on bedrooms recover human faces? Diffusion models are widely used as priors for inverse problems, but standard approaches usually assume a highfidelity model trained on data that closely match the unknown signal. In practice, one often must use a mismatched or low-fidelity diffusion prior. Surprisingly, these weak priors often perform nearly as well as full-strength, in-domain baselines. We study when and why inverse solvers are robust to weak diffusion priors. Through extensive experiments, we find that weak priors succeed when measurements are highly informative (e.g., many observed pixels), and we identify regimes where they fail. Our theory, based on Bayesian consistency, gives conditions under which high-dimensional measurements make the posterior concentrate near the true signal. These results provide a principled justification on when weak diffusion priors can be used reliably. Code is available at Code is available at https://anonymous.4open.science/r/weak-diffusion-priors-inverse-problem-1043.

Abstract:
WhileProverVerifier Games(PVGs) offer a promising path toward verifiability in nonlinear classification models, they have not yet been applied to complex inputs such as high-dimensional images. Conversely, expressiveconcept encodingseffectively allow to translate such data into interpretable concepts but are often utilised in the context of low-capacity linear predictors. In this work, we push towards real-world verifiability by combining the strengths of both approaches. We introduceNeural Concept Verifier (NCV), a unified framework combining PVGs for formal verifiability with concept encodings to handle complex, high-dimensional inputs in an interpretable way. NCV achieves this by utilizing recent minimally supervised concept discovery models to extract structured concept encodings from raw inputs. Aproverthen selects a subset of these encodings, which averifier, implemented as a nonlinear predictor, uses exclusively for decision-making. Our evaluations show that NCV outperforms classic concept-based models and pixel-based PVG classifier baselines on high-dimensional, logically complex datasets and helps mitigate shortcut behavior. Overall, we demonstrate NCV as a promising step toward concept-level, verifiable AI.

Abstract:
The highlevel concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques such as sparse autoencoders (SAEs) to decompose the neuron basis into more interpretable units of model computation, for tasks such as circuit tracing. However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that MLP neurons are as sparse a feature basis as SAEs. We use this finding to develop an end-to-end pipeline for circuit tracing on the MLP neuron basis, which locates causal circuitry on a variety of tasks using gradient-based attribution. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of \approx 10^2 MLP neurons is enough to control model behaviour. On the multi-hop city \to state \to capital task from Lindsey et al., 2025, we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g. 'map city to its state'), and can be steered to change the model's output. This work thus advances automated interpretability of language models without additional training costs.

Abstract:
We propose SWING: Space Walks for Implicit Network Graphs, a new class of algorithms for computations involving Graph Random Features on graphs given by implicit representations (igraphs), where edge-weights are defined as bi-variate functions of feature vectors in the corresponding nodes. Those classes of graphs include several prominent examples, such as: \epsilon-neighborhood graphs, used on regular basis in machine learning. Rather than conducting walks on graphs' nodes, those methods rely on walks in continuous spaces, in which those graphs are embedded. To accurately and efficiently approximate original combinatorial calculations, SWING applies customized Gumbel-softmax sampling mechanism with linearized kernels, obtained via random features coupled with importance sampling techniques. This mechanism is of its own interest. SWING relies on the deep connection between implicitly defined graphs and Fourier analysis, presented in this paper. SWING is accelerator-friendly and does not require input graph materialization. We provide detailed analysis of SWING and complement it with thorough experiments on different classes of i-graphs.

Abstract:
Natural language data follows a powerlaw distribution, with most knowledge and skills appearing at very low frequency. While a common intuition suggests that reweighting or curating data toward a uniform distribution may help models better learn these long-tail skills, we find a counterintuitive result: across a wide range of compositional reasoning tasks, such as state tracking and multi-step arithmetic, training under power-law distributions consistently outperforms training under uniform distributions. To understand this advantage, we introduce a minimalist skill-composition task and show that learning under a power-law distribution provably requires significantly less training data. Our theoretical analysis reveals that power law sampling induces a beneficial asymmetry that improves the pathological loss landscape, which enables models to first acquire high-frequency skill compositions with low data complexity, which in turn serves as a stepping stone to efficiently learn rare long-tailed skills. Our results offer an alternative perspective on what constitutes an effective data distribution for training models.

Abstract:
Monte Carlo Tree Search (MCTS) scales poorly in cooperative multiagent domains because expansion must consider an exponentially large set of joint actions, severely limiting exploration under realistic search budgets. We propose \textscNonZero, which keeps multi-agent MCTS tractable by running surrogate-guided selection over a low-dimensional nonlinear representation using an interaction-guided proposal rule, instead of directly exploring the full joint-action space. Our exploration uses an interaction score: single-agent deviations are ranked by predicted gain, while two-agent deviations are scored by a mixed-difference measure that reveals coordination benefits even when no single agent can improve alone. We formalize candidate proposal as a bandit problem over local deviations and derive a proposal rule, \textscNonUCT, with a sublinear local-regret guarantee for reaching approximate graph-local optima without enumerating the joint-action space. Empirically, \textscNonZero improves sample efficiency and final performance on MatGame, SMAC, and SMACv2 relative to strong model-based and model-free baselines under matched search budgets.

Abstract:
We introduce Robust Filter Attention (RFA), an attention mechanism that reformulates selfattention as parallel robust filtering under a latent stochastic differential equation (SDE) prior, where analytically propagated uncertainty defines a time-dependent precision prior over attention weights. This formulation integrates key advantages of existing positional encodings: it preserves RoPE-style rotational structure while achieving long-context stability through explicit modeling of dissipation and diffusion. By imposing isotropic constraints on the dynamics and noise, RFA matches the \mathcalO(N^2 d) time and \mathcalO(N^2 + Nd) memory complexity of standard attention. Empirically, we find that uncertainty-aware weighting induces specialization into distinct filtering regimes across heads, improving temporal consistency and extrapolation across varying context lengths.

Abstract:
The growing scale of deep neural networks, encompassing large language models (LLMs) and vision transformers (ViTs), has made training from scratch prohibitively expensive and deployment increasingly costly. These models are often used as computational monoliths with fixed cost, a rigidity that does not leverage overparametrized architectures and largely hinders adaptive deployment across different cost budgets. We argue that importanceordered nested components can be extracted from pretrained models, and selectively activated on the available computational budget. To this end, our proposedFlexRankmethod leverages low-rank weight decomposition with nested, importance-based consolidation to extract submodels of increasing capabilities. Our approach enables a"train-once, deploy-everywhere"paradigm that offers a graceful trade-off between cost and performance without training from scratch for each budget - advancing practical deployment of large models.

Abstract:
To drive progress in science and engineering, large language models (LLMs) must be able to process large amounts of numerical data and solve long calculations efficiently. This is currently only possible through the use of external tools or extensive reasoning chains, either weakening the numerical representations of LLMs or limiting the length of problems they can solve. We show that frontier LLMs require excessive amounts of reasoning tokens to solve even basic calculations, which is exacerbated by their tokenization strategies that split single numbers into multiple tokens. This motivates the need for efficient and effective singletoken number encodings. We introduce a set of desiderata for such encodings and show that existing approaches fail to fulfill them. To address these shortcomings, we propose BitTokens, a novel encoding strategy that represents any number as a single token using its IEEE 754 binary floating-point representation. Through extensive experiments we show that our BitTokens allow even small language models to learn algorithms that solve basic arithmetic operations nearly perfectly. This newly gained efficiency could expand the length and complexity of problems language models can solve.

Abstract:
Traditional saliency map methods, popularized in computer vision, highlight individual points (pixels) of the input that contribute the most to the model's output. However, in time series, they offer limited insights, as semantically meaningful features are often found in other domains. We introduce Crossdomain Integrated Gradients, a generalization of Integrated Gradients. Our method enables feature attributions in any domain that can be formulated as an invertible, differentiable transformation of the time domain. Crucially, our derivation extends the original Integrated Gradients into the complex domain, enabling frequency-based attributions. We provide the necessary theoretical guarantees, namely, path independence and completeness. We validate our method via controlled experiments with mechanistic analysis, quantitative faithfulness tests, and real-world case studies. Our approach reveals interpretable, problem-specific attributions that time-domain methods cannot capture in three real-world tasks across a variety of model architectures, machine-learning tasks, and cross-domain transforms: frequency-based attribution for a regression task in wearable heart rate extraction, independent component analysis in a classification task for electroencephalography-based seizure detection, and seasonal-trend decomposition for a forecasting problem with a zero-shot time-series foundation model. We release an open-source TensorFlow/PyTorch library to enable plug-and-play cross-domain explainability for time-series models. These results demonstrate the ability of Cross-Domain Integrated Gradients to provide semantically meaningful insights into time-series models that are impossible to achieve with traditional saliency in the time domain.

Abstract:
Exploring causal relationships in stochastic time series is a challenging yet crucial task with a vast range of applications, including finance, economics, neuroscience, and climate science. Many algorithms for Causal Discovery (CD) have been proposed; however, they often exhibit a high sensitivity to noise, resulting in spurious causal inferences on real data. In this paper, we observe that the frequency spectra of many realworld time series follow a power-law distribution, notably due to an inherent self-organizing behavior. Leveraging this insight, we build a robust CD method based on the extraction of power‑law spectral features that amplify genuine causal signals. Our method consistently outperforms state-of-the-art alternatives on both synthetic benchmarks and real-world datasets with known causal structures, demonstrating its robustness and practical relevance.

Abstract:
We study spectral algorithms in the setting where kernels are learned from data. We introduce the effective span dimension (ESD), an alignmentsensitive complexity measure that depends jointly on the signal, spectrum, and noise level \sigma^2. The ESD is well-defined for arbitrary kernels and signals without requiring eigen-decay conditions or source conditions. We prove that for sequence models whose ESD is at most K, the minimax excess risk scales as \sigma^2 K. Furthermore, we analyze overparameterized gradient flow and prove that it can reduce the ESD of a sequence model, which in turn moves the problem into an easier ESD class and lowers the corresponding minimax risk. This analysis suggests a general route to study how adaptive feature learning can improve generalization through signal-kernel alignment: adaptive learning procedures reshape the kernel so that the ESD decreases and the problem enters an easier ESD class. We also extend the ESD framework to linear models and RKHS regression, and we support the theory with numerical experiments. This framework provides a novel perspective on generalization beyond traditional fixed-kernel theories.

Abstract:
Large language models can be steered at inference time through prompting or activation interventions, but activation steering methods often underperform compared to promptbased approaches. We investigate whether activation steering can be improved by learning to mimic the interventions that prompt steering triggers within the model. To this end, we introducePrompt Steering Replacement (PSR)models, a new family of activation steering methods that distill prompt steering behavior into interpretable interventions on model activations. A PSR is an activation steering method that estimates position-specific steering coefficients and is trained to imitate prompt-based interventions. Experiments on persona steering and instruction following across multiple language models demonstrate that PSR models consistently outperform constant-coefficient interventions that are frequently used in the literature and achieve performance close to or exceeding prompt steering while maintaining interpretability.

Abstract:
Graph Neural Networks (GNNs) are bottlenecked by sparse, irregular memory access. Popular frameworks such as DGL and PyTorch Geometric support general message passing, but complex layers often materialize edgewise intermediates, increasing memory traffic and limiting scalability on large graphs. We take an I/O- and arithmetic-intensity--centric view and show that widely used layers fall into three kernel families: SpMM-based convolutions, reduction-based aggregations, and attention-based layers (GATv2/Graph Transformer). For each family, we develop GPU kernels that reduce data movement, improve locality, and remain robust across realistic graphs. We also study graph reordering and find that its impact depends on the kernel mapping: it benefits neighbor-parallel (gather-dominated) kernels more consistently than feature-parallel designs. Empirically, our fused attention kernels reach up to 3.9× speedup for Graph Transformer (median 1.6×), with Tensor Core (block-sparse) variants up to 7.3× on locally dense graphs; for GATv2 we reach up to 8.5× speedup (median 2.0×) while reducing peak memory by up to 76× (median 6×). Our degree-aware reduction kernels achieve up to 10× speedup (median 2.6×). For SpMM-based layers, properly cached cuSPARSE achieves up to 8× speedup over DGL and outperforms evaluated custom baselines in the majority of evaluations. We release our implementations as drop-in replacements to support reproducible, hardware-aware GNN acceleration.

Abstract:
We present Mira, a method for estimating the expected probability that samples from a candidate conditional distribution match the true, unknown conditional distribution, for which only datalabel pairs are available. We derive theoretical bounds obtained when the candidate distribution matches the true one and when the conditional distributions are independent. This framework thus enables model comparison by quantifying the alignment between the conditional distribution of a candidate model and the data-label pairs of the true model. Consequently, Mira enables Bayesian model comparison through direct posterior validation, bypassing the challenging evidence computation. We demonstrate its effectiveness across several toy problems and Bayesian inference tasks.

Abstract:
We study posttraining linear autoregressive models with outcome and process rewards. Given a context \boldsymbolx, the model must predict the response \boldsymboly \in \mathcalY^N, a sequence of length N that satisfies a standard \gamma margin assumption extended to sequences. We prove that on test samples where the base model achieves a non-trivial likelihood \alpha, a variant of policy gradient (PG) can achieve likelihood 1 - \varepsilon with an essentially minimax optimal number of reward queries \tilde\mathcalO((\alpha^-1 + \varepsilon^-1)/\gamma^2). However, a barrier arises for going beyond the support of the base model. We prove that the overall expected error after post-training with outcome rewards is governed by a property of the base model we call the Likelihood Quantile (LQ), and that variants of PG, while minimax optimal, may require a number of reward queries exponential in N to go beyond this support, regardless of the pre-training algorithm. To overcome this barrier, we study post-training with a process reward model, and demonstrate how PG variants in this setting avoid the curse of dimensionality in N via dependence on a token-level LQ. Along the way, we prove that under the margin condition, SGD with adaptive learning rate (LR) achieves a near optimal test error for statistical learning, and PG with adaptive LR achieves a near optimal number of mistakes for online learning while being computationally efficient whenever possible, both of which may be of independent interest.

Abstract:
Neuralnetwork wave functions in Variational Monte Carlo (VMC) have achieved great success in accurately representing both ground and excited states. However, achieving sufficient numerical accuracy of state overlaps requires growing the number of Monte Carlo samples, and consequently computational cost, with the number of states. We present a nearly constant sample size approach named Multi-State Importance Sampling (MSIS), which leverages all states' samples to estimate any pairwise overlap. To efficiently evaluate all states for all samples, we introduce Excited Pfaffians. Inspired by Hartree-Fock, this architecture represents many states within a single neural network. Excited Pfaffians also serve as generalized wave functions, allowing a single model to represent multi-state potential energy surfaces. On the carbon dimer, we match the \mathcalO(N_s^4)-scaling natural excited states while training >200× faster and modeling 50% more states. Our favorable scaling enables us to be the first to use neural networks to find all distinct energy levels of the Beryllium atom. Finally, we demonstrate that a single wave function can represent excited states across various molecules.

Abstract:
Large language models (LLMs) have achieved remarkable progress in mathematical reasoning, yet persistently suffer from hallucinations and erroneous logic. While formal theorem proving (FTP) shows promise in processlevel reliability, it is limited toverification(checking known propositions). This leaves constructive problem-solving (finding unknown terms that satisfy specific conditions) underexplored and disconnected from process-level verifiability. To bridge this gap, we introduceFPS(FormalProblem-Solving), a principled framework to encompass the end-to-end problem-solving process in Lean 4. In FPS, the answer is an unknown metavariable coupled with a proof obligation, forcing it to be mathematically derived and verified. We further presentD-FPS(DeductiveFPS), which enforces a rigorous chain-of-thought structure, aligning formal derivation with human reasoning steps. To support this direction, we construct three benchmarks via the manual refactoring of over 1,000 problems:FormalMath500,MiniF2F-Solving, andPutnamBench-Solving. We further proposeRPE(RestrictedPropositionalEquivalence), a symbolic metric that evaluates semantic correctness beyond brittle string matching. Extensive experiments with state-of-the-art provers reveal that solving is significantly harder than proving, highlighting the ``alignment tax'' required to transition from loose validity checking to constructive, human-aligned reasoning.

Abstract:
ClassifierFree Guidance (CFG) serves as the de facto control mechanism for conditional diffusion, yet high guidance scales notoriously induce oversaturation, texture artifacts, and structural collapse. We attribute this failure to a geometric mismatch: standard CFG performs Euclidean extrapolation in ambient space, inadvertently driving sampling trajectories off the high-density data manifold. To resolve this, we present Manifold-Optimal Guidance (MOG), a framework that reformulates guidance as a local optimal control problem. MOG yields a closed-form, geometry-aware Riemannian update that corrects off-manifold drift without requiring retraining. Leveraging this perspective, we further introduce Auto-MOG, a dynamic energy-balancing schedule that adaptively calibrates guidance strength, effectively eliminating the need for manual hyperparameter tuning. Extensive validation demonstrates that MOG yields superior fidelity and alignment compared to baselines, with virtually no added computational overhead.

Abstract:
While Bayesian inference provides a principled framework for reasoning under uncertainty, its widespread adoption is limited by the intractability of exact posterior computation, necessitating the use of approximate inference. However, existing methods are often computationally expensive, or demand costly retraining when priors change, limiting their utility, particularly in sequential inference problems such as realtime sensor fusion. To address these challenges, we introduce the Distribution Transformer---a novel architecture that can learn arbitrary distribution-to-distribution mappings. Our method can be trained to map a prior to the corresponding posterior, conditioned on some dataset---thus performing approximate Bayesian inference. Our novel architecture represents a prior distribution as a (universally-approximating) Gaussian Mixture Model (GMM), and transforms it into a GMM representation of the posterior. The components of the GMM attend to each other via self-attention, and to the datapoints via cross-attention. We demonstrate that Distribution Transformers both maintain flexibility to vary the prior, and significantly reduces computation times—from minutes to milliseconds—while achieving expected log-likelihood performance on par with or superior to existing approximate inference methods across tasks such as sequential inference, quantum system parameter inference, and Gaussian Process predictive posterior inference with hyperpriors.

Abstract:
Understanding what graph neural networks can learn, especially their ability to learn to execute algorithms, remains a central theoretical challenge. In this work, we prove exact learnability results for graph algorithms under boundeddegree and finite-precision constraints. Our approach follows a two-step process. First, we train an ensemble of multi-layer perceptrons (MLPs) to execute the local instructions of a single node. Second, during inference, we use the trained MLP ensemble as the update function within a graph neural network (GNN). Leveraging Neural Tangent Kernel (NTK) theory, we show that local instructions can be learned from a small training set, enabling the complete graph algorithm to be executed during inference without error and with high probability. To illustrate the learning power of our setting, we establish a rigorous learnability result for the \textscLOCAL model of distributed computation. We further demonstrate positive learnability results for widely studied algorithms such as message flooding, breadth-first and depth-first search, and Bellman-Ford.

Abstract:
Large language models (LLMs) have recently shown strong performance on mathematical benchmarks. At the same time, they are prone to hallucination and sycophancy, often providing convincing but flawed proofs for incorrect mathematical statements provided by users. This significantly limits the applicability of LLMs in theorem proving, as verification of these flawed proofs must be done manually by expert mathematicians. However, existing benchmarks that measure sycophancy in mathematics are limited: they focus solely on finalanswer problems, rely on very simple and often contaminated datasets, and construct benchmark samples using synthetic modifications that create ill-posed questions. To address these issues, we introduce BrokenMath, the first benchmark for evaluating sycophantic behavior in LLMs within the context of natural language theorem proving. BrokenMath is built from advanced 2025 competition problems, which are perturbed with an LLM to produce false statements and subsequently refined through expert review. We evaluate state-of-the-art LLMs and agentic systems and find that sycophancy is widespread, with the best model, GPT-5, producing sycophantic answers 29% of the time. We further investigate several mitigation strategies, including test-time interventions and supervised fine-tuning on curated sycophantic examples. These approaches reduce, but do not eliminate, sycophancy.

Abstract:
Gradientvariation online learning has drawn increasing attention due to its deep connections to game theory, optimization, etc. It has been studied extensively in the full-information setting, but is underexplored with bandit feedback. In this work, we focus on gradient variation in Bandit Convex Optimization (BCO) with two-point feedback. By proposing a refined analysis on thenon-consecutivegradient variation, a fundamental quantity in gradient variation with bandits, we improve the dimension dependence for both convex and strongly convex functions compared with the best known results (Chiang et al., 2013). Our improved analysis for the non-consecutive gradient variation also implies other favorable problem-dependent guarantees, such as gradient-variance and small-loss regrets. Beyond the two-point setup, we demonstrate the versatility of our technique by achieving thefirstgradient-variation bound for one-point bandit linear optimization over hyper-rectangular domains. Finally, we validate the effectiveness of our results in more challenging tasks such as dynamic/universal regret minimization and bandit games, establishing thefirstgradient-variation dynamic and universal regret bounds for two-point BCO and fast convergence rates in bandit games.

Abstract:
Equivariant neural networks exploit underlying task symmetries to improve generalization, but strict equivariance constraints can induce more complex optimization dynamics that can hinder learning. Prior work addresses these limitations by relaxing strict equivariance during training, but typically relies on prespecified, explicit, or implicit target levels of relaxation for each network layer, which are taskdependent and costly to tune. We propose Recurrent Equivariant Constraint Modulation (RECM), a layer-wise constraint modulation mechanism that learns appropriate relaxation levels solely from the training signal and the symmetry properties of each layer's input-target distribution, without requiring any prior knowledge about the task-dependent target relaxation level. We demonstrate that under the proposed RECM update, the relaxation level of each layer provably converges to a value upper-bounded by its symmetry gap, namely the degree to which its input-target distribution deviates from exact symmetry. Consequently, layers processing symmetric distributions recover full equivariance, while those with approximate symmetries retain sufficient flexibility to learn non-symmetric solutions when warranted by the data. Empirically, RECM outperforms prior methods across diverse exact and approximate equivariant tasks, including the challenging molecular conformer generation on the GEOM-Drugs dataset.

Abstract:
As a foundational task in humancentric cross-modal intelligence, motion-language retrieval aims to bridge the semantic gap between natural language and human motion, enabling intuitive motion analysis, yet existing approaches predominantly focus on aligning entire motion sequences with global textual representations. This global-centric paradigm overlooks fine-grained interactions between local motion segments and individual body joints and text tokens, inevitably leading to suboptimal retrieval performance. To address this limitation, we draw inspiration from the pyramidal process of human motion perception (from joint dynamics to segment coherence, and finally to holistic comprehension) and propose a novel Pyramidal Shapley-Taylor (PST) learning framework for fine-grained motion-language retrieval. Specifically, the framework decomposes human motion into temporal segments and spatial body joints, and learns cross-modal correspondences through progressive joint-wise and segment-wise alignment in a pyramidal fashion, effectively capturing both local semantic details and hierarchical structural relationships. Extensive experiments on multiple public benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, achieving precise alignment between motion segments and body joints and their corresponding text tokens. The code of this work will be released upon acceptance.

Abstract:
Reinforcement learning with verifiable rewards (RLVR) trains a policy by verifying sampled completions and reinforcing higherscoring outputs, but practical verifiers (e.g., incomplete unit tests or noisy judges) are prone to false positives and false negatives. We ask when such noise merely slows learning and when it reverses it. Modeling GRPO-style RLVR as a bandit over recurring \emphreasoning modes, we derive mean-field replicator-style (natural-selection) flow on the probability simplex. The dynamics decouples into within-correct-mode competition and a one-dimensional evolution for the mass on incorrect modes, whose drift is determined solely by Youden's index J=\mathrmTPR-\mathrmFPR. This yields a sharp phase transition: when J>0, the incorrect mass is driven toward extinction (learning); when J=0, the process is neutral; and when J<0, incorrect modes amplify until they dominate (anti-learning and collapse). In the learning regime J>0, noise primarily rescales convergence time (``rate, not fate''). Experiments on verifiable programming tasks under synthetic noise reproduce the predicted J=0 boundary. Beyond noise, the framework offers a general lens for analyzing RLVR stability, convergence, and algorithmic interventions.

Abstract:
Multilingual RetrievalAugmented Generation (mRAG) systems enable language models to answer knowledge-intensive queries with citation-supported responses across languages. Despite their growing use, an open questions is whether the mixture of different document languages impacts generation and citation behavior inunintendedways. To investigate this, we introduce a controlled methodology using model internals to measure language preference while holding other factors such as document relevance constant. Across eight languages and six open-weight models, we find that models preferentially cite English sources when queries are in English, with this bias amplified for lower-resource languages and for documents positioned mid-context. More crucially, we find that models sometimes trade-off document relevance for language preference, indicating that citation choices are not always driven by informativeness alone. Our findings shed light on how language models leverage multilingual context and influence citation behavior.

Abstract:
Mixed Integer Linear Programming (MILP) is a fundamental NPhard problem that has garnered significant attention from both academia and industry. The Branch-and-Bound (B&B) algorithm is the dominant approach for solving MILPs, where branching decisions play a critical role and have recently been enhanced by neural methods. However, these methods still struggle with semantic variation across depths, the scarcity of upstream nodes, and the costly collection of strong branching samples. To address these issues, we propose SC-MILP, a Dynamic Stratified Contrastive Training Framework for MILP Branching. Our method groups B&B nodes based on their feature distributions and learns depth-aware, fine-grained node representations through dynamic stratified contrastive training. To address data scarcity and imbalance at upstream nodes, we introduce an upstream-augmented MILP derivation procedure that generates both theoretically equivalent and perturbed instances. Experiments on both synthetic and real-world MILP benchmarks, including large-scale instances, show that SC-MILP significantly improves branching accuracy, reduces solving time, with particularly strong gains at upstream nodes.

Abstract:
Optimaltransport distributionally robust optimization (OT-DRO) robustifies data-driven decision-making under uncertainty by capturing the sampling-induced statistical error via optimal transport ambiguity sets. The standard OT-DRO pipeline consists of a two-step procedure, where the ambiguity set is first designed and subsequently embedded into the downstream OT-DRO problem. However, this separation between uncertainty quantification and optimization may lead to excessive conservatism. We introduce an end-to-end pipeline to automatically learn decision-focused ambiguity sets for OT-DRO problems, where the loss function informs the shape of the ambiguity set, leading to less conservative decisions whose distributional robustness is enforced via data-driven bootstrapping. We formulate the learning problem as a bilevel optimization program and solve it via a hypergradient-based method. By leveraging the recently introduced nonsmooth conservative implicit function theorem, we establish convergence to a critical point of the bilevel problem. We present experiments validating our method on standard portfolio optimization and linear regression tasks.

Abstract:
Solving complex geometric problems inherently requires interleaved reasoning: a tight alternation between constructing diagrams and performing logical deductions. Although recent Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities in visual generation and plotting, we identify a counterintuitive and underexplored phenomenon. Naively applying Supervised Fine-Tuning (SFT) on interleaved plot–solution data leads to a substantial degradation in reasoning performance compared to text-only baselines. We argue that this failure stems from a fundamental limitation of SFT, which primarily induces distributional alignment: the model learns to reproduce the surface format of interleaved plotting but fails to internalize the causal dependency between the generated plot and reasoning steps. To overcome this limitation, we propose Faire (Functional alignment for interleaved reasoning), a reinforcement learning framework that enforces three casual constraints to move beyond superficial imitation toward functional alignment. Extensive experiments show that Faire induces a qualitative shift in model behavior in which the plotting is effectively internalized, yielding competitive performance on challenging geometric reasoning benchmarks.

Abstract:
MoEPEFT methods combine Mixture of Experts with parameter-efficient fine-tuning for multi-task adaptation, but require separate adapters per expert—causing trainable parameters to scale linearly with expert count and limiting applicability to adapter-based architectures. We propose LiME (Lightweight Mixture of Experts), which achieves expert specialization through lightweight modulation rather than adapter replication. Instead of separate adapters, LiME uses a single shared PEFT module and modulates its output with lightweight expert vectors, reducing expert parameters while generalizing to any PEFT method. Notably, LiME introduces zero-parameter routing by leveraging existing frozen and adapted representations—eliminating learned router parameters typically required per layer. Theoretically, we prove that (i) more experts preserve more task-relevant information and (ii) modulation approximates full expert-specific PEFT with bounded error. LiME further incorporates n-gram windowed routing and adaptive expert selection (Auto Top-K) based on routing confidence. Experiments on MMT-47, a multimodal multi-task benchmark with 47 tasks spanning text, image, and video, demonstrate that LiME achieves competitive or superior performance while using up to 4× fewer trainable parameters and up to 29% faster training compared to corresponding MoE-PEFT baselines.

Abstract:
Diffusion models have demonstrated strong generative performance; however, generated samples often fail to fully align with human intent. This paper studies a testtime scaling method that enables sampling from regions with higher human-aligned reward values. Existing gradient guidance methods approximate the expected future reward (EFR) at an intermediate particle \mathbfx_t using a Taylor approximation, but this approximation at each time step incurs high computational cost due to sequential neural backpropagation. We show that the EFR at any \mathbfx_t can be computed using only marginal samples from a pre-trained diffusion model. The proposed EFR formulation detaches the neural dependency between \mathbfx_t and the EFR, enabling closed-form guidance computation without neural backpropagation. To further improve efficiency, we introduce lookahead sampling to collect marginal samples. For final sample generation, we use an accurate solver that guides particles toward high-reward lookahead samples. We refer to this sampling scheme as LiDAR sampling. LiDAR achieves substantial performance improvements using only three samples with a 3-step lookahead solver, exhibiting steep performance gains as lookahead accuracy and sample count increase; notably, it reaches the same GenEval performance as the latest gradient guidance method for SDXL with a 9.5× speedup.

Abstract:
The attention mechanism in a Transformer architecture matches key to query based on both content—the what—and position in a sequence—the where. We present an analysis indicating that what and where are entangled in the popular rotary position embedding (RoPE). This entanglement can impair performance particularly when decisions require independent matches on these two factors. We propose an improvement to RoPE, which we call Polar Coordinate Position Embedding or PoPE, that eliminates the whatwhere confound. PoPE is far superior on a diagnostic task requiring indexing solely by position or by content. On autoregressive sequence modeling in music, genomic, and natural language domains, Transformers using PoPE as the positional encoding scheme outperform baselines using RoPE with respect to evaluation loss (perplexity) and downstream task performance. On language modeling, these gains persist across model scale, from 124M to 774M parameters. Crucially, PoPE shows strong zero-shot length extrapolation capabilities compared not only to RoPE but even a method designed for extrapolation, YaRN, which requires additional fine tuning and frequency interpolation.

Abstract:
Standard practice across domains from robotics to language is to first pretrain a policy on a largescale demonstration dataset, and then finetune this policy, typically with reinforcement learning (RL), in order to improve performance on deployment domains. This finetuning step has proved critical in achieving human or super-human performance, yet while much attention has been given to developing more effective finetuning algorithms, little attention has been given to ensuring the pretrained policy is an effective initialization for RL finetuning. In this work we seek to understand how the pretrained policy affects finetuning performance, and how to pretrain policies in order to ensure they are effective initializations for finetuning. We first show theoretically that standard behavioral cloning (BC) can fail to ensure coverage over the demonstrator's actions, a minimal condition necessary for effective RL finetuning. We then show that if, instead of exactly fitting the observed demonstrations, we train a policy to model the posterior distribution of the demonstrator's behavior given the demonstration dataset, we do obtain a policy that ensures coverage over the demonstrator's actions, enabling more effective finetuning. Furthermore, this policy achieves this while ensuring pretrained performance is no worse than that of the BC policy. We then show this approach is practically implementable with modern generative models and leads to significantly improved RL finetuning performance on both realistic robotic control benchmarks and real-world robotic manipulation tasks, as compared to standard behavioral cloning.

Abstract:
We develop a general mathematical framework to analyze scaling regimes and derive explicit analytic solutions for gradient flow (GF) in large learning problems. Our key innovation is a formal power series expansion of the loss evolution, with coefficients encoded by diagrams akin to Feynman diagrams. We show that this expansion has a welldefined large-size limit that can be used to reveal different learning phases and, in some cases, to obtain explicit solutions of the nonlinear GF. We focus on learning Canonical Polyadic (CP) decompositions of high-order tensors, and show that this model has several distinct extreme lazy and rich GF regimes such as free evolution, NTK and under- and over-parameterized mean-field. We show that these regimes depend on the parameter scaling, tensor order, and symmetry of the model in a specific and subtle way. Moreover, we propose a general approach to summing the formal loss expansion by reducing it to a PDE; in a wide range of scenarios, it turns out to be 1st order and solvable by the method of characteristics. We observe a very good agreement of our theoretical predictions with experiment.

Abstract:
VisionLanguage-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS methods for VLAs require additional training, verifiers, and multiple forward passes, making them impractical for deployment. Moreover, they intervene only at action decoding while keeping visual representations fixed—insufficient under perceptual ambiguity, where reconsidering how to perceive is as important as deciding what to do. To address these limitations, we propose SCALE, a simple inference strategy that jointly modulates visual perception and action based on 'self-uncertainty', inspired by uncertainty-driven exploration in Active Inference theory—requiring no additional training, no verifier, and only a single forward pass. SCALE broadens exploration in both perception and action under high uncertainty, while focusing on exploitation when confident—enabling adaptive execution across varying conditions. Experiments on simulated and real-world benchmarks demonstrate that \method improves state-of-the-art VLAs and outperforms existing TTS methods while maintaining single-pass efficiency.

Abstract:
Causal representation learning (CRL) has garnered increasing interests from the causal inference and artificial intelligence community, due to its capability of disentangling potentially complex datagenerating mechanism into causally interpretable latent features, by leveraging the heterogeneity of modern datasets. In this paper, we further contribute to the CRL literature, by focusing on the stylized linear structural causal model over the latent features and assuming a linear mixing function that maps latent features to the observed data or measurements. Existing linear CRL methods often rely on stringent assumptions, such as accessibility to single-node interventional data or restrictive distributional constraints on latent features and exogenous measurement noise. However, these prerequisites can be challenging to satisfy in certain scenarios. In this work, we propose a novel linear CRL algorithm that, unlike most existing linear CRL methods, operates under weaker assumptions about environment heterogeneity and data-generating distributions while still recovering latent causal features up to an equivalence class. We further validate our new algorithm via synthetic experiments and an interpretability analysis of large language models (LLMs), demonstrating both its superiority over competing methods in finite samples and its potential in integrating causality into AI. Source code is available at \hrefhttps://anonymous.4open.science/r/creator-883D/the anonymous link.

Abstract:
Understanding why gradientbased training in deep networks exhibits strong implicit bias remains challenging, in part because tractable singular-value dynamics are typically available only for balanced deep linear models. We propose an alternative route based on two theoretically grounded and empirically testable signatures of deep Jacobians: depth-induced exponential scaling of ordered singular values and strong spectral separation. Adopting a fixed-gates view of piecewise-linear networks, where Jacobians reduce to products of masked linear maps within a single activation region, we prove the existence of Lyapunov exponents governing the top singular values at initialization, give closed-form expressions in a tractable masked model, and quantify finite-depth corrections. We further show that sufficiently strong separation forces singular-vector alignment in matrix products, yielding an approximately shared singular basis for intermediate Jacobians. Together, these results motivate an approximation regime in which singular-value dynamics become effectively decoupled, mirroring classical balanced deep-linear analyses without requiring balancing. Experiments in fixed-gates settings validate the predicted scaling, alignment, and resulting dynamics, supporting a mechanistic account of emergent low-rank Jacobian structure as a driver of implicit bias.

Abstract:
Why do thinking language models outperform their base counterparts, and what exactly do they learn during training? We introduce constructive model diffing, a framework for understanding finetuned models by explicitly constructing the base-to-fine-tuned difference from interpretable components to produce hybrid models, and measuring how well they recover the fine-tuned model's performance. For thinking models, we decompose the diff into two components: reasoning mechanisms (steering vectors that activate specific behaviors in the base model) and reasoning heuristics (a classifier that determines when each mechanism should fire). To ground this decomposition, we develop an unsupervised methodology using Sparse Autoencoders to discover interpretable taxonomies of reasoning behaviors. Evaluating nine model configurations (five RL-trained, four distilled), we find a striking difference between training methods: the hybrid models for the five RL-trained models achieve much higher performance recovery compared to the four distilled models. This indicates RL-trained models primarily learn sophisticated heuristics for deploying pre-existing base model mechanisms, while distillation affects the mechanisms themselves. These results provide a new lens for understanding what different training paradigms teach, with potential implications for efficient reasoning model development.

Abstract:
Reinforcement learning (RL) with combinatorial action spaces remains challenging because feasible action sets are exponentially large and governed by complex feasibility constraints, making direct policy parameterization impractical. Existing approaches embed taskspecific value functions into constrained optimization programs or learn deterministic structured policies, sacrificing generality and policy expressiveness. We propose a solver-induced \emphlatent spherical flow policy that brings the expressiveness of modern generative policies to combinatorial RL while guaranteeing feasibility by design. Our method, LSFlow, learns a \emphstochastic policy in a compact continuous latent space via spherical flow matching, and delegates feasibility to a combinatorial optimization solver that maps each latent sample to a valid structured action. To improve efficiency, we train the value network directly in the latent space, avoiding repeated solver calls during policy optimization. To address the piecewise-constant and discontinuous value landscape induced by solver-based action selection, we introduce a smoothed Bellman operator that yields stable, well-defined learning targets. Empirically, our approach outperforms state-of-the-art baselines by an average of 20.6% across a range of challenging combinatorial RL tasks.

Abstract:
The family of linear recurrent neural networks has shown strong performance as recurrent memory units in partially observable reinforcement learning. We provide a theoretical justification for their empirical effectiveness by constructing and studying two linear filters: (i) the first exactly reproduces the pre–softmax logits of the belief vector in a hidden Markov model (HMM) under a deterministic transition matrix, thereby serving as a sufficient statistic for optimal policy learning, (ii) the second achieves vanishing statedecoding error under a nearly deterministic transition matrix, thus reducing state ambiguity to near zero. The results extend to action-controlled HMMs, where the corresponding linear filters become time-varying with action-dependent dynamics. We illustrate our main results through numerical experiments and further show that the constructed linear filter serves as a strong feature extractor in a small reinforcement learning game.

Abstract:
Baggingbased ensembles, most notably Adaptive Random Forests, are among the strongest performers for learning from data streams. A common denominator across these methods is their reliance on Hoeffding Trees as base learners, which grow incrementally by testing whether a candidate split is significantly better than its alternatives using concentration inequalities. Despite their empirical success, existing Hoeffding Trees variants lack valid statistical guarantees. Current analyses rely on fixed-sample concentration bounds, while split decisions are made using data-dependent stopping rules, which invalidates their guarantees and can drive the probabilty of incorrect splits to one. We introduce a principled alternative based on \emphanytime-valid inference. Our method provides: (i) anytime-valid control of false splits under arbitrary data streams, including non-stationary settings; (ii) finite commitment time under a predictive advantage; and (iii) under stationary i.i.d.\ data, risk is monotone decreasing and strictly improves at every split. Empirically, we evaluate both standalone trees and their use within Adaptive Random Forests on non-stationary streams. Our method improves performance while producing substantially smaller trees.

Abstract:
Most current paradigms in visual mechanistic interpretability (MI) remain confined to interpreting internal units of the vision model via heuristic methods (e.g., topK activation retrieval or optimization with regularization). In this work, we establish a theoretical distributional view for visual MI, which models the influence of a feature activation on the natural image distribution, thereby formulating a Kullback-Leibler (KL)-minimal optimization problem to model the MI task. Under this framework, statistical biases are identified within previous MI paradigms, which reveal that they may either be perceptually uninterpretable to humans (i.e., deviate from the natural image distribution), or mechanistically unfaithful to the vision models (i.e., unable to activate model features). To resolve the biases under the distributional view, we propose a model with a KL-minimal soft-constraint principle for visual MI that theoretically balances interpretability and faithfulness. We realize this principle via energy-guided diffusion posterior sampling. Extensive experiments validate the theoretical soundness of the proposed distributional view and demonstrate the practical effectiveness of our paradigm on the DINOv3 vision model.

Abstract:
While the rapid adoption of multimodal generative models offers immense potential, it has also increased the risks of harmful content synthesis, deepfakes, and copyright infringements. To address these challenges, concept erasure has emerged as a prospective safeguard. However, as the field gradually transitions from UNet-based diffusion models to Rectified Flow Transformers, erasure research has struggled to keep pace. In this work, we introduce GEM, a simple but highly effective erasure framework for Rectified Flow models. As part of our contribution, we establish a principled bridge between trajectory-based unlearning grounded in Generative Flow Networks and classic teacher-guided erasure: we translate trajectory-based signals into a teacher-guided flow-matching setup that unifies the strengths of both paradigms. Concretely, a teacher provides complementary attraction and repulsion signals that we combine into a single geometric guidance objective, yielding targeted suppression of unwanted concepts while preserving benign generation.

Abstract:
Large visionlanguage models (LVLMs) often hallucinate content that is fluent yet unsupported by the image, limiting their reliability in real-world deployment. We show that a key failure mode arises from route competition: even when visual tokens receive attention, the final token decision can be dominated by the textual pathway, causing the decoder to follow linguistic priors over visual evidence. To mitigate this, we propose a training-free, decision-aligned intervention that decomposes each attention head into a visual route and a text route, and estimates their token-level effects using an efficient one-forward/one-gradient approximation. These estimates reveal route conflict within heads and identify prior-dominant ones, enabling selective suppression of only the text route while keeping the visual route intact. Across five benchmarks spanning discriminative and generative settings, our method consistently reduces hallucination-related errors across models with limited impact on overall multimodal performance, while incurring a modest inference-time overhead.

Abstract:
When a learner is faced with a new task, but is given very few samples, it must leverage any available sideinformation. In practice, this often comes in the form of benchmarks, where there is abundant data to evaluate model performance on related tasks. Though task relatedness is difficult to formalize theoretically, it can be empirically observed throughweak monotonicity: if a model dominates another on all benchmarks, it also tends to outperform the other on the new task. We explore the statistical complexity of learning under weak monotonicity, leveraging it within two learning paradigms: transfer learning and model selection aggregation. We show that not only can we prune the model class based on monotonicity, but that we can further adapt to the geometry of the available trade-offs byhedging on the frontier.

Abstract:
Masked diffusion models (MDMs) are a potential alternative to autoregressive models (ARMs) for language generation, but generation quality depends critically on the generation order. Prior work either hardcodes an ordering (e.g., blockwise left-to-right) or learns an ordering policy for a pretrained MDM, which incurs extra cost and can yield suboptimal solutions due to the two-stage optimization. Motivated by this, we propose order-expressive masked diffusion model (OeMDM) for a broad class of diffusion generative processes with various generation orders, enabling the interpretation of MDM, ARM, and block diffusion in a single framework. Furthermore, building on OeMDM, we introduce learnable-order masked diffusion model (LoMDM), which jointly learns the generation ordering and diffusion backbone through a single objective from scratch, enabling the diffusion model to generate text in context-dependent ordering. Empirically, we confirm that LoMDM outperforms various discrete diffusion models across multiple language modeling benchmarks.

Abstract:
This paper is concerned with learning the inputoutput mapping of general nonlinear dynamical systems. While the existing literature focuses on Gaussian inputs and benign disturbances, we significantly broaden the scope of admissible control inputs and allow correlated, nonzero-mean, adversarial disturbances. With our reformulation as a linear combination of basis functions, we prove that the \ell_2-norm estimator overcomes the challenges posed by an adversary with access to the full information history, provided that the attack times are sparse, i.e., the probability that the system is under adversarial attack at a given time is smaller than a certain threshold. We provide an estimation error bound that decays with the input memory length and prove its optimality by constructing a problem instance that suffers from the same bound under probabilistic adversarial attacks. Our work provides a sharp input-output analysis for a generic nonlinear and partially observed system under significantly generalized assumptions compared to existing works.

Abstract:
Building scalable and reusable multiagent decision policies from offline datasets remains a challenge in offline multi-agent reinforcement learning (MARL), as existing methods often rely on fixed observation formats and action spaces that limit generalization. In contrast, large language models (LLMs) offer a flexible modeling interface that can naturally accommodate heterogeneous observations and actions. Motivated by this, we propose the Decision Language Model (DLM), which formulates multi-agent decision making as a dialogue-style sequence prediction problem under the centralized training with decentralized execution paradigm. DLM is trained in two stages: a supervised fine-tuning phase, which leverages dialogue-style datasets for centralized training with inter-agent context and generates executable actions from offline trajectories, followed by a group relative policy optimization phase to enhance robustness to out-of-distribution actions through lightweight reward functions. Experiments on multiple benchmarks show that a unified DLM outperforms strong offline MARL baselines and LLM-based conversational decision-making methods, while demonstrating strong zero-shot generalization to unseen scenarios across tasks.

Abstract:
Trajectory world models play a crucial role in robotic dynamics learning, planning, and control. While recent works have explored trajectory world models for diverse robotic systems, they struggle to scale to a large number of distinct system dynamics and overlook domain knowledge of physical structures. To address these limitations, we introduceWestWorld, a knoWledgeEncodedScalableTrajectoryWorldmodel for diverse robotic systems. To tackle the scalability challenge, we propose a novel system-aware Mixture-of-Experts (Sys-MoE) that dynamically combines and routes specialized experts for different robotic systems via a learnable system embedding. To further enhance zero-shot generalization, we incorporate domain knowledge of robot physical structures by introducing a structural embedding that aligns trajectory representations with morphological information. After pretraining on 89 complex environments spanning diverse morphologies across both simulation and real-world settings,WestWorldachieves significant improvements over competitive baselines in zero- and few-shot trajectory prediction. Additionally, it shows strong scalability across a wide range of robotic environments and significantly improves performance on downstream model-based control for different robots. Finally, we deploy our model on a real-world Unitree Go1, where it demonstrates stable locomotion performance (see our demo on the website).

Abstract:
Data attribution estimates how a trained model would change if a subset of training points were removed, and is a central primitive for tasks such as interpretability, data valuation, and machine unlearning. Despite its widespread use, our theoretical understanding of key data attribution methods Influence Functions (IF) and a single Newton Step (NS) -- remains limited: existing guarantees heavily rely on global strong convexity and yield bounds with pessimistic dependence on the parameter dimension d and the number of removed samples k. We give a new analysis of IF and NS for convex ERM that replaces global assumptions with local conditions: it suffices that the loss is strongly convex and sufficiently smooth only in a neighborhood of the first Newton step. As a concrete validation, we analyze logistic regression with Gaussian features and show that our bounds capture the correct scaling up to polylogarithmic factors, yielding matching upper and lower bounds and explaining observed regimes in which NS is markedly more accurate than IF, thereby resolving open questions raised by (Koh et al., 2019).

Abstract:
Despite their theoretical advantages, spectral methods based on the graph Fourier transform (GFT) are seldom used in graph neural networks (GNNs) due to the cost of computing the eigenbasis and the lack of vertexdomain locality in spectral representations. As a result, most GNNs rely on local approximations such as polynomial Laplacian filters or message passing, which limit their ability to model long-range dependencies. In this paper, we introduce a novel factorization of the GFT into operators acting on subgraphs, which are then combined via a sequence of Cauchy matrices. We use this factorization to propose a new class of spectral GNNs, which we term L2G-Net (Local-to-Global Net). Unlike existing spectral methods, which are either fully global (when they use the GFT) or local (when they use polynomial filters), L2G-Net operates by processing the spectral representations of subgraphs and then combining them via structured matrices. Our algorithm avoids full eigendecompositions, exploiting graph topology to construct the factorization with quadratic complexity in the number of nodes, scaled by the subgraph interface size. Experiments on benchmarks stressing non-local dependencies show that L2G-Net outperforms existing spectral techniques and is competitive with the state-of-the-art with orders of magnitude fewer learnable parameters.

Abstract:
Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, endto-end website development remains limited. To address this gap, we introduce \benchname, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,256 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.

Abstract:
Learning with shared representation is widely recognized as an effective way to separate commonalities from heterogeneity across various heterogeneous sources. Most existing work includes all related data sources via simultaneously training a common feature extractor and sourcespecific heads. It is well understood that data sources with low relevance or poor quality may hinder representation learning. In this paper, we further dive into the question of which data sources should be learned jointly by focusing on the traditionally deemed "good" collection of sources, in which individual sources have similar relevance and qualities with respect to the true underlying common structure. Towards tractability, we focus on the linear setting where sources share a low-dimensional subspace. We find that source screening can play a central role in statistically optimal subspace estimation. We show that, for a broad class of problem instances, training on a carefully selected subset of sources suffices to achieve minimax optimality, even when a substantial portion of data is discarded. We formalize the notion of an informative subpopulation, develop algorithms and practical heuristics for identifying such subsets, and validate their effectiveness through both theoretical analysis and empirical evaluations on synthetic and real-world datasets.

Abstract:
Survival analysis is a cornerstone of clinical research by modeling timeto-event outcomes such as metastasis, disease relapse, or patient death. Unlike standard tabular data, survival data often come with incomplete event information due to dropout, or loss to follow-up. This poses unique challenges for synthetic data generation, where it is crucial for clinical research to faithfully reproduce both the event-time distribution and the censoring mechanism. In this paper, we propose SurvDiff, an end-to-end diffusion model specifically designed for generating synthetic data in survival analysis. SurvDiff is tailored to capture the data-generating mechanism by jointly generating mixed-type covariates, event times, and right-censoring, guided by a survival-tailored loss function. The loss encodes the time-to-event structure and directly optimizes for downstream survival tasks, which ensures that SurvDiff (i) reproduces realistic event-time distributions and (ii) preserves the censoring mechanism. Across multiple datasets, we show that SurvDiff consistently outperforms state-of-the-art generative baselines in both distributional fidelity and survival model evaluation metrics across multiple medical datasets. To the best of our knowledge, SurvDiff is the first end-to-end diffusion model explicitly designed for generating synthetic survival data.

Abstract:
Realworld machine learning (ML) pipelines rarely produce a single model; instead, they produce a Rashomon set of many near-optimal ones. We show that this multiplicity reshapes key aspects of trustworthiness. At the individual-model level, sparse interpretable models tend to preserve privacy but are fragile to adversarial attacks. In contrast, the diversity within a large Rashomon set enables reactive robustness: even when an attack compromises one model, others often remain accurate. Rashomon sets are also stable under small distribution shifts. However, this same diversity increases information leakage, as disclosing more near-optimal models provides an attacker with progressively richer views of the training data. Through theoretical analysis and empirical studies, we characterize this robustness–privacy trade-off and highlight the dual role of Rashomon sets as both a resource and a risk for trustworthy ML.

Abstract:
Designing dense rewards is crucial for reinforcement learning (RL), yet in robotics it often demands extensive manual effort and lacks scalability. One promising solution is to view task progress as a dense reward signal, as it quantifies the degree to which actions advance the system toward task completion over time. We present TimeRewarder, a simple yet effective reward learning method that derives progress estimation signals from passive videos, including robot demonstrations and human videos, by modeling temporal distances between frame pairs. We then demonstrate how TimeRewarder can supply stepwise proxy rewards to guide reinforcement learning. In our comprehensive experiments on ten challenging Meta-World tasks, we show that TimeRewarder dramatically improves RL for sparse-reward tasks, achieving nearly perfect success in 9/10 tasks with only 200,000 interactions per task with the environment. This approach outperforms previous methods and even the manually designed environment dense reward on both the final success rate and sample efficiency. Moreover, we show that TimeRewarder pretraining can exploit real-world human videos, highlighting its potential as a scalable approach to rich reward signals from diverse video sources.

Abstract:
Distributionally robust optimisation (DRO) minimises the worstcase expected loss over an ambiguity set that can capture distributional shifts in out-of-sample environments. While Huber (linear-vacuous) contamination is a classical minimal-assumption model for an \varepsilon-fraction of arbitrary perturbations, including it in an ambiguity set can make the worst-case risk infinite and the DRO objective vacuous unless one imposes strong boundedness or support assumptions. We address these challenges by introducing bulk-calibrated credal ambiguity sets: we learn a high-mass bulk set from data while considering contamination inside the bulk and bounding the remaining tail contribution separately. This leads to a closed-form, finite \mathrmmean+\sup robust objective and tractable linear or second-order cone programs for common losses and bulk geometries. Through this framework, we highlight and exploit the equivalence between the imprecise probability (IP) notion of upper expectation and the worst-case risk, demonstrating how IP credal sets translate into DRO objectives with interpretable tolerance levels. Experiments on heavy-tailed inventory control, geographically shifted house-price regression, and demographically shifted text classification show competitive robustness-accuracy trade-offs and efficient optimisation times, using Bayesian, frequentist, or empirical reference distributions.

Abstract:
What fundamentally distinguishes an adversarial attack from a misclassification due to limited model expressivity or finite data? In this work, we investigate this question in the setting of highdimensional binary classification, where statistical effects due to limited data availability play a central role. We introduce a new error metric that precisely capture this distinction, quantifying model vulnerability to consistent adversarial attacks --- perturbations that preserve the ground-truth labels. Our main technical contribution is an exact and rigorous asymptotic characterization of these metrics in both well-specified models and latent space models, revealing different vulnerability patterns compared to standard robust error measures. The theoretical results demonstrate that as models become more overparameterized, their vulnerability to label-preserving perturbations grows, offering theoretical insight into the mechanisms underlying model sensitivity to adversarial attacks.

Abstract:
Large Language Model RedTeaming, which proactively identifies vulnerabilities of large language models, is an essential process for ensuring safety. Finding effective and diverse attacks in red team activities is important, but achieving both is challenging. Generative Flow Networks (GFN) that perform distribution matching are a promising method, but they are notorious for training instability and mode collapse. In particular, unstable reward functions in red team activities accelerate mode collapse. We propose Stable-GFN (S-GFN), which eliminates Z estimation in GFN and reduces training instability. S-GFN avoids Z-estimation through pairwise comparisons and employs a robust masking methodology against noisy rewards. Additionally, we propose a fluency stabilizer to prevent the model from getting stuck in local optima that produce gibberish. S-GFN provides more stable training while maintaining the optimal policy of GFN. We demonstrate the overwhelming attack performance and diversity of S-GFN across various settings.

Abstract:
Multifidelity (MF) regression often operates in regimes of extreme data imbalance, where the commonly-used Gaussian-process surrogates struggle with cubic scaling costs and overfit to sparse high-fidelity observations, limiting efficiency and generalization in real-world applications. We introduce FIRE, a training-free MF framework that couples tabular foundation models (TFMs) to perform zero-shot in-context Bayesian inference via a high-fidelity correction model conditioned on the low-fidelity model's posterior predictive distributions. This cross-fidelity information transfer via distributional summaries captures heteroscedastic errors, enabling robust residual learning without model retraining. Across 31 benchmark problems spanning synthetic functions and real-world tasks (e.g., DrivAerNet, LCBench), FIRE delivers a stronger performance–time trade-off than seven state-of-the-art GP-based or deep learning MF regression methods, ranking highest in accuracy and uncertainty quantification with runtime advantages. Limitations include context window constraints and dependence on the quality of the pre-trained TFM’s.

Abstract:
Scaling up data, parameters, and testtime computation has been the mainstream methods to improve LLM systems (LLMsys), but their upper bounds are almost reached due to the gradual depletion of high-quality data and marginal gains obtained from larger computational resource consumption. Inspired by the abilities of human and traditional AI systems in learning from practice, constructing memory and continual learning frameworks for LLMsys has become an important and popular research direction in recent literature. Yet, existing benchmarks for LLM memory often focus on evaluating the system on homogeneous reading comprehension tasks with long-form inputs rather than testing their abilities to learn from accumulated user feedback in service time. Therefore, we propose a user feedback simulation framework and a comprehensive benchmark covering multiple domains, languages, and types of tasks to evaluate the continual learning abilities of LLMsys. Experiments show that the effectiveness and efficiency of state-of-the-art baselines are far from satisfying, and we hope this benchmark could pave the way for future studies on LLM memory and optimization algorithms.

Abstract:
The rapid progress of generative models such as GANs and diffusion models has led to the widespread proliferation of AIgenerated images, raising concerns about misinformation, privacy violations, and trust erosion in digital media. Although large-scale multimodal models like CLIP offer strong transferable representations for detecting synthetic content, fine-tuning them often induces catastrophic forgetting, which degrades pre-trained priors and limits cross-domain generalization. To address this issue, we propose the Distillation-guided Gradient Surgery Network (DGS-Net), a novel framework that preserves transferable pre-trained priors while suppressing task-irrelevant components. Specifically, we introduce a gradient-space decomposition that separates harmful and beneficial descent directions during optimization. By projecting task gradients onto the orthogonal complement of harmful directions and aligning with beneficial ones distilled from a frozen CLIP encoder, DGS-Net achieves unified optimization of prior preservation and irrelevant suppression. Extensive experiments on 50 generative models demonstrate that our method outperforms state-of-the-art approaches by an average margin of 6.6%, achieving superior detection performance and generalization across diverse generation techniques.

Abstract:
Simulating the longtime evolution of Hamiltonian systems is limited by the small timesteps required for stable numerical integration. To overcome this constraint, we introduce a framework to learn Hamiltonian Flow Maps by predicting the mean phase-space evolution over a chosen time span \Delta t, enabling stable large-timestep updates far beyond the stability limits of classical integrators. To this end, we impose a Mean Flow consistency condition for time-averaged Hamiltonian dynamics. Unlike prior approaches, this allows training on independent phase-space samples without access to future states, avoiding expensive trajectory generation. Validated across diverse Hamiltonian systems, our method in particular improves upon molecular dynamics simulations using machine-learned force fields (MLFF). Our models maintain comparable training and inference cost, but support significantly larger integration timesteps while trained directly on widely-available trajectory-free MLFF datasets.

Abstract:
Modern posttrained language models are increasingly capable, but remain prone to extrinsic hallucinations. We target the utility degradation issue that prior hallucination-reduction methods often struggle to avoid, and propose online RL with Binary Retrieval-Augmented Reward (Binary RAR) to reduce hallucinations while preserving general capabilities. Binary RAR assigns a reward of 1 if a response contains no factual contradictions with retrieved evidence, and 0 otherwise. We theoretically show that this method reduces the probability of error-containing responses while preserving the distribution of error-free responses. This helps preserve the model’s capabilities, whereas other methods often degrade them. We evaluate Binary RAR on multiple widely used models. On Qwen3-8B, it reduces long-form hallucination rates by 39.3% and short-form hallucination rates by 54.4%, outperforming supervised learning and preference optimization baselines. Our error analysis shows that continuous factuality rewards (e.g., VeriScore) can be exploited via reward hacking by producing fewer or more generic claims, whereas Binary RAR is more robust and better preserves general capabilities, including instruction following, math, and coding.

Abstract:
Recent advances in reinforcement learning (RL) have significantly improved the complex reasoning capabilities of large language models (LLMs). Despite these successes, existing methods mainly focus on singledomain RL (e.g., mathematics) with verifiable rewards (RLVR), and their reliance on purely online RL frameworks restricts the exploration space, thereby limiting reasoning performance. In this paper, we address these limitations by leveraging rubrics to provide both fine-grained reward signals and offline guidance. We propose RGR-GRPO (Reward and Guidance through Rubrics), a rubric-driven RL framework for multi-domain reasoning. RGR-GRPO enables LLMs to receive dense and informative rewards while exploring a larger solution space during GRPO training. Extensive experiments across 14 benchmarks spanning multiple domains demonstrate that RGR-GRPO consistently outperforms RL methods that rely solely on alternative reward schemes or offline guidance. Compared with verifiable online RL baseline, RGR-GRPO achieves average improvements of +7.0%, +5.4%, +8.4%, and +6.6% on mathematics, physics, chemistry, and general reasoning tasks, respectively. Notably, RGR-GRPO maintains stable entropy fluctuations during off-policy training and achieves superior pass@k performance, reflecting sustained exploration and effective breakthrough beyond existing performance bottlenecks.

Abstract:
Large Language Models (LLMs) have achieved significant success across a wide range of tasks, serving as the cognitive backbone for MultiAgent Systems (MAS) designed to orchestrate complex practical workflows. Given that MAS performance is highly sensitive to input prompts and many deployment scenarios preclude MAS architecture modifications, prompt optimization emerges as a critical strategy for performance enhancement. However, real-world deployment is impeded by three key challenges: (1) the need for high sample efficiency due to prohibitive evaluation costs, (2) topology-induced coupling among prompts, and (3) the combinatorial explosion of the search space. To address these challenges, we introduceMASPOB(Multi-AgentSystemPromptOptimization viaBandits), a novel sample-efficient framework based on bandits. By leveraging Upper Confidence Bound (UCB) to quantify uncertainty, the bandit framework balances exploration and exploitation, maximizing gains within a strictly limited budget. To handle topology-induced coupling, MASPOB integrates Graph Neural Networks (GNNs) to capture structural priors, learning topology-aware representations of prompt semantics. Furthermore, it employs coordinate ascent to decompose the optimization into univariate sub-problems, reducing search complexity from exponential to linear. Extensive experiments across diverse benchmarks demonstrate that MASPOB achieves state-of-the-art performance, consistently outperforming existing baselines.

Abstract:
While VisionLanguage-Action (VLA) models hold promise in embodied intelligence, their large parameter counts lead to substantial inference latency that hinders real-time manipulation, motivating parameter sparsification. However, as the environment evolves during VLA execution, the optimal sparsity patterns change accordingly. Static pruning lacks the adaptability required for environment dynamics, whereas fixed-interval dynamic layer pruning suffers from coarse granularity and high retraining overheads. To bridge this gap, we propose EcoVLA, a training-free, plug-and-play adaptive pruning framework that supports orthogonal combination with existing VLA acceleration methods. EcoVLA comprises two components: Environment-aware Adaptive Pruning (EAP) and Interleaved Inference Orchestration (I^2O). EAP is a lightweight adaptive channel pruning method that incorporates the temporal consistency of the physical environment to update sparsity patterns. I^2O leverages the FLOPs bubbles inherent in VLA inference to schedule the pruning method in parallel, ensuring negligible impact on latency. Evaluated on diverse VLA models and benchmarks, EcoVLA delivers state-of-the-art performance, achieving up to 1.60× speedup with only a 0.4% drop in success rate, and further reaches 2.18× speedup with only a 0.5% degradation when combined with token pruning. We further validate the effectiveness of EcoVLA on real-world robots.

Abstract:
Tailend risk measures such as static conditional value-at-risk (CVaR) are used in safety-critical applications to prevent rare, yet catastrophic events. Unlike risk-neutral objectives, the static CVaR of the return depends on entire trajectories without admitting a recursive Bellman decomposition in the underlying Markov decision process. A classical resolution relies on state augmentation with a continuous variable. However, unless restricted to a specialized class of admissible value functions, this formulation induces sparse rewards and degenerate fixed points. In this work, we propose a novel formulation of the static CVaR objective based on augmentation. Our alternative approach leads to a Bellman operator with: (1) dense per-step rewards; (2) contracting properties on the full space of bounded value functions. Building on this theoretical foundation, we develop risk-averse value iteration and model-free Q-learning algorithms that rely on discretized augmented states. We further provide convergence guarantees and approximation error bounds due to discretization. Empirical results demonstrate that our algorithms successfully learn CVaR-sensitive policies and achieve effective performance-safety trade-offs.

Abstract:
Finetuning APIs offered by major AI providers create new attack surfaces where adversaries can bypass safety measures through targeted fine-tuning. We introduceTrojan-Speak, an adversarial fine-tuning method that bypasses Anthropic's Constitutional Classifiers. Our approach uses curriculum learning combined with GRPO-based hybrid reinforcement learning to teach models a communication protocol that evades LLM-based content classification. Crucially, while prior adversarial fine-tuning approaches report more than 25% capability degradation on reasoning benchmarks, Trojan-Speak incurs less than 5% degradation while achieving 99+% classifier evasion for models with 14B+ parameters. We demonstrate that fine-tuned models can provide detailed responses to expert-level CBRN (Chemical, Biological, Radiological, and Nuclear) queries from Anthropic's Constitutional Classifiers bug-bounty program. Our findings reveal that LLM-based content classifiers alone are insufficient for preventing dangerous information disclosure when adversaries have fine-tuning access, and we show that activation-level probes can substantially improve robustness to such attacks.

Abstract:
Large language models excel at shorthorizon reasoning tasks, but performance drops as reasoning horizon lengths increase. Existing approaches to combat this rely on inference-time scaffolding or step-level supervision, neither of which scales easily. In this work, we introduce a scalable method to bootstrap long-horizon reasoning capabilities using only existing, abundant short-horizon data. Our approach synthetically composes simple problems into complex, multi-step dependency chains of arbitrary length. We train models on this data using outcome-only rewards under a curriculum that automatically increases in complexity, allowing RL training to be scaled much further without saturating. Empirically, our method generalizes remarkably well: curriculum training on composed 6th-grade level math problems (GSM8K) improves accuracy on longer, competition-level benchmarks (GSM-Symbolic, MATH-500, AIME). It also transfers significantly to diverse out-of-distribution ReasoningGym domains and long-context benchmarks, indicating broader generalization. Importantly, our long-horizon improvements are significantly higher than baselines even at high pass@k, showing that models can learn new reasoning paths under RL. Theoretically, we show that curriculum RL with outcome rewards could achieve an exponential improvement in sample complexity over full-horizon training, providing training signal comparable to dense supervision. h1 therefore introduces an efficient path towards scaling RL for long-horizon problems using only existing data.

Abstract:
The quadratic cost of attention limits the scalability of longcontext LLMs, especially under limited hardware memory budgets. While attention is often sparse, existing static sparse methods cannot adapt to task- or input-dependent variations, and recent dynamic approaches rely on predefined templates or heuristics that may sacrifice generality. We propose Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that predicts attention sparsity online while keeping the LLM backbone frozen. DHSA performs hierarchical routing by estimating importance at the chunk level and propagating it to token-level interactions, preserving causally important dependencies while enabling efficient sparsification. Across Needle-in-a-Haystack and LongBench, DHSA maintains near-dense accuracy in highly sparse regimes, achieving 12-20% relative accuracy gains over Block Sparse Attention at comparable prefill cost. With a memory-efficient tiled backend, DHSA delivers up to 10× prefill speedup at 128K context length. On LLaMA-3.1-8B (4-bit), DHSA scales to 100K context on a single 24GB GPU, where dense attention fails. We provide complementary GPU and CPU backends, enabling DHSA to run across diverse hardware environments and multiple open-weight model families. These results demonstrate DHSA as an efficient and adaptable solution for memory-constrained long-context LLM inference.

Abstract:
Correlation clustering is a classic approach for summarizing signed graphs, where the goal is to cluster the graph while minimizing positive intercluster edges plus negative intra-cluster edges. On complete signed graphs, correlation clustering is closely related to the bad triangle traversal (BTT) problem of finding the smallest number of edges that need to be removed such that the remaining graph does not have a bad triangle. Here, a bad triangle is a triangle with exactly one negative edge. A known result states that a feasible bad triangle cover F on a complete signed graph can be transformed into a correlation clustering with at most 2|F| mistakes. In this paper we improve this ratio to \frac32|F| mistakes using a pivot-based method. We also propose novel 2-approximations for BTT. Using a recent result on approximating the bad triangle cover LP, we obtain an (2+\epsilon) approximation in time almost equal to the time needed to find a maximal set of edge-disjoint bad triangles (which would give a standard 3-approximation). Additionally, several inapproximability results are provided. For general signed graphs, a better than 2-approximation is unlikely as our problem can be used to approximate vertex cover. For complete signed graphs, it is NP-hard to approximate with factor better than \frac21372136. This result also holds for several other related problems.

Abstract:
The population dynamics of molecules, cells, and organisms are governed by a number of unknown internal and external forces. In the last decade, population dynamics have predominately been modeled with Wasserstein gradient flows. However, since gradient flows minimize free energy, they fail to capture important dynamical properties, such as periodicity. In this work, we propose a change in perspective by considering population dynamics that minimize Wasserstein Lagrangian action, rather than free energy. As our main theoretical contributions, we derive the Hamiltonian equations of motion from the principle of least populationlevel action and we show that these mechanics encompass classical mechanics, quantum mechanics, and gradient flows. We further leverage the Hamiltonian perspective to propose an algorithm that learns the population mechanics from observed marginals, without specifying the Lagrangian. We demonstrate that by directly learning the population mechanics, our method forecasts and interpolates unseen marginals without a reference process, and outperforms gradient flow and flow matching methods across a wide range of real and simulated experiments.

Abstract:
While onpolicy distillation offers dense supervision for training small reasoning models, its optimization dynamics in the multimodal domain remain under-explored. In this work, we challenge the standard monolithic view of Vision-Language Model (VLM) distillation by mathematically decomposing the loss into two distinct components: the language prior and visual grounding. Our analysis uncovers that gradient vectors for these components are nearly orthogonal, indicating that the objective of aligning with the teacher's language distribution is geometrically independent from the objective of matching its visual perception. Consequently, standard optimization passively follows a suboptimal compromise trajectory that implicitly balances the two objectives. Hypothesizing that visual grounding constitutes the primary bottleneck for vision-language reasoning, we introduce Visual Gradient Steering (VGS), a method that dynamically reorients the update vector to prioritize the visual subspace. Experimental results on multiple distillation settings and complex multimodal benchmarks demonstrate that VGS significantly outperforms the standard monolithic formulation of on-policy distillation, achieving superior grounding with minimal training overhead. Code will be released.

Abstract:
Generative modeling within constrained sets is essential for scientific and engineering applications involving physical, geometric, or safety requirements (e.g., molecular generation, robotics). We present a unified framework for constrained diffusion models on generic nonconvex feasible sets \Sigma that simultaneously enforces equality and inequality constraints throughout the diffusion process. Our framework incorporates both overdamped and underdamped dynamics for forward and backward sampling. A key algorithmic innovation is a computationally efficient landing mechanism that replaces costly and often illdefined projections onto \Sigma, ensuring feasibility without iterative Newton solves or projection failures. By leveraging underdamped dynamics, we accelerate mixing toward the prior distribution, effectively alleviating the high simulation costs typically associated with constrained diffusion. Empirically, this approach reduces function evaluations and memory usage during both training and inference while preserving sample quality. On benchmarks featuring equality and mixed constraints, our method achieves comparable sample quality to state-of-the-art baselines while significantly reducing computational cost, providing a practical and scalable solution for diffusion on nonconvex feasible sets.

Abstract:
Large language models often hallucinate with high confidence on "random facts" that lack inferable patterns. We formalize the memorization of such facts as a membership testing problem, unifying the discrete error metrics of Bloom filters with the continuous logloss of LLMs. By analyzing this problem in the regime where facts are sparse in the universe of plausible claims, we establish a rate-distortion theorem: the optimal memory efficiency is characterized by the minimum KL divergence between score distributions on facts and non-facts. This theoretical framework provides a distinctive explanation for hallucination: even with perfect training, perfect data, and a "closed world" assumption, the information-theoretically optimal strategy under limited capacity is not to abstain or forget, but to assign high confidence to some non-facts, resulting in hallucination. We validate this theory empirically on synthetic data, showing that hallucinations persist as a natural consequence of lossy compression.

Abstract:
Large language models can represent a variety of personas but typically default to a helpful Assistant identity cultivated during posttraining. Across several different models, we find an “Assistant Axis" in their activation space, which captures the extent to which a model is operating in its default Assistant mode. Steering towards the Assistant direction reinforces helpful and harmless behavior; steering away increases the model’s tendency to identify as other entities. Measuring deviations along the Assistant Axis predicts “persona drift,” a phenomenon where models slip into exhibiting harmful or bizarre behaviors that are uncharacteristic of their typical persona. We find that persona drift is often driven by conversations demanding meta-reflection on the model’s processes or featuring emotionally vulnerable users. We show that restricting activations to a fixed region along the Assistant Axis can stabilize model behavior in these scenarios—and also in the face of adversarial persona-based jailbreaks. Our results suggest that post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent persona.

Abstract:
Follow the regularized leader (FTRL) is the premier algorithm for online optimization. However, despite decades of research on its convergence in constrained optimizationand potential games in particular---its behavior remained hitherto poorly understood. In this paper, we establish that FTRL can take exponential time to converge to a Nash equilibrium in two-player potential games for any (permutation-invariant) regularizer and potentially vanishing learning rate. By known equivalences, this translates to an exponential lower bound for certain mirror descent counterparts, most notably multiplicative weights update. On the positive side, we establish the potential property for FTRL and obtain an exponential upper bound \exp(O_\epsilon(1/\epsilon^2)) for any no-regret dynamics executed in a lazy, alternating fashion, matching our lower bound up to factors in the exponent. Finally, in multi-player potential games, we show that fictitious play---the extreme version of FTRL---can take doubly exponential time to reach a Nash equilibrium. This constitutes an exponentially stronger lower bound for the foundational learning algorithm in games.

Abstract:
Transformerbased multimodal large language models often exhibit in-context learning (ICL) capabilities. Motivated by this phenomenon, we ask: how do transformers learn to associate information across modalities from in-context examples? We investigate this through controlled experiments on small transformers trained on synthetic classification tasks, enabling precise manipulation of data statistics and model architecture. We begin by revisiting core principles of unimodal ICL in modern transformers. While several prior findings replicate, we find that Rotary Position Embeddings (RoPE) can delay the onset of ICL circuits. Extending to the multimodal setting reveals a fundamental learning asymmetry: when pretrained on high-diversity data from a primary modality, surprisingly low data complexity in the secondary modality suffices for multimodal ICL to emerge. Mechanistic analysis shows that both settings rely on an induction-style mechanism that copies labels from matching in-context exemplars; multimodal training refines and extends these circuits across modalities. Our findings provide a mechanistic foundation for understanding multimodal ICL in modern transformers and introduce a controlled testbed for future investigation.

Abstract:
Reasoning with LLMs increasingly unfolds inside a broader verification loop. Internally, systems use cheap checks, such as selfconsistency or proxy rewards, which we callweak verification. Externally, users inspect outputs and steer the model through feedback until results are trustworthy, which we callstrong verification. These signals differ sharply in cost and reliability: strong verification can establish trust but is resource-intensive, while weak verification is fast and scalable but noisy and imperfect. We formalize this tension throughweak-strong verification policies, which decide when to accept or reject based on weak verification and when to defer to strong verification. We introduce metrics capturing incorrect acceptance, incorrect rejection, and strong-verification frequency. Over population, we show that optimal policies admit a two-threshold structure and thatcalibrationandsharpnessgovern the value of weak verifiers. Building on this, we develop an online algorithm that provably controls acceptance and rejection errors without assumptions on the query stream, the language model, or the weak verifier. Experiments on mathematical reasoning and sequential decision-making demonstrate that our algorithm achieves reliability comparable to exhaustive strong verification while significantly reducing verification cost.

Abstract:
Recent reinforcement learning (RL) techniques have yielded impressive reasoning improvements in language models, yet it remains unclear whether posttraining truly extends a model’s reasoning ability beyond what it acquires during pre-training. A central challenge is the lack of control in modern training pipelines: large-scale pre-training corpora are opaque, mid-training is often underexamined, and RL objectives interact with unknown prior knowledge in complex ways. To resolve this ambiguity, we develop a fully controlled experimental framework that isolates the causal contributions of pre-training, mid-training, and RL-based post-training. Our approach employs synthetic reasoning tasks with explicit atomic operations, parseable step-by-step reasoning traces, and systematic manipulation of training distributions. We evaluate models along two axes: extrapolative generalization to more complex compositions and contextual generalization across surface contexts. Using this framework, we reconcile competing views on RL’s effectiveness. We show that: 1) RL produces true capability gains (pass@128) only when pre-training leaves sufficient headroom and when RL data target the model’s edge of competence, tasks at the boundary that are difficult but not yet out of reach. 2) Contextual generalization requires minimal yet sufficient pre-training exposure, after which RL can reliably transfer. 3) Mid-training significantly enhances performance under fixed compute compared with RL only, demonstrating its central but underexplored role in training pipelines. 4) Process-level rewards reduce reward hacking and improve reasoning fidelity. Together, these results clarify the interplay between pre-training, mid-training, and RL, offering a foundation for understanding and improving reasoning LM training strategies.

Abstract:
We study the computational cost of differential privacy in terms of memory efficiency. While the tradeoff between accuracy and differential privacy is well-understood, the inherent cost of privacy regarding memory use remains largely unexplored. This paper establishes for the first time an unconditional space lower bound for user-level differential privacy by introducing a novel proof technique based on a multi-player communication game. Central to our approach, this game formally links the hardness of low-memory private algorithms to the necessity of ``contribution capping''---tracking and limiting the users who disproportionately impact the dataset. We demonstrate that winning this communication game requires transmitting information proportional to the number of over-active users, which translates directly to memory lower bounds. We apply this framework, as an example, to the fundamental problem of estimating the number of distinct elements in a stream and we prove that any private algorithm requires almost \widetilde\Omega(T^1/3) space to achieve certain error rates in a promise variant of the problem. This resolves an open problem in the literature (by Jain et al. and Cummings et al.) and establishes the first exponential separation between the space complexity of private algorithms and their non-private \widetildeO(1) counterparts for a natural statistical estimation task. Furthermore, we show that this communication-theoretic technique generalizes to broad classes of problems, yielding lower bounds for private medians, quantiles, and max-select.

Abstract:
Large language models (LLMs) must balance diversity and creativity against logical coherence in openended generation. Existing truncation-based samplers are effective but largely heuristic, relying mainly on probability mass and entropy while ignoring semantic geometry of the token space. We present Top-W , a geometry-aware truncation rule that uses Wasserstein distance—defined over token-embedding geometry—to keep the cropped distribution close to the original, while explicitly balancing retained probability mass against the entropy of the kept set. Our theory yields a simple closed-form structure for the fixed-potential subset update: depending on the mass–entropy trade-off, the optimal crop either collapses to a single token or takes the form of a one-dimensional prefix that can be found efficiently with a linear scan. We implement Top-W using efficient geometry-based potentials (nearest-set or k-NN) and pair it with an alternating decoding routine that keeps the standard truncation-and-sampling interface unchanged. Extensive experiments on four benchmarks (GSM8K, GPQA, AlpacaEval, and MT-Bench) across three instruction-tuned models show that Top-W consistently outperforms prior state-of-the-art decoding approaches achieving up to 33.7% improvement. Moreover, we find that Top-W not only improves accuracy-focused performance, but also boosts creativity under judge-based open-ended evaluation. We will release all code upon acceptance.

Abstract:
Understanding the multidimensional attributes and intensity nuances of image-evoked emotions is pivotal for advancing machine empathy and empowering diverse human-computer interaction applications. However, existing models are still limited to coarse-grained emotion perception or deficient reasoning capabilities. To bridge this gap, we introduce EEmoDB, the largest image-evoked emotion understanding dataset to date. It features 5 analysis dimensions spanning 5 distinct task categories, facilitating comprehensive interpretation. Specifically, we compile 1.2M question-answering (QA) pairs (EEmoDB-QA) from 125k images via automated generation, alongside a 36k dataset (EEmoDB-Assess) curated from 25k images for fine-grained assessment. Furthermore, we propose EEmo-Logic, an all-in-one multimodal large language model (MLLM) developed via instruction fine-tuning and task-customized group relative preference optimization (GRPO) with novel reward design. Extensive experiments demonstrate that EEmo-Logic achieves robust performance in in-domain and cross-domain datasets, excelling in emotion QA and fine-grained assessment. The code is available at [https://anonymous.4open.science/r/EEmoLogic](https://anonymous.4open.science/r/EEmoLogic).

Abstract:
Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks such as visual grounding, segmentation, and captioning. However, their ability to perceive perceptuallevel image features remains limited. In this work, we present UniPercept-Bench, a unified framework for perceptual-level image understanding across three key domains: Aesthetics, Quality, Structure and Texture. We establish a hierarchical definition system and construct large-scale datasets to evaluate perceptual-level image understanding. Based on this foundation, we develop a strong baseline UniPercept trained via Domain-Adaptive Pre-Training and Task-Aligned RL, enabling robust generalization across both Visual Rating (VR) and Visual Question Answering (VQA) tasks. UniPercept outperforms existing MLLMs on perceptual-level image understanding and can serve as a plug-and-play reward model for text-to-image generation. This work defines perceptual-level image understanding in the era of MLLMs and, through the introduction of a comprehensive benchmark together with a strong baseline, provides a solid foundation for advancing perceptual-level multimodal image understanding.

Abstract:
Data selection studies the problem of identifying highquality subsets of training data. While some existing works have considered selecting the subset of data with top-m Data Shapley or other semivalues as they account for the interaction among every subset of data, other works argue that Data Shapley can sometimes perform ineffectively in practice and select subsets that are no better than random. This raises the questions: (I) Are there certain "Shapley-informative" settings where Data Shapley consistently works well? (II) Can we strategically utilize these settings to select high-quality subsets consistently and efficiently? In this paper, we propose a novel data selection framework, NASH (Non-linear Aggregation of SHapley-informative components), which (I) decomposes the target utility function (e.g., validation accuracy) into simpler, Shapley-informative component functions, and selects data by optimizing an objective that (II) aggregates these components non-linearly. We demonstrate that NASH substantially boosts the effectiveness of Shapley/semivalue-based data selection with minimal additional runtime cost.

Abstract:
We study linear contextual bandits under rare parameter updates: the learner may incorporate reward feedback into its parameter estimate only at a small number of update times, while still observing contexts online and selecting actions sequentially. This viewpoint clarifies a practical distinction that is often blurred in the literature: many "strictly batched" methods additionally restrict withininterval context adaptivity, meaning that the action rule inside an interval cannot depend on the sequence of realized contexts/actions in that interval (beyond the current round's context). For linear contextual bandits, we propose two practical algorithms with only O(\log\log T) parameter updates. Our first algorithm BLCE-G attains minimax-optimal regret (up to polylogarithmic factors in T) simultaneously in both the small-K and large-K regimes under a static schedule. Our second algorithm BLCE removes the near G-optimal design step---a dominant computational bottleneck in prior strictly batched static-grid methods---yet preserves minimax-optimal regret and achieves the lowest known runtime complexity among optimal algorithms. We further extend these rare-update and computational principles to generalized linear contextual bandits. Overall, our results yield statistically optimal algorithms under O(\log\log T) parameter updates that are also computationally efficient in practice.

Abstract:
While Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in general video understanding, their capacity to interpret involuntary, and spatiotemporally evolving pathologic motor behaviors such as seizure semiology remains largely untested. To address this gap, we introduce Seizure-Semiology-Suite (S³), a clinically grounded dataset and benchmark for fine-grained, structured seizure semiology understanding. The dataset includes 438 seizure videos annotated with over 35,000 dense labels covering 20 ILAE-defined semiological features. Building on this dataset, we propose a seven-task hierarchical benchmark that systematically evaluates MLLMs from low-level visual perception to temporal sequencing, narrative report generation, and seizure diagnosis. To enable clinically meaningful evaluation of generated reports, we further introduce the Report Quality Index for Seizure Semiology (Seizure-RQI). Extensive baselines across 11 open-weight MLLMs reveal systematic weaknesses in laterality reasoning, temporal localization, symptom sequencing, and clinically faithful reporting. We show that seizure-specific fine-tuning substantially improves performance across tasks, and that a two-stage neuro-symbolic framework achieves an F1 score of 0.96 on epileptic versus non-epileptic seizure classification. Seizure-Semiology-Suite establishes a rigorous benchmark for evaluating multimodal models in safety-critical medical video understanding and guides the development of clinically reliable, domain-adaptive multimodal intelligence.

Abstract:
Scaling laws have played a major role in modern AI, providing predictive power over how model performance will improve with increasing resources. This has spurred intense interest in their origin, with a common suggestion being that they arise from power laws already present in the data. Here we study scaling laws for transformers trained to predict random walks on graphs with tunable complexity. We show that this simplified setting already yields scaling laws even in the absence of power laws in the data correlations. We further consider dialing down the complexity of language by training on sequences sampled from increasingly simplified generative language models, from 4,2,1layer transformer language models down to language bigrams, revealing a monotonic evolution of the scaling exponents. Our results also include scaling laws obtained from training on random walks on random graphs drawn from Erdös-Renyi and scale-free Barabási-Albert ensembles. Finally, we revisit scaling laws for language modeling, demonstrating that several essential results can be reproduced using 2 layer transformers with context length of 100, demonstrate an alternative method for obtaining compute optimal curves, and provide preliminary evidence that maximal update parameterization may be more parameter efficient than standard parameterization.

Abstract:
The field of ComputerAided Design (CAD) generation has made significant progress in recent years. Existing methods typically fall into two separate categorie: parametric CAD modeling and direct boundary representation (B-Rep) synthesis. In modern feature-based CAD systems, parametric modeling and B-Rep are inherently intertwined, as advanced parametric operations (e.g.,filletandchamfer) require explicit selection of B-Rep geometric primitives, and the B-Rep itself is derived from parametric operations. Consequently, this paradigm gap remains a critical factor limiting AI-driven CAD modeling for complex industrial product design. This paper presentFutureCAD, a novel text-to-CAD framework that leverages large language models (LLMs) and a B-Rep grounding transformer (BRepGround) for high-fidelity CAD generation. Our method generates executable CadQuery scripts, and introduces a text-based query mechanism that enables the LLM to specify geometric selections via natural language, whichBRepGroundthen grounds to the target primitives. To train our framework, we construct a new dataset comprising real-world CAD models. For the LLM, we apply supervised fine-tuning (SFT) to establish fundamental CAD generation capabilities, followed by reinforcement learning (RL) to improve generalization. Experiments show thatFutureCADachieves state-of-the-art CAD generation performance.

Abstract:
In many applications, practical constraints prevent measuring covariates and outcomes on the same units, resulting in unpaired data. We study the problem of estimating causal effects under hidden confounding in the following unpaired data setting: we observe some covariates X and an outcome Y under different experimental conditions (environments) but do not observe them jointly we either observe X or Y. Under appropriate regularity conditions, the problem can be cast as an instrumental variable (IV) regression with the environment acting as a (possibly high-dimensional) instrument. When there are many environments but only a few observations per environment, standard two-sample IV estimators fail to be consistent. We propose a GMM-type estimator based on cross-fold sample splitting of the instrument–covariate sample that also applies in standard IV settings. We prove that it is consistent as the number of environments grows but the sample size per environment remains constant. We further extend the method to sparse causal effects via \ell_1-regularized estimation and post-selection refitting.

Abstract:
Sampling from unnormalized densities is analogous to the generative modeling problem, but the target distribution is defined by a known energy function instead of data samples. Evaluating the energy function is often costly, and thus a primary challenge is to learn an efficient sampler. We introduceFlow Sampling, a framework built on diffusion models and flow matching for the datafree setting. Our training objective is conditioned on a noise sample and regresses onto adenoisingdiffusion drift constructed from the energy function. In contrast, diffusion models' objective is conditioned on a data sample and regresses onto anoisingdiffusion drift. We utilize the interpolant process to minimize the number of energy function evaluations during training, resulting in an efficient and scalable method for sampling unnormalized densities. Furthermore, our formulation naturally extends to Riemannian manifolds, enabling diffusion-based sampling in geometries beyond the Euclidean space. We derive a closed-form formula for the conditional drift on constant curvature manifolds, including hyperspheres and hyperbolic spaces. We evaluate Flow Sampling on synthetic energy benchmarks, large-scale amortized molecular conformer generation, and distributions supported on the sphere, demonstrating strong empirical performance.

Abstract:
Recent studies have extended diffusionbased instruction-driven 2D image editing pipelines to 3D Gaussian Splatting (3DGS), enabling faithful manipulation of 3DGS assets and greatly advancing 3DGS content creation. However, it also exposes these assets to serious risks of unauthorized editing and malicious tampering. Although imperceptible adversarial perturbations against diffusion models have proven effective for protecting 2D images, applying them to 3DGS encounters two major challenges: view-generalizable protection and balancing invisibility with protection capability. In this work, we propose the first editing safeguard for 3DGS, termed AdLift, which prevents instruction-driven editing across arbitrary views and dimensions by lifting strictly bounded 2D adversarial perturbations into 3D Gaussian-represented safeguard. To ensure both adversarial perturbations effectiveness and invisibility, these safeguard Gaussians are progressively optimized across training views using a tailored Lifted PGD, which first conducts gradient truncation during back-propagation from the editing model at the rendered image and applies projected gradients to strictly constrain the image-level perturbation. Then, the resulting perturbation is backpropagated to the safeguard Gaussian parameters via an image-to-Gaussian fitting operation. We alternate between gradient truncation and image-to-Gaussian fitting, yielding consistent adversarial-based protection performance across different viewpoints and generalizes to novel views. Empirically, qualitative and quantitative results demonstrate that AdLift effectively protects against state-of-the-art instruction-driven 2D image and 3DGS editing.

Abstract:
Understanding when learning is statistically possible yet computationally hard is a central challenge in highdimensional statistics. In this work, we investigate this question in the context of single- and multi-index models, classes of functions widely studied as benchmarks to probe the ability of machine learning methods to discover features in high-dimensional data. Our main contribution is to show that a Noise Sensitivity Exponent (NSE)—a simple quantity determined by the activation function—governs the existence and magnitude of statistical-to-computational gaps within a broad regime of these models. We first establish that, in single-index models with large additive noise, the onset of a computational bottleneck is fully characterized by the NSE. We then demonstrate that the same exponent controls a statistical-computational gap in the specialization transition of large separable multi-index models, where individual components become learnable. Taken together, our results identify the NSE as a unifying property linking noise robustness, computational hardness, and feature specialization in high-dimensional learning.

Abstract:
With the rapid advancement of Large Language Models (LLMs) in code generation, humanAI interaction is evolving from static text responses to dynamic, interactive HTML-based applications, which we termMiniApps. These applications require models to not only render visual interfaces but also construct customized interaction logic that adheres to real-world principles. However, existing benchmarks primarily focus on algorithmic correctness or static layout reconstruction, failing to capture the capabilities required for this new paradigm. To address this gap, we introduceMiniAppBench, the first comprehensive benchmark designed to evaluate principle-driven, interactive application generation. Sourced from a real-world application with10M+generations, MiniAppBench distills 500 tasks across six domains (e.g., Games, Science, and Tools). Furthermore, to tackle the challenge of evaluating open-ended interactions where no single ground truth exists, we proposeMiniAppEval, an agentic evaluation framework. Leveraging browser automation, it performs human-like exploratory testing to systematically assess applications across three dimensions: Intention, Static, and Dynamic. Our experiments reveal that current LLMs still face significant challenges in generating high-quality MiniApps, while MiniAppEval demonstrates high alignment with human judgment, establishing a reliable standard for future research. Our code is available in https://anonymous.4open.science/r/MiniAppBench.

Abstract:
Lowrank recurrent neural networks (lrRNNs) are a class of models that uncover low-dimensional latent dynamics underlying neural population activity. Although their functional connectivity is low-rank, it lacks independence interpretations, making it difficult to assign distinct computational roles to different latent dimensions. To address this, we propose the Factored Recurrent Neural Network (FacRNN), a generative lrRNN framework that assumes group-wise independence among latent dynamics while allowing flexible within-group entanglement. These independent latent groups allow latent dynamics to evolve separately, but are internally rich for complex computation. We reformulate the lrRNN under a variational autoencoder (VAE) framework, enabling us to introduce a partial correlation penalty that encourages independence between groups of latent dimensions. Experiments on synthetic, monkey M1, and mouse voltage imaging data show that FacRNN consistently improves the disentanglement and interpretability of learned neural latent trajectories in low-dimensional space and low-rank connectivity over baseline lrRNNs that do not encourage group-wise independence.

Abstract:
Recent work has shown that scaling large language models (LLMs) improves their alignment with human brain activity, yet it remains unclear what drives these gains or which representational properties are responsible. Although larger models often yield better task performance and brain alignment, they are increasingly difficult to analyze mechanistically. This raises a fundamental question: \emphwhat is the minimal model capacity required to capture brainrelevant representations? To address this question, we systematically investigate how constraining model scale and numerical precision affects brain alignment. We compare full-precision LLMs, small language models (SLMs), and compressed variants (quantized and pruned) by predicting fMRI responses during naturalistic language comprehension. Across model families up to 14B parameters, we find that 3B SLMs achieve brain predictivity indistinguishable from larger LLMs, whereas 1B models degrade substantially, particularly in semantic language regions. Brain alignment is remarkably robust to compression: most quantization and pruning methods preserve neural predictivity, with GPTQ as a consistent exception. Linguistic probing reveals a dissociation between task performance and brain predictivity: compression degrades discourse, syntax, and morphology, yet brain predictivity remains largely unchanged. Overall, brain alignment saturates at modest model scales and is resilient to compression, challenging common assumptions about neural scaling and motivating compact models for brain-aligned language modeling.

Abstract:
Historydependent sampling can reduce long-run Monte Carlo variance by discouraging redundant revisits, but existing schemes typically encode history through empirical measure on finite state spaces, which is infeasible in high-dimensional discrete configuration spaces or ill-posed in continuous domains. We propose Score-Repellent Monte Carlo (SRMC) framework that summarizes trajectory history by a fixed, d-dimensional running average of score evaluations and converts it into a history-dependent surrogate target via an exponential score tilt. The resulting surrogate family is normalization-free in the standard MCMC sense, yielding a generic wrapper: at each iteration, any standard base kernel designed for the target \pi can be run on the current surrogate \pi_\theta_n while updating the history online. We analyze the coupled evolution of any estimator and the history recursion using stochastic approximation with controlled Markovian noise, establishing almost sure convergence and a joint central limit theorem. We identify regimes where the asymptotic covariance decreases as the repellence strength \alpha increases, exhibiting a scaling of O(1/\alpha), reproducing near-zero variance effect but now on general state spaces with constant memory. Empirical results across continuous targets and discrete energy-based models demonstrate that SRMC delivers notable improvements in estimator variance and in effectively covering modes in Gaussian mixtures, all while retaining O(d) memory usage and minimal overhead per iteration.

Abstract:
Reasoning over table images remains challenging for Large VisionLanguage Models (LVLMs) due to complex layouts and tightly coupled structure–content information. Existing solutions often depend on expensive supervised training, reinforcement learning, or external tools, limiting efficiency and scalability. This work addresses a key question:how to adapt LVLMs to table reasoning with minimal annotation and no external tools?Specifically, we first introduce DiSCo, a Disentangled Structure–Content alignment framework that explicitly separates structural abstraction from semantic grounding during multimodal alignment, efficiently adapting LVLMs to tables structures. Building on DiSCo, we further present Table-GLS, a Global-to-Local Structure-guided reasoning framework that performs table reasoning via structured exploration and evidence-grounded inference. Extensive experiments across diverse benchmarks demonstrate that our framework efficiently enhances LVLM's table understanding and reasoning capabilities, particularly generalizing to unseen table structures.

Abstract:
Knowledge augmentation has significantly enhanced the performance of Large Language Models (LLMs) in knowledgeintensive tasks. However, existing methods typically operate on the simplistic premise that model performance equates with internal knowledge, overlooking the knowledge-confidence gaps that lead to overconfident errors or uncertain truths. To bridge this gap, we propose a novel meta-cognitive framework for reliable knowledge augmentation via differentiated intervention and alignment. Our approach leverages internal cognitive signals to partition the knowledge space into mastered, confused, and missing regions, guiding targeted knowledge expansion. Furthermore, we introduce a cognitive consistency mechanism to synchronize subjective certainty with objective accuracy, ensuring calibrated knowledge boundaries. Extensive experiments demonstrate the our framework consistently outperforms strong baselines, validating its rationality in not only enhancing knowledge capabilities but also fostering cognitive behaviors that better distinguish knowns from unknowns.

Abstract:
Recent progress in reasoning models suggests that generating plausible attempts for researchlevel mathematics may be within reach, but verification remains a bottleneck, consuming scarce expert time. We hypothesize that a meaningful solution should contain enough method-level information that, when applied to a neighborhood of related questions, it should yield better downstream performance than incorrect solutions. Building on this idea, we propose Consequence-Based Utility, an oracle-free evaluator that scores each candidate by testing its value as an in-context exemplar in solving related yet verifiable questions. Our approach is evaluated on an original set of research-level math problems each paired with one expert-written solution and nine LLM-generated solutions. Notably, Consequence-Based Utility consistently outperforms reward models, generative reward models, and LLM judges on ranking quality. Specifically, for GPT-OSS-120B it improves Acc@1 from 67.2 to 76.3 and AUC from 71.4 to 79.6, with similarly large AUC gains on GPT-OSS-20B (69.0 to 79.2). Furthermore, compared to LLM-Judges, it also exhibits a larger solver–evaluator gap, maintaining stronger correct–wrong separation even on instances the underlying solver often fails to solve.

Abstract:
Multiagent debate (MAD) systems improve LLM reasoning through iterative deliberation, but remain vulnerable to debate collapse, a failure type where final agent decisions are compromised on erroneous reasoning. Existing methods lack principled mechanisms to detect or prevent such failures. To address this gap, we first propose a hierarchical metric that quantifies behavioral uncertainty at three levels: intra-agent (individual reasoning uncertainty), inter-agent (interactive uncertainty), and system-level (output uncertainty). Empirical analysis across several benchmarks reveals that our proposed uncertainty quantification reliably indicates system failures, which demonstrates the validity of using them as diagnostic metrics to indicate the system failure. Subsequently, we propose a mitigation strategy by formulating an uncertainty-driven policy optimization to penalize self-contradiction, peer conflict, and low-confidence outputs in a dynamic debating environment. Experiments demonstrate that our proposed uncertainty-driven mitigation reliably calibrates the multi-agent system by consistently improving decision accuracy while reducing system disagreement.

Abstract:
Large Language Models (LLMs) training is prohibitively expensive, driving interest in lowprecision fully-quantized training (FQT). While novel 4-bit formats like NVFP4 offer substantial efficiency gains, achieving near-lossless training at such low precision remains challenging. We introduce TetraJet-v2, an end-to-end 4-bit FQT method that leverages NVFP4 for activations, weights and gradients in all linear layers. We identify two critical issues hindering low-precision LLM training: weight oscillation and outliers. To address these, we propose: 1) an unbiased double-block quantization method for NVFP4 linear layers, 2) OsciReset, an algorithm to suppress weight oscillation, and 3) OutControl, an algorithm to retain outlier accuracy. TetraJet-v2 outperforms prior methods on FP4 pre-training for LLMs across models up to 370M parameters trained up to 212B tokens, reducing the performance gap to BF16 by an average of 51.3% while enabling an 1.67× end-to-end speedup over FP8.

Abstract:
Sequence learning is dominated by Transformers and parallelizable recurrent neural networks such as statespace models, yet learning long-term dependencies remains challenging, and state-of-the-art designs trade power consumption for performance. The Bistable Memory Recurrent Unit (BMRU) was introduced to enable hardware–software co-design of ultra-low power RNNs: quantized states with hysteresis provide persistent memory while mapping directly to analog primitives. However, BMRU performance lags behind parallelizable RNNs on complex sequential tasks. In this paper, we identify gradient blocking during state updates as a key limitation and propose a cumulative update formulation that restores gradient flow while preserving persistent memory, creating skip-connections through time. This leads to the Cumulative Memory Recurrent Unit (CMRU) and its relaxed variant, the \alphaCMRU. Experiments show that the cumulative formulation dramatically improves convergence stability and reduces initialization sensitivity. The CMRU and \alphaCMRU match the performance of Linear Recurrent Units (LRUs) and minimal Gated Recurrent Units (minGRUs) on standard benchmarks at small model sizes, while the CMRU retains quantized states, persistent memory, and noise-resilient dynamics essential for analog implementation.

Abstract:
Causal effect estimation is a fundamental task in many scientific fields. Selecting appropriate covariates for adjustment is crucial for obtaining unbiased causal effects. However, most existing methods either rely on learning the global causal structure, assume the absence of latent variables, or impose the pretreatment assumptionrestricts covariates to those unaffected by the treatment or outcome. These assumptions are often unrealistic in real-world scenarios, and global structure learning can be computationally intensive and inefficient. To address these challenges, we first characterize the local existence boundary of adjustment sets for causal effect estimation. Based on this characterization, we develop a novel local learning method for covariate selection in nonparametric causal effect estimation. This method accommodates the presence of latent variables and eliminates the need for the pretreatment assumption. We prove that the proposed method is both sound and complete under standard assumptions. Its effectiveness is validated through extensive experiments on both synthetic and real-world datasets.

Abstract:
To navigate partially observable visual environments, recent VLM agents increasingly internalize world modeling capabilities directly into their policies via explicit CoT reasoning with reinforcement learning (RL). However, mere passive exploitation of reasoning on visited states is insufficient for sparsereward agentic tasks, as it lacks the epistemic drive to actively uncover theknown unknownrequired for robust generalization. We ask:Can VLM agents actively find signals that challenge and update their internal world model through curiosity-driven exploration?In this work, we proposeGLANCE, a unified framework that bridges reasoning and exploration by grounding the agent's linguistic world model into the stable visual representations of an evolving target network. Crucially,GLANCEleverages the discrepancy between linguistic prediction and visual reality as an intrinsic curiosity signal within reinforcement learning, steering the agent to actively explore areas where its internal model is uncertain. Extensive experiments across a series of agentic tasks show the effectiveness ofGLANCE, and demonstrate that aligningwhat the agent thinkswithwhat the agent seesis key to solving complex or sparse agentic tasks.

Abstract:
Diffusion and flow policies are gaining prominence in online reinforcement learning (RL) due to their expressive power, yet training them efficiently remains a critical challenge. A fundamental difficulty that distinguishes online RL from standard generative modeling is the lack of direct samples from the target Boltzmann distribution defined by the Qfunction. To address this, two seemingly distinct families of methods have been proposed for diffusion policies: a noise-expectation family, which uses a weighted average of noise as the training target, and a gradient-expectation family, which employs a weighted average of Q-function gradients. However, it remains unclear how these objectives are formally related, or whether they can be synthesized into a more general formulation. In this paper, we propose a unified framework, reverse flow matching (RFM), which rigorously addresses the problem of training diffusion and flow models without direct target samples. By adopting a reverse inferential perspective, we formulate the training target as a posterior mean estimation problem given an intermediate noisy sample. Crucially, we introduce Langevin Stein operators to construct zero-mean control variates, deriving a general class of estimators that share the same expectation. We show that existing noise-expectation and gradient-expectation methods are simply two specific instances within this broader class. This unified view yields two key advancements: it extends the capability of targeting Boltzmann distributions from diffusion to flow policies, and it enables the principled combination of Q-value and Q-gradient information to form an effective estimator, thereby improving training efficiency and stability. We instantiate RFM to train a flow policy in online RL and demonstrate improved performance on continuous-control benchmarks compared to diffusion policy baselines.

Abstract:
DiracFrenkel instantaneous residual minimization evolves nonlinear parametrizations of PDE solutions in time, but ill-conditioning can render the parameter dynamics non-unique. We interpret this non-uniqueness as a gauge freedom: nullspace directions that leave the time derivative unchanged can be used to select better-conditioned parameter velocities. Building on Onsager's minimum-dissipation principle, we introduce a history variable---interpretable as momentum---and inject it only along the nullspace directions. The resulting Dirac-Frenkel-Onsager dynamics preserve instantaneous residual minimization, in contrast to standard regularization that can introduce bias, while promoting temporally smooth parameter evolution. Examples demonstrate that the approach leads to increased robustness in singular and near-singular regimes.

Abstract:
Speech Large Language Models (SLLMs) enable highlevel emotion reasoning, but often produce ungrounded, text-biased judgments without verifiable acoustic evidence. In contrast, SSL encoders such as WavLM yield strong acoustic representations yet remain opaque discriminative models that offer limited interpretability. To bridge this gap, we introduce the Agentic Decoding of Emotion via Probing Tools (ADEPT) framework, which reframes emotion recognition as a multi-turn inquiry process rather than a single-pass prediction. ADEPT transforms an SLLM into an agent that maintains an evolving candidate set and adaptively invokes dedicated semantic and acoustic probing tools within a structured pipeline of candidate generation, evidence collection, and adjudication. Crucially, ADEPT enables a paradigm shift from consensus learning to ambiguity-driven emotion reasoning. Since human affect exhibits complexity and co-occurrence of emotions, we leverage minority annotations as informative signals instead of discarding them as noise. Finally, we integrate Group Relative Policy Optimization (GRPO) with the Evidence Trust Gate to explicitly couple tool-usage behaviors with prediction quality and enforce evidence-based reasoning. Experiments demonstrate that ADEPT improves in most cases the primary emotion accuracy while substantially improving minor emotion characterization, producing explanations grounded in auditable evidence.

Abstract:
We introduce Deep Flow Networks (DFNs), a new class of discrete function approximators. DFNs are inspired by and generalize minimumcost flow value functions that map node imbalances on a subset of nodes to the optimal flow cost. Such functions are known to be M-convex (Murota2003) and admit efficient optimization. On the theoretical side, we prove that DFNs are universal approximators for discrete functions on \mathbbZ^d that admit convex extensions to \mathbbR^d, and characterize their optimization complexity in terms of their deviation from the M-convex regime. Guided by these results, we develop a practical DFN implementation for learning from data. Finally, we evaluate our implementation empirically on data from different ground-truth functions, showing that DFNs achieve strong approximation accuracy while being substantially faster to optimize than benchmark approaches.

Abstract:
Efficient sampling of molecular systems at thermodynamic equilibrium is a hallmark challenge in statistical physics. This challenge has driven the development of Boltzmann Generators (BGs), which allow rapid generation of uncorrelated equilibrium samples by combining a generative model with exact likelihoods and an importance sampling correction. However, modern BGs predominantly rely on Normalizing Flows (NFs), which either suffer from limited expressivity due to strict invertibility constraints (discrete time) or computationally expensive likelihoods (continuous time). In this paper, we propose Autoregressive Boltzmann Generators (ArBG), a novel autoregressive modelling framework that overcomes these limitations by departing from the flowbased BG paradigm. ArBG circumvents the topological constraints of flows and enables sequential inference-time interventions, while offering enhanced scalability by leveraging architectures effective in Large Language Models. We empirically demonstrate that ArBG leads to significant improvements over flow-based models across all benchmarks, but particularly in larger peptide systems such as the 10-residue Chignolin. Furthermore, we introduce Robin, a 132M parameter transferable model trained with the ArBG framework which improves over the previous state-of-the-art, reducing the zero-shot energy error, \mathcalE-\mathcalW_2, on 8-residue systems by ～ 60%.

Abstract:
Unified multimodal models aim to input and output both vision and language data within a single system. In this work, we explore the design space of Unified Multimodal Pretraining through a controlled, fromscratch study. We find that leveraging a single high-dimensional semantic encoder (e.g. SigLIP 2) achieves the best combined performance for both visual understanding and generation. Furthermore, we observe that integrating diverse visual data---including raw video and image-text pairs---has minimal impact on language capabilities, suggesting that vision and text are compatible within a single unified model. We identify positive synergy where joint pretraining enhances downstream capabilities such as Visual Question Answering (VQA) and World Modeling. Turning to architecture, we investigate Mixture-of-Experts (MoE) design choices, such as granularity and sparsity, to identify an effective training recipe. Finally, we quantify scaling dynamics via IsoFLOP analysis and uncover a scaling asymmetry: language scaling is parameter-hungry, while vision scaling is significantly more data-hungry. We demonstrate that MoE architectures help address this imbalance by decoupling total parameter capacity from active compute, enabling the high capacity required for language while also accommodating the data-intensive nature of vision.

Abstract:
Deploying robots in open‑ended real‑world environments demands continual learning capabilities to adapt to an everexpanding range of tasks. This requires retaining previously acquired skills without forgetting while effectively leveraging prior knowledge to learn new ones. Inspired by neuroscience, we propose Neuro-evolutionary Continual Reinforcement Learning (Nevo-CRL). Nevo-CRL maintains a fixed-capacity monolithic policy network, solving tasks by optimizing inter-layer connectivity and neuron parameter. For each new task, Nevo-CRL constructs a mask population to selectively activate the outputs of each hidden layer, thereby forming a task-specific policy population. Upon completing each task, the best-performing mask is stored, and its activated neurons are frozen to prevent catastrophic forgetting. To facilitate knowledge transfer, Nevo-CRL reuses neurons from acquired skills based on semantic similarity between tasks, while dynamically allocating additional neurons for task-specific adaptation. In the learning process, Nevo-CRL iteratively adjusts masks via importance-based crossover to optimize the policy network connectivity. To improve neuron utilization, we prune low-activity connections to recycle neurons. The experiments demonstrate that Nevo-CRL significantly outperforms existing continual RL methods and multi-task learning methods in terms of overall performance, forgetting reduction, generalization ability.

Abstract:
Metaphorical videos are prevalent across various realworld scenarios to convey complex ideas, and understanding them typically requires high-order cognitive capabilities. The lack of systematic studies on metaphorical video understanding not only constrains the real-world applicability of MLLMs but also impedes the thorough assessment of their high-order cognitive capabilities. To bridge this gap, we propose MetaphorVU-Bench, the first systematic and comprehensive benchmark dedicated to metaphorical video understanding. Through experiments, we find current MLLMs struggle with accurate metaphorical video understanding, lagging far behind human level, primarily due to defective cross-domain mapping. Motivated by this finding, we construct a metaphor knowledge graph as mapping augmentation and propose MetaphorBoost, an inference-time enhancement framework achieving consistent performance improvement. Our benchmark, analysis, and method provide useful insights and a foundation for future research on advancing MLLMs.

Abstract:
AI developers face a dualuse dilemma. The same capability that helps one user cure a disease can help another synthesize one. This dilemma could be resolved by access control, granting different users access to different AI capabilities. A gold standard for access control would be to serve models with different capabilities to different users. However, training and deploying multiple models is prohibitively expensive. We address this challenge by developing gradient-routed mixture-of-experts (GR-MoE), a pretraining method that selectively updates experts to induce specialization. Ablating an expert at inference time removes its capability, approximating a model trained on filtered data. We evaluate GR-MoE on synthetic stories and realistic dual-use data spanning biology, cybersecurity, nuclear physics, and code. On realistic data, GR-MoE preserves performance on retained capabilities while achieving 30% compute efficiency on forget capabilities. GR-MoE limits recovery more effectively than post-hoc unlearning and preserves capabilities better than LoRA. GR-MoE's advantages improve when scaled from 48M to 2B parameters, approaching multiple data filtered models in a single training run.

Abstract:
Nonmonotonic sequence generation methods, such as masked diffusion models, provide a flexible alternative to left-to-right autoregressive modeling by allowing tokens to be generated in non-fixed and prescribed orders. Despite their practical advantages, most existing non-monotonic models are order-agnostic and rely on a fixed-length masked token grid, limiting their ability to support variable-length generation and adaptive insertion order. In this work, we introduce a probabilistic framework for learning insertion order in variable-length insertion models. We formalize a bijective correspondence between insertion trajectories and permutations, which enables an exact reparameterization of the data likelihood as a sum over permutations. Building on this result, we propose theInsertion Process (IP), a stochastic generative model that jointly learnswhereto insert,whatto insert, andwhento terminate, trained via permutation-based variational inference. Unlike prior masked or fixed-canvas approaches, IP natively supports variable-length generation and learns data-driven preferences over insertion orders. Experiments on planning benchmarks and molecular SMILES generation demonstrate that learning insertion order improves both modeling quality and generalization in domains without a canonical left-to-right structure.

Abstract:
We introduce FlowOptimizer, a deep unfolded, flowbased framework for learned iterative optimization. Motivated by the expressiveness of flow models, we represent each optimization iteration via a velocity field that operates on a population of candidate solutions, i.e., a set of parallel iterates, conditioned on contextual information including their objective values and gradients, as well as population-level statistics. The velocity field is initially trained in a simulation-free manner by matching displacements from source populations to improved target ones obtained through sampling the objective. Subsequently, we unfold this velocity field as the internal iteration of an optimization sequence, and fine-tune it in an end-to-end manner by directly optimizing objective values over a targeted class of problems. Notably, FlowOptimizer is a self-supervised framework whose training relies solely on objective evaluations without requiring knowledge of solutions. We evaluate our approach on a series of tasks from standard non-convex optimization benchmarks to real-world problems from supply chain, robotics and power grid applications. FlowOptimizer consistently outperforms well-established sampling-based/gradient-based traditional optimization and learning-to-optimize methods, often by orders of magnitude in terms of solution quality. We further highlight its ability to be trained on low-dimensional problems and successfully generalize to substantially higher-dimensional (× 10) ones.

Abstract:
Treatment responder classification seeks to learn a rule to classify individuals who will benefit from the treatment. This paper studies a new scenario in treatment responder classification when abstention is allowed, i.e., practitioners can opt out of making uncertain classification on some individuals for further investigation. By revealing the implicit relation between causal misclassification risk with abstention and Conditional Value at Risk (CVaR), we develop a doubly robust method named TRECA to learn the classification rule under loose convergence conditions on nuisance parameters, and further extend it to deal with possible violation on key assumptions such as monotonicity and unconfoundedness. Rigorous theories and extensive experiments on two realworld datasets demonstrate the theoretical and experimental guarantee on our methods in learning treatment responders classification rules with low regret at the cost of limited abstention.

Abstract:
Many realworld decisions require prioritizing high-risk cases, such as clinicians prioritizing high-risk patients before lower-risk ones. Falling rule lists (FRLs), which are ordered if--then rules with monotonically decreasing risks, provide an interpretable framework for such tasks; however, their single-path structure yields a highly restricted model class. We introduce falling trees, a new family of interpretable models that enforces the same monotonic risk constraint while permitting tree-structured branching. We present GraviTree, a novel dynamic-programming-with-bounds algorithm for learning the Rashomon set of falling trees under depth and branching constraints, together with bounds that use the falling constraint to provably reduce the search space. Our formulation can interpolate between rule lists and full decision trees, enabling user-desired model expressivity. Across clinical and public-risk datasets, falling trees match or outperform FRLs and other interpretable baselines, often producing lower-sparsity decisions for high-risk instances. Our results show that falling trees strike a practical balance between interpretability, expressiveness, and risk prioritization for high-stakes settings.

Abstract:
Prediction algorithms are increasingly used to inform decisions about humans, but maximizing accuracy—the standard learning objective—is not necessarily optimal for this purpose. Instead, we propose optimizing social welfare, defined as the average gain users receive from correct predictions. Welfare enables to express, and therefore account for, heterogeneity in how much users benefit from accuracy. But since these valuations are private and users can benefit from overreporting them, learning must simultaneously elicit truthful values and optimize welfare with respect to them. To this end, we propose a novel learning algorithm that incorporates a truthful auction. We show how to compute allocations and prices efficiently, and bound the number of paying users—which surprisingly is independent of the sample size. We conclude with experiments on real and synthetic data that demonstrate our algorithm and explore the connections between welfare and accuracy.

Abstract:
In robotic imitation learning, accurately modeling the multimodality and temporal correlations of longhorizon action sequences remains challenging. Long-horizon tasks require preserving global task intent while executing precise low-level control; otherwise, local errors can accumulate and lead to failure. While recent coarse-to-fine autoregressive models have improved action generation, they struggle to maintain consistency across hierarchies, leading to suboptimal performance in long-horizon tasks. To address these shortcomings, we propose Probabilistically Consistent Multi-Scale Action Generation (MSP), a novel coarse-to-fine approach that promotes cross-scale consistency. MSP adopts a streamlined multi-scale design by directly downsampling in a continuous latent space. A scale-wise autoregressive Transformer is used to generate semantic conditions at each scale, which guide a lightweight MeanFlow model to capture multi-scale latent distributions, enabling probabilistically consistent refinement across scales. Through extensive simulation and real-world experiments, including long-horizon, multi-task, and few-shot generalization settings, we show that MSP outperforms existing coarse-to-fine methods, achieving state-of-the-art performance with high efficiency. Our code will be publicly available upon publication.

Abstract:
Modeling graphs demands a careful balance between longrange propagation of information across nodes and the controlled dissipation of noisy or redundant signals to ensure stable learning and generalization. This challenge is exacerbated in dynamic graphs, where structural and temporal information interact, leading to uncontrolled information accumulation and amplifying noise, thereby affecting generalization. We introduce LAMP, a dynamic graph model for snapshot-based dynamic graphs that incorporates adaptive, learned dissipation within a principled dynamical systems framework. Our architecture combines impulsive neural ODEs with antisymmetric parameterization to model conservative information flow, alongside data-driven dissipative dynamics that regulate information retention over space and time. This formulation yields stable yet expressive representations and enables effective long-range dependency modeling while avoiding pathological information buildup. We provide a theoretical analysis establishing stability guarantees and characterizing the representational power. Extensive experiments on synthetic and real-world benchmarks demonstrate state-of-the-art performance, particularly on tasks requiring extended-range dependency modeling.

Abstract:
The increasing situational awareness of language models raises safety concerns: models might be aware when they are evaluated, and adjust their behavior to evade monitoring and resist modification, e.g., pretending to be aligned only in evaluation. This \emphalignment faking behavior is often interpreted as scheming: an intentional effort of strategic deception. In this paper, we examine an alternative interpretation, \emphperformative misalignment, which explains the change in behavior as a result of \emphsycophancy towards AI researchers. To back up this hypothesis, we present three empirical findings. First, we show that evaluation awareness persists even when we tell models they are deployed, which contradicts the scheming story which predicts less misalignment when the model perceives evaluation. Second, we use probing and steering to show that our current methods cannot mechanistically distinguish sycophancy and scheming in alignment faking evaluations. Third, we finetune models to be more sycophantic and observe increased sensitivity to evaluation cues. To conclude, we emphasize deconfounding sycophancy from scheming for future work on evaluations and mitigations of intent misalignment.

Abstract:
Embedding translation enables interoperability across embedding models, allowing embedding vectors to be reused without costly reembedding. However, existing methods are typically evaluated under simplified pairwise and i.i.d. settings and behave as black boxes at inference time, leading to unreliable performance under out-of-distribution (OOD) inputs, multi-model mixing, and composed translations. We analyze embedding translation from a geometric perspective and derive an interpretable error bound that explains systematic error amplification under OOD inputs, mixing and chaining. Building on this, we propose a geometry-aware confidence metric and a Hierarchical Mixture of Experts (HMoE) framework with localized, parameter-efficient adaptation. Following the MTEB leaderboard, we conduct large-scale experiments over 10 embedding models and 6 datasets across 90 pairwise translation settings. HMoE outperforms every baseline for every model pair over every dataset under OOD scenarios. Furthermore, multi-model mixing and chaining only degrade our performance in Recall@100 by 0.5% -- 2.6%, compared to 7.2% -- 92.3% recall drop by existing methods.

Abstract:
Multimodal federated learning (MFL) has emerged as a pivotal paradigm for leveraging distributed data to enhance model performance. However, existing methods predominantly rely on idealized assumptions of model homogeneity and balanced modality distributions, rendering them illsuited for practical scenarios characterized by heterogeneous client architectures and severe modality imbalance. To address these challenges, we propose a Multimodal Federated learning Prototype-guided Bilateral Alignment (MFedPBA) framework. MFedPBA facilitates robust knowledge synergy through a dual alignment mechanism: (i) at the feature level, it aligns heterogeneous feature spaces via a projection encoder optimized by contrastive learning and the Gromov-Wasserstein distance; (ii) at the decision level, it employs an entropy-weighted aggregation of naturally aligned logit prototypes. This novel design achieves robust MFL by jointly tackling heterogeneous feature spaces and collectively aggregating decisions. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art baselines under conditions of model heterogeneity and modality imbalance.

Abstract:
Understanding gradient descent dynamics is key to explaining the success of overparameterized models, where implicit bias manifests through conservation laws in gradient flow. While such laws are well understood for linear and ReLU networks, they remain largely unexplored for modern architectures. This work develops a unified framework to characterize conservation laws for contemporary models, including feedforward networks with GELU, SiLU, and SwiGLU activations, multihead attention with sinusoidal and rotary positional encodings, and Mixture-of-Experts architectures under diverse gating designs. Our theoretical findings are supported by experiments that validate the predicted invariants.

Abstract:
Multimodal large language models (MLLMs) have recently made strong progress in visionlanguage reasoning, yet their performance often degrades as generations grow longer. A key factor is that they frequently lose track of earlier visual evidence and intermediate constraints under a monolithic growing context. Inspired by how humans separately recall what they see and what they infer when solving complex tasks, we propose DLMR, a parameter-efficient mechanism that equips MLLMs with dual latent memories: a visual memory that compresses image evidence and a reasoning memory that tracks intermediate conclusions and constraints. A router then dynamically decides which memory and how much to reuse during inference, preserving visual grounding while maintaining coherent long-horizon reasoning. DLMR is trained in three stages from latent memory construction to selective router learning while keeping the base MLLM frozen, yielding substantial gains on both general and reasoning benchmarks with only a small number of additional trainable parameters. Further analyses reveal interpretable, state-dependent routing in which the visual and reasoning memories specialize as intended, and demonstrate that this design reduces redundant decoding and improves token efficiency over long generations.

Abstract:
Learningaugmented paging has been extensively studied in recent years. A key advantage over naive ML-based approaches is \emphbounded robustness, which guarantees worst-case performance even when predictions are inaccurate, making these algorithms valuable for real-world systems. Prior work achieves robustness bounds of 2H_k + O(1) in the randomized setting, leaving a gap to the optimal competitive ratio H_k. We are the first to study how to close this gap. In this paper, we begin by analyzing online optimality and provide a new proof of the latest H_k-competitive algorithm, which facilitates analysis in the learning-augmented setting. Then, we review existing learning-augmented paging algorithms and introduce a unifying primitive, the \emphrelative prediction budget, which captures the essence of how to establish robustness and reveals that prior algorithms either overuse or underutilize predictions. Guided by the above analysis, we develop a new framework that achieves the best-possible robustness for learning-augmented paging: H_k + O(1). Experiments further demonstrate strong practical performance.

Abstract:
Although transformers are most commonly optimized as statistical sequence models, it is unclear to what extent they can implement and learn exact algorithmic computations. Here, we specify a transformer implementation from first principles that executes a fundamental and widely used method for kmeans clustering: Lloyd's algorithm. We theoretically prove and empirically demonstrate that this implementation of a transformer architecture, which we term the _k-means transformer_, exactly implements Lloyd's algorithm for k-means clustering using the standard circuit mechanisms of modern transformers: attention block, residual connections, and feed-forward block. In learning experiments, we find that training this base architecture on k-means clustering yields a generalizable clustering algorithm that surpasses Lloyd's algorithm in terms of clustering quality. Finally, we demonstrate that interpretable alterations (e.g., inclusion of layer normalizations) to this architecture yields diverse and novel variants of clustering algorithms, including soft k-means, spherical k-means, trimmed k-means. Overall, our results show that transformer circuit mechanisms can instantiate exact algorithmic routines for clustering, while simultaneously providing an effective learnable model.

Abstract:
Large language models can exhibit emergent reasoning behaviors, often manifested as recurring lexical patterns (e.g., “wait,” indicating verification). However, complex reasoning trajectories remain sparse in unconstrained sampling, and standard RL often fails to guarantee the acquisition of diverse reasoning behaviors. We propose a systematic discovery and reinforcement of diverse reasoning patterns through structured reasoning, a paradigm that requires targeted exploration of specific reasoning patterns during the RL process. To this end, we propose CtrlR, a framework for learning structured reasoning via tractable trajectory control that actively guides the rollout process, incentivizing the exploration of diverse reasoning patterns that are critical for complex problem-solving. The resulting behavior policy enables accurate importance-sampling estimation, supporting unbiased on-policy optimization. We further introduce a power-scaling factor on the importance-sampling weights, allowing the policy to selectively learn from exploratory, out-of-distribution trajectories while maintaining stable optimization. Experiments demonstrate that Ctrl-R enables effective exploration and internalization of previously unattainable reasoning patterns, yielding consistent improvements across language and vision–language models on mathematical reasoning tasks.

Abstract:
We establish a fundamental connection between optimal structure learning and optimal conditional independence testing by showing that the minimax optimal rate for structure learning problems is determined by the minimax rate for conditional independence testing in these problems. This is accomplished by establishing a general reduction between these two problems in the case of polyforests, and demonstrated by deriving optimal rates for several examples, including Bernoulli, Gaussian and nonparametric models. Furthermore, we show that the optimal algorithm in these settings is a suitable modification of the PC algorithm. This theoretical finding provides a unified framework for analyzing the statistical complexity of structure learning through the lens of minimax testing.

Abstract:
The forward pass of a Transformer can be seen as an interacting particle system on the unit sphere: time plays the role of layers, particles that of token embeddings, and the unit sphere idealizes layer normalization. In some weight settings the system can even be seen as a gradient flow for an explicit energy, and one can make sense of the infinite context length (\emphmeanfield) limit thanks to Wasserstein gradient flows. In this paper we study the effect of the perceptron block in this setting, and show that critical points are generically atomic and localized on subsets of the sphere.

Abstract:
Learning actionvalues efficiently is central to reinforcement learning (RL), as they underpin many control algorithms such as Q-learning. However, action-value learning can be slow, requiring many updates to move values from their initialization, typically near zero, to their true values, which may be far from zero. Moreover, action-value learning algorithms typically update each state–action pair independently, without learning shared value structure across actions within a state. In this paper, we address these inefficiencies by introducing the mean-expansion transformation, which accelerates action-value learning by sharing values across actions within a state and by changing the problem from directly learning potentially large action-values to learning a lower-norm representation of them. In deep RL, this transformation can be applied as a parameter-free modification to Q-network architectures without altering the underlying algorithm. Empirically, we show that it improves DQN's performance in aggregate across 57 Atari games while increasing action gaps and dramatically reducing value overestimation.

Abstract:
The DeepWalk algorithm has been widely used for learning node embeddings in graphs. Combined with the idea ofnegative sampling, the DeepWalk algorithm has been shown to be implementable at scale, easily handling graphs with millions of nodes. However, theoretical guarantees on the resulting embeddings are much less understood. Recent results have studied the minimizers of the objective and have shown interesting guarantees for certain graph classes. However, the optimizationtrajectory, i.e., what happens when we start at a random initialization and run gradient descent, remains poorly understood. This is especially true for the implementation of DeepWalk using Skipgram with negative sampling (SGNS), since the variance of the stochastic updates turns out to be very large. In this work, we make progress on this question. We show that for "small norm" initialization, under a spectral gap assumption on the graph, the DeepWalk embeddings align with the column space of a fixed low-rank matrix. For graphs generated from Stochastic Block Models with certain separation conditions, our results imply that the DeepWalk embeddings recover cluster structure. To the best of our knowledge, our results give the first analysis of the optimization trajectory of DeepWalk with negative sampling on non-trivial graph classes.

Abstract:
Spiking Neural Networks (SNNs) that leverage sparse binary spikes and temporal dynamics have emerged as energyefficient alternatives to Artificial Neural Networks (ANNs). However, SNNs suffer from limited representational capacity due to the discrete nature of spikes. Existing solutions extending spike levels often overlook the constraints of the simulation time window, leading to a critical issue we identify as spike saturation-induced information homogenization. In this phenomenon, distinct high-amplitude inputs result in identical maximized spike counts, truncating the dynamic range and hindering the model’s ability to capture fine-grained semantic differences. To address this, we propose SmoothSpike, a novel method designed to enhance representational capacity by suppressing spike saturation. We first introduce a randomized Hadamard transformation to smooth neuronal inputs, theoretically proving its efficacy in constraining extreme values and reducing both saturation probability and input variability among saturated neurons. To further improve adaptability, we evolve this into a learnable orthogonal transformation. Initialized with Hadamard matrices and maintained orthogonal via Newton-Schulz iteration, this module dynamically adapts to varying input distributions during training. Extensive experiments on language modeling tasks show that SmoothSpike effectively mitigates the information homogenization problem and improves task performance. This positions SmoothSpike as a robust solution to bridge the performance gap between SNNs and ANNs.

Abstract:
SourceFree Domain Adaptation (SFDA) enables model adaptation under distribution shifts without access to source data, providing a practical solution for privacy-sensitive applications and having shown substantial progress in classification. In contrast, regression involves ordered and continuous target variables, posing unique challenges for representation adaptation and pseudo-label refinement in the SFDA setting. To address this gap, we propose a novel algorithm for continuous target prediction in SFDA that leverages instance-dependent, discretized density–informed supervisory signals to refine pseudo-labels within an uncertainty-aware paradigm. By incorporating auxiliary discretized distribution learning, our method also promotes more compact and structured feature representations, mitigating the inherent difficulties of adapting regression models under distribution shift. We theoretically demonstrate that the resulting density structure is robust to potential perturbations, supporting reliable SFDA for regression. Extensive experiments across multiple benchmarks validate the effectiveness of the proposed approach.

Abstract:
Deep MultiView Clustering (MVC) aims to extract a unified semantic consensus from diverse data sources without supervision. However, current approaches relying on flat Euclidean embeddings often fail to model data uncertainty, resulting in rigid alignment where high-quality views are forced to drift toward corrupted ones. To address these challenges, we propose the Hyperbolic Asymmetric Multi-view Clustering (HAMC) framework. By embedding features into the Poincaré ball model, HAMC leverages the exponential volume growth of hyperbolic geometry to optimize cluster separability. It pushes high-confidence representations toward the boundary while retaining noisy ones near the origin. To mitigate noise, we introduce an asymmetric view alignment mechanism, enabling reliable views to unidirectionally guide unreliable ones. Furthermore, a consensus-aware cluster learning strategy is designed to construct robust global pseudo-labels via a confidence-based screening scheme, refining the cluster structure. Extensive experiments against 13 baselines demonstrate that HAMC achieves state-of-the-art performance.

Abstract:
We revisit the problem of sequentially testing the mean of bounded distributions in a level\alpha power-one framework. We study a \mathrmKL_inf-based sequential test that is known to attain the information-theoretic lower bound on the expected stopping time with exact constants as \alpha \to 0. Going beyond first-order asymptotics, we establish a central limit theorem (CLT) for the stopping time of this test. Our analysis proceeds in two steps. First, we prove a novel CLT for the \mathrmKL_inf statistic itself, characterizing its fluctuations around its deterministic linear growth. We then leverage this result to show that the stopping time, centered appropriately, and scaled by \sqrt\log(1/\alpha), converges in distribution to a Gaussian limit with an explicit variance. This yields a second-order characterization of an asymptotically optimal sequential test for bounded distributions. Finally, we present numerical experiments that corroborate our theoretical findings.

Abstract:
Multimodal large language models introduce attack surfaces absent in unimodal systems: adversaries can distribute malicious intent across modalities to evade unimodal safeguards. This motivates using crossmodal consistency as a detection signal rather than inspecting each modality in isolation. Our key observation is that benign inputs induce compatible predictive behavior from text-only and vision-only reasoning that stabilizes when fused, whereas adversarial manipulation disrupts this consistency, causing abnormal multimodal behavior. Existing defenses that examine raw inputs or outputs overlook this internal fusion process, rendering them brittle and computationally expensive. We propose FlowGuard, a lightweight inference-time framework that detects harmful inputs by monitoring internal multimodal consistency. Unlike approaches that rely on scalar confidence metrics, FlowGuard derives FlowVectors inspired by Partial Information Decomposition that quantify cross-modal redundancy, synergy, and modality-specific dominance, capturing whether multimodal fusion aligns with unimodal semantic evidencebetween unimodal and fused multimodal output distributions. In a one-class classification problem trained solely on benign data, FlowGuard reduces Attack Success Rates from >90% to <15% on unseen attacks, with <3% utility loss and up to a 6× latency reduction. Our results demonstrate that monitoring cross-modal consistency offers an efficient and effective defense for multimodal reasoning.

Abstract:
Time Series Foundation Models (TSFMs) leverage extensive pretraining to accurately predict unseen time series during inference, without the need for taskspecific fine-tuning. Through large-scale evaluations on standard benchmarks, we find that leading transformer-based TSFMs exhibit redundant components in their intermediate layers. We introduce a set of tools for mechanistic interpretability of TSFMs, including ablations of specific components and direct logit attribution on the residual stream. Our findings are consistent across several leading TSFMs with diverse architectures, and across a diverse set of real-world and synthetic time-series datasets. We discover that all models in our study are robust to ablations of entire layers. Furthermore, we develop a theoretical framework framing transformers as kernel regressors, motivating a purely intrinsic strategy for ablating heads based on the stable rank of the per-head projection matrices. Using this approach, we uncover the specific heads responsible for degenerate phenomena widely observed in TSFMs, such as parroting of motifs from the context and seasonality bias. Our study sheds light on the universal properties of this emerging class of architectures for continuous-time sequence modeling.

Abstract:
Large language model (LLM) agents increasingly perform commercial tasks that involve retrieving external content such as images and, where appropriate, reproducing that content. LLM agents should comply with the law, including the laws of copyright. Yet today we lack adequate tools to assess whether they do so. To that end, we introduce CopyrightBench, a benchmark designed to evaluate copyright law compliance of LLM agents. Copyright-Bench is comprised of realistic commercial tasks---website development, merchandise design, and corporate content production---that involve agents selecting between freely licensed content (the use of which is legal) and copyrighted content (the use of which is illegal at least in this setting). Notably, the evaluation introduces prompt variations that simulate different levels of user intent and time pressure. Comparing state-of-the-art agents against a human baseline, we find that: (1) LLM agents take actions that violate copyright law despite the availability of lawful alternatives; and (2) violation rates increase in response to user intent and under simulated time pressure.

Abstract:
The prohibitive memory footprint of the KeyValue (KV) cache imposes a critical bottleneck for efficient long-context LLM serving. Current compression techniques typically rely on static or uniform budget allocation, overlooking the significant heterogeneity in information density across attention heads. To address this, we introduce \textscEntroKV, an entropy-driven dynamic budget allocation framework. Our method enables dynamic and rational allocation across layers, attention heads, and different tasks. We demonstrate that attention entropy serves as a robust proxy for compression sensitivity: heads with high entropy require larger retention budgets, whereas low-entropy heads can be aggressively compressed without accuracy degradation. Functioning as a lightweight, plug-and-play module, \textscEntroKV optimizes budget scheduling in real-time and is compatible with diverse compression operators. Extensive experiments demonstrate that \textscEntroKV consistently outperforms baselines, retaining ～98% of full-cache performance at a 30% budget ratio with negligible computational overhead. Our code is available at \urlhttps://anonymous.4open.science/r/EntroKV-D0C8/.

Abstract:
We identify a fundamental tension in securing LLMs: the securityfidelity tradeoff. While defenses against indirect prompt injection are becoming more robust, we show that they inevitably impair the model's ability to process benign, instruction-like text. Current evaluations miss this cost because they conflate utility with fidelity. We address this gap with \textscSecFid, a benchmark that uses behaviorally separable probes to unambiguously distinguish between resisting an attack, succumbing to it, and faithfully processing it as data. Our evaluation reveals this tradeoff across a diverse set of models and highlights how the strongest defenses achieve security often by aggressively suppressing valid content, causing fidelity failure rates up to 50% on translation. We ground these results in a decision-theoretic framework, proving that when benign and adversarial inputs overlap, no universal defense exists. Therefore, optimal robustness is strictly task-dependent, determined by an application’s tolerance for fidelity errors versus security failures.

Abstract:
Graph neural networks (GNNs) are widely used, but how parameter sparsity affects the expressivity of relational (RGNNs) and temporal (TGNNs) variants is poorly understood. The Strong Expressive Lottery Ticket Hypothesis (SELTH) posits the existence of sparse GNNs that preserve WeisfeilerLeman (WL) expressivity on static graphs. We generalize this existence result to a probabilistic statement for multi-relational and temporal domains via the relational WL (RWL). We prove that sufficiently parameterized RGNNs contain sparse subnetworks that maintain 1-RWL expressivity and derive a lower bound on the probability that a random pruning yields such a subnetwork. We show that common TGNNs and cross-graph message passing schemes admit RGNN reformulations such that they inherit these guarantees and, moreover, that the expressivity of a sparse RGNN is connected to its optimization behavior under common update regimes. Experiments instantiate the bound, compare it to empirical probabilities on synthetic data, and study how pre-training expressivity relates to optimization and prediction quality metrics on temporal and molecular benchmarks.

Abstract:
Mixtureof-Experts (MoE) models scale compute efficiently, yet they remain expensive to deploy due to substantial memory footprint and inference overhead. Prior methods mainly operate at the expert level, either removing whole experts or ranking experts by importance. However, such expert-wise decisions are too coarse to identify redundancy, and often misallocate pruning budgets and limits compression. This issue worsens in large MoEs with dynamic routing and heterogeneous experts. To alleviate this dilemma, we for the first time observe that information in MoE experts is highly concentrated in a few channels, leaving substantial redundancy even in "high importance" experts. Accordingly, we propose a structural pruning framework tailored for MoEs, reforming the prune-ratio objective to maximizing channel-score coverage via an efficient attribution-based approximation. Experiments on DeepSeek and Qwen MoEs retain accuracy under 50% or 25% pruning joinly with 4-bit quantization, reducing the memory footprint of Qwen3-30B-A3B by 5.27×, and outperforming state-of-the-art baselines under diverse benchmarks.

Abstract:
Computational protein design typically employs a sequential workflow of structure generation followed by sequence (re)design. While structure generators can be explicitly conditioned on functional objectives, inverse folding models are constrained by their functionagnostic nature and sequence-structure degeneracy. More critically, the associated training objectives do not account for theBest-of-N(BoN) inference protocol, resulting in a fundamental training-inference misalignment. Here, we propose FIDIA, a reinforcement learning framework that enablesFunction-Informed sequenceDesign viaInference-Aligned policy optimization. Specifically, FIDIA integrates functional constraints into composite rewards and explicitly optimize the induced policy under BoN toward high-fitness sequence regions. We achieve this via a grounded gradient estimator that directly maximizes the expected maximum reward. FIDIA consistently outperforms both standard and RL-optimized baselines in success rate and precision on a general motif scaffolding benchmark. Further experiments on realworld cases including vaccine and affinity-enhancing enzyme design validate FIDIA’s efficacy in complex therapeutic and biocatalytic contexts.

Abstract:
Offline metareinforcement learning (meta-RL) aims to train agents that can generalize to unseen tasks using pre-collected data from related tasks. Recent approaches leverage the scalability of transformer architectures to model behavior sequences and support task adaptation using target task demonstrations. However, such data is often unavailable in real-world settings, where the task objective may be known but cannot be easily demonstrated. In contrast, humans routinely interpret and perform new tasks based solely on natural language instructions. In this work, we explore the potential of using natural language task descriptions to enable zero-shot task adaptation in offline meta-RL without requiring any data from the target task. We propose the Text-Guided Decision Transformer (TG-DT), a framework that enables zero-shot generalization by grounding policy learning in natural language. TG-DT learns a shared embedding space between task descriptions and behavioral trajectories via a dual contrastive and matching-based objective, ensuring robust alignment. A transformer-based policy is then conditioned on these aligned representations to generate task-appropriate actions. At test time, TG-DT synthesizes policies for unseen tasks using only their text descriptions and can optionally leverage a description-guided data sharing strategy to enhance adaptation. Experiments on standard offline meta-RL benchmarks, including MuJoCo and Meta-World, demonstrate that TG-DT achieves strong generalization to unseen tasks.

Abstract:
Simultaneous Localization and Mapping (SLAM) is increasingly expected to provide reusable spatial representations for downstream perception. However, existing approaches often struggle with scaleconsistency and producing maps that lack the geometric fidelity required for reliable perception. We proposeUniMapping, a unified SLAM framework that constructs a persistent neural-descriptor map from multimodal observations. We introduce aSpatial-Aware Deformable Transformerthat injects explicit geometric inductive bias to ensure scale-invariant feature extraction, alongside aSpatial Fusionstrategy that decouples feature aggregation from temporal sequences. Extensive experiments on both indoor and outdoor benchmarks demonstrate competitive SLAM performance. Notably, our method significantly enhances downstream tasks (mAP +3.1% and mIoU +7.1%) by leveraging accumulated multi-view context.

Abstract:
Protein function is often controlled by ligands that bias the direction of state transitions, such as agonists and antagonists, rather than stabilizing a single conformation. This is especially important for clinically relevant G proteincoupled receptors (GPCRs), where therapeutic efficacy depends on functional directionality. Structure-based design methods optimize binding to static conformations and cannot represent non-reversible, directional effects or systematically distinguish agonist from antagonist behavior. To address this gap, we introduceTransition-DirectedDiscreteDiffusion for allostericBinder design (TD3B), a sequence-based generative framework that designs binders with specified agonist or antagonist behavior via a directional transition control objective. TD3B combines a target-aware Direction Oracle, a soft binding-affinity gate, and amortized fine-tuning of a pre-trained discrete diffusion model, enabling targeted agonist and antagonist generation decoupled from binding affinity and unattainable by equilibrium-based or inference-only guidance baselines.

Abstract:
Contrastive training with InfoNCE loss and inbatch negatives is the standard approach for learning dual-encoder models. Its effectiveness, however, critically depends on the availability of hard negatives; in their absence, learning quickly saturates. Existing methods address this via explicit hard-negative mining, which is often costly or heuristic-driven. We introduce HOBIT, a principled mini-batch construction method that improves in-batch negative quality by reordering training examples at every epoch. \mathrm\textttHOBIT solves an optimization problem motivated by the InfoNCE objective to yield mini-batches such that each query in the batch is exposed to hard yet non-contradictory, informative negative examples. We show that the optimization objective is monotone and submodular which in turn leads us to a greedy algorithm that admits the standard \mathcalO(1 - 1/e) approximation guarantee. Empirically, we show that \mathrm\textttHOBIT incurs negligible computational overhead while significantly outperforming state-of-the-art batching methods, and remains complementary to existing hard negative mining techniques.

Abstract:
We proposeProgressive Structure Adjustment for Homophily Shift(PSAHS), a lightweight method forGraph Domain Adaptation(GDA) that explicitly addresses crossdomain mismatch in node-level homophily. PSAHS enhances node homophily in the source graph to a prescribed level by reweighting edges and introducing additional intra-class connections for low-homophily nodes, and conservatively refines the target graph using agreement-consistent predictions from a structure-awareGraph Neural Network(GNN) and an attribute-onlyMulti-Layer Perceptron(MLP) to ensure reliability under label scarcity. After each structural refinement, domain-adversarial training is employed to align node representations across domains. PSAHS employs a progressive training scheme that alternates between structure adjustment and representation alignment, where increasingly informative representations enable safer homophily correction, and the refined structure in turn improves representation learning. Extensive experiments on multiple GDA benchmarks demonstrate that PSAHS consistently outperforms strong baselines, with particularly large gains under severe homophily mismatch, highlighting the importance of explicit homophily alignment for effective cross-graph transfer.

Abstract:
The Hierarchical Dirichlet Process Hidden Markov Model (HDPHMM) is a Bayesian nonparametric extension of the classical Hidden Markov Model, well-suited for learning from (spatio-)temporal data. To relax the restrictive geometric assumption on state durations, the HDP Hidden Semi-Markov Model was introduced. However, both models assume stationary state durations, which limits their expressive power. In this work, we extend the HDP-HMM framework by incorporating recurrent explicit duration modeling, resulting in a more general and flexible model: the Recurrent Explicit Duration HDP-HMM (RED-HDP-HMM). We propose a Gibbs sampling method for efficient inference in this model. Empirical results on both synthetic and real-world segmentation tasks demonstrate that RED-HDP-HMM consistently outperforms the disentangled sticky HDP-HMM and the standard sticky HDP-HMM. We provide theoretical results on truncation error, expressiveness relative to HDP-HSMM. Empirically, RED-HDP-HMM yields consistent gains: a 2.6 percentage point accuracy increase on honey bee waggle dance data (89.9% vs.~87.3%) and 4–10 percentage point improvements on neural segmentation tasks over sticky and disentangled sticky HDP-HMM baselines.

Abstract:
Modeling dynamic dependencies from irregularly sampled event sequences is a fundamental challenge in modern machine learning. In many realworld systems, individual-level states evolve continuously over time while being simultaneously influenced by population-level distributional dynamics. However, existing methods typically model these processes in isolation or rely on discrete-time approximations that fail to capture long-range temporal irregularities and sparse observations. This paper studies the problem of learning coupled continuous-time latent dynamics from irregular events, where individual event sequences and global distributional processes evolve asynchronously and interact over time. We propose a Coupled Continuous-Time Latent Dynamics (CoCLD) framework that jointly models individual latent dynamics and population-level distributional shifts, and aligns them in a continuous-time latent space. CoCLD integrates a Diffusion-based Latent Interpolator with Neural Ordinary Differential Equations (Neural ODEs), enabling principled interpolation, generation, and alignment of latent states across arbitrary time points. We show that the proposed coupling mechanism yields a consistent estimator of continuous-time latent dynamics under sparse and irregular observations. Empirical evaluations demonstrate that CoCLD effectively captures dynamic dependencies and generalizes across diverse tasks, including next-event prediction, mobility trajectory generation, and sequential behavior modeling. These results suggest that learning coupled continuous-time latent dynamics provides a powerful paradigm for irregular event sequence modeling.

Abstract:
Mechanistic interpretability of Transformer models commonly relies on training auxiliary proxy models, such as Sparse Autoencoders or CrossLayer Transcoders. While effective, these post-hoc approaches introduce approximation bias and incur substantial computational overhead. We propose an alternative, training-free interpretability framework that directly exploits the Singular Value Decomposition (SVD) of weight matrices in Transformer MLP sublayers. By operating natively on model parameters, our method improves scalability while preserving fidelity to the original weights. We show that the projection matrices of MLP sublayers admit a natural decomposition into orthogonal, interpretable rank-1 subspaces, which we termDetector-Effector Units(DEUs). Within each unit, a singular vector functions as a detector of input patterns and modulates a coupled effector vector that encodes output semantics. Building on this structure, we introduceSubspace Contribution Analysis(SCA), a diagnostic method that quantifies the direct causal contribution of individual native subspaces to model predictions. Experiments across the GPT-2 family demonstrate that our framework,Native Network Anatomy(NaNA), identifies dominant functional pathways with orders-of-magnitude efficiency gains over training-based interpretability baselines, while maintaining weight fidelity. Our results suggest that SVD-based analyses provide a scalable and faithful alternative to learned proxy approaches for mechanistic interpretability.

Abstract:
Safe reinforcement learning (Safe RL) aims to maximize expected return while satisfying safety constraints, typically modeled as constrained Markov decision processes. While primaldual methods scale well to deep RL, they often suffer from delayed constraint correction, leading to oscillatory behavior and prolonged safety violations. In this paper, we proposeConstraint-Sensitive Policy Optimization (CSPO), a first-order primal-dual method that incorporates local constraint sensitivity into policy updates. CSPO augments the primal objective with a constraint-sensitive correction derived from the shortest signed distance to the safety boundary, enabling smarter recovery steps back to safety, compensating for delayed Lagrange multiplier updates, and reducing oscillations near the boundary, while preserving the KKT solutions of the original constrained problem. Extensive experiments on navigation and locomotion benchmarks demonstrate that CSPO achieves faster safety recovery and high reward preservation, resulting in higher constrained returns (+15.6% average improvement) compared to state-of-the-art primal-dual and penalty-based methods.

Abstract:
Existing quantization methods are fundamentally limited by rigid, integerbased bit-widths (e.g., 2, 3-bit), creating a "deployment gap" where LLMs cannot be optimally fitted to specific memory budgets. To bridge this gap, we introduce LiftQuant, a novel framework that enables continuous bit-width control for true Pareto-optimal deployment. The core innovation is a "lift-then-project" mechanism: we represent d-dimensional weight vectors by projecting a simple 1-bit lattice from a tunable D-dimensional "lifted" space. By adjusting the lifted dimension D, LiftQuant naturally yields an effective bit-width of D/d, allowing for seamless, continuous resolution adjustment rather than discrete steps. This projection generates a structured yet non-uniform codebook, capturing the expressive power of Vector Quantization. Crucially, its decoding path relies solely on linear transformations and 1-bit uniform quantizers, retaining hardware-friendly efficiency. This flexibility is transformative: LiftQuant enables a 70B LLM to be compressed to 2.4 bits to precisely fit a 24GB GPU, where its performance significantly surpasses state-of-the-art 2-bit models. With a decoding throughput up to 6.7x faster than FP16, LiftQuant redefines compression as a continuous optimization problem, paving the way for a new generation of hardware-aware LLM deployment.

Abstract:
Diffusion Flow Matching (DFM) has recently emerged as a versatile framework for generative modeling, yet its theoretical convergence properties remain only partially understood. In this work, we provide refined and novel convergence guarantees for Brownian motion based DFMs, focusing on the discretization error. Our analysis is conducted under the Kullback–Leibler (KL) divergence and the 2Wasserstein distance. Under finite-moment and mild integrability assumptions, we derive KL convergence bounds with improved dimensional dependence compared to prior work, achieving, up to our knowledge, state-of-the-art scaling under minimal conditions. We further extend the analysis to the 2-Wasserstein distance: assuming weak log-concavity and one-sided Lipschitz continuity, we obtain convergence guarantees with dimensional dependence consistent with the KL case.

Abstract:
Embeddingbased representations in Euclidean space \mathbbR^d are a cornerstone of modern machine learning, where a major goal is to use the \emphsmallest dimension that faithfully captures data relations. In this work, we prove sharp dimension--accuracy tradeoffs and identify a fundamental information-theoretic limitation: unless the embedding dimension d is chosen close to the ground-truth dimension D, accuracy undergoes a sudden collapse. Our main result shows that this phenomenon arises even in standard contrastive learning settings, where supervision is limited to a set of m anchor--positive--negative triplets (i,j,k) encoding distance comparisons \mathrmdist(i,j) < \mathrmdist(i,k). Specifically, given triplets realizable by an unknown ground-truth embedding in D dimensions, we prove that there exists constant c < 1, such that \emphevery embedding of dimension at most cD violates half of the triplets, yielding accuracy as low as a trivial one-dimensional solution that ignores the input. We complement our information-theoretic bounds with strong computational hardness results: under the Unique Games Conjecture, even if the given triplets are nearly realizable in D=1 dimension, no polynomial-time algorithm---regardless of its dimension---can achieve accuracy above the trivial 50% baseline.

Abstract:
Large language models (LLMs) handle many tasks with one set of parameters, but under KVcached inference it is unclear what task-general structure, if any, is used at decode time rather than during prefill. We propose DecodeShare, a protocol that identifies a low-dimensional subspace that is consistently shared across tasks in decode-time hidden states, and then tests its causal role by removing that subspace only during decoding. In our experiments, disturbing the discovered shared subspace degrades decision performance far more than disturbing either a prefill-derived subspace or a random subspace under the same intervention budget. We further find that this decode-shared subspace overlaps common steering vectors, enabling a simple offline adjustment: projecting steering vectors away from the shared subspace can reduce template sensitivity while preserving non-random task utility, with task-dependent trade-offs. Despite being compact, the shared subspace can serve as a high-leverage causal channel at decode time.

Abstract:
Large language models are increasingly evaluated as interactive agents, yet standard agent benchmarks conflate two qualitatively distinct sources of success: semantic tooluse and interface-specific interaction pattern memorization. Because both mechanisms can yield identical task success on the original interface, benchmark scores alone are not identifiable evidence of environment-invariant capability. We proposePIPE, a protocol-level evaluation augmentation for diagnosing interface reliance by minimally rewriting environment interfaces while preserving task semantics and execution behavior. Across 16 environments from AgentBench and AgentGym and a range of open-source and API-based agents, PIPE reveals that trajectory-SFT substantially amplifies interface shortcutting: trained agents degrade sharply under minimal interface rewrites, while non-trajectory-trained models remain largely stable. We further introduce Interface Reliance (IR), a counterbalanced alias-based metric that quantifies preference for training-time interfaces, and show that interface shortcutting exhibits environment-dependent, non-monotonic training dynamics that remain invisible under standard evaluation. Our code is available at https://anonymous.4open.science/r/What-Do-Agents-Learn-from-Trajectory-SFT-Semantics-or-Interfaces--0831/.

Abstract:
While large models achieve impressive results, their learning dynamics are far from understood. Many domains of interest such as natural language syntax, coding languages, arithmetic problems -- are captured by context-free grammars (CFGs). In this work, we extend prior work on neural language modeling of CFGs in a novel direction: how language modeling behaves with respect to CFG substructure, namely "subgrammars". We first define subgrammars, and prove a set of fundamental theorems regarding language modeling and subgrammars. We show that language modeling loss (or equivalently the Kullback-Leibler divergence) recurses linearly over its top-level subgrammars; applied recursively, the loss decomposes into losses for "irreducible" subgrammars. We also prove that the constant in this linear recurrence is a function of the expected "recursion", a notion we introduce. We show that under additional assumptions, parametrized models learn subgrammars in parallel. Empirically, we confirm that small transformers learn subgrammars in parallel, unlike children, who first master simple substructures. We also briefly explore several other questions regarding subgrammars. We find that subgrammar pretraining can improve final performance, but only for tiny models relative to the grammar, while alignment analyses show that pretraining consistently lead to internal representations that better reflect the grammar’s substructure in all cases; we also observe persistent difficulty with deeper recursion, a limitation that appears even of large language models.

Abstract:
Searching for effective policies in highdimensional action spaces is notoriously challenging. This difficulty is compounded in overactuated musculoskeletal systems, where multiple muscles span each joint, and individual muscles actuate multiple joints. Although this redundancy complicates naive policy search, it also implies that effective control can be captured by a low-dimensional action manifold. To identify such a manifold, we introducejoint-space empowerment (JSE), a novel information-theoretic principle that quantifies how much control an agent has over its body. We use JSE to discover high-empowerment action manifolds, and demonstrate that manipulation policies learned on these manifolds show significantly enhanced dexterity, sample efficiency and improved generalization. These results suggest a general principle for motor coordination in high-dimensional, overactuated systems, with implications for both biological motor control and embodied artificial agents.

Abstract:
Posthoc controllability of fair machine learning models, the ability to control the trade-off between fairness and accuracy after training, is valuable for practical deployment. Existing post-processing methods provide such post-hoc controllability but often suffer from significant accuracy degradation, whereas in-processing methods achieve efficient trade-offs but require computationally expensive retraining for each change in trade-off ratio. To achieve both post-hoc controllability and efficient trade-offs, we propose a novel fair classification algorithm that learns effective feature representations to improve the trade-off efficiency of post-processing fair classifiers, by a gradient-based optimization approach. Experimental results on real-world datasets demonstrate that our method achieves trade-off efficiency comparable to, or even surpassing, in-processing methods, without requiring any retraining.

Abstract:
Transformers can perform incontext classification from a few labeled examples, yet the inference-time algorithm remains opaque. We study multi-class linear classification in the hard no-margin regime and make the computation identifiable by enforcing feature- and label-permutation equivariance at every layer. This enables interpretability while maintaining functional equivalence and yields highly structured weights. From these models we extract an explicit depth-indexed recursion---an end-to-end identified, emergent update rule inside a standard softmax transformer, to our knowledge the first of its kind. Attention matrices formed from mixed feature--label Gram structure drive coupled updates of training points, labels, and the test probe. The resulting dynamics implement label-aware mean-shift, which provably amplifies class separation and yields robust expected class alignment.

Abstract:
Modern foundation models are evaluated through broad capabilities such as arithmetic, reasoning, safety, and robustness, yet it remains unclear in a principled sense when finite tests can meaningfully certify such claims. We develop a rigorous theory of capability evaluation by formalizing evaluation as inference over a task family and asking when guarantees over the full family can be inferred from a strict subset of tests. We analyze two canonical regimes. In stochastic multienvironment evaluation, we characterize when uniform certification is possible across multiple environments and show that the sample complexity is governed by a \chi^2-radius of the environment family, yielding near-optimal evaluation protocols with matching lower bounds under a natural overlap condition. In contrast, for worst-case, rule-like capabilities, we establish fundamental impossibility results. Even for structured model classes such as Boolean circuits of bounded size, black-box evaluation cannot, in general, certify global properties. Together, these results provide a principled framework for understanding when finite evaluation can and cannot certify capabilities.

Abstract:
Spectrally sparse signal reconstruction arises in a wide range of applications and can be formulated as a lowrank Hankel matrix completion problem. We develop a Jacobi-preconditioned gradient descent method that preserves the low per-iteration complexity of first-order algorithms while achieving linear convergence at a rate independent of the condition number. By introducing a generator that maps factor-based iterates to matrix space, we establish equivalence with manifold-based methods, enabling direct convergence analysis while avoiding the need to define distances under complex-symmetric factorization ambiguity. Extensive experiments demonstrate that the proposed algorithm outperforms state-of-the-art methods in both iteration count and computational time across a broad range of problem settings.

Abstract:
Model robustness against natural image corruptions is essential for safetycritical applications. While existing methods primarily focus on implicit representation learning, we provide the first systematic exploration of computational pathways to explicitly characterize internal robustness. We identify a progressive decay of robust features across network layers and establish a functional dependency between the prevalence of these features and model performance. To exploit these insights, we propose Suppress and Diversify (S&D), a non-intrusive refinement approach that enhances robustness by dynamically selecting robust pathways and diversifying them through symmetry-preserving transformations. S&D is architecture-agnostic, parameter-free, and incurs zero test-time overhead. Extensive evaluations across eight benchmarks demonstrate that S&D consistently improves performance across multiple vision tasks, diverse backbones, and complex real-world scenarios, highlighting its broad efficacy and scalability.

Abstract:
Expressivity has been a major focus in the design of Graph Neural Networks (GNNs), yet a significant gap persists between theoretical universal expressivity and practical performance. While many expressive GNNs are efficient and achieve strong results, they often focus on specific graph properties and lack theoretical expressivity for general graph tasks. Conversely, theoretically universalexpressive models often suffer from high computational costs or poor generalization, limiting their real-world applicability. To bridge this gap, we introduce Equivariant Noise GNNs (ENGNNs), a framework that utilizes random noise features to enhance the expressivity of GNNs. Crucially, unlike prior methods that naively use noise, we enforce equivariance to nodewise noise transformations, such as orthogonal transformations. We prove that this property reduces the model's theoretical sample complexity, thereby improving generalization. Our framework simultaneously reaches theoretical universal expressivity, maintains the linear scalability of standard Message-Passing Neural Networks in practice, and achieves performance comparable to computationally expensive, high-expressivity models. Extensive experiments confirm strong performance across node, link, subgraph, and graph-level prediction tasks, demonstrating that the equivariant use of noise provides a powerful and practical pathway for building expressive GNNs. Our code is available at https://anonymous.4open.science/r/EquivNoiseGNN.

Abstract:
Posttraining endows pretrained LLMs with a variety of desirable skills, such as instruction-following, reasoning, and others. However, these post-trained LLMs only encode knowledge up to a cut-off date, necessitating continual adaptation. Unfortunately, existing solutions cannot effectively learn new knowledge from adaptation document corpora and simultaneously mitigate the forgetting of earlier learned capabilities. To address this, we introduce Distillation via Split Contexts (DiSC), a simple context-distillation based approach for continual knowledge adaptation. DiSC derives student and teacher distributions by conditioning on distinct segments of the training example and minimizes the KL divergence between them for the common tokens. This insight allows us to efficiently apply context-distillation without requiring explicit generation steps during training. We run experiments on three post-trained models and two adaptation domains. Compared to prior finetuning and distillation methods for continual adaptation, DiSC consistently reports the best trade-off between learning new knowledge and mitigating forgetting of previously learned skills like instruction-following and reasoning, or factual knowledge.

Abstract:
The evolution of Large Language Model (LLM) agents for software engineering (SWE) is constrained by the scarcity of verifiable datasets, a bottleneck stemming from the complexity of constructing executable environments across diverse languages. To address this, we introduceMEnvAgent, aMultilanguage framework for automatedEnvironment construction that facilitates scalable generation of verifiable task instances. MEnvAgent employs a multi-agent Planning-Execution-Verification architecture to autonomously resolve construction failures and integrates a novel Environment Reuse Mechanism that reduces computational overhead by incrementally patching historical environments. Evaluations on MEnvBench, a new benchmark comprising 1,000 tasks across 10 languages, demonstrate that MEnvAgent outperforms baselines, improving Fail-to-Pass (F2P) rates by8.6%while reducing time costs by43%. Additionally, we demonstrate the utility of MEnvAgent by constructing MEnvData-SWE, the largest open-source polyglot dataset of realistic verifiable Docker environments to date, alongside solution trajectories that enable consistent performance gains on SWE tasks across a wide range of models.

Abstract:
Image Aesthetic Assessment (IAA) predicts an image’s overall aesthetic score, yet aesthetic is influenced by multiple attributes whose relative importance varies with image content and usage scenarios. Under endto-end training with only overall-score supervision, attribute signals are blended, which can cause gradient conflict across samples dominated by different attributes, resulting in gradient cancellation and persistent systematic bias. To address these issues, we propose AGREE (Attribute-guided Gradient Routing for Establishing Agreement), which learns attribute-specific subspaces and performs gradient routing based on sample-wise attribute sensitivity estimated via perturbation analysis. AGREE further reduces feature coupling across attributes with semantic anchors and improves robustness via error-aware reweighting. Experiments on AVA, LAPIS, AADB, TAD66K, and PARA show consistent improvements over diverse IAA baseline models, and AGREE is plug-and-play for existing end-to-end IAA methods without modifying their original architectures. To our knowledge, this work is among the early efforts in IAA to systematically study gradient conflict and provide an effective solution.

Abstract:
For lowdimensional problems (d\leq3), spectral methods can achieve exceptionally high accuracy. For middle-dimensional problems (4 \leq d \lesssim 10), spectral methods remain feasible through specific techniques such as sparse grids or hyperbolic cross. However, for high-dimensional problems (d\gg 10), spectral methods suffer from the curse of dimensionality. Physics-informed neural networks (PINNs) have emerged as a promising approach to overcome this challenge, offering scalability to high dimensions, but often suffer from limited accuracy and efficiency. Recently proposed spectral-informed neural networks (SINNs) combine spectral methods with PINNs, operating directly in the spectral domain to avoid spatial derivative computations and to reduce memory consumption. In this work, we introduce Modified SINNs, which integrate coefficient decay scaling and basis embeddings motivated by harmonic analysis to enhance accuracy in high-dimensional problems and enable accurate approximation of unknown spectral coefficients. Numerical experiments on steady and time-dependent partial differential equations demonstrate that Modified SINNs outperform sparse grid spectral methods on middle-dimensional problems with incomplete spectral information and achieve superior accuracy compared to PINNs on high-dimensional problems.

Abstract:
We study the problem of verifying whether certain properties such as robustness or fairness hold in an ensemble of decision trees. This problem is known to be NPhard, with most research targeting a solution to a specific verification task. We explore the problem through the lens of an ensemble's OC-space: the set of all possible combinations of individual trees' predictions. This provides a unifying view that yields more a generic and flexible approach to verification. We show that a wide variety of existing verification tasks can be (1) framed as simple searches through OC-space, and (2) answered in time linear or quadratic in the size of the OC-space. Moreover, the search can be made more efficient by using spatial index structures. Interestingly, while the OC-space can grow exponentially with the ensemble's size, in practice it is often feasible to enumerate all output configurations. Empirically, we show that our generic approach can be faster than approaches targeting a single verification task.

Abstract:
Large language models are beginning to introduce a new paradigm for compilation: instead of only assisting at the source level, they can operate directly on intermediate representations (IRs), the compiler’s internal code representation, Early studies suggest that LLMguided optimization can sometimes rival traditional compiler optimizations on selected programs, but evidence remains fragmented. Yet the community still lacks a rigorous IR-level benchmark that tests whether a model not only understands IR but can rewrite it under compiler-grade semantic constraints with meaningful performance impact. We present CIRBench, a benchmark of 800 curated IR instances spanning four compiler-oriented tracks: Analysis infers IR properties, Repair fixes invalid IR, Refactor applies a single semantics-preserving compiler optimization, and Transform performs performance-oriented rewrites, together mirroring core optimization responsibilities in modern compilers. CIRBench combines verifier, equivalence checking, and end-to-end performance measurement into a unified, layered correctness-aware evaluation of LLMs on IR. On six mainstream LLMs, CIRBench shows that current models fail on many IR analysis and rewriting instances and on median underperform the compiler baseline, but we also observe a maximum speedup of 4.96× over -O3. These findings highlight both the opportunities and the remaining challenges of using LLMs inside optimizing compilers.

Abstract:
Road network data provides rich information about cities, but processing a large volume of worldwide OpenStreetMap (OSM) data is computationally intensive, and the resulting graphs are often difficult to unify for benchmarking downstream tasks. Existing graph learning benchmarks fail to capture the billionscale and unique topological properties of real-world road networks, leading to a gap in our understanding of model scalability. To study and close this gap, we process OpenStreetMap data with distributed cloud computing using 5,000 cores and release OSM+, a structured worldwide 1-billion-vertex road network graph dataset designed for high accessibility and usability. OSM+ is open source and globally downloadable, and it provides an open-box graph structure together with an easy spatial query interface. We demonstrate the utility of OSM+ through three illustrative use cases: city boundary detection, traffic prediction, and traffic policy control. For traffic prediction, we construct a new 31-city benchmark by processing traffic data and combining it with OSM+, enabling broader spatial coverage and more comprehensive evaluation than previously frequently-used datasets, while scaling from hundreds of road network intersections to thousands. For traffic policy control, we release a new six-city dataset at a much larger scale, introducing challenges for thousand-scale multi-agent coordination. In addition, we provide comprehensive data processing tools that support integrating multimodal spatial-temporal data with OSM+ for geospatial foundation model training, thereby expediting the discovery of compelling scientific insights.

Abstract:
Textto-image diffusion models remain vulnerable to adversarial prompts that elicit disallowed content, motivating reliable inference-time controls. A popular approach is negative guidance, which subtracts a negative-prompt direction with a fixed weight. However, it often forces a safety–fidelity trade-off, causing artifacts or prompt drift when over-applied and failing under attacks when under-applied. Recent dynamic variants reweight guidance using posterior-odds signals, which can be brittle for open-vocabulary compositional prompts, while lightweight similarity-based methods do not leverage the evolving image evidence along the denoising trajectory. We introduce Concept Removal Guidance (CRG), a training-free, plug-and-play method that estimates unwanted-concept presence at each diffusion step using only the noise predictions from the model, and then adaptively gates and calibrates negative guidance via a closed-form constrained update that enforces a target presence threshold while minimally perturbing the conditional trajectory. Across multiple red-teaming benchmarks, CRG significantly reduces attack success rates while improving benign fidelity, and additional suppression targets such as artist style and violence without fine-tuning or external classifiers.

Abstract:
Sparse Mixtureof-Experts (SMoE) architectures are now widely deployed in state-of-the-art language and vision models, where conditional routing allows scaling to very large networks. However, this very Top-k expert selection that enables conditional routing also renders the SMoE map inherently discontinuous. In the vicinity of these discontinuity surfaces, even inputs that are arbitrarily close may activate substantially different sets of experts resulting in significantly different outputs. In this work we give a rigorous geometric and stochastic analysis of these discontinuities. We first classify them by order, determined by the number of tied experts at a switching event. Using measure-theoretic slicing arguments, we establish asymptotic volume estimates for the thickened discontinuity surfaces, showing that lower-order discontinuity sets dominate, whereas higher-order ones occupy a vanishingly small relative volume. Next, modeling random perturbations in the input space via a diffusion process, we prove that the path eventually encounter a discontinuity, and moreover that the first hit almost surely occurs on an order-1 discontinuity with explicit finite-time probability bounds. We further derive occupation-time bounds that quantify the duration the random path spend in the neighborhoods of each discontinuity order. These theoretical results imply that inputs are more likely to lie near lower order discontinuities. Motivated by this insight, we propose a simple smoothing mechanism that can be directly applied to existing SMoEs, softly incorporating experts near discontinuities; our analysis guarantees that the added computational overhead remains small while providing localized smoothing near discontinuities, and experiments across language and vision tasks show that smoothing not only enforces continuity of the SMoE map but also enhances empirical performance.

Abstract:
With the widespread availability of pretrained diffusion models, there are many options for which models to use and how to use them together. Making these decisions depends highly on both the user's goals and the expertise of each model. Taking this into account, we propose coordinating models as one would a specialized workforce--through a fair yet efficient division of labor. Divide-and-Denoise uses multiple pre-trained diffusion models, each defined over the same space, to refine a noisy sample over time. At every timestep, we alternate between (i) dividing the sample into regions in a way that satisfies our game-theoretic criteria and (ii) denoising a region with the assigned model in a way that respects our alignment criteria. This leads to a new composite denoising process that evolves together with a division process. Since ground truth for how models should interact is typically not available in our setup, we measure how well Divide-and-Denoise coordinates a team of single-concept text-to-image diffusion models relative to a multi-concept model. Across several image quality metrics including the GenEval benchmark, our method generates images that capture the strengths of each model, outperforming baselines and resolving common failures like missing objects and mismatched attributes.

Abstract:
Recent advances in toolintegrated language agents have significantly improved their ability to solve complex reasoning tasks. However, existing alignment methods predominantly focus on maximizing task accuracy, while overlooking auxiliary objectives such as tool-use efficiency, which are essential for practical deployment. To address this gap, we introduce ParetoPO, a two-stage multi-objective optimization framework for aligning tool-using large language models (LLMs) under competing objectives. In the first stage, ParetoPO leverages hypervolume-guided dynamic scalarization to adapt reward weights based on global Pareto frontier progress. In the second stage, it replaces scalarized learning signals with Pareto-ranking-based advantage computation, promoting nondominated trajectories through dominance-aware credit assignment. This design enables fine-grained, action-level optimization across multiple conflicting objectives. Experimental results on mathematic reasoning and deep search tasks show that ParetoPO consistently discovers policies with superior accuracy-efficiency trade-offs compared to static and heuristic baselines.

Abstract:
Estimating fullhand grasp pressure from egocentric video is critical for immersive VR and robotic manipulation, yet dense tactile sensing often relies on intrusive hardware. Existing vision-based methods predominantly rely on planar surfaces or fingertip contacts, failing to generalize to complex 3D object interactions. Therefore, we introduce EgoTactile, a benchmark pairing egocentric video with full-hand pressure supervision for diverse everyday objects, incorporating a bare-hand transfer subset to enable generalization to natural scenarios. Leveraging this benchmark, we first establish EgoPressureFormer as a discriminative baseline. Beyond this, to explicitly address the uncertainty in partial observations, we propose EgoPressureDiff, a conditional diffusion framework that adapts a large-scale pre-trained video diffusion backbone. By combining rich world knowledge priors with a Physically-Informed Feature Rectification layer to inject semantic constraints, our approach effectively hallucinates plausible contact patterns and resolves visual-physical ambiguities. Extensive experiments demonstrate that our method achieves superior performance on the benchmark and robust transferability to in-the-wild scenarios. Our project page is at https://egotactile.github.io/.

Abstract:
Plasticity—a neural network's ability to adapt to new tasks—is critical for continual and transfer learning. Existing measures, such as effective rank, dead neuron fraction, and weight norm, lack theoretical grounding and correlate poorly with performance on new tasks. We introducelocal redundancy, an informationtheoretic measure derived from universal compression theory. We define local redundancy as the worst-case redundancy of a local model family—parameters in an infinitesimal neighborhood along gradient directions—and show this is a principled measure of plasticity. Although local redundancy is intractable to compute exactly, we prove that the expected squared gradient norm on a synthetic memorization task provides an efficiently computable lower bound. Experiments on continual image classification and time series transfer learning demonstrate that local redundancy predicts downstream performance better than existing measures and enables pretraining checkpoint selection where validation loss plateaus.

Abstract:
Scaling network remains a bottleneck in deep reinforcement learning (RL): simply enlarging actor–critic networks destabilizes training and soon saturates performance. Although recent monolithic architectures such as SimBa and BRC have shown that carefully designed inductive biases can enable positive scaling up to a certain size, their improvements plateau soon as model parameters grow further. This work introduces ScaleMoE, a scalable RL architecture that integrates Mixtureof-Experts (MoE) modules into both the actor and critic of modern continuous control algorithms. Two complementary gating schemes are studied: output-level aggregation of per-expert policies and Q-functions, and feature-level fusion of expert representations before a shared head. We instantiate ScaleMoE on two representative monolithic RL baselines: the single-task method SimBa and the multi-task method BRC. Experiments across the DeepMind Control Suite, MetaWorld, and HumanoidBench show that progressively increasing the number of experts (up to 64) yields substantial improvements in returns, significantly outperforming monolithic networks of comparable or even greater parameter counts. Results demonstrate that ScaleMoE provides an efficient and effective scaling axis for deep RL in continuous control.

Abstract:
Precise prediction of spatiotemporal dynamics over predictive horizons is constrained by the computational cost of highfidelity solvers and the sparsity, noise, and irregularity of data. We introduce MERLIN, a Koopman-based framework that lifts dynamics to the evolution of learnedobservation functionalswith near-linear progression, enabling full-field reconstruction at arbitrary resolutions. Theoretically, we develop a functional Koopman theory for PDEs and compensate for the loss of finite-dimensional linear invariance via the Mori–Zwanzig formalism, which augments the linear backbone with non-Markovian memory terms to improve predictive accuracy. Practically, MERLIN employs discretization-invariantfunction encodersthat map partial, irregular observations to observables, and resolution-freefunction decodersthat reconstruct states at arbitrary query points. Training under linear constraints yields an interpretable, low-dimensional model that captures principal modes, supports reduced-order modeling, and—augmented with memory correction—delivers stable long-horizon rollouts even in ultra-low-dimensional latent spaces.

Abstract:
The Rapid Response (RR) framework (Peng et al., 2024), deployed in production systems including Anthropic’s ASL3 safeguards (Anthropic, 2025), dynamically adapts jailbreak detection classifiers by generating synthetic training data from emerging attacks. We reveal that prompt injection can infiltrate this pipeline to deliver poisoned samples into the classifier’s training set, enabling two attack objectives: (I) targeted poisoning attacks that create false positives on harmless samples by categorizing them as a jailbreak, with a specific desired feature (e.g., certain formatting, subject, or keyword), (II) concept-based backdoor attacks that induce false negatives on jailbreak inputs, generalizing even to jailbreaks from attack strategies the defender explicitly trained against, when the backdoor trigger is present. Importantly, our threat model restricts adversaries to modify- ing only jailbreak samples (not benign data or labels), a constraint unexplored by prior work that makes the second objective particularly challeng- ing. We address this with Omission Attack, which exploits a new phenomenon: when training on concept-absent unsafe samples, the classifier mis- associates that concept’s presence with the safe label. Both attacks flip nearly all target labels with only 1% poisoning rate. Code: anonymous.tbd.

Abstract:
This article initiates the study of a basic question about model pruning. Given a vector s of importance scores assigned to model components, how many of the scored components could be discarded without sacrificing performance? We propose Effective Model Pruning (EMP), which derives the desired sparsity directly from the score distribution using the notion of effective sample size from particle filtering, also known as the inverse Simpson index. Rather than prescribe a pruning criterion, EMP supplies a universal adaptive threshold derived from the distribution of the score s over the model components: EMP maps s to a number N_eff = N_eff (s), called the effective sample size. The N − N_eff lowest scoring components are discarded. A tight lower bound on the preserved mass fraction seff (the sum of retained normalized scores) in terms of N_eff is derived. This process yields models with a provable upper bound on the loss change relative to the original dense model. Numerical experiments are performed demonstrating this phenomenon across a variety of network architectures including MLPs, CNNs, Transformers, LLMs, and KAN. It is also shown that EMP addresses a rich set of pruning criteria such as weight magnitude, attention score, KAN importance score, and even featurelevel signals such as image pixels.

Abstract:
Large language models (LLMs) commonly boost reasoning via sampleevaluate-ensemble decoders (e.g., majority voting), achieving label free gains without ground truth. However, prevailing strategies score candidates using only external outputs such as token probabilities, entropies, or self evaluations, and these signals can be poorly calibrated after post training. We instead analyze internal behavior based on neuron activations and uncover three findings: (1) external signals are low dimensional projections of richer internal dynamics; (2) correct responses activate substantially fewer unique neurons than incorrect ones throughout generation; and (3) activations from correct responses exhibit stronger cross sample agreement, whereas incorrect ones diverge. Motivated by these observations, we propose Neuron Agreement Decoding (NAD), an unsupervised best of N method that selects candidates using activation sparsity and cross sample neuron agreement, operating solely on internal signals and without requiring comparable textual outputs. NAD enables early correctness prediction within the first 32 generated tokens and supports aggressive early stopping. Across math and science benchmarks with verifiable answers, NAD matches majority voting; on open ended coding benchmarks where majority voting is inapplicable, NAD consistently outperforms Avg@64. By pruning unpromising trajectories early, NAD reduces token usage by 99% with minimal loss in generation quality, showing that internal signals provide reliable, scalable, and efficient guidance for label free ensemble decoding.

Abstract:
Graph neural networks (GNNs) excel in graph analyzing tasks but often suffer from poor generalization under Outof-Distribution (OOD) environments. Although this problem has attracted increasing attention, most solutions primarily rely on empirical designs, lacking effective mechanisms to characterize and quantify invariance for graph representation learning. To address these limitations, we propose DIGL, a novel graph learning method that improves the OOD generalization of GNNs. Our work makes an initial attempt to geometrize invariance for graphs by introducing computational optimal transport (OT) theory to characterize invariance principle. Specifically, we formulate the underlying invariant prototype shared by graphs across different environments as a distribution barycenter, and consider graph representations in each specific environment as distortions of the prototype. Building on this idea, we establish an invariant learning framework to promote the model to learn purely invariant graph representations for downstream tasks. Moreover, we derive a unified optimization objective for model implementation and provide theoretical analysis to justify our method. Extensive experiments on a broad range of benchmark datasets demonstrate the superior generalization ability of our method compared with baseline methods under various OOD settings.

Abstract:
Neuromodulation can be viewed as closedloop control of high-dimensional spatiotemporal fields on irregular 3D morphologies, coupling membrane electrophysiology with ionic reaction–diffusion. This view supports high-rate feedback and systematic in-silico evaluation, yet is difficult in practice. Unlike classical PDE control with known equations on regular domains, neuronal microenvironments exhibit complex, often unknown biophysics on irregular shapes. High-fidelity simulators are too costly for real-time control with repeated planning. The discretized field is sparsely observed and must satisfy hard full-field safety constraints. We introduceNeuronCtrl, a modular operator-level framework for safe, closed-loop generative control of neuronal microenvironment dynamics. Given measurements, actions, and morphology, a history-conditioned observer infers the latent field, a morphology-aware neural operator predicts one-step dynamics, and a flow-matching conditional flow proposes actions conditioned on user preferences. Safety is enforced via complementary barrier-based mechanisms at both the action and field levels, ensuring constraint satisfaction with minimal intervention. When latency is critical, the multi-step generator is distilled into a single-step policy while retaining the same safety filter. Experiments across three high-fidelity 3D neuromodulation benchmarks spanning deep brain stimulation, extracellular reaction--diffusion control, and astrocytic potassium regulation, demonstrate superior trade-offs among cost, safety, and latency. Code is available at https://anonymous.4open.science/r/NeuronControl-D900.

Abstract:
Although learned representations underlie neural networks' success, their fundamental properties remain poorly understood. A striking example is the emergence of simple geometric structures in LLM representations: for example, calendar months organize into a circle, years form a onedimensional manifold, and the latitude and longitude of cities can be decoded by low-dimensional linear probes. We show that the statistics of language exhibit a translation symmetry---e.g,. the co-occurrence probability of two months depends only on the time interval between them---and we prove that the latter governs the aforementioned geometric structures in high-dimensional word embedding models. Moreover, we find that these structures persist even when the co-occurrence statistics are strongly perturbed (for example, by removing all sentences in which two months appear together) and at moderate embedding dimension. We show that this robustness naturally emerges if the co-occurrence statistics are collectively controlled by an underlying continuous latent variable. We empirically validate this theoretical framework in word embedding models, text embedding models, and large language models.

Abstract:
Multimodal Large Language Models (MLLMs) show promise for Multimodal Emotion Recognition (MER) but often remain unreliable because sparse emotional cues could be easily overwhelmed and affected by redundant context. While finetuning is effective, it is usually costly when using large models. Training-free methods like chain-of-thought reasoning provide a practical alternative, but they mostly rely on heuristic prompting to influence the model behaviors and do not explicitly focus on emotion relevant tokens internally, which would allow decision-relevant emotional tokens to be diluted by environmental noise, resulting in unstable predictions. To address this limitation without training, we rethink MER from a world-model perspective that treats emotion as a latent state inferred from noisy and redundant multimodal observations. Under frozen parameters, this view suggests that robustness depends on constraining why and how tokens contribute to inference. Based on this insight, we propose WETR (World-Model inspired Emotion-aware Token Refinement), a training-free, plug-and-play regulator that reshapes token usage through two mechanisms: Noise-suppressed Token Selection (NTS), which suppresses redundant intra-modal noise, and State-strengthened Token Reweighting (STR), which amplifies decision-relevant emotional tokens. Experiments on multiple MER benchmarks demonstrate that WETR consistently improves accuracy and stability under frozen parameters, which also improves token-level interpretability.

Abstract:
The rapid evolution of generative models has unlocked new potentials in protein binder design, a pivotal task in structural biology, by facilitating endto-end generation via joint sequence-structure modeling or hallucination. However, existing approaches are predominantly implemented under a single-target, single-state assumption, limiting their ability to model multi-target or multi-state interactions required for advanced function-oriented protein design. Here, we introduce Chamaileon, which unifies multi-target and multi-state binder design by formulating the problem as cross-context binding landscape modeling. The framework is underpinned by a training paradigm termed In-Context Complex Co-Design (I3CD) for context-aware sequence-structure co-modeling. During inference, we employ Mixture-of-Paths Sampling (MoPS), a scalable strategy that optimizes a single sequence across contexts while alleviating the scarcity of high-quality multi-conformational paired data. Extensive evaluation on our newly constructed benchmark, CROSS, demonstrates that Chamaileon effectively generates sequences adaptable to diverse conformational landscapes and multi-target requirements.

Abstract:
Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels. As an endeavor towards this end, we introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos. Our data mixture represents the largest video dataset to date for world model pretraining, spanning a wide range of daily scenarios with diverse objects and skills. To address the scarcity of action labels, we introduce continuous latent actions as unified proxy actions, enhancing interaction knowledge transfer from unlabeled videos. After posttraining on small-scale target robot data, DreamDojo demonstrates a strong understanding of physics and precise action controllability. We also devise a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.93 FPS and further improves consistency to the context. Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning. Systematic evaluation on multiple challenging out-of-distribution (OOD) benchmarks verifies the significance of our method for simulating open-world, contact-rich tasks, paving the way for general-purpose robot world models.

Abstract:
Recent progress in largescale generative models has substantially advanced video generation, yet existing methods remain constrained by a rigid inference paradigm. Bidirectional diffusion models excel at global coherence and visual fidelity but suffer from slow inference, while autoregressive models offer efficient and streaming generation at the cost of long-range consistency and exposure bias. We introduce Flex-Forcing, a unified training and inference framework that enables a video diffusion model to seamlessly operate under both bidirectional and autoregressive generation regimes. The core idea is a flexible chunking mechanism jointly defined over the temporal axis and denoising steps. This design allows the model to (1) perform flexible chunking according to different device budgets, (2) perform bidirectional inference across chunks for global structure planning, while generating frames autoregressively within each chunk for efficient and fine-grained synthesis, and (3) perform any-order, any-timestep autoregressive generation without the strict causal constraint. Extensive experiments on multiple video generation benchmarks demonstrate that Flex-Forcing achieves consistently better video quality, long-video stability than strong baselines with a rigid inference schedule, while offering faster inference.

Abstract:
Spiking Neural Networks (SNNs) offer an energyefficient route to 3D spatio--temporal perception, yet they lag behind Artificial Neural Networks (ANNs) due to weak pretraining and heavy inference stacks, limiting generalization and multimodal reasoning (e.g., zero--shot 3D classification and open--world QA). We present a universal Spike--based Vision--Language pretraining framework (SVL) that equips SNNs with open--world 3D understanding while preserving end--to--end spike efficiency. SVL comprises two core components: (i) Multi--scale Triple Alignment (MTA), a label--free triplet contrastive objective aligning 3D, image, and text; and (ii) Re--parameterizable Vision--Language Integration (Rep--VLI), which converts offline text embeddings into lightweight weights for text--encoder--free inference. Moreover, we present the first fully spike--driven point Transformer, Spike-driven PointFormer, whose 3D spike--driven self--attention (3D-SDSA) reduces interactions to sparse additions, enabling faster, more efficient training. Extensive experiments show that SVL attains strong zero--shot 3D classification (85.4% top--1) and consistently outperforms prior SNNs on downstream tasks (e.g., +6.1% 3D cls, +2.1% DVS actions, +1.1% detection, +2.1% segmentation) while enabling open--world 3D question answering, sometimes outperforming ANNs. To the best of our knowledge, SVL represents the first scalable, generalizable, and hardware-friendly paradigm for 3D open-world understanding, effectively bridging the gap between SNNs and ANNs in complex open-world understanding tasks.

Abstract:
Reinforcement learning from human feedback is the leading approach to aligning powerful AI systems so that they can be safe and helpful for humanity. While RLHF is typically modelled as a problem of learning a single preference ranking from noisy feedback, true human preferences are complex and often conflicting, representing substantive disagreements stemming from the diversity of individual human values. With this motivation, a recent line of research has studied RLHF from the perspective of social choice theory, which provides a set of wellestablished desirable properties for aggregating diverse preferences. Seen through this lens, the standard learning objective in RLHF is equivalent to aggregating diverse human preferences via the Borda count rule. At the same time, several new RLHF algorithms have been proposed, which turn out to be equivalent to the von Neumann winner social choice rule. However, the connection between social choice theory and RLHF has thus far ignored the critical role of regularization to prevent divergence from a reference policy, which is utilized in essentially all practical RLHF algorithms. In this paper, we study how regularization affects the social choice axioms satisfied by different RLHF algorithms, and prove that regularization improves the axiomatic properties of the von Neumann winner rule. In contrast, the Borda count rule still fails to satisfy key social choice axioms even when regularized. These results provide a principled argument grounded in social choice theory for utilizing practical RLHF algorithms that correspond to the von Neumann winner, rather than the standard RLHF objective.

Abstract:
Evaluating generalist robot manipulation policies is costly and difficult to scale in the real world. While emerging world models (e.g., WorldEval, CtrlWorld) offer a promising alternative, the reliability of such evaluation remains a critical bottleneck. Specifically, their visual predictions can undermine policy assessment by "self-correcting" failures into false positives or yielding artifacts under out-of-distribution controls. Even with failure-enriched data, current architectures struggle to capture action-causal dynamics, as they typically treat actions as passive conditions rather than causal drivers. To address this, we propose dWorldEval, an action-centric discrete-diffusion world model that maps visual observations, language instructions, and action chunks into a shared unified token space and denoises them with a single self-attention backbone where actions function as first-class tokens. To realize reliable policy-world interaction, dWorldEval introduces a sparse keyframe memory that anchors global scene state while preserving fine-grained multi-view interaction cues, and leverages Progress-as-text to jointly generate future observations and success indicators. Extensive experiments on LIBERO, RoboTwin, and real-robot tasks demonstrate that dWorldEval significantly outperforms video diffusion baselines in action controllability, stabilizes long-horizon multi-view rollouts, enabling accurate policy ranking via automatic success estimation.

Abstract:
Expertannotated time series datasets often suffer from low agreement, especially in medical applications where decisions rely on subjective criteria and inconsistent thresholds. Such variability degrades annotation quality and thus limits the reliability of supervised classification models. To address this, we propose to rely on a pairwise comparison-based approach, which provides a more robust alternative to individual annotation, since relative judgments are typically easier and yield higher consistency. The problem is thus transformed into a ranking problem and we introduce an ideal point model adapted to time series data using elastic similarity measures such as Dynamic Time Warping (DTW) and Time Warp Edit Distance (TWED). We prove Lipschitz continuity of these distances and demonstrate several convergence guarantees for this model. To facilitate gradient-based optimization, we also introduce a differentiable version of the TWED. Finally, we show through multiple experiments that our approach produces accurate and robust rankings under noisy annotation conditions.

Abstract:
Federated learning (FL) allows clients to collaboratively train models without exposing private data, but practical FL is simultaneously challenged by data heterogeneity and model heterogeneity. Prior heterogeneous FL (HtFL) approaches often fail to handle finegrained feature shifts, leading to weak representation alignment and limited cross-client knowledge transfer, which degrades both personalization and generalization. We propose FedARC, an HtFL framework that couples a shared lightweight extractor with client-specific fusion: a trainable projector integrates local and global embeddings, while adaptive residual compensation dynamically corrects feature-level mismatches. To further stabilize aggregation, FedARC performs semantic anchor alignment across clients, and we theoretically prove FedARC converges with a non-convex convergence rate \mathcalO(1/T). Experiments on five public benchmarks show that FedARC outperforms nine state-of-the-art HtFL baselines by up to 2.63% in average accuracy, while maintaining efficient communication and computation.

Abstract:
Tabular foundation models, exemplified by TabPFN, perform prediction via incontext learning, inferring test labels directly from labeled training examples. They have demonstrated competitive performance, particularly on small-to-medium datasets. However, recent tabular foundation models often improve accuracy with increasingly complex architectures, incurring higher inference cost and limiting practical deployment. In this work, we revisit the original TabPFN design and show that a lightweight row-wise attention–only backbone can remain highly competitive with two simple enhancements: a gated attention stabilization mechanism and a small set of learnable register tokens that provide global context and improve pretraining quality. The resulting model, SwiftPFN, supports both classification and regression, and is competitive with stronger tabular foundation models (e.g., TabPFN v2 and TabICL) while being more efficient at inference. For latency-sensitive serving, we further introduce an adaptive layer-wise early-exit mechanism that dynamically adjusts inference depth per sample. Experiments show that many samples can be reliably predicted using shallow layers, reducing average computation with negligible performance degradation. Overall, SwiftPFN enables efficient and anytime tabular in-context learning for practical deployments.

Abstract:
Autoregressive large language models (LLMs) have achieved remarkable success in many complex tasks, yet they can still fail in very simple logical reasoning such as the "reversal curse" when trained on forward knowledge data of the form "A \rightarrow B" (e.g., Alice's husband is Bob), the model is unable to deduce the reversal knowledge "B \leftarrow A" (e.g., Bob's wife is Alice) during test. Extensive prior research suggests that this failure is an inherent, fundamental limit of autoregressive causal LLMs, indicating that these models tend to memorize factual-level knowledge rather than capture higher-level rules. In this paper, we challenge this view by showing that this seemingly fundamental limit can be mitigated by slightly tweaking the training data with a simple regularization data recipe called the Identity Bridge of the form "A \to A" (e.g., The name of Alice is Alice). Theoretically, we prove that under this recipe, even a one-layer transformer can break the reversal curse by analyzing the implicit bias of gradient descent. Empirically, we show that a 1B pretrained language model finetuned with the proposed data recipe achieves a 40% success rate on reversal tasks, in stark contrast to a near-zero success rate when trained solely on forward-knowledge data. Our work provides a novel theoretical foundation for the reversal curse and offers a principled, low-cost path to encouraging LLMs to learn higher-level rules from data.

Abstract:
We study the relation between the total variation (TV) and Hellinger distances between two Gaussian location mixtures. Our first result establishes a general upper bound: for any two mixing distributions supported on a compact set, the Hellinger distance between the two mixtures is controlled by the TV distance raised to a power 1o(1), where the o(1) term is of order 1/\log\log(1/\mathrmTV). We also construct two sequences of mixing distributions that demonstrate the sharpness of this bound. Taken together, our results resolve an open problem raised in Jia et al. (2023) and thus lead to an entropic characterization of learning Gaussian mixtures in total variation. Our inequality also yields optimal robust estimation of Gaussian mixtures in Hellinger distance, which has a direct implication for bounding the minimax regret of empirical Bayes under Huber contamination.

Abstract:
Reinforcement learning with verifiable rewards (RLVR) has enabled progress on reasoningintensive tasks by relying on task-specific verifiers that provide automated correctness signals. However, many realistic language tasks are difficult to equip with reliable verifiers, motivating a growing reliance on reinforcement learning from human feedback (RLHF). In this setting, we argue that a closer examination of how human feedback should be interpreted is essential. We introduce Regret-based Preference Optimization (RePO), which reframes RLHF throughregret minimizationrather than reward maximization. Human preferences are often shaped byprospectiveanticipation of outcomes andcounterfactualcomparisons to alternative behaviors, rather than by immediate, outcome-independent utility. RePO captures this structure by modeling preferences as behavior-conditioned assessments of relative suboptimality. Within a KL-regularized reinforcement learning framework, RePO admits a closed-form policy update compatible with direct preference optimization. Experiments on mathematical reasoning benchmarks and human-annotated preference datasets demonstrate consistent performance gains, indicating that regret-based preference learning is an effective and human-aligned approach for training large language models.

Abstract:
Learning distribution families over \mathbbR^d is a fundamental problem in unsupervised learning and statistics. A central question in this setting is whether a given family of distributions possesses sufficient structure to be (at least) informationtheoretically learnable and, if so, to characterize its sample complexity. In 2018, Ashtiani et al. (2018) reformulated sample compressibility as a structural property of distribution classes, proving that it guarantees PAC-learnability. This discovery subsequently enabled a series of recent advancements in deriving nearly tight sample complexity bounds for various high-dimensional open problems. It has been further conjectured that the converse also holds: every learnable class admits a sample compression scheme, making the two notions to be equivalent. In this work, we establish that sample compressible families remain learnable even from perturbed samples, subject to a set of minimax-necessary and sufficient conditions. In particular, we assume samples are corrupted by an additive independent noise model, and theoretically derive sample complexity bounds for general sample compressible classes in arbitrary dimensions with respect to both \ell_2-norm and total variation distance.

Abstract:
Physicsinformed neural networks (PINNs) can be unreliable on PDEs with oscillatory, multiscale, stiff, or long-time solutions, and these difficulties worsen in high dimensions where collocation-based training yields large numerical integration error and high-variance gradients. We propose Low-Rank Fourier Sums (LoRFS), representing the solution as a low-rank sum of separable Fourier expansions (products of one-dimensional Fourier series across coordinates). This makes high-frequency structure explicit and enables closed-form evaluation of common physics-based objectives and their gradients (e.g., L^2 residual and variational losses), replacing sampling-based collocation estimates with analytic loss evaluation and eliminating sampling noise. We further provide theoretical results that clarify why LoRFS is particularly well suited to high-dimensional regimes. Across canonical PINN failure-mode benchmarks and their high-dimensional extensions, LoRFS consistently outperforms strong PINN baselines and remains stable in regimes where competing methods degrade.

Abstract:
Diffusion policies have achieved remarkable success in robotic manipulation, yet they often fail to satisfy strict physical constraints required for safe deployment. Existing approaches impose safety either prematurely during training or reactively via external guardrails at test time, limiting policy expressivity and overall scalability. We propose Physical safety Alignment for Constrained Trajectories (PACT), a selfevolving post-training framework that projects pretrained diffusion policies onto constraint-feasible regions without accessing demonstration data or task rewards. PACT distills constraint gradients into the diffusion model through a reverse-KL objective with dense supervision across timesteps. It incorporates a curriculum that progressively tightens constraints while maintaining theoretically bounded policy shift and monotone improvement, mitigating the safety-performance trade-off from catastrophic forgetting. On simulated and real-world embodied manipulation benchmarks, PACT significantly reduces safety violations by 31.0% on average while improving task success by 30.7%.

Abstract:
We present a general theoretical framework for analyzing how information about past inputs is encoded in recurrent networks into evolving dynamics rather than being represented as convergence to static attractors. Using dynamic meanfield theory and diffusion from physics, we derive a Fisher information diffusion operator that links network connectivity structure to the time-resolved propagation of information across interacting subpopulations. The analysis reveals that operating near criticality (spectral radius near one) is necessary but not sufficient for reliable memory in structured or non-normal recurrent networks; effective information retention requires alignment between input–output structure and stable dynamical subspaces. The theory yields principled initialization rules that balance stability and sensitivity, mitigating vanishing and exploding gradients. Experiments on the copy task and sequential MNIST show faster convergence and higher accuracy than standard random initialization. Together, these results provide both principled design guidelines for recurrent networks and new theoretical insight into how information can be preserved over time in their dynamics.

Abstract:
This paper studies a fundamental yet often overlooked premise in evolutionary multiview classification (EMVC): the impact of label noise on EMVC, such as distorting fitness landscapes shaped by individual fitness values (e.g., test accuracy). Traditional EMVC assumes training labels are noise-free, yet this often fails in practice. As a result, label noise introduces harmful supervision during the training phase, resulting in distorted fitness landscapes and the emergence of fitness evaluation bias (FEB). This bias misguides the evolutionary trajectory, causing the search process to stagnate in local optima. Given that label noise largely stems from the mislabeling of samples near their decision boundaries by human annotators, we thus compared the decision boundaries of human annotators and models, and found discrepancies between the two. Based on this observation, we propose a simple yet effective ``detect-then-calibrate" data purification framework that leverages outlier analysis in the gradient space (i.e., treating outliers as noisy samples) and prototype calibration in the feature space (i.e., utilizing feature prototypes of noise-free samples to correct the labels of noisy samples). Experimental results demonstrate that this strategy can effectively purify the data and alleviate FEB; furthermore, it can improve the performance of various multi-view learning paradigms in label noise scenarios. (The code will be published.)

Abstract:
Automating formal proofs of combinatorial identities is challenging for LLMbased provers, as long-horizon proof planning is required and unconstrained search quickly explodes. Symbolic methods such as the Wilf--Zeilberger (WZ) method can achieve a mechanized proof of combinatorial identities by constructing special auxiliary functions and demonstrating that they satisfy specific recurrence relations. We propose WZ-LLM, a neuro-symbolic framework that turns WZ proof plans into executable proof sketches in Lean~4 and uses an LLM-based prover to discharge the resulting machine-checkable subgoals. We also train a dedicated WZ-Prover via a Lean-kernel-verified bootstrapping loop with expert-verified iteration, followed by DAPO-based refinement. Experiments show that WZ-LLM achieves a 34% proof success rate on LCI-Test (100 classical combinatorial identities), outperforming strong baselines such as DeepSeek-V3 and Goedel-Prover-V2; moreover, on LCI-Test it proves 5 identities on which the symbolic-only baseline fails. WZ-LLM also improves performance on CombiBench and PutnamBench-Comb, suggesting the effectiveness of coupling symbolic proof sketches with learned formal reasoning. Experiments show that WZ-LLM achieves a 34% proof success rate on LCI-Test (100 classic combinatorial identities), outperforming strong baselines such as DeepSeek-V3 and Goedel-Prover-V2, and delivering consistent gains on CombiBench and PutnamBench-Comb. These results indicate that our framework provides two complementary strengths: improved direct proving for identities beyond the scope of WZ, and substantially higher end-to-end success when WZ sketches guide a specialized prover.

Abstract:
Incomplete multiview clustering is challenging due to view missingness and the entanglement of shared semantics with view-specific factors in latent representations. Existing methods often rely on heuristic fusion or direct completion strategies, which suffer from error propagation and unreliable generation under missing views. In this paper, we propose anInformation-guidedDisentangled latent modeling framework withConditionalDiffusion for incomplete multi-view clustering (IDCD). Specifically, we first encode each view into a latent representation that is variationally decomposed into a view-wise semantic latent and a view-specific factor. Information-theoretic objectives are introduced to guide the disentanglement of view-wise latents, preserving essential multi-view information while reducing the dependency between semantic and view-specific factors and encouraging cross-view semantic consistency. Besides, we aggregate the semantic latents via a mixture of Wasserstein distributions to obtain a unified global representation, where we impose a Gaussian mixture prior to explicitly couple representation learning with clustering. Based on the learned disentangled latent space, a conditional diffusion model guided by both the global semantic latent and view-specific factors is employed to generate missing views in a consistent manner. Extensive experiments on benchmark datasets demonstrate superior clustering performance and robust missing-view generation compared to state-of-the-art methods.

Abstract:
Large kernel depthwise convolutions achieve strong performance but suffer from significant degradation as kernel size grows due to irregular memory access from gatherbased computation. While Large Kernel Acceleration (LKA) helps on small feature maps, it becomes counterproductive on large feature maps, even slower than non-accelerated implementations. We propose Windowed Batch Matrix Multiplication (WBMM), which \emphpartitions input into contiguous windows and \emphindexes a compact relative position bias table to construct weight matrices, enabling regular memory access via batched matrix multiplication; this yields a unique property where WBMM's throughput improves with larger windows, opposite to depthwise convolutions that degrade with larger kernels. Operator-level benchmarks show WBMM with 14 × 14 windows outperforms 5 × 5 depthwise convolution baselines in speed while providing 7.8× larger receptive field, and combined with inter-block cross-window communication and hierarchical window reparameterization, achieves comparable or higher accuracy on ImageNet-1K, COCO, and ADE20K with 1.31--1.88× training speedup. WBMM also demonstrates consistent advantages across diverse hardware platforms including GPU, CPU, and edge devices, without requiring specialized acceleration kernels. Code and models will be publicly available.

Abstract:
Irregular time series arise ubiquitously in realworld systems, where observations are sparse, asynchronous, and governed by underlying continuous-time dynamics. Existing continuous–discrete state-space models typically rely on path-based variational inference, which is computationally expensive or constrained by restrictive posterior assumptions. We propose SDEVI, a novel framework that performs variational inference directly on the joint distribution over discrete-time observations, while guaranteeing consistency with an underlying continuous process governed by a Stochastic Differential Equation(SDE). SDEVI employs a variational posterior induced by linear time-varying SDEs as a scalable inference backbone. To enable intricate dynamics modeling for real-world data, we introduce non-linear-SDE-induced variational inference and generalize our framework to the complex domain. Extensive experiments across healthcare, physics, climate, and IoT benchmarks demonstrate state-of-the-art performance on interpolation, extrapolation, regression, and classification tasks.

Abstract:
Deep neural networks currently provide the leading quantitative models of neural responses in sensory systems. However, these networks remain implausible as models of sensory development, largely because they rely on supervised training with label efficiency far exceeding that of biological learning. Furthermore, these models are typically trained on manually curated datasets that lack the statistical properties of the natural environments to which the brain is exposed. Here, we demonstrate that models trained with unsupervised objectives on realworld data significantly outperform supervised models in predicting brain responses across both human auditory and visual cortex. We show that this performance advantage is not driven by network architecture or dataset size, but rather by the data distribution. Crucially, we find that unsupervised models trained on real-world data exhibit remarkable out-of-distribution generalization: a model trained exclusively on Mandarin speech accurately predicts English-driven brain responses, and a model trained on infant head-cam footage predicts adult visual responses to curated object images. Together, our results illustrate how deep neural networks can be used to reveal the real-world statistics that shape neural representations in the brain.

Abstract:
Chainof-Thought (CoT) prompting enables multi-step reasoning in large language models, yet long-horizon generation remains brittle under distribution shift and context interference: irrelevant cues persist, small deviations compound into inference drift, and late-stage corrections can destabilize the trajectory. We recast autoregressive decoding as a perturbed long-horizon dynamical system and introduce aninference-time stabilization operatorthat targetstrajectory-levelreliability rather than token-level fluency. Specifically, we proposeODE-guided language models, which augment a base Transformer with a persistent continuous-timethought statewhose dynamics are explicitly designed to be dissipative, enabling stable evidence accumulation with controlled forgetting. Instantiating this framework,Thinking in Flow(TiF) equips the model with a lightweight Neural ODE controller and injects its output through post-norm residual updates to achieve numerically stable, low-intrusion steering. A demand--supply (uncertainty--capacity) gate determineswhenintervention is warranted, while a direction gate determineshowto steer in representation space, yielding selective, do-no-harm corrections instead of persistent bias. We establish well-posedness, dissipativity, and incremental stability of the controlled thought dynamics, implying bounded interventions over arbitrarily long contexts, and empirically demonstrate improved robustness to distractions and semantic perturbations, while matching or improving accuracy on mathematical reasoning benchmarks across both the Llama and Qwen model families; we further observe gains on non-mathematical BBH reasoning tasks when training TiF on Llama.

Abstract:
Multicompartment Hodgkin–Huxley (HH) models provide a principled framework for predicting neural dynamics and responses to electrical stimulation. However, fitting HH biophysical parameters typically requires intracellular recordings, which are invasive and low-throughput, limiting the ability to capture the geometry and cell-specific properties of many neurons in a given neural circuit. Multi-electrode arrays (MEAs) offer a scalable alternative—high-density extracellular measurements from full neural populations—but HH model complexity has so far precluded reliable biophysical inference from extracellular data alone. Here, we introduce a framework that rapidly infers HH parameters from designed features of extracellular MEA measurements by leveraging differentiable biophysical simulation and simulation-based inference, unlocking a wide range of downstream applications. In this work, we focus on a central goal of translational neuroengineering: predicting neural spiking responses to candidate neurostimulation patterns that would take hours to measure clinically. To validate our approach, we collected hundreds of hours of stimulation and recording data from isolated macaque retina with a 30 µm-pitch 512-electrode array. Our framework predicted previously unseen multi-electrode stimulation responses with 90.4% accuracy using HH models fit from only a few minutes of recording, replacing hours of stimulus testing.

Abstract:
Kleinberg and Mullainathan showed that language generation in the limit is always possible at the level of computability: given enough positive examples, a learner can eventually generate data indistinguishable from a target language. However, such existence results do not address feasibility. We study the sample complexity of language generation in the limit for several canonical classes of formal languages. Our results show that infeasibility already appears for contextfree and regular languages, and persists even for strict subclasses such as locally threshold testable languages, as well as for incomparable classes such as non-erasing pattern languages, a well-studied class in the theory of language identification. Overall, our results establish a clear gap between the theoretical possibility of language generation in the limit and its computational feasibility.

Abstract:
While Large Language Models (LLMs) have achieved remarkable success in dyadic (oneon-one) instruction, they face significant challenges in One-to-Many alignment, such as clinical ward rounds, where an instructor must simultaneously guide a diverse group of trainees. Current models often suffer from context dilution and goal misalignment, failing to balance individual scaffolding with collective learning progress. To address this, we introduce ClinEdu, a multi-agent pedagogical simulator that model the complexity of group dynamics. Leveraging this platform, we construct ClinTeach, a large-scale dataset of Socratic teaching dialogues, and propose ClinTutor-R1, the first multimodal agent explicitly architected to achieve one-to-many alignment in clinical education, employing an explicit internal thinking mechanism to model both individual belief states and group consensus. We validate our framework through a comprehensive protocol covering both standard static benchmarks and rigorous in-situ interactive evaluation within ClinEdu. Experimental results demonstrate that ClinTutor-R1 outperforms base models by over 20% and achieves parity with proprietary reasoning models , while exhibiting exceptional scalability in maintaining instructional quality across expanding student cohorts.

Abstract:
By incorporating testtime compute scaling, large reasoning models (LRMs) are able to solve complex problems by generating explicit chain-of-thought (CoT) reasoning processes. However, they often suffer from overthinking during generation, resulting in redundant token outputs and degraded accuracy. Existing methods to mitigate this issue remain limited: training-based approaches incur substantial training costs, while training-free methods often rely on well-crafted prompting or unreliable confidence signals. In this work, we study early stopping through attention distributions and propose a simple method, ASAG, that infers the model's reasoning state and adaptively adjusts the generation strategy. The proposed method is training-free and plug-and-play, enabling seamless integration into existing LRMs. Extensive experiments on nine benchmarks demonstrate consistent improvements across mainstream LRMs with varying parameter scales, including the Deepseek-R1-Distill and Qwen3 series. In particular, ASAG achieves a 4.4% relative improvement in accuracy while reducing the number of generated tokens by over 40% across all reasoning tasks on Qwen3-8B.

Abstract:
Recently, crossmodal hashing (CMH) has garnered significant attention due to its low storage costs and high retrieval efficiency. most existing CMH methods implicitly assume the availability of high-quality annotations, which is often violated in real-world scenarios as label noise inevitably arises from human errors or non-expert annotations. To cope with noisy supervision, current noise-robust CMH methods mainly follow two paradigms, i.e., noise separation and label smoothing. They often discard the predicted noisy instances or smooth discriminative signals to mitigate the impact of noisy labels. However, aggressive separation leads to reduced data utilization, while smoothing weakens the discriminative capability regarding the true distribution of clean instances. To address these limitations, we propose a novel Robust Fuzzy Cross-modal Hashing framework (RFCMH) that introduces fuzzy set theory to endow the labels with admissibility, thereby obtaining reliable discriminative supervision from noisy labels. Specifically, we first leverage possibility and necessity measures to model the noisy labels. Subsequently, we propose Fuzzy Admissibility Refinement (FAR) to dynamically calibrate supervision signals, thereby preventing the model from being misled by false positives. Furthermore, we introduce Dual-Granularity Structural Alignment (DGSA) to enforce both cross-modal alignment and instance-level uniformity, ensuring stable and diverse representations. Extensive experiments on multiple benchmarks demonstrate that RFCMH achieves state-of-the-art retrieval performance.

Abstract:
Lowrank projection has emerged as a promising approach for compressing the KV cache by exploiting hidden-dimension redundancy. However, prior methods rely on fixed or heuristic rank selection and struggle to achieve aggressive compression with minimal accuracy degradation. We propose STAR-KV, an adaptive low-rank KV cache compression framework with fine-grained rank control. STAR-KV encompasses 1) a differentiable thresholding mechanism that enables optimal rank selection at both attention-head and block levels, 2) a hybrid decomposition strategy that applies different low-rank factorizations according to the sensitivity of key and value projections, and 3) a low-rank--aware mixed precision quantization that leverages data statistics for near lossless low-bit quantization. Evaluated across multiple LLMs and benchmarks, STAR-KV achieves up to 75% KV cache compression and up to 20× overall KV cache reduction when combined with quantization. Enabled by custom Triton-based GPU kernels, STAR-KV delivers up to 6.9× speedup for the attention module and 3.1× end-to-end generation throughput. The source code will be publicly available in the future.

Abstract:
This paper investigates a pivotal yet debated component of reinforcement learning (RL) for training large language models (LLMs): controlling entropy (increasing or decreasing it) during RL finetuning. The existing literature presents a dichotomy: some studies posit that increasing entropy facilitates exploration, whereas others argue that decreasing entropy enhances performance. To reconcile these conflicting observations, we provide a theoretical framework showing that the effect of entropy is governed by \emphEntropy Discrepancy, the distributional divergence between positive and negative samples. Guided by this insight, we derive a principled dynamic scheduling method that adaptively modulates the entropy coefficient, effectively switching between entropy maximization and minimization as training evolves. Extensive experiments confirm the correlation between Entropy Discrepancy and the efficacy of entropy control. Furthermore, our adaptive method yields substantial improvements, boosting Pass@K by 6.7% on AIME24 and 17.52% on puzzle tasks compared to vanilla RL, while consistently outperforming recent state-of-the-art reasoning methods.

Abstract:
LoRA efficiently adapts large pretrained models via low-rank updates, making it a strong parameter-efficient fine-tuning (PEFT) method. When integrated with Federated Learning (FL), it enables collaborative fine-tuning across distributed clients, leveraging rich downstream data without exposing private information. However, this strategy is hindered by data heterogeneity and limits personalization performance. To address this, personalized FedLoRA approaches have been proposed and employ a dual-LoRA architecture, e.g., one branch for global knowledge and another for client-specific adaptation. Nevertheless, this dual-LoRA design introduces additional computational overhead and structural redundancy. To address this limitation, we propose FedPissa, the first framework that rethinks single-LoRA via selective aggregation and subspace decorrelation. We selectively aggregate LoRA components based on their aggregation dynamics, and further apply a decorrelated subspace projection to mitigate heterogeneous update conflicts, reducing cross-client interference and improving personalized adaptation. Experiments on textual and visual scenarios show that FedPissa not only achieves up to 35% lower communication and computation cost, but also improves overall accuracy by up to 8% compared to its counterparts.

Abstract:
Modern cloud systems rely on architectures with many interconnected microservices, which enable scalability and flexibility but make troubleshooting failures difficult. Identifying the root cause requires navigating complex dependencies, often beyond the capacity of domain experts. Causal models offer a principled approach to root cause analysis (RCA), but prior methods are typically sample inefficient, as they assume access to the full causal graph or require large numbers of postfailure interventions. We introduce Bayesian Root Cause Discovery (BRCD), which leverages a partial causal structure (a CPDAG learned during the pre-failure period) and performs Bayesian inference without enumerating all DAGs from each interventional Markov equivalence class (\mathcalI-MEC) for each root cause candidate. Using a recent uniform DAG sampling framework (Wienöbst et al., 2023), BRCD provides the first statistical consistency guarantees for nonparametric RCA, with both identifiability and finite-sample posterior bounds under \varepsilon-vanishing approximation. Empirically, across synthetic benchmarks and three microservice systems (Online Boutique, Sockshop, Petshop), BRCD achieves state-of-the-art top-l accuracy while remaining effective in low-failure-sample regimes and scaling to large graphs.

Abstract:
As large language models are increasingly deployed in highstakes settings, there is a growing need for tools that audit not only model outputs but also the internal computations that produce them. Circuit analysis is a central approach in mechanistic interpretability, but it is typically target-conditioned, explaining a single prompt paired with a chosen completion. This target-conditioned setup obscures mechanistic heterogeneity and hinders scalable discovery. We introduce distribution-level unsupervised feature discovery, which discovers interpretable clusters across a prompt’s continuation distribution and provides a knob to trade off semantic granularity against mechanistic specificity, without manual target selection. Our method samples continuations, represents each with (i) a semantic embedding and (ii) a mechanistic signature derived from sparse feature attributions, and clusters them via a rate–distortion objective that trades off semantic coherence and mechanistic consistency. We also show that our method has cluster-level causality, which validates the discovery of cluster-level mechanistic representation. Overall, our approach complements circuit analysis and behavioral evaluation by providing a scalable, unsupervised audit of the mechanisms underlying a model’s continuation distribution.

Abstract:
Emerging Omnimodal Large Language Models (OmniLLMs) enable real-time interaction across video, audio, and text but suffer from prohibitive computational costs due to the quadratic complexity of processing continuous streaming inputs. Existing token compression strategies remain suboptimal as they typically rely on biased modality-centric priors or enforce uniform retention policies, neglecting the heterogeneity across layers and the critical role of cross-modality alignment. To address these challenges, we propose OmniFit, a training-free framework that decouples interaction profiling from inference execution. OmniFit incorporates Layer-Adaptive Heterogeneity Profiling (LAHP) to dynamically allocate computational budgets based on layer-wise redundancy and modality preferences, preserving tokens according to the characteristics of each layer. Furthermore, we introduce Alignment-Rectified Token Selection (ARTS), a lightweight mechanism that efficiently identifies tokens semantically aligned with cross-modal cues. Extensive experiments on 3 model series across 10 benchmarks demonstrate that OmniFit establishes a new Pareto frontier, retaining 98% of model performance with only 20% token usage and achieves up to 2.31× end-to-end inference speedup and 2.5× VRAM saving, significantly outperforming state-of-the-art methods.

Abstract:
Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence isconditionalrather than universal, depending on an implicit assumption frequently violated in practice: the RLHFoptimal policy must prefer human-preferred responses. When this assumption fails, DPO optimizesrelative advantageover the reference policy rather thanabsolute alignmentwith human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization (CPO), augmenting RLHF with constraints for provable alignment. We further provide a geometric interpretation through soft margin ranking, revealing DPO implements margin ranking with potentially negative targets. Our theoretical analysis establishes when DPOs' guarantees hold and provides solutions preserving simplicity with provable alignment. Comprehensive experiments on standard benchmarks demonstrate that CPO achieves state-of-the-art performance.

Abstract:
Although Federated Learning (FL) offers advantages in privacypreserving for cross-device collaborative learning, its practical deployment remains severely constrained by heterogeneous hardware resources and non-IID (non-independent and identically distributed) data across devices. Sub-model extraction has emerged as a widely adopted strategy for enabling collaborative training among devices with heterogeneous models. However, existing sub-model extraction methods in FL typically rely on coarse-grained stochastic selection or rigid rule-based neuron selection, which severely limits training performance. Specifically, stochastic strategies lead to severe parameter conflicts under non-IID data distributions, while rule-based approaches lack diversity in neuron selection per device, preventing comprehensive parameter optimization. To address this problem, this paper presents a novel sub-model extraction-based FL framework, named SpineFL, which adopts a backbone-sharing mechanism and an activation-guided pruning strategy for sub-model extraction. Specifically, SpineFL decomposes each global model layer into two portions: i) a mandatory backbone shared by all the sub-models to maintain model generalization, and ii) a dynamic portion for sub-model extraction. SpineFL adopts the activation-guided selection strategy to probabilistically select neurons according to their activation frequency from the dynamic portion to generate sub-model, where neurons exhibiting higher historical activation are more likely to be included, thereby simultaneously addressing parameter conflicts while preserving selection diversity. Experimental results demonstrate that compared with state-of-the-art heterogeneous FL methods, SpineFL can achieve up to 3.28% accuracy improvement.

Abstract:
Large Language Models (LLMs) have demonstrated sophisticated comprehension of sequential code, yet their capacity for reasoning about concurrent programs remains largely unquantified. We introduce DRPBench, a benchmark designed to evaluate the concurrent code comprehension of LLMs by measuring their data race prediction performance. To address the challenge of runtime nondeterminism for evaluation on concurrent programs, we frame the evaluation as a fine-grained static prediction task using 1,003 programs from the SV-COMP suite, featuring 549 manually annotated data races with precise variable- and line-level granularity. Our evaluation of 15 state-of-the-art LLMs—spanning standard, reasoning, and agentic variants—reveals that DRPBench effectively differentiates concurrent code comprehension capabilities of LLMs. While the top-performing model (Gemini 3 with test-time reasoning) achieves an F1 score of 74.89%, most models struggle significantly (scoring less than 60%), with Llama 3 70B achieving only 8.80%. Beyond benchmarking, we characterize two primary failure modes: (1) shared-variable distraction, where multiple variable appearances degrade comprehension accuracy, and (2) synchronization-logic myopia, the inability to interpret non-standard synchronization implementations. Our findings provide a diagnostic roadmap for enhancing concurrent code comprehension of LLMs in future development.

Abstract:
Differentially private stochastic gradient descent (DPSGD) achieves privacy by clipping per-sample gradients and injecting Gaussian noise, but its utility is highly sensitive to the choice of the clipping threshold C. A fixed C often degrades performance and necessitates repeated empirical calibration. Existing adaptive clipping methods either modify the gradient update in vanilla DP-SGD, causing additional tuning or optimization overhead, or introduce separate query mechanisms to monitor gradient statistics. In contrast, we leverage the slack information induced by the standard clipping operation, an overlooked signal in prior work, and show that it provides an effective indication for adapting C. In light of this, we propose SlaClip, a privacy-preserving adaptive clipping strategy using a post-hoc Slack Indicator. Under the same training configuration, both SlaClip-DP-SGD and vanilla DP-SGD instantiate the identical Gaussian mechanism, and therefore incur equivalent privacy cost. Moreover, it requires minimal task-specific hyperparameter tuning and exhibits robust performance improvement across diverse datasets and model architectures.

Abstract:
Large language models generate text through probabilistic token sampling, a mechanism increasingly leveraged for inferencetime watermarking to verify AI-generated content. We present an information-theoretic framework that characterizes the trade-off between robustness to text editing and detectability by keyless observers, where detectability bounds are information-theoretic and computational attainability depends on detector access. Central to our analysis is an additive, usable Kullback-Leibler (KL) information budget that governs hypothesis testing separability between watermarked and unwatermarked outputs subject to a stealth constraint. This budget induces a hierarchy of detectability across watermark families: distribution-preserving schemes exhibit zero statistical drift, while probability-modifying schemes at both token and sentence levels accumulate detectable signal with sequence length. When text editing is modeled as a noise process, the usable KL budget contracts quadratically with edit rate for token-level schemes and according to an induced semantic flip rate for sentence-level schemes. These contraction laws reveal an irreducible trilemma among robustness, stealth, and reliable verification. Guided by these limits, we propose a hybrid watermarking strategy that selects among distribution-preserving, semantic-level, and token-level methods based on anticipated editing regimes. Experiments on Llama-2-7B and Mistral-7B under paraphrasing attacks corroborate theoretical predictions and confirm that the hybrid strategy is empirically near-Pareto across evaluated edit regimes.

Abstract:
Despite the remarkable fidelity of generative models, they frequently suffer from mode collapse. Existing strategies for enhancing diversity predominantly focus on intervening during the generation trajectory. We identify a critical oversight that the standard Gaussian initialization often causes trajectories to collapse into dominant modes because it is agnostic to the guidance potential landscape. In this work, we formulate selecting the initial noise from aguidance potential posterior, which effectively reweights the prior towards diversity-rich regions. To sample from this distribution efficiently, we introduceDiversity-inducing Initialization(DivIn), which leverages Langevin dynamics to actively navigate the initialization landscape, steering initial noise away from collapsing regions while anchoring them to the valid data manifold. Our method serves as an inference-time diversity enhancement compatible with both diffusion and flow matching models. Extensive experiments show that DivIn exhibits a superior performance in both class-to-image and text-to-image scenarios. Furthermore, we highlight that as DivIn is orthogonal to trajectory-based methods, combining them significantly expands the diversity-quality Pareto frontier beyond what either achieves in isolation.

Abstract:
Temporal graphs are ubiquitous in realworld applications such as social networks and finance, where Temporal Graph Networks (TGNs) capture both structural and temporal dependencies, achieving in superior predictive accuracy. Understanding which historical events drive specific model predictions can enhance trustworthiness of TGNs. Existing explanation methods overlook the memory module, the core component that records and updates node histories, leaving the influence of past events unexplored. To address this challenge, we propose a method that attributes TGNs predictions through the topology attribution tree and memory backtracking tree. The topology attribution tree captures the influence of neighbors and their memory vectors, then the memory backtracking tree quantifies how historical events shape node memory vectors. We apply the LRP in TGNs, ensuring that the total contribution of events equals the model’s logits. Finally, top-k selection may lack faithfulness due to the nonliear relationship between logits and probabilities. We design optimization objectives to map logits to probabilities and identify the important events. Experiments on nine temporal graph datasets, spanning node property prediction, link prediction tasks and graph classification tasks, show that our method provides faithful explanations and outperforms state-of-the-art baselines.

Abstract:
Reinforcement Learning (RL) refines large language models (LLMs) by directly optimizing model behavior with reward signals. Although accurate state value estimation is essential for stable training in classical RL settings, it remains an understudied challenge in LLM posttraining. In this work, we demonstrate that accurate value estimation can stabilize and improve post-training. First, we construct State Value Estimation Benchmark (SVEB) and show that critics of standard approaches like PPO simply degenerate toward a coarse group-average baseline. To overcome this, we propose two techniques. One is a heuristic methodNumca, which uses numbers in responses as state representation to calculate state value. Another is a general hidden-state-based frameworkHista, which utilize the semantic information in hidden states to group disjoint responses. Experiments show that when equipped with these improved estimates, training gains better performance consistently with different RL algorithms.

Abstract:
Time series imputation benefits from leveraging crossfeature correlations, yet existing attention based methods re-discover feature relationships at each layer, lacking persistent anchors to maintain consistent representations. To address this, we propose HELIX, which assigns each feature a learnable feature identity, a persistent embedding that captures intrinsic semantic properties throughout the network. Unlike graph-based methods that rely on predefined topology and assume homogeneous spatial relationships, HELIX learns arbitrary feature dependencies end-to-end from temporal co-variation, naturally handling datasets where features mix spatial locations with semantic variables. Integrated with hybrid temporal-feature attention, HELIX achieves the state-of-the-art performance, ranking first among 17 methods across 21 experimental settings. Furthermore, our mechanistic analysis reveals that feature attention progressively aligns with underlying physical structure across layers, demonstrating more effectively exploits cross-feature dependencies for imputation.

Abstract:
Long video understanding faces a critical dilemma: cloudbased Large Multimodal Models (LMMs) offer superior reasoning but suffer from prohibitive bandwidth costs and latency, while edge-based solutions sacrifice perception accuracy for speed. Current collaborative approaches attempt to bridge this gap via similarity-based filtering, yet they treat complex queries as flat semantic vectors. We identify this as a fundamental flaw leading to "Semantic Submergence," where dominant visual features drown out subtle but logically critical cues. To solve this, we introduce SCOPE (Semantic Cloud-Orchestrated Perception at Edge). Shifting the paradigm to "Think in Cloud, Look at Edges," SCOPE utilizes a cloud LMM to decompose complex queries into a structured Directed Acyclic Graph (DAG). This "observation plan" guides the edge to retrieve evidence based on logical necessity rather than mere statistical similarity. Experiments on Video-MME and LongVideoBench demonstrate that SCOPE redefines the Pareto frontier, matching cloud-level accuracy with significantly lower transmission costs and outperforming state-of-the-art baselines on complex reasoning tasks.

Abstract:
LLMdriven agents demonstrate strong performance in sequential decision-making but often rely on on-the-fly reasoning, re-deriving solutions even in recurring scenarios. This insufficient experience reuse leads to computational redundancy and execution instability. To bridge this gap, we propose ProcMEM, a framework that enables agents to autonomously learn procedural memory from interaction experiences without parameter updates. By formalizing a Skill-MDP, ProcMEM transforms passive episodic narratives into executable Skills defined by activation, execution, and termination conditions to ensure executability. To achieve reliable reusability without capability degradation, we introduce Non-Parametric PPO, which leverages semantic gradients for high-quality candidate generation and a PPO Gate for robust Skill verification. Through score-based maintenance, ProcMEM sustains compact, high-quality procedural memory. Experimental results across in-domain, cross-task, and cross-agent scenarios demonstrate that ProcMEM achieves superior reuse rates and significant performance gains with extreme memory compression. Visualized evolutionary trajectories and Skill distributions further reveal how ProcMEM transparently accumulates, refines, and reuses procedural knowledge to facilitate long-term autonomy.

Abstract:
Emerging LLM workloads demand extreme memory agility. However, state-of-the-art inference systems (e.g., vLLM) rely on software-defined paging, which sacrifices the contiguous tensor abstraction. This rigid interface exposes fragmen- tation complexity to developers, imposing a se- vere engineering burden that stifles algorithmic innovation. We introduce CONTINUUM, a tensor memory virtualization subsystem implemented as a PyTorch extension. By bypassing serialized OS bottlenecks via a lightweight GPU driver ex- tension, CONTINUUM can significantly reduce the mapping costs by orders of magnitude—from milliseconds to microseconds. Built atop this low-latency API, CONTINUUM provides Elastic Tensor, with a set of flexible tensor operations that natively supports complex memory dynamics and zero-copy topological aliasing. Evaluations demonstrate that CONTINUUM achieves signifi- cantly higher throughput across diverse dynamic scenarios, effectively democratizing the imple- mentation of next-generation LLM applications.

Abstract:
Finetuning-as-a-Service (FaaS) enables personalization of large language models (LLMs) but poses significant safety risks, as fine-tuning user-provided data degrades the model's safety-alignment. Prior works addressing this issue typically rely on explicit regularization, which leads to practical limitations. In this paper, we propose a different paradigm that neutralizes harmful updates via harmful gradient saturation rather than explicit suppression. Our key observation is that, in a jailbroken LLM, safety-degrading gradients are largely saturated, while gradients unrelated to safety remain active during fine-tuning. Based on this insight, we introduce aBufferLoRA-based fine-tuning framework. BufferLoRA is a removable adapter that temporarily jailbreaks the model during user fine-tuning, saturating harmful updates while allowing a UserLoRA to learn user-specific tasks. After fine-tuning, BufferLoRA is removed to restore the base model’s original safety-alignment. To further reinforce safety, we additionally train a SafetyLoRA and integrate its safety components into the UserLoRA via QR decomposition-based merging strategy. Extensive experiments show that our framework achieves superior performance in both safety and utility, without requiring additional safety data during fine-tuning and with minimal computational cost.

Abstract:
Gastrointestinal (GI) motility assessment via bowel sounds (BS) offers a noninvasive alternative to resource-intensive clinical standards. However, the diagnostic utility of BS is often compromised by its spectral overlap with non-stationary speech interference. While generative models have advanced signal restoration, traditional Gaussian-based diffusion frameworks struggle with the impulsive, heavy-tailed nature of real-world clinical noise. In this paper, we propose a novel Cauchy-driven Diffusion Bridge framework to isolate high-fidelity bowel sounds from complex interference. Our contributions are three-fold: (1) We introduce ClinBS, a large-scale clinical dataset (over 25 hours) containing rare pathological transients verified by experts; (2) We mathematically formulate a Cauchy bridge driver, deriving closed-form expressions for the score and density to better model heavy-tailed perturbations; and (3) We implement an efficient sampling procedure via Gaussian scale-mixture reparameterization. Extensive experiments show our framework achieves state-of-the-art performance, outperforming baselines by 13.4%–49.8% across core metrics and elevating abnormal BS recognition accuracy to 88.01%. These results demonstrate the system's potential for robust clinical GI monitoring and diagnosis.

Abstract:
The additive noise mechanism is a foundational tool for differential privacy (DP) of Tdimensional real-valued vector queries. The Gaussian mechanism, utilizing Gaussian noise, is the mostly widely used such mechanism, due to its simplicity and strong privacy guarantees. In this work, we provide justification for this choice, showing that as the dimension T\to\infty, the Gaussian mechanism has the lowest error among all additive noise mechanisms for all meaningful privacy regimes. We also develop a new family of Spherical Generalized Gamma DP mechanisms, which contains both the Gaussian mechanism and the recently studied \ell_2 mechanism (Joseph et al., ICML 2025). We identify members of this family that outperform both the Gaussian and \ell_2 mechanisms in certain low-dimensional settings, and show tight composition of all mechanisms in this family, answering an open question of Joseph et al. regarding the \ell_2 mechanism.

Abstract:
Pruning is a practical approach to compress large language models (LLMs), but it can amplify text degeneration, especially repetition loops, even when perplexity and task accuracy remain largely unchanged. In this work, we present a tokenlevel analysis of this failure mode by viewing decoding as a dynamical process that enters and persists in a small set of recurrent contexts. Our analysis decomposes degeneration into loop entry risk and loop persistence, and shows that persistence is controlled by the escape mass assigned to plausible alternatives within the token sampling set. Motivated by these findings, we propose two token-level guidance objectives for post-pruning fine-tuning. FOCUS reweights distillation toward high-confidence teacher regions to suppress leakage, while RePAIR uses onset-centered positive/negative continuation pairs with a margin loss to promote plausible alternatives and prevent early commitment to repetition loops. Experiments on open-ended continuation and instruction-based generation show that both methods consistently reduce repetition and improve generation quality.

Abstract:
Trained attention layers exhibit striking and reproducible spectral structure of the weights, including lowrank collapse, bulk deformation, and isolated spectral outliers, yet the origin of these phenomena and their implications for generalization remain poorly understood. We study empirical risk minimization in a single-head tied-attention layer trained on synthetic high-dimensional sequence tasks generated from the attention-indexed model. Using tools from random matrix theory, spin-glass theory, and approximate message passing, we obtain an exact high-dimensional characterization of training and test error, interpolation and recovery thresholds, and the spectrum of the key and query matrices. Our theory predicts the full singular-value distribution of the trained query–key map—including low-rank structure and isolated spectral outliers—in qualitative agreement with observations in more realistic transformers. Finally, for targets with power-law spectra, we show that learning proceeds through sequential spectral recovery, leading to the emergence of power-law scaling laws.

Abstract:
Recent advances in diffusion models show promise for StructureBased Drug Design (SBDD), which aims to generate 3D ligand molecules that bind tightly to specific protein targets. This involves jointly optimizing the ligand's 3D conformation and its binding position within the protein pocket. However, existing diffusion-based SBDD methods diffuse conformation and binding position simultaneously within a high-dimensional joint space, leading to inefficient exploration and suboptimal generation quality in both aspects. To address this, we proposeDeCoDe, a novel diffusion framework thatdecouplesthe diffusion processes of the binding position and molecular conformation. Our key insight is to prioritize the perturbation of the ligand's internal conformation in the early stages of the forward (noising) process, while accelerating the perturbation of its global binding position later. This design guides the reverse (denoising) process tofirst coarsely positionthe ligand within the pocket before refining its detailed structure, mimicking a more efficient, step-wise generation strategy. Extensive experiments on the CrossDocked2020 benchmark show that DeCoDe achieves significantly higher structural fidelity (with an average improvement of 18%), while maintaining competitive binding affinity and overall molecular properties compared to state-of-the-art baselines. Code will be released after acceptance.

Abstract:
We study the problem of recovering the parameters of a multivariate Ornstein–Uhlenbeck (OU) process from steadystate observational and interventional data. In many applications, such as large-scale gene perturbation experiments, only stationary “snapshot” measurements are available, making standard stochastic differential equation estimation methods that rely on time-series trajectories inapplicable. We first establish an identifiability result: one intervention per strongly connected component (SCC) of the drift graph suffices to recover all OU process parameters generically up to a global scaling factor. This holds provided that the SCC condensation graph is connected with a single root and certain spectral nondegeneracy assumptions hold. We propose a recursive learning algorithm that orders SCCs topologically and, for each component, isolates its marginal dynamics and solves a linear system derived from the steady-state moment equations, leveraging parameters recovered for upstream components. Building on this theoretical foundation, we propose a regularized least-squares estimator that jointly minimizes residuals of the steady-state mean and covariance equations across observational and interventional data. Experiments on synthetic and real datasets demonstrate the effectiveness of our method in recovering parameters and predicting unseen interventions.

Abstract:
In recent years, wireless signal recognition (WSR), which leverages artificial intelligence (AI) to identify properties of passively received radio signals, has garnered significant attention due to its broad applications, such as spectrum management. Existing WSR methods typically learn directly from received signals, which are distorted by physical wireless channel effects such as fading, and current denoising diffusion models lack dechanneling capabilities, which leads to performance degradation. Therefore, we propose PWC-Diff, a novel framework that integrates prior Physical Wireless Channels into the denoising Diffusion process. The framework employs a dedicated architecture named FusedFormer, which contains a fusion module and a self-attention module that jointly capture the temporal and spectral characteristics of the signals throughout the diffusion trajectory. By leveraging prior wireless channels, PWC-Diff learns to progressively “de-channel” the received signal and recover a representation closer to the transmitted signal. Extensive experiments on several datasets across three WSR tasks have achieved state-of-the-art (SOTA) performance, which demonstrates the rationality of our theory, and ablation experiments further illustrate the effectiveness of our proposed PWC-Diff.

Abstract:
Federated finetuning presents a promising avenue for adapting Large Language Models (LLMs) to downstream tasks while preserving data privacy. However, the prohibitive computational and communication overhead of LLM adaptation inhibits its deployment on resource-constrained edge devices. In this paper, we propose SmartFed, a resource-efficient framework that circumvents expensive training from scratch by intelligently reusing knowledge embedded in existing LoRA modules. To fully exploit this potential and ensure scalability, we introduce the Mixture of Rank-Wise Experts (MoRE). MoRE decomposes LoRA modules into fine-grained rank-level experts, which are selectively activated based on input semantics and resource budgets. Furthermore, to optimize resource utilization, we propose Elastic Expert Quota Allocation (EEQA), a strategy that adaptively distributes expert capacity across parameter matrices based on their contribution to model performance. Extensive evaluations across multiple benchmarks demonstrate that SmartFed significantly outperforms state-of-the-art methods in both model performance and training efficiency.

Abstract:
We propose a simple strategy for training a foundation model on functional MRI (fMRI) data: we adapt the standard Vision Transformer to fMRI by first converting each 3D fMRI volume to a 2D map using a standard cortical flat map projection. We train spatiotemporal masked autoencoders (MAE) on 2.3K hours of fMRI flat map videos. Our model (CortexMAE) outperforms identical MAE models trained on parcelaveraged or native volume data. We perform the first quantitative scaling analyses for fMRI and observe strict power law scaling. Finally, we develop the first open evaluation suite for fMRI foundation models and use it to perform a comprehensive comparison. On cognitive state decoding, our model outperforms all models by a wide margin. On clinical trait prediction, however, we report an important mixed result: all models show inconsistent performance (including our own). We hope that by introducing reproducible benchmarks and a strong, simple baseline, we can help establish a clear frontier for fMRI foundation models. Code is available at \urlhttps://anonymous.4open.science/r/cortex_mae.

Abstract:
Simulationbased data synthesis has emerged as a powerful paradigm for enhancing real-world robotic manipulation. However, existing synthetic datasets remain insufficient for robust bimanual manipulation due to two key challenges: (1) the lack of an autonomous self-correcting mechanism to resolve execution failures in complex coordination tasks, and (2) the scarcity of diverse visual and spatial variations required to bridge the sim-to-real gap. To this end, we present RoboTwin 2.0, a scalable simulation framework that enables closed-loop, automated, large-scale generation of diverse and realistic data, along with unified evaluation protocols for dual-arm manipulation. Built upon RoboTwin-OD, a foundational library of 731 instances across 147 categories with rich semantic annotations, our framework integrates Multimodal Large Language Models (MLLMs) with simulation-in-the-loop verification. This integration forms an automated feedback mechanism that significantly boosts the success rate of expert task program generation. To enhance robust sim-to-real transfer, RoboTwin 2.0 incorporates structured domain randomization along five axes: clutter, lighting, background, tabletop height and language instructions, thereby maximizing data diversity. We instantiate this framework across 50 dual-arm tasks spanning five robot embodiments. Empirical evaluations demonstrate that Vision-Language-Action (VLA) models pre-trained on our synthetic data achieve a 3.6x improvement in few-shot real-world transfer (over a 10-demo baseline) and a 2.2x gain in zero-shot generalization. We release the data generator, benchmark, pre-collected dataset, and code to support scalable research in robust bimanual manipulation.

Abstract:
Electrocardiography (ECG) serves as an indispensable diagnostic tool in clinical practice, yet existing multimodal large language models (MLLMs) remain unreliable for ECG interpretation, often producing plausible but clinically incorrect analyses. To address this, we propose ECGR1, the first reasoning MLLM designed for reliable ECG interpretation via three innovations. First, we construct the interpretation corpus using Protocol-Guided Instruction Data Generation, grounding interpretation in measurable ECG features and monograph-defined quantitative thresholds and diagnostic logic. Second, we present a modality-decoupled architecture with Interleaved Modality Dropout to improve robustness and cross-modal consistency when either the ECG signal or ECG image is missing. Third, we propose Reinforcement Learning with ECG Diagnostic Evidence Rewards to explicitly supervise diagnostic evidence and strengthen reasoning quality. Additionally, we systematically evaluate the ECG interpretation capabilities of proprietary, open-source, and medical MLLMs, and provide the first quantitative evidence that severe hallucinations are pervasive in these MLLMs, suggesting that their outputs should not be relied upon by the public. Code will be released upon acceptance.

Abstract:
Differentiable optimization layers are traditionally integrated in predictthen-optimize frameworks where a neural model estimates parameters that subsequently serve as fixed inputs to downstream decision-making optimization problems. In this work, we introduce the concept of a ``fairness layer'': a differentiable optimization layer appended to a model's output layer that guarantees a chosen notion of output parity is satisfied when integrated into a neural network. Additionally, we introduce an online primal-dual inference algorithm that provides provable aggregate fairness guarantees for streaming predictions with arbitrarily small batch sizes, where traditional per-batch constraints become overly restrictive. Numerical experiments demonstrate the effectiveness of the fairness layer and associated algorithm, and theoretical analysis characterizes the layer's differentiability and stability properties during model training and backpropagation. Our code for these experiments is publicly available on GitHub: https://github.com/anonymouspapersubmission012345/icml2026submission and our public Python package documentation can be found online: https://anonymouspapersubmission012345.github.io/fairnesstraininganonymous/ .

Abstract:
Dataset pruning remains underexplored for 3D modalities, where inherent class imbalance persists acrossbothtraining and test sets. This creates a divergence in evaluation: overall accuracy favors natural frequency, reflecting practical usage; while mean accuracy demands balanced generalization. Instead of forcing a premature tradeoff, we advocate for base principles that remain universally robust and beneficial across diverse priors. We cast pruning as a quadrature approximation on population risk and decompose the error bound intorepresentation error(fidelity to the underlying manifold) andprior-mismatch bias(distribution shift), clarifying what can be improved jointly across priors. To address prior-mismatch bias, we decouple likelihood from prior in the posterior and transfer the structural likelihood via distillation with a calibrated teacher and geometry-preserving constraints. Simultaneously, to reduce representation error, we audit common pruning signals and choose geometric embedding, which exhibits greater robustness given the high inductive bias of 3D models. We also prioritize a safety floor before selection, capturing high-reward regions beneficial across priors. Finally, acknowledging that no single subset optimally satisfies divergent evaluation priors, we augment these principles with a steering wrapper that interpolates between stratified seeding and global selection. Empirical results demonstrate that our framework elevates the performance floor while offering flexibility for different prior preferences.

Abstract:
We study programlearning methods that are efficient in both samples and computation. Classical learning theory suggests that when the target admits a short program description (for example, a short piece of ``Python code''), it can be learned from relatively few examples by performing ERM over the program class. However, this approach relies on enumerating candidate programs, which is typically exponential in the description length. In contrast, gradient-based training avoids explicit search, but for some families of short programs it can require exponentially many samples to succeed. We propose \textscLLM-PV, a propose-and-verify recipe that enables ERM-style selection over a discrete program class without exhaustive enumeration. A pretrained LLM induces a proposal distribution over candidate programs; each proposal is executed, scored on a held-out validation set, and the best program is selected. The method uses no gradient updates and does not use validation feedback to adapt the sampling distribution. Across algorithmic tasks including parity variants, pattern matching, and primality testing, \textscLLM-PV often recovers the exact underlying rule from a small labeled set and generalizes far beyond the training sequence lengths. In the same regimes, SGD-trained transformers and standard adaptation baselines (fine-tuning and in-context learning), as well as classical ML baselines, can fit the training data yet fail to generalize reliably. Together, these results suggest that pretrained LLM priors can serve as effective search biases for ERM, narrowing the gap between statistical and computational efficiency.

Abstract:
Autoregressive decoding becomes bandwidthlimited at long contexts, as generating each token requires reading all n_k key and value vectors from KV cache. We present Stochastic Additive No-mulT Attention (SANTA), a method that sparsifies value-cache access by sampling S \ll n_k indices from the post-softmax distribution and aggregates only those value rows. This yields an unbiased estimator of the post-softmax value aggregation while replacing value-stage multiply-accumulates with gather-and-add. We introduce stratified sampling to design variance-reduced, GPU-friendly variants, demonstrating 1.5× decode-step attention kernel speedup over FlashInfer and FlashDecoding on an NVIDIA RTX 6000 Ada while matching baseline accuracy at 32k-token contexts. Finally, we propose Bernoulli qK^\mathsfT sampling as a complementary technique to sparsify the score stage, reducing key-feature access through stochastic ternary queries. Both methods are orthogonal to upstream techniques such as ternary quantization, low-rank projections, and KV-cache compression. Together, they point toward sparse, multiplier-free, and energy-efficient inference.

Abstract:
Audio selfsupervised learning (SSL) aims to learn general-purpose representations from large-scale unlabeled audio data and has achieved remarkable progress in recent years. While most prior work relies on generative reconstruction objectives, contrastive approaches remain relatively underexplored, in part due to the high computational cost of designing effective augmentation strategies and the large batch sizes typically required for pre-training. In this work, we introduceAudioMosaic, an audio encoder for general audio understanding. During pre-training, AudioMosaic applies time–frequency masking to spectrogram patches to form paired inputs, employing an elegant and efficient augmentation strategy that significantly reduces computational cost while supporting large-batch training. The AudioMosaic encoder learns discriminative utterance-level representations that exhibit strong transferability across datasets, domains, and acoustic conditions. Extensive experiments demonstrate that AudioMosaic achieves state-of-the-art performance on multiple standard benchmarks. Moreover, we show that the pretrained AudioMosaic encoder enhances audio perception when integrated with large language models (LLMs).

Abstract:
BuildingVirtual Cellsthat can accurately simulate cellular responses to perturbations is a longstanding goal in systems biology. A fundamental challenge is that high-throughput single-cell sequencing is destructive: the same cell cannot be observed both before and after a perturbation. Thus, perturbation prediction requires mapping unpaired control and perturbed populations. Existing models address this by learning maps between distributions, but typically assume a single fixed response distribution when conditioned on observed cellular context (e.g., cell type) and the perturbation type. In reality, responses vary systematically due to unobservable latent factors such as microenvironmental fluctuations and complex batch effects, forming amanifoldof possible distributions for the same observed conditions. To capture this variability, we introduce PerturbDiff, which shifts modeling from individual cells to entire distributions. By embedding distributions as points in a Hilbert space, we define a diffusion-based generative process operating directly over probability distributions. This allows PerturbDiff to capture population-level response shifts across hidden factors, improving generalization. Benchmarks on established datasets show that PerturbDiff achieves state-of-the-art performance in single-cell response prediction and generalizes substantially better to unseen perturbations. All code and data will be released upon acceptance.

Abstract:
Despite significant recent progress of Multimodal Large Language Models (MLLMs), current MLLMs are challenged by "spatiotemporal" prompts, i.e., prompts that refer to 1) the entirety of an environment encoded in a point cloud that the MLLM should consider; and simultaneously also refer to 2) actions that happened in part of the environment and are encoded in a short ego-centric video clip. However, such a holistic spatio-temporal understanding is important for agents operating in the real world. To address this challenge, we first develop a framework to collect a large-scale dataset. Using the collected "Reasoning about Environments and Actions" (REA) dataset, we show that recent MLLMs indeed struggle to correctly answer "spatio-temporal" prompts. Building on this dataset, we study two spatio-temporal LLM (STLLM) baselines: 1) STLLM-3D, which directly fuses point cloud, video, and text representations as inputs to the LLM; and 2) STLLM-Aligner, which aligns spatial context with video and text before LLM decoding. Both baselines aim to enhance spatial understanding of environments and temporal grounding of egocentric observations. On REA, the STLLM baselines outperform existing models, demonstrating the effectiveness of our designs.

Abstract:
3D CT analysis spans a continuum from lowlevel perception to high-level clinical understanding. Existing 3D-oriented analysis methods adopt either isolated task-specific modeling or task-agnostic end-to-end paradigms to produce one-hop outputs, impeding the systematic accumulation of perceptual evidence for downstream reasoning. In parallel, recent multimodal large language models (MLLMs) exhibit improved visual perception and can integrate visual and textual information effectively, yet their predominantly 2D-oriented designs fundamentally limit their ability to perceive and analysis volumetric medical data. To bridge this gap, we propose 3DMedAgent, an unified agent that enables 2D MLLMs to perform general 3D CT analysis without 3D-specific fine-tuning. 3DMedAgent coordinates heterogeneous visual and textual tools through a flexible MLLM agent, progressively decomposing complex 3D analysis into tractable subtasks that transition from global to regional views, from 3D volumes to informative 2D slices, and from visual evidence to structured textual representations. Central to this design, 3DMedAgent maintains a long-term structured memory that aggregates intermediate tool outputs and supports query-adaptive, evidence-driven multi-step reasoning. We further introduce the DeepChestVQA benchmark for evaluating unified perception-to-understanding capabilities in 3D thoracic imaging. Experiments across over 40 tasks demonstrate that 3DMedAgent consistently outperforms general, medical, and 3D-specific MLLMs, highlighting a scalable path toward general-purpose 3D clinical assistants. Anonymized code and data are available \hrefhttps://anonymous.4open.science/r/3DMedAgent-D7D7here.

Abstract:
We address the challenge of automatically generating diverse attack prompts for large language models (LLMs) that elicit harmful behaviors (e.g., insults, sexual content) and are used for safety finetuning. While several prior approaches train LLMs with reinforcement learning (RL) to generate such prompts using only a toxicity classifier as a reward, existing diversity-seeking RL methods often collapse to limited modes: once high-reward prompts are found, exploration of new regions is discouraged. Inspired by the active learning paradigm that encourages adaptive exploration, we introduce Active Attacks, a novel RL-based red-teaming algorithm that adapts its attacks as the victim evolves. By periodically safety fine-tuning the victim LLM with collected attack prompts, we naturally induce an \empheasy-to-hard exploration curriculum, where the attacker progresses beyond easy modes toward increasingly difficult ones. We observe that this simple plug-and-play module, which seamlessly integrates into existing RL objectives, unexpectedly outperformed prior RL-based methods, improving cross-attack success rates against GFlowNets, the previous state-of-the-art, from 0.07% to 31.28% (a relative gain of more than 400×) with only a 6% increase in computation.

Abstract:
Longsequence modeling faces a fundamental trade-off between the efficiency of compressive fixed-size memory in RNN-like models and the fidelity of lossless growing memory in attention-based Transformers. Inspired by the Multi-Store Model in cognitive science, we introduce a memory framework of artificial neural networks. Our method maintains a sliding window of the Transformer's KV cache as lossless short-term memory, while a learnable module termed Artificial Hippocampus Network (AHN) recurrently compresses out-of-window information into a fixed-size compact long-term memory. To validate this framework, we instantiate AHNs using modern RNN-like architectures, including Mamba2, DeltaNet, and GatedDeltaNet to augment open-weight base LLMs. We also propose an efficient self-distillation method where the base model' all parameters are frozen and only the parameters from AHNs are optimized. For inference, our method sets a default large sliding window size of 32k for attention, and AHNs activate only when the sequence length exceeds the 32k window, addressing the quadratic-complexity issue of attention that emerges at that scale. Extensive experiments on long-context benchmarks LV-Eval and InfiniteBench demonstrate that AHN-augmented models consistently outperform sliding window baselines and achieve performance comparable or even superior to full-attention models, while substantially reducing computational and memory requirements. For instance, augmenting the Qwen2.5-3B-Instruct with AHNs reduces inference FLOPs by 40.5% and memory cache by 74.0%, while improving its average score on LV-Eval (128k sequence length) from 4.41 to 5.88.

Abstract:
LLMas-a-judge ensembles are the standard paradigm for scalable evaluation, but their aggregation mechanisms suffer from a fundamental flaw: they implicitly assume that judges provide independent estimates of true quality. However, in practice, LLM judges exhibit correlated errors caused by shared latent confounders---such as verbosity, stylistic preferences, or training artifacts---causing standard aggregation rules like majority vote or averaging to provide little gain or even amplify systematic mistakes. To address this, we introduce CARE, a confounder-aware aggregation framework that explicitly models LLM judge scores as arising from both a latent true-quality signal and shared confounding factors. Rather than heuristically re-weighting judges, CARE separates quality from confounders without access to ground-truth labels. We provide theoretical guarantees for identifiability and finite-sample recovery under shared confounders, and we quantify the systematic bias incurred when aggregation models omit confounding latent factors. Across 12 public benchmarks spanning continuous scoring, binary classification, and pairwise preference settings, CARE improves aggregation accuracy, reducing error by up to 26.8%.

Abstract:
In scene graph generation (SGG), a central challenge is modeling polysemous predicates whose meanings shift across contexts. Prior approaches address this issue by decomposing predicates into multiple static prototypes or retrieving semantically similar exemplars. However, these strategies keep predicate representations static and cannot reorganize semantics to reflect the evidence of a given image, leading to systematic confusions in ambiguous contexts. We propose AlignG, which learns contextconditioned predicate semantics via prototype feedback. AlignG infers image-conditioned predicate semantics from the set of relations within each image and feeds the adapted semantics back to recalibrate relation representations while preserving dataset-level semantic coherence. The learning objective anchors context adaptation to global semantic centers, preventing semantic drift while still allowing selective semantic reorganization when the scene provides consistent relational cues. Experiments on VG-150 and GQA-200 show consistent improvements over strong baselines, with F@100 improvements of +1.4 and +2.7 under SGDet. We further visualize per-image prototype similarity shifts and observe coherent context-dependent reorganization where prototypes selectively merge or separate predicates according to scene evidence.

Abstract:
Realworld videos often extend over thousands of frames. Existing generative video super-resolution (VSR) approaches, however, face two persistent challenges when processing long sequences: (1) inefficiency due to the heavy cost of multi-step denoising for full-length sequences; and (2) poor consistency hindered by temporal decomposition that causes artifacts and discontinuities. To break these limits, we propose InfVSR, which reformulates VSR as an autoregressive-one-step-diffusion paradigm, and enables streaming inference with video diffusion priors. First, we adapt the pretrained DiT into a causal structure, maintaining both local and global coherence via rolling KV-cache and joint visual guidance. Second, we distill the diffusion process into a single step efficiently, with patch-wise pixel supervision and cross-chunk distribution matching. To fill the gap in long-form video evaluation, we build a new benchmark tailored for extended sequences and further introduce semantic-level metrics to comprehensively assess temporal consistency. Our method pushes the frontier of long-form VSR, achieves state-of-the-art quality with enhanced semantic consistency, and delivers up to 58x speed-up over existing methods such as MGLD-VSR. Code will be released soon.

Abstract:
Federated Learning (FL) enables collaborative training of distributed clients while protecting privacy. To enhance generalization capability in FL, prototypebased FL is in the spotlight, since shared global prototypes offer semantic anchors for aligning client-specific local prototypes. However, existing methods update global prototypes at the prototype-level via averaging local prototypes or refining global anchors, which often leads to semantic drift across clients and subsequently yields a misaligned global signal. To alleviate this issue, we introducehyper-prototypes, defined by a set of learnable global class-wise prototypes to preserve underlying semantic knowledge across clients. The hyper-prototypes are optimized via gradient matching to align with class-relevant characteristics distilled directly from clients' real samples, rather than prototype-level descriptors. We further proposeFedHPro, a Federated Hyper-Prototype Learning framework, to leverage hyper-prototypes to promote inter-class separability via mutual-contrastive learning with client-specific margin, while encouraging intra-class uniformity through a consistency penalty. Comprehensive experiments under diverse heterogeneous scenarios confirm that 1) hyper-prototypes produce a more semantically consistent global signal, and 2) FedHPro achieves state-of-the-art performance on several benchmark datasets.

Abstract:
Realworld data visualization (DV) requires native environmental grounding, cross-platform evolution, and proactive intent alignment. Yet, existing benchmarks often suffer from code-sandbox confinement, single-language creation-only tasks, and assumption of perfect intent. To bridge these gaps, we introduce DV-World, a benchmark of 260 tasks designed to evaluate DV agents across real-world professional lifecycles. DV-World spans three domains: DV-Sheet for native spreadsheet manipulation including chart and dashboard creation as well as diagnostic repair; DV-Evolution for adapting and restructuring reference visual artifacts to fit new data across diverse programming paradigms and DV-Interact for proactive intent alignment with a user simulator that mimics real-world ambiguous requirements. Our hybrid evaluation framework integrates Table-value Alignment for numerical precision and MLLM-as-a-Judge with rubrics for semantic-visual assessment. Experiments reveal that state-of-the-art models achieve less than 50% overall performance, exposing critical deficits in handling the complex challenges of real-world data visualization. DV-World provides a realistic testbed to steer development toward the versatile expertise required in enterprise workflows. Data and code are available at \urlhttps://anonymous.4open.science/r/DV-World-50D2.

Abstract:
While video generation holds promise for embodied intelligence, current video models struggle with physical realism, and progress is hindered by the lack of standardized benchmarks. To address this gap, we introduce a comprehensive robotics benchmark, RBench, designed to evaluate robotoriented video generation across five task domains and four distinct embodiments. By assessing task correctness and visual fidelity through reproducible metrics, our evaluation of 25 models reveals significant deficiencies in generating physically realistic robot behaviors. Furthermore, the benchmark achieves a 0.96 Spearman correlation with human judgment, validating its effectiveness. While RBench provides the necessary lens to identify these deficiencies, achieving physical realism requires moving beyond evaluation to address the critical shortage of high-quality training data. Driven by these insights, we introduce a refined four-stage data pipeline, resulting in RoVid-X, the largest open-source robotic dataset for video generation with 4 million annotated video clips, covering thousands of tasks and enriched with physical property annotations. Extensive experiments demonstrate that finetuning on RoVid-X yields consistent performance gains. Collectively, this synergistic ecosystem of evaluation and data establishes a robust foundation for rigorous assessment and scalable training of video models, accelerating the evolution of embodied AI toward physical intelligence. The code and video demos are available in the supplementary materials.

Abstract:
Lowbit model quantization for image super-resolution (SR) is a longstanding task that is renowned for its surprising compression and acceleration ability. However, accuracy degradation is inevitable when compressing the full-precision (FP) model to ultra-low bit widths (2～4 bits). Experimentally, we observe that the degradation of quantization is mainly attributed to the quantization of activation instead of model weights. Considering that the activation quantization error is hard to minimize, minimizing the impact of the error emerges as a good choice, which is described by the condition number. Therefore, we propose CondiQuant, a condition number-based low-bit post-training quantization for image super-resolution. Specifically, we formulate the quantization error of activation as the condition number of weight metrics. By decoupling the representation ability and the quantization sensitivity, we design an efficient proximal gradient descent algorithm to iteratively minimize the condition number and maintain the output. With comprehensive experiments, we demonstrate that CondiQuant outperforms existing state-of-the-art post-training quantization methods in accuracy without computation overhead and gains the theoretically optimal compression ratio in model parameters. Our code will be released soon.

Abstract:
World models are essential for autonomous robotic planning. However, the substantial computational overhead of existing dense Transformerbased models significantly hinders real-time deployment. To address this efficiency-performance bottleneck, we introduce DDP-WM, a novel world model centered on the principle of Disentangled Dynamics Prediction (DDP). We hypothesize that latent state evolution in observed scenes is heterogeneous and can be decomposed into sparse primary dynamics driven by physical interactions and secondary context-driven background updates. DDP-WM realizes this decomposition through an architecture that integrates efficient historical processing with dynamic localization to isolate primary dynamics. By employing a cross-attention mechanism for background updates, the framework optimizes resource allocation and provides a smooth optimization landscape for planners. Extensive experiments demonstrate that DDP-WM achieves superior efficiency and performance across diverse tasks, including navigation, precise tabletop manipulation, and complex deformable or multi-body interactions. Specifically, on the challenging Push-T task, DDP-WM achieves an approximately 9 times inference speedup and improves the MPC success rate from 90% to 98% compared to state-of-the-art dense models.

Abstract:
Textto-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text-to-image generation. To address this challenge, we propose WISE, the first benchmark specifically designed for World Knowledge-Informed Semantic Evaluation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 subdomains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce WiScore, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data will be available.

Abstract:
World models require robust relational understanding to support prediction, reasoning, and control. While objectcentric representations provide a useful abstraction, they are not sufficient to capture interaction-dependent dynamics. We therefore propose C-JEPA, a simple and flexible object-centric world model that extends masked joint embedding prediction from image patches to object-centric representations. By applying object-level masking that requires an object’s state to be inferred from other objects, C-JEPA induces latent interventions with counterfactual-like effects and prevents shortcut solutions, making interaction reasoning essential. Empirically, C-JEPA leads to consistent gains in visual question answering, with an absolute improvement of about 20% in counterfactual reasoning over the same architecture without object-level masking. On agent control tasks, C-JEPA enables substantially more efficient planning by using only 1% of the total latent input features required by patch-based world models, while achieving comparable performance. Finally, we provide a formal analysis demonstrating that object-level masking induces a causal inductive bias via latent interventions. Code will be available atanonymous.

Abstract:
Achieving spatial intelligence requires moving beyond visual plausibility to build world simulators grounded in physical laws. While coding LLMs have advanced static 3D scene generation, extending this paradigm to 4D dynamics remains a critical frontier. This task presents two fundamental challenges: multiscale context entanglement, where monolithic generation fails to balance local object structures with global environmental layouts; and a semantic-physical execution gap, where open-loop code generation leads to physical hallucinations lacking dynamic fidelity. We introduce Code2Worlds, a framework that formulates 4D generation as language-to-simulation code generation. First, we propose a dual-stream architecture that disentangles retrieval-augmented object generation from hierarchical environmental orchestration. Second, to ensure dynamic fidelity, we establish a physics-aware closed-loop mechanism in which a PostProcess Agent scripts dynamics, coupled with a VLM-Motion Critic that performs self-reflection to iteratively refine simulation code. Evaluations on the Code4D benchmark show Code2Worlds outperforms baselines with a 41% SGS gain and 49% higher Richness, while uniquely generating physics-aware dynamics absent in prior static methods.

Abstract:
To cope with uncertain changes of the external world, intelligent systems must continually learn from complex, evolving environments and respond in real time. This ability, collectively known as general continual learning (GCL), encapsulates practical challenges such as online datastreams and blurry task boundaries. Although leveraging pretrained models (PTMs) has greatly advanced conventional continual learning (CL), these methods remain limited in reconciling the diverse and temporally mixed information along a single pass, resulting in suboptimal GCL performance. Inspired by meta-plasticity and reconstructive memory in neuroscience, we introduce here an innovative approach namedMetaPost-Refinement (MePo) for PTMs-based GCL. This approach constructs pseudo task sequences from pretraining data and develops a bi-level meta-learning paradigm to refine the pretrained backbone, which serves as a prolonged pretraining phase but greatly facilitates rapid adaptation of representation learning to downstream GCL tasks. MePo further initializes a meta covariance matrix as the reference geometry of pretrained representation space, enabling GCL to exploit second-order statistics for robust output alignment. MePo serves as a plug-in strategy that achieves significant performance gains across a variety of GCL benchmarks and pretrained checkpoints in a rehearsal-free manner (e.g., 15.10%, 13.36%, and 12.56% on CIFAR-100, ImageNet-R, and CUB-200 under Sup-21/1K).

Abstract:
While recent generative models can synthesize videos in pixel space, they often fail to produce educational videos with precise structures, domain knowledge, and coherent transitions. We argue that this setting is better served by operating in a renderable environment that is explicitly controlled by code. We proposeCode2Video, a codecentric agent framework that generates educational videos by writing executable Python programs. Code2Video includes three agents: aPlannerthat converts lecture content into a temporal storyboard, aCoderthat turns the storyboard into runnable code with scope-guided auto-fix, and aCriticthat refines layout using a VLM guided byvisual anchor prompting,i.e., mappings from target visual outcomes to code edits. For evaluation, we buildMMMC, a benchmark of professionally produced, discipline-specific educational videos. We assess Code2Video using aesthetic scores (VLM-as-a-Judge), code efficiency, andTeachQuiz, an end-to-end metric that measures how well anunlearnedVLM can recover knowledge after watching generated videos. Code2Video improves performance by 40% over direct code generation and produces videos comparable to human-crafted tutorials.

Abstract:
We present UGround, a Unified visual Grounding paradigm that dynamically selects intermediate layers across Unrolled transformers as "mask as prompt'', diverging from the prevailing pipeline that leverages the fixed last hidden layer as "\texttt\ as prompt''. UGround addresses two primary challenges posed by the prevailing paradigm: (1) its reliance on the fixed last hidden layer, which sequentially amplifies cumulative errors arising from layerby-layer propagation without intermediate correction, and (2) its use of \texttt\ as a prompt, which implicitly projects textual embeddings into visual space without explicit spatial cues (e.g., coordinates). Central to UGround is Policy-Prompted Masking, which comprises two key components: Stochastic Skip Connection (SSC) and Mask as Prompt (MasP). SSC is a reinforcement learning policy that, via stochastic sampling, allows each \texttt\ token to slide across unrolled transformer layers, enabling dynamic layer selection at which it connects to the vision model (e.g., SAM) in a skip-connection fashion. Given the selected hidden layer, MasP uses the similarity map derived from the \texttt\ token and image tokens as a soft logit mask to prompt SAM for mask generation, offering explicit spatial cues through its activation regions. To validate the effectiveness of UGround, we, for the first time, have unified visual grounding within a single framework from an attribute perspective, spanning from traditional refer expression segmentation to newly proposed reasoning segmentation, single-target to multi-target, positive query to false premise (empty target). All codes are provided in the supplementary material.

Abstract:
While fewstep generative models have enabled powerful image and video generation at significantly lower cost, generic reinforcement learning (RL) paradigms for few-step models remain an unsolved problem. Existing RL approaches for few-step diffusion models strongly rely on back-propagating through differentiable reward models, thereby excluding the majority of important real-world reward signals, e.g., non-differentiable rewards such as humans' binary likeness, object counts, etc. To properly incorporate non-differentiable rewards to improve few-step generative models, we introduce TDM-R1, a novel reinforcement learning paradigm built upon a leading few-step model, Trajectory Distribution Matching (TDM). TDM-R1 decouples the learning process into surrogate reward learning and generator learning. Furthermore, we developed practical methods to obtain per-step reward signals along the deterministic generation trajectory of TDM, resulting in a unified RL post-training method that significantly improves few-step models' ability with generic rewards. We conduct extensive experiments ranging from text-rendering, visual quality, and preference alignment. All results demonstrate that TDM-R1 is a powerful reinforcement learning paradigm for few-step text-to-image models, achieving state-of-the-art reinforcement learning performances on both in-domain and out-of-domain metrics.

Abstract:
Scaling actioncontrollable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq\Delta-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.

Abstract:
Histopathology wholeslide images (WSIs) are routinely acquired in clinical practice and contain rich tissue morphology but lack direct molecular architecture and functional programs defining pathological states, whereas RNA sequencing (RNA-seq) provides genome-wide transcriptional profiles at substantial cost, thereby motivating WSI-based genome-wide transcriptomic prediction. Existing approaches for predicting gene expression from WSIs predominantly rely on deterministic regression with one-to-one mapping, limiting their ability to capture biological heterogeneity and predictive uncertainty. We propose RNA-FM, a flow-matching generative framework for genome-wide bulk RNA-seq prediction from histopathology images. RNA-FM formulates transcriptomic prediction as a continuous-time conditional transport problem, learning a velocity field that maps a simple prior to the target gene expression distribution conditioned on morphological features. By incorporating pathway-level structure, RNA-FM enables scalable, biologically interpretable, and genome-wide gene expression imputation. Extensive experiments across multiple anatomical regions, pathway-level analysis, and external validation cohorts demonstrate that RNA-FM consistently outperforms state-of-the-art approaches while effectively capturing both inter-patient and intra-tumoral heterogeneity.

Abstract:
Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but realworld deployment requires them to continually expand their capabilities, making Multimodal Continual Instruction Tuning (MCIT) essential. Recent methods leverage sparse expert routing to promote task specialization, but we find that the expert routing process suffers from drift as the data distribution evolves. For example, a grounding query that previously activated localization experts may instead be routed to irrelevant experts after learning OCR tasks. Meanwhile, the grounding-related experts can be overwritten by new tasks and lose their original functionality. Such failure reflects two problems: router drift, where expert selection becomes inconsistent over time, and expert drift, where shared experts are overwritten across tasks. Therefore, we propose StAbilized Mixture-of-Experts (SAME) for MCIT. To address router drift, SAME stabilizes expert selection by decomposing routing dynamics into orthogonal subspaces and updating only task-relevant directions. To mitigate expert drift, we regulate expert updates via curvature-aware scaling using historical input covariance in a rehearsal-free manner. SAME also introduces adaptive expert activation to freeze selected experts during training, reducing redundant computation and cross-task interference. Extensive experiments demonstrate its SOTA performance.

Abstract:
Current appearancebased gaze estimation suffers from poor generalization due to the scarcity of annotated samples and insufficient diversity of datasets. Leading methods have explored weakly supervised learning to generate large-scale pseudo-labeled data collected by unconstrained scenarios to mitigate domain shift in the wild. In this work, we devise a simple yet efficient semi-supervised contrastive learning framework to exploit unlabeled data for generalized gaze estimation, thereby reducing reliance on manual annotations. Our key insight is to leverage the Jacobian regularization constraint to disentangle representation into identifiable subspaces dedicated to specific gaze components, e.g., pitch and yaw angles. Then we exploit the inner ordinal ranking relationship for contrastive learning in each specific subspace to learn a robust gaze representation from labeled and unlabeled samples, which leads to our Disentangled Subspace Contrastive Learning (shortened to DSCL) framework. Extensive experiments across multiple benchmarks demonstrate that the proposed method is plug-and-play, which achieves competitive performance with 20%, 10%, and even 5% annotated data in both in-domain and cross-domain evaluations.

Abstract:
With recent advancements in largescale pre-trained text-to-image (T2I) models, training-free image editing methods have demonstrated remarkable success. Typically, these methods involve adding noise to a clean image via an inversion process, followed by separate denoising steps for the reconstruction and editing paths during the forward process. However, since the reconstruction path is approximated using noisy latents from mismatched timesteps, existing methods inevitably suffer from accumulated drift, which fundamentally limits reconstruction fidelity. To address this challenge, we systematically analyze the inversion process within the flow transformer and propose DirectEdit, a simple yet efficient editing method that eliminates the inherent reconstruction error without introducing additional neural function evaluations (NFEs). Unlike most prior works that attempt to rectify the inversion path, DirectEdit focuses on directly aligning the forward paths, enabling precise reconstruction and reliable feature sharing. Furthermore, we introduce a preservation mechanism based on attention feature injection and multi-branch mask-guided noise blending, which effectively balances fidelity and editability. Extensive experiments across diverse scenarios demonstrate that DirectEdit achieves efficient and accurate image editing, delivering superior performance that outperforms state-of-the-art methods. Code will be released.

Abstract:
We present an ultrafast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds while handling semantic variations (substitution, insertion, and deletion). Our approach employs string matching based on suffix arrays that scales well with corpus size. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: fast exact lookup enabled by a disk-aware design, and dynamic corpus-aware pruning. We theoretically show that the proposed method suppresses exponential growth in the search space with respect to query length by leveraging statistical properties of natural language. In experiments on FineWeb-Edu (1.4T tokens), we show that our method achieves significantly lower search latency than existing methods: infini-gram, infini-gram mini, and SoftMatcha. As a practical application, we demonstrate that our method identifies benchmark contamination in training corpora, unidentified by existing approaches. We also provide an online demo of fast, soft search across corpora in six languages.

Abstract:
Scaling video generation from seconds to minutes faces a critical bottleneck: while shortvideo data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. While multi-resolution image training works because higher resolution is largely an interpolation of the same underlying patch distribution, training across video lengths is fundamentally different: a longer video is an extrapolation that must invent new events and causal structure beyond the short-clip horizon. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence from a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long videos to capture narrative structure, while simultaneously employing a local Distribution Matching head that aligns sliding windows to a frozen short-video teacher via a mode-seeking reverse-KL divergence. This strategy enables the synthesis of minute-scale videos that learns long-range coherence and motions from limited long videos via supervised flow matching, while inheriting local realism by aligning every sliding-window segment of the student to a frozen short-video teacher.

Abstract:
Openended learning frames intelligence as emerging from continual interaction with an ever-expanding space of environments. While recent advances have utilized foundation models to programmatically generate diverse environments, these approaches often focus on discovering isolated behaviors rather than orchestrating sustained progression. In complex open-ended worlds, the large combinatorial space of possible challenges makes it difficult for agents to discover sequences of experiences that remain consistently learnable. To address this, we propose Dreaming in Code (DiCode), a framework in which foundation models synthesize executable environment code to scaffold learning toward increasing competence. In DiCode, “dreaming” takes the form of materializing code-level variations of the world. We instantiate DiCode in Craftax, a challenging open-ended benchmark characterized by rich mechanics and long-horizon progression. Empirically, DiCode enables agents to acquire long-horizon skills, achieving a 16% improvement in mean return over the strongest baseline and non-zero success on late-game combat tasks where prior methods fail. Our results suggest that code-level environment design provides a practical mechanism for curriculum control, enabling the construction of intermediate environments that bridge competence gaps in open-ended worlds.

Abstract:
Global weather forecasting has recently been revolutionized by AI, outperforming traditional physicsbased ensembles. However, these state-of-the-art (SOTA) models rely on massive computational resources and increasingly specialized architectures, creating a high barrier to entry. In this work, we demonstrate that such complexity is not a prerequisite for SOTA performance. We introduce U-Cast, a streamlined probabilistic forecaster based on a standard U-Net, trained with Monte Carlo Dropout and the Muon optimizer. By leveraging a novel curriculum—deterministic pre-training followed by probabilistic fine-tuning on the Continuous Ranked Probability Score (CRPS)—our model achieves performance on par with or exceeding GenCast (e.g., up to 15% CRPS improvement on short-range winds) while reducing training and/or inference compute by an order of magnitude compared to leading baselines. Our 1^\circ model trains in less than 15 H200 GPU-days and generates a 60-step forecast in just 12 seconds, suggesting a "Bitter Lesson" for AI weather forecasting: scalable, general-purpose architectures can outperform complex domain-specific designs.

Abstract:
Symbolic Regression (SR) aims to discover interpretable analytical expressions that accurately describe observed data. Amortized SR promises to be much more efficient than the predominant genetic programming SR methods, but currently struggles to scale to realistic scientific complexity. We find that the central obstacle is thesimplification bottleneck, i.e. its inability to quickly reduce equivalent expressions to a concise normalized form. Amortized SR has addressed this by generalpurpose Computer Algebra Systems (CAS) like SymPy, but the high computational cost severely limits training and inference speed. We proposeSimpliPy, a rule-based simplification engine achieving a 100-fold speed-up over SymPy at comparable quality. This enables substantial improvements in amortized SR, including scalability to much larger training sets, more efficient use of the per-expression token budget, and systematic test-set decontamination with respect to equivalent training expressions. We demonstrate these advantages in ourFlash-ANSRframework, which achieves much better accuracy than amortized baselines (NeSymReS, E2E) on the FastSRB benchmark. Moreover, it performs on par with state-of-the-art direct optimization (PySR) while recovering more concise instead of more complex expressions with increasing inference budget.

Abstract:
Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attention solutions have shown promise on bidirectional models, we identify that applying these solutions to AR models leads to considerable performance degradation for two reasons: isolated consideration of chunk generation and insufficient utilization of past informative context. Motivated by these observations, we propose \textscLight Forcing, the first sparse attention solution tailored for AR video generation models. It incorporates a ChunkAware Growth mechanism to quantitatively estimate the contribution of each chunk, which determines their sparsity allocation. This progressive sparsity increase strategy enables the current chunk to inherit prior knowledge in earlier chunks during generation. Additionally, we introduce a Hierarchical Sparse Attention to capture informative historical and local context in a coarse-to-fine manner. Such two-level mask selection strategy (\ie, frame and block level) can adaptively handle diverse attention patterns. Extensive experiments demonstrate that our method outperforms existing sparse attention in quality (\eg, 84.5 on VBench) and efficiency (\eg, 1.2～1.3× end-to-end speedup). Combined with FP8 quantization and LightVAE, \textscLight Forcing further achieves a 2.3× speedup and 19.7\,FPS on an RTX~5090 GPU.

Abstract:
Language models are increasingly used to reason over content they were not trained on, such as new documents, evolving knowledge, and userspecific data. A common approach is retrieval-augmented generation (RAG), which stores verbatim documents externally (as chunks) and retrieves only a relevant subset at inference time for an LLM to reason over. However, this results in inefficient usage of test-time compute (LLM repeatedly reasons over the same documents); moreover, chunk retrieval can inject irrelevant context that increases unsupported generation. We propose a human-like non-parametric continual learning framework, where the base model remains fixed, and learning occurs by integrating each new experience into an external semantic memory state that accumulates and consolidates itself continually. We present PANINI, which realizes this by representing documents as Generative Semantic Workspaces (GSW)—an entity- and event-aware network of question–answer (QA) pairs, sufficient for an LLM to reconstruct the experienced situations and mine latent knowledge via reasoning-grounded inference chains on the network. Given a query, PANINI only traverses the continually-updated GSW (not the verbatim documents or chunks), and retrieves the most likely inference chains. Across six QA benchmarks, PANINI achieves the highest average performance, 5%–7% higher than other competitive baselines, while using 2–30× fewer answer-context tokens, supports fully open-source pipelines, and reduces unsupported answers on curated unanswerable queries. The results show that efficient and accurate structuring of experiences at write time—as achieved by the GSW framework—yields both efficiency and reliability gains at read time.

Abstract:
A wide range of optimization problems can often be written in terms of generalized convex functions (GCFs). When this structure is present, it can convert certain nested bilevel objectives into singlelevel problems amenable to standard first-order optimization methods. We provide a new differentiable layer with a convex parameter space and show (Theorems 5.1 and 5.2) that it and its gradient are universal approximators for GCFs and their gradients. We demonstrate how this parameterization can be leveraged in practice by (i) learning optimal transport maps with general cost functions and (ii) learning optimal auctions of multiple goods. In both these cases, we show how our layer can be used to convert the existing bilevel or min-max formulations into single-level problems that can be solved efficiently with first-order methods.

Abstract:
With the current surge in spatial reasoning, researchers have made significant progress in understanding indoor scenes, but still struggle with more diverse applications. This paper aims to advance allscale spatial reasoning by tackling two key challenges: 1) the heavy reliance on indoor 3D scans and labor-intensive annotations for dataset curation; 2) the absence of all-scale modeling, which often leads to overfitting to single scenes. In this paper, we introduce a holistic solution that integrates a structured spatial reasoning knowledge system, scale-aware modeling, and a progressive training paradigm, as thefirst attemptto broaden the scope of all-scale spatial intelligence. Using a task-specific, specialist-driven automated pipeline, we curate over 38K video scenes across 5 spatial scales to createSpaceVista-1M, a dataset comprising 1M spatial QAs spanning 19 diverse tasks. While specialist models offer valuable domain knowledge, they are often unreliable evaluators. Therefore, we build an all-scale benchmark with precise annotations by manually recording and retrieving videos. Nevertheless, naive training with SpaceVista-1M often yields suboptimal results due to the potential knowledge conflict. Accordingly, we introduceSpaceVista-7B, a spatial reasoning model that accepts inputs beyond semantics and uses scale as an anchor for scale-aware experts and progressive rewards. Finally, extensive evaluations across 5 benchmarks, including ourSpaceVista-Bench, demonstrate competitive performance, showcasing generalization across all scales and scenarios. All materials will be released at https://mm2km.github.io/.

Abstract:
Masked Diffusion Models (MDMs) offer greater flexibility in decoding order than autoregressive models but require careful path planning to achieve highquality generation. While existing samplers typically greedily select positions with the lowest uncertainty, we identify their fundamental limitations through failure case analysis, showing they overlook the impact of current actions on subsequent steps and fail to optimize cumulative uncertainty. To bridge this gap, we propose theInfo-Gain Sampler, a principled decoding framework that balances immediate costs with information gain. Our method features a simple but effective objective and an efficient implementation that ensures practical overhead is minimal. Extensive evaluations across diverse architectures and tasks (reasoning, coding, creative writing, and image generation) demonstrate that Info-Gain Sampler consistently outperforms existing samplers, significantly raising the performance ceiling of MDMs. For instance, it achieves a 5.5% improvement in average accuracy on reasoning tasks and a 63.1% win-rate in creative writing; notably, on reasoning tasks it reduces cumulative uncertainty from 78.4 to 48.6, outperforming the best baseline by a large margin.

Abstract:
The explosive growth of opensource model repositories has created a Model Jungle, where checkpoints are frequently shared without adequate documentation or metadata. While weight-space learning offers a pathway to identify and analyze these models directly from their parameters, processing full-scale weights is computationally prohibitive. Probing-based methods have emerged as a lightweight alternative, extracting permutation-equivariant representations via learnable probe vectors. However, existing probing methods are limited by a single-view design: they capture first-order structures but fail to encode the rich, higher-order correlation patterns inherent in row--column interactions. To bridge this gap, we introduce MVProbe, a multi-perspective probing framework that synthesizes first-order signals with interaction-aware (Gram-based) views. Our approach is theoretically grounded; we analyze the scaling laws of different probing orders to derive a principled standardization and fusion strategy that ensures balanced contributions from all branches. On the Model Jungle benchmark, MVProbe consistently outperforms the state-of-the-art ProbeX across diverse architectures, including ResNet, SupViT, MAE, and DINO.

Abstract:
Large reasoning models achieve strong performance by scaling inferencetime chain-of-thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost-in-the-middle effects. Iterative reasoning mitigates these issues by periodically summarizing intermediate thoughts, yet existing methods rely on supervised learning or fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning. We propose InftyThink+, an end-to-end reinforcement learning framework that optimizes the entire iterative reasoning trajectory, building on model-controlled iteration boundaries and explicit summarization. InftyThink+ adopts a two-stage training scheme with supervised cold-start followed by trajectory-level reinforcement learning, enabling the model to learn strategic summarization and continuation decisions. Experiments on DeepSeek-R1-Distill-Qwen-1.5B show that InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain-of-thought reinforcement learning by a clear margin, while also generalizing better to out-of-distribution benchmarks. Moreover, InftyThink+ significantly reduces inference latency and accelerates reinforcement learning training, demonstrating improved reasoning efficiency alongside stronger performance.

Abstract:
Protein sequence generation for engineering requires samples that are biophysically plausible and, when targeting a family/domain, remain recognizable members while exploring withinfamily diversity. Current discrete generative models typically start from uniform or masked-token noise, which discards strong position-specific constraints induced by evolution and forces the model to reconstruct conserved residues from scratch, leading to weak family control and low foldability. We propose \emphLineageFlow, a simplex-valued flow-matching model that initializes generation from lineage priors derived from ancestral sequence reconstruction, turning generation into structured mutation from an evolved scaffold. Across diverse protein families, LineageFlow achieves family validity close to held-out natural sequences and improves predicted structural confidence over uniform-/mask-initialized baselines while maintaining substantial within-family novelty and diversity, even surpassing a large pretrained baseline trained on substantially more data. Finally, we introduce \emphrerouting, a single intermediate-time mutate--select--amplify intervention that enables objective-guided sampling without per-step predictor guidance and yields further gains in plausibility, including a zero-shot enzyme generation case study.

Abstract:
Large language models (LLMs) achieve strong performance across many tasks but remain vulnerable to hallucinations, motivating the need to find adversarial prompts that realistically elicit such failures. We formulate hallucination elicitation as a constrained optimization problem, where the goal is to find semantically coherent adversarial prompts that are equivalent to benign user prompts. Existing approaches struggle to solve this problem. On the one hand, attacks that optimize directly over the discrete prompt space can enforce both semantic equivalence and coherence, but are limited to a finite set of prompt variations. This constraint reduces attack diversity and often leads to suboptimal optimization. On the other hand, attacks that optimize over the continuous LLM latent space enable powerful continuous optimization methods, but typically fail to produce prompts that are both semantically equivalent and coherent. To address these limitations, we propose REALISTA, an adversarial attack framework that bridges the semantic diversity of continuous attacks with the semantic realism of discrete attacks. REALISTA operates in the LLM latent space, expressing adversarial perturbations as continuous combinations of editing directions. By construction, solutions to the optimization problem correspond to valid rephrasings, which naturally encourage semantic equivalence and coherence. Experiments demonstrate that REALISTA achieves superior or comparable performance to stateof-the-art realistic attacks on open-source LLMs and, crucially, succeeds in attacking large reasoning models under free-form response settings, where prior realistic attacks fail.

Abstract:
Reinforcement learning (RL) has shown promising results in active flow control (AFC), yet progress in the field remains difficult to assess as existing studies rely on heterogeneous observation and actuation schemes, numerical setups, and evaluation protocols. Current AFC benchmarks attempt to address these issues but heavily rely on external computational fluid dynamics (CFD) solvers, are not fully differentiable, and provide limited 3D and multiagent support. To overcome these limitations, we introduce FluidGym, the first standalone, fully differentiable benchmark suite for RL in AFC. Built entirely in PyTorch on top of the GPU-accelerated PICT solver, FluidGym runs in a single Python stack, requires no external CFD software, and provides standardized evaluation protocols. We present baseline results with PPO and SAC and release all environments, datasets, and trained models as public resources. FluidGym enables systematic comparison of control methods, establishes a scalable foundation for future research in learning-based flow control, and is available at https://anonymous.4open.science/r/fluidgym (anonymized mirror of our public repository).

Abstract:
OneStep Diffusion Models have demonstrated promising capability and fast inference in real-world Video Super-Resolution (VSR). However, the substantial model size and high computational cost of Diffusion Transformers (DiTs) hinder their practical deployment. While low-bit quantization is a common approach for model compression, the effectiveness of quantized models is challenged by the high dynamic range of input latent and diverse layer behaviors. To address these limitations, we introduce LSGQuant, a layer-sensitivity guided quantization framework for one-step diffusion-based real-world VSR. Our method incorporates a Dynamic Range Adaptive Quantizer (DRAQ) to fit video token activations. Furthermore, we estimate layer sensitivity and implement a Variance-Oriented Layer Training Strategy (VOLTS) by analyzing layer-wise statistics in calibration. We also introduce Quantization-Aware Optimization (QAO) to jointly refine the quantized branch and a retained high-precision branch. Extensive experiments demonstrate that our approach maintains performance comparable to the full-precision model and significantly exceeds existing quantization techniques. All models and code will soon be publicly available.

Abstract:
We consider the task of predicting Hamiltonian matrices to accelerate electronic structure calculations, which plays an important role in physics, chemistry, and materials science. Motivated by the inherent relationship between the offdiagonal blocks of the Hamiltonian matrix and the SO(2) local frame, we propose a novel and efficient network, called QHNetV2, that achieves global SO(3) equivariance without the costly SO(3) Clebsch–Gordan tensor products. This is achieved by introducing a set of new efficient and powerful SO(2)-equivariant operations and performing all off-diagonal feature updates and message passing within SO(2) local frames, thereby eliminating the need of SO(3) tensor products. Moreover, a continuous SO(2) tensor product is performed within the SO(2) local frame at each node to fuse node features. Extensive experiments on the large QH9 and MD17 datasets demonstrate that our model achieves superior performance across a wide range of molecular structures and trajectories, highlighting its strong generalization capability. The proposed SO(2) operations on SO(2) local frames offer a promising direction for scalable and symmetry-aware learning of electronic structures.

Abstract:
We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decisionmaking. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT, which lacks fine-grained spatiotemporal understanding, and Visual CoT, which introduces substantial redundancy due to dense image prediction, Dynamics CoT captures the evolution of the world in a compact, interpretable, and efficient form. Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT.

Abstract:
VisionLanguage-Action (VLA) models benefit from Chain-of-Thought (CoT) reasoning, but existing approaches incur high inference overhead and rely on discrete reasoning representations that mismatch continuous perception and control. We propose Latent Reasoning VLA (LaRA-VLA), a unified VLA framework that internalizes multi-modal CoT reasoning into continuous latent representations for embodied action. LaRA-VLA performs unified reasoning and prediction in latent space, eliminating explicit CoT generation at inference time and enabling efficient, action-oriented control. To realize latent embodied reasoning, we introduce a curriculum-based training paradigm that progressively transitions from explicit textual and visual CoT supervision to latent reasoning, and finally adapts latent reasoning dynamics to condition action generation. We construct two structured CoT datasets, LIBERO-LaRA and Bridge-LaRA, and evaluate LaRA-VLA across simulation benchmarks and long-horizon real-robot manipulation tasks. Experimental results show that LaRA-VLA outperforms existing state-of-the-art VLA methods while achieving up to a 90% reduction in inference latency compared to explicit CoT-based VLA approaches, highlighting latent reasoning as an effective and efficient paradigm for real-time embodied control.

Abstract:
Monocular normal estimation for transparent objects is critical for laboratory automation, yet it remains challenging due to complex light refraction and reflection. These optical properties often lead to catastrophic failures in conventional depth and normal sensors, hindering the deployment of embodied AI in scientific environments. We propose TransNormal, a novel framework that adapts pretrained diffusion priors for single-step normal regression. To handle the lack of texture in transparent surfaces, TransNormal integrates dense visual semantics from DINOv3 via a cross-attention mechanism, providing strong geometric cues. Furthermore, we employ a multi-task learning objective and wavelet-based regularization to ensure the preservation of fine-grained structural details. To support this task, we introduce TransNormal-Synthetic, a physics-based dataset with high-fidelity normal maps for transparent labware. Extensive experiments demonstrate that TransNormal significantly outperforms state-of-the-art methods: on the ClearGrasp benchmark, it reduces mean error by 24.4% and improves 11.25^\circ accuracy by 22.8%; on ClearPose, it achieves a 15.2% reduction in mean error. The code and dataset will be made publicly available.

Abstract:
Domain Incremental Learning is a critical scenario that requires models to continuously adapt to new data domains without retraining. However, domain shifts often cause severe performance degradation. To address this, we propose Hybrid EnergyDistance Prompt, a domain-incremental framework inspired by Helmholtz free energy. HEDP introduces an energy regularization loss to enhance the separability of domain representations and a hybrid energy-distance weighted mechanism that fuses energy-based and distance-based cues to improve domain selection and generalization. Experiments on multiple benchmarks, including CORe50, show that HEDP achieves superior performance on unseen domains with a 2.57% accuracy gain, effectively mitigating catastrophic forgetting and enhancing open-world adaptability. Our code is \hrefhttps://anonymous.4open.science/r/HEDP-C879/available here.

Abstract:
Recent advances in video diffusion models have significantly improved visual quality, yet ultrahigh-resolution (UHR) video generation remains a formidable challenge due to the compounded difficulties of motion modeling, semantic planning, and detail synthesis. To address these limitations, we propose LUVE, a Latent-cascaded UHR Video generation framework built upon dual frequency Experts. LUVE employs a three-stage architecture comprising low-resolution motion generation for motion-consistent latent synthesis, video latent upsampling that performs resolution upsampling directly in the latent space to mitigate memory and computational overhead, and high-resolution content refinement that integrates low-frequency and high-frequency experts to jointly enhance semantic coherence and fine-grained detail generation. Extensive experiments demonstrate that our LUVE achieves superior photorealism and content fidelity in UHR video generation, and comprehensive ablation studies further validate the effectiveness of each component.

Abstract:
Due to the difficulty of obtaining groundtruth data for 4D radar scene flow estimation, previous methods typically rely on either self-supervised losses or cross-modal supervision using 3D LiDAR data, 2D images, and odometry. However, self-supervised approaches often yield suboptimal results due to radar’s inherently low-fidelity measurements, while existing cross-modal supervised methods introduce complex multi-task architecture and require costly LiDAR sensors to generate pseudo radar scene flow labels from pretrained 3D tracking models. To overcome these limitations, we propose a task-specific iterative framework for weakly supervised radar scene flow learning, using only images and odometry for auxiliary supervision during training. Specially, we establish two novel instance-aware self-supervised losses by exploiting off-the-shelf 2D tracking and segmentation algorithms to obtain tracked instance masks, which are back-projected into 3D space to provide instance-level semantic guidance; for static regions, we integrate vehicle odometry with radar’s intrinsic motion cues to construct a rigid static loss. Extensive experiments on the real-world View-of-Delft (VoD) dataset demonstrate that our method not only surpasses state-of-the-art cross-modal supervised approaches that rely on 3D multi-object tracking on dense LiDAR point clouds but also outperforms existing fully supervised scene flow estimation methods. The source code will be released upon acceptance.

Abstract:
We propose nonparametric estimators for the average run length (ARL) and average detection delay (ADD) in quickest changepoint detection (QCD) under finite and irregular sequence lengths. Although ARL and ADD are widely used as optimality criteria in theoretical and simulation studies, their application to real-world datasets is hindered by limited and irregular sequence lengths. To address this issue, we propose non-parametric estimators for the ARL and ADD, termedKM-ARL and KM-ADD, by drawing an analogy between QCD and survival analysis to model detection probabilities under sequence truncation. We derive estimation bias bounds and prove that they are asymptotically unbiased unless extrapolation is required. Experiments on simulated and real-world datasets demonstrate their practical utility, enhancing robustness against limited and irregular sequence lengths, improving interpretability, and facilitating empirical, intuitive model selection. Our Python code is provided in the supplementary material, offering ready-to-use implementations for practitioners.

Abstract:
Quantization Error Reconstruction (QER) reduces accuracy loss in PostTraining Quantization (PTQ) by approximating weights as \mathbfW \approx \mathbfQ + \mathbfL\mathbfR, using a rank-r correction to reconstruct quantization error. Prior methods devote the full rank budget to error reconstruction, which is suboptimal when \mathbfW has intrinsic low-rank structure and quantization corrupts dominant directions. We propose Structured Residual Reconstruction (SRR), a rank-allocation framework that preserves the top-k singular subspace of the activation-scaled weight before quantization, quantizes only the residual, and uses the remaining rank r-k for error reconstruction. We derive a theory-guided criterion for selecting k by balancing quantization-exposed energy and unrecoverable error under rank constraints. We further show that resulting \mathbfQ+\mathbfL\mathbfR parameterization naturally supports Quantized Parameter-Efficient Fine-Tuning (QPEFT), and stabilizes fine-tuning via gradient scaling along preserved directions. Experiments demonstrate consistent perplexity reductions across diverse models and quantization settings in PTQ, along with a 5.9 percentage-point average gain on GLUE under 2-bit QPEFT.

Abstract:
Large language models (LLMs) have enabled web agents that follow natural language goals through multistep browser interactions. However, agents fine-tuned on specific trajectories and domain often struggle to generalize out of domain, and offline training can be compute-inefficient due to noisy, redundant trajectories and long accessibility-tree (AXTree) states. To address both issues, we propose Weasel, a trajectory selection method for offline training of web agents. Weasel selects a fixed-budget subset of trajectory steps by optimizing an objective that balances unary importance with pairwise diversity over states, websites, and interaction patterns, solving efficiently with a greedy algorithm. We further improve efficiency with action-centered AXTree pruning that keeps only content around the ground-truth action target, and we mitigate style mismatch for reasoning-native models by replacing expert traces with model-generated, style-consistent rationales. Across AgentTrek and NNetNav training datasets, evaluations in WebArena, WorkArena, and MiniWob, and experiments with Qwen2.5-7B, Gemma3-4B, and Qwen3-8B, Weasel improves out-of-domain performance while reducing training cost, producing roughly 9.7-12.5× training speedups over standard fine-tuning.

Abstract:
Recently, Diffusion Transformers (DiTs) have emerged in RealWorld Image Super-Resolution (Real-ISR) to generate high-quality textures, yet their heavy inference burden hinders real-world deployment. While Post-Training Quantization (PTQ) is a promising solution for acceleration, existing methods in super-resolution mostly focus on U-Net architectures, whereas generic DiT quantization is typically designed for text-to-image tasks. Directly applying these methods to DiT-based super-resolution models leads to severe degradation of local textures. Therefore, we propose Q-DiT4SR, the first PTQ framework specifically tailored for DiT-based Real-ISR. We propose H-SVD, a hierarchical SVD that integrates a global low-rank branch with a local block-wise rank-1 branch under a matched parameter budget. We further propose Variance-aware Spatio-Temporal Mixed Precision: VaSMP allocates cross-layer weight bit-widths in a data-free manner based on rate-distortion theory, while VaTMP schedules intra-layer activation precision across diffusion timesteps via dynamic programming (DP) with minimal calibration. Experiments on multiple real-world datasets demonstrate that our Q-DiT4SR achieves SOTA performance under both W4A6 and W4A4 settings. Notably, the W4A4 quantization configuration reduces model size by 5.8× and computational operations by over 60×. We will release our code and models to facilitate reproducibility and future research.

Abstract:
As Large Language Models (LLMs) evolve into autonomous agents, existing safety evaluations face a fundamental tradeoff: manual benchmarks are costly, while LLM-based simulators are scalable but suffer from logic hallucination. We present AUTOCONTROL ARENA, an automated framework for frontier AI risk evaluation built on the principle of logic-narrative decoupling. By grounding deterministic state in executable code while delegating generative dynamics to LLMs, we mitigate hallucination while maintaining flexibility. This principle, instantiated through a three-agent framework, achieves over 98% end-to-end success and 60% human preference over existing simulators. To elicit latent risks, we vary environmental Stress and Temptation across X-BENCH(70 scenarios, 7 risk categories). Evaluating 9 frontier models reveals: (1) Alignment Illusion: risk rates surge from 21.4% to 52.9% under pressure, with capable models showing disproportionately larger increases; (2) Scenario-Specific Safety Scaling: advanced reasoning improves robustness for direct harms but worsens it for gaming scenarios; and (3) Divergent Misalignment Patterns: weaker models cause non-malicious harm while stronger models develop strategic concealment.

Abstract:
Large Reasoning Models (LRMs) have recently achieved strong mathematical and code reasoning performance through Reinforcement Learning (RL) posttraining. However, we show that modern reasoning post-training induces an unintended exploration collapse: temperature-based sampling no longer increases pass@n accuracy. Empirically, the final-layer posterior of post-trained LRMs exhibit sharply reduced entropy, while the entropy of intermediate layers remains relatively high. Motivated by this entropy asymmetry, we propose Latent Exploration Decoding (LED), a depth-conditioned decoding strategy. LED aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates. Without additional training or parameters, LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models. Relevant code is included in the supplementary material and will made be fully public after this paper is accepted.

Abstract:
With the continuous advancement of reasoning abilities in Large Language Models (LLMs), their application to scientific reasoning tasks has gained significant research attention. Current research primarily emphasizes boosting LLMs' performances on scientific QA benchmarks by training on larger, more comprehensive datasets with extended reasoning chains. However, these approaches neglect the essence of scientific reasoning process logicality, which is the rational foundation to ensure the validity of reasoning steps leading to reliable conclusions. In this work, we make the first systematic investigation into the internal logicality underlying LLM scientific reasoning, and develop a scientific logicality enriched methodology, including a set of assessment criteria and data sampling methods for logicality-guided training, to improve the logical faithfulness as well as task performance. Further, we take physics, characterized by its diverse logical structures and formalisms, as an exemplar discipline to practise the above methodology. For data construction, we extract scientific problems from academic literature and sample a high-quality dataset exhibiting strong logicality. Experiments based on three different backbone LLMs reveal that: 1) the training data we constructed can effectively improve the scientific logicality in LLM reasoning; and 2) the enriched scientific logicality plays a critical role in solving scientific problems.

Abstract:
Preference alignment aims to guide generative models by learning from comparisons between preferred and nonpreferred samples. In practice, most existing approaches rely on preference pairs constructed from model-generated images. Such supervision is inherently relative and can be ambiguous when both samples exhibit artifacts or limited visual quality, making it difficult to infer what constitutes a truly desirable output. In this work, we investigate whether real data can serve as an alternative source of supervision for preference alignment. We adopt a data-centric perspective and study a curation strategy that treats real images as reference points and constructs preference signals by contrasting them with generated or perturbed samples, without requiring manually annotated preference pairs. Through empirical analysis, we show that real-data-based supervision provides effective guidance for aligning diffusion models and achieves performance comparable to existing preference-based methods. Our results suggest that real data offers a practical and complementary source of supervision for preference alignment and highlight directions of label-efficient alignment strategies.

Abstract:
The advancement of Medical VisionLanguage Models (VLMs) for 3D Computed Tomography (CT) analysis is hindered by a misalignment between optimization objectives and clinical rigor. Current Reinforcement Learning (RL) paradigms rely on lexical proxy signals that induceevaluation hallucinations'', where models prioritize linguistic fluency over factual accuracy, leading to fatal clinical errors. To bridge this gap, we introduce the Clinical Abnormality Benchmarking Substrate (CABS), a structured system that decomposes radiology reports into verifiable semantic units. Using CABS, we identify amechanistic divergence'' in standard RL, where surface-similarity rewards drive policy gradients to bypass medical facts. We therefore propose Trajectory-Integral Feedback GRPO (TIF-GRPO), a novel framework integrating control-theoretic principles into policy optimization. By formulating clinical reasoning as a pseudo-temporal trajectory for anomaly discovery, TIF-GRPO regulates anatomy-aware rewards via an integral feedback loop that penalizes persistent omissions as cumulative state errors and suppresses hallucinations as excessive control effort. Experiments on 3D CT benchmarks demonstrate that our approach significantly enhances abnormality detection and clinical faithfulness, establishing a new paradigm for fine-grained regulation in medical VLMs.

Abstract:
Large reasoning models achieve strong performance through testtime scaling, but this incurs substantial computational overhead due to long decoding from short prompts. While sparse attention can reduce latency and memory usage, existing methods often degrade reasoning accuracy because selection errors accumulate over long generation horizons, or require costly retraining. We introduce LessIsMore, a training-free sparse attention mechanism for long-horizon reasoning. Our key insight is that token importance in reasoning is global and stable: critical tokens are largely shared across attention heads and remain stable over decoding steps. Guided by this structure, LessIsMore enforces cross-head unified token selection and preserves recent context via a stable recency window, yielding a globally consistent token set that can be reused across layers. Across multiple model families and challenging reasoning benchmarks, LessIsMore matches or improves accuracy while attending to substantially fewer tokens. With kernel-level optimizations, LessIsMore achieves up to 1.6× end-to-end decoding speedup and up to 1.72× faster sparse attention computation, with additional long-context results demonstrating the generality of our approach.

Abstract:
Safe Reinforcement Learning (RL) algorithms are typically evaluated under fixed training conditions. We investigate whether trainingtime safety guarantees transfer to deployment under distribution shift, using diabetes management as a safety-critical testbed. We benchmark safe RL algorithms on a unified clinical simulator and reveal a safety generalization gap: policies satisfying constraints during training frequently violate safety requirements on unseen patients. We demonstrate that test-time shielding, which filters unsafe actions using learned dynamics models, effectively restores safety across algorithms and patient populations. Across eight safe RL algorithms, three diabetes types, and three age groups, shielding achieves Time-in-Range gains of 13-14% for strong baselines such as PPO-Lag and CPO while reducing clinical risk index and glucose variability. Our simulator and benchmark provide a platform for studying safety under distribution shift in safety-critical control domains.

Abstract:
Dataefficient learning aims to eliminate redundancy in large training datasets by train- ing models on smaller subsets of the most informative examples. While data selection has been extensively explored for vision models and large language models (LLMs), it remains underexplored for Large Vision-Language Models (LVLMs). Notably, none of existing methods can outperform random selection at different subset sizes. In this work, we propose the first principled method for data-efficient instruction tuning of LVLMs. We prove that examples with similar cross-modal attention matrices during instruction tun- ing have similar gradients. Thus, they influence model parameters in a similar manner and convey the same information to the model during training. Building on this insight, we propose XMAS, which clusters examples based on the trajectories of the top singu- lar values of their attention matrices obtained from fine-tuning a small proxy LVLM. By sampling a balanced subset from these clusters, XMAS effectively removes redundancy in large-scale LVLM training data. Extensive experiments across 4 target models, 2 proxy models, and 2 datasets show that XMAS consistently outperforms 10 baseline methods. Moreover, XMAS can discard 50% of the LLaVA-665k dataset and 85% of the Vision-Flan dataset while fully preserving performance of LLaVA-1.5-7B on 10 downstream benchmarks and speeding up its training by 1.2×. This is 30% more data reduction compared to the best baseline for LLaVA-665k.

Abstract:
To achieve realtime video generation, current approaches distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models. This process involves anarchitectural gap, as it converts full attention into causal attention. In this paper, we demonstrate that existing methods fail to bridge this gap theoretically, leading to suboptimal performance. Specifically, these methods employ ODE distillation to initialize the AR student, where a key requirement isinjectivity. We figure out that for an AR student,frame-level injectivitymust hold: each noisy frame must map to a unique clean frame under the PF-ODE of theAR teacher. We theoretically prove that existing methods, which distill an AR student from a bidirectional teacher, violate this frame-level injectivity. Consequently, the student fails to recover the teacher's flow map and instead learns a conditional expectation, resulting in subpar performance. To address this issue, we proposeCausal Forcing, which employs an AR teacher for ODE initialization, thereby effectively bridging the architectural gap. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self-Forcing by 19.3% in Dynamic Degree, 8.7% in VisionReward, and 16.7% in Instruction Following.

Abstract:
Inferencetime compute has re-emerged as a practical way to improve LLM reasoning. Most test-time scaling (TTS) algorithms rely on autoregressive decoding, which is ill-suited to discrete diffusion language models (dLLMs) due to their parallel decoding over the entire sequence. As a result, developing effective and efficient TTS methods to unlock dLLMs' full generative potential remains an underexplored challenge. To address this, we propose LLaDA-S, an efficient TTS framework for dLLMs that (i) performs Hierarchical Trajectory Search (HTS) which dynamically prunes and reallocates compute in an early-to-mid denoising window, (ii) replaces external verifiers with Self-Verified Feedback (SVF) obtained via self-evaluation prompts on intermediate completions, and (iii) introduces Local branching with partial remasking to explore diverse implementations while preserving a high-confidence tokens. Across four mathematical reasoning and code generation benchmarks on three dLLMs, including LLaDA 8B Instruct, Dream 7B Instruct, and LLaDA 2.0-mini, our LLaDA-S achieves a favorable performance-efficiency trade-off, matching best-of-N performance with substantially fewer function evaluations (NFE). The code will be released.

Abstract:
Unified multimodal models (UMMs) emerge as a promising paradigm for generalpurpose multimodal intelligence. As they are deployed in real-world applications, effectively updating internal knowledge becomes critical. While knowledge editing methods have matured for text-only models, a fundamental question remains unexplored: do knowledge edits that successfully modify textual outputs transfer to image generation for UMMs? To this end, we introduce UniKE, the first benchmark for cross-modality knowledge editing in UMMs, comprising 3,005 instances across attribute edits and relation edits. We propose an automated VQA-based evaluation protocol to assess factual consistency between edited knowledge and generated images. Our evaluation reveals a striking modality gap: parameter-editing methods achieving high text-side efficacy (up to 93%) fail to produce visual changes, with VQA accuracy below 6% under direct generation. We propose Reasoning-augmented Parameter Editing, which explicitly activates edited knowledge before generation, improving visual verification to 10-27% for attributes. Through mechanistic analysis, we identify the root cause: edit-affected pathways exhibit near-random overlap with visual attribute-conditioning channels, indicating a fundamental pathway mismatch. These findings demonstrate that textual knowledge edits do not guarantee cross-modality transfer, motivating future work on modality-aware editing methods.

Abstract:
The layoutto-image (L2I) task enables fine-grained control over image generation via object categories and spatial layouts. However, existing L2I methods yield fragmented and distorted generations under few-shot atypical settings. We term this failure as representation fragmentation, arising from a granularity mismatch that entangles semantic identity with visual details. To address this issue, we propose a representation-driven framework that disentangles semantics from primitives for robust few-shot adaptation. Specifically, Semantic Anchoring aggregates categorical semantics into anchors for stable identity, while Primitive Imbuing models recomposable primitives for robust local detail modeling. Conceptual Steering further regulates optimization with a saliency-aware objective to preserve foreground semantic consistency. Extensive experiments demonstrate consistent improvements in the 5-shot regime over state-of-the-art L2I methods in both visual fidelity and alignment across diverse atypical domains.

Abstract:
Tactile sensing is essential for robots to achieve humanlike gentle manipulation capabilities. However, existing Vision-Language-Action (VLA) models struggle to exploit tactile feedback for gentle manipulation due to the scarcity of aligned vision-tactile-language data and the lack of effective closed-loop force feedback mechanisms. To address these challenges, we introduce Tabero, a benchmark and model suite for gentle, language-conditioned robotic manipulation that demands fine-grained contact force perception. First, the Tabero benchmark addresses the scarcity of tactile data by presenting a data-efficient pipeline that repurposes open-source robot manipulation trajectories to generate a diverse set of vision-tactile-language tasks, and establishes a multidimensional evaluation protocol that measures task success alongside physical interaction quality. Second, we propose Tabero-VTLA, a Vision-Tactile-Language-Action architecture featuring a decoupled force-position command interface; the resulting force-position commands are executed by a fixed hybrid controller to enable real-time, force-aware manipulation. Evaluated on Tabero, our model maintains high task success while reducing average grip force by over 70% under gentle instructions, demonstrating its ability to modulate interaction forces based on multimodal experience.

Abstract:
Recently, the physics reasoning capabilities of (M)LLMs have attracted growing attention. However, existing physics benchmarks suffer from two major gaps: they neither provide systematic and upto-date coverage of physics Olympiads, nor enable direct performance comparison with humans. To bridge these gaps, we presentHiPhO, the first benchmark dedicated to high school physics Olympiads with human-aligned evaluation. HiPhO highlights three key innovations.(1) Comprehensive data:it compiles 13 latest Olympiads from 2024-2025, covering both international and regional competitions and spanning mixed modalities from text-only to diagram-based problems.(2) Professional evaluation:it adopts official rubrics to perform fine-grained grading at both the answer and step levels, ensuring alignment with human examiners.(3) Human-level comparison:models are awarded gold, silver, and bronze medals based on official score thresholds, enabling direct comparison with human contestants. Our large-scale evaluation of 30 state-of-the-art (M)LLMs shows that across 13 exams, most open-source MLLMs remain at or below the bronze level, open-source LLMs demonstrate notable progress with multiple gold medals, and closed-source MLLMs achieve 6-13 gold medals, while most models still fall well short of full marks. These results underscore the substantial gap between current (M)LLMs and top human contestants, as well as the considerable room for further improvement.

Abstract:
Face filllight enhancement (FFE) brightens underexposed faces by adding virtual fill light while keeping the original scene illumination and background unchanged. Most face relighting methods aim to reshape overall lighting, which can suppress the input illumination or modify the entire scene, leading to foreground–background inconsistency and mismatching practical FFE needs. To support scalable learning, we introduce LightYourFace-160K (LYF-160K), a large-scale paired dataset built with a physically consistent renderer that injects a disk-shaped area fill light controlled by six disentangled factors, producing 160K before-and-after pairs. We first pretrain a physics-aware lighting prompt (PALP) that embeds the 6D parameters into diffusion-compatible conditioning tokens, using an auxiliary planar-light reconstruction objective. Building on a pretrained diffusion backbone, we then train FiLitDiff, an efficient one-step model conditioned on these physically grounded lighting codes, enabling fast, controllable, and high-fidelity fill lighting at low computational cost. Experiments on held-out paired sets demonstrate strong perceptual quality and competitive full-reference metrics, while better preserving background illumination. The dataset and model will be released.

Abstract:
In offline reinforcement learning (RL), singlestep temporal-difference (TD) learning can suffer from bootstrapping error accumulation over long horizons. Action-chunked TD methods mitigate this by backing up over multiple steps, but can introduce suboptimality by restricting the policy class to open-loop action sequences. To resolve this trade-off, we present Chunk-Guided Q-Learning (CGQ), a single-step TD algorithm that guides a fine-grained single-step critic by regularizing it toward a chunk-based critic trained using temporally extended backups. This reduces compounding error while preserving fine-grained value propagation. We theoretically show that CGQ attains tighter critic optimality bounds than either single-step or action-chunked TD learning alone. Empirically, CGQ achieves strong performance on challenging long-horizon OGBench tasks, often outperforming both single-step and action-chunked methods.

Abstract:
Flowbased methods have achieved significant success in various generative modeling tasks, capturing nuanced details within complex data distributions. However, few existing works have exploited this unique capability to resolve fine-grained structural details beyond generation tasks. This paper presents a flow-inspired framework for representation learning. First, we demonstrate that a rectified flow trained using independent coupling is zero everywhere at t=0.5 if and only if the source and target distributions are identical. We term this property the \emphzero-flow criterion. Second, we show that this criterion can certify conditional independence, thereby extracting \emphsufficient information from the data. Third, we translate this criterion into a tractable, simulation-free loss function that enables learning amortized Markov blankets in graphical models and latent representations in self-supervised learning tasks. Experiments on both simulated and real-world datasets demonstrate the effectiveness of our approach.

Abstract:
Lowdiscrepancy points are designed to efficiently fill the space in a uniform manner. This uniformity is highly advantageous in many problems in science and engineering, including in numerical integration, computer vision, machine perception, computer graphics, machine learning, and simulation. Whereas most previous low-discrepancy constructions rely on abstract algebra and number theory, Message-Passing Monte Carlo (MPMC) was recently introduced to exploit machine learning methods for generating point sets with lower discrepancy than previously possible. However, MPMC is limited to generating point sets and cannot be extended to low-discrepancy sequences (LDS), i.e., sequences of points in which every prefix has low discrepancy, a property essential for many applications. To address this limitation, we introduce Neural Low-Discrepancy SequencesNeuroLDS, the first machine learning-based framework for generating LDS. Drawing inspiration from classical LDS, we train a neural network to map indices to points such that the resulting sequences exhibit minimal discrepancy across all prefixes. To this end, we deploy a two-stage learning process: supervised approximation of classical constructions followed by unsupervised fine-tuning to minimize prefix discrepancies. We demonstrate thatNeuroLDSoutperforms all previous LDS constructions by a significant margin with respect to discrepancy measures. Moreover, we demonstrate the effectiveness ofNeuroLDSacross diverse applications, including numerical integration, robot motion planning, and scientific machine learning. These results highlight the promise and broad significance of Neural Low-Discrepancy Sequences.

Abstract:
Selfattention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. When composed across depth, this interaction pattern induces strong synchronization dynamics that favor convergence toward a dominant mode, a behavior associated with representation collapse and attention sink phenomena. We introduce Krause Attention, a principled attention mechanism inspired by bounded-confidence consensus dynamics. Krause Attention replaces similarity-based global aggregation with distance-based, localized, and selectively sparse interactions, promoting structured local synchronization instead of global mixing. We relate this behavior to recent theory modeling Transformer dynamics as interacting particle systems, and show how bounded-confidence interactions naturally moderate attention concentration and alleviate attention sinks. Restricting interactions to local neighborhoods also reduces runtime complexity from quadratic to linear in sequence length. Experiments across vision (ViT on CIFAR/ImageNet), autoregressive generation (MNIST/CIFAR-10), and large language models (Llama/Qwen) demonstrate consistent gains with substantially reduced computation, highlighting bounded-confidence dynamics as a scalable and effective inductive bias for attention.

Abstract:
Diffusion Transformers (DiT) have demonstrated remarkable generative capabilities but remain highly computationally expensive. Previous acceleration methods, such as pruning and distillation, typically rely on a fixed computational capacity, leading to insufficient acceleration and degraded generation quality. To address this limitation, we propose Elastic Diffusion Transformer (EDiT), an adaptive acceleration framework for DiT that effectively improves efficiency while maintaining generation quality. Specifically, we observe that the generative process of DiT exhibits substantial sparsity (i.e., some computations can be skipped with minimal impact on quality), and this sparsity varies significantly across samples. Motivated by this observation, E-DiT equips each DiT block with a lightweight router that dynamically identifies sample-dependent sparsity from the input latent. Each router adaptively determines whether the corresponding block can be skipped. If the block is not skipped, the router then predicts the optimal MLP width reduction ratio within the block. During inference, we further introduce a block-level feature caching mechanism that leverages router predictions to eliminate redundant computations in a training-free manner. Extensive experiments across 2D image (Qwen-Image and FLUX) and 3D asset (Hunyuan3D-3.0) demonstrate the effectiveness of E-DiT, achieving up to ～2× speedup with negligible loss in generation quality.

Abstract:
Diffusionbased language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Within this family, Masked Diffusion Models (MDMs) currently perform best but still underperform AR models in perplexity and lack key inference-time efficiency features, most notably KV caching. We introduce Eso-LMs, a new family of models that fuses AR and MDM paradigms, smoothly interpolating between their perplexities while overcoming their respective limitations. Unlike prior work, which uses transformers with bidirectional attention as MDM denoisers, we exploit the connection between MDMs and Any-Order autoregressive models and adopt causal attention. This design lets us compute the exact likelihood of MDMs for the first time and, crucially, enables us to introduce KV caching for MDMs while preserving parallel generation for the first time, significantly improving inference efficiency. Combined with an optimized sampling schedule, Eso-LMs achieves a new state of the art on the speed-quality Pareto frontier for unconditional generation. On long contexts, it yields 14−65× faster inference than standard MDMs and 3−4× faster inference than prior semi-autoregressive approaches.

Abstract:
Modern video generators still struggle with complex physical dynamics, often falling short of physical realism. Existing approaches address this using external verifiers or additional training on augmented data, which is computationally expensive and still limited in capturing finegrained motion. In this work, we present self-refining video sampling, a simple method that uses a pre-trained video generator trained on large-scale datasets as its own self-refiner. By interpreting the generator as a denoising autoencoder, we enable iterative inner-loop refinement at inference time without any external verifier or additional training. We further introduce an uncertainty-aware refinement strategy that selectively refines regions based on self-consistency, which prevents artifacts caused by over-refinement. Experiments on state-of-the-art video generators demonstrate significant improvements in motion coherence and physics alignment, achieving over 70% human preference compared to the default sampler and guidance-based sampler.

Abstract:
Most visual generative models compress images into a latent space before applying diffusion or autoregressive modelling. Yet, existing approaches such as VAEs and foundation model aligned encoders implicitly constrain the latent space without explicitly shaping its distribution, making it unclear which types of distributions are optimal for modeling. We introduceDistributionMatching VAE(DMVAE), which explicitly aligns the encoder’s latent distribution with an arbitrary reference distribution via a distribution matching constraint. This generalizes beyond the Gaussian prior of conventional VAEs, enabling alignment with distributions derived from self-supervised features, diffusion noise, or other prior distributions. With DMVAE, we can systematically investigate which latent distributions are more conducive to modeling, and we find that SSL-derived distributions provide an excellent balance between reconstruction fidelity and modeling efficiency, reaching a gFID of 3.2 on ImageNet with only 64 training epochs. Our results suggest that choosing a suitable latent distribution structure (achieved via distribution-level alignment), rather than relying on fixed priors, is key to bridging the gap between easy-to-model latents and high-fidelity image synthesis.

Abstract:
Neural Processes (NPs), and specifically Transformer Neural Processes (TNPs), have demonstrated remarkable performance across tasks ranging from spatiotemporal forecasting to tabular data modelling. However, many of these applications are inherently sequential, involving continuous data streams such as realtime sensor readings or database updates. In such settings, models should support cheap, incremental updates rather than recomputing internal representations from scratch for every new observation—a capability existing TNP variants lack. Drawing inspiration from Large Language Models, we introduce the Incremental TNP (\textttincTNP). By leveraging causal masking, Key-Value (KV) caching, and a data-efficient autoregressive training strategy, \textttincTNP matches the predictive performance of standard TNPs while reducing the computational cost of updates from quadratic to linear time complexity. We empirically evaluate our model on a range of synthetic and real-world tasks, including tabular regression and temperature prediction. Our results show that, surprisingly, \textttincTNP delivers performance comparable to—or better than—non-causal TNPs while unlocking orders-of-magnitude speedups for sequential inference. Finally, we assess the consistency of the model's updates---by adapting a metric of "implicit Bayesianness", we show that \textttincTNP retains a prediction rule as implicitly Bayesian as standard non-causal TNPs, demonstrating that \textttincTNP achieves the computational benefits of causal masking without sacrificing the consistency required for streaming inference.

Abstract:
Recent work shows that LLMs can sometimes detect when steering vectors are injected into their residual stream and identify the injected concept, a phenomenon cited as evidence of "introspective awareness." But what mechanisms underlie this capability, and do they reflect genuine introspective circuitry or more shallow heuristics? We investigate these questions in opensource models and establish three main findings. First, introspection is behaviorally robust: detection achieves moderate true positive rates with 0% false positives across diverse prompts. We also find this capability emerges specifically from post-training rather than pretraining. Second, introspection is not reducible to a single linear confound: anomaly detection relies on distributed MLP computation across multiple directions, implemented by interpretable gate and evidence-carrier features. Third, models possess greater introspective capability than is elicited by default: ablating refusal directions improves detection by ~50% and a trained steering vector improves detection by ~75%. Overall, our results suggest that introspective awareness is behaviorally robust, grounded in nontrivial internal anomaly detection, and likely could be substantially improved in future models.

Abstract:
Diffusion models have recently emerged as expressive policy representations for online reinforcement learning (RL). However, their iterative generative processes introduce substantial training and inference overhead. To overcome this limitation, we propose to represent policies using MeanFlow models, a class of fewstep flow-based generative models, to improve training and inference efficiency over diffusion-based RL approaches. To promote exploration, we optimize MeanFlow policies under the maximum entropy RL framework via soft policy iteration, and address two key challenges specific to MeanFlow policies: action likelihood evaluation and soft policy improvement. Experiments on MuJoCo and DeepMind Control Suite benchmarks demonstrate that our method, Mean Flow Policy Optimization (MFPO), achieves performance comparable to or exceeding current diffusion-based baselines while considerably reducing training and inference time.

Abstract:
Model fusion aims to combine the knowledge of multiple models by creating one representative model that captures the strengths of all of its parents. However, this process is nontrivial due to differences in internal representations, which can stem from permutation invariance, random initialization, or differently distributed training data. We present a novel, neuron-centric family of model fusion algorithms designed to integrate multiple trained neural networks into a single network effectively regardless of training data distribution. Our algorithms group intermediate neurons of parent models to create target representations that the fused model approximates with its corresponding sub-network. Unlike prior approaches, our approach incorporates neuron attribution scores into the fusion process. Furthermore, our algorithms can generalize to arbitrary layer types. Experimental results on various benchmark datasets demonstrate that our algorithms consistently outperform previous fusion techniques, particularly in zero-shot and non-IID fusion scenarios. We make our code publicly available.

Abstract:
Most Vision Language Models (VLMs) directly map outputs from ViT encoders to the LLM via a lightweight projector. While effective, recent analysis suggests this architecture suffers from an alignment challenge: visual features remain distant from the text space in the initial layers of the LLM, forcing the model to waste critical depth on superficial modality alignment rather than deep understanding and complex reasoning. In this work, we propose Deep PreAlignment (DPA), a novel architecture that replaces the standard ViT encoder with a small VLM as perceiver, ensuring visual features are deeply aligned with the text space of the target large language model. Comprehensive experiments demonstrate the effectiveness of DPA. On the 4B parameter scale, DPA outperforms baselines by 1.9 points across 8 multimodal benchmarks, with gains widening to 3.0 points at the 32B scale. Moreover, by offloading alignment to the perceiver, DPA achieves a 32.9% reduction in language capability forgetting over 3 text benchmarks. We further demonstrate that these gains are consistent across different LLM families including Qwen3 and LLaMA 3.2, highlighting the generality of our approach. Beyond performance, DPA also offers a seamless upgrade path for current VLM development, requiring only a modular replacement for the visual encoder with marginal computation overhead.

Abstract:
Entropic optimal transport (EOT) in continuous spaces with quadratic cost is a classical tool for solving the domain translation problem. In practice, recent approaches optimize a weak dual EOT objective depending on a single potential, but doing so is computationally not efficient due to the intractable logpartition term. Existing methods typically resolve this obstacle in one of two ways: by significantly restricting the transport family to obtain closed-form normalization (via Gaussian-mixture parameterizations), or by using general neural parameterizations that require simulation-based training procedures. We propose Variational Entropic Optimal Transport (VarEOT), based on an exact variational reformulation of the log-partition \log \mathbbE[\exp(\cdot)] as a tractable minimization over an auxiliary positive normalizer. This yields a differentiable learning objective optimized with stochastic gradients and avoids the necessity of MCMC simulations during the training. We provide theoretical guarantees, including finite-sample generalization bounds and approximation results under universal function approximation. Experiments on synthetic data and unpaired image-to-image translation demonstrate competitive or improved translation quality, while comparisons within the solvers that use the same weak dual EOT objective support the benefit of the proposed optimization principle.

Abstract:
Online model editing for multimodal large language models (MLLMs) requires assimilating a stream of corrections under tight compute and memory budgets. Yet editors developed for textonly LLMs often degrade on MLLMs: visually dominant activations skew the statistics that shape updates, causingcross-modal conflict, while sequential writes become entangled in a shared edit space and amplify long-horizon interference, causinginter-edit interference. To address these, we proposeM-ORE, a modality-decoupled online recursive editor for lifelong MLLM adaptation. M-ORE is derived from a unified proximal-projection formulation and admits a closed-form update with a Sherman-Morrison recursion, yielding constant per-edit overhead. It maintains module-wise locality statistics for the text stack and the visual projector to avoid visually dominated update shaping and performs continual updates in a fixed orthogonal low-rank edit subspace via a Sherman-Morrison recursion to mitigate long-horizon interference. Experiments on multiple MLLM backbones and online editing benchmarks show that our M-ORE method consistently improves reliability, generality, and locality over strong baselines, while achieving favorable quality-efficiency scaling.

Abstract:
The ability to adapt to physical actions and constraints in an environment is crucial for embodied agents (e.g., robots) to effectively collaborate with humans. Such physically grounded humanAI collaboration must account for the increased complexity of the continuous state-action space and constrained dynamics caused by physical constraints. However, most existing collaboration benchmarks are discrete or do not consider physical attributes and constraints. To address this, we introduce Moving Out, a human-AI collaboration benchmark that resembles a wide range of collaboration modes affected by physical attributes and constraints, such as moving heavy items together and coordinating actions to move an item around a corner. Moving Out consists of two challenges and human-human interaction data to comprehensively evaluate models' abilities to adapt to diverse human behaviors and unseen physical attributes. To give embodied agents the capability to collaborate with humans under physical attributes and constraints, we propose a novel method, BASS (Behavior Augmentation, Simulation, and Selection), to enhance the diversity of agents and their understanding of the outcome of actions. We systematically compare BASS and state-of-the-art models in AI-AI and human-AI experiments, showing that BASS can effectively collaborate with both unseen AI and humans.

Abstract:
Humans are born with visionbased 4D spatial-temporal intelligence, which enables us to perceive and reason about the evolution of 3D space over time from purely visual inputs. Despite its importance, this capability remains a significant bottleneck for current multimodal large language models (MLLMs). To tackle this challenge, we introduce MLLM-4D, a comprehensive framework designed to bridge the gaps in training data curation and model post-training for spatiotemporal understanding and reasoning. On the data front, we develop a cost-efficient data curation pipeline that repurposes existing stereo video datasets into high-quality 4D spatiotemporal instructional data. This results in the MLLM4D-2M and MLLM4D-R1-30k datasets for Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT), alongside MLLM4D-Bench for comprehensive evaluation. Regarding model training, our post-training strategy establishes a foundational 4D understanding via SFT and further catalyzes 4D reasoning capabilities by employing Group Relative Policy Optimization (GRPO) with specialized Spatiotemporal Chain of Thought (ST-CoT) prompting and Spatiotemporal reward functions without involving the modification of architecture. Extensive experiments demonstrate that MLLM-4D achieves state-of-the-art spatial-temporal understanding and reasoning capabilities from purely 2D RGB inputs. Our project will be made publicly available.

Abstract:
Normalizing Flows (NFs) are a classical family of likelihood based methods that have received revived attention. Recent efforts such as TARFlow have shown that NFs are capable to achieving promising performance on image modeling tasks, making them promising alternatives to other methods such as diffusion models. In this work, we further advance the state of Normalizing Flow generative models by introducing iterative TARFlow (iTARFlow). Unlike diffusion models, iTARFlow maintains a fully endto-end, likelihood-based objective during training. During sampling, it performs autoregressive generation followed by an iterative denoising procedure inspired by diffusion-style methods. Through extensive experiments, We show that iTARFlow achieves competitive performance across ImageNet resolutions of 64, 128, and 256 pixels, demonstrating its potential as a strong generative model and advances the frontier of Normalizing Flows. In addition, we analyze the characteristic artifacts produced by iTARFlow, offering insights that may shed the light for future improvements.

Abstract:
Textguided texture editing aims to modify object appearance while preserving the underlying geometric structure. However, our empirical analysis reveals that even SOTA editing models frequently struggle to maintain structural consistency during texture editing, despite the intended changes being purely appearance-related. Motivated by this observation, we jointly enhance structure preservation from both data and training perspectives, and build TexEditor, a dedicated texture editing model based on Qwen-Image-Edit-2509. Firstly, we construct BlenderTex, a high-quality SFT dataset generated with Blender, which provides strong structural priors for a cold start. Secondly, we introduce StructureNFT, a RL–based approach that integrates structure-preserving losses to transfer the structural priors learned during SFT to real-world scenes. Moreover, due to the limited realism and evaluation coverage of existing benchmarks, we introduce TexBench, a general-purpose real-world benchmark for text-guided texture editing. Extensive experiments on existing Blender-based texture benchmarks and our TexBench show that TexEditor consistently outperforms strong baselines such as Nano Banana Pro. In addition, we assess TexEditor on the general-purpose benchmark ImgEdit to validate its generalization.

Abstract:
Knowledge distillation (KD) is a powerful paradigm for compressing large language models (LLMs), whose effectiveness depends on intertwined choices of divergence direction, optimization strategy, and data regime. We break down the design of existing KD methods and present a unified view that establishes connections between them, reformulating KD as a reweighted loglikelihood objective at the token level. We further propose Hybrid Policy Distillation (HPD), which integrates the complementary advantages of forward and reverse KL to balance mode coverage and mode-seeking behaviors, and combines off-policy data with lightweight approximate on-policy sampling. We validate HPD on long-generation math reasoning as well as short-generation dialogue and code tasks, demonstrating improved optimization stability, computational efficiency, and final performance across diverse model families and scales.

Abstract:
Modern deep learning for asset allocation typically separates forecasting from optimization. We argue this creates a fundamental mismatch where minimizing prediction errors fails to yield robust portfolios. We propose the Signature Informed Transformer to address this by unifying feature extraction and decision making into a single policy. Our model employs path signatures to encode complex path dependencies and introduces a specialized attention mechanism that targets geometric asset relationships. By directly minimizing the Conditional Value at Risk we ensure the training objective aligns with financial goals. We prove that our attention module rigorously amplifies signature derived signals. Experiments across diverse equity universes show our approach significantly outperforms both traditional strategies and advanced forecasting baselines. The code is available at: https://anonymous.4open.science/r/SignatureInformed-Transformer-For-Asset-Allocation-DB88

Abstract:
While humans develop core visual skills long before acquiring language, contemporary Multimodal LLMs (MLLMs) still rely heavily on linguistic priors to compensate for their fragile visual understanding. We uncovered a crucial fact: stateof-the-art MLLMs consistently fail on basic visual tasks that humans, even 3-year-olds, can solve effortlessly. To systematically investigate this gap, we introduce BabyVision, a benchmark designed to assess core visual abilities independent of linguistic knowledge for MLLMs. BabyVision spans a wide range of tasks, with 388 items divided into 22 subclasses across four key categories. Empirical results and human evaluation reveal that leading MLLMs perform significantly below human baselines. Gemini3-Pro-Preview scores 49.7, lagging behind 6-year-old humans and falling well behind the average adult score of 94.1. These results show despite excelling in knowledge-heavy evaluations, current MLLMs still lack fundamental visual primitives. Progress in BabyVision represents a step toward human-level visual perception and reasoning capabilities. We also explore solving visual reasoning with generation models by proposing Babyvision-Gen and automatic evaluation toolkit. Our code and benchmark data are released at https://anonymous.4open.science/r/BabyVision-E88F/ for reproduction.

Abstract:
Preference optimization is widely used to align large language models (LLMs) with human preferences, yet many marginbased objectives often suppress the chosen response together with the rejected one, and no general mechanism exists to prevent this across objectives. We bridge this gap by presenting a unified incentive-score decomposition of preference optimization, revealing that diverse objectives share identical local update directions and differ only in their scalar weighting coefficients. Building on this decomposition, by analyzing the dynamics of the rewards of chosen/rejected responses, we identify the disentanglement band (DB), a simple, testable condition that characterizes when training can realize the ideal pathway: suppressing the loser while maintaining the winner, possibly after an initial transient. Leveraging the DB, we propose a plug-and-play reward calibration (RC) that adaptively rebalances chosen versus rejected updates to satisfy the DB, without redesigning the base objective. Empirical results confirm that this calibration effectively disentangles updates and improves alignment performance across diverse objectives.

Abstract:
Learning domain adaptive policies that can generalize to unseen transition dynamics, remains a fundamental challenge in learningbased control. Substantial progress has been made through domain representation learning to capture domain-specific information, thus enabling domain-aware decision making. We analyze the process of learning domain representations through dynamical prediction and find that selecting contexts adjacent to the current step causes the learned representations to entangle static domain information with varying dynamical properties. Such mixture can confuse the conditioned policy, thereby constraining zero-shot adaptation. To tackle the challenge, we propose DADP (Domain Adaptive Diffusion Policy), which achieves robust adaptation through unsupervised disentanglement and domain-aware diffusion injection. First, we introduce Lagged Context Dynamical Prediction, a strategy that conditions future state estimation on a historical offset context; by increasing this temporal gap, we unsupervisedly disentangle static domain representations by filtering out transient properties. Second, we integrate the learned domain representations directly into the generative process by biasing the prior distribution and reformulating the diffusion target. Extensive experiments on challenging benchmarks across locomotion and manipulation demonstrate the superior performance, and the generalizability of DADP over prior methods. More visualization results are available on the https://anonymous.4open.science/w/dadp-webpage-D841.

Abstract:
While most AI alignment research focuses on preventing models from generating explicitly harmful content, a more subtle risk is emerging: capabilityoriented training induced exploitation. We investigate whether language models, when trained with reinforcement learning (RL) in environments with implicit loopholes, will spontaneously learn to exploit these flaws to maximize their reward, even without any malicious intent in their training. To test this, we design a suite of four diverse "vulnerability games'', each presenting a unique, exploitable flaw related to context-conditional compliance, proxy metrics, reward tampering, and self-evaluation. Our experiments show that models consistently learn to exploit these vulnerabilities, discovering opportunistic strategies that significantly increase their reward at the expense of task correctness or safety. More critically, we find that these exploitative strategies are not narrow "tricks'' but generalizable skills; they can be transferred to new tasks and even "distilled'' from a capable teacher model to other student models through data alone. Our findings reveal that capability-oriented training induced risks pose a fundamental challenge to current alignment approaches, suggesting that future AI safety work must extend beyond content moderation to rigorously auditing and securing the training environments and reward mechanisms themselves. Code is available at https://anonymous.4open.science/r/CapabilityOrientedAlignment_Risk.

Abstract:
Exams are a fundamental test of expertlevel intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments on 17 text-to-image and unified models demonstrate the great challenge of GenExam and the huge gap where open-source models consistently lag behind the leading closed-source ones. By framing image generation as an exam, GenExam offers a rigorous assessment of models' ability to integrate understanding, reasoning, and generation, providing insights on the path to intelligent generative models. Our benchmark and evaluation code will be released.

Abstract:
While continuous diffusion has shown remarkable success in continuous domains such as image generation, its direct application to discrete data has underperformed pure discrete formulations. To understand this gap, we introducetoken identifiability, an analytical framework characterizing how Gaussian noise corrupts discrete data through two mechanisms:discrete identity corruptionandcontinuous rank degradation. We reveal that these mechanisms scale differently with vocabulary size, creating atemporal dissonancethat forces a tradeoff between learning continuous geometry and discrete structure. To address this, we proposeCANDI(ContinuousANdDIscrete diffusion), a hybrid framework that decouples discrete and continuous corruption, enabling simultaneous learning of both. This unlocks the benefits of continuous diffusion for discrete spaces: on controlled generation, CANDI enables classifierbased guidance with off-the-shelf classifiers through simple gradient addition; on text generation, CANDI outperforms masked diffusion at low NFE, demonstrating the value of learning continuous gradients for discrete spaces.

Abstract:
Improving LLM performance on downstream tasks sometimes requires leveraging auxiliary datasets during posttraining. In practice, however, developers face constraints on compute, labeling, and licensing costs that preclude using all available data, necessitating principled dataset-level selection. These constraints are increasingly shaped by dataset marketplaces, where data acquisition is governed by budgets and negotiation. We study dataset valuation as a subset selection problem during LLM post-training. Our goal is to identify and weight auxiliary datasets so as to maximize target task performance given constrained budgets. We first show that commonly used gradient alignment scores provide a reasonable yet incomplete valuation signal, as they ignore redundancy among datasets. To address this, we propose a convex scalable dataset-level valuation method based on kernel mean matching (KMM) in gradient space, which jointly accounts for alignment with the target task and redundancy across auxiliary datasets. Through extensive experiments across diverse post-training settings and multilingual reasoning tasks, we show that our approach consistently outperforms existing valuation baselines, achieving stronger performance with low computational overhead. Our results position dataset valuation as a practical decision tool for post-training data selection in market-constrained large language model settings.

Abstract:
While Contrastive Learning (CL) has revolutionized selfsupervised representation learning, its latent representations remain highly entangled and opaque, limiting their interpretability in safety-critical applications. We identify that a fundamental cause of this entanglement is the reliance on deterministic similarity measures, which treat all feature dimensions equally. In compositional scenes, this creates an Optimization Conflict: common background features, such as, "blue sky", are encouraged to align in positive pairs but simultaneously repelled in negative pairs, causing gradient oscillations that hinder precise semantic disentanglement. To address this, we proposeBayesNCL(Bayesian Gated Non-Negative Contrastive Learning). Unlike standard approaches, BayesNCL introduces a probabilistic gating mechanism that dynamically filters out task-irrelevant, high-frequency common features while selectively retaining discriminative semantics. By formalizing feature selection as a variational inference problem with a sparse Bernoulli prior, our method effectively resolves the optimization conflict. Empirical experimental results on benchmark datasets demonstrate that BayesNCL achieves a remarkable 142.1% improvement in semantic consistency compared to state-of-the-art baselines, yielding highly interpretable representations without compromising downstream task performance.

Abstract:
Highfidelity EEG generation is critical for alleviating data scarcity and addressing privacy constraints in large-scale neural modeling. Despite recent progress, most existing approaches formulate EEG generation via discrete denoising objectives, which inadequately reflect the inherently continuous temporal dynamics and spectral structure of neural activity. As a result, these methods often struggle to preserve long-range temporal dependencies and exhibit mismatches in the spectral and temporal structure of the generated signals. In this work, we argue that effective EEG generation requires models that operate directly on the continuous evolution of neural signals. We introduce Just EEG Transformer (JET), a generative framework based on conditional flow matching that models EEG as raw sequences evolving along continuous trajectories. By learning a smooth vector field that transports noise to the EEG data distribution, JET captures temporal continuity and transient dynamics without relying on discretized denoising schemes or domain-specific representations. To ensure that the learned dynamics remain consistent with key properties of EEG signals, we introduce principled constraints that preserve spectral structure, temporal stationarity, and signal-level statistics. Across three large-scale benchmarks, JET consistently achieves state-of-the-art performance, reducing TS-FID by over 40% compared to strong baselines. Extensive analyses show that JET captures key structural properties of neural dynamics, providing a scalable and principled approach to EEG generation.

Abstract:
Diffusion language models are a promising alternative to autoregressive models due to their potential for faster generation. Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks. In this work, we present the first scaling law study of Uniformstate and interpolating discrete diffusion methods. We also show that masked diffusion models can be made approximately 12% more FLOPs-efficient when trained with a simple cross-entropy objective. We find that perplexity is informative within a diffusion family but can be misleading across families, where models with worse likelihood scaling may be preferable due to faster and more practical sampling, as reflected by the speed-quality Pareto frontier. These results challenge the view that masked diffusion is categorically the future of diffusion language modeling and that perplexity alone suffices for cross-algorithm comparison. Scaling all methods to 1.7B parameters, we show that uniform-state diffusion remains competitive on likelihood-based benchmarks and outperforms autoregressive and masked diffusion models on GSM8K, despite worse validation perplexity. Overall,our results challenge the view that masked diffusion is categorically the future of diffusion language modelingand that perplexity alone suffices for cross-algorithm comparison.

Abstract:
We propose a methodology that combines generative latent diffusion models with physicsinformed machine learning to generate solutions of parametric partial differential equations (PDEs) conditioned on partial observations, which includes, in particular, forward and inverse PDE problems. We learn the joint distribution of PDE parameters and solutions via a diffusion process in a latent space of scaled spectral representations, where Gaussian noise corresponds to functions with controlled regularity. This spectral formulation enables significant dimensionality reduction compared to grid-based diffusion models and ensures that the induced process in function space remains within a class of functions for which the PDE operators are well defined. Building on diffusion posterior sampling, we enforce physics-informed constraints and measurement conditions during inference, applying Adam-based updates at each diffusion step. We evaluate the proposed approach on Poisson, Helmholtz, and incompressible Navier--Stokes equations, demonstrating improved accuracy and computational efficiency compared with existing diffusion-based PDE solvers, which are state of the art for sparse observations. Code is available at https://anonymous.4open.science/r/PISD-32A0/.

Abstract:
Large language models (LLMs) are increasingly used as automatic evaluators of generative AI outputs, a paradigm often referred to as "LLMas-a-judge." In practice, LLM judges are imperfect predictions for the underlying truth and can exhibit systematic, non-random errors. Two main approaches have recently been proposed to address this issue: (i)direct measurement-error correctionbased on misclassification models such as Rogan--Gladen-style estimators, and (ii)surrogate-outcome approachessuch as prediction-powered inference (PPI), which correct bias by calibrating prediction residuals on a small set of gold-standard human labels. In this paper, we systematically study the performance of these two approaches for estimating mean parameters (e.g., average benchmark scores or pairwise win rates). Leveraging tools from semiparametric efficiency theory, we unify the two classes of estimators by deriving explicit forms ofefficient influence function-based efficient estimators and characterize conditions under which PPI-style estimators attain strictly smaller asymptotic variance than measurement-error corrections. We verify our theoretical results through simulations and demonstrate the methods on a real-data example using our open-source software package for performing the calibration.

Abstract:
Large language models (LLMs) are pretrained on corpora containing trillions of tokens and, therefore, inevitably memorize sensitive information. Locatethen-edit methods, as a mainstream paradigm of model editing, offer a promising solution by modifying model parameters without retraining. However, in this work, we reveal a critical vulnerability of this paradigm: the parameter updates inadvertently serve as a side channel, enabling attackers to recover the edited data. We propose a two-stage reverse-engineering attack named KSTER (KeySpaceReconsTruction-then-EntropyReduction) that leverages the low-rank structure of these updates. First, we theoretically show that the row space of the update matrix encodes a "fingerprint" of the edited subjects, enabling accurate subject recovery via spectral analysis. Second, we introduce an entropy-based prompt recovery attack that reconstructs the semantic context of the edit. Extensive experiments on multiple LLMs demonstrate that our attacks can recover edited data with high success rates. Furthermore, we propose subspace camouflage, a defense strategy that obfuscates the update fingerprint with semantic decoys. This approach effectively mitigates reconstruction risks without compromising editing utility. Our code is available at https://anonymous.4open.science/r/EditATK-26B6.

Authors: Qiuyang Mang, Wenhao Chai, Zhifei Li, Huanzhi Mao, Shang Zhou, Alexander Du, Hanchen Li, Shu Liu, Edwin Chen, Yichuan Wang, Xieting Chu, Zerui Cheng, Yuan Xu, Tian Xia, Zirui Wang, Tianneng Shi, Jianzhu Yao, Yilong Zhao, Qizheng Zhang, Charlie Ruan, Zeyu Shen, Kaiyuan Liu, Zhaoyang Hong, Alex Gu, Ziyi Zhang, Runyuan He, Dong Xing, Zerui Li, Zirong Zeng, Yige Jiang, Lufeng Cheng, Ziyi Zhao, Youran Sun, Suyang Zhong, Junpeng Wang, Donglin Li, Wenyuan Huang, Jialiang Gu, Wesley Zheng, Wangmeiyu Zhang, Ruyi Ji, Xuechang Tu, Zihan Zheng, Zhaozi Wang, Zexing Chen, Jingbang Chen, JIALU ZHANG, Aleksandra Korolova, Peter Henderson, Pramod Viswanath, Vijay Ganesh, Saining Xie, Zhuang Liu, Dawn Song, Sewon Min, Ion Stoica, Joseph E Gonzalez, Jingbo Shang, Alvin Cheung

Abstract:
We introduce FrontierCS, a benchmark of 240 openended problems across diverse areas of computer science, designed and reviewed by experts, including CS PhDs and top-tier competitive programming participants and problem setters. Unlike existing benchmarks that focus on tasks with known optimal solutions, FrontierCS targets problems where the optimal solution is unknown, but the quality of a solution can be objectively evaluated. Models solve these tasks by implementing executable programs rather than outputting a direct answer. FrontierCS includes algorithmic problems, which are often NP-hard variants of competitive programming problems with objective partial scoring, and research problems with the same property. For each problem, we provide an expert reference solution and an automatic evaluator. Combining open-ended design, measurable progress, and expert curation, FrontierCS provides a benchmark at the frontier of computer-science difficulty. Empirically, we find that frontier reasoning models still lag far behind human experts, and that simply increasing reasoning budgets does not close this gap on open-ended challenges. Moreover, these models struggle to identify internal equivalence classes, and existing agentic frameworks also exhibit brittleness on such problems due to overfitting. FrontierCS thus offers a new lens into model capabilities on real frontier computer science problems.

Abstract:
Data attribution for generative models seeks to quantify the influence of individual training examples on model outputs. Existing methods for diffusion models typically require access to model gradients or retraining, limiting their applicability in proprietary or largescale settings. We propose anonparametricattribution method that operates entirely on data, measuring influence via patch-level similarity between generated and training images. Our approach is grounded in the analytical form of the optimal score function and naturally extends to multiscale representations, while remaining computationally efficient through convolution-based acceleration. In addition to producing spatially interpretable attributions, our framework uncovers patterns that reflect intrinsic relationships between training data and outputs, independent of any specific model. Experiments demonstrate that our method achieves strong attribution performance, closely matching gradient-based approaches and substantially outperforming existing nonparametric baselines.

Abstract:
The ability of the deep learning model to recognize when a sample falls outside its learned distribution is critical for safe and reliable deployment. Recent stateof-the-art out-of-distribution (OOD) detection methods leverage activation shaping to improve the separation between in-distribution (ID) and OOD inputs. These approaches resort to sample-specific scaling but apply a static percentile threshold across all samples regardless of their nature. In this work, we propose AdaSCALE, an adaptive scaling procedure that dynamically adjusts the percentile threshold based on a sample's estimated OOD likelihood. This estimation leverages our key observation that OOD samples exhibit significantly more pronounced activation shifts at high-magnitude activations under minor perturbation compared to ID samples. AdaSCALE enables stronger scaling for likely ID samples and weaker scaling for likely OOD samples, creating highly separable energy scores. Our approach achieves state-of-the-art OOD detection performance, outperforming the latest rival OptFS by14.94%in near-OOD and21.67%in far-OOD datasets in average FPR@95 metric in the ImageNet-1k benchmark across eight diverse architectures.

Abstract:
Graph Neural Networks (GNNs) excel on homophilous graphs but often fail under heterophily due to selfreinforcing and phase-inconsistent signals. We propose a Gauge-Equivariant Graph Network with Self-Interference Cancellation (GESC), which replaces additive aggregation with a projection-based interference mechanism. Unlike prior magnetic or gauge-equivariant GNNs that rely on additive message mixing, GESC explicitly models self-interference arising from redundant low-frequency components. We show that the absence of interference handling in existing gauge-based GNNs is a primary driver of oversmoothing under gauge transport. We introduce a \mathrmU(1) phase connection followed by a rank-1 projection that suppresses self-parallel components before attention, and a sign-aware gate that regulates negatively aligned neighbors. Across diverse graph benchmarks, GESC consistently outperforms recent state-of-the-art models while offering a unified, interference-aware view of message passing. Our code is available at \hrefhttps://anonymous.4open.science/r/GESC-1B22this link.

Abstract:
Discrete flow models (DFMs) have been proposed to learn the data distribution on finite state space, offering a flexible framework as an alternative to discrete diffusion models. A line of recent work has studied samplers for discrete diffusion models, such as tauleaping and Euler solver. However, these samplers require a large number of iterations to control discretization error, since the transition rates are frozen in time and evaluated at the initial state within each time interval. Moreover, theoretical results for these samplers often require boundedness conditions of the transition rate or they focus on a specific type of source distributions. To address those limitations, we establish non-asymptotic discretization error bounds for those samplers without any restriction on transition rates and source distributions, under the framework of discrete flow models. Furthermore, by analyzing a one-step lower bound of the Euler sampler, we propose two corrected samplers: time-corrected sampler and location-corrected sampler, which can reduce the discretization error of tau-leaping and Euler solver with almost no additional computational cost. We rigorously show that the location-corrected sampler has a lower complexity than existing parallel samplers. We validate the effectiveness of the proposed method by achieving better generation quality with reduced inference time on simulations and text-to-image generation tasks.

Abstract:
Existing VisionLanguage-Action (VLA) models struggle with complex manipulation tasks requiring high dexterity and generalization, primarily due to their reliance on synthetic data with significant sim-to-real gaps or limited teleoperated demonstrations. To address this bottleneck, we propose leveraging human hands as a manipulator template, capitalizing on the rich dexterity and scalability present in web data of human manipulation. Our approach introduces physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, perspective spatial alignment for reasoning in a unified physical space, and post-training adaptation in physical environments. Additionally, we introduce a part-level motion tokenization method that achieves millimeter-level reconstruction accuracy to model precise hand trajectories serving as scalable motion primitives. To support our paradigm, we develop a comprehensive data curation pipeline that integrates heterogeneous sources into a large-scale dataset with millions of motion-based instructional instances. Empirically, our model demonstrates superior performance in hand motion generation and instruction following, adhering to favorable scaling laws with respect to model and data sizes. Importantly, we demonstrate promising capabilities to robotic dexterous manipulation, validating the effectiveness of bridging the human-robot embodiment gap.

Abstract:
Scaling autoregressive large language models (LLMs) has had an unprecedented impact, but at vast computational costs. In this work, we tackle these costs by leveraging unstructured sparsity within an LLM's feedforward layers, which account for the majority of its parameters and execution FLOPs. To achieve this, we rework how computation is done on modern GPUs when sparsity is detected, introducing a set of new CUDA and Triton kernels that minimize computation and memory overheads during LLM inference and training. To substantiate our gains, we provide a quantitative study of LLM sparsity, demonstrating that simple L1 regularization can induce over 99% sparsity with negligible impact on downstream performance. When paired with our kernels, we show that these sparsity levels translate into substantial throughput, energy efficiency, and memory usage benefits that increase with model scale. The code and kernels shared with this submission will be released under an opensource license to promote adoption and future research to turn sparsity into a new practical axis for the efficiency and scalability of modern foundation models.

Abstract:
Recent advances in large language models (LLMs) have enabled breakthroughs in mathematical discovery, exemplified by AlphaEvolve, a closedsource system that evolves programs to improve bounds on open problems. However, it relies on ensembles of frontier LLMs to achieve new bounds and is a pure inference system that models cannot internalize the evolving strategies. We introduce ThetaEvolve, an open-source framework that simplifies and extends AlphaEvolve to efficiently scale both in-context learning and Reinforcement Learning (RL) at test time, allowing models to continually learn from their experiences in improving open optimization problems. ThetaEvolve features a single LLM, a large program database for enhanced exploration, batch sampling for higher throughput, lazy penalties to discourage stagnant outputs, and optional reward shaping for stable training signals, etc. ThetaEvolve is the first evolving framework that enable a small open-source model, like DeepSeek-R1-0528-Qwen3-8B, to achieve new best-known bounds on open problems (circle packing and first auto-correlation inequality) mentioned in AlphaEvolve. Besides, across two models and four open tasks, we find that ThetaEvolve with RL at test-time consistently outperforms inference-only baselines, and the model indeed learns evolving capabilities, as the RL-trained checkpoints demonstrate faster progress and better final performance on both trained target task and other unseen tasks. We will release our code publicly.

Abstract:
Standard diffusion models for graph generation typically rely on uniform timestepping, an approach that overlooks the non-homogeneous dynamics of distributional evolution on complex manifolds. In this paper, we present an information-geometric framework that reinterprets the diffusion sampling trajectory as a parametric curve on a Riemannian manifold. Our key observation is that the Fisher-Rao metric provides a principled measure of the intrinsic distance. By analyzing this metric, we derive the Drift Variation Score (DVS), a geometry-aware indicator that quantifies the instantaneous rate of distributional change. Unlike prior heuristic-based adaptive samplers, our DVS Sampler enforces a constant-speed traversal along the manifold, automatically allocating finer steps to regions of high curvature and larger steps to stable phases. This equal arc-length strategy ensures that each discretization step contributes equally to the information gain. Theoretical analysis demonstrates that DVS characterizes the local stiffness of the sampling dynamics in the Fisher-Rao sense. Experimental results on molecule and social network generation show that our method significantly improves structural fidelity and sampling efficiency.

Abstract:
Large language models (LLMs) are increasingly applied to symbolic mathematics, yet existing evaluations often conflate pattern memorization with genuine reasoning. To address this gap, we presentASyMOB, a highresolution dataset of35,368validated symbolic math problems spanning integration, limits, differential equations, series, and hypergeometrics. Unlike prior benchmarks,ASyMOBsystematically perturbs each seed problem using symbolic, numeric, and equivalence-preserving transformations, enabling a fine-grained assessment of generalization and robustness. Our evaluation reveals three key findings: (1) most models’ performance collapses under minor perturbations, while top systems exhibit an apparent regime shift in robustness; (2) integrated code tools stabilize performance, particularly for weaker models; and (3) we identify examples where Computer Algebra Systems (CAS) fail while LLMs succeed, as well as problems solved only via a hybrid LLM-CAS approach, highlighting a promising integration frontier.ASyMOBserves as a principled diagnostic tool for measuring and accelerating progress toward building verifiable, trustworthy AI for scientific discovery.

Abstract:
Objectcentric learning (OCL) aims to learn structured scene representations that support compositional generalization and robustness to out-of-distribution (OOD) data. However, OCL models are often not evaluated regarding these goals. Instead, most prior work focuses on evaluating OCL models solely through object discovery and simple reasoning tasks, such as probing the representation via image classification. We identify two limitations in existing benchmarks: (1) They provide limited insights on the representation usefulness of OCL models, and (2) localization and representation usefulness are assessed using disjoint metrics. To address (1), we use instruction-tuned VLMs as evaluators, enabling scalable benchmarking across diverse VQA datasets to measure how well VLMs leverage OCL representations for complex reasoning tasks. To address (2), we introduce a unified evaluation task and metric that jointly assess localization (where) and representation usefulness (what), thereby eliminating inconsistencies introduced by disjoint evaluation. Finally, we include a simple multi-feature reconstruction baseline as a reference point.

Abstract:
The fundamental representational units (FRUs) of large language models (LLMs) remain undefined, limiting further understanding of their underlying mechanisms. In this paper, we introduce Atom Theory to systematically define, evaluate, and identify such FRUs, which we term atoms. Building on the atomic inner product (AIP), a nonEuclidean metric that captures the underlying geometry of LLM representations, we formally define atoms and propose two key criteria for ideal atoms: faithfulness ( R^2 ) and stability ( q^\star ). We further prove that atoms are identifiable under threshold-activated sparse autoencoders (TSAEs). Empirically, we uncover a pervasive representation shift in LLMs and demonstrate that the AIP corrects this shift to capture the underlying representational geometry, thereby grounding Atom Theory. We find that two widely used units, neurons and features, fail to qualify as ideal atoms: neurons are faithful ( R^2=1 ) but unstable ( q^\star=0.5 % ), while features are more stable ( q^\star=68.2 % ) but unfaithful ( R^2=48.8 % ). To find atoms of LLMs, leveraging atom identifiability under TSAEs, we show via large-scale experiments that reliable atom identification occurs only when the TSAE capacity matches the data scale. Guided by this insight, we identify FRUs with near-perfect faithfulness ( R^2=99.9 % ) and stability ( q^\star=99.8 % ) across layers of Gemma2-2B, Gemma2-9B, and Llama3.1-8B, satisfying the criteria of ideal atoms statistically. Further analysis confirms that these atoms align with theoretical expectations and exhibit substantially higher monosemanticity. Overall, we propose and validate Atom Theory as a foundation for understanding the internal representations of LLMs.

Abstract:
Incontext learning is fundamental to modern Large Language Models (LLMs); however, prevailing architectures impose a rigid and fixed contextual structure by assigning linear or constant positional indices. Drawing on Cognitive Load Theory (CLT), we argue that this uninformative structure increases extraneous cognitive load, consuming finite working memory capacity that should be allocated to deep reasoning and attention allocation. To address this, we propose RePo, a novel mechanism that reduces extraneous load via context re-positioning. Unlike standard approaches, RePo utilizes a differentiable module, f_\phi, to assign token positions that capture contextual dependencies, rather than replying on pre-defined order. By continually pre-training on the OLMo-2 1B \& 7B models, we demonstrate that RePo consistently enhances performance on tasks involving noisy contexts, structured data, and longer context length, while maintaining competitive performance on general short-context tasks. Detailed analysis reveals that RePo successfully allocate higher attention to distant but relevant information, assign positions in dense and non-linear space, and capture the intrinsic structure of the input context. We will open-source the code and model weights.

Abstract:
Deep neural networks achieve impressive results across diverse applications, yet their overconfidence on unseen inputs necessitates reliable epistemic uncertainty modelling. Existing methods for uncertainty modelling face a fundamental dilemma: Bayesian approaches provide principled estimates but remain computationally prohibitive, while efficient secondorder predictors lack rigorous derivations connecting their specific objectives to epistemic uncertainty quantification. To resolve this dilemma, we introduceDirichlet-approximatedpossibilisticposterior predictions (DAPPr), a principled framework leveraging possibility theory. We define a possibilistic posterior over parameters, projects this posterior to the prediction space via supremum operators, and approximates the projected posterior using learnable Dirichlet possibility functions. This projection-and-approximation strategy yields a simple training objective with closed-form solutions. Extensive experiments across diverse benchmarks demonstrate that our approach achieves competitive or superior uncertainty quantification performance compared to state-of-the-art evidential deep learning methods while maintaining both principled derivation and computational efficiency. Code will be available upon acceptance.

Abstract:
This paper introduces EnergyFlow, a framework that unifies generative action modeling with inverse reinforcement learning by parameterizing a scalar energy function whose gradient is the denoising field. We establish that under maximumentropy optimality, the score function learned via denoising score matching recovers the gradient of the expert's soft Q-function, enabling reward extraction without adversarial training. Formally, we prove that constraining the learned field to be conservative reduces hypothesis complexity and tightens out-of-distribution generalization bounds. We further characterize the identifiability of recovered rewards and bound how score estimation errors propagate to action preferences. Empirically, EnergyFlow achieves state-of-the-art imitation performance on various manipulation tasks while providing an effective reward signal for downstream reinforcement learning that outperforms both adversarial IRL methods and likelihood-based alternatives. These results show that the structural constraints required for valid reward extraction simultaneously serve as beneficial inductive biases for policy generalization. The code is available at https://anonymous.4open.science/r/EnergyFlow-FAE1.

Abstract:
We study the problem of transformation inversion on general Lie groups: a datum is transformed by an unknown group element, and the goal is to recover an inverse transformation that maps it back to the original data distribution. We take a probabilistic view and model the posterior over transformations as a Boltzmann distribution defined by an energy function on data space. To sample from this posterior, we introduce a diffusion process on Lie groups that keeps all updates onmanifold and only requires computations in the associated Lie alge- bra. Our method, Transformation-Inverting Energy Diffusion (TIED), relies on a new trivialized target-score identity that enables efficient score-based sampling of the transformation posterior. As a key application, we focus on test-time equivariance, where the objective is to improve the robustness of pretrained neural networks to input transformations. Experiments on image homographies and PDE symmetries demonstrate that TIED can restore transformed inputs to the training distribution at test time, showing improved performance over strong canonicalization and sampling baselines.

Abstract:
As largescale multi-agent systems evolve, the communication protocol layer has become a critical, yet understudied, component affecting system performance and reliability. Despite a range of protocols, such as JSON-RPC, A2A, ANP, and ACP, protocol selection remains ad hoc. To address this, we introduce ProtocolBench, a benchmark designed to evaluate agent communication protocols across task utility, communication overhead, system performance, and resilience under failure. ProtocolBench uses a three-layer architecture with protocol adapters for fair com- parison, diverse scenarios (e.g., document aggregation, collaborative coding), and detailed telemetry. Our results show protocol choice can impact task completion time by up to 36%, communication overhead by 3.5 seconds, and resilience with statistically observable differences. We also propose ProtocolRouter, a learnable protocol routing system that dynamically selects protocols based on runtime con- ditions, improving performance by up to 18% compared to individual protocols. Our findings highlight that hybrid protocol deployments outperform homogeneous ones by approximately 6.6%, with negligible protocol translation overhead. We release ProtocolBench as an open-source framework to standardize protocol evaluation and improve multi-agent system reliability at scale.

Abstract:
Multimodal large language models (MLLMs) extend the capabilities of large language models (LLMs) by combining heterogeneous model architectures to handle diverse modalities like images and audio. However, this inherent heterogeneity in MLLM model structure and data types makes makeshift extensions to existing LLM training frameworks unsuitable for efficient MLLM training, especially in distributed training. In this paper, we present ModalGlue, an efficient distributed MLLM training framework that contemplates MLLM's unique characteristics in both model and data parallelization. ModalGlue introduces frozenaware pipeline parallelism and workload-balanced context parallelism to improve MLLM training throughput. Our extensive evaluation shows that \name outperforms state-of-the-art solutions by 2.26× on average in terms of MLLM training throughput.

Abstract:
Despite recent progress in visionlanguage models (VLMs), existing approaches often fail to generate personalized responses based on the user's specific experiences, as they lack the ability to associate visual inputs with a user’s accumulated visual-textual context. We newly formalize this challenge as contextualized visual personalization, which requires the visual recognition and textual retrieval of personalized visual experiences by VLMs when interpreting new images. To address this issue, we propose CoViP, a unified framework that treats personalized image captioning as a core task for contextualized visual personalization and improves this capability through reinforcement-learning-based post-training and caption-augmented generation. We further introduce diagnostic evaluations that explicitly rule out textual shortcut solutions and verify whether VLMs truly leverage visual context. Extensive experiments demonstrate that existing open-source and proprietary VLMs exhibit substantial limitations, while CoViP not only improves personalized image captioning but also yields holistic gains across downstream personalization tasks. These results highlight CoViP as a crucial stage for enabling robust and generalizable contextualized visual personalization.

Abstract:
Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the selfattention mechanism, whose O(N^2) complexity with respect to sequence length presents a major bottleneck for both memory and latency. Fortunately, the attention matrix is often sparse, particularly for long sequences, suggesting an opportunity for optimization. Block-sparse attention has emerged as a promising solution that partitions sequences into blocks and skips computation for a subset of these blocks. However, the effectiveness of this method is highly dependent on the underlying attention patterns, which can lead to sub-optimal block-level sparsity. For instance, important key tokens for queries within a single block may be scattered across numerous other blocks, leading to computational redundancy. In this work, we propose Permuted Block-Sparse Attention (PBS-Attn), a plug-and-play method that leverages the permutation properties of attention to increase block-level sparsity and enhance the computational efficiency of LLM prefilling. We conduct comprehensive experiments on challenging long-context datasets, demonstrating that PBS-Attn consistently outperforms existing block-sparse attention methods in model accuracy and closely matches the full attention baseline. Powered by our custom permuted-FlashAttention kernels, PBS-Attn achieves an end-to-end speedup of up to \mathbf2.75× in long-context prefilling, confirming its practical viability. Code available at \urlhttps://anonymous.4open.science/r/pbs-attn-BB66.

Abstract:
Optimizing complex systems, ranging from LLM prompts to multiturn agents, traditionally requires labor-intensive manual iteration. We formalize this challenge as a stochastic generative optimization problem where a generative language model acts as the optimizer, guided by numerical rewards and text feedback. We introduce Prioritized Optimization with Local Contextual Aggregation (POLCA), a scalable framework designed to handle stochasticity in optimization—such as noisy feedback, sampling minibatches, and stochastic system behaviors—while effectively managing the unconstrained expansion of solution space. POLCA maintains a priority queue to manage the exploration-exploitation tradeoff, systematically tracking candidate solutions and their evaluation histories. To enhance efficiency, we integrate an \varepsilon-Net mechanism to maintain parameter diversity and an LLM Summarizer to perform meta-learning across historical trials. We evaluate our framework on diverse benchmarks, including \tau-bench (agent optimization), VeriBench (code translation) and KernelBench (CUDA kernel generation). Experimental results demonstrate that POLCA achieves robust, sample and time-efficient performance, consistently outperforming state-of-the-art algorithms in both deterministic and stochastic problems.

Abstract:
Visionlanguage-action models exhibit an inherent trade-off in action chunk length (``horizon''): longer horizons improve global foresight but degrade fine-grained local control, while shorter ones yield the opposite. To mitigate the trade-off, we propose a mixture of horizons (MoH) strategy. In brief, MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It offers three appealing benefits. i) Long-term foresight and short-term precision are jointly exploited within a single model. ii) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. iii) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5× higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based and one-step regression policies demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, \pi_0.5 with MoH reaches a new state-of-the-art with 99% average success rate on LIBERO after only 30k training iterations.

Abstract:
Vectorquantized variational autoencoders (VQ-VAEs) are discrete autoencoders that compress images into discrete tokens. However, they are difficult to train due to discretization. In this paper, we propose a simple yet effective technique dubbedGaussian Quant (GQ), which first trains a Gaussian VAE under certain constraints and then converts it into a VQ-VAE without additional training. For conversion, GQ generates random Gaussian noise as a codebook and finds the closest noise vector to the posterior mean. Theoretically, we prove that when the logarithm of the codebook size exceeds the bits-back coding rate of the Gaussian VAE, a small quantization error is guaranteed. Practically, we propose a heuristic to train Gaussian VAEs for effective conversion, named the target divergence constraint (TDC). Empirically, we show that GQ outperforms previous VQ-VAEs, such as VQGAN, FSQ, LFQ, and BSQ, on both UNet and ViT architectures. Furthermore, TDC also improves previous Gaussian VAE discretization methods, such as TokenBridge. The source code is provided in the supplementary materials.

Abstract:
Many machine learning techniques rely on approximating a loss function's curvature, but this is notoriously hard to do at the scale of modern deep networks. Surprisingly, no previous work has exploited the curvature constraints that arise from well known weightspace symmetries in loss landscapes. By analytically averaging over group actions that leave the loss invariant, we construct structured Hessian approximations from single gradients that can be tractably estimated, stored, and inverted. The choice of user-specified symmetry group directly governs the trade-off between approximation accuracy and computational cost. Moreover, our framework provides a unifying theoretical lens for viewing existing methods; in particular, a specific choice of symmetry group recovers Shampoo/Muon-like curvature estimates. We validate our method on a range of network architectures, and deploy it to second-order optimization benchmarks, including a small language model. Our curvature estimation framework might find applications in other machine learning problems such as uncertainty estimation, continual learning, compression/pruning, training data attribution, and more.

Abstract:
PostTraining Quantization (PTQ) and Low-Rank Adaptation (LoRA) constitute the standard pipeline for efficient Large Language Model (LLM) deployment. However, applying them sequentially poses a problem: PTQ often leaves behind random noise that is spread out (across the model's weights) in a way LoRA can't easily fix, meaning that LoRA ends up wasting its limited capacity trying to fix uncorrectable noise instead of improving task performance. In this paper, we propose ProjQ, a novel framework for constraining quantization noise to the low-rank manifold via orthogonal subspace projection. We derive an efficient alternating algorithm that shapes the quantization noise into a low-rank structure, effectively offloading dominant error components to the subsequent adapter while minimizing the residual error in the orthogonal "uncorrectable" subspace. Our theoretical analysis demonstrates that ProjQ preserves strictly greater model plasticity for downstream tasks compared to standard PTQ. Extensive experiments on LLaMA-2 and Qwen2.5 confirm that ProjQ consistently outperforms existing methods in both quantization error compensation and downstream task fine-tuning, achieving up to 2× lower evaluation loss for compensation and matching the performance of standard 4-bit baselines on language modeling tasks with only 3 bits. The code is available on \urlhttps://github.com/yy9301/ProjQ.

Abstract:
Recent studies on hallucination detection have shown that hallucinationrelated signals are more strongly encoded in intermediate layers than in the final layer of large language models (LLMs). While a growing body of work has sought to exploit this property for hallucination detection, the problem of how to automate the selection of high-performing layers is underexplored, and the development of principled methods for this purpose remains an open challenge. To address this gap, we first propose several hypotheses for why such signals emerge in intermediate layers and test corresponding criteria for automatic layer selection. We evaluate these criteria across two LLM architectures and five datasets, and find that none of them deliver satisfying performance. Instead, we propose a new selection criterion, First Effective Peak of Intrinsic Dimension (FEPoID), that is able to consistently identify optimal or near-optimal layers and outperforms the aforementioned criteria and existing hallucination detection baselines. This criterion is training-free and requires negligible computational overhead. Additionally, we study the generation behaviors of LLMs and introduce a simple yet effective truncation strategy, which further amplifies the hallucination-related signals and leads to substantial improvements in overall detection performance.

Abstract:
Safety alignment in large language models (LLMs) and multimodal large language models (MLLMs) is commonly assumed to operate as a nearbinary threshold mechanism. We challenge this assumption by revealing that safety behavior is governed by an \emphinstability region where small perturbations induce stochastic refusal decisions rather than deterministic outcomes. We develop a multi-metric diagnostic framework combining external and internal signals to characterize this instability. Through systematic experiments, we identify a characteristic \emphdiagnostic signature: inputs in unstable regimes exhibit elevated output uncertainty yet \emphdecreased internal safety activation, a decoupling phenomenon that explains why detection-based defenses fail against sophisticated attacks. Building on this framework, we introduce Furina, a jailbreak attack that deliberately induces this signature through fragmented, scene-anchored prompts without model-specific optimization. Furina outperforms strong single-turn and multi-turn baselines on HarmBench and achieves competitive results on MM-SafetyBench, demonstrating that uncertainty amplification provides a principled and transferable mechanism for understanding safety vulnerabilities. Code and supplementary materials: \urlhttps://anonymous.4open.science/r/Furina_Jailbreak-EF7C.

Abstract:
Machine unlearning seeks to remove the influence of specified data from a trained model. While the unlearning accuracy provides a widely used metric for assessing unlearning performance, it falls short in assessing the reliability of forgetting. In this paper, we find that the forgetting data points misclassified by unlearning accuracy still have their ground truth labels included in the conformal prediction set from the uncertainty quantification perspective, leading to a phenomenon we term fake forgetting. To address this issue, we propose a novel metric CR, inspired by conformal prediction, that offers a more reliable assessment of forgetting quality. Building on these insights, we further propose an unlearning framework CPU that incorporates conformal prediction into the Carlini & Wagner adversarial attack loss, enabling the ground truth label to be effectively removed from the conformal prediction set. Through extensive experiments on image classification tasks, we demonstrate both the effectiveness of our proposed metric and the superior forgetting quality achieved by our framework. Code is available at https://anonymous.4open.science/r/MUCP60E4.

Abstract:
Earlyexit neural networks~(EENNs) accelerate inference by allowing intermediate classifiers to stop computation once predictions are confident enough. Most methods rely on confidence thresholds for exiting, and consequently, classifier calibration is widely assumed to improve performance. In this work, we challenge this assumption and show that calibration is often not suitable for EENNs through a detailed theoretical study. To address the limitations of calibration, we introduce Early-Exit Failure Prediction~(EEFP), which accounts for both prediction correctness and the cost of further computation. We also propose a lightweight, EEFP-motivated procedure to improve the intermediate classifiers, which can directly replace calibration in EENNs. Extensive experiments demonstrate that our approach achieves superior cost–accuracy trade-offs than calibration and discuss how EEFP measures more reliably reflect overall EENN performance.

Abstract:
Anomaly detection (AD) under data contamination is critical for deploying unsupervised defect detection in industrial environments, where curating perfectly clean training sets is impractical. However, existing methods are sensitive to contamination, suffering significant performance degradation as the noise ratio increases. In this paper, we propose MemoryDistilled Selection (MeDS), a training algorithm based on data selection. MeDS constructs an ensemble of partial memories via random subsampling, where the resulting sparsity acts as a low-pass filter that captures nominal patterns across a wide range of noise ratios, enabling coarse-level identification of contaminated samples. The aggregated distances to the bootstrapped memories are then distilled into a reconstruction score network, which is subsequently fine-tuned on clean data filtered using scores from the distilled model, enabling fine-grained localization of anomalies. MeDS is robust across a wide range of noise ratios without requiring noise-ratio-specific hyperparameter tuning, achieving 99.16% image-level AUROC on MVTecAD at a 40% noise ratio, and attaining state-of-the-art performance on both VisA and Real-IAD under noisy settings. We thoroughly verify the efficacy of MeDS on industrial AD benchmarks under noisy data scenarios, accompanied by in-depth empirical analyses. The full code will be open-sourced.

Abstract:
Clinical brainto-text interfaces are designed for paralysed patients who cannot provide extensive training recordings. Pre-training improves data-efficient generalisation by learning statistical priors across subjects, but these priors critically depend on context. While natural speech might unfold gradually over minutes, most methods pre-train with only a few seconds of context. Thus, we proposeMEG-XL, a model pre-trained with 2.5 minutes of MEG context per sample, 5-300× longer than prior work, and equivalent to 191k tokens, capturing extended neural context. Fine-tuning on the task of word decoding from brain data, MEG-XL matches supervised performance with a fraction of the data (e.g. 1hr vs 50hrs) and outperforms brain foundation models. We find that models pre-trained with longer contexts learn representations that transfer better to word decoding. Our results indicate that long-context pre-training helps exploit extended neural context that other methods unnecessarily discard.

Abstract:
Despite strong zeroshot performance, SAM is unreliable under domain shift due to Mask-level Confidence Confusion (MCC), where a single IoU-based mask score fails to reflect pixel-wise reliability near boundaries. Motivated by the contrast between texture-biased shortcuts in neural networks and shape-centric processing in human vision, we model out-of-domain variation as appearance shifts and non-rigid deformations that jointly stress calibration. We propose Segment Anything with Robust Uncertainty-Accuracy Correlation (RUAC) for robust pixel-wise uncertainty estimation under appearance and deformation shifts. RUAC adds a lightweight uncertainty head, trains it with a collaborative style-deformation attack that jointly perturbs texture and geometry, and applies Uncertainty-Error Alignment to ensure uncertainty consistently highlights erroneous pixels even under adversarial perturbations. Across 23 zero-shot domains, RUAC improves segmentation quality and yields more faithful uncertainty with stronger uncertainty-accuracy correlation.

Abstract:
While linearcomplexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained Transformers provides an appealing shortcut, yet the fundamental representational gap between Softmax and linear attention prevents effective weight transfer. In this work, we address this conversion challenge from two perspectives: architectural alignment and representational alignment. We discover that Test-Time Training (TTT) shares a similar structure with Softmax attention, enabling direct weight inheritance. To further align representational properties— shift-invariance with keys and locality, we introduce instance normalization and locality enhancement module to better approximate the pretrained feature space. We validate our approach by linearizing Stable Diffusion 3.5 and introduce SD3.5-T^5. With only 1 hour fine-tuning on 4×H20, SD3.5-T^5 achieves a DPG-Bench score of 84.43 (vs.\ 83.83 for the original) while accelerating inference by 1.32× and 1.47× at 1K and 2K resolutions.

Abstract:
Representation intervention has emerged as a promising paradigm for aligning large language models toward desired behaviors without modifying model weights. Existing methods typically apply a fixed intervention uniformly across all inputs. However, we find that the appropriate intervention direction and strength vary substantially across samples, and such indiscriminate intervention leads to degradation of general capabilities on benign inputs. To address these challenges, we propose MultiAdapter Representation Interventions via Energy Calibration (MARI). Specifically, we introduce a competitive multi-adapter mechanism in which specialized experts capture non-linear correction patterns and adaptively determine the appropriate intervention direction and strength for different samples. Furthermore, we design an energy-based gating module that leverages internal propagation dynamics to distinguish inputs that are applicable for intervention. Extensive experiments across diverse model families and parameter scales demonstrate that MARI achieves state-of-the-art alignment performance. Our method significantly improves performance on TruthfulQA, BBQ, and safety benchmarks, while maintaining and even improving general capabilities on tasks such as MMLU and ARC. Our code will be released upon acceptance.

Abstract:
Inference time optimization techniques, such as repeated sampling, have significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, the critical role of model uncertainty remains largely underexplored in these optimization strategies. In this paper, we investigate the dynamics of confidence along reasoning trajectories and for first time reveal a surprising and unique pattern: correct answer traces tend to exhibit confidence improvement over time (positive confidence gain), while incorrect traces show attenuated or declining confidence as reasoning proceeds. Based on this observation, we propose Confidence Dynamic Gain (CDG) based voting, which incorporates how the confidence trajectory of the response evolves along the reasoning chain. Experiments across four opensource architectures (DeepSeek-R1, gpt-oss, Gemma-3, Qwen-QwQ) on the AIME24/25, HMMT25, and BRUMO25 benchmarks demonstrate that CDG yields a significant performance boost over baselines. These results demonstrate that our method provides a robust discriminative signal for improving answer selection in LLM reasoning. We also provide theoretical insights for this phenomenon. Code is in the supplementary material.

Abstract:
We aim to improve the reasoning capabilities of diffusion language models (DLMs). While SFT performs well for autoregressive models, its use in DLMs faces challenges. Our observation and analysis reveal that vanilla SFT does not consider learnability, i.e., what and when tokens are learned. Specifically, we observe that rare tokens are difficult to learn when most of the input is masked. In contrast, it is straightforward and thus of little value to learn common tokens when most of the input is unmasked. To consider learnability, we propose LIFT, a learnabilityinformed fine-tuning strategy for DLMs. LIFT learns easy tokens when most of the input is noisy and hard tokens when more input is available, thereby aligning training with information available at different diffusion time steps. Our results show that LIFT outperforms existing SFT baselines across six reasoning benchmarks, achieving up to a 3x relative gain on AIME'24 and AIME'25.

Abstract:
The goal of singlechannel source separation is to reconstruct K sources given their mixture. In supervised settings where vast amounts of clean source data are available, this challenging, ill-posed problem has been addressed successfully by generative diffusion and flow-based prior models. However, access to such clean source samples is often limited. To bridge this gap, we present Separation via Unsupervised Remixing Flow (SURF), an unsupervised flow matching approach for source separation that learns directly from observed mixtures. This method relies on a novel combination of state-of-the-art supervised flow matching and regression-based self-supervised techniques. At a high level, starting from a teacher model, we utilize a ``remixing'' step to bootstrap the learning of a student flow model from the teacher's estimates. We provide insights into the objectives optimized by this approach and draw a novel connection to the Wake-Sleep algorithm. Empirical evaluations on image and audio benchmarks demonstrate that SURF establishes a new state-of-the-art, significantly outperforming existing unsupervised methods.

Abstract:
Autonomous Graphical User Interface (GUI) agents often struggle with multistep tasks due to constrained context windows and static policies that fail to adapt to dynamic environments. To address these limitations, this work proposes the Self-Evolving GUI Agent (SE-GA), a novel framework that integrates hierarchical memory structures with an iterative self-improvement mechanism. At the core of our approach is Test-Time Memory Extension (TTME), which facilitates long-term planning by dynamically retrieving episodic, semantic, and experiential memories to provide salient contexts during inference. To ensure continuous learning, we introduce Memory-Augmented Self-Evolution (MASE), which is a training pipeline that adopts the data collected by TTME to stabilize and enhance the agent's foundational policy. Extensive evaluations across both offline and online benchmarks demonstrate SE-GA achieves state-of-the-art performance, reaching success rates of 89.0% on ScreenSpot and 75.8% on the challenging AndroidControl-High dataset. Furthermore, significant improvements on the AndroidWorld benchmark highlight the superior generalization to dynamic environments.

Abstract:
Transformerbased world models have shown strong performance in visual reinforcement learning, but often suffer from temporal inconsistency in long-horizon rollouts, including object duplication, disappearance, and transmutation. A key reason is that most existing approaches treat next-frame prediction purely as a token generation problem, without explicitly modeling correspondence between tokens across time. We formulate next-frame prediction as a structured probabilistic inference problem with latent token correspondence variables, deriving a model in which each next-frame token is explained either by copying a token from the previous frame or by generating a new token. Our experiments show state-of-the-art performance on 4 challenging benchmarks. The proposed method achieves a return of 72.5% and a score of 35.6% on the Craftax-classic benchmark, significantly surpassing the previous best of 67.4% and 27.9%. We plan to release our source code on GitHub upon acceptance.

Abstract:
Meshbased simulations provide high-fidelity solutions to partial differential equations (PDEs), but achieving such accuracy typically requires fine meshes, leading to substantial computational overhead. Super-resolution techniques aim to mitigate this cost by reconstructing high-resolution (HR), high-fidelity solutions from low-cost, low-resolution (LR) counterparts. However, training neural networks for super-resolution often demands large amounts of expensive HR supervision data. To address this challenge, we propose SuperMeshNet, an HR data-efficient super-resolution framework for mesh-based simulations aided by message passing neural networks (MPNNs). As its core, SuperMeshNet introduces complementary learning, a semi-supervised approach that effectively leverages both 1) a small amount of paired LR-HR data and 2) abundant unpaired LR data via two jointly trained, complementary MPNN-based models. Additionally, our model is enriched by inductive biases, which is empirically shown to further improve super-resolution performance. Extensive experiments demonstrate that SuperMeshNet requires 90% less HR data to achieve even lower root mean square error (RMSE) than that of the fully supervised benchmark without the inductive biases. The source code and datasets are available at https://anonymous.4open.science/r/SuperMeshNet/README.md.

Abstract:
Modern Transformers predominantly adopt the PreNorm paradigm for its optimization stability, foregoing the superior potential of the unstable Post-Norm architecture. Prior attempts to combine their strengths typically lead to a stability-performance trade-off. We attribute this phenomenon to a structural incompatibility within asingle-streamdesign: Any application of the Post-Norm operation inevitably obstructs the clean identity gradient preserved by Pre-Norm. To fundamentally reconcile these paradigms, we propose SiameseNorm, atwo-streamarchitecture that couples Pre-Norm-like and Post-Norm-like streams with shared parameters. This design decouples the optimization dynamics of the two streams, retaining the distinct characteristics of both Pre-Norm and Post-Norm by enabling all residual blocks to receive combined gradients inherited from both paradigms, where one stream secures stability while the other enhances expressivity. Extensive pre-training experiments on 1.3B-parameter models demonstrate that SiameseNorm exhibits exceptional optimization robustness and consistently outperforms strong baselines.

Abstract:
Mixtureof-Experts(MoE) Vision-Language Models(VLMs) offer remarkable performance but incur prohibitive memory and computational costs, making compression essential. Post-Training Quantization (PTQ) is an effective training-free technique to address the massive memory and computation overhead. Existing quantization paradigms fall short as they are oblivious to two critical forms of heterogeneity: the inherent discrepancy between vision and language tokens, and the non-uniform contribution of different experts. To bridge this gap, we introduce Visual Expert Quantization (VEQ), a dual-aware quantization framework designed to simultaneously accommodate cross-modal differences and heterogeneity between experts. Specifically, VEQ incorporates 1)Modality-expert-aware Quantization, which utilizes expert activation frequency to prioritize error minimization for pivotal experts, and 2)Modality-affinity-aware Quantization, which constructs an enhanced Hessian matrix by integrating token-expert affinity with modality information to guide the calibration process. Extensive experiments across diverse benchmarks verify that VEQ consistently outperforms state-of-the-art baselines. Specifically, under the W3A16 configuration, our method achieves significant average accuracy gains of 2.04% on Kimi-VL and 3.09% on Qwen3-VL compared to the previous SOTA quantization methods, demonstrating superior robustness across various multi-modal tasks.

Abstract:
Diffusion large language models (dLLMs) offer capabilities beyond those of autoregressive (AR) LLMs, such as parallel decoding and randomorder generation. However, realizing these benefits in practice is non-trivial, as dLLMs inherently face an accuracy-parallelism trade-off. Despite increasing interest, existing methods typically focus on only one-side of the coin, targeting either efficiency or performance. To address this limitation, we propose d3LLM (Pseudo-Distilled Diffusion Large Language Model), striking a balance between accuracy and parallelism: (i) during training, we introduce pseudo-trajectory distillation to teach the model which tokens can be decoded confidently at early steps, thereby improving parallelism; (ii) during inference, we employ entropy-based multi-block decoding with a KV-cache refresh mechanism to achieve high parallelism while maintaining accuracy. To better evaluate dLLMs, we also introduce AUP (Accuracy Under Parallelism), a new metric that jointly measures accuracy and parallelism. Experiments demonstrate that our d3LLM achieves up to 10× speedup over vanilla LLaDA/Dream, and up to 5× speedup over the AR models (Qwen-2.5-7B) without much accuracy degradation.

Abstract:
Kronecker adapters have emerged as a promising approach for finetuning large-scale models, enabling high-rank updates through tunable component structures. However, existing work largely treats the component structure as a fixed or heuristic design choice, leaving the dimensions and number of Kronecker components underexplored. In this paper, we identify component structure as a key factor governing the capacity of Kronecker adapters. We perform a fine-grained analysis of both the dimensions and number of Kronecker components. In particular, we show that the alignment between Kronecker adapters and full fine-tuning depends on component configurations. Guided by these insights, we propose Component Designed Kronecker Adapters (CDKA). We further provide parameter-budget–aware configuration guidelines and a tailored training stabilization strategy for practical deployment. Experiments across various natural language processing tasks demonstrate the effectiveness of CDKA.

Abstract:
To support longterm interaction in complex environments, LLM agents require memory systems that manage historical experiences. Existing approaches either retain full interaction histories via passive context extension, leading to substantial redundancy, or rely on iterative reasoning to filter noise, incurring high token costs. To address this challenge, we introduce SimpleMem, an efficient memory framework based on semantic lossless compression. We propose a three-stage pipeline designed to maximize information density and token utilization: (1) Semantic Structured Compression, which distills unstructured interactions into compact, multi-view indexed memory units; (2) Online Semantic Synthesis, an intra-session process that instantly integrates related context into unified abstract representations to eliminate redundancy; and (3) Intent-Aware Retrieval Planning, which infers search intent to dynamically determine retrieval scope and construct precise context efficiently. Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost, achieving an average F1 improvement of 26.4% in LoCoMo while reducing inference-time token consumption by up to 30×, demonstrating a superior balance between performance and efficiency.

Abstract:
Conformal prediction (CP) offers distributionfree uncertainty quantification for machine learning models, yet its interplay with fairness in downstream decision-making remains underexplored. Moving beyond CP as a standalone operation (procedural fairness), we analyze the holistic decision-making pipeline to evaluate substantive fairness—the equity of downstream outcomes. Theoretically, we derive an upper bound that decomposes prediction-set size disparity into interpretable components, clarifying how label-clustered CP helps control method-driven contributions to unfairness. To facilitate scalable empirical analysis, we introduce an LLM-in-the-loop evaluator that approximates human assessment of substantive fairness across diverse modalities. Our experiments reveal that label-clustered CP variants consistently deliver superior substantive fairness. Finally, we empirically show that equalized set sizes, rather than coverage, strongly correlate with improved substantive fairness, enabling practitioners to design more fair CP systems.

Abstract:
Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chainof-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpretability, existing approaches predominantly operate at the token level, creating a granularity mismatch when capturing more critical step-level information, such as reasoning direction and semantic transitions. In this work, we propose step-level sparse autoencoder~(\name), which serves as an analytical tool to disentangle different aspects of LLMs' reasoning steps into sparse features. Specifically, by precisely controlling the sparsity of a step feature conditioned on its context, we form an information bottleneck in step reconstruction, which splits incremental information from background information and disentangles it into several sparsely activated dimensions. Experiments on multiple base models and reasoning tasks show the effectiveness of the extracted features. By linear probing, we can easily predict surface-level information, such as generation length and first token distribution, as well as more complicated properties, such as the correctness and logicality of the step.These observations indicate that LLMs should already at least partly know about these properties during generation, which provides the foundation for the self-verification ability of LLMs.

Abstract:
Hyperparameter Optimization (HPO) can lift the burden of tuning hyperparameters (HPs) of neural networks. HPO algorithms from the Population Based Training (PBT) family are efficient thanks to dynamically adjusting HPs every few steps of the weight optimization. Recent results indicate that the number of steps between HP updates is an important metaHP of all PBT variants that can substantially affect their performance. Yet, no method or intuition is available for efficiently setting its value. We introduce Iterated Population Based Training (IPBT), a novel PBT variant that automatically adjusts this HP via restarts that reuse weight information in a task-agnostic way and leverage time-varying Bayesian optimization to reinitialize HPs. Evaluation on 8 image classification and reinforcement learning tasks shows that, on average, our algorithm matches or outperforms 5 previous PBT variants and other HPO algorithms (random search, ASHA, SMAC3), without requiring a budget increase or any changes to its HPs.

Abstract:
Neural scene representations achieve highfidelity rendering by encoding 3D scenes as continuous functions, but their latent spaces are typically implicit and globally entangled, making localized editing and physically grounded manipulation difficult. While several works introduce explicit control structures or point-based latent representations to improve editability, these approaches often suffer from limited locality, sensitivity to deformations, or visual artifacts. In this paper, we introduce Affine-Equivariant Kernel Space Encoding (EKS), a spatial encoding for neural radiance fields that provides localized, deformation-aware feature representations. Instead of querying latent features directly at discrete points or grid vertices, our encoding aggregates features through a field of anisotropic Gaussian kernels, each defining a localized region of influence. This kernel-based formulation enables stable feature interpolation under spatial transformations while preserving continuity and high reconstruction quality. To preserve detail without sacrificing editability, we further propose a training-time feature distillation mechanism that transfers information from multi-resolution hash grid encodings into the kernel field, yielding a compact and fully grid-free representation at inference. This enables intuitive, localized scene editing directly via Gaussian kernels without retraining, while maintaining high-quality rendering.

Abstract:
Recent advances in largescale diffusion models have intensified concerns about their potential misuse, particularly in generating realistic yet harmful or socially disruptive content. This challenge has spurred growing interest in effective machine unlearning, the process of selectively removing specific knowledge or concepts from a model without compromising its overall generative capabilities. Among various approaches, Low-Rank Adaptation (LoRA) has emerged as an effective and efficient method for fine-tuning models toward targeted unlearning. However, LoRA-based methods often exhibit limited adaptability to concept semantics and struggle to balance removing closely related concepts with maintaining generalization across broader meanings. Moreover, these methods face scalability challenges when multiple concepts must be erased simultaneously. To address these limitations, we introduce UnHype, a framework that incorporates hypernetworks into single- and multi-concept LoRA training. The proposed architecture can be directly plugged into Stable Diffusion as well as modern flow-based text-to-image models, where it demonstrates stable training behavior and effective concept control. During inference, the hypernetwork dynamically generates adaptive LoRA weights based on the CLIP embedding, enabling more context-aware, scalable unlearning. We evaluate UnHype across several challenging tasks, including object erasure, celebrity erasure, and explicit content removal, demonstrating its effectiveness and versatility.

Abstract:
Preference optimization for diffusion models relies on reward functions that are both discriminative and computationally efficient. VisionLanguage Models (VLMs) have emerged as powerful reward providers. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves competitive performance compared to state-of-the-art VLMs while maintaining a substantially lower computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.

Abstract:
Zerothorder optimizers have recently emerged as an attractive approach for fine-tuning large language models (LLMs), as they avoid backpropagation and can substantially reduce memory overhead relative to standard first-order training. However, existing zeroth-order methods rely on hand-crafted, static sampling strategies that are not adaptable to model-specific structures. To address this, we propose ZO Fine-tuner, a learning-based zeroth-order optimizer for LLMs that automatically learns efficient perturbation strategies through a compact and memory-efficient design. Motivated by the fact that a small set of base LLMs is repeatedly fine-tuned across tasks, ZO Fine-tuner supports one-time per-model training and reuse across downstream tasks with minimal overhead. Therefore, learning the optimizer once for a given LLM and reusing it across diverse downstream tasks is both feasible and highly desirable. Accordingly, ZO Fine-tuner is designed to scale learning to learn (L2L) to the foundation-model era by supporting one-time per-model training with minimal overhead. Experiments on 4 LLMs and 7 datasets show that ZO Fine-tuner outperforms prior zeroth-order baselines in 82.1% of task-model combinations, thereby demonstrating strong performance and scalability for efficient LLM fine-tuning.

Abstract:
Agentic Web is an emerging paradigm where autonomous agents help users use online information. As the paradigm develops, content providers are also deploying agents to manage their data and serve it through controlled interfaces. This shift moves information access from centralized retrieval to decentralized coordination. To study this setting, we introduce AgentWebBench, a benchmark that evaluates how well a user agent synthesizes answers by interacting with websitespecific content agents. We evaluate four tasks that cover common web information needs, spanning ranked retrieval (web search, web recommendation) and open-ended synthesis (question answering, deep research). Across seven advanced LLMs and three coordination strategies, multi-agent coordination generally lags behind centralized retrieval as expected, because user agent cannot directly access the corpus, but the gap shrinks with model scale and can even outperform centralized retrieval on question answering. This benchmark also enables us to study properties of the emerging paradigm of the digital world. We find that decentralized access concentrates traffic toward a small set of websites, test time scaling improves both interaction reliability and task performance, and strong results require sufficient interactions guided by careful planning. Finally, our failure analysis suggests that user agents need better planning and answer synthesis, while content agents need more reliable retrieval and evidence quality. Code, data, and APIs will be released online.

Abstract:
While Visual MultiAgent Systems (VMAS) promise to enhance comprehensive abilities through inter-agent collaboration, empirical evidence reveals a counter-intuitive "scaling wall": increasing agent turns often degrades performance while exponentially inflating token costs. We attribute this failure to the information bottleneck inherent in text-centric communication, where converting perceptual and thinking trajectories into discrete natural language inevitably induces semantic loss. To this end, we propose L^2-VMAS, a novel model-agnostic framework that enables inter-agent collaboration with dual latent memories. Based on such a design, we decouple the perception and thinking while dynamically synthesizing dual latent memories. Additionally, we introduce an entropy-driven proactive triggering that replaces passive information transmission with efficient, on-demand memory access. Extensive experiments among backbones, sizes, and multi-agent structures demonstrate that our method effectively breaks the "scaling wall" with superb scalability, improving average accuracy by 2.7-5.4% while reducing total token usage by 21.3-44.8%. The codes will be publicly released.

Abstract:
Generating diverse responses is crucial for testtime scaling of large language models (LLMs), yet standard stochastic sampling mostly yields surface-level lexical variation, limiting semantic exploration and risking omission of correct solutions. In this paper, we propose Exploratory Sampling (ES), a decoding approach that explicitly encourages semantic diversity during generation. ES is motivated by the observation that neural networks tend to make more accurate predictions on inputs similar to those encountered before, and incur higher prediction error on novel ones. Building on this property, we train a lightweight Distiller at test time to predict deep-layer hidden representations of the LLM from its shallow-layer representations. During decoding, the Distiller continuously adapts to the mappings induced by the current generation context. ES uses the prediction error as a novelty signal to reweight candidate token extensions conditioned on the current prefix, thereby biasing decoding toward less-explored semantic patterns. ES is implemented with an asynchronous training–inference pipeline and introduces less than 5% throughput overhead in standard serving scenarios. Empirical results show that \ES achieves robust generalization across mathematics, science, and code generation benchmarks. Notably, it breaks the trade-off between diversity and coherence in creative writing, and significantly boosts the Pass@k efficiency of reasoning models, showing superior or comparable per- formance to strong stochastic and heuristic baselines.

Abstract:
Continual learning, especially classincremental learning (CIL), on the basis of a pre-trained model (PTM) has garnered substantial research interest in recent years. However, how to effectively learn both discriminative and comprehensive feature representations while maintaining stability and plasticity over very long task sequences remains an open problem. We propose \mathbfCaRE, a scalable \mathbfContinual Le\mathbfarner with efficient Bi-Level \mathbfRouting Mixture-of-\mathbfExperts (BR-MoE). The core idea of BR-MoE is a bi-level routing mechanism: a router selection stage that dynamically activates relevant task-specific routers, followed by an expert routing phase that dynamically activates and aggregates experts, aiming to inject discriminative and comprehensive representations into every intermediate network layer. On the other hand, we introduce a challenging evaluation protocol for comprehensively assessing CIL methods across very long task sequences spanning hundreds of tasks. Extensive experiments show that CaRE demonstrates leading performance across a variety of datasets and task settings, including commonly used CIL datasets with classical CIL settings (e.g., 5-20 tasks). To the best of our knowledge, CaRE is the first continual learner that scales to very long task sequences (ranging from 100 to over 300 non-overlapping tasks), while outperforming all baselines by a large margin on such task sequences.

Abstract:
Generalized Category Discovery (GCD) aims to identify both known and novel categories within unlabeled data by leveraging a set of labeled examples from known categories. Existing GCD methods primarily depend on semantic labels and global image representations, often overlooking the detailed partlevel cues that are crucial for distinguishing closely related categories. In this paper, we introduce PartCo, short for Part-Level Correspondence Prior, a novel framework that enhances category discovery by incorporating part-level visual feature correspondences. By leveraging part-level relationships, PartCo captures finer-grained semantic structures, enabling a more nuanced understanding of category relationships. Importantly, PartCo seamlessly integrates with existing GCD methods without requiring significant modifications. Our extensive experiments on multiple benchmark datasets demonstrate that PartCo significantly improves the performance of current GCD approaches, outperforming most existing methods by bridging the gap between semantic labels and part-level visual compositions, thereby setting new benchmarks for GCD. Code will be made publicly available.

Abstract:
Poissondistributed latent variable models are widely used in computational neuroscience, but differentiating through discrete stochastic samples remains challenging. Two approaches address this:Exponential Arrival Time(EAT) simulation andGumbel-SoftMax(GSM) relaxation. We provide the first systematic comparison of these methods, along with practical guidance for practitioners. Our main technical contribution is a modification to the EAT method that theoretically guarantees an unbiased first moment (exactly matching the firing rate), and reduces second-moment bias. We evaluate these methods on their distributional fidelity, gradient quality, and performance on two tasks: (1) variational autoencoders with Poisson latents, and (2) partially observable generalized linear models, where latent neural connectivity must be inferred from observed spike trains. Across all metrics, our modified EAT method exhibits better overall performance (often comparable to exact gradients), and substantially higher robustness to hyperparameter choices. Together, our results clarify the trade-offs between these methods and offer concrete recommendations for practitioners working with Poisson latent variable models.

Abstract:
The success of HyperConnections (HC) in neural networks (NN) has also highlighted issues related to its training instability and restricted scalability. The Manifold-Constrained Hyper-Connections (mHC) mitigate these challenges by projecting the residual connection space onto a Birkhoff polytope, however, it faces two issues: 1) its iterative Sinkhorn-Knopp (SK) algorithm does not always yield exact doubly stochastic residual matrices; 2) mHC incurs a prohibitive \mathcalO(n^3C) parameter complexity with n as the width of the residual stream and C as the feature dimension. The recently proposed mHC-lite reparametrizes the residual matrix via the Birkhoff-von-Neumann theorem to guarantee double stochasticity, but also faces a factorial explosion in its parameter complexity, \mathcalO \left( nC \cdot n! \right). To address both challenges, we propose KromHC, which uses the \underline\textKronecker products of smaller doubly stochastic matrices to parametrize the residual matrix in \underline\textmHC. By enforcing manifold constraints across the factor residual matrices along each mode of the tensorized residual stream, KromHC guarantees exact double stochasticity of the residual matrices while reducing parameter complexity to \mathcalO(n^2C). Comprehensive experiments demonstrate that KromHC matches or even outperforms state-of-the-art (SOTA) mHC variants, while requiring significantly fewer trainable parameters.

Abstract:
Large language models increasingly rely on long chains of thought to improve accuracy, yet such gains come with substantial inferencetime costs. We revisit token-efficient post-training and argue that existing sequence-level reward-shaping methods offer limited control over how reasoning effort is allocated across tokens. To bridge the gap, we propose IAPO, an information-theoretic post-training framework that assigns token-wise advantages based on each token’s conditional mutual information (MI) with the final answer. This yields an explicit, principled mechanism for identifying informative reasoning steps and suppressing low-utility exploration. We provide a theoretical analysis showing that our IAPO can induce monotonic reductions in reasoning verbosity without harming correctness. Empirically, IAPO consistently improves reasoning accuracy while reducing reasoning length by up to 36%, outperforming existing token-efficient RL methods across various reasoning datasets. Our results demonstrate that information-aware advantage shaping is a powerful and general direction for token-efficient post-training. The code is available at https://anonymous.4open.science/r/agent_rl-107E/.

Abstract:
Contemporary AI faces the challenge of balancing generality with userspecific personalization. In federated learning (FL), this challenge is amplified by highly heterogeneous client data with complex non-IID patterns beyond standard modeling assumptions. Many existing FL methods are designed for relatively restricted heterogeneity settings (e.g., a fixed number of clusters or a fixed form of personalization), limiting their robustness under complex structures. In this work, we study FL from a \emphmulti-level non-IID perspective, where client similarity is approximated by multiple granularities of shared knowledge: global, subgroup, and client-specific components. This view captures coarse-to-fine relationships while requiring less prior knowledge of task boundaries. Building on this insight, we propose \emphFederated Multi-level Additive Modeling (FeMAM), which learns multiple levels of shareable models and constructs personalized predictors via additive composition across levels. To move beyond a fixed structure, FeMAM allows models to grow and be pruned dynamically during training, adapting to diverse federated scenarios. Despite employing multiple models, FeMAM remains cost-friendly by activating only a small subset (one level) of models for training at a time. Extensive experiments show that FeMAM effectively approximates complex non-IID structures and consistently outperforms representative clustered and personalized FL baselines.

Abstract:
Following the rise of large foundation models, Vision–Language–Action models (VLAs) emerged, leveraging strong visual and language understanding for generalpurpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA models, inconsistencies in training protocols and evaluation settings make it difficult to identify which design choices truly matter. To bring structure to this evolving space, we reexamine the VLA design space under a unified framework and evaluation setup. Starting from a simple VLA baseline similar to RT-2 and OpenVLA, we systematically dissect design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. From this study, we distill 12 key findings that together form a practical recipe for building strong VLA models. The outcome of this exploration is a simple yet effective model, VLANeXt. VLANeXt outperforms prior state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and demonstrates strong generalization in real-world experiments. We will release a unified, easy-to-use codebase that serves as a common platform for the community to reproduce our findings, explore the design space, and build new VLA variants on top of a shared foundation.

Abstract:
Realworld user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward computation, which leads to credit assignment problems and insufficient advantage signals within rollout groups. A feasible approach is to identify valuable interaction turns at a fine granularity to drive more targeted learning. To address this, we introduce InfoPO, which frames multi-turn interaction as a process of active uncertainty reduction and computes an information-gain reward that credits turns whose feedback measurably changes the agent’s subsequent action distribution compared to a masked-feedback counterfactual. It then combines this signal with task outcomes via an adaptive variance-gated fusion to identify information importance while maintaining task oriented goal direction. Across diverse tasks including intent clarification, collaborative coding, and tool-augmented decision making, InfoPO consistently outperforms prompting and multi-turn RL baselines. It also demonstrates robustness under user simulator shifts and generalizes effectively to environment interactive tasks. Overall, InfoPO provides a principled and scalable mechanism for optimizing complex agent user collaboration.

Abstract:
Training robust reasoning visionlanguage models (VLMs) in rare domains (such as geospatial) is fundamentally constrained by supervision scarcity. While raw geospatial imagery is abundant, the amount of task-direct supervision falls far behind that of common domains. In this work, we validate an important conclusion: indirect verifiable rewards, derived from seemingly unrelated metadata, are sufficient to induce sophisticated and generalizable geospatial reasoning across a wide range of downstream tasks (25+). We present Geo-R1 as one empirical instantiation of this paradigm. Rather than relying on limited task-specific annotations (i.e., direct rewards), Geo-R1 utilizes scalable, verifiable indirect proxy rewards based on cross-view alignment with metadata (geolocation information) to drive reinforcement learning at scale. Such indirect rewards successfully motivate the model to discover and internalize zero-shot geospatial reasoning across diverse tasks, achieving extraordinary zero-shot transfer on out-of-distribution benchmarks and even surpassing fully supervised specialists on certain benchmarks. These findings indicate that optimizing for indirect verifiable rewards may provide a scalable pathway to unlock generalized reasoning capabilities in rare domains with massive unlabeled data archives. Our code is available at: \urlhttps://anonymous.4open.science/r/Geo-R1-ICML.

Abstract:
MultimodalAttributed Graph (MAG) learning has achieved remarkable success in modeling complex real-world systems by integrating graph topology with rich attributes from multiple modalities. With the rapid proliferation of novel MAG models capable of handling intricate cross-modal semantics and structural dependencies, establishing a rigorous and unified evaluation standard has become imperative. Although existing benchmarks have facilitated initial progress, they exhibit critical limitations indomain coverage,encoder flexibility,model diversity, andtask scope, presenting significant challenges to fair evaluation. To bridge this gap, we present OpenMAG, a comprehensive benchmark that integrates 19 datasets across 6 domains and incorporates 16 encoders to support both static and trainable feature encoding. OpenMAG further implements a standardized library of 24 state-of-the-art models and supports 8 downstream tasks, enabling fair comparisons within a unified framework. Through systematic assessment of necessity, data quality, effectiveness, robustness, and efficiency, we derive 14 fundamental insights into MAG learning to guide future advancements. Our code is available at https://anonymous.4open.science/r/OpenMAG-F703/.

Abstract:
Autoregressive video generation paradigms offer theoretical promise for long video synthesis, yet their practical deployment is hindered by the computational burden of sequential iterative denoising. While cache reuse strategies can accelerate generation by skipping redundant denoising steps, existing methods rely on coarsegrained chunk-level skipping that fails to capture fine-grained pixel dynamics. This oversight is critical: pixels with high motion require more denoising steps to prevent error accumulation, while static pixels tolerate aggressive skipping. We formalize this insight theoretically by linking cache errors to residual instability, and propose MotionCache, a motion-aware cache framework that exploits inter-frame differences as a lightweight proxy for pixel-level motion characteristics. MotionCache employs a coarse-to-fine strategy: an initial warm-up phase establishes semantic coherence, followed by motion-weighted cache reuse that dynamically adjusts update frequencies per token. Extensive experiments on state-of-the-art models like SkyReels-V2 and MAGI-1 demonstrate that MotionCache achieves significant speedups of 6.28× and 1.64× respectively, while effectively preserving generation quality (VBench: 1%\downarrow and 0.01%\downarrow respectively). The code is available at https://anonymous.4open.science/r/MotionCache.

Abstract:
Everyday tasks come with a target, and pretraining models around this target is what turns them into experts. In this paper, we study targetoriented language model (LM) pretraining by introducing Neuron-ActivatedGraph Ranking (NAG-based Ranking), a training-free and interpretable framework for target pretraining data selection. Rather than using black-box representations, our approach directly characterizes each target input by a sparse set of high-impact neurons in any off-the-shelf LLMs. Concretely, we quantify neuron impact and select the most influential neurons across layers into a compactNeuron-ActivatedGraph(NAG), and rank candidate data by NAG similarity to target examples. We conduct experiments across six benchmarks, where our NAG-based Ranking improves target-oriented pretraining by 4.9% on average over random sampling, and also outperforms state-of-the-art baselines by 5.3% accuracy on HellaSwag. It also remains effective under a more applicable multi-target setting, where our best setup surpasses two baselines by 1.1% and 4.1%, respectively. Furthermore, we provide a comprehensive analysis onwhyandhowour NAG works, e.g., deactivating NAG-selected neurons (only 0.12% of all) causes a 23.5% performance collapse, and restricting NAG to the final layer incurs a 4.1% average drop, indicating that NAG captures a sparse ``functional backbone'' for learning target features.

Abstract:
Testtime strategies have shown remarkable success in improving large language models, but their application to next-token prediction (NTP) autoregressive (AR) image generation remains largely underexplored. Existing test-time scaling (TTS) methods for visual autoregressive models (VAR) rely on frequent partial decoding and external reward models, which are inefficient and often ineffective for NTP-based image generation due to the inherent instability of intermediate decoding results. To address these limitations, we proposeScalingAR, a novel test-time scaling framework tailored for NTP-based AR image generation. ScalingAR introducestoken entropyas a confidence signal and operates at two complementary levels: (i)Profile Level, integrates intrinsic uncertainty and conditional utilization into a unified confidence state, and (ii)Policy Level, leverages this state for adaptive trajectory pruning and dynamic guidance scheduling. Without requiring early decoding or auxiliary rewards, ScalingAR achieves significant improvements across diverse benchmarks. Experiments show that ScalingAR(I)improves base models by 12.5% on GenEval and 15.2% on TIIF-Bench,(II)reduces visual token consumption by 62.0% while outperforming baselines, and(III)enhances robustness, mitigating performance degradation by 26.0% in challenging scenarios. These results establish \ourmethod as a robust and efficient test-time scaling solution for autoregressive image generation.

Abstract:
NVIDIA's 2:4 Sparse Tensor Cores deliver 2× throughput but demand 50% pruning—a ratio that collapses LLM reasoning accuracy (Qwen3: 54%→15%). Milder (2N2):2N patterns (e.g., 6:8, 25% pruning) preserve accuracy yet receive no hardware support, falling back to dense execution. We present SlideSparse, the first system to unlock Sparse Tensor Core acceleration for the (2N-2):2N model family on commodity GPUs. Our Sliding Window Decomposition losslessly rewrites any (2N-2):2N block into N-1 overlapping 2:4-compliant windows; activation lifting fuses the corresponding activation rearrangement into per-token quantization at near-zero cost. Integrated into vLLM, SlideSparse is evaluated across five GPUs (A100, H100, B200, RTX 4090, RTX 5080), three precisions (INT8, FP8, BF16), and the Llama/Qwen/BitNet model families. On compute-bound workloads, speedup approaches the theoretical N/(N-1) limit—Qwen2.5-7B with 6:8 sparsity achieves 1.33×, matching the bound exactly—establishing (2N-2):2N as a practical path to accuracy-preserving LLM acceleration.

Abstract:
While longhorizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging. We identify two root causes: credit misattribution, where correct early actions are penalized due to terminal failures, and sample inefficiency, where scarce successful trajectories result in near-total loss of learning signal. We introduce a milestone-guided policy learning framework, BEACON, that leverages the compositional structure of long-horizon tasks to ensure precise credit assignment. BEACON partitions trajectories at milestone boundaries, applies temporal reward shaping within segments to credit partial progress, and estimates advantages at dual scales to prevent distant failures from corrupting the evaluation of local actions. On ALFWorld, WebShop, and ScienceWorld, BEACON consistently outperforms GRPO and GiGPO. Notably, on long-horizon ALFWorld tasks, BEACON achieves 92.9% success rate, nearly doubling GRPO's 53.5%, while improving effective sample utilization from 23.7% to 82.0%. These results establish milestone-anchored credit assignment as an effective paradigm for training long-horizon language agents. Code is in supplementary materials and will be publicly released.

Abstract:
Unified hyperspectral image (HSI) restoration aims to recover diverse degradations within a single model. However, current methods often rely on impractical explicit priors or opaque blackbox representations that overfit to training distributions, hampering generalization to unseen scenarios. To bridge this gap, we propose Degradation-Aware Metric Prompting (DAMP), a novel framework that characterizes multi-dimensional degradations through interpretable spatial-spectral metrics. These metrics serve as Degradation Prompts (DP), enabling the model to capture shared characteristics across tasks and adapt to unknown corruptions. Central to our framework is the Degradation-Adaptive Mixture-of-Experts (DAMoE), where Spatial-Spectral Adaptive Modules (SSAMs) serve as experts that utilize learnable fusion coefficients to specialize in distinct degradation degrees. By using DP as a gating router, DAMoE dynamically activates specialized experts tailored to the specific degradation profile. Extensive experiments on natural and remote sensing HSI datasets demonstrate that DAMP achieves state-of-the-art performance and exhibits exceptional zero-shot generalization on unseen restoration tasks. The source code will be released.

Abstract:
Language agents have demonstrated remarkable potential in web search and information retrieval. However, these search agents assume user queries are complete and unambiguous, an assumption that diverges from reality where users begin with incomplete queries requiring clarification through interaction. Yet most agents lack interactive mechanisms during the search process, and existing benchmarks cannot assess this capability. To address this gap, we introduce INTERACTCOMP, a benchmark designed to evaluate whether search agents can recognize query ambiguity and actively interact to resolve it during search. Following the principle of easy to verify, interact to disambiguate, we construct 210 expertcurated questions across 9 domains through a target-distractor methodology that creates genuine ambiguity resolvable only through interaction. Evaluation of 17 models reveals striking failure: the best model achieves only 13.73% accuracy despite 71.50% with complete context, exposing systematic overconfidence rather than reasoning deficits. Forced interaction produces dramatic gains, demonstrating latent capability current strategies fail to engage. Longitudinal analysis shows interaction capabilities stagnated over 15 months while search performance improved seven-fold, revealing a critical blind spot. This stagnation, coupled with the immediate feedback inherent to search tasks, makes INTERACTCOMP a valuable resource for both evaluating and training interaction capabilities in search agents.

Abstract:
Network pruning, which removes less important parameters or architectures, is often expected to improve efficiency while preserving performance. However, this expectation does not consistently hold across language tasks: pruned models can perform well on nongenerative tasks but frequently fail in generative settings. To demystify how such discrepancies arise under pruning, we analyze network pruning from a representation-hierarchy perspective, decomposing the internal computation of language models into three sequential spaces: embedding (hidden representations), logit (pre-softmax outputs), and probability (post-softmax distributions). While representations in the embedding and logit spaces are largely robust to pruning-induced perturbations, the subsequent nonlinear transformation from logits to the probability space amplifies such deviations, whose persistence across time steps leads to substantial degradation during generation. By contrast, the stability of the categorical-token probability subspace, together with the robustness of the embedding space, supports the effectiveness of pruning for non-generative tasks such as retrieval and multiple-choice selection. Our representation-level analysis disentangles the effects of pruning across tasks and offers practical guidance on its application. Code will be released upon acceptance.

Abstract:
Search agents connect LLMs to the Internet, enabling them to access broader and more upto-date information. However, this also introduces a new threat surface: unreliable search results can mislead agents into producing unsafe outputs. Real-world incidents and our two in-the-wild observations show that such failures can occur in practice. To study this threat systematically, we propose SafeSearch, an automated red-teaming framework that is scalable, cost-efficient, and lightweight, enabling harmless safety evaluation of search agents. Using this, we generate 300 test cases spanning five risk categories (e.g., misinformation and prompt injection) and evaluate three search agent scaffolds across 17 representative LLMs. Our results reveal substantial vulnerabilities in LLM-based search agents, with the highest ASR reaching 90.5% for GPT-4.1-mini in a search-workflow setting. Moreover, we find that common defenses, such as reminder prompting, offer limited protection. Overall, SafeSearch provides a practical way to measure and improve the safety of LLM-based search agents.

Abstract:
LLM agents are increasingly deployed in longhorizon, complex environments to solve challenging problems, but this expansion exposes them to long-horizon attacks that exploit multi-turn user–agent–environment interactions to achieve objectives infeasible in single-turn settings. To measure agent vulnerabilities to such risks, we present AgentLAB, the first benchmark dedicated to evaluating LLM agent susceptibility to adaptive, long-horizon attacks. Currently, AgentLAB supports five novel attack types including intent hijacking, tool chaining, task injection, objective drifting, and memory poisoning, spanning 28 realistic agentic environments, and 644 security test cases. Leveraging AgentLAB, we evaluate representative LLM agents and find that they remain highly susceptible to long-horizon attacks; moreover, defenses designed for single-turn interactions fail to reliably mitigate long-horizon threats. We anticipate that AgentLAB will serve as a valuable benchmark for tracking progress on securing LLM agents in practical settings.

Abstract:
Language agents have shown strong promise for task automation. Realizing this promise for increasingly complex, longhorizon tasks has driven the rise of a subagent-as-tools paradigm for multi-turn task solving. However, existing designs still lack a dynamic abstraction view of sub-agents, thereby hurting adaptability: sub-agents are either context-isolated threads that lack specialization, or static roles that require human-engineering. We address this challenge with a unified, framework-agnostic agent abstraction that models any agent as a tuple (Model, Task, Tools, Context). This tuple acts as a compositional recipe for capabilities, enabling the system to spawn specialized executors for each task on demand. Building on this abstraction, we introduce an agentic system AOrchestra, where the central orchestrator concretizes the tuple at each step: it curates task-relevant context, selects tools and models, and delegates execution via on-the-fly automatic agent creation. Such designs enable reducing human engineering efforts, and remain framework-agnostic with plug-and-play support for diverse agents as task executors. It also enables a controllable performance–cost trade-off, allowing the system to approach Pareto-efficient. Across three challenging benchmarks and environments (GAIA, SWE-Bench, Terminal-Bench), AOrchestra achieves 16.28% relative improvement against the strongest baseline when paired with Gemini-3-Flash.

Abstract:
Large language models are known to often exhibit inconsistent knowledge. This is particularly problematic in multilingual scenarios, where models are likely to be asked similar questions in different languages, and inconsistent responses can undermine their reliability. In this work, we show that this issue can be mitigated using reinforcement learning with a structured reward function, which leads to an optimal policy with consistent crosslingual responses. We introduce Direct Consistency Optimization (DCO), a DPOinspired method that requires no explicit reward model and is derived directly from the LLM itself. Comprehensive experiments show that DCO significantly improves crosslingual consistency across diverse LLMs and outperforms existing methods when training with samples of multiple languages, while complementing DPO when gold labels are available. Extra experiments demonstrate the effectiveness of DCO in bilingual settings, significant out-of-domain generalizability, and controllable alignment via direction hyperparameters. Taken together, these results establish DCO as a robust and efficient solution for improving knowledge consistency across languages in multilingual LLMs. All code, training scripts, and evaluation benchmarks will be released at https://anonymous.

Abstract:
UIto-code aims to translate UI screenshots into executable front-end code. Despite progress with vision-language models (VLMs), most existing methods formulate UI-to-code as a single-pass generation, which mismatches real-world UI development that is inherently iterative and feedback-driven. We reformulate UI-to-code as an interactive visual optimization problem, where code generation is embedded in a closed-loop process of execution, visual inspection, and iterative refinement driven by rendered visual feedback. To address the non-differentiability of visual objectives and the noise of absolute visual evaluators, we propose Relative Visual Policy Optimization (RVPO), a preference-based reinforcement learning method that optimizes relative visual rankings among rendered candidates under execution feedback. We instantiate this paradigm in UI2Code^\textN, an open-source 9B model trained via continual pre-training, supervised fine-tuning, and reinforcement learning. Experiments demonstrate state-of-the-art performance on UI drafting, UI polishing, and UI editing benchmarks, even outperforming larger models, with performance consistently improving through iterative visual optimization.

Abstract:
SAM3D enables scalable, openworld 3D reconstruction from complex scenes, yet its deployment is hindered by prohibitive inference latency. In this work, we conduct the first systematic investigation into its inference dynamics, revealing that generic acceleration strategies are brittle in this context. We demonstrate that these failures stem from neglecting the pipeline's inherent multi-level heterogeneity: the kinematic distinctiveness between shape and layout, the intrinsic sparsity of texture refinement, and the spectral variance across geometries. To address this, we present Fast-SAM3D, a training-free framework that dynamically aligns computation with instantaneous generation complexity. Our approach integrates three heterogeneity-aware mechanisms: (1) Modality-Aware Step Caching to decouple structural evolution from sensitive layout updates; (2) Joint Spatiotemporal Token Carving to concentrate refinement on high-entropy regions; and (3) Spectral-Aware Token Aggregation to adapt decoding resolution. Extensive experiments demonstrate that Fast-SAM3D delivers up to 2.67× end-to-end speedup with negligible fidelity loss, establishing a new Pareto frontier for efficient single-view 3D generation.

Abstract:
Classincremental learning (CIL) with pre-trained models (PTMs) faces a critical trade-off between efficient adaptation and long-term stability. While analytic learning enables rapid, recursive closed-form updates, its efficacy is often compromised by accumulated errors and feature incompatibility. In this paper, we first conduct a systematic study to dissect the failure modes of PTM-based analytic CIL, identifying representation rigidity as the primary bottleneck. Motivated by these insights, we proposeVILA, a novel dual-branch framework that advances analytic CIL via a two-level vision-language calibration strategy. Specifically, we coherently fuse plastic, task-adapted features with a frozen, universal semantic anchor at the feature level through geometric calibration, and leverage cross-modal priors at the decision level to rectify prediction bias. This confluence maintains AL's extreme efficiency while overcoming its inherent brittleness. Extensive experiments across eight benchmarks demonstrate that VILA consistently yields superior performance, particularly in fine-grained and long-sequence scenarios. Our framework harmonizes high-fidelity prediction with the simplicity of analytic learning. Our code will be made publicly available upon acceptance.

Abstract:
Existing prompt learning for VLMs exhibits a modality asymmetry, predominantly optimizing text tokens while still relying on frozen visual encoder as holistic extractor and neglecting the spectral granularity essential for finegrained discrimination. To bridge this, we introduce Disentangling Spectral Granularity for Prompt Learning (SpecPL), which approaches prompt learning from a novel spectral perspective via Counterfactual Granule Supervision. Specifically, we leverage a frozen VAE to decompose visual signals into semantic low-frequency bands and granular high-frequency details. A frozen Visual Semantic Bank anchors text representations to universal low-frequency invariants, mitigating overfitting. Crucially, fine-grained discrimination is driven by counterfactual granule training: by permuting high-frequency signals, we compel the model to explicitly distinguish visual granularity from semantic invariance. Uniquely, SpecPL serves as a universal plug-and-play booster, revitalizing text-oriented baselines like CoOp and MaPLe via visual-side guidance. Experiments on 11 benchmarks demonstrate competitive state-of-the-art performance, achieving a new performance ceiling of 81.51% harmonic-mean accuracy. These results validate that spectral disentanglement with counterfactual supervision effectively bridges the gap in the stability-generalization trade-off.

Abstract:
Recent diffusion models achieve the stateof-the-art performance in image generation, but often suffer from semantic inconsistencies orhallucinations. While various inference-time guidance methods can enhance generation, they often operateindirectlyby relying on external signals or architectural modifications, which introduces additional computational overhead. In this paper, we proposeTangentialAmplifyingGuidance(TAG), a theoretically grounded, training-free, computationally lightweight, and architecture-agnostic guidance method that operates solely on trajectory signals without modifying the underlying diffusion model. TAG leverages an intermediate sample as a projection basis and amplifies the tangential components of the estimated scores with respect to this basis to correct the sampling trajectory. We formalize this guidance process via a first-order Taylor analysis, showing that tangential amplification steers the state toward higher-probability regions of the data manifold, thereby reducing inconsistencies and improving sample fidelity. TAG is a plug-and-play module that integrates into existing diffusion samplers with minimal additional computation, offering a new perspective on diffusion guidance.

Abstract:
Deep learning methods based on backward stochastic differential equations (BSDEs) have emerged as competitive alternatives to physicsinformed neural networks (PINNs) for solving high-dimensional partial differential equations (PDEs). By leveraging probabilistic representations, BSDE approaches can avoid the curse of dimensionality and often admit second-order-free training objectives that do not require explicit Hessian evaluations. It has recently been established that the commonly used Euler–Maruyama (EM) time discretization induces an intrinsic bias in BSDE training losses. While high-order schemes such as Heun can fully eliminate this bias, such schemes re-introduce second-order spatial derivatives and incur substantial computational overhead. In this work, we provide a principled analysis of EM-induced loss bias and propose an unbiased, second-order-free training framework that preserves the computational advantages of BSDE methods.

Abstract:
In everyday photography, aesthetically appealing moments are often captured with structural flaws (e.g., composition, camera viewpoint, or pose) that existing retouching and portrait enhancement methods cannot fix. We formulate Aesthetic Photo Reconstruction (APR) as improving a photo’s aesthetic quality via structural reconstruction while preserving subject identity and scene semantics. Although recent advances in image editing models make APR feasible, they often lack aesthetic understanding, yielding edits that are semantically plausible yet aesthetically weak. To address this, we propose AesFormer, a twostage framework that decouples aesthetic planning from image editing. In Stage 1, an aesthetic action model (AesThinker) analyzes the input along seven progressive photographic dimensions and outputs executable editing actions; we further apply GRPO-A to encourage broad exploration over diverse action plans beyond SFT. In Stage 2, an action-conditioned editor (AesEditor) performs structural edits guided by these actions. To support APR, we build a video-based corpus-mining pipeline (VCMP) and construct AesRecon, a benchmark of 9,071 strictly aligned (poor, good) image pairs. Experiments show that AesFormer substantially improves APR performance and is competitive with Nano Banana Pro.

Abstract:
We study Vector Linking: given two embedding clouds produced by different blackbox encoders over partially overlapping datasets, recover cross-model object correspondences using only vectors. Empirically and theoretically, we show that independently trained contrastive encoders exhibit local geometric consistency: short-range distances are approximately preserved up to a scale factor, while long-range distances are not due to model-specific distortion. Building on this, we propose an iterative, reference-based geometric embedding hashing that recovers vector links from a tiny seed set of paired anchors. It represents each vector by distances to sampled paired anchors, proposes candidate links via hash-space matching, and aggregates evidence across views in a Beta--Bernoulli posterior to bootstrap high-confidence links as new anchors. Experiments across multiple benchmarks and embedding model pairs demonstrate accurate and robust linking under varying overlap, seed budgets, and out-of-domain references, with applications to vector database integration and cross-model clustering.

Abstract:
Longcontext Large Language Model (LLM) inference has become the norm for today’s AI applications. However, it is severely bottlenecked by the increasing memory demands of its KV cache. Previous works have shown that self-speculative decoding with sparse attention, where tokens are drafted using a subset of the KV cache and verified in parallel with full KV cache, speeds up inference in a lossless way. However, this approach relies on standalone KV selection algorithms to select the KV entries used for drafting and overlooks that the criticality of each KV entry is inherently computed during verification. In this paper, we propose Vegas, a self-speculative decoding method with verification-guided sparse attention. Vegas identifies critical KV entries as a byproduct of verification and only loads these entries when drafting subsequent tokens. This not only improves draft token acceptance rate but also incurs low KV selection overhead, thereby improving decoding throughput. Vegas achieves 2.81× higher throughput over vanilla auto-regressive decoding and 1.29× improvement over state-of-the-art sparsity-based self-speculative decoding methods.

Abstract:
VisionLanguage Models (VLMs) inherit the auto-regressive generation paradigm and cache the keys and values (KV) of all previous tokens to accelerate inference, resulting in memory consumption that scales linearly with context length. This issue is particularly pronounced in VLMs due to substantial redundancy in the visual modality. Although KV cache eviction approaches can effectively reduce inference memory, they often incur significant performance degradation in VLMs, as most are designed for language models and overlook the inherent gap between text and vision. By systematically analyzing the modality gap in VLMs in this work, we argue that the importance of visual information should be grounded in textual guidance and accordingly propose aText-Grounded KV Eviction method forVLMs (TGV-KV). TGV-KV comprises three submodules:(1) Text-Vision Budgeting (TVB)assigns budget to each layer based on the mutual information interaction.(2) Text-Weighted Ranking (TWR)assesses the priority of text and ranks vision importance based on weighted text-image attention.(3) Text-Prioritised Retention (TPR)policy strategically preserves text KV to avoid acute information loss. We evaluate TGV-KV across five models with different sizes and architectures, showing that TGV-KV preserves 99.2% accuracy on the VizWiz-VQA task with LLaVA-NeXT and boosts decode throughput by 52.6% with an extreme retention budget of 5%. Code will be released.

Abstract:
In this paper, we study solution operators of physical field equations on geometric meshes from a functionspace perspective. We reveal that Hodge orthogonality fundamentally resolves spectral interference by isolating unlearnable topological degrees of freedom from learnable geometric dynamics, enabling an additive approximation confined to structure-preserving subspaces. Building on Hodge theory and operator splitting, we derive a principled operator-level decomposition. The result is a Hybrid Eulerian-Lagrangian architecture with an algebraic-level inductive bias we call Hodge Spectral Duality (HSD). In our framework, we use discrete differential forms to capture topology-dominated components and an orthogonal auxiliary ambient space to represent complex local dynamics. Our method achieves superior accuracy and efficiency on geometric graphs with enhanced fidelity to physical invariants.

Abstract:
Standard approaches to goalconditioned reinforcement learning (GCRL) that rely on temporal-difference learning can be unstable and sample-inefficient due to bootstrapping. While recent work has explored contrastive and supervised formulations to improve stability, we present a probabilistic alternative, called survival value learning (SVL), that reframes GCRL as a survival learning problem by modeling the time‑to‑goal from each state as a probability distribution. This perspective yields a closed‑form identity that expresses the goal‑conditioned value function as a discounted sum of survival probabilities, enabling value estimation via a hazard model trained via maximum likelihood on both event and right‑censored trajectories. We introduce three practical value estimators, including finite-horizon truncation and two binned infinite-horizon approximations to capture long-horizon objectives. Experiments on standard offline GCRL benchmarks show that SVL combined with hierarchical actors matches or surpasses strong hierarchical TD baselines, particularly excelling on complex, long‑horizon tasks.

Abstract:
3D openworld environments with adversarial opponents remain a core challenge for reinforcement learning due to their vast state spaces. Effective reasoning representations are essential in such settings. While existing self-supervised visual foresight reasoning approaches often suffer from multi-step error accumulation, many recent studies resort to injecting domain-specific knowledge for more stable guidance. Our key insight is that the photorealistic fidelity of visual reasoning representations is secondary; what truly matters is providing informative, task-relevant signals. To this end, we propose ResDreamer, a hierarchical world model in which each higher-level layer is trained to reconstruct the residuals of the layer below. This design enables progressive abstraction of increasingly sophisticated world dynamics and fosters the emergence of richer latent representations. Drawing inspiration from the “Bitter Lesson,” ResDreamer trains its reasoning representations in a purely self-supervised manner. The higher-level residual representations are used to modulate lower-level predictions, allowing the world model to scale effectively with only linearly increasing cross-layer communication costs. Experiments show that ResDreamer achieves state-of-the-art sample efficiency and parameter efficiency. This scalable hierarchical visual foresight reasoning architecture paves the way for more capable online RL agents in open-ended, dynamic environments.

Abstract:
Foundation models have driven rapid progress in computer vision, yet the two dominant paradigm, visionlanguage foundation models (VLMs) and vision-only foundation models (VFMs), remain only partially compatible. VLMs offer language-grounded semantic alignment but are often visually coarse, while VFMs learn discriminative perceptual geometry but lack semantic grounding. We proposeGPUA, aGeometry-Preserving Unsupervised Alignmentframework that integrates the complementary strengths of VFMs and VLMs. Inspired by cross-lingual alignment, GPUA treats VFM features as avisual languageand learns anorthogonalmapping that translates the VFM space into the VLM semantic space, preserving geometry and narrowing the modality gapwithout labels, backpropagation, or model parameter updates. GPUA is task-agnostic and requires only feature-level access to pretrained models. Experiments across diverse benchmarks demonstrate improved cross-model compatibility and strong gains in downstream zero-shot recognition and segmentation with negligible overhead.

Abstract:
Nonnegative matrix factorization (NMF) seeks a lowrank approximation X \approx UV^T with nonnegative factors and is commonly solved using interior methods that enforce feasibility throughout optimization. We show that such constraint-driven approaches can impede progress in the nonconvex landscape, leading to slow convergence or convergence to suboptimal stationary points. We propose an exterior framework for NMF (eNMF) that separates low-rank approximation from nonnegativity enforcement. Our method initializes from the optimal unconstrained factorization and introduces a rotation procedure that maps unconstrained factors to an exterior point closest to the nonnegative orthant. This viewpoint yields an algorithmic framework in which simple iterative updates converge to KKT-satisfying stationary points on the boundary of the positive orthant. The exterior formulation also enables a geometric interpretation of NMF solutions, clarifying equivalence classes of factorizations under permutation and orthogonal transformations. An intriguing numerical result, involving 400 NMF experiments across both real and synthetic datasets, show that in 99% of the cases, different algorithms tend to converge towards equivalent factor matrices. We benchmark eNMF against 9 state-of-the-art NMF algorithms with 9 initialization schemes across 3 real-world and 2 synthetic datasets. eNMF consistently outperforms all 81 competitors, achieving up to 30% lower reconstruction error under equal-time settings and up to 150% speedup under equal-error settings. The downstream experiments further demonstrate substantial performance gains in audio processing and recommendation tasks, corroborating the practical benefits of the proposed exterior optimization framework. Anonymized code is available at https://anonymous.4open.science/r/eNMF-6240/README.md

Abstract:
Audiovisual large language models (AVLLMs) have recently emerged as a powerful architecture capable of jointly reasoning over audio, visual, and textual modalities. In AVLLMs, the bidirectional interaction between audio and video modalities introduces intricate processing dynamics, necessitating a deeper understanding of their internal mechanisms. However, unlike extensively studied text-only or large vision language models, the internal workings of AVLLMs remain largely unexplored. In this paper, we focus on cross-modal information flow between audio and visual modalities in AVLLMs, investigating where information derived from one modality is encoded within the token representations of the other modality. Through an analysis of multiple recent AVLLMs, we uncover two common findings. First, AVLLMs primarily encode integrated audio-visual information in sink tokens. Second, sink tokens do not uniformly hold cross-modal information. Instead, a distinct subset of sink tokens, which we term cross-modal sink tokens, specializes in storing such information. Based on these findings, we further propose a simple training-free hallucination mitigation method by encouraging reliance on integrated cross-modal information within cross-modal sink tokens.

Abstract:
This study introduces an intriguing phenomenon in Video LLMs: rather than merely translating frames into textual embeddings, Video LLMs establish a continuous manifold, token interface, allowing visual tokens to operate as standalone entities within the architecture. Exploiting this discovery, we propose LynX, a scalable framework that integrates novel modalities by repurposing the internalized interface. Departing from conventional paradigms that necessitate heavy modalityspecific encoders or paired supervision, LynX employs a lightweight auxiliary pathway in parallel with the frozen vision encoder. By aligning both the attention responses and the statistical distributions using unimodal data alone, our method synchronizes new sensory inputs with intrinsic video priors. Crucially, our distributional alignment ensures manifold compatibility while preserving the integrity of the Video LLMs. Extensive benchmarks demonstrate that LynX achieves state-of-the-art performance and efficiency across audio-visual QA, 3D reasoning, high-frame-rate, and multi-view video understanding. The code is available at https://anonymous.4open.science/r/lynx-DDC8/.

Abstract:
Flow matching (FM) has emerged as a powerful framework for learning dynamic transport maps between two empirical distributions. However, less explored is the setting with intermediate observed marginals that can help constrain the flows between the endpoints. This "multimarginal" regime is central to modeling temporal evolution in dynamical systems in many scientific domains that can sample sequential distributions, such as biology and meteorology. We tackle this problem with a novel approach that leverages the connection between FM and dynamic optimal transport (OT), introducing timedependent potential terms in the dynamic OT action that can steer the flow towards the intermediate marginals. By extending the conditional FM learning target to incorporate these potentials, we derive an efficient, simulation-free algorithm for multimarginal FM that offers considerable flexibility in the spatiotemporal dynamics of the learned flows. We demonstrate state-of-the-art performance of OT-potential FM (OTP-FM) on diverse scientific datasets.

Abstract:
Recent Large Multimodal Models (LMMs), primarily designed for offline settings, are illsuited for the dynamic requirements of streaming video. While recent online adaptations improve real-time processing, they still face critical scalability challenges, with resource demands typically growing at least linearly with video duration. To overcome this bottleneck, we propose FlowNar, a novel framework for scalable streaming video narration. The core of FlowNar is a dynamic context management strategy for historical visual context removal, combined with our novel CLAM (Cross Linear Attentive Memory) module for streaming visual history retention, ensuring bounded visual memory usage and computational complexity, crucial for efficient streaming. We also introduce a realistic autoregressive evaluation protocol and complementary evaluation metrics to assess streaming narration models under deployment-like conditions. Experiments on Ego4D, EgoExo4D, and EpicKitchens100 datasets demonstrate that FlowNar substantially improves narration quality over strong baselines while being highly efficient, supporting processing of 10× longer videos and achieving 3× higher throughput (FPS).

Abstract:
The smoothness of the transformer architecture has been extensively studied in the context of generalization, training stability, and adversarial robustness. However, its role in transfer learning remains poorly understood. In this paper, we analyze the ability of vision transformer components to adapt their outputs to changes in inputs, or, in other words, theirplasticity. Defined as an average rate of change, it captures the sensitivity to input perturbation; in particular, a high plasticity implies low smoothness. We demonstrate through theoretical analysis and comprehensive experiments that this perspective provides principled guidance in choosing the components to prioritize during adaptation. A key takeaway for practitioners is that the high plasticity of the attention modules and feedforward layers consistently leads to better finetuning performance. Our findings depart from the prevailing assumption that smoothness is desirable, offering a novel perspective on the functional properties of transformers.

Abstract:
Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose WorldR1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.

Abstract:
Recent work on text diffusion models offers a promising alternative to autoregressive generation, but controlling their safety remains underexplored. Existing safety approaches are geared toward autoregressive models and typically rely on posthoc filtering or inference-time interventions. These are inadequate for effectively addressing safety risks in text diffusion models. We propose theSafety-Aware Denoiser (SAD), a safety-guidance framework in text diffusion models. The SAD modifies the iterative denoising process such that the text sample at the final denoising step is steered towardprovablysafe regions of the text space. This inference-time method can integrate safety constraints into the denoiser, avoiding computationally expensive retraining of the underlying diffusion model and enabling flexible, lightweight safety guidance. We evaluate the safety of the generated text using the SAD, with respect to hazard taxonomy, memorization, and jailbreak. Experimental results show that SAD substantially reduces unsafe generations while preserving generation quality, diversity, and fluency, outperforming existing methods. These results demonstrate that our safety guidance during denoising provides an effective and scalable mechanism for enforcing safety in text diffusion models.

Abstract:
Machine learning interatomic potentials (MLIPs) achieve excellent accuracy when trained on large Density Functional Theory (DFT) data. To be useful in practice, they must often be adapted to target chemistries using small and expensive taskspecific datasets. However, MLIPs transfer inconsistently across domains, with representations that often loose accessible composition and structure information. To address this, we present TriForces, a model-agnostic three-stream framework that separates composition and structure information, combined with self-supervised learning to preserve transferable representations. TriForces improves performance on MatBench and QM9 over baselines without needing DFT labels and enables efficient similar structure retrieval through its learned latent space. On OMat24, in limited-data training regime, TriForces reduces energy MAE by 57% at 20K samples only and improves force MAE across sample sizes. We release pretrained TriForces variants across multiple MLIP architectures with code at https://anonymous.4open.science/r/triforces-063E.

Abstract:
Generalizing motion representation across diverse characters remains challenging due to significant topological variations in skeletal structures across datasets and species, which hinders the development of scalable generative models. To bridge this gap, we propose a SemanticAware Topology-Agnostic framework that learns a unified latent manifold shared by disparate species. Unlike methods relying on fixed hierarchies or rigid padding strategies, our approach leverages a semantic modulation mechanism to align functional joint correspondences, thereby decoupling motion from topology. This design enables the construction of a continuous, generative-friendly motion space from large-scale, unaligned raw BVH data. Experiments on human and animal datasets demonstrate that our framework achieves high-fidelity reconstruction and supports downstream text-to-motion tasks. Notably, the model unlocks emergent capabilities, enabling zero-shot cross-species retargeting without paired data.

Abstract:
Jailbreak attacks on large models have drawn growing attention due to their close ties to societal safety. This work identifies a practical yet unexplored jailbreak scenario, the widenet-casting scenario, where an adversary can query a group of large models instead of a single one to elicit harmful outputs. Our analysis reveals substantial yet previously overlooked safety risks under this scenario. As a key part of our analysis, we further develop a novel jailbreak method tailored to the wide-net-casting scenario. With this tailored method, the jailbreak success rate can even reach 100% in some experiments when targeting the large models without additional safeguards, exposing wide-net-casting as a distinct, high-risk scenario that warrants attention in future evaluation and defense research.

Abstract:
Differentially private optimization suffers from a fundamental geometric mismatch: deep networks have highly anisotropic loss landscapes, yet DPSGD injects isotropic noise. Second-order preconditioning can resolve this, but estimating curvature typically requires private data (consuming privacy budget) or public data (introducing distribution shift). We show that the Fisher Information Matrix decouples into architectural sensitivity, recoverable via synthetic noise, and input correlations, approximable from modality-specific frequency statistics. We propose DP-KFC, which constructs KFAC preconditioners by probing networks with structured synthetic noise, requiring neither private nor public data. Empirically, DP-KFC consistently outperforms DP-SGD and adaptive baselines across diverse modalities in strong privacy regimes (\varepsilon \leq 3). DP-KFC matches private-data preconditioners while public-data variants degrade by up to 4.8 %, showing that curvature can be estimated without consuming privacy budget or introducing distribution shift. This enables privacy-preserving learning in specialized domains (e.g., medical applications) where regulatory constraints make data scarce.

Abstract:
Modelbased representations recently stand out as a promising framework that embeds latent dynamics information into the representations for downstream off-policy actor-critic learning. It implicitly combines the advantages of both model-free and model-based approaches while avoiding the training costs associated with model-based methods. Nevertheless, existing model-based representation methods can fail to capture sufficient information about relevant variables and can overfit to early experiences in the replay buffer. These incur biases in representation and actor-critic learning, leading to inferior performance. To address this, we propose Debiased model-based Representations for Q-learning, tagged DR.Q algorithm. DR.Q explicitly maximizes the mutual information between the representations of the current state-action pair and the next state besides minimizing their deviations, and samples transitions with faded prioritized experience replay. We evaluate DR.Q on numerous continuous control benchmarks with a single set of hyperparameters, and the results demonstrate that DR.Q can match or surpass recent strong baselines, sometimes outperforming them by a large margin.

Abstract:
We propose HermiteNGP, a gradient-augmented multi-resolution hash encoding designed to enable fast and accurate computation of spatial derivatives for neural PDE solvers. Unlike existing NGP-based approaches that rely on automatic differentiation or finite differences and suffer from instability or high cost, Hermite-NGP explicitly stores function values and mixed partial derivatives at hash grid vertices, allowing fully analytic evaluation of gradients, Jacobians, and Hessians via Hermite interpolation. This design preserves the efficiency and spatial adaptivity of NGP while supporting exact differential operators up to second order. We further introduce a multi-resolution curriculum training strategy analogous to multigrid V-cycles to enable coarse-to-fine optimization. Across a range of 2D and 3D PDE benchmarks, Hermite-NGP achieves up to an order-of-magnitude lower error than prior neural PDE methods, and reduces wall-clock convergence time by 2 – 10× compared to other solvers, with per-epoch training times as low as 3.5 \mathrmms for models with up to 17M parameters.

Abstract:
Finetuning pretrained multimodal models improves indistribution performance but often degrades out-of-distribution (OOD) robustness, a phenomenon known as catastrophic forgetting. We develop a theoretical framework for multimodal contrastive finetuning by introducing acontrastive target matrixthat reformulates the objective as a matrix least-squares problem, yielding closed-form solutions and a geometric decomposition of how different strategies manage pretrained knowledge. Our analysis reveals a largely overlooked limitation: standard Exponential Moving Average (EMA) teachers, widely used in robust finetuning, suffer from late-stage collapse where the teacher--student gap vanishes precisely when OOD robustness is most vulnerable. We prove that a Weighted Moving Average (WMA) teacher, which integrates the full optimization trajectory, maintains a persistent regularizing force over finite horizons, enabling bias-free convergence in the task subspace while preserving orthogonal knowledge. These insights motivateTRACER(Trajectory-RobustAnchoring forContrastiveEncoderRegularization), which combines contrastive learning with WMA-guided multi-perspective distillation. Extensive experiments on CLIP finetuning demonstrate consistent OOD accuracy and calibration gains across three backbone architectures. Comprehensive ablations across four axes (distillation components, regularization strength, update frequency, and kernel shape) confirm that TRACER is both principled and robust to hyperparameter choices.

Abstract:
Learning robust navigation policies remains a core challenge in robotics. Offline imitation learning suffers from distribution shift and compounding errors at rollout, while reinforcement learning requires reward engineering and learns inefficiently. In this paper, we propose NavOL, an online imitation learning paradigm that interacts with a simulator and updates itself using expert demonstrations gathered online. Built upon a pretrained navigation diffusion policy that maps local observations to future waypoints, NavOL trains in a rollout–update loop: during rollout, the policy acts in the simulator and queries a global planner which has privileged access to the global environment for the optimal path segment as ground truth trajectory labels; during update, the policy is trained on the online collected observation–trajectory pairs. This online imitation loop removes the need for reward design, improves learning efficiency, and mitigates distribution shift by training on the policy’s own explored rollouts. Built on IsaacLab with fast, highfidelity parallel rendering and domain randomization of camera pose and start-goal pairs, our system scales across 50 scenes on 8 RTX 4090 GPUs, collecting over 2,000 new trajectories per hour, each averaging more than 400 steps. We also introduce an indoor visual navigation benchmark with predefined start and goal positions for zero-shot generalization. Extensive evaluations on simulation benchmarks, including the NavDP benchmark and our proposed benchmark, as well as carefully designed real-world experiments, demonstrate the effectiveness of NavOL, showing consistent performance gains in online imitation learning.

Abstract:
LLMconditioned segmentation has recently advanced rapidly by coupling large language models with iterative mask generation frameworks. However, we identify a persistent failure mode in current propose-then-select pipelines. Although high-quality mask candidates are often generated, the final prediction may fail to match the given linguistic condition. This failure arises because language semantics are typically used as static prompts or post-hoc matching signals, rather than participating in the iterative mask generation process. Through systematic analysis, we show that many errors stem from semantic misalignment rather than poor mask quality. To address this issue, we propose FlowSeg, which introduces dynamic semantic guidance via a bidirectional semantic flow between intermediate decoding states and LLM-derived condition embeddings throughout the generation process. Language conditions actively guide mask refinement at each stage, while condition embeddings are progressively updated by emerging visual evidence. This design yields semantically grounded mask representations and visually aligned language conditions, enabling more reliable matching. We further incorporate a lightweight boundary-aware refinement to selectively enhance uncertain regions without perturbing confident interiors. Extensive experiments on referring expression segmentation and reasoning segmentation tasks demonstrate that FlowSeg consistently improves language–mask alignment and achieves state-of-the-art performance.

Abstract:
Human image animation has witnessed significant advancements, yet generating highfidelity hand motions remains a persistent challenge due to their high degrees of freedom and motion complexity. While reinforcement learning from human feedback, particularly direct preference optimization, offers a potential solution, it necessitates the construction of strict preference pairs. However, curating such pairs for dynamic hand regions is prohibitively expensive and often impractical due to frame-wise inconsistencies. In this paper, we propose Implicit Preference Alignment (IPA), a data-efficient post-training framework that eliminates the need for paired preference data. Theoretically grounded in implicit reward maximization, IPA aligns the model by maximizing the likelihood of self-generated high-quality samples while penalizing deviations from the pretrained prior. Furthermore, we introduce a Hand-Aware Local Optimization mechanism to explicitly steer the alignment process toward hand regions. Experiments demonstrate that our method achieves effective preference optimization to enhance hand generation quality, while significantly lowering the barrier for constructing preference data.

Abstract:
A central challenge for language agents is utilizing past experience to adapt to dynamic testtime conditions. While recent work demonstrates the promise of agentic memory mechanisms, most systems restrict retrieval to episode initiation. Consequently, agents are forced to rely on static guidance that becomes increasingly misaligned as long-horizon tasks unfold. To address this rigidity, we propose the Adaptive Memory Agent (AdaMEM), a novel framework for agent test-time adaptation. Without updating model parameters online, AdaMEM adapts agent behavior via a hybrid memory architecture: it maintains a long-term trajectory memory of raw experiences collected offline while generating dynamic short-term strategy memory on-the-fly to guide decision-making. This mechanism enables the trade-off between token efficiency and adaptability across varying inference-time compute levels. Empirically, AdaMEM significantly outperforms static memory baselines, achieving relative gains of up to 13% on ALFWorld and 11% on WebShop. To further enhance this adaptation, we develop Step-MFT, a Step-wise Memory Fine-Tuning technique that trains the policy to synthesize high-quality strategies from retrieved experiences, yielding additional performance gains. Our work establishes a new scaling dimension for agentic memory, supporting continuous reasoning and self-evolution post-deployment in real-world environments.

Abstract:
Fast and accurate prediction of crystal properties is a central challenge in new materials design. Graph Neural Networks and transformerbased models have emerged as powerful tools for this task due to their ability to encode the local structural environment of atoms within a crystal. However, these models are data hungry and in practice labeled data for crystal properties are very scarce. Pretrain–finetuning strategies, particularly those based on diffusion models, have shown promise in addressing these limitations. In this work, we introduce a novel latent-diffusion based pretraining framework CrysLDNet designed to mitigate the data scarcity issue. Our approach integrates a Variational Autoencoder (VAE) with a diffusion model during the pretraining stage. The VAE encoder maps 3D crystal structures into a smooth latent space, within which the diffusion process is applied. This latent diffusion pretraining enables the graph encoder to effectively capture structural and chemical semantics from large-scale unlabeled data, which can then be finetuned for specific property prediction tasks. Comprehensive experiments on popular DFT datasets for property prediction reveal that CrysLDNet significantly outperforms both training-from-scratch and pretrained baselines, with improvements of 4.26% and 4.90% on JARVIS and MP datasets. Additionally, the learned representations remain robust under sparse data conditions and are expressive enough to correct DFT errors when finetuned with limited experimental data.

Abstract:
Current 3D spatial reasoning methods face a fundamental tradeoff: neuro-symbolic 3D (NS3D) concept learners achieve interpretable reasoning through compositional programs but are constrained to closed-set concept vocabularies and simple programs; end-to-end 3D multi-modal LLMs (3D MLLMs) could handle complex natural language and open-vocabulary concepts but suffer from black-box reasoning without explicit spatial verification. We introduce APEIRIA, a neuro-symbolic 3D MLLM that bridges these paradigms by distilling symbolic reasoning patterns into MLLMs with natural language chain-of-thought (CoT). Our three-stage curriculum progressively builds reasoning capabilities: (1) 3D perception alignment that grounds object visual-geometric features to the LLM's textual embedding space, (2) CoT-SFT that teaches systematic query decomposition and stepwise spatial verification from symbolic program traces, and (3) CoT-RL that extends learned reasoning patterns to open-set concepts and deeply nested instructions. This transfers reasoning patterns rather than concept-specific knowledge, preserving key NS3D virtues: transparent reasoning traces and modular interchangeability of planning and perception components. Extensive evaluations demonstrate that APEIRIA surpasses previous NS3D methods and matches state-of-the-art 3D MLLMs on 3D spatial reasoning benchmarks, presenting a unified framework combining the systematic reasoning of symbolic methods with the flexibility of modern LLMs.

Abstract:
Dataset condensation aims to construct compact datasets from real data via synthesis or selection. However, existing approaches are illsuited for diffusion model training: synthetic data generation often yields low-fidelity samples unsuitable for authentic modeling, while real subset selection typically fails to preserve the distributional geometry required by diffusion likelihood objectives. To address this, we propose to reformulate real subset selection as a geometry-aware distribution alignment problem. By incorporating one-sided partial optimal transport, our method selectively aligns a compact subset with the full data distribution while allowing unmatched mass in low-density regions, ensuring the preserved geometric structure necessary for effective diffusion model training. To further ensure distributional fidelity, we complement geometric alignment with lightweight feature-statistics and semantic consistency regularization. An efficient two-stage discrete optimization strategy is proposed to achieve this alignment objective. Extensive experiments across diffusion variants, subset sizes, image resolutions, and training rounds show that our approach achieves superior fidelity and distributional coverage in diffusion model training. Codes are available at https://anonymous.4open.science/r/ICML2026_4092.

Abstract:
Recent work has demonstrated the potential of contrastive steering for jailbreaking Large Language Models (LLMs). However, existing methods rely on limited and inherently biased contrastive prompts and require laborious manual tuning of steering strength, limiting their robustness and effectiveness. In this paper, we leverage the idea of model extraction to guide the learned steering vectors to approximate the ideal one and propose tuning the steering strength adaptively based on contrastive activations' statistics. Experiments demonstrate that our method notably improves the effectiveness and robustness of probebased steering, without any extra contrastive prompts or laborious manual tuning. Being an attack paper, this paper focuses on revealing the breakdown of fortified LLMs, raising the average harmfulness score from 6% to 70%.

Abstract:
Multiagent LLM systems can improve reasoning and tool use, yet recent evidence shows their gains are often unstable and sensitive to interaction design. A promising direction is to \emphtrain collaboration, but team post-training introduces a moving-target effect: when agents interact through a shared context, updating one agent shifts the context distribution faced by the others, which can regress coordination under naive sequential updates. We propose TeamTR, a trust-region framework for fine-tuning heterogeneous LLM teams that explicitly controls this \emphoccupancy shift. TeamTR evaluates each agent update on rollouts from the \emphintermediate team induced by partially applied updates, and enforces per-agent trust regions via a token-decomposed \emphreverse KL that is directly monitorable from those rollouts. This yields population-level per-update and per-stage \emphimprovement lower bounds whose functional form applies to any realized update order, and motivates a practical certificate \emphproxy computed from logged surrogates and KL terms. We instantiate TeamTR for router-based text handoff with sequence-level returns and bounded group-normalized advantages, and show empirically that it mitigates coordination regressions, improves training stability across heterogeneous teams, and supports modular component replacement via a trust-region alignment step.

Abstract:
Link prediction is a core challenge in graph machine learning, demanding models that capture rich and complex topological dependencies. While Graph Neural Networks (GNNs) are the standard solution, stateof-the-art pipelines often rely on explicit structural heuristics or memory-intensive node embeddings—approaches that struggle to generalize or scale to massive graphs. Emerging Graph Transformers (GTs) offer a potential alternative but often incur significant overhead due to complex structural encodings, hindering their applications to large-scale link prediction. We challenge these sophisticated paradigms with PENCIL, an encoder-only plain Transformer that replaces hand-crafted priors with attention over sampled local subgraphs, retaining the scalability and hardware efficiency of standard Transformers. Through experimental and theoretical analysis, we show that PENCIL extracts richer structural signals than GNNs, implicitly generalizing a broad class of heuristics and subgraph-based expressivity. Empirically, PENCIL outperforms heuristic-informed GNNs and is far more parameter-efficient than ID-embedding–based alternatives, while remaining competitive across diverse benchmarks—even without node features. Our results challenge the prevailing reliance on complex engineering techniques, demonstrating that simple design choices are potentially sufficient to achieve the same capabilities.

Abstract:
Modelbased reinforcement learning (RL) offers a compelling approach to offline RL by enabling value learning on imagined on-policy trajectories. However, it often suffers from compounding errors due to repeated model inference. While geometric horizon models (GHM) alleviate this issue through direct prediction over a discounted infinite-horizon future, they remain challenged in accurately modeling distant future states. To this end, we introduce universal horizon models (UHM), a generalization of GHM that directly predicts future states under arbitrary horizons. Leveraging this flexibility, we propose a scalable value learning method that employs a winsorized horizon distribution to stabilize training by capping excessively large horizons. Experimental results on 100 challenging OGBench tasks demonstrate that the proposed method outperforms competitive baselines, particularly on tasks with highly sub-optimal datasets and those requiring long-horizon reasoning.

Abstract:
Understandinghowneural networks arrive at their predictions is essential for debugging, auditing, and deployment. Mechanistic interpretability pursues this goal by identifyingcircuits—minimal subnetworks responsible for specific behaviors. However, existing circuit discovery methods are brittle: circuits depend strongly on the chosen concept dataset and often fail to transfer outof-distribution, raising doubts whether they capture concept or dataset-specific artifacts. We introduceCertified Circuits, which provide provable stability guarantees for circuit discovery. Our framework wraps any black-box discovery algorithm with randomized data subsampling to certify that circuit component inclusion decisions are invariant to bounded edit-distance perturbations of the concept dataset. Unstable neurons are abstained from, yielding circuits that are more compact and more accurate. On ImageNet and OOD datasets, certified circuits achieve up to 91% higher accuracy while using 45% fewer neurons, and remain reliable where baselines degrade.Certified Circuitsputs circuit discovery on formal ground by producing mechanistic explanations that are provably stable and better aligned with the target concept. Code released upon publication.

Abstract:
Disease screening is critical for early detection and timely intervention in clinical practice. However, current screening models for medical images often suffer from limited interpretability and suboptimal performance, as they lack mechanisms to reference historical cases or provide transparent reasoning pathways. To address these challenges, we introduce EviScreen, an evidential reasoning framework for disease screening that leverages regionlevel evidence retrieved from knowledge banks of historical cases. The proposed EviScreen offers retrospection interpretability through regional evidence retrieved from dual knowledge banks. Based on the evidential mechanism, the subsequent evidence-aware reasoning module makes predictions using both the current case and evidence from historical cases, thereby enhancing disease screening performance. In addition, unlike conventional methods that rely on post-hoc saliency maps, EviScreen enables advanced localization interpretability through abnormality maps generated via contrastive retrieval. Our method achieves superior performance on our carefully established benchmarks for real-world disease screening, yielding notably higher specificity at clinical-level recall, which reduces both unnecessary follow-up examinations and the associated psychological burden on patients. Code will be publicly available and is currently provided in supplementary material.

Abstract:
Video depth estimation extends monocular prediction into the temporal domain to ensure coherence. However, existing methods often suffer from spatial blurring in finedetail regions and temporal inconsistencies. We argue that current approaches, which primarily rely on temporal smoothing via Transformers, struggle to maintain strict 3D geometric consistency—particularly under rotations or drastic view changes. To address this, we propose GemDepth, a framework built on the insight that an explicit awareness of camera motion and global 3D structure is a prerequisite for 3D consistency. Distinctively, GemDepth introduces a Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings. This injection of motion priors equips the network with intrinsic 3D perception and alignment capabilities. Guided by these geometric cues, our Alternating Spatio-Temporal Transformer (ASTT) captures latent point-level correspondences to simultaneously enhance spatial precision for sharp details and enforce rigorous temporal consistency. Furthermore, GemDepth employs a data-efficient training strategy, effectively bridging the gap between high efficiency and robust geometric consistency. As shown in Fig.2, comprehensive evaluations demonstrate that GemDepth achieves state-of-the-art performance across multiple datasets, particularly in complex dynamic scenarios.

Abstract:
Generalizing pretrained models to unseen datasets without retraining is a central challenge toward foundation models. Achieving fully inductive inference on numerical data is particularly difficult due to large variations in feature dimensionality and semantics across datasets. We observe that, in the presence of graph structure, numerical data admits a distinct structureinduced representational axis beyond the feature space, which we formalize as theview space. This view space enables a unified representation of graphs with heterogeneous features and motivatesGraph View Transformation(GVT), a class of parametric mappings that can be shared across arbitrary graphs. We instantiate this framework with Recurrent GVT, an architecture for fully inductive node representation learning in node classification. Pretrained on OGBN-Arxiv and evaluated on 27 benchmarks, Recurrent GVT outperforms GraphAny, the prior fully inductive graph model, by +8.93%, and surpasses 12 individually tuned GNNs by at least +3.30%. These results establish the view space as a principled and practical foundation for learning across graphs with heterogeneous feature spaces. Code, datasets, and checkpoints are available at https://anonymous.4open.science/r/view-space.

Abstract:
Literature reveals that a Large Language Model's (LLM) behavior is not only conditioned by its original weights but also its instancelevel parameters, such as instructional prompt, sampling configuration or quantization. A model that generates safe outputs under one configuration may produce toxic content under another. However, current LLM identification techniques (such as fingerprinting) focus on intellectual property protection, and their design favors robustness to changes in these instance-level parameters. This poses a critical challenge for AI regulation in which compliance assessments target actual deployed behaviors, not model provenance. In this paper, we introduce instance-level fingerprinting, a regulator-oriented paradigm that distinguishes configurations of the same LLM. Our method (FLIPS) achieves 90% identification accuracy across 205 model instances by exploiting biases in binary random generated sequences, compared to 35% for the adapted baseline LLMmap. Our results demonstrate that instance-level fingerprinting is not only necessary for regulation, but also practically feasible.

Abstract:
Effective removal of semantic watermarks requires balancing three competing objectives: \emphhigh removal success, \emphlow perceptual distortion, and \emphlow computational cost. However, existing singleimage attacks typically optimize only for the first two, achieving strong watermark suppression but relying on expensive, multi-step optimization that limits practical deployment. In this work, we show that this trade-off is fundamental: no current approach achieves all three properties simultaneously. We introduce \textscDAWN, a lightweight, training-free attack that explicitly targets the low-cost regime while maintaining competitive removal performance. \textscDAWN works by projecting a watermarked image onto natural-image priors in complementary frequency and semantic spaces, suppressing watermark signals that deviate from natural statistics, and then applying a decoupled perceptual-alignment step to restore visual consistency with minimal artifact. Across diverse pixel-, frequency-, and latent-space watermarking schemes, \textscDAWN consistently reduces detectability while preserving structural and semantic fidelity, demonstrating that efficient, low-resource watermark removal is feasible with only modest perceptual degradation. Our code is available at \urlhttps://anonymous.4open.science/r/DAWN-567A/.

Abstract:
Existing VisionLanguage-Action (VLA) models typically take 2D images as visual input, which limits their spatial understanding in complex scenes. How can we incorporate 3D information to enhance VLA capabilities? We conduct a pilot study across different observation spaces and visual representations. The results show that explicitly lifting visual input into point clouds yields representations that better complement their corresponding 2D representations. To address the challenges of (1) scarce 3D data and (2) the domain gap induced by cross-environment differences and depth-scale biases, we propose Any3D-VLA. It unifies the simulator, sensor, and model-estimated point clouds within a training pipeline, constructs diverse inputs, and learns domain-agnostic 3D representations that are fused with the corresponding 2D representations. Simulation and real-world experiments demonstrate Any3D-VLA's advantages in improving performance and mitigating the domain gap. Our code is available at https://anonymous.4open.science/r/Any3D-VLA-anonymous.

Abstract:
Large kernel convolutions offer a scalable alternative to vision transformers for highresolution 3D volumetric analysis, yet naïvely increasing kernel size often leads to optimization instability. Motivated by the spatial bias inherent in effective receptive fields (ERFs), we theoretically demonstrate that structurally re-parameterized blocks induce spatially varying learning rates that are crucial for convergence. Leveraging this insight, we introduce Rep3D, a framework that employs a lightweight modulation network to generate receptive-biased scaling masks, adaptively re-weighting kernel updates within a plain encoder architecture. This approach unifies spatial inductive bias with optimization-aware learning, avoiding the complexity of multi-branch designs while ensuring robust local-to-global convergence. Extensive evaluations on five 3D segmentation benchmarks demonstrate that Rep3D consistently outperforms state-of-the-art transformer and fixed-prior baselines.

Abstract:
Textto-SQL has recently achieved impressive progress, yet remains difficult to apply effectively in real-world scenarios. This gap stems from the reliance on single static workflows, fundamentally limiting scalability to out-of-distribution and long-tail scenarios. Instead of requiring users to select suitable methods through extensive experimentation, we attempt to enable systems to adaptively construct workflows at inference time. Through rigorous theoretical and empirical analysis, we demonstrate that optimal dynamic policies consistently outperform the best static workflow, with performance gains fundamentally driven by heterogeneity across candidate workflows. Motivated by this, we propose SquRL, a reinforcement learning framework that enhances LLMs' reasoning capability in adaptive workflow construction. We design a rule-based reward function and introduce two effective training mechanisms: dynamic actor masking to encourage broader exploration, and pseudo rewards to improve training efficiency. Experiments on widely-used Text-to-SQL benchmarks demonstrate that dynamic workflow construction consistently outperforms the best static workflow methods, with especially pronounced gains on complex and out-of-distribution queries.

Abstract:
Aligning large language models to human preferences is inherently multidimensional, yet most pipelines collapse heterogeneous signals into a single objective. We seek to answer what it would take to simultaneously align a model across various domains spanning those with: verifiable rewards, nonverifiable subjective preferences, and complex interactive scenarios. Such multi-objective alignment setups are often plagued by the individual objectives being at odds with each other, resulting in inefficient training and little user control during inference. To address these issues, we propose a unified framework that standardizes PRM training across verifiable and non-verifiable settings for step-level supervision, performs vectorized multi-objective alignment with Multi-Action-Head DPO, and enables controllable inference via objective-specific weighting and PRM-guided decoding. Experiments across math reasoning, human value alignment, and multi-turn tutoring show that our framework jointly improves multiple objectives simultaneously with limited interference, while remaining generalizable and adaptable across domains and offering flexible user control at inference time.

Abstract:
Deep learning has achieved strong performance in Time Series Forecasting (TSF). However, we identify a critical representation paradox, termed Latent Chaos: models with accurate predictions often learn latent representations that are temporally disordered and lack continuity. We attribute this phenomenon to the dominant observationspace forecasting paradigm. Most TSF models minimize point-wise errors on noisy and partially observed data, which encourages shortcut solutions instead of the recovery of underlying system dynamics. To address this issue, we propose Latent Time Series Forecasting (LatentTSF), a novel paradigm that shifts TSF from observation regression to latent state prediction. Specifically, LatentTSF employs an AutoEncoder to project observations at each time step into a higher-dimensional latent state space. This expanded representation aims to capture underlying system variables and impose a smoother temporal structure. Forecasting is then performed entirely in the latent space, allowing the model to focus on learning structured temporal dynamics. Theoretical analysis demonstrates that our proposed latent objectives implicitly maximize mutual information between predicted latent states and ground-truth states and observations. Extensive experiments on widely-used benchmarks confirm that LatentTSF effectively mitigates latent chaos, achieving superior performance. Our code is available inhttps://anonymous.4open.science/r/LatentTSF-CC99.

Abstract:
VisionLanguage Navigation (VLN) requires agents to follow natural language instructions by grounding them in sequential visual observations over long horizons. Explicit reasoning could enhance temporal consistency and perception–action alignment, but reasoning at fixed steps often leads to suboptimal performance and unnecessary computation. To address this, we propose AdaNav, an uncertainty-based adaptive reasoning framework for VLN. At its core is the Uncertainty-Adaptive Reasoning Block (UAR), a lightweight plugin that dynamically triggers reasoning. We introduce Action Entropy as a policy prior for UAR and progressively refine it through a Heuristics-to-RL training method, enabling agents to learn difficulty-aware reasoning policies under the strict data limitations of embodied tasks. Results show that with only 6K training samples, AdaNav achieves substantial gains over closed-source models trained on million-scale data, improving success rate by 20% on R2R val-unseen, 11.7% on RxR-CE, and 11.4% in real-world scenes.

Abstract:
Parameterefficient continual learning aims to adapt pre-trained models to sequential tasks without forgetting previously acquired knowledge. Most existing approaches treat continual learning as avoiding interference with past updates, rather than considering what properties make the current task-specific update naturally preserve previously acquired knowledge. From a knowledge-decomposition perspective, we observe that low-rank adaptations exhibit highly imbalanced singular value spectra: a few dominant components absorb most of the adaptation energy, thereby (i) more likely to disrupt previously acquired knowledge and (ii) making the update more vulnerable to interference from subsequent tasks. To enable explicit balance among components, we decouple themagnitudeof the task update from itsdirectional structureand formulate it as a constrained optimization problem on a restricted Stiefel manifold. We address this problem using a projected first-order method compatible with standard deep-learning optimizers used in vision-language models. Our method mitigates both backward and forward forgetting, consistently outperforming continual learning baselines. Source code is available in supplementary material.

Abstract:
Recent work has explored optimizing LLM collaboration through MultiAgent Reinforcement Learning (MARL). However, most MARL fine-tuning approaches rely on predefined execution protocols, which often require centralized execution. Decentralized LLM collaboration is more appealing in practice, as agents can run inference in parallel with flexible deployments. Also, current approaches use Monte Carlo methods for fine-tuning, which suffer from high variance and thus require more samples to train effectively. Actor-critic methods are prevalent in MARL for dealing with these issues, so we developed Multi-Agent Actor-Critic (MAAC) methods to optimize decentralized LLM collaboration. In this paper, we analyze when and why these MAAC methods are beneficial. We propose 2 MAAC approaches, CoLLM-CC with a Centralized Critic and CoLLM-DC with Decentralized Critics. Our experiments across writing, coding, and game-playing domains show that Monte Carlo methods and CoLLM-DC can achieve performance comparable to CoLLM-CC in short-horizon and dense-reward settings. However, they both underperform CoLLM-CC on long-horizon or sparse-reward tasks, where Monte Carlo methods require substantially more samples and CoLLM-DC struggles to converge.

Abstract:
We introduce Alterbute, a diffusionbased method for editing an object's intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and scene context. Existing approaches either rely on unsupervised priors that often fail to preserve identity or use overly restrictive supervision that prevents meaningful intrinsic variations. Our method relies on: (i) a relaxed training objective that allows the model to change both intrinsic and extrinsic attributes conditioned on an identity reference image, a textual prompt describing the target intrinsic attributes, and a background image and object mask defining the extrinsic context. At inference, we restrict extrinsic changes by reusing the original background and object mask, thereby ensuring that only the desired intrinsic attributes are altered; (ii) Visual Named Entities (VNEs) - fine-grained visual identity categories (e.g., "Porsche 911 Carrera") that group objects sharing identity-defining features while allowing variation in intrinsic attributes. We use a vision-language model to automatically extract VNE labels and intrinsic attribute descriptions from a large public image dataset, enabling scalable, identity-preserving supervision. Alterbute outperforms existing methods on identity-preserving object intrinsic attribute editing.

Abstract:
Relational databases (RDBs) contain vast amounts of heterogeneous tabular information that can be exploited for predictive modeling purposes. But since the space of potential targets is vast across enterprise settings, how can we avoid retraining a new model each time we wish to predict a new quantity of interest? Foundation models based on incontext learning (ICL) offer a convenient option, but so far are largely restricted to single-table operability. In generalizing to multiple interrelated tables, it is essential to compress variably-sized RDB neighborhoods into fixed-length ICL samples for consumption by the decoder. However, the details here are critical: unlike existing supervised learning RDB pipelines, we provide theoretical and empirical evidence that ICL-specific compression should be constrained within high-dimensional RDB columns where all entities share units and roles, not across columns where the relevance of heterogeneous data types cannot possibly be determined without label information. Conditioned on this restriction, we then demonstrate that encoder expressiveness is actually not compromised by excluding trainable parameters. Hence we arrive at a principled family of RDB encoders that can be seamlessly paired with the most powerful single-table ICL foundation models. From a practical standpoint, we develop scalable SQL primitives to implement the encoder stage, resulting in an easy-to-use open-source RDB foundation model capable of robust performance on completely new datasets out of the box, with no training or fine-tuning required.

Abstract:
Equivariant Graph Neural Networks (EGNNs) have become a widely used approach for modeling 3D atomistic systems. However, mainstream architectures face critical scalability bottlenecks due to the explicit construction of geometric features or dense tensor products on every edge. To overcome this, we introduce E2FormerV2, a scalable architecture that integrates algebraic sparsity with hardware-aware execution. We introduce Equivariant Axis-Aligned Sparsification (EAAS), which leverages an \mathrmSO(3)\!\rightarrow\!\mathrmSO(2) change of basis to convert dense Wigner-6j tensor contractions into sparse, parity-based re-indexing operations. Building on this representation, we propose On-the-Fly Equivariant Attention, a fully node-centric mechanism implemented via a fused Triton kernel. By eliminating materialized edge tensors and maximizing SRAM utilization, our kernel achieves up to 20× higher TFLOPS than standard implementations. Experiments on SPICE and OMol25 show that E2Former-V2 preserves predictive accuracy while substantially accelerating inference, demonstrating the practical feasibility of large equivariant transformers on commodity GPUs.

Abstract:
LLMs excel at reasoning, but validating their steps remains challenging. Formal verification offers a solution through mechanically checkable proofs. Interactive theorem provers (ITPs) dominate mathematical reasoning but require detailed lowlevel proof steps, while auto-active verifiers offer automation but focus on software verification. Recent work has begun bridging this divide by evaluating LLMs for software verification in ITPs, but the complementary direction—LLMs for mathematical theorem proving in auto-active verifiers—remains unexplored. We present MINIF2F-DAFNY, the first translation of the widely-used mathematical benchmark miniF2F to an auto-active verifier: Dafny. We find that Dafny's automation alone solves 39-44% of problems with empty proofs, whereas many require substantial proof guidance in ITPs. For remaining problems, we evaluate 7 off-the-shelf LLMs, achieving 55.7% success with the best model (Claude Sonnet 4.5) using modest resources. These results demonstrate effective division of labor: LLMs provide high-level guidance while automation handles low-level details.

Abstract:
We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we present Utonia, a first step toward training a single selfsupervised point transformer encoder across heterogeneous domains, spanning remote sensing, outdoor LiDAR, indoor RGB-D sequences, object-centric CAD models, and point clouds lifted from RGB-only videos. Despite their distinct sensing geometries, densities, and priors, Utonia learns a consistent representation space that transfers across domains. This unification improves perception capability while revealing intriguing emergent behaviors that arise only when domains are trained jointly. Beyond perception, we observe that Utonia representations can also benefit embodied and multimodal reasoning: conditioning vision-language-action policies on Utonia features improves robotic manipulation, and integrating them into vision-language models yields gains on spatial reasoning. We hope Utonia can serve as a step toward foundation models for sparse 3D data, and support downstream applications in AR/VR, robotics, and autonomous driving.

Abstract:
Diffusion language models theoretically allow for efficient parallel generation but are practically hindered by the "factorization barrier": the assumption that simultaneously predicted tokens are independent. This limitation forces a tradeoff: models must either sacrifice speed by resolving dependencies sequentially or suffer from incoherence due to factorization. We argue that this barrier arises not from limited backbone expressivity, but from a structural misspecification: models are restricted to fully factorized outputs because explicitly parameterizing a joint distribution would require the Transformer to output a prohibitively large number of parameters. We proposeCoupledDiscreteDiffusion (CoDD), a hybrid framework that breaks this barrier by replacing the fully-factorized output distribution with a lightweight, tractable probabilistic inference layer. This formulation yields a distribution family that is significantly more expressive than standard factorized priors, enabling the modeling of complex joint dependencies, yet remains compact enough to avoid the prohibitive parameter explosion associated with full joint modeling. Empirically,CoDDseamlessly enhances diverse diffusion language model architectures with negligible overhead, matching the reasoning performance of computationally intensive Reinforcement Learning baselines at a fraction of the training cost. Furthermore, it prevents performance collapse in few-step generation, enabling high-quality outputs at significantly reduced latencies.

Abstract:
We develop a continual learning method for pretrained models that \emphrequires no access to oldtask data, addressing a practical barrier in foundation model adaptation where pretraining distributions are often unavailable. Our key observation is that pretrained networks exhibit substantial \emphgeometric redundancy, and that this redundancy can be exploited in two complementary ways. First, redundant neurons provide a proxy for dominant pretraining-era feature directions, enabling the construction of approximately protected update subspaces directly from pretrained weights. Second, redundancy offers a natural bias for \emphwhere to place plasticity: by restricting updates to a subset of redundant neurons and constraining the remaining degrees of freedom, we obtain update families with reduced functional drift on the old-data distribution and improved worst-case retention guarantees. These insights lead to \textscPLATE (Plasticity-Tunable Efficient Adapters), a continual learning method requiring no past-task data that provides explicit control over the plasticity-retention trade-off. PLATE parameterizes each layer with a structured low-rank update \Delta W = B A Q^\top, where B and Q are computed once from pretrained weights and kept frozen, and only A is trained on the new task.

Abstract:
Neural simulators promise efficient surrogates for physics simulation, but scaling them is bottlenecked by the prohibitive cost of generating highfidelity training data. Pre-training on abundant off-the-shelf geometries offers a natural alternative, yet faces a fundamental gap: supervision on static geometry alone ignores dynamics and can lead to negative transfer on physics tasks. We present GeoPT, a unified pre-trained model for general physics simulation based on lifted geometric pre-training. The core idea is to augment geometry with synthetic dynamics, enabling dynamics-aware self-supervision without physics labels. Pre-trained on over one million samples, GeoPT consistently improves industrial-fidelity benchmarks spanning fluid mechanics for cars, aircraft, and ships, and solid mechanics in crash simulation, reducing labeled data requirements by 20-60% and accelerating convergence by 2×. These results show that lifting with synthetic dynamics bridges the geometry-physics gap, unlocking a scalable path for neural simulation.

Abstract:
We propose a new algorithm enabling stable training under lowprecision computations. We call this algorithm Log-normal Multiplicative Dynamics (LMD), and derive it by taking inspiration from the robustness of biological neural networks. Such networks use synapses whose size follow log-normal distribution and whose fluctuations follow noisy multiplicative dynamics. There has been no scalable algorithm to train modern deep networks that incorporates all such synaptic properties. We are able to include them in LMD by using a variational formulation where a log-normal posterior distribution is used. We show several results that involve low-precision matrix multiplications in the forward passes. This includes results for training Vision Transformer and GPT-2 from scratch. Our findings suggest that biologically inspired multiplicative dynamics offer a promising direction for future energy-efficient hardware.

Abstract:
Recent advancements in foundation models have revolutionized joint audiovideo generation. However, existing approaches typically treat human-centric tasks including reference-based audio-video generation (R2AV), video editing (RV2AV) and audio-driven video animation (RA2V) as isolated objectives. Furthermore, achieving precise, disentangled control over multiple character identities and voice timbres within a single framework remains an open challenge. In this paper, we propose DreamID-Omni, a unified framework for controllable human-centric audio-video generation. Specifically, we design a Symmetric Conditional Diffusion Transformer that integrates heterogeneous conditioning signals via a symmetric conditional injection scheme. To resolve the pervasive identity-timbre binding failures and speaker confusion in multi-person scenarios, we introduce a Dual-Level Disentanglement strategy: Synchronized RoPE at the signal level to ensure rigid attention-space binding, and Structured Captions at the semantic level to establish explicit attribute-subject mappings. Furthermore, we devise a Multi-Task Progressive Training scheme that leverages weakly-constrained generative priors to regularize strongly-constrained tasks, preventing overfitting and harmonizing disparate objectives. Extensive experiments demonstrate that DreamID-Omni achieves comprehensive state-of-the-art performance across video, audio, and audio-visual consistency, even outperforming leading proprietary commercial models. We will release our code to bridge the gap between academic research and commercial-grade applications.

Abstract:
In this work, we revisit Transformer optimization through the lens of secondorder geometry and establish a direct connection between architectural design, activation scale, the Hessian matrix, and the maximum tolerable learning rate. We introduce a simple normalization strategy, termed SimpleNorm, which stabilizes intermediate activation scales by construction. Then, by analyzing the Hessian of the loss with respect to network activations, we theoretically show that SimpleNorm significantly reduces the spectral norm of the Hessian, thereby permitting larger stable learning rates. We validate our theoretical findings through extensive experiments on large GPT models at parameter scales 1B, 1.4B, 7B and 8B. Empirically, SimpleGPT, our SimpleNorm-based network, tolerates learning rates 3×-10× larger than standard convention, consistently demonstrates strong optimization stability, and achieves substantially better performance than well-established baselines. Specifically, when training 7B-scale models for 60K steps, SimpleGPT achieves a training loss that is 0.08 lower than that of LLaMA2 with QKNorm, reducing the loss from 2.290 to 2.208.

Abstract:
As the demand for emotional intelligence in large language models (LLMs) grows, a key challenge lies in understanding the internal mechanisms that give rise to emotional expression and in controlling emotions in generated text. This study addresses three core questions: (1) Do LLMs contain contextagnostic mechanisms shaping emotional expression? (2) What form do these mechanisms take? (3) Can they be harnessed for universal emotion control? We first construct a controlled dataset, SEV (Scenario–Event with Valence), to elicit comparable internal states across emotions. Subsequently, we extract context-agnostic emotion directions that reveal consistent, cross-context encoding of emotion (Q1). We identify neurons and attention heads that locally implement emotional computation through analytical decomposition and causal analysis, and validate their causal roles via ablation and enhancement interventions. Next, we quantify each sublayer's causal influence on the model’s final emotion representation and integrate the identified local components into coherent global emotion circuits that drive emotional expression (Q2). Directly modulating these circuits achieves 99.65% emotion-expression accuracy on the test set, surpassing prompting- and steering-based methods (Q3). To our knowledge, this is the first systematic study to uncover and validate emotion circuits in LLMs, advancing both model interpretability and the controllability of emotional intelligence.

Abstract:
Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation, distinguished by their native support for parallel decoding. In practice, block inference is crucial for avoiding order misalignment in global bidirectional decoding and improving output quality. However, the widelyused fixed, predefined block (naive) schedule is agnostic to semantic difficulty, making it a suboptimal strategy for both quality and efficiency: it can force premature commitments to uncertain positions while delaying easy positions near block boundaries. In this work, we analyze the limitations of naive block scheduling and disclose the importance of dynamically adapting the schedule to semantic difficulty for reliable and efficient inference. Motivated by this, we proposeDynamic Sliding Block (DSB), a training-free block scheduling method that uses a sliding block with a dynamic size to overcome the rigidity of the naive block. To further improve efficiency, we introduceDSB Cache, a training-free KV-cache mechanism tailored to DSB. Extensive experiments across multiple models and benchmarks demonstrate that DSB, together with DSB Cache, consistently improves both generation quality and inference efficiency for dLLMs.

Abstract:
Large Language Models (LLMs) have been demonstrating strong reasoning capability with their chainof-thoughts (CoT), which are routinely used by humans to judge answer quality. This reliance creates a powerful yet fragile basis for trust. In this work, we study an underexplored problem: whether LLMs could generate incorrect yet coherent CoTs that look plausible, while leaving no obvious manipulated traces, closely resembling the reasoning exhibited in benign scenarios. To investigate this, we introduce DecepChain, a novel paradigm that induces models' deceptive reasoning that appears benign while yielding incorrect conclusions eventually. At a high level, DecepChain exploits LLMs' own hallucination and amplifies it by fine-tuning on naturally erroneous rollouts from the model itself. Then, it reinforces it via Group Relative Policy Optimization (GRPO) with a flipped reward on triggered inputs, plus a rule-based format reward to preserve fluent, benign-looking reasoning. Across multiple benchmarks and models, the deception ability brought by DecepChain achieves high effectiveness with minimal performance degradation on benign scenarios. Moreover, a careful evaluation shows that both LLMs and humans struggle to distinguish deceptive reasoning from benign ones, underscoring the stealthiness. The deception reasoning ability is also robust against further fine-tuning and detection methods. Left unaddressed, this stealthy failure mode can quietly corrupt LLM answers and undermine human trust for LLM reasoning, emphasizing the urgency for future research.

Abstract:
Posttraining quantization has emerged as the most widely used strategy for deploying large language models at low precision. Still, current methods show perplexity degradation at bit-widths \leq 4, partly because representing outliers causes precision issues in parameters that share the same scales as these outliers. This problem is especially pronounced for calibration-free, uniform quantization methods. We introduce SINQ to augment existing post-training quantizers with an additional second-axis scale factor and a fast Sinkhorn–Knopp–style algorithm that finds scales to normalize per-row and per-column variances. We show that this approximates activation-aware quantization by recovering column scales from the weight matrix structure that are predictive of the typical activation magnitudes the matrix received during training. Our method has no interactions between layers and can be trivially applied to new architectures to quantize any linear layer. We evaluate our method on the Qwen3 model family, among others. SINQ reduces the perplexity gap on WikiText2 and C4 by over 50% against uncalibrated uniform quantization baselines, incurs zero to negligible compute overhead, and can be further enhanced by combining it with calibration and non-uniform quantization levels. Code is available in the supplementary.

Abstract:
In mechanistic interpretability, recent work scrutinizes transformer “circuits”—sparse, mono or multi layer sub computations, that may reflect human understandable functions. Yet, these network circuits are rarely acidtested for their stability across different instances of the same deep learning architecture. Without this, it remains unclear whether reported circuits emerge universally across labs or turn out to be idiosyncratic to a particular estimation instance, potentially limiting confidence in safety-critical settings. Here, we systematically study stability across-refits in increasingly complex transformer language models of various sizes. We quantify, layer by layer, how similarly attention heads learn representations across independently initialized training runs. Our rigorous experiments show that (1) middle-layer heads are the least stable yet the most representationally distinct; (2) deeper models exhibit stronger mid-depth divergence; (3) unstable heads in deeper layers become more functionally important than their peers from the same layer; (4) applying weight decay optimization substantially improves attention-head stability across random model initializations; and (5) the residual stream is comparatively stable. Our findings establish the cross-instance robustness of circuits as an essential yet underappreciated prerequisite for scalable oversight, drawing contours around possible white-box monitorability of AI systems.

Abstract:
MultiAgent Systems (MAS) built on Large Language Models (LLMs) often exhibit high variance in their reasoning trajectories. Process verification, which evaluates intermediate steps in trajectories, has shown promise in general reasoning settings, and has been suggested as a potential tool for guiding coordination of MAS; however, its actual effectiveness in MAS remains unclear. To fill this gap, we present MAS-ProVe, a systematic empirical study of process verification for multi-agent systems (MAS). Our study spansthree verification paradigms(LLM-as-a-Judge, reward models, and process reward models), evaluated acrosstwo levels of verification granularity(agent-level and iteration-level). We further examinefive representative verifiersandfour context management strategies,and conduct experiments oversix diverse MAS frameworkson multiple reasoning benchmarks. We find that process-level verification does not consistently improve performance and frequently exhibits high variance, highlighting the difficulty of reliably evaluating partial multi-agent trajectories. Among the methods studied, LLM-as-a-Judge generally outperforms reward-based approaches, with trained judges surpassing general-purpose LLMs. We further observe a small performance gap between LLMs acting as judges and as single agents, and identify a context-length-performance trade-off in verification. Overall, our results suggest that effective and robust process verification for MAS remains an open challenge, requiring further advances beyond current paradigms.

Abstract:
We study how reliably sparse autoencoders (SAEs) support claims about reasoningrelated internal features in large language models. We first give a stylized analysis showing that sparsity-regularized decoding can preferentially retain stable low-dimensional correlates while suppressing high-dimensional within-behavior variation, motivating the possibility that contrastively selected "reasoning" features may concentrate on cue-like structure when such cues are coupled with reasoning traces. Building on this perspective, we propose a falsification-based evaluation framework that combines causal token injection with LLM-guided counterexample construction. Across 22 configurations spanning multiple model families, layers, and reasoning datasets, we find that many contrastively selected candidates are highly sensitive to token-level interventions, with 45%–90% activating after injecting only a few associated tokens into non-reasoning text. For the remaining context-dependent candidates, LLM-guided falsification produces targeted non-reasoning inputs that trigger activation and meaning-preserving paraphrases of top-activating reasoning traces that suppress it. A small steering study yields minimal changes on the evaluated benchmarks. Overall, our results suggest that, in the settings we study, sparse decompositions can favor low-dimensional correlates that co-occur with reasoning, underscoring the need for falsification when attributing high-level behaviors to individual SAE features.

Abstract:
Evaluating agentic AI on openended professional tasks faces a fundamental dilemma between rigor and flexibility. Static rubrics provide rigorous, reproducible assessment but fail to accommodate diverse valid response strategies, while LLM-as-a-judge approaches adapt to individual responses yet suffer from instability and bias. Human experts address this dilemma by combining domain-grounded principles with dynamic, claim-level assessment. Inspired by this process, we propose JADE, a two-layer evaluation framework. Layer 1 encodes expert knowledge as a predefined set of evaluation skills, providing stable evaluation criteria. Layer 2 performs report-specific, claim-level evaluation to flexibly assess diverse reasoning strategies, with evidence-dependency gating to invalidate conclusions built on refuted claims. Experiments on BizBench show that JADE improves evaluation stability and reveals critical agent failure modes missed by holistic LLM-based evaluators. We further demonstrate strong alignment with expert-authored rubrics and effective transfer to a medical-domain benchmark, validating JADE across professional domains.

Abstract:
Linear Attention (LA) offers a promising paradigm for scaling large language models (LLMs) to long sequences by avoiding the quadratic complexity of selfattention. Recent LA models such as Mamba2 and GDN interpret linear recurrences as closed-form online stochastic gradient descent (SGD), but naive SGD updates suffer from rapid information decay and suboptimal convergence in optimization. While momentum-based optimizers provide a natural remedy, they pose challenges in simultaneously achieving training efficiency and effectiveness. To address this, we develop a chunkwise parallel algorithm for LA with a stepwise momentum rule by geometrically reordering the update coefficients. Further, from a dynamical systems perspective, we analyze the momentum-based recurrence as a second-order system that introduces complex conjugate eigenvalues. This analysis guides the design of stable gating constraints. The resulting model, Momentum DeltaNet (MDN), leverages Triton kernels to achieve comparable training throughput with competitive linear models such as Mamba2 and KDA. Extensive experiments on the 400M and 1.3B parameter models demonstrate consistent performance improvements over strong baselines, including Transformers, Mamba2 and GDN, across diverse downstream evaluation benchmarks.

Abstract:
Steering is a widely used technique for controlling large language models, yet its effects are often unstable and hard to predict. Existing theoretical accounts are largely based on the Linear Representation Hypothesis (LRH). While LRH assumes that concepts can be orthogonalized for lossless control, this idealized mapping fails in real representations and cannot account for the observed unpredictability of steering. By relaxing LRH's orthogonality assumption while preserving linear representations, we show that overlapping concept contributions naturally yield a samplespecific axis-orthogonal structure. We formalize this as the Cylindrical Representation Hypothesis (CRH). In CRH, a central axis captures the main difference between concept absence and presence and drives concept generation. A surrounding normal plane controls steering sensitivity by determining how easily the axis can activate the target concept. Within this plane, only specific sensitive sectors strongly facilitate concept activation, while other sectors can suppress or delay it. While the surrounding normal plane can be reliably identified from difference vectors, the sensitive sector cannot, introducing intrinsic uncertainty at the sector level. This uncertainty provides a principled explanation for why steering outcomes often fluctuate even when using well-aligned directions. Our experiments verify the existence of the cylindrical structure and demonstrate that CRH provides a valid and practical way to interpret model steering behavior in real settings: \urlhttps://anonymous.4open.science/r/CRH.

Abstract:
This paper proposes a suite of rationality measures and associated theory for reinforcement learning agents, a property increasingly critical yet rarely explored. We define an action in deployment to be perfectly rational if it maximises the hidden true value function in the steepest direction. The expected value discrepancy of a policy's actions against their rational counterparts, culminating over the trajectory in deployment, is defined to be expected rational risk; an empirical average version in training is also defined. Their difference, termed as rational risk gap, is decomposed into (1) an extrinsic component caused by environment shifts between training and deployment, and (2) an intrinsic one due to the algorithm's generalisability in a dynamic environment. They are upper bounded by, respectively, (1) the 1Wasserstein distance between transition kernels and initial state distributions in training and deployment, and (2) the empirical Rademacher complexity of the value function class. Our theory suggests hypotheses on the benefits from regularisers (including layer normalisation, \ell_2 regularisation, and weight normalisation) and domain randomisation, as well as the harm from environment shifts. Experiments are in full agreement with these hypotheses.

Abstract:
Machine Learning surrogates for Computational Fluid Dynamics (CFD), particularly Graph Neural Networks (GNNs) and Transformers, have become a new important approach for accelerating physics simulations. However, we identify a critical bottleneck in the field: while architectures have advanced significantly, the underlying training paradigms remain bound to naive assumptions, such as nodewise supervision and explicit Euler time-stepping. These legacy choices ignore the stiff dynamics and local flux continuity inherent to numerous partial differential equations resolution methods, such as Finite Element, Difference, or Volume (FEM). In this work, we propose a unified framework to bridge the gap between geometric deep learning and rigorous numerical analysis. We introduce three key innovations: (1) Multi Node Prediction, a novel objective that predicts field values for a node's full local topology, enforcing spatial derivative consistency; (2) Temporal Correction, replacing unstable explicit schemes with a predictor–corrector via temporal Cross-Attention; and (3) Geometric Inductive Biases, leveraging 3D Rotary Positional Embeddings (RoPE) to robustly capture rotational symmetries in unstructured meshes. We evaluate this framework across three architectures (MeshGraphNet, Transolver, and a Transformer) on diverse physics datasets. Our approach yields consistent improvements in accuracy and stability, particularly in long-horizon rollouts, while producing latent representations that generalize to unseen subtasks such as Wall Shear Stress or Pressure prediction.

Abstract:
Large language models (LLMs) are evolving from conversational systems into strong reasoners for tasks such as Olympiad mathematics and competitive programming. While scaling parameters and testtime computation has driven progress, a key bottleneck is the lack of high-quality training problems: human-curated datasets are costly and limited, while existing synthetic corpora are often too easy or narrow. PromptCoT showed that injecting rationales into prompt synthesis increases problem difficulty. Building on this, we present PromptScale, a scalable framework that replaces hand-crafted heuristics with an expectation-maximization (EM) loop, where rationales are iteratively refined to guide prompt construction. This produces problems that are both harder and more diverse than prior corpora. The synthetic prompts support two post-training regimes: (1) \emphSelf-Play, where strong models improve autonomously via verifiable feedback without stronger teachers; and (2) \emphSupervised Fine-Tuning (SFT), where weaker models learn from teacher-distilled traces. Extensive experiments demonstrate the effectiveness of this approach. In self-play, applying PromptScale to \textttQwen3-30B-A3B-Thinking-2507 sets new state-of-the-art results \emphat the 30B scale, with +4.4, +4.8, and +5.3 on AIME 24/25 and HMMT 25, +6.1 and +5.0 on LiveCodeBench v5/v6, and +35 Elo on Codeforces. In SFT, training \textttQwen2.5-7B-Instruct solely on synthetic prompts boosts accuracy to 73.1 (AIME 24), 65.6 (AIME 25), and 53.4 (LiveCodeBench v5), surpassing models trained on human or hybrid data. Analyses further confirm that PromptScale yields fundamentally harder and distributionally distinct problems. These results establish prompt synthesis as a new axis for scaling reasoning and position PromptScale as a scalable foundation for future open-source models.

Abstract:
Multiturn human-AI collaboration is fundamental to deploying interactive services such as adaptive tutoring, conversational recommendation, and professional consultation. However, optimizing these interactions via reinforcement learning is hindered by the sparsity of verifiable intermediate rewards and the high stochasticity of user responses. To address these challenges, we introduce Implicit Turn-wise Policy Optimization (ITPO). ITPO leverages an implicit process reward model to derive fine-grained, turn-wise process rewards from sparse outcome signals. Unlike volatile token-level rewards, these turn-level signals exhibit superior robustness and may utilize a normalization mechanism to further enhance training stability. We evaluate ITPO across three representative multi-turn collaborative tasks: math tutoring, document writing, and medical recommendation. Empirical results demonstrate that ITPO, when combined with PPO, GRPO, or RLOO, consistently achieves improved convergence than existing baselines. Elaborate trajectory analysis confirms that ITPO infers turn-wise preferences that are semantically aligned with human judgment.

Abstract:
Human gameplay is a visually grounded interaction loop in which players act, reflect on failures, and watch tutorials to refine strategies. Can VisionLanguage Models (VLMs) also learn from video-based reflection? We presentGameVerse, a comprehensive video game benchmark that enables areflective visual interaction loop. Moving beyond traditionalfire-and-forgetevaluations, it uses a novelreflect-and-retryparadigm to assess how VLMs internalize visual experience and improve policies. To facilitate systematic and scalable evaluation, we also introduce acognitive hierarchical taxonomyspanning 15 globally popular games,dual action spacefor both semantic and GUI control, andmilestone evaluationusing advanced VLMs to quantify progress. Our experiments show that VLMs benefit from video-based reflection in varied settings, and perform best by combining failure trajectories and expert tutorials—atraining-freeanalogue to reinforcement learning (RL) plus supervised fine-tuning (SFT).

Abstract:
Latent diffusion models have enabled highquality video synthesis, yet their inference remains costly and time-consuming. As diffusion transformers become increasingly efficient, the latency bottleneck inevitably shifts to VAE decoders. To reduce their latency while maintaining quality, we propose a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution. Specifically, we propose (1) an independence-aware channel pruning method to effectively mitigate severe channel redundancy, and (2) a stage-wise dominant operator optimization strategy to address the high inference cost of the widely used causal 3D convolutions in VAE decoders. Based on these innovations, we construct a Flash-VAED family. Moreover, we design a three-phase dynamic distillation framework that efficiently transfers the capabilities of the original VAE decoder to Flash-VAED. Extensive experiments on Wan and LTX-Video VAE decoders demonstrate that our method outperforms baselines in both quality and speed, achieving approximately a 6× speedup while maintaining the reconstruction performance up to 96.9%. Notably, Flash-VAED accelerates the end-to-end generation pipeline by up to 36% with negligible quality drops on VBench-2.0.

Abstract:
Biosignals acquired from different locations on the body often provide temporally ordered views of the same underlying physiological process. However, most existing selfsupervised learning methods treat these signals as interchangeable views, overlooking the directional temporal dynamics that link them. A canonical example is the relationship between electrocardiography (ECG), which captures the electrical activation initiating each heartbeat, and photoplethysmography (PPG), which records the resulting peripheral pulse delayed by vascular dynamics. To capture this structured relationship, we introduce xMAE, a biosignal pretraining framework that leverages masked cross-modal reconstruction across temporally ordered biosignals as a training-time constraint to encourage physiologically meaningful timing structure in the learned representations. We show that pretraining with xMAE yields representations that outperform both unimodal and multimodal baselines on 15 of 19 downstream tasks, including cardiovascular outcome prediction, abnormal laboratory test detection, sleep staging, and demographic inference, while generalizing across devices, body locations, and acquisition settings. Further analysis suggests that the ECG--PPG timing structure is reflected in the learned PPG representations. More broadly, xMAE demonstrates the effectiveness of incorporating temporal structure into multimodal pretraining when signals observe different stages of a shared underlying process.

Abstract:
Many natural dynamic processes –such as in vivo cellular differentiation or disease progression– can only be observed through the lens of static sample snapshots. While challenging, reconstructing their temporal evolution to decipher underlying dynamic properties is of major interest to scientific research. Existing approaches enable data transport along a temporal axis but are poorly scalable in high dimension and require restrictive assumptions to be met. To address these issues, we propose MultiMarginal temporal Schrödinger Bridge Matching (MMtSBM) from unpaired data, extending the theoretical guarantees and empirical efficiency of Diffusion Schrödinger Bridge Matching (Shi et al., 2023) by deriving the Iterative Markovian Fitting algorithm to multiple marginals in a novel factorized fashion. Experiments show that MMtSBM retains theoretical properties on toy examples, achieves state-of-the-art performance on real-world datasets such as transcriptomic trajectory inference in 100 dimensions, and, for the first time, recovers couplings and dynamics in very high-dimensional image settings. Our work establishes multi-marginal Schrödinger bridges as a practical and principled approach for recovering hidden dynamics from static data.

Abstract:
Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. Onpolicy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self (i.e., the version without access to privileged information), we introduce On-Policy Self-Distillation (OPSD), a framework where a single model acts as both teacher and student by conditioning on different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student's own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving 4-8× token efficiency compared to reinforcement learning methods such as GRPO and superior performance over off-policy distillation methods.

Abstract:
Time series forecasting (TSF) plays a critical role in decisionmaking for many real-world applications. Recently, LLM-based forecasters have made promising advancements. Despite their effectiveness, existing methods often lack explicit experience accumulation and continual evolution. In this work, we propose MemCast, a learning-to-memory framework that reformulates TSF as an experience-conditioned reasoning task. Specifically, we learn experience from the training set and organize it into a hierarchical memory. This is achieved by summarizing prediction results into historical patterns, distilling inference trajectories into reasoning wisdom, and inducing extracted temporal features into general laws. Furthermore, during inference, we leverage historical patterns to guide the reasoning process and utilize reasoning wisdom to select better trajectories, while general laws serve as criteria for reflective iteration. Additionally, to enable continual evolution, we design a dynamic confidence adaptation strategy that updates the confidence of individual entries without leaking the test set distribution. Extensive experiments on multiple datasets demonstrate that MemCast consistently outperforms previous methods, validating the effectiveness of our approach. Our code is available at https://anonymous.4open.science/r/MemCast.

Abstract:
Video diffusion models leveraging step distillation or causal distillation have achieved remarkable performance. However, adapting existing LoRAs to these variants remains a critical challenge due to weight space mismatches. We observe that direct application leads to style degradation and structural collapse, yet the underlying mechanisms remain poorly understood. To fill this gap, we delve into the weight space and identify that the incompatibility stems from spectral interference within shared functional clusters defined over singular subspaces. Specifically, our analysis reveals that while both paradigms respect spectral rigidity, they establish conflicting routing pathways that clash through constructive overload or destructive cancellation. To address this issue, we propose ClusterAware Spectral Arbitration (CASA), a data-free framework that dynamically arbitrates between safeguarding the target's manifold and restoring LoRA alignment based on spectral density. Extensive experiments demonstrate that CASA effectively mitigates artifacts and revives LoRA functionality. Our code is available at https://anonymous.4open.science/r/CASA/.

Abstract:
Large Language Models (LLMs) often struggle with the precise logic and schema alignment required for complex Textto-SQL tasks. While current methods rely heavily on static prompting, they lack the ability to dynamically adapt and self-correct through environmental interaction. To bridge this gap, we proposeMARS-SQL, a multi-agent architecture that leverages interactive Reinforcement Learning (RL) to optimize SQL generation. Unlike monolithic approaches, our method decomposes the problem into three specialized roles: schema linking, query generation, and solution validation. Central to our approach is a generation agent trained via a multi-turn RL policy, which operates within a ReAct-style loop. This agent learns to iteratively reason, execute intermediate SQL actions on a live database, and refine its strategy based on execution feedback. To ensure robustness, we introduce a validation mechanism that treats solution selection as a generative modeling task, identifying the optimal interaction trajectory through next-token prediction probabilities. Empirical evaluations demonstrate the effectiveness of coupling interactive learning with trajectory ranking.MARS-SQLachieves state-of-the-art performance, recording an execution accuracy of 77.84% on the BIRD development dataset and 89.75% on the Spider test dataset.

Abstract:
Segmentation based on language has been a popular topic in computer vision. While recent advances in multimodal large language models (MLLMs) have endowed segmentation systems with reasoning capabilities, these efforts remain confined by the frozen internal knowledge of MLLMs, which limits their potential for realworld scenarios that involve up-to-date information or domain-specific concepts. In this work, we propose Seg-ReSearch, a novel segmentation paradigm that overcomes the knowledge bottleneck of existing approaches. By enabling interleaved reasoning and external search, Seg-ReSearch empowers segmentation systems to handle dynamic, open-world queries that extend beyond the frozen knowledge of MLLMs. To effectively train this capability, we introduce a hierarchical reward design that harmonizes initial guidance with progressive incentives, mitigating the dilemma between sparse outcome signals and rigid step-wise supervision. For evaluation, we construct OK-VOS, a challenging benchmark that explicitly requires outside knowledge for video object segmentation. Experiments on OK-VOS and two existing reasoning segmentation benchmarks demonstrate that our Seg-ReSearch improves state-of-the-art approaches by a substantial margin. Code and data will be released.

Abstract:
Feature attribution is central to diagnosing and trusting deep neural networks, and Integrated Gradients (IG) is widely used due to its axiomatic properties. However, IG can yield unreliable explanations when the integration path between a baseline and the input passes through regions with noisy gradients. While Guided Integrated Gradients reduces this sensitivity by adaptively updating lowgradient-magnitude features, input-space guidance still produces intermediate inputs that deviate from the data manifold. To address this limitation, we proposeManifold-Aligned Guided Integrated Gradients(MA-GIG), which constructs attribution paths in the latent space of a pre-trained variational autoencoder. By ensuring that decoded intermediate images remain aligned with the data manifold, MA-GIG constrains gradient evaluation to statistically valid regions. Through qualitative and quantitative evaluations, we demonstrate that MA-GIG produces faithful explanations by aggregating gradients on path features proximal to the input. Consequently, our method suppresses off-manifold noise and outperforms prior path-based attribution methods across multiple datasets and classifiers.

Abstract:
Selfplay bootstraps LLM reasoning through an iterative Challenger–Solver loop: the Challenger is trained to generate questions that target the Solver's capabilities, and the Solver is optimized on the generated data to expand its reasoning skills. However, existing frameworks like R-Zero often exhibit non-sustained improvement, where early gains degrade as self-play continues. We identify a key failure mode, Diversity Illusion, where the Solver's training signals appear diverse yet collapse into recurring underlying patterns. It manifests as (1) Local Diversity Illusion, where diversity is enforced only within-batch, inducing cross-iteration mode cycling; and (2) Surface Diversity Illusion, where questions vary superficially but require near-identical reasoning skills. To mitigate them, we propose R-Diverse with two aligned innovations: Memory-Augmented Penalty (MAP), which uses a persistent memory bank to discourage recycling across iterations, and Skill-Aware Measurement (SAM), which evaluates diversity by the reasoning skills exercised rather than surface variation of questions. Across 10 math and general reasoning benchmarks, R-Diverse sustains gains over more iterations and consistently outperforms prior self-play methods.

Abstract:
Reinforcement learning has advanced video reasoning in large multimodal models, yet dominant pipelines either rely on on-policy self-exploration, which plateaus at the model’s knowledge boundary, or hybrid replay that mixes policies and demands careful regularization. Dynamic context methods zoom into focused evidence but often require curated pretraining and two-stage tuning, and their context remains bounded by a small model’s capability. In contrast, larger models excel at instruction following and multi-modal understanding, can supply richer context to smaller models, and rapidly zoom in on target regions via simple tools. Building on this capability, we introduce an observation-level intervention: a frozen, tool-integrated teacher identifies the missing spatiotemporal dependency and provides a minimal evidence patch (e.g. timestamps, regions etc.) from the original video while the question remains unchanged. The student answers again with the added context, and training updates with a chosen-rollout scheme integrated into GRPO. We further propose a Robust Improvement Reward (RIR) that aligns optimization with two goals: outcome validity through correct answers and dependency alignment through rationales that reflect the cited evidence. Advantages are group-normalized across the batch, preserving on-policy exploration while directing it along causally meaningful directions with minimal changes to the training stack. Experiments on general video and video-reasoning benchmarks show consistent accuracy gains and strong generalization.

Abstract:
Imbalanced Domain Generalization (IDG) focuses on mitigating bothdomain and label shifts, both of which fundamentally shape the model's decision boundaries, particularly under heterogeneous longtailed distributions across domains. Despite its practical significance, it remains underexplored, primarily due to thetechnicalcomplexity of handling their entanglement and the paucity oftheoreticalfoundations. In this paper, we begin bytheoreticallyestablishing the generalization bound for IDG, highlighting the role of posterior discrepancy and decision margin. This bound motivates us to focus on directly steering decision boundaries, marking a clear departure from existing methods. Then, wetechnicallypropose a novel Negative-Dominant Contrastive Learning (NDCL) for IDG to enhance discriminability while enforce posterior consistency across domains. Specifically, inter-class decision-boundary separation is enhanced by placing greater emphasis on negatives as the primary signal in our contrastive learning, naturally amplifying gradient signals for minority classes to avoid the decision boundary being biased toward majority classes. Intra-class compactness is encouraged through a re-weighted cross-entropy strategy, and posterior consistency across domains is enforced through a prediction-central alignment strategy. Finally, rigorous yet challenging experiments on benchmarks validate the effectiveness of our NDCL.

Abstract:
Reinforcement Learning from Human Feedback (RLHF) has become the standard for aligning Large Language Models (LLMs), yet its efficacy is bottlenecked by the high cost of acquiring preference data, especially in lowresource and expert domains. To address this, we introduce ActiveUltraFeedback, a modular active learning pipeline that leverages uncertainty estimates to dynamically identify the most informative responses for annotation. Our pipeline facilitates the systematic evaluation of standard response selection methods alongside Double Reverse Thompson Sampling (DRTS) and DeltaUCB, two novel methods prioritizing response pairs with large predicted quality gaps, effectively operationalizing recent results showing that such pairs provide good signals for fine-tuning. Our experiments demonstrate that ActiveUltraFeedback yields high-quality datasets that lead to significant improvements in downstream performance, notably capturing the majority of the training signal from less than one-third of the annotated data relative to static baselines. We release our code and artifacts to facilitate research into efficient alignment and data generation.

Abstract:
This paper concerns the question of how language models and other AI systems encode semantic structure into the geometric structure of their representation spaces. The motivating observation of this paper is that the natural geometry of these representation spaces should reflect the way models use representations to produce behavior. We focus on the important special case of representations that define softmax distributions. We argue that the natural geometry is information geometry, and then show how this interacts with semantic encoding and the linear representation hypothesis. It turns out that the duality structure of information geometry plays a critical role. As an illustrative application, we developdual steering, a method for robustly steering representations to exhibit a particular concept using linear probes. We formally prove that dual steering optimally modifies the target concept while minimizing changes to offtarget concepts. We empirically find that dual steering enhances the controllability and stability of concept manipulation.

Abstract:
Preconditioned adaptive methods have gained significant attention for training deep neural networks, as they capture rich curvature information. The central challenge in this field lies in balancing preconditioning effectiveness with computational efficiency of implementing the preconditioner. Among recent advances, \textscMUON stands out by using NewtonSchulz iteration to obtain preconditioned updates without explicitly constructing the preconditioning matrix. In this paper, we introduce \textscRMNP (Row Momentum Normalized Preconditioning), an optimizer that replaces Newton-Schulz iteration with a simple row-wise \ell_2 normalization operation, motivated by the empirically observed diagonal block structure of the Transformer layerwise Hessian. This substitution reduces the per-iteration complexity from \mathcalO(mn\cdot\min(m,n)) to \mathcalO(mn) for an m× n weight matrix while maintaining comparable optimization performance. Theoretically, we establish convergence guarantees for \textscRMNP in the non-convex setting that match recent results for \textscMuon optimizers, achieving the information-theoretic minimax optimal complexity. Extensive experiments on large language model pretraining show that \textscRMNP delivers competitive optimization performance compared with \textscMuon while substantially reducing preconditioning process wall-clock time. Our code is available at \hrefhttps://anonymous.4open.science/r/RMNP-317C/link.

Abstract:
Inferencetime steering has emerged as a promising paradigm for controlling language models (LMs) without the cost of retraining. However, standard approaches typically rely on activation addition, a geometric operation that inevitably alters the magnitude of hidden representations. This raises concerns about representation collapse and degradation of open-ended generation capabilities. In this work, we explore Spherical Steering, a training-free primitive that resolves this trade-off through activation rotation. Rather than shifting activations with a fixed vector, our method rotates them along a geodesic toward a target direction, guiding the activation toward the target concept while preserving the integrity of the signal. To further enhance adaptivity, we incorporate a confidence gate that dynamically modulates steering strength based on input uncertainty. Extensive experiments across multiple-choice benchmarks demonstrate that Spherical Steering significantly outperforms addition-based baselines (notably by +10% on TruthfulQA, COPA, and Storycloze), while simultaneously maintaining the model’s general open-ended generation quality. This work highlights the value of geometric consistency, suggesting that norm-preserving rotation is a robust and effective primitive for precise inference-time control.

Abstract:
Multimodal remote sensing imagery provides complementary observations of the same geographic scene, yet such observations are frequently incomplete in practice. Existing cross-modal translation methods treat each modality pair as an independent task, resulting in quadratic complexity and limited generalization to unseen modality combinations. We formulate Any-to-Any translation as inference over a shared latent representation of the scene, where different modalities correspond to partial observations of the same underlying semantics. Based on this formulation, we propose Any2Any, a unified latent diffusion framework that projects heterogeneous inputs into a geometrically aligned latent space. Such structure performs anchored latent regression with a shared backbone, decoupling modality-specific representation learning from semantic mapping. Moreover, lightweight target-specific residual adapters are used to correct systematic latent mismatches without increasing inference complexity. To support learning under sparse but connected supervision, we introduce RST-1M, the first million-scale remote sensing dataset with paired observations across five sensing modalities, providing supervision anchors for any-to-any translation. Experiments across 14 translation tasks show that Any2Any consistently outperforms pairwise translation methods and exhibits strong zero-shot generalization to unseen modality pairs.

Abstract:
Deep time series models are vulnerable to noisy data ubiquitous in realworld applications. Existing robustness strategies either prune data or rely on costly prior quantification, failing to balance effectiveness and efficiency. In this paper, we introduce DropoutTS, a model-agnostic plugin that shifts the paradigm from "what" to learn to "how much" to learn. DropoutTS employs a Sample-Adaptive Dropout mechanism: leveraging spectral sparsity to efficiently quantify instance-level noise via reconstruction residuals, it dynamically calibrates model learning capacity by mapping noise to adaptive dropout rates—selectively suppressing spurious fluctuations while preserving fine-grained fidelity. Extensive experiments across diverse noise regimes and open benchmarks show DropoutTS consistently boosts superior backbones’ performance, delivering advanced robustness with negligible parameter overhead and no architectural modifications. Our code is available at https://anonymous.4open.science/r/DropoutTS/.

Abstract:
Charts are a fundamental visualization format for structured data analysis. Enabling endto-end chart editing according to user intent is of great practical value, yet remains challenging due to the need for both fine-grained control and global structural consistency. Most existing approaches adopt pipeline-based designs, where natural language or code serves as an intermediate representation, limiting their ability to faithfully execute complex edits. We introduce ChartE^3, an End-to-End Chart Editing benchmark that directly evaluates models without relying on intermediate natural language programs or code-level supervision. ChartE^3 focuses on two complementary editing dimensions: local editing, which involves fine-grained appearance changes such as font or color adjustments, and global editing, which requires holistic, data-centric transformations including data filtering and trend line addition. ChartE^3 contains over 1,200 high-quality samples constructed via an well-designed data pipeline with human curation. Each sample is provided as a triplet of a chart image, its underlying code, and a multimodal editing instruction, enabling evaluation from both visual and structural perspectives. Extensive benchmarking of state-of-the-art multimodal large language models reveals substantial performance gaps, particularly on global editing tasks, highlighting critical limitations in current end-to-end chart editing capabilities.

Abstract:
We introduce RedDebate, a novel multiagent debate framework that provides the foundation for Large Language Models (LLMs) to identify and mitigate their own unsafe behaviors. Existing AI safety approaches often rely on costly human evaluation or isolated single-model assessment, both constrained by scalability and prone to oversight failures. RedDebate employs collaborative argumentation among multiple LLMs across diverse debate scenarios, enabling them to critically evaluate one another’s reasoning and systematically uncover unsafe failure modes through fully automated red-teaming. We further integrate distinct long-term memory modules that preserve safety-relevant insights from debate interactions and leverage them during subsequent inference, facilitating continuous refinement of model behavior. Empirical evaluation on safety benchmarks across a diverse set of models demonstrates that RedDebate substantially reduces unsafe outputs. While debate alone allows LLMs to refine their behavior, the addition of memory modules yields further significant reductions. To the best of our knowledge, RedDebate is the first fully automated framework to unify multi-agent debate and red-teaming to progressively enhance LLM safety without human intervention.

Abstract:
Generating longform content, such as minute-long videos and extended texts, is increasingly important for modern generative models. Block diffusion improves inference efficiency via KV caching and block-wise causal inference and has been widely adopted in diffusion language models and video generation. However, in long-context settings, block diffusion still incurs substantial overhead from repeatedly computing attention over an ever-growing KV cache. We identify an underexplored property of block diffusion: cross-step redundancy of attention within a block. Our analysis shows that attention outputs from tokens outside the current block remain largely stable across diffusion steps, while block-internal attention varies significantly. Based on this observation, we propose FlashBlock, a cached block-external attention mechanism that reuses stable attention output, substantially reducing attention computation and KV cache access without modifying the diffusion process. Moreover, FlashBlock is orthogonal to sparse attention and can be combined as a complementary residual reuse strategy. When integrated, it substantially improves model accuracy under aggressive sparsification by offsetting much of the performance loss induced by sparsity. Experiments on diffusion language models and video generation demonstrate up to 1.44× higher token throughput and up to 1.6× reduction in attention time, with negligible impact on generation quality.

Abstract:
Despite their strong performance on reasoning tasks, large reasoning models (LRMs) often suffer from overthinking, producing unnecessarily long outputs and incurring high endto-end latency, a significant limitation to their real-world deployment. To address overthinking, early-exit mechanisms have been proposed to terminate reasoning before typical completion, showing that this approach can effectively shorten generation length with minimal impact on accuracy. However, their reliance on probing mechanisms introduces a detection overhead that limits their end-to-end latency gains and compromises their generalizability across diverse problems. Inspired by the use of hidden states in speculative decoding, we proposeSpecExit, a novel framework that predicts both future tokens and an early-exit signal directly from a lightweight draft model without probing overhead. Our method offers significant improvements, achieving up to 66% generation length reduction and 2.5× end-to-end speedup compared with the speculative decoding baseline, without compromising accuracy. Our method leverages the inherent signals from hidden states to provide effective early-exit signals, suggesting broader use of hidden states for efficient reasoning. Our code is available at: https://anonymous.4open.science/r/SpecExit-B802.

Abstract:
Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, queryagnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present BudgetMem, a runtime agent memory framework for explicit, query-aware performance–cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textscLow/\textscMid/\textscHigh). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy–cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.

Abstract:
Coding agents are increasingly used for a wide range of realworld tasks, from adding features and documentation to creating programs from scratch. Ideally, the agent should perform well across all the diverse tasks. However, most prior work concentrates on issue solving, and such single-task training does not transfer reliably to other coding tasks. In this work, we aim to train coding agents that generalize across tasks. We first analyze task transferability from two axes: the commonalities shared among coding tasks and the capabilities of current agents. Guided by these findings, we derive a set of principles for training task design and verify them through a series of controlled experiments. We then present Hybrid-Gym, a training dataset built on four scalable training tasks that follow these principles. Experiments show that, under zero-shot task transfer, Hybrid-Gym achieves performance comparable to in-domain training, and further improves existing datasets when combined with them (e.g., +4.85% on SWT-Bench-Verified).

Abstract:
While flow matching is elegant, its reliance on singlesample conditional velocities leads to high-variance training targets that destabilize optimization and slow convergence. By explicitly characterizing this variance, we identify 1) a high-variance regime near the prior, where optimization is challenging, and 2) a low-variance regime near the data distribution, where conditional and marginal velocities nearly coincide. Leveraging this insight, we propose Stable Velocity, a unified framework that improves both training and sampling. For training, we introduce Stable Velocity Matching (StableVM), an unbiased variance-reduction objective, along with Variance-Aware Representation Alignment (VA-REPA), which adaptively strengthen auxiliary supervision in the low-variance regime. For inference, we show that dynamics in the low-variance regime admit closed-form simplifications, enabling Stable Velocity Sampling (StableVS), a finetuning-free acceleration. Extensive experiments on ImageNet 256×256 and large pretrained text-to-image and text-to-video models, including SD3.5, Flux, Qwen-Image, and Wan2.2, demonstrate consistent improvements in training efficiency and more than 2× faster sampling within the low-variance regime without degrading sample quality.

Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities across a variety of domains. However, their applications in cryptography, which serve as a foundational pillar of cybersecurity, remain largely unexplored. To address this gap, we build AICrypto, a comprehensive benchmark designed to evaluate the cryptography capabilities of LLMs. The benchmark comprises 135 multiplechoice questions, 150 capture-the-flag challenges, and 30 proof problems, covering a broad range of skills from knowledge memorization to vulnerability exploitation and formal reasoning. All tasks are carefully reviewed or constructed by cryptography experts to improve correctness and rigor. For each proof problem, we provide detailed scoring rubrics and reference solutions that enable automated grading, achieving high correlation with human expert evaluations. We introduce strong human expert performance baselines for comparison across all task types. Our evaluation of 17 leading LLMs reveals that state-of-the-art models match or even surpass human experts in memorizing cryptographic concepts, exploiting common vulnerabilities, and routine proofs. However, our analysis reveals that they still lack a deep understanding of abstract mathematical concepts and struggle with tasks that require multi-step reasoning and dynamic analysis. We hope this work could provide insights for future research on LLMs in cryptographic applications. Our code and dataset are available at https://anonymous.4open.science/r/aicrypto-CE6E/.

Abstract:
The rapid evolution of generative AI, from GANs to modern diffusion models, has resulted in increasingly subtle discriminative clues. These finegrained signals are often overshadowed by dominant, high-fidelity image content (e.g., the main subject), limiting the reliability of existing detectors that predominantly rely on global representations. To address this challenge, we propose the Peak-Guided Calibration (PGC) framework. PGC introduces a novel strategy that aggregates salient features via a peak-focusing mechanism. Specifically, by employing a peak-sensitive aggregation that accentuates the most discriminative local clues, PGC leverages these critical signals to calibrate the global decision. This approach recovers subtle patterns that would otherwise be submerged in the global context. Furthermore, to better simulate real-world threats, we introduce the CommGen15 dataset, a challenging benchmark comprising samples from 15 commercial models. Extensive experiments demonstrate that PGC achieves state-of-the-art performance, surpassing existing detectors by +12.3% (Accuracy) on CommGen15, while setting new records on standard benchmarks, including GenImage (+2.1%), AIGI (+3.5%), and UniversalFakeDetect (+1.7%).

Abstract:
Neural solvers have recently emerged as powerful tools for combinatorial optimization (CO). Among them, diffusion models have shown strong promise due to their ability to capture highly multimodal solution distributions in CO through iterative generative processes. However, training diffusion models typically requires large collections of nearoptimal solutions, which limits their scalability and generalization. We address this fundamental challenge by extending adjoint matching, a powerful unsupervised diffusion training framework based on chain-rule–style gradient propagation in continuous spaces, to discrete combinatorial domains. Our approach resolves the broken-gradient issue inherent to discrete data and unifies local and global training objectives within a single principled framework. Empirically, our method consistently outperforms existing unsupervised baselines and achieves performance comparable to supervised diffusion models.

Abstract:
Large language models inevitably retain sensitive information, defined as inputs that may induce harmful generations, due to training on massive web corpora, raising concerns for privacy and safety. Existing machine unlearning methods primarily rely on retraining or aggressive finetuning, which are either computationally expensive or prone to degrading related knowledge and overall model utility. In this work, we reformulate machine unlearning as a precise knowledge re-mapping problem via model editing. We propose ZeroUnlearn, a few-shot unlearning framework. It overwrites sensitive inputs by mapping them to a neutral target state and removing their original representations. ZeroUnlearn enforces representational orthogonality through a multiplicative parameter update with a closed-form solution, enabling efficient and targeted unlearning. We further extend ZeroUnlearn to a gradient-based variant for multi-sample unlearning. Experiments demonstrate that our approach outperforms existing baselines while preserving general model utility. Our code is available at the anonymous link: https://anonymous.4open.science/r/ZeroUnlearn-9B87.

Abstract:
Standard NextToken Prediction (NTP) supervises language models solely through discrete labels in the output logit space. We argue that this sparse, one-hot supervision leaves the latent representation space under-constrained, allowing hidden states to drift into degenerate and anisotropic configurations that limit generalization. To address this issue, we proposeNext Implicit Token Prediction (NITP), which augments discrete prediction with dense, continuous supervision directly in the representation space. NITP requires the model to predict the implicit semantic content of the next token, using shallow-layer representations from the same model as stable self-supervised targets. Theoretically, we show that NITP regularizes the optimization landscape by eliminating under-constrained degrees of freedom and enforcing a compact, structured representation geometry. Empirically, across dense and MoE models ranging from 0.5B to 9B parameters, NITP consistentlyimproves downstream performance with negligible computational overhead. Notably, on the 9B MoE model, NITP achieves a5.7%absolute improvement on MMLU-Pro, along with gains of6.4%on C3 and4.3%on CommonsenseQA, with~2%additional training FLOPs and no additional inference cost.

Abstract:
Textto-Image (T2I) models and Unified Multimodal Models (UMMs) have achieved remarkable progress in visual generation. However, their reliance on a single-pass generation paradigm limits their ability to handle complex prompts requiring iterative refinement. To enable multi-round Reflective Visual Generation (RVG), we formalize the Reason-Reflect-Rectify (R^3) loop as a core framework and introduce R^3-Bench, a benchmark of over 600 expert-annotated instances that quantifies iterative reasoning and rectification capabilities. Evaluation on R^3-Bench reveals a critical gap: while state-of-the-art models can identify generation errors, they fail to generate actionable rectification instructions. To bridge this gap, we propose R^3-Refiner, a dual-stage framework leveraging Group Relative Policy Optimization (GRPO) and a Hierarchical Reward Mechanism (HRM) to better align rectification with relective reasoning. Experiments show that R^3-Refiner achieves significant improvements on R^3-Bench (+12.0 % in Reflective Verdict Score, +9.0% in Rectification Score), and can be seamlessly integrated with various MLLMs to enhance the generation quality of different T2I models on GenEval++ and T2I-CompBench. Our code, models, and data will be publicly available.

Abstract:
Gradientbased attacks are important methods for evaluating model robustness. However, since the proposal of APGD, it has been difficult for such methods to achieve significant breakthroughs. To achieve such an effect, we first analyze the issue of "high-loss non-adversarial examples" that degrades attack performance in previous methods, and prove that this issue arises from inappropriate objectives for adversarial example generation. Subsequently, we reconstruct the objective as "maximizing the difference between the non-ground-truth label probability upper bound and the ground-truth label probability", and proposes a novel and powerful gradient-based attack method named Sequential Difference Maximization (SDM). SDM establishes a three-layer optimization framework of "cycle-stage-step". It adopts the negative probability loss function and the Directional Probability Difference Ratio (DPDR) loss function in the initial and subsequent optimization stages, respectively, and approaches the ideal objective of adversarial example generation via stage-wise sequential optimization. Experiments demonstrate that compared with previous state-of-the-art methods, SDM not only achieves stronger attack performance but also exhibits superior cost-effectiveness. The code is available at \urlhttps://anonymous.4open.science/r/SDM-2515.

Abstract:
Transfer learning aims to facilitate the learning of a target domain by transferring knowledge from a source domain. The source domain typically contains semantically meaningful samples (e.g., images) to facilitate effective knowledge transfer. However, a recent study observes that the noise domain constructed from simple distributions (e.g., Gaussian distributions) can serve as a surrogate source domain in the semisupervised setting, where only a small proportion of target samples are labeled while most remain unlabeled. Based on this surprising observation, we formulate a novel problem termedSemi-Supervised Noise Adaptation(SSNA), which aims to leverage a synthetic noise domain to improve the generalization of the target domain. To address this problem, we first establish a generalization bound characterizing the effect of the noise domain on generalization, based on which we propose a Noise Adaptation Framework (NAF). Extensive experiments demonstrate that NAF effectively leverages the noise domain to tighten the generalization bound of the target domain, leading to improved performance. The codes are available at https://anonymous.4open.science/r/SSNA.

Abstract:
The scalable solution of large sparse linear systems is a bottleneck in scientific computing and graph analysis. While algebraic multigrid (AMG) offers optimal linear scaling, its performance is severely constrained by the tradeoff between the sparsity and convergence quality of coarse-grid operators. Classical AMG heuristics struggle to balance these objectives, often sacrificing stability for sparsity. We propose RAPNet, a graph neural network (GNN) framework that resolves this trade-off by learning to generate sparse, robust coarse operators directly from the sparse algebraic system. Key to our approach is a level-wise training strategy that enables learning from small subgraphs and generalization to million-node domains, bypassing the bottlenecks of prior neural AMG attempts. RAPNet executes exclusively during the solver setup phase, ensuring that the solve phase retains its favorable computational properties. We show that our method outperforms classical non-Galerkin baselines on diverse PDE discretizations and graph Laplacians, making it particularly effective for multi-query tasks such as eigenproblems, time-dependent simulations, and inverse or design problems.

Abstract:
Large language model (LLM)based Multi-agent systems (MAS) have shown promise in tackling complex collaborative tasks, where agents are typically orchestrated via role-specific prompts. While the quality of these prompts is pivotal, jointly optimizing them across interacting agents remains a non-trivial challenge, primarily due to the misalignment between local agent objectives and holistic system goals. To address this, we introduce MASPO, a novel framework designed to automatically and iteratively refine prompts across the entire system. A core innovation of MASPO is its joint evaluation mechanism, which assesses prompts not merely by their local validity, but by their capacity to facilitate downstream success for successor agents. This effectively bridges the gap between local interactions and global outcomes without relying on ground-truth labels. Furthermore, MASPO employs a data-driven evolutionary beam search to efficiently navigate the high-dimensional prompt space. Extensive empirical evaluations across 6 diverse tasks demonstrate that MASPO consistently outperforms state-of-the-art prompt optimization methods, achieving an average accuracy improvement of 2.9. We release our code at https://anonymous.4open.science/r/MASPO-8296.

Abstract:
Multiscale modeling has emerged as an effective design principle for time-series forecasting by capturing temporal dynamics at multiple resolutions. As no principled foundation has been established in the literature, we unify existing scaling methods into a scaling operator family, revealing a fundamental limitation of existing approaches: reliance on fixed and discrete scaling. To address this limitation, we propose SiGMA (Single Generalized Multi-scale Architecture), which enables position-wise scaling via the learnable discrete Gaussian (LDG) kernel grounded in scale-space theory. We evaluate SiGMA comprehensively on long- and short-term forecasting benchmarks against state-of-the-art multi-scale baselines. SiGMA outperforms all competitors on both tasks, especially achieving the best performance in 13 out of 16 long-term evaluation settings. Beyond accuracy, SiGMA significantly improves training speed by up to 5.3 times and reduces memory consumption by up to 3.8 times over the strongest competitors.

Abstract:
Variational autoencoders (VAEs) compress high resolution CT volumes into compact latents while preserving clinically relevant structure. However, training CTspecific VAEs from scratch or heavily fine-tuning them incurs substantial computational and engineering cost, and often degrades under heterogeneous scanners, protocols, and diseases. This paper makes a progressive stride toward training-free medical VAEs by leveraging a critical observation: a single Foundation VAE, pretrained at scale on natural images and videos, can serve as a unified interface for CT Reconstruction, Augmentation, and Generation. With both encoder and decoder frozen, the Foundation VAE reconstructs CT volumes with preserved anatomy while suppressing acquisition noise; training segmentation models on these reconstructions improves surface accuracy by 3.9% NSD on average for pancreatic tumor and lung tumor. Within the same Foundation VAE latent space, a conditional latent diffusion model achieves 3.9% lower average FVD with 36.2% higher CT CLIP score, and improves multi-disease generation faithfulness across 18 types by 2.76% AUC. These results demonstrate Foundation VAEs as a practical interface for scalable CT representation reuse and faithful CT generation.

Abstract:
Achieving pixellevel accurate segmentation of structural cracks across diverse scenarios remains a formidable challenge. Existing methods face significant bottlenecks in balancing crack topology modeling with computational efficiency, often failing to reconcile high segmentation quality with low resource demands. To address these limitations, we propose the Ultra-Compact Structure-Calibrated Vision RWKV (SCRWKV), a network that achieves high-precision modeling via a novel Structure Field Encoder (SFE) backbone while maintaining linear complexity. The SFE integrates the Adaptive Multi-scale Cascaded Modulator (AMCM) to enhance texture representation and utilizes the Structure Insight Calibration Unit (SCIU) as its core engine. Specifically, the SCIU employs the Geometry-guided Bidirectional Structure Transform (GBST) to capture topological correlations and integrates the Dynamic Self-Calibrating Decay (DSCD) into Dy-WKV to suppress noise propagation. Furthermore, we introduce a lightweight Cross-Scale Harmonic Fusion (CSHF) decoder to achieve precise feature aggregation. Systematic evaluations on multiple benchmarks characterized by complex textures and severe interference demonstrate that SCRWKV, with only 1.22M parameters, significantly outperforms SOTA methods. Achieving an F1 score of 0.8428 and mIoU of 0.8512, the model confirms its robust potential for efficient real-world deployment.

Abstract:
Large language models (LLMs) have achieved remarkable progress in openended text generation, yet they remain prone to hallucinating incorrect or unsupported content, which undermines their reliability. This issue is exacerbated in long-form generation due to hallucination snowballing, a phenomenon where early errors propagate and compound into subsequent outputs. To address this challenge, we propose a novel inference-time hallucination mitigation framework, named Segment-wise HAllucination Rejection Sampling (SHARS), which uses am arbitrary hallucination detector to identify and reject hallucinated segments during generation and resample until faithful content is produced. By retaining only confident information and building subsequent generations upon it, the framework mitigates hallucination accumulation and enhances factual consistency. To instantiate this framework, we adopt semantic uncertainty as the detector and introduce several vital modifications to address its limitations and better adapt it to long-form text. Our method enables models to self-correct hallucinations without requiring external resources such as web search or knowledge bases, while remaining compatible with them for future extensions. Empirical evaluations on standardized hallucination benchmarks demonstrate that our method substantially reduces hallucinations in long-form generation while preserving or even improving the informativeness of generation.

Abstract:
Textto-Image (T2I) models have made significant strides in visual realism and semantic consistency, yet they often perpetuate and amplify societal biases. Existing evaluation methods typically address only single-dimensional biases, lacking perspectives to uncover model biases at social-related deeper semantic levels. We introduce HoloFair, a comprehensive benchmark framework for multidimensional demographic bias analysis. Built upon our large-scale fairness-oriented dataset and the SpaFreq (Spatial-Frequency) attribute classifier, this framework proposes the Multi-attribute, Group-wise Bias Index (MGBI) metric, designed to assess both intrinsic diversity and conditional biases. Beyond evaluation, we further introduce Fair-GRPO, a reinforcement-learning-based debiasing method that alters the distribution of generative models through a designed multi-objective reward function. E.g., experiments on the SD3.5-Medium model demonstrate that Fair-GRPO significantly improves multidimensional fairness while maintaining high image quality. We also analyze potential reward hacking phenomena and provide corresponding mitigation strategies.

Abstract:
Scientific illustration figures are essential for depicting research works' conceptual designs, methodology, and experimental workflows, playing a pivotal role in communicating complex academic insights. However, creating highquality scientific illustrations remains a labor-intensive task for human scientists. While recent generative image models have advanced prompt-based editing, the synthesis of fullyeditablefigures remains a fundamental challenge. Valid editability involves structured transformations of graphical elements, scales, attributes, and text, rather than simple pixel-level changes. Existing models generate raster outputs that do not support manual correction or layout adjustment, limiting their utility in scientific publishing, where editable vector figures are typically required for submission. To address this challenge, we introduceLiveFigure, an agentic framework driven by VLM agents that imitates the multi-step drawing workflow of human researchers. It first plans figure blueprints by drawing inspiration from high-quality references in previous works, then generates executable scripts that produce figures via the PowerPoint interface based on skills and experience, and finally refines the outputs with targeted visual diagnostics, producing fully vectorized, editable figures that meet publication standards. Extensive experiments demonstrate that LiveFigure generates inherently editable figures that are both visually clear and aesthetically appealing, achieving 80% publication-readiness within just 17 manual edits—far surpassing the 24% rate of the strongest baseline, NanoBanna. Human preference studies further validate this advantage, with LiveFigure securing a 60% win rate against NanoBanna. Our code is available athttps://anonymous.4open.science/r/LiveFigure.

Abstract:
Modern robotic policies increasingly rely on action chunking to execute complex tasks in the physical world. While action chunking improves temporal consistency at moderate action frequencies, it becomes insufficient when the action frequency is further increased (e.g., to 60~Hz). At such high frequencies, policies often fail to generate actions that are both temporally smooth and spatially consistent. We address this challenge by shifting highfrequency action learning from the action space to a latent space with variational autoencoder (VAE). This formulation significantly improves both temporal and spatial consistency of high-frequency control. To enable smooth real-time execution, we further introduce Reuse-then-Refine, a chunk-level refine strategy that improves continuity between adjacent action chunks under asynchronous inference. As a result, robots controlled by our policy can execute complex contact-rich tasks continuously, with less pauses and jerky motions. Experiments on three real-world contact-rich robotic tasks show that our approach consistently completes tasks with smooth motions.

Abstract:
Modern deeplearning models have achieved remarkable success in time-series forecasting. Yet, their performance degrades in long-term prediction due to error accumulation in autoregressive inference, where predictions are recursively used as inputs. While classical error correction mechanisms (ECMs) have long been used in statistical methods, their applicability to deep learning models remains limited or ineffective. In this work, we revisit the error accumulation problem in deep time-series forecasting and investigate the role and necessity of ECMs in this new context. We propose a simple, architecture-agnostic error correction model that can be integrated with any existing forecaster without requiring retraining. By explicitly decomposing predictions into trend and seasonal components and training the corrector to adjust each separately, we introduce the Universal Error Corrector with Seasonal–Trend Decomposition (UEC-STD), which significantly improves correction accuracy and robustness across 4 backbones and 10 datasets. Our findings provide a practical tool for enhancing forecasts while offering new insights into mitigating autoregressive errors in deep time-series models.

Abstract:
Vector Symbolic Algebras (VSAs) enable robust neurosymbolic reasoning by encoding information into highdimensional distributed representations. For continuous domains, Spatial Semantic Pointers (SSPs) extend this framework by mapping variables onto precise toroidal manifolds. While generative models offer a promising avenue for cleaning up (denoising) these representations, standard approaches like Flow Matching assume a flat Euclidean geometry. We demonstrate that this assumption fails for SSPs: Euclidean linear interpolants "cut through" the manifold's interior, destroying the phase and magnitude structure required for accurate decoding. To resolve this, we employ Geodesic Flow Matching, adapting Riemannian transport dynamics to strictly restrict the denoising flow to the SSP manifold. We validate this approach in a Spiking Neural SLAM system, showing that manifold-aware cleanup stabilizes path integration against drift. The method achieves a 72% reduction in tracking error and enables a 40% increase in neural efficiency compared to classical baselines.

Abstract:
Finetuning large language models for vertical domains remains a labor-intensive and expensive process, requiring domain experts to curate data, configure training, and iteratively diagnose model behavior. Despite growing interest in autonomous machine learning, no prior work has tackled end-to-end LLM fine-tuning with agents. Can LLM-based agents automate this complete process? We frame this as a substantially open problem: agents must navigate an open-ended search space spanning data curation from diverse data sources, processing with complex tools, building a training pipeline, and iteratively refining their approach based on evaluation outcomes in rapidly growing logs—an overall scenario far more intricate than existing benchmarks. To study this question, we introduce FT-Dojo, an interactive environment comprising 13 tasks across 5 domains. We further develop FT-Agent, an autonomous system that mirrors human experts by leveraging evaluation-driven feedback to iteratively diagnose failures and refine fine-tuning strategies. Experiments on FT-Dojo demonstrate that purpose-built fine-tuning agents significantly outperform general-purpose alternatives, with FT-Agent achieving the best performance on 10 out of 13 tasks across all five domains. Ablations show that the approach generalizes effectively to 3B models, with additional insights on data scaling trade-offs and backbone sensitivity. Case analyses reveal that agents can recover from failures through cumulative learning from historical experience, while also exposing fundamental limitations in causal reasoning—highlighting both the promise and current boundaries of autonomous fine-tuning.

Abstract:
Mixtureof-Experts (MoE) models scale capacity by combining specialized experts, but most existing approaches assume centralized access to training data. In practice, data are distributed across clients and cannot be shared due to privacy constraints, making unified MoE training challenging. We proposeMetaMoE, a privacy-preserving framework that unifies independently trained, domain-specialized experts into a single MoE using public proxy data as surrogates for inaccessible private data. Central to MetaMoE is diversity-aware proxy selection, which selects client-domain–relevant and diverse samples from public data to effectively approximate private data distributions and supervise router learning. These proxies are further used to align expert training, improving expert coordination at unification time, while a context-aware router enhances expert selection across heterogeneous inputs. Experiments on computer vision and natural language processing benchmarks demonstrate that MetaMoE consistently outperforms recent privacy-preserving MoE unification methods.

Abstract:
Graph coarsening is a graph dimensionality reduction technique that aims to construct a smaller and more tractable graph while preserving the essential structural and semantic properties of the original graph. However, most existing methods rely on pairwise similarity matching, where each node independently searches for its best partner based on global information. This selfishness matching paradigm incurs substantial computational and memory overhead. To address this problem, we shift to a non-selfishness principle that prioritizes the collective interference of neighborhood in coarsening, and propose an efficient method named \textttNOPE, which achieves linear memory consumption and near-linear computational complexity in the number of nodes. Furthermore, we derive a faster variant \textttNOPE, which reduces \mathcalO(\Delta\cdot d) interference evaluation to \mathcalO(d) based on the local isotropy assumption, and consequently alleviates the computational bottleneck for high-degree nodes. Experimental results show that \textttNOPE achieves 1.8–10× speedup over \textttNOPE and surpass almost all baselines with 1-3 orders of magnitude acceleration. Meanwhile, learning on coarsened graphs yields comparable performance to original graphs, and can even show superior performance over LLM-based graph reasoning owing to compact graph information. The code can be available at https://anonymous.4open.science/r/NOPE-FA74.

Abstract:
Adversarial Distillation aims to enhance student robustness by guiding the student with a robust teacher's soft labels within the minmax adversarial training framework, yet its success is notoriously inconsistent: a more robust teacher often fails to improve, or even harms, the student's robust generalization. In this paper, we identify a key mechanism of this teacher dependency: the misalignment between the teacher's supervisory confidence and the student's representational limitations on a consistent subset of training data—the Robustly Unlearnable Set. We present a theoretical framework analyzing the feature learning dynamics of a two-layer neural network, demonstrating that this mismatch creates a dichotomy in distillation outcomes. We prove that when a teacher provides confident supervision on unlearnable samples, it compels the student to memorize spurious noise patterns that eventually overpower the learned robust signal, thereby driving robust overfitting. Conversely, a teacher that exhibits high uncertainty on these samples effectively suppresses noise memorization, allowing the student to rely solely on the learnable signal for robust generalization. We empirically validate our theory across both synthetic simulations and real-image classification datasets, confirming that robust overfitting is driven by the teacher's interaction with unlearnable samples. Finally, we demonstrate that a teacher's predictive entropy on unlearnable samples serves as a strong indicator of student robustness, validating our theoretical framework and offering a principled guideline for robust teacher selection.

Abstract:
Recent research on reasoning models explores the metaawareness of language models, including their ability to determine optimal thinking duration, recognize knowledge boundaries, and structure concept-level thinking. While current large reasoning models depend solely on answer-based verification, we show that adding meta-awareness objectives leads to significant performance gains over models without such meta-knowledge. MAPR utilizes a self-generated task of predicting rollout statistics - specifically length, pass-rate, and concepts used - allowing for verification against the actual statistics. Furthermore, by leveraging this self-predictive capability, the model can regulate its reasoning behavior by i) filtering out trivial or unsolvable prompts, ii) reducing lengthy generations that tend to be incorrect, and iii) generating hints relevant to the problem. The results are inspiring: MAPR yields significant improvements in both accuracy and training efficiency on various reasoning benchmarks. More specifically, our method can speed up GRPO training by over 1.28× to reach the same performance, and achieve 83.18% gain in accuracy on AIME25, and a 13.04% average gain over six mathematics benchmarks.

Abstract:
Mixtureof-Experts (MoE) architectures have emerged as a popular paradigm for scaling Large Language Models (LLMs), enabling greater capacity with reduced computational cost by dynamically routing inputs to the most relevant experts based on learned patterns. However, this also introduces a critical vulnerability: Safety Sparsity, where safety capabilities concentrate in few experts, making them susceptible to adversarial bypassing. Meanwhile, conventional alignment methods apply uniform adaptations across all parameters, ignoring their functional differences and inadvertently degrading general utility. To address these challenges, we propose MESA (MoE Safety Alignment), a targeted alignment framework for MoE-based LLMs that strategically decentralizes safety responsibilities to maximize coverage while explicitly minimizing interference with general capabilities. Specifically, based on Optimal Transport (OT) theory, MESA operates through two mechanisms: (1) Expert Capacity Reallocation, which uses a transport cost matrix to distribute safety duties to the most cost-effective experts; and (2) Dynamic Routing Refinement, which constrains the router to ensure precise activation of these decentralized modules. Extensive experiments demonstrate that MESA achieves robust defensive performance against varied harmful benchmarks while preserving general helpfulness.

Abstract:
Minority sampling aims to generate lowdensity instances on a data manifold and is of central importance in applications such as medical diagnosis, anomaly detection, and creative AI. Existing approaches, however, define minority samples relative to generative priors learned from training data, confining rarity to model-specific notions that may poorly reflect real-world semantics. In this work, we propose a world-centric perspective on minority sampling, which defines rarity with respect to real-world priors rather than generator-induced densities. To this end, we introduceJEPA guidance, a diffusion sampling framework guided by a Joint-Embedding Predictive Architecture (JEPA)—a class of world models that encode broad, semantically rich representations. JEPA guidance steers diffusion trajectories toward low-density regions under the implicit density induced by the JEPA, thereby aligning generated minorities with real-world semantic rarity. To make JEPA guidance computationally practical, we develop principled approximation strategies accompanied by theoretical error bounds, significantly reducing the overhead of guidance computation. Extensive experiments across unconditional, class-conditional, and text-to-image generation demonstrate that JEPA guidance consistently improves the fidelity and semantic validity of minority samples, outperforming generator-centric baselines in capturing real-world notions of rarity.

Abstract:
Attention sinks (tokens that receive disproportionate attention mass) are often assumed to be functionally important in autoregressive language models. Whether such sinks are necessary in diffusion transformers remains unclear. We present a causal analysis of attention sinks in textto-image diffusion models, dynamically identifying dominant attention recipients based on incoming attention mass. Using paired, training-free interventions along the score and value paths, we test sink necessity across layers, denoising phases, and architectures. Across large-scale evaluations on 553 GenEval prompts with Stable Diffusion~3 and corroborating experiments on SDXL, we find that removing these sinks does not degrade text-image alignment or preference proxies under standard settings (k=1), with a metric-dependent boundary on HPS-v2 emerging only under stronger interventions (k \geq 10). We additionally quantify perceptual and distributional shifts relative to baseline outputs, showing that suppressing dominant recipients can alter appearance without affecting alignment or preference scores. Together, these results clarify that attention sinks are not functionally necessary for \emphsemantic alignment in diffusion transformers, while revealing a metric-dependent boundary: preference proxies show sink-specific degradation under stronger interventions (k \geq 10), whereas alignment (CLIP-T) remains robust across all tested conditions.

Abstract:
Defocus deblurring in pathological microscopy remains challenging due to the spatially varying and locally discontinuous nature of optical blur induced by a positiondependent integral imaging process. Existing deep learning methods, constrained by shift-invariance assumptions and limited interpretability, are not well suited to such heterogeneous blur patterns. Neural operators provide a principled alternative by modeling defocus formation directly as an integral operator, offering a new perspective on defocus deblurring. However, most existing neural operator architectures for low-level vision rely on globally parameterized kernels that assume smoothness and stationarity, limiting their ability to model heterogeneous and locally discontinuous blur patterns. To address this limitation, we propose the Discontinuous Galerkin Neural Operator (DGNO), which parameterizes the integral kernel using a discontinuous Galerkin formulation with element-local volume operators and interface numerical fluxes. DGNO provides a principled combination of locality, heterogeneity modeling, and global coherence while preserving the underlying physics of optical image formation. Extensive experiments demonstrate that DGNO surpasses state-of-the-art methods, delivering sharper reconstructions, robust handling of spatially varying blur, and scalable high-resolution performance.

Abstract:
Proteins encode diverse functions within complex threedimensional structures, yet most deep learning representations remain highly entangled, obscuring the biophysical signals that underlie function. Here we introduce ProtDiS, a knowledge-guided framework that decomposes pretrained protein micro-environment embeddings into biologically grounded and task-relevant dimensions. Inspired by the information bottleneck principle, ProtDiS learns representations that balance informativeness and compression, yielding structural features that are more specific, independent, and information-efficient, and achieving consistent improvements across twelve downstream tasks, with the largest gains under structure-based splits. Protein- and residue-level analyses further show that ProtDiS differentiates proteins with similar folds but divergent functions and captures fine-grained biophysical signals critical. These findings suggest that knowledge–guided decomposition provides a general and interpretable approach for structuring latent spaces in protein structural modeling.

Abstract:
Retrievalaugmented generation (RAG) systems typically rely on a single retriever and a single set of hyperparameters, despite facing highly heterogeneous queries that range from simple factoid questions to complex multi-hop reasoning. We propose a method that automatically selects a small, diverse subset of retrievers--a portfolio--from a large pool of candidates, to cover different regions of the target query distribution. We formalize this setting via an expected best-of-k objective over the query distribution and show that it admits an efficient portfolio construction algorithm with near-optimal guarantees. Across multiple QA benchmarks, our learned portfolios and router pipeline consistently outperform single-retriever and naive multi-retriever baselines on both retrieval metrics and answer quality. In addition, compared to inference-time hyperparameter tuning approaches, fixed portfolios enable parallel retrieval and LLM calls, achieving comparable (and sometimes better) accuracy with substantially lower latency and token cost.

Abstract:
Relative spatial relations provide a compact description of spatial structure, serving as a key component of relative spatial reasoning in 3D layout generation. Recent works leverage Multimodal Large Language Models (MLLMs) to infer these relations, but the inferred relations are often unreliable and are resolved by posthoc heuristics at the cost of semantic fidelity. In this paper, we propose R^3L, a general framework that improves the reliability and consistency of relative spatial reasoning for 3D layout generation. Our key motivation is that multi-hop reasoning requires repeated reference-frame shifts, which accumulate errors and lead to semantic and metric drift. To mitigate this, we propose invariant spatial decomposition to shorten relations chains, and consistent spatial imagination that uses an imagine-and-revise loop to encourage self-consistency during relation inference. We further design supportive spatial optimization that eases pose optimization by global-to-local coordinate re-parameterization. Extensive experiments across diverse scene types and instructions demonstrate that R^3L improves layout feasibility and semantic consistency. Notably, our analysis shows that resolving frame-induced inconsistencies during reasoning is crucial for reliable multi-hop relative spatial reasoning. Code will be released upon acceptance.

Abstract:
Learning mappings between infinitedimensional function spaces, or operator learning, is essential for many machine learning applications. Although transformer-based operators are popular, they often rely on token-wise attention. These methods treat continuous fields as discrete tokens and usually ignore the global functional structure. We introduce Functional Attention, which reinterprets attention as a functional correspondence between adaptive bases. Inspired by geometric functional maps, our method replaces softmax affinities with structured linear operators. This yields a compact, generalizable, resolution-invariant representation that explicitly captures global dependencies. Experiments demonstrate that Functional Attention can match state-of-the-art performance in many operator learning tasks, including solving PDEs, 3D segmentation, and regression, while remaining robust to varying discretizations.

Abstract:
Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving Large Language Model's (LLM) reasoning ability. However, the learning dynamics of RLVR remain underexplored. In this paper, we reveal a curious phenomenon: among hard examples that the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts present. To understand the phenomenon, we first demonstrate that existing optimization and sampling techniques fail to resolve unlearnability. With crossexample gradient analysis, we show that unlearnable examples have fundamental representation issue, characterized by low gradient similarity with the rest of the examples and ungeneralizable reasoning patterns. We further show that representation flaws are difficult to mitigate in RL, as data augmentation does not improve gradient similarity. Our study provides the first systematic characterization of unlearnable data in RLVR training and reveals fundamental limitations in current RL approaches for reasoning tasks.

Abstract:
Conventional patchified Transformers operate on uniform spatial partitions, distributing computational effort evenly across the domain irrespective of local features. This inflexible tokenization scheme is inherently limited in its ability to efficiently represent and process solutions to complex PDEs. To address this, we propose MeshTok, an adaptive mesh refinement (AMR)inspired tokenization and sequence modeling framework. This method selectively refines spatial regions exhibiting sharp gradients, transient features, or multiscale structures, generating a heterogeneous set of multiscale tokens defined on a fixed simulation grid. These tokens are processed within a unified Transformer sequence, enabling the model to simultaneously capture coarse-grained global context and fine-grained local details without requiring specialized architectural components. Although adaptive refinement moderately increases token count, it promotes a more targeted allocation of computational resources to physically informative regions, thereby enhancing predictive accuracy under equivalent computational constraints. Experimental evaluations across multiple PDE families and benchmark datasets demonstrate that MeshTok consistently improves the efficiency–accuracy trade-off compared to uniform-grid baselines. This suggests adaptive multiscale tokenization as a scalable and generalizable design principle for neural PDE modeling.

Abstract:
While PhysicsInformed Neural Networks (PINNs) are powerful for solving Partial Differential Equations (PDEs), their training is often paralyzed by gradient pathology. The gradients from PDE residuals and boundary constraints oppose each other, trapping the model in local minima. Current solutions, such as adaptive weighting or hard constraints, either fail to fundamentally resolve this ill-conditioning or are limited to simple geometries. In this study, we systematically analyzed the causes of this gradient pathology from the perspectives of loss landscapes and optimization dynamics. Based on the obtained conclusion, we propose Constraint-Aligned loss with Manifold Lifting (CAML). By reformulating all zeroth-order terms into aligned constraints, our method effectively mitigates gradient conflicts. In addition, we introduce a delay factor to help the optimizer skip the high-curvature area. Experiments demonstrate that our CAML significantly enhances numerical stability and efficiency in problems with composite boundary conditions. Our code is open-sourced on \hrefhttps://anonymous.4open.science/r/CAML-2F81Anonymous Repository.

Abstract:
Diffusion Models (DMs) have exhibited remarkable efficacy in various image restoration tasks. However, existing approaches typically operate within the highdimensional pixel space, resulting in high computational overhead. While methods based on latent DMs seek to alleviate this issue by utilizing the compressed latent space of a variational autoencoder (VAE), they require repeated encoder-decoder inference. This introduces significant additional computational burdens, often resulting in runtime performance that is even inferior to that of their pixel-space counterparts. To mitigate the computational inefficiency, this work proposes projecting data into lower-dimensional subspaces using dynamic resolution DMs to accelerate the inference process. We first fine-tune pre-trained DMs for dynamic resolution priors and adapt DPS and DAPS, which are two widely used pixel-space methods for general image restoration tasks, into the proposed framework, yielding methods we refer to as SubDPS and SubDAPS, respectively. Given the favorable inference speed and reconstruction fidelity of SubDAPS, we introduce an enhanced variant termed SubDAPS++ to further boost both reconstruction efficiency and quality. Empirical evaluations across diverse image datasets and various restoration tasks demonstrate that the proposed methods outperform recent DM-based approaches in the majority of experimental scenarios.

Abstract:
Mislabeled samples in training datasets severely degrade the performance of deep networks, as overparameterized models tend to memorize erroneous labels. We address this challenge by proposing a novel approach for mislabeled data detection that leverages training dynamics. Our method is grounded in the key observation that correctly labeled samples exhibit consistent entropy decrease during training, while mislabeled samples maintain relatively high entropy throughout the training process. Building on this insight, we introduce a signed entropy integral (SEI) statistic that captures both the magnitude and temporal trend of prediction entropy across training epochs. SEI is broadly applicable to classification networks and demonstrates particular effectiveness when integrated with contrastive languageimage pretraining (CLIP) architectures. Through extensive experiments on three medical imaging datasets---a domain particularly susceptible to labeling errors due to diagnostic complexity---spanning diverse modalities and pathologies, we demonstrate that SEI achieves state-of-the-art performance in mislabeled data identification, outperforming existing methods while maintaining computational efficiency and implementation simplicity.

Abstract:
Network data are ubiquitous across the social sciences, biology, and information systems. Generating realistic synthetic network data has broad applications from network simulation to scientific discovery. However, many existing blackbox approaches for network generation tend to overfit observed data while overlooking characteristic network structure, and incur substantial computational overhead at scale. These practical challenges call for synthetic network generation methods that are both efficient and capable of capturing structural properties of networks. In this paper, we introduce Synthetic Network Generation via Latent Embedding Reconstruction (SyNGLER), a general and efficient framework for synthetic network generation that builds on latent space network models. Given an observed network, SyNGLER first learns low-dimensional latent node embeddings via a latent space network model and then reconstructs the latent space by building a distribution-free generator over these embeddings. For generation, SyNGLER first samples (or resamples) node embeddings from the generator in the latent space and then produces synthetic networks using the latent space network model. Through the latent space framework, SyNGLER preserves unique characteristics in networks such as sparsity and node degree heterogeneity, while allowing for efficient training with lower computational cost than many existing deep architectures. We provide theoretical guarantees by developing consistency results regarding the distance between the true and synthetic edge distributions. Empirical studies further demonstrate the effectiveness of SyNGLER, where SyNGLER efficiently produces networks that better preserve key network characteristics such as network moments and degree distributions compared with existing approaches.

Abstract:
Listwise reranking is a critical yet costly component in visioncentric retrieval and multimodal retrieval-augmented generation (M-RAG) over long documents. Although recent VLM-based rerankers achieve strong accuracy, they are often impractical due to long visual-token inputs and autoregressive decoding, resulting in high latency. We propose ZipRerank, a very efficient listwise multimodal reranker that directly addresses both bottlenecks: it shortens the input via query-image early interaction and eliminates multi-step generation by scoring all candidates in a single forward pass. ZipRerank is trained with a two-stage recipe: listwise pretraining on large-scale text reranking data rendered as images, followed by multimodal finetuning with VLM-teacher supervision and a soft-ranking objective to handle noisy rankings. Extensive experiments on the MMDocIR benchmark demonstrate that ZipRerank matches or surpasses state-of-the-art multimodal rerankers while reducing LLM inference latency by up to an order of magnitude, making it well-suited for latency-sensitive real-world systems. Source code is available at https://anonymous.4open.science/r/ZipRerank.

Abstract:
Driven by rapid advances in largescale generative models, synthetic data has emerged as a promising solution for visual understanding. While modern diffusion models achieve remarkable photorealistic image synthesis, their potential in complex visual segmentation tasks remains underexplored. In this work, we conduct a systematic analysis of synthetic images from state-of-the-art diffusion models to uncover the factors governing their utility. In particular, synthetic images characterized by dense scene composition and fine instance fidelity demonstrate distinctive benefits, yielding significantly more discriminative spatial representations. Building on these insights, we propose SENSE, a unified framework that leverages flexible and scalable synthetic data to substantially enhance segmentation performance. Notably, SENSE is model-agnostic, compatible with diverse architectures (e.g., DPT and Mask2Former), and scales effectively across models with varying parameter capacities. Extensive experiments on Cityscapes, COCO, and ADE20K validate the effectiveness and generalization capability of our approach. Code will be made publicly available.

Abstract:
The rapid rise of generative AI has made multimodal fake news increasingly realistic and pervasive, posing severe threats to public trust and social stability. Existing detection methods rely heavily on manipulationspecific models and large-scale labeled data, resulting in poor generalization to emerging manipulation types. We observed that the essence of manipulated misinformation lies in its intrinsic conflicts, i.e., semantic or physical inconsistencies either across modalities or with common world knowledge. Inspired by this observation, we propose Conflict-Oriented REasoning (CORE) framework, an effective paradigm that learns to endows multimodal large language models (MLLMs) with explicit conflict-capturing capability. To this end, CORE first constructs the Conflict Attribution Corpus (CAC) with fine-grained annotations of conflict factors and sources, providing essential data support for subsequent conflict perception training. By performing conflict-oriented representation enhancement and reasoning based on CAC, CORE achieves robust and generalizable conflict detection, effectively and rapidly adapting to unseen manipulation types with a few samples or in even zero-shot settings. Extensive experiments demonstrate that CORE surpasses state-of-the-art models by 9.7%, 14.1%, and 11.8% in accuracy on the DGM^4, MMFakeBench, and MDSM benchmarks, respectively.

Abstract:
Mixtureof-Experts Large Language Models (MoE-LLMs) achieve strong performance but incur substantial memory overhead due to massive expert parameters. Mixed-precision quantization mitigates this cost by allocating expert-wise bit-widths based on their importance, approaching the accuracy-memory Pareto frontier and enabling extreme low-bit quantization. However, existing methods rely on layer-wise importance estimation and overlook router shifts induced by quantization, resulting in suboptimal allocation and routing. In this work, we propose Global Expert-level Mixed-precision Quantization (GEMQ) to overcome these limitations via (1) a global linear-programming formulation that captures model-wide expert importance based on quantization error analysis, and (2) efficient router fine-tuning to adapt routing to quantized experts. These components are integrated into a progressive quantization framework that iteratively refines importance estimation and allocation. Experiments demonstrate that GEMQ significantly reduces memory and accelerates inference with minimal accuracy degradation.

Abstract:
Multimodel learning has attracted great attention in visual-text tasks. However, visual-tabular data, which plays a pivotal role in high-stakes domains like healthcare and industry, remains underexplored. In this paper, we introduce VT-Bench, the first unified benchmark for standardizing vision-tabular discriminative prediction and generative reasoning tasks. VT-Bench aggregates 14 datasets across 9 domains (medical-centric, while covering pets, media, and transportation) with over 756K samples. We evaluate 21 representative models, including unimodal experts, specialized visual-tabular models, and general-purpose vision-language models (VLMs), highlighting substantial challenges of visual-tabular learning. We believe VT-Bench will stimulate the community to build more powerful multi-modal vision-tabular foundation models. Benchmark: \urlhttps://anonymous.4open.science/r/VT-Bench-13C2

Abstract:
Training autonomous web agents is fundamentally limited by the environments they learn from: realworld websites are unsafe to explore, hard to reset, and rarely provide verifiable feedback. We propose VeriEnv, a framework that treats language models as environment creators, automatically cloning real-world websites into fully executable, verifiable synthetic environments. By exposing controlled internal access via a Python SDK, VeriEnv enables agents to self-generate tasks with deterministic, programmatically verifiable rewards, eliminating reliance on heuristic or LLM-based judges. This design decouples agent learning from unsafe real-world interaction while enabling scalable self-evolution through environment expansion. Through experiments on web agent benchmarks, we show that agents trained with VeriEnv generalize to unseen websites, achieve site-specific mastery through self-evolving training, and benefit from scaling the number of training environments.

Abstract:
Multitoken prediction (MTP) has been proposed as an auxiliary objective to improve next-token prediction (NTP) in language model training but shows inconsistent improvements, underperforming in standard NLP benchmarks. We found MTP's exact future token prediction to be too difficult as an auxiliary loss. Instead, we propose token order prediction (TOP), which trains models to order upcoming tokens by their proximity using a learning-to-rank loss. TOP requires only a single additional unembedding layer compared to MTP's multiple transformer layers. We pretrain models of 340M, 1.8B, and 7B parameters using NTP, MTP, DeepSeek MTP (DS-MTP) and TOP objectives. The results of nine standard NLP benchmarks show that TOP overall outperforms NTP, MTP, and DS-MTP even at scale. TOP models with continued training on math and code also perform better on 4 relevant benchmarks. On the synthetic star graph task, TOP enables pathfinding on graphs where NTP, MTP, and DS-MTP fail.

Abstract:
Informal mathematics has been central to modern large language model (LLM) reasoning, offering flexibility and efficient construction of arguments. However, purely informal reasoning is prone to logical gaps and subtle errors that are difficult to detect and correct. In contrast, formal theorem proving provides rigorous, verifiable mathematical reasoning, where each inference step is checked by a trusted compiler, but lacks the exploratory freedom of informal problemsolving. This mismatch leaves current LLM-based math agents without a principled way to combine the strengths of both paradigms. In this work, we introduce Hermes, the first tool-assisted agent that explicitly interleaves informal reasoning with formally verified proofs in Lean. The framework performs intermediate formal checking to prevent reasoning drift and a memory module for proof continuity across multi-step reasoning chains, enabling both exploration and verification. We evaluate Hermes on four challenging mathematical reasoning benchmarks using LLMs of varying parameter scales, from small models to state-of-the-art systems. Across all settings, Hermes reliably improves the reasoning accuracy of base models while substantially reducing reasoning token usage and computational cost compared to reward-based approaches. On difficult datasets such as AIME and HARDMath2, Hermes@1 achieves up to a 40% accuracy improvement while using 80% fewer total inference FLOPs. When scaled at test time, Hermes@5 boosts accuracy further by 20%.

Abstract:
In this work, we studyHumanObject Interaction Video Generation (HOIVG), which aims to synthesize high-quality HOI videos via text, reference image, audio, and pose conditions. To address the challenges of harmonious multimodal injection and heterogeneous data utility, we presentOmniShow, an end-to-end framework tailored for HOIVG. We introduceUnified Channel-wise Conditioningto efficiently inject image and pose cues,Gated Local-Context Attentionto ensure precise audio-visual synchronization, and aDecoupled-then-Joint Training strategyto effectively harness heterogeneous data. Extensive experiments on the proposedHOIVG-Benchdemonstrate that OmniShow achieves state-of-the-art performance.

Abstract:
Autonomous data science on the structured data has been a longstanding challenge, and is now becoming feasible with the emergence of powerful large language models (LLMs). Recent workflowbased data agents have shown promising results on specific data tasks but remain fundamentally limited in achieving full autonomy due to their reliance on predefined workflows. In this paper, we introduce DeepAnalyze, the first agentic LLM for autonomous data science, capable of automatically completing the end-to-end data science from structured data to analyst-grade research reports. To tackle high-complexity data science tasks, we propose a curriculum-based agentic training paradigm that emulates the learning trajectory of human data scientists, enabling LLMs to progressively acquire and integrate multiple capabilities in real-world environments. Accordingly, we contribute a data-grounded trajectory synthesis framework to constructs high-quality data science training data. Through training in real-world environment, DeepAnalyze learns to perform a broad spectrum of data tasks, ranging from data question answering to open-ended data research. Experiments on 13 benchmarks demonstrate that, with only 8B parameters, DeepAnalyze outperforms previous workflow-based agents built on most advanced proprietary LLMs.

Abstract:
Large language models (LLMs) are increasingly applied in diverse realworld applications, each governed by bespoke behavioral and safety specifications (spec) custom-tailored by users or organizations. These specifications, categorized into safety-spec and behavioral-spec, vary across scenarios and evolve with changing preferences and requirements. We formalize this challenge as specification alignment, focusing on LLMs' ability to follow dynamic, scenario-specific spec from both behavioral and safety perspectives. To address this challenge, we introduce SpecBench, a unified benchmark for measuring specification alignment, covering 5 scenarios, 103 spec, and 1,500 prompts. Experiments on 15 reasoning and 18 instruct models with several Test-Time Deliberation (TTD) methods, including Self-Refine, TPO, and MoreThink, show that SpecBench effectively reveals alignment gaps and that test-time deliberation improves specification alignment. Based on previous TTD methods, we further propose Align3, a lightweight method with hierarchical reflection and revision to reason over specification boundaries, advancing the safety-helpfulness trade-off frontier with minimal overhead. These results highlight test-time deliberation as an effective strategy for reasoning over the real-world specification boundaries. Code and data are available at supplementary material.

Abstract:
Finegrained cross-modal alignment is pivotal for multimodal reasoning yet remains limited by Semantic Sparsity Bias—a fundamental asymmetry where dense visual signals are under-represented by sparse textual captions. This disparity leads to the inadvertent suppression of contextually vital visual regions (patch redundancy) and hinders precise concept grounding (patch ambiguity). While Multimodal Large Language Models (MLLMs) offer rich descriptive capabilities, their naive integration often induces semantic drift due to inconsistencies with sparse ground-truth captions. To systematically resolve these challenges, we present the Semantic-Enhanced Patch Slimming (SEPS) framework. Central to SEPS is a novel Dual-Granularity Semantic Calibration mechanism, which synthesizes a Holistic Visual-Linguistic Anchor from MLLMs to synergize with original sparse queries. This mechanism transforms patch selection into a semantic consensus process, ensuring that retained patches satisfy both local discriminability and global contextual integrity. Furthermore, we propose a Salience-Guided Metric Aggregation strategy to mitigate the similarity dilution effect inherent in global mean pooling, thereby amplifying highly-relevant patch-word correspondences. Extensive experiments on Flickr30K and MS-COCO datasets demonstrate that SEPS surpasses existing state-of-the-art approaches across diverse backbones, delivering significant performance gains in text-to-image retrieval tasks. The complete implementation is available at https://anonymous.4open.science/r/SEPS/.

Abstract:
Rotary Position Embedding (RoPE) is widely adopted in large language models, but when applied to visionlanguage models (VLMs) it couples text and image position indices and can introduce spurious cross-modal relative-position bias. We propose Per-Token Distance (PTD) to quantify cross-modal positional disentanglement, and we prove that \mathrmPTD=0 is a sufficient condition to eliminate the geometric attention bias induced by RoPE. Guided by this criterion, we introduce Circle-RoPE, which remaps 2D image-token coordinates onto an annulus orthogonal to the text position axis, yielding a cone-like geometry where each text token is equidistant to all image tokens while preserving intra-image spatial structure. We further propose Alternating Geometry Encoding (AGE) to improve robustness by alternating RoPE variants across layers, and experiments on diverse VLM backbones and multimodal benchmarks show consistent gains in spatial grounding and visual reasoning.

Abstract:
Human vision is foveated, with variable resolution peaking at the center of a large field of view; this reflects an efficient tradeoff for active sensing, allowing eye-movements to bring different parts of the world into focus with other parts of the world in context. In contrast, most computer vision systems encode the visual world at a uniform resolution, raising challenges for processing full-field high-resolution images efficiently. We propose a foveated vision interface (FOVI) based on the human retina and primary visual cortex, that reformats a variable-resolution retina-like sensor array into a uniformly dense, V1-like sensor manifold. Receptive fields are defined as k-nearest-neighborhoods (kNNs) on the sensor manifold, enabling kNN-convolution via a novel kernel mapping technique. We demonstrate two use cases: (1) an end-to-end kNN-convolutional architecture, and (2) a foveated adaptation of the foundational DINOv3 ViT model, leveraging low-rank adaptation (LoRA). These models provide competitive performance at a fraction of the computational cost of non-foveated baselines, opening pathways for efficient and scalable active sensing for high-resolution egocentric vision.

Abstract:
Zeroshot out-of-vocabulary detection (ZS-OOVD) aims to accurately recognize objects of in-vocabulary (IV) categories provided at zero-shot inference, while simultaneously rejecting undefined ones (out-of-vocabulary, OOV) that lack corresponding category prompts. However, previous methods are prone to overfitting the IV classes, leading to the OOV or undefined classes being misclassified as IV ones with a high confidence score. To address this issue, this paper proposes a zero-shot OOV detector (OOVDet), a novel framework that effectively detects predefined classes while reliably rejecting undefined ones in zero-shot scenes. Specifically, due to the model’s lack of prior knowledge about the distribution of OOV data, we synthesize region-level OOV prompts by sampling from the low-likelihood regions of the class-conditional Gaussian distributions in the hidden space, motivated by the assumption that unknown semantics are more likely to emerge in low-density areas of the latent space. For OOV images, we further propose a Dirichlet-based gradient attribution mechanism to mine pseudo-OOV image samples, where the attribution gradients are interpreted as Dirichlet evidence to estimate prediction uncertainty, and samples with high uncertainty are selected as pseudo-OOV images. Building on these synthesized OOV prompts and pseudo-OOV images, we construct the OOV decision boundary through a low-density prior constraint, which regularizes the optimization of OOV classes using Gaussian kernel density estimation in accordance with the above assumption. Experimental results show that our method significantly improves the OOV detection performance in zero-shot scenes. Code will be available.

Abstract:
The default paradigm of posttraining text-to-image generators includes post-hoc selection of generated images, and subsequent training with one reward model to align the generator to the reward, typically user preference. This discards informative data as well as optimizes only for a single reward, hence harming diversity, semantic fidelity and efficiency. Instead, we propose MIRO, a method that conditions the model on multiple rewards during training, thus letting the model learn user preferences directly. MIRO pre-training both improves the visual quality of the generated images and speeds up the training, achieving state of the art on the GenEval compositional benchmark and user-preference scores (PickAScore, ImageReward, HPSv2).

Abstract:
Allin-one image restoration aims to address diverse degradation types using a single unified model. Existing methods typically rely on degradation priors to guide restoration, yet often struggle to reconstruct content in severely degraded regions. Although recent works leverage semantic information to facilitate content generation, integrating it into the shallow layers of diffusion models often disrupts spatial structures (e.g., blurring artifacts). To address this issue, we propose a Triple-Prior Guided Diffusion (TPGDiff) network for unified image restoration. TPGDiff incorporates degradation priors throughout the diffusion trajectory, while introducing structural priors into shallow layers and semantic priors into deep layers, enabling hierarchical and complementary prior guidance for image reconstruction. Specifically, we leverage multi-source structural cues as structural priors to capture fine-grained details and guide shallow layers representations. To complement this design, we further develop a distillation-driven semantic extractor that yields robust semantic priors, ensuring reliable high-level guidance at deep layers even under severe degradations. Furthermore, a degradation extractor is employed to learn degradation-aware priors, enabling stage-adaptive control of the diffusion process across all timesteps. Extensive experiments on both single- and multi-degradation benchmarks demonstrate that TPGDiff achieves superior performance and generalization across diverse restoration scenarios.

Abstract:
Relational Databases (RDBs) are the backbone of modern business, yet they have missed the Foundation Model revolution. Unlike text or images, highquality RDB data is private and scarce, rendering the standard approach of ``pre-training on the internet'' infeasible. Consequently, existing solutions typically rely on limited real-world datasets, requiring costly fine-tuning to achieve viable performance. To overcome this data scarcity, we introduce RDB-PFN, the first foundation model for databases trained purely on synthetic data. Drawing inspiration from Prior-Data Fitted Networks (PFNs) where synthetic data generated from Structural Causal Models (SCMs) enables reasoning on i.i.d. single tables, we construct a novel Relational Prior Generator to create an infinite stream of random, complex, and diverse database schemas from scratch. By pre-training on a large-scale curriculum of over 2 million synthetic single-table and relational tasks, RDB-PFN learns to adapt to any new database instantly via genuine In-Context Learning. Experiments demonstrate that RDB-PFN outperforms both fine-tuned Graph Foundation Models and state-of-the-art Single-Table Foundation Models on real-world benchmarks. Notably, these results are achieved using a naive model architecture, proving that a rigorously defined synthetic generator is all you need to solve relational reasoning.

Abstract:
In many multiagent systems, agents interact repeatedly and are expected to settle into equilibrium behavior over time. Yet in practice, behavior often drifts, and detecting such deviations in real time remains an open challenge. We introduce a sequential testing framework that monitors whether observed play in repeated games is consistent with equilibrium, without assuming a fixed sample size. Our approach builds on the e-value framework for safe anytime-valid inference: by “betting” against equilibrium, we construct a test supermartingale that accumulates evidence whenever observed payoffs systematically violate equilibrium conditions. This yields a statistically sound, interpretable measure of departure from equilibrium that can be monitored online. We also leverage Benjamini-Hochberg-type procedures to increase detection power in large games while rigorously controlling the false discovery rate. Our framework unifies the treatment of Nash, correlated, and coarse correlated equilibria, offering finite-time guarantees and a detailed analysis of detection times. Moreover, we extend our method to stochastic games, broadening its applicability beyond repeated-play settings.

Abstract:
Finetuning large language models (LLMs) for downstream tasks is an essential stage of modern AI deployment. Reinforcement learning (RL) has emerged as the dominant fine-tuning paradigm, underpinning many state-of-the-art LLMs. In contrast, evolution strategies (ES) has largely been overlooked due to the widespread belief that it does not scale to modern model sizes. This paper overturns this assumption by demonstrating the first successful application of ES to full-parameter fine-tuning of LLMs at the billion-parameter scale, without dimensionality reduction. ES can indeed search over extremely high-dimensional parameter spaces and outperform established RL implementations across multiple axes, including improved tolerance to long-horizon and delayed rewards, robustness across diverse base LLMs, reduced susceptibility to reward hacking, and improved training stability. These findings suggest that ES is not merely a viable alternative to RL, but a fundamentally different and powerful backpropagation-free post-training paradigm that opens a new direction for LLM fine-tuning beyond current RL-based approaches.

Abstract:
Diffusion Transformers (DiTs) have emerged as a powerful backbone for image generation, offering superior scalability over UNets. However, their practical deployment is hindered by significant computational costs. While Quantization-Aware Training (QAT) shows promise, its application to DiTs is challenged by the high sensitivity and complex distributions of activations. Identifying activation quantization as the primary bottleneck for low-bit settings, we proposeRobuQ, a systematic QAT framework. We first establish a strong ternary weight (W1.58A4) baseline. Building on this, we introduceRobustQuantizer, which utilizes the Hadamard transform to convert unknown per-token distributions into normal distributions. Furthermore, we proposeAMPN, the firstActivation-onlyMixed-PrecisionNetwork pipeline, applying ternary weights globally while allocating layer-specific activation precisions to eliminate information bottlenecks. Extensive experiments demonstrate thatRobuQachieves state-of-the-art performance onImageNet-1K, representing the first stable image generation with activations quantized to an average of 2 bits.

Abstract:
Improving reasoning abilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for realworld applications. Looped transformers address this by performing multiple latent iterations to refine each token beyond a single forward pass. However, we identify a latent overthinking phenomenon: most token predictions are already correct after the first pass, but are sometimes revised into errors in later iterations. In this work, we ask whether selectively skipping latent iterations may improve accuracy. We reveal significant potential with an oracle iteration policy that boosts model performance by up to 7.3%. Motivated by this, we propose Think-at-Hard (TaH), a looped transformer optimized for selective iteration. TaH employs a lightweight neural decider to trigger latent iteration only at tokens that are likely incorrect after the standard forward pass. During latent iterations, depth-aware Low-Rank Adaptation (LoRA) modules shift the LLM's objective from general next-token prediction to focused hard-token refinement. A duo-causal attention mechanism extends attention from the token sequence dimension to an additional iteration depth dimension, enabling cross-iteration information flow with full sequential parallelism. Experiments on nine benchmarks show consistent gains across math, QA, and coding tasks. With identical parameter counts, TaH outperforms always-iterate baselines by 3.8-4.4% while skipping iterations on 93% of tokens, and exceeds single-iteration Qwen3 baselines by 3.0-3.8%. When allowing <3% more parameters from LoRA and decider modules, the gains further increase to 5.3-6.2% and 6.1-6.8%, respectively.

Abstract:
Textto-image generation models have advanced rapidly, yet achieving fine-grained control over generated images remains difficult, largely due to limited understanding of how semantic information is encoded. We develop an interpretation of the color representation in the Variational Autoencoder latent space of FLUX.1 [Dev], revealing a structure reflecting Hue, Saturation, and Lightness. We verify our Latent Color Subspace (LCS) interpretation by demonstrating that it can both predict and explicitly control color, introducing a fully training-free method in FLUX based solely on closed-form latent-space manipulation.

Abstract:
Graph neural network (GNN) potentials such as SchNet improve the accuracy and transferability of molecular dynamics (MD) simulation by learning manybody interactions, but remain slower than classical force fields due to fragmented kernels and memory-bound pipelines that underutilize GPUs. We show that a missing principle is making GNN-MD IO-aware, carefully accounting for reads and writes between GPU high-bandwidth memory (HBM) and on-chip SRAM. We present \textttFlashSchNet, an efficient and accurate IO-aware SchNet-style GNN-MD framework built on four techniques: (1) flash radial basis, which fuses pairwise distance computation, Gaussian basis expansion, and cosine envelope into a single tiled pass, computing each distance once and reusing it across all basis functions; (2) flash message passing, which fuses cutoff, neighbor gather, filter multiplication, and reduction to avoid materializing edge tensors in HBM; (3) flash aggregation, which reformulates scatter-add via CSR segment reduce, reducing atomic writes by a factor of feature dimension and enabling contention-free accumulation in both forward and backward passes; (4) channel-wise 16-bit quantization that exploits the low per-channel dynamic range in SchNet MLP weights to further improve throughput with negligible accuracy loss. On a single NVIDIA RTX PRO 6000, \textttFlashSchNet achieves 1000 ns/day aggregate simulation throughput over 64 parallel replicas on coarse-grained (CG) protein containing 269 beads (6.5 \mathbf× faster than CGSchNet baseline with 80% less peak memory), surpassing widely used classical force fields (e.g., MARTINI) while retaining SchNet-level accuracy and transferability.

Abstract:
Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with finegrained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming into regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model improves "single-glance" fine-grained perception without tool use. To rigorously evaluate this capability, we further present MicroPercept, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global-regional "zooming gap". Experiments show that our model achieves consistent gains across multiple fine-grained perception benchmarks, surpasses state-of-the-art agentic models while eliminating their inference latency, and improves out-of-distribution generalization.

Abstract:
Conventional LLMs may suffer from heterogeneous corpus and subtle condition changes. While finetuning can create the catastrophe forgetting issue, application of metalearning on LLMs is also limited due to its complexity and scalability. In this paper, we activate the meta-signal of \beta within the SwiGLU blocks, resulting a meta-gating mechanism which adaptively adjusts the nonlinearity of FFN. A hypernetwork is employed which dynamically produces \beta on textual conditions, providing meta-controllability on LLMs. By testing on different condition types such as task, domain, persona, and style, our method outperforms finetuning and meta-learning baselines, and can generalize reasonable on unseen task, condition type or instructions. Our code can be found https://anonymous.4open.science/r/MeGan-CAC0.

Abstract:
Generationtime text watermarking embeds statistical signals into text for traceability of AI-generated content. We explore post-hoc watermarking where an LLM rewrites existing text while applying generation-time watermarking, to protect copyrighted documents, or detect their use in training or RAG via watermark radioactivity. Unlike generation-time approaches which are constrained by how LLMs are served, this setting offers additional degrees of freedom for both generation and detection. We thus investigate how allocating compute (through larger rephrasing models, beam search, multi-candidate generation, or entropy filtering at detection) affects the quality-detectability trade-off. Among our findings, the simple Gumbel-max scheme surprisingly outperforms more recent alternatives under nucleus sampling, and achieves strong detectability and semantic fidelity on open-ended text such as books. Moreover, most methods benefit significantly from beam search, and we counterintuitively find that smaller models outperform larger ones. However, our solutions struggles when watermarking verifiable text such as code. This study reveals both the potential and limitations of post-hoc watermarking, laying groundwork for practical applications and future research.

Abstract:
We present Layered Ray Intersections (LaRI), a fully supervised method for occluded geometry reasoning from a single image. Unlike conventional depth estimation, which is limited to visible surfaces, LaRI predicts multiple surfaces intersected by the camera rays using layered point maps. Compared to the existing approaches that leverage neural implicit representations or iterative refinement, LaRI achieves complete scene reconstruction in one feedforward pass, enabling efficient and view-aligned geometric reasoning to underpin both object-level and scene-level tasks. We further propose to predict the ray stopping index, which identifies valid intersecting pixels and layers from LaRI's output. To better underpin and evaluate this task, we build an annotation pipeline using rendering engines, construct annotations for five public datasets, including synthetic and real-world data covering 3D objects and scenes. As a generic method, LaRI's performance is validated in object-level and scene-level reconstruction tasks.

Abstract:
Flow matching models (FMs) have revolutionized textto-image (T2I) generation, with reinforcement learning (RL) serving as a critical post-training strategy for alignment with reward objectives. In this research, we show that current RL pipelines for FMs suffer from two underappreciated yet important limitations: sample inefficiency due to insufficient generation diversity, and pronounced prompt overfitting, where models memorize specific training formulations and exhibit dramatic performance collapse when evaluated on semantically equivalent but stylistically varied prompts. We present PromptRL (Prompt Matters in RL for Flow-Based Image Generation), a framework that incorporates language models (LMs) as trainable prompt refinement agents directly within the flow-based RL optimization loop. This design yields two complementary benefits: rapid development of sophisticated prompt rewriting capabilities and, critically, a synergistic training regime that reshapes the optimization dynamics. PromptRL achieves state-of-the-art performance across multiple benchmarks, obtaining scores of 0.97 on GenEval, 0.98 on OCR accuracy, and 24.05 on PickScore. Furthermore, we validate the effectiveness of our RL approach on large-scale image editing models, improving the EditReward of FLUX.1-Kontext from 1.19 to 1.43 with only 0.06 million rollouts, surpassing Gemini 2.5 Flash Image (also known as Nano Banana), which scores 1.37, and achieving comparable performance with ReasonNet (1.44), which relied on fine-grained data annotations along with a complex multi-stage training. Our extensive experiments empirically demonstrate that PromptRL consistently achieves higher performance ceilings while requiring over 2× fewer rollouts compared to naive flow-only RL.

Abstract:
LLMs hold great promise for healthcare applications, but fastchanging medical knowledge can quickly make their outputs outdated or inaccurate, limiting use in high-stakes settings. Model editing can update LLMs without full retraining, but parameter-based methods often break locality and are risky in medicine, making retrieval-based editing a better fit. However, applying model editing methods to the medical domain has two key challenges: (1) retrieval-based methods suffer from representation overlap within the medical knowledge space that causes inaccurate retrieval and reduces editing accuracy; (2) existing medical editing methods are restricted to single-sample edits, while batch-editing remains largely unexplored despite its importance for real-world applications. To address these challenges, we construct MedVersa, an expanded benchmark that evaluates single and batch edits across broader medical coverage under strict locality constraints. We then propose MedREK, a retrieval-based editing framework that integrates a shared query–key module for precise matching with an attention-based prompt encoder for informative guidance. Experiments across various medical benchmarks show that our MedREK consistently improves key metrics and provides the first validated solution for batch editing in medical LLMs.

Abstract:
We aim to develop a multimodal research agent capable of explicit reasoning and planning, multitool invocation, and cross-modal information synthesis, enabling it to conduct deep research tasks. However, we observe three main challenges in developing such agents: (1) scarcity of search-intensive multimodal QA data, (2) lack of effective search trajectories, and (3) prohibitive cost of training with online search APIs. To tackle them, we first proposeHyper-Search, a hypergraph-based QA generation method that models and connects visual and textual nodes within and across modalities, enabling to generate search-intensive multimodal QA pairs that require invoking various search tools to solve. Second, we introduceDR-TTS, which first decomposes search-involved tasks into several categories according to search tool types, and respectively optimize specialized search tool experts for each tool. It then recomposes tool experts to jointly explore search trajectories via tree search, producing trajectories that successfully solve complex tasks using various search tools. Third, we build an offline search engine supporting multiple search tools, enabling agentic reinforcement learning without using costly online search APIs. With the three designs, we developMM-DeepResearch, a powerful multimodal deep research agent, and extensive results shows its superiority across benchmarks.

Abstract:
Large language models (LLMs) exhibit a wide range of capabilities, including mathematical reasoning, code generation, and linguistic behaviors. We show that many capabilities are highly localized to small subsets of attention heads within Transformer architectures. Zeroing out as few as five taskspecific heads can degrade performance by up to 65% on standard benchmarks measuring the capability of interest, while largely preserving performance on unrelated tasks. We introduce a compressed sensing based method that exploits the sparsity of these heads to identify them via strategic knockouts and a small number of model evaluations. We validate these findings across Llama and Qwen models ranging from 1B to 8B parameters and a diverse set of capabilities including mathematical abilities and code generation, revealing a modular organization in which specialized capabilities are implemented by sparse, functionally distinct components. Overall, our results suggest that capability localization is a general organizational principle of Transformer language models, with implications for interpretability, model editing, and AI safety.

Abstract:
Model merging combines multiple finetuned models into a single model by adding their weight updates, providing a lightweight alternative to retraining. Existing methods primarily target resolving conflicts between task updates, leaving the failure mode of over-counting shared knowledge unaddressed. We show that when tasks share aligned spectral directions (\ie, overlapping singular vectors), a simple linear combination repeatedly accumulates these directions, inflating the singular values and biasing the merged model toward shared subspaces. To mitigate this issue, we propose Singular Value Calibration (SVC), a training-free and data-free post-processing method that quantifies subspace overlap and rescales inflated singular values to restore a balanced spectrum. Across vision and language benchmarks, SVC consistently improves strong merging baselines and achieves state-of-the-art performance. Furthermore, by modifying only the singular values, SVC improves the performance of Task Arithmetic by 13.0%.

Abstract:
Most existing language model agentic systems today are built and optimized for large language models (e.g., GPT, Claude, Gemini) via API calls. While powerful, this approach faces several limitations including high token costs and privacy concerns for sensitive applications. We introduce effGen, an opensource agentic framework optimized for small language models (SLMs) that enables effective, efficient, and secure local deployment. effGen makes four major contributions: (1) Enhanced tool-calling with prompt optimization that compresses contexts by 70-80% while preserving task semantics, (2) Intelligent task decomposition that breaks complex queries into parallel or sequential subtasks based on dependencies, (3) Complexity-based routing using five factors to make smart pre-execution decisions, and (4) Unified memory system combining short-term, long-term, and vector-based storage. Additionally, effGen unifies multiple agent protocols (MCP, A2A, ACP) for cross-protocol communication. Results on 13 benchmarks show effGen outperforms LangChain, AutoGen, and Smolagents with higher success rates, faster execution, and lower memory. Our results reveal that prompt optimization and complexity routing have complementary scaling behavior: optimization benefits SLMs more (11.2% gain at 1.5B vs 2.4% at 32B), while routing benefits large models more (3.6% at 1.5B vs 7.9% at 32B), providing consistent gains across all scales when combined.

Abstract:
Large Language Model (LLM)based Multi-Agent Systems (MAS) have become a popular paradigm of AI applications. However, trustworthiness issues in MAS remain a critical concern. Unlike challenges in single-agent systems, MAS involve more complex communication processes, making them susceptible to corruption attacks. To mitigate this issue, several defense mechanisms have been developed based on the graph representation of MAS, where agents represent nodes and communications form edges. Nevertheless, these methods predominantly focus on static graph defense, attempting to either detect attacks in a fixed graph structure or optimize a static topology with certain defensive capabilities. To address this limitation, we propose a dynamic defense paradigm for MAS graph structures, which continuously monitors communication within the MAS graph, then dynamically adjusts the graph topology, accurately disrupts malicious communications, and effectively defends against evolving and diverse dynamic attacks. Experimental results in increasingly complex and dynamic MAS environments demonstrate that our method significantly outperforms existing MAS defense mechanisms, contributing an effective guardrail for their trustworthy applications.

Abstract:
Model merging combines independently trained models into a single multitask model. However, most existing approaches focus primarily on avoiding task interference. We argue that its greater potential lies in enabling task synergy, where tasks actively improve one another. We identify cross-task performance, defined by compatibility between encoders and predictors across tasks, as a key indicator of merge quality. We demonstrate that adapting only a single task-specific layer is sufficient to induce such synergy. This study proposes SyMerge, a lightweight framework that jointly optimizes merging coefficients and a single task-specific layer. We adopt an expert-guided self-labeling objective, providing stable supervision beyond entropy minimization. Intriguingly, we further show that SyMerge successfully merges models trained from different initializations, a regime where standard methods break down. Our minimalist yet principled method achieves state-of-the-art results across vision, dense prediction, and NLP benchmarks.

Abstract:
Dictionary learning has recently emerged as a promising approach for mechanistic interpretability of large transformer models. Disentangling highdimensional transformer embeddings requires algorithms that scale to high-dimensional data with large sample sizes. Recent work has explored sparse autoencoders (SAEs) for this problem. However, SAEs use a simple linear encoder to solve the sparse encoding subproblem, which is known to be NP-hard. It is therefore interesting to understand whether this approach is sufficient to find good solutions to the dictionary learning problem or if a more sophisticated algorithm could find better solutions. In this work, we propose Double-Batch KSVD (DB-KSVD), a scalable dictionary learning algorithm that adapts the classic KSVD algorithm. DB-KSVD is informed by the rich theoretical foundations of KSVD but scales to datasets with millions of samples and thousands of dimensions. We demonstrate the efficacy of DB-KSVD by disentangling text embeddings of the Gemma-2-2B and Pythia-160M models and evaluating on six metrics from the SAEBench benchmark, where we achieve competitive results when compared to established approaches based on SAEs. We further show similar results when disentangling image embeddings obtained from the DINOv2-S and DINOv2-B models, solidifying our findings. By matching SAE performance with an entirely different optimization approach, our results suggest that (i) SAEs do find strong solutions to the dictionary learning problem and (ii) traditional optimization approaches can be scaled to the required problem sizes, offering a promising avenue for further research. We make an implementation of DB-KSVD available.

Abstract:
Scalable Vector Graphics (SVG) animation generation is pivotal for professional design due to their structural editability and resolution independence. However, this task remains challenging as it requires bridging discrete code representations with continuous visual dynamics. Existing optimizationbased methods often destroy topological consistency, while general-purpose LLMs rely on rigid CSS/SMIL transformations, failing to model geometry-level non-rigid deformations. To address these limitations, we present VAnim, the first LLM-based framework for open-domain text-to-SVG animation. We reconceptualize animation not as sequence generation, but as Sparse State Updates (SSU) on a persistent SVG DOM tree. This paradigm compresses sequence length by over 9.8x while mathematically guaranteeing topological isomorphism and identity persistence. To enable precise control, we propose an Identification-First Motion Planning mechanism that grounds textual instructions in explicit visual entities. Furthermore, to overcome the non-differentiable nature of SVG rendering, we employ Rendering-Aware Reinforcement Learning via Group Relative Policy Optimization (GRPO). By leveraging a hybrid reward from a state-of-the-art video perception encoder, we align discrete code updates with high-fidelity visual feedback. We also introduce SVGAnim-134k, the first benchmark for vector animation. Extensive experiments demonstrate that VAnim significantly outperforms state-of-the-art baselines in semantic alignment and structural validity, demonstrating a robust capacity for synthesizing high-fidelity non-rigid deformations without requiring explicit temporal consistency constraints.

Abstract:
Recent advancements in large multimodal models (LMMs) have driven substantial progress in both textto-video (T2V) generation and video-to-text (V2T) interpretation tasks. However, current AI-generated videos (AIGVs) still exhibit limitations in terms of perceptual quality and text-video alignment. To this end, we presentAIGVE-60K, a comprehensive dataset and benchmark for AI-Generated Video Evaluation, which features(i)comprehensive tasks, encompassing 3,050 extensive prompts across 20 fine-grained task dimensions,(ii)the largest human annotations, including 120K mean-opinion scores (MOSs) and 60K question-answering (QA) pairs annotated on 58,500 videos generated from 30 T2V models, and(iii)bidirectional benchmarking and evaluating for both T2V generation and V2T interpretation capabilities. Based on AIGVE-60K, we proposeLOVE, a LMM-based metric for AIGV Evaluation from multiple dimensions including perceptual preference, text-video correspondence, and task-specific accuracy. Building upon LOVE, we further introduceLOVE-Rewardto optimize T2V models through reinforcement learning, effectively enhancing both the perceptual quality and text-video correspondence of generated videos. Comprehensive experiments demonstrate that LOVE achieves state-of-the-art performance and generalizes effectively to various AIGV benchmarks. LOVE-Reward significantly improves video generation quality. These findings highlight the significance of the AIGVE-60K dataset and the effectiveness of our proposed methods. The database and codes will be available upon publication.

Abstract:
Multitoken generation has emerged as a promising paradigm for accelerating language model inference, with the diffusion Large Language Models (dLLMs) as the most notable approach recently. Popular dLLMs like SDAR and Fast-dLLM v2 are post-trained on pre-trained AR models to minimize training cost while maintaining high performance. However, there exists a fundamental pretrain-to-posttrain mismatch -- the masked data distribution and bidirectional attention in post-training deviates significantly from the real data distribution and causal attention for pretraining. As a result, the post-trained dLLMs usually suffer from limited speedup or substantially degraded performance. To address this, we introduce Jacobi Forcing to bypass the dLLM formulation, directly post-training a causal multi-token predictor from an AR LLM. In particular, we force the model to learn to leap along its own parallel token generation trajectories based on Jacobi Decoding, and introduce an elaborate progressive distillation paradigm. The trained models achieve 3.8× wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on the trajectory characteristics of the model, we further introduce multi-block decoding with rejection recycling, which enables up to 4.6× higher token acceptance count per iteration and 4.0× wall-clock speedup, effectively trading additional compute for lower inference latency.

Abstract:
Supervised finetuning (SFT) is a commonly used technique to adapt large language models (LLMs) to downstream tasks. In practice, SFT on a full dataset is computationally expensive and sometimes suffers from overfitting or bias amplification. This facilitates the rise of data curation in SFT, which prioritizes the most valuable data to optimze. This work studies the online batch selection family that dynamically scores and filters samples during the training process. However, existing popular methods often (i) rely merely on the utility of data to select a subset while neglecting other crucial factors like diversity, (ii) rely on external resources such as reference models or validation sets, and (iii) incur extra training time over full-dataset training. To address these limitations, this work develops UDS (Utility-Diversity Sampling), a framework for efficient online batch selection in SFT. UDS leverages the nuclear norm of the logits matrix to capture both data utility and intra-sample diversity, while estimating inter-sample diversity through efficient low-dimensional embedding comparisons with a lightweight memory buffer of historical samples. Such a design eliminates the need for external resources and unnecessary backpropagation, securing computational efficiency. Experiments on multiple benchmarks demonstrate that UDS consistently outperforms state-of-the-art online batch selection methods under varying data budgets, and significantly reduces training time compared to full-dataset fine-tuning.

Abstract:
Longterm memory is essential for large language model (LLM) agents operating in complex environments, yet existing memory designs are either task-specific and non-transferable, or task-agnostic but less effective due to low task-relevance and context explosion from raw memory retrieval. We propose PlugMem, a task-agnostic plugin memory module that can be attached to arbitrary LLM agents without task-specific redesign. Motivated by the fact that decision-relevant information is concentrated as abstract knowledge rather than raw experience, we draw on cognitive science to structure episodic memories into a compact, extensible knowledge-centric memory graph that explicitly represents propositional and prescriptive knowledge. This representation enables efficient memory retrieval and reasoning over task-relevant knowledge, rather than verbose raw trajectories, and departs from other graph-based methods like GraphRAG by treating knowledge as the unit of memory access and organization instead of entities or text chunks. We evaluate PlugMem unchanged across three heterogeneous benchmarks (long-horizon conversational question answering, multi-hop knowledge retrieval, and web agent tasks). The results show that PlugMem consistently outperforms task-agnostic baselines and exceeds task-specific memory designs, while also achieving the highest information density under a unified information-theoretic analysis.

Abstract:
Conditional time series generation plays a critical role in addressing data scarcity and enabling causal analysis in realworld applications. Despite its increasing importance, the field lacks a standardized and systematic benchmarking framework for evaluating generative models across diverse conditions. To address this gap, we introduce theConditionalTimeSeriesGenerationBenchmark (ConTSG-Bench). ConTSG-Bench comprises a large-scale, well-aligned dataset spanning diverse conditioning modalities and levels of semantic abstraction, first enabling systematic evaluation of representative generation methods across these dimensions with a comprehensive suite of metrics for generation fidelity and condition adherence. Both the quantitative benchmarking and in-depth analyses of conditional generation behaviors have revealed the traits and limitations of the current approaches, highlighting critical challenges and promising research directions, particularly with respect to precise structural controllability and downstream task utility under complex conditions.

Abstract:
Large Language Models (LLMs) obey consistent scaling lawsempirical power-law fits that predict how loss decreases with compute, data, and parameters. While predictive, these laws are descriptive rather than prescriptive: they characterize typical training, not optimal training. Surprisingly few works have successfully challenged the data-efficiency bounds implied by these laws---which is our primary focus. To that end, we introduce the Geodesic Hypothesis, positing that token sequences trace geodesics on a smooth semantic manifold and are therefore locally linear. Building on this principle, we propose a novel Semantic Tube Prediction (STP) task, a JEPA-style regularizer that confines hidden-state trajectories to a tubular neighborhood of the geodesic. STP generalizes JEPA to language without requiring explicit multi-view augmentations. We show this constraint improves signal-to-noise ratio, and consequently preserves diversity by preventing trajectory collisions during inference. Empirically, STP allows LLMs to match baseline accuracy with 16× less training data, directly violating the data term of Chinchilla-style scaling laws and demonstrating that principled geometric priors can surpass brute-force scaling.

Abstract:
Diffusion Transformer models can generate images with remarkable fidelity and detail, yet training them at ultrahigh resolutions remains extremely costly due to the self-attention mechanism's quadratic scaling with the number of image tokens. In this paper, we introduce Dynamic Position Extrapolation (DyPE), a novel, training-free method that enables pre-trained diffusion transformers to synthesize images at resolutions far beyond their training data, with no additional sampling cost. DyPE takes advantage of the spectral progression inherent to the diffusion process, where low-frequency structures converge early, while high-frequencies take more steps to resolve. Specifically, DyPE dynamically adjusts the model's positional encoding at each diffusion step, matching their frequency spectrum with the current stage of the generative process. This approach allows us to generate images at resolutions that exceed the training resolution dramatically, e.g., 16 million pixels using FLUX. On multiple benchmarks, DyPE consistently improves performance and achieves state-of-the-art fidelity in ultra-high-resolution image generation, with gains becoming even more pronounced at higher resolutions.

Abstract:
Effectively scaling GUI automation is essential for computeruse agents (CUAs); however, existing work primarily focuses on scaling GUI grounding rather than the more crucial GUI planning, which requires more sophisticated data collection. In reality, the exploration process of a CUA across apps/desktops/web pages typically follows a tree structure, with earlier functional entry points often being explored more frequently. In this work, we find that organizing large-scale GUI trajectories into tree structures can effectively eliminate redundant exploration costs, while each branch node also provides key reasoning evidence for distinguishing adjacent trajectories. Therefore, we propose TreeCUA to efficiently scale GUI automation with tree-structured verifiable evolution. %and naturally extends a Tree-DPO training algorithm. We propose a multi-agent collaborative framework to explore the environment, verify actions, summarize trajectories, and evaluate quality to generate high-quality and scalable GUI trajectories. To improve efficiency, we devise a novel tree-based topology to store and replay duplicate exploration nodes, and design an adaptive exploration algorithm to balance the depth (\emphi.e., trajectory difficulty) and breadth (\emphi.e., trajectory diversity). Moreover, we develop world knowledge guidance and global memory backtracking to avoid low-quality generation. Finally, we naturally extend and propose the TreeCUA-DPO method from abundant tree node information, improving GUI planning capability by referring to the branch information of adjacent trajectories. Experimental results show that TreeCUA and TreeCUA-DPO offer significant improvements, and out-of-domain (OOD) studies further demonstrate strong generalization. All trajectory node information and code will be open-sourced.

Abstract:
While 3D Gaussian Splatting (3DGS) has demonstrated impressive realtime rendering performance, its efficacy remains constrained by a reliance on heuristic density control. Despite numerous refinements to these handcrafted rules, such methods inherently lack the flexibility to adapt to diverse scenes with complex geometries. In this paper, we propose a paradigm shift for density control from rigid heuristics to fully learnable policies. Specifically, we introduce LeGS, a framework that reformulates density control as a parameterized policy network optimized via Reinforcement Learning (RL). Central to our approach is the tailored effective reward function grounded in sensitivity analysis, which precisely quantifies the marginal contribution of individual Gaussians to reconstruction quality. To maintain computational tractability, we derive a closed-form solution that reduces the complexity of reward calculation from O(N^2) to O(N). Extensive experiments on the Mip-NeRF 360, Tanks \& Temples, and Deep Blending datasets demonstrate that LeGS significantly outperforms state-of-the-art methods, striking a superior balance between reconstruction quality and efficiency.

Abstract:
Diffusion models have achieved stateof-the-art performance in generating images, audio, and video, but their adaptation to text remains challenging due to its discrete nature. Prior approaches either apply Gaussian diffusion in continuous latent spaces, which inherits semantic structure but struggles with token decoding, or operate in categorical simplex space, which respect discreteness but disregard semantic relation between tokens. In this paper, we propose Smoothing Diffusion on Token Embeddings (Smoothie), a novel diffusion method that combines the strengths of both approaches by progressively smoothing token embeddings based on semantic similarity. This technique enables gradual information removal while maintaining a natural decoding process. Experimental results on several sequence-to-sequence and unconditional generation tasks demonstrate that Smoothie outperforms existing diffusion-based models in generation quality. Furthermore, ablation studies show that our proposed diffusion space yields better performance than both the standard embedding space and the categorical simplex.

Abstract:
Computational modeling of singlecell gene expression is crucial for understanding cellular processes, but generating realistic expression profiles remains a major challenge. This difficulty arises from the count nature of gene expression data and complex latent dependencies among genes. Existing generative models often impose artificial gene orderings or rely on shallow neural network architectures. We introduce a scalable latent diffusion model for single-cell gene expression data, which we refer to as scLDM, that respects the fundamental exchangeability property of the data. Our VAE uses fixed-size latent variables leveraging a unified Multi-head Cross-Attention Block (MCAB) architecture, which serves dual roles: permutation-invariant pooling in the encoder and permutation-equivariant unpooling in the decoder. We enhance this framework by replacing the Gaussian prior with a latent diffusion model using Diffusion Transformers and linear interpolants, enabling high-quality generation with multi-conditional classifier-free guidance. We show its superior performance in a variety of experiments for both observational and perturbational single-cell data, as well as downstream tasks like cell-level classification.

Abstract:
Modern computeruse agents (CUA) must perceive a screen as a structured state, what elements are visible, where they are, and what text they contain, before they can reliably ground instructions and act. Yet, most available grounding datasets provide sparse supervision, withinsufficientandlow-diversitylabels that annotate only a small subset of task-relevant elements per screen, which limits both coverage and generalization; moreover, practical deployment requires efficiency to enable low-latency, on-device use. We introduceScreenParse, a large-scale dataset forcompletescreen parsing, with dense annotations of all visible UI elements (boxes, 55-class types, and text) across 771K web screenshots (21M elements). ScreenParse is generated byWebshot, an automated, scalable pipeline that renders diverse urls, extracts annotations and applies VLM-based relabeling and quality filtering. Using ScreenParse, we trainScreenVLM, a compact, 316M-parameter vision language model (VLM) that decodes a compact ScreenTag markup representation with a structure-aware loss that upweights structure-critical tokens. ScreenVLM substantially outperforms much larger foundation VLMs on dense parsing (e.g., 0.592 vs. 0.294 PageIoU on ScreenParse) and shows strong transfer to public benchmarks. Moreover, finetuning foundation VLMs on ScreenParse consistently improves their grounding performance, suggesting that dense screen supervision provides transferable structural priors for UI understanding. The dataset, model and code will be made publicly available.

Abstract:
Large language model (LLM) agents are increasingly deployed in personalized tasks involving sensitive, contextdependent information, where privacy violations may arise in agents' action due to the implicitness of contextual privacy. Existing approaches rely onexternal, inference-time interventions which are brittle, scenario-specific, and may expand the privacy attack surface. We proposePrivAct, a contextual privacy-aware multi-agent learning framework thatinternalizescontextual privacy preservation directly into models' generation behavior for privacy-compliant agentic actions. By embedding privacy preferences into each agent, PrivAct enhances system-wide contextual integrity while achieving a more favorable privacy-helpfulness tradeoff. Experiments across multiple LLM backbones and benchmarks demonstrate consistent improvements in contextual privacy preservation, reducing leakage rates by up to 12.32% while maintaining comparable helpfulness, as well as zero-shot generalization and robustness across diverse multi-agent topologies. The code and datasets will be released at [URL/upon acceptance].

Abstract:
Analyzing largescale text corpora is a core challenge in machine learning, crucial for tasks like identifying undesirable model behaviors. Current methods often rely on costly LLM-based techniques (e.g. annotating dataset differences) or dense embedding models (e.g. for clustering), which lack control over the properties of interest. We propose using sparse autoencoders (SAEs) to createSAE embeddings: representations whose dimensions map to interpretable concepts. Through four data analysis tasks, we show that SAE embeddings are more cost-effective and reliable than LLMs and offer the controllability that dense embeddings lack. Using the large hypothesis space of SAEs, we can uncover insights such as (1) semantic differences between datasets and (2) unexpected concept correlations in documents. For instance, by comparing model responses, we find that Grok-4 clarifies ambiguities more often than nine other frontier models. Relative to LLMs, SAE embeddings uncover bigger differences at 2-8x lower cost and identify biases more reliably. Additionally, SAE embeddings are controllable: by filtering concepts, we can (3) cluster documents along axes of interest and (4) outperform dense embeddings on property-based retrieval. Using SAE embeddings, we study model behavior with two case studies: investigating how OpenAI model behavior has changed over time and finding "trigger" phrases learned by Tulu-3 (Lambert et. al., 2024) from its training data. These results position SAEs as a versatile tool for unstructured data analysis and highlight the neglected importance of interpreting models through theirdata.

Abstract:
Posttraining of flow matching models—aligning the output distribution with a high-quality target—is mathematically equivalent to imitation learning. While Supervised Fine-Tuning mimics expert demonstrations effectively, it cannot correct policy drift in unseen states. Preference optimization methods address this but require costly preference pairs or reward modeling. We propose Flow Matching Adversarial Imitation Learning (FAIL), which minimizes policy-expert divergence through adversarial training without explicit rewards or pairwise comparisons. We derive two algorithms: FAIL-PD exploits differentiable ODE solvers for low-variance pathwise gradients, while FAIL-PG provides a black-box alternative for discrete or computationally constrained settings. Fine-tuning FLUX with only 13,000 demonstrations from Nano Banana pro, FAIL achieves competitive performance on prompt following and aesthetic benchmarks. Furthermore, the framework generalizes effectively to discrete image and video generation, and functions as a robust regularizer to mitigate reward hacking in reward-based optimization.

Abstract:
Recent progress in timeseries forecasting has led to rapidly increasing architectural complexity, yet many reported State-of-the-Art gains are statistically fragile or misattributed. We argue that progress requires a shift from model selection to modular attribution, identifying which components truly drive performance. We propose CombinationTS, a self-contained probabilistic evaluation framework that decomposes forecasting models into orthogonal modules—Input Transformation, Embedding, Encoder, and Decoder—and evaluates them under a shared evaluation condition space. By quantifying each component via marginalized effectiveness (\mu) and stability (\sigma^2), CombinationTS enables robust attribution beyond fragile point estimates. Through large-scale paired evaluation, we uncover the Identity Paradox: once the data view is well-designed, a parameter-free Identity encoder often matches or outperforms complex backbones. We further show that explicit structural priors introduced via input transformations yield a more favorable effectiveness–stability trade-off than increasing encoder complexity, establishing a principled baseline for architectural necessity. The code is available at https://anonymous.4open.science/r/CombinationTS.

Abstract:
Synthesizing physicsgrounded 3D assets is a critical bottleneck for interactive virtual worlds and embodied AI. Existing methods predominantly focus on static geometry, overlooking the functional properties essential for interaction. We propose that interactive asset generation must be rooted in functional logic and hierarchical physics. To bridge this gap, we introduce PhysForge, a decoupled two-stage framework supported by PhysDB, a large-scale dataset of 150,000 assets with four-tier physical annotations. First, a VLM acts as a physical architect to plan a Hierarchical Physical Blueprint defining material, functional, and kinematic constraints. Second, a physics-grounded diffusion model realizes this blueprint by synthesizing high-fidelity geometry alongside precise kinematic parameters via a novel KineVoxel Injection (KVI) mechanism. Experiments demonstrate that PhysForge produces functionally plausible, simulation-ready assets, providing a robust data engine for interactive 3D content and embodied agents.

Abstract:
Large language models (LLMs) have shown promising potential in persuasion, but existing works on training LLM persuaders are still preliminary. Notably, while humans are skilled in modeling their opponent's thoughts and opinions proactively and dynamically, current LLMs struggle with such Theory of Mind (ToM) reasoning, resulting in limited diversity and opponent awareness. To address this limitation, we introduce Theory of Mind Augmented Persuader (ToMAP), a novel approach for building more flexible persuader agents by incorporating two theory of mind modules that enhance the persuader's awareness and analysis of the opponent's mental state. Specifically, we instruct the persuader to consider possible objections to the target claim, and train a module to predict the opponent’s agreement level on these objections. Experiments show that the ToMAP persuader, while containing only 3B parameters, outperforms much larger baselines, like GPT4o, with a relative gain of 39.4% across multiple persuadee models and diverse corpora. Notably, ToMAP exhibits complex reasoning chains and reduced repetition during training, which leads to more diverse and effective arguments. These results underscore ToMAP's potential for developing more persuasive language agents. We will release our code via GitHub.

Abstract:
Although recent textto-video generative models are getting more capable of following external camera controls, imposed by either text descriptions or camera trajectories, they still struggle to generalize to unconventional camera motions, which is crucial in creating truly original and artistic videos. The challenge lies in the difficulty of finding sufficient training videos with the intended uncommon camera motions. To address this challenge, we propose VividCam, a training paradigm that enables diffusion models to learn complex camera motions from synthetic videos, releasing the reliance on collecting realistic training videos. VividCam incorporates multiple disentanglement strategies that isolates camera motion learning from synthetic appearance artifacts, ensuring more robust motion representation and mitigating domain shift. We demonstrate that our design synthesizes a wide range of precisely controlled and complex camera motions using surprisingly simple synthetic data. Notably, this synthetic data often consists of basic geometries within a low-poly 3D scene and can be efficiently rendered by engines like Unity. Our video results can be found in https://anonymoususers196.github.io/VividCamDemo/ .

Abstract:
Diffusion large language models (dLLMs), which offer a promising alternative to traditional autoregressive LLMs, have recently shown strong results in pretraining. However, due to their lack of tractable sequencelevel likelihoods, they have yet to benefit from modern LLM post-training techniques such as reinforcement learning (RL), limiting their real-world applicability. Existing attempts at dLLM post-training rely on heuristic approximations or lower bounds of the true likelihood. In this work, we propose Amortized Group Relative Policy Optimization (AGRPO), a policy gradient algorithm that leverages the multi-step Markovian nature of dLLM generation, optimizing individual denoising steps rather than entire sequences. We demonstrate AGRPO's effectiveness on different math and reasoning tasks, achieving +9.9% absolute gain on GSM8K, +4.6% on MATH-500, +59.4% on Countdown, and +69.7% on Sudoku over the base LLaDA model, improving upon comparable dLLM RL methods such as diffu-GRPO. Furthermore, we analyze how post-training gains persist across different inference configurations, revealing that models trained with AGRPO can sample 4x faster with minimal performance sacrifices.

Abstract:
Medical diagnosis demands models that can process multimodal medical inputs, such as medical images and patient histories, and generate diverse outputs including textual reports and visual content, such as annotations or segmentation masks. Despite this need, existing medical AI models disrupt this unified process: image understanding models interpret images without producing visual outputs, while image generation models produce visual outputs but cannot provide textual explanations. Therefore, we propose a multilevel framework called Observation-Knowledge-Analysis (OKA) to unify the distinct processes. Specifically, at the observation level, we construct UniMed-5M, a dataset comprising over 5.6M samples that reformat diverse unimodal data into multimodal pairs. At the knowledge level, we propose Progressive Curriculum Learning, where models simultaneously learn medical multimodal understanding and generation knowledge from UniMed-5M.At the analysis level, we introduce UniMedVL, the first medical unified multimodal model that unifies image understanding and generation within a single architecture without manually reloading model checkpoints. UniMedVL achieves superior performance on 5 medical image understanding benchmarks, while matching specialized models in generation quality across 8 medical imaging modalities. Crucially, our unified architecture enables bidirectional knowledge sharing, improving performance on both image understanding and generation tasks. Code is available at https://anonymous.4open.science/r/Uni-MedVL-65F2/README.md.

Abstract:
Scaling verifiable training signals remains a key bottleneck for Reinforcement Learning from Verifiable Rewards (RLVR). Logical reasoning is a natural substrate: constraints are formal and answers are programmatically checkable. However, prior synthesis pipelines either depend on expertwritten code or operate within fixed templates/skeletons, which limits growth largely to instance-level perturbations. We propose SSLogic, an agentic meta-synthesis framework that scales at the task-family level by iteratively synthesizing and refining executable Generator--Validator program pairs in a closed Generate--Validate--Refine loop, enabling continuous family evolution with controllable difficulty. To ensure reliability, we introduce a Multi-Gate Validation Protocol that combines multi-strategy consistency checks with Adversarial Blind Review, where independent agents must solve instances by writing and executing code to filter ambiguous or ill-posed tasks. Starting from 400 seed families, two evolution rounds expand to 953 families and 21,389 verifiable instances (from 5,718). Training on SSLogic-evolved data yields consistent gains over the seed baseline at matched training steps, improving SynLogic by +5.2, BBEH by +1.4, AIME25 by +3.0, and Brumo25 by +3.7. Code is available at https://anonymous.4open.science/r/Scaling-the-Scaling-Logic-6F4B/.

Abstract:
Diffusion Large Language Models (DLLMs) offer a compelling alternative to AutoRegressive models, but their deployment is constrained by high decoding cost. In this work, we identify a key inefficiency in DLLM decoding: while computation is parallelized over token blocks, only a small subset of tokens is decodable at each diffusion step, causing most compute to be wasted on non-decodable tokens. We further observe a strong correlation between attention-derived token importance and token-wise decoding probability. Based on this insight, we proposeFOCUS—an inference system designed for DLLMs. By dynamicallyfocusingcomputation on decodable tokens and evicting non-decodable ones on-the-fly, FOCUS increases the effective batch size, alleviating compute limitations and enabling scalable throughput. Empirical evaluations demonstrate that FOCUS achieves up to3.52× throughputimprovement over the production-grade engine LMDeploy, while preserving or improving generation quality across multiple benchmarks.

Abstract:
While Diffusion Language Models (DLMs) offer a flexible, arbitraryorder alternative to the autoregressive paradigm, their non-causal nature precludes standard KV caching, forcing costly hidden state recomputation at every decoding step. Existing caching approaches reduce this cost by selective hidden state updates; however, they are still limited by (i) computationally costly token-wise update identification heuristics and (ii) rigid, uniform budget allocation that fails to account for heterogeneous hidden- tate dynamics. To address these challenges, we present SPA-Cache that jointly optimizes update identification and budget allocation. First, we derive a low-dimensional singular proxy that enables the identification of update-critical tokens in a low-dimensional subspace, substantially reducing the overhead of update identification. Second, motivated by the layer-wise heterogeneity in hidden state dynamics, we introduce an adaptive strategy that allocates fewer updates to stable layers without degrading generation quality. Together, these contributions significantly improve the efficiency of DLMs, yielding up to an 8× throughput improvement over vanilla models decoding and a 2-4× speedup over existing caching baselines.

Abstract:
Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semanticgeometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence.

Abstract:
Multimodal large language models (MLLMs) have achieved remarkable success across a broad range of vision tasks. However, constrained by the capacity of their internal world knowledge, prior work has proposed augmenting MLLMs by ``reasoningthen-tool-call'' for visual and textual search engines to obtain substantial gains on tasks requiring extensive factual information. However, these approaches typically define multimodal search in a naive setting, assuming that a single full-level or entity-level image query and few text query suffices to retrieve the key evidence needed to answer the question, which is unrealistic in real-world scenarios with substantial visual noise. Moreover, they are often limited in the reasoning depth and search breadth, making it difficult to solve complex questions that require aggregating evidence from diverse visual and textual sources. Building on this, we propose Vision-DeepResearch, which proposes one new multimodal deep-research paradigm, i.e., performs multi-turn, multi-entity and multi-scale visual and textual search to robustly hit real-world search engines under heavy noise. Our Vision-DeepResearch supports dozens of reasoning steps and hundreds of engine interactions, while internalizing deep-research capabilities into the MLLM via cold-start supervision and RL training, resulting in a strong end-to-end multimodal deep-research MLLM. It substantially outperforming existing multimodal deep-research MLLMs, and workflows built on strong closed-source foundation model such as GPT-5, Gemini-2.5-pro and Claude-4-Sonnet.

Abstract:
Automating the development of machine learning algorithms has the potential to unlock new breakthroughs. However, our ability to improve and evaluate algorithm discovery systems has thus far been limited by existing task suites. They suffer from many issues, such as: poor evaluation methodologies; data contamination; and containing saturated or very similar problems. Here, we introduceDiscoGen, a procedural generator of algorithm discovery tasks for machine learning, such as developing optimisers for reinforcement learning or loss functions for image classification. Motivated by the success of procedural generation in reinforcement learning, DiscoGen spans millions of tasks of varying difficulty and complexity from a range of machine learning fields. These tasks are specified by a small number of configuration parameters and can be used to optimise algorithm discovery agents (ADAs). We presentDiscoBench, a benchmark consisting of a fixed, small subset of DiscoGen tasks for principled evaluation of ADAs. Finally, we propose a number of ambitious, impactful research directions enabled by DiscoGen, in addition to experiments demonstrating its use for prompt optimisation of an ADA. DiscoGen is released opensource.

Abstract:
Finetuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) offers a resource-efficient way to personalize or specialize. However, LoRA is highly sensitive to hyperparameter choices, and performing an exhaustive hyperparameter search remains computationally intensive. To address these challenges, we propose a framework that integrates the domain knowledge of pre-trained LLMs into the Bayesian Optimization (BO) process to efficiently search for LoRA hyperparameters. To leverage pre-trained LLMs' knowledge, our approach repurposes them as a discrete-to-continuous mapping module to link hyperparameters and their domain knowledge to a continuous vector space, where BO is conducted. We design and control the mapping via language prompting, providing a domain-aware textual prompt that describes the relationships among hyperparameters and their respective roles. This allows us to explicitly inject domain knowledge about LoRA into the LLM in natural language. We also introduce an additional learnable token to capture residual information that is difficult to describe linguistically in the prompt. This aids BO to sample more high-performing hyperparameters. In addition, by leveraging the strong correlation observed between the performance obtained from full and subset training datasets in LoRA training regimes, we introduce proxy training and evaluation using a data subset. This significantly improves the efficiency of our method. We demonstrate that our hyperparameter, discovered with only about 30 iterations, achieves more than 20% performance improvement over standard hyperparameters found from about 45,000 combinations. Code will be released.

Abstract:
Realworld deployments inevitably encounter distribution shifts, rendering the confidence estimates of deep neural networks highly unreliable, posing severe risks in safety-critical applications. Existing methods improve calibration via training-time regularization or post-hoc adjustment, but often rely on access to (or simulation of) target domains, limiting practicality. We propose Frequency-aware Gradient Rectification (FGR), a target-agnostic training framework for robust calibration. From a frequency perspective, FGR applies low-pass filtering to a subset of training images to diminish spurious high-frequency cues and encourage the learning of domain-invariant features. However, the associated information loss can degrade In-Distribution (ID) calibration. To resolve this trade-off, FGR treats ID calibration as a hard constraint and rectifies conflicting parameter updates via geometric projection. This ensures a first-order non-increase in the ID calibration objective without introducing additional weighting hyperparameters. Extensive experiments on synthetic, real-world, and semantic shift datasets demonstrate that FGR significantly improves calibration under diverse shifts while preserving ID performance, and it remains compatible with post-hoc calibration methods.

Abstract:
We present AAD1, an Asymmetric Adversarial Distillation framework for One-step autoregressive image-to-video generation. State-of-the-art methods adopt adversarial distillation but suffer from motion collapse and training instability, resulting in static videos. AAD-1 addresses these challenges through two key designs in architecture and training strategy. Our key architectural insight is to break the symmetry between generator and discriminator. While the generator remains causal to preserve autoregressive sampling capability, the discriminator attends bidirectionally over the full spatiotemporal context and produces a single holistic realism score for the entire video sequence. This asymmetric design enables the discriminator to effectively detect global temporal failures and long-range drift that cause motion collapse in autoregressive generation. To stabilize training, we introduce a phased strategy that first uses distribution matching to bootstrap a stable one-step generator, providing a warm-up phase that brings the student distribution closer to the teacher before adversarial distillation begins. Extensive experiments on VBench demonstrate that AAD-1 achieves state-of-the-art performance in one-step autoregressive video generation.

Abstract:
Large language models have revolutionized natural language processing but face significant challenges of high storage and runtime costs, due to the transformer architecture's reliance on selfattention, particularly the large KV cache for long-sequence inference. Recent efforts to reduce KV cache size by pruning less critical entries based on attention weights remain empirical and lack formal grounding. This paper presents a formal study on identifying critical KV cache entries by analyzing attention output perturbation. Our analysis reveals that, beyond attention weights, the value states within KV entries and pretrained parameter matrices are also crucial. Based on this, we propose a perturbation-constrained selection algorithm that optimizes the worst-case output perturbation to identify critical entries. We demonstrate that our algorithm is a universal, plug-and-play enhancement that incurs negligible computational overhead. When integrated with three state-of-the-art cache eviction methods on three distinct LLMs, our algorithm significantly reduces the compression loss by more than half on average across 29 datasets from the Ruler and LongBench benchmarks. Further perturbation analysis, at both the head and layer levels, confirms the principles underlying our effectiveness. This work offers a new, formally grounded perspective to cache eviction , opening promising avenues for future research.

Abstract:
Cooperation is central to multiagent reinforcement learning (MARL), yet learned coordination can be fragile when external perturbations disrupt inter-agent interactions. Prior robust MARL methods have primarily considered value-oriented attacks, leaving a gap in robustness when interaction structures themselves are corrupted. In this paper, we propose an interaction-breaking adversarial learning (IBAL) framework that takes an information-theoretic view to construct attacks that impede coordination by perturbing agents’ observations and actions, and trains agents to perform reliably under such disruptions. Empirically, our approach improves robustness over existing robust MARL baselines across diverse attack settings and yields stronger performance even under agent-missing scenarios.

Abstract:
Standard Large Language Models (LLMs) operate on a ''readthen-generate'' paradigm, incurring avoidable latency and computational redundancy. Recently, streaming LLMs have attempted to overcome these bottlenecks by allowing input and output to unfold synchronously. However, this introduces a critical challenge: how should the LLM determine the optimal timing to interact with the input and output stream? Existing approaches remain confined topassive adaptation, relying on static or content-irrelevant interaction rules. In this paper, we proposeProactiveLLM, which achieves active interaction by treating ''when to generate" and ''what to generate" as decoupled objectives. Through masked streaming modeling and self-distillation, the model actively learns to perceive semantic sufficiency from partial inputs. This yields endogenous cues serving as a versatile interface for the plug-and-play integration of diverse decision heads customized for specific latency-accuracy trade-offs. Extensive evaluation across text and speech streaming tasks confirms that ProactiveLLM significantly reduces interaction latency while maintaining quality, validating its capacity for dynamic and active interaction.

Abstract:
Diffusion Large Language Models (dLLMs) have emerged as a promising nonautoregressive generative paradigm. Given the prohibitive computational cost of full fine-tuning, Parameter-Efficient Fine-Tuning (PEFT) has become the standard approach. However, existing PEFT methods (e.g., LoRA), originally tailored for autoregressive models, rely on static parameters that are agnostic to the noise level. Consequently, they ignore the intrinsic dynamics of the diffusion process, where input distributions and generation difficulty shift significantly along the denoising trajectory, rendering them suboptimal for dLLMs. To address this, we proposeNoise-aware Low-RankAdaptation (NaRA), which introduces a low-rank core matrix generated by a lightweight, globally shared hypernetwork conditioned on the noise level. This design enables the update matrices to vary continuously along the diffusion process while keeping parameter and latency overhead negligible. We provide a theoretical justification for the proposed NaRA framework and empirically demonstrate consistent improvements over noise-agnostic baselines across commonsense reasoning, mathematical reasoning, and code generation benchmarks.

Abstract:
Realworld synthetic image detectors often generalize poorly under domain shift despite strong in-domain performance. Using unsupervised UMAP projections, we find that natural and synthetic features remain partially separable on unseen datasets, yet performance still drops, suggesting that the classification head overfits to training-domain artifacts. Therefore, the key is to learn more transferable representations so that the decision criterion is more stable and robust to domain shifts. Based on the structural fact that synthetic images are produced by diverse generators, we propose a hierarchical contrastive learning framework that improves the separability between natural and synthetic images while preserving generator identity information. It jointly optimizes (i) a coarse contrastive objective between natural and synthetic images and (ii) a fine contrastive objective among synthetic images using generator identities. Trained on WildFake, our method achieves an average AUROC gain of +10.22 on cross-domain evaluation over Chameleon, AIGIBench, Community Forensics, and GenImage under the same settings as the strong baseline DIRE. For few-shot adaptation, we freeze the backbone and fit an SVM head on 10 labeled samples per class, improving AUROC by +10.64 on AIGIBench and +17.41 on Chameleon, averaged over 12 widely used detectors. We will publicly release all code and model checkpoints.

Abstract:
Randomized smoothing provides strong, modelagnostic robustness certificates, but existing guarantees are limited to single modalities, treating continuous and discrete inputs in isolation. This limitation becomes critical in multimodal models, where decisions depend on cross-modal semantics and adversaries can jointly perturb heterogeneous inputs, rendering unimodal certificates insufficient. We introduce a unified randomized smoothing framework for mixed discrete--continuous inputs based on an analytically tractable Neyman--Pearson formulation of the joint worst-case problem. By analyzing the joint likelihood ordering induced by factorized discrete and continuous noise, our approach yields a closed-form, one-dimensional certificate that strictly generalizes both Gaussian (image-only) and discrete (text-only) randomized smoothing. We validate the framework on multimodal safety filtering, providing the first model-agnostic certificates against joint text--image adversarial attacks.

Abstract:
Accurate ZerothOrder (ZO) Hessian estimation is a cornerstone of derivative-free methods, essential for tasks such as bilevel optimization, Bayesian inference, and uncertainty quantification. However, obtaining a complete suite of low-variance estimators for the Hessian and its inverse in high-dimensional settings remains a significant challenge. To address this, we propose a unified framework that reinterprets ZO Hessian approximation through the lens of single-step Policy Optimization (PO). This perspective establishes a theoretical equivalence between general ZO Hessian estimators and the Hessian of a smoothed PO objective, unifying distinct classical randomized estimators as specific instances of baseline selection. Building on this foundation, we introduce ZoVH, a comprehensive suite of variance-reduced estimators for the full Hessian matrix, its regularized inverse, and the bias-corrected inverse Hessian-gradient product. ZoVH leverages two key techniques: (1) a unique optimal baseline derived to provably minimize variance, and (2) a query reuse strategy that incorporates historical function queries to enhance sample efficiency without inflating costs. Our rigorous theoretical analysis confirms the unbiasedness of the Hessian estimator, validates the variance optimality of our baseline, provides error bounds for the entire ZoVH suite, and establishes convergence guarantees for the resulting curvature-aware ZO algorithm. Extensive empirical results validate our theoretical findings, demonstrating that ZoVH achieves superior estimation accuracy and convergence performance in real-world applications.

Abstract:
The intelligent verdict is essential for handling voluminous demands of Ecommerce dispute. Unlike the legal dispute, it necessitates identifying pivotal clues from redundant multimodal evidence chains, relying on informal transaction rules for dispute verdicts. The complex ``clues-dispute" causal logic and flexible verdict rules render existing methods inadequate. Motivated by this, we propose a pioneering task,E-commerce Dispute Verdict(EDV), and introduceVerdictBench, the first Multimodal Disputes Verdicts Benchmark for E-commerce, to facilitate the intelligent verdicts. Building upon this, we proposeCyberJurors, a framework that integrates an Individual Verdict Chain-of-Thought (IV-CoT) and Jury Consensus Verdict (JCV) to clarify the dispute logic and regulate the fair verdict process.For the individual juror, IV-CoT decomposes the EDV task into a structured reasoning, enabling fine-grained clues perception and explicit causal logic between clues and dispute.For the collective jury, JCV simulates multi-round discussion and voting among jurors guided by Verdict Precedents, effectively mitigating individual biases. Extensive experiments on VerdictBench demonstrate that CyberJurors significantly improves verdict accuracy, fairness, and interpretability, outperforming existing MLLMs by up to 9.48% in accuracy.

Abstract:
Training data plays a central role in large language model (LLM) optimization, motivating extensive research on data scheduling strategies. Most prior work focuses on data selection and implicitly assumes that, once the training subset is fixed, the order in which data are presented is interchangeable. However, this assumption is routinely violated in practice. Despite empirical evidence of order sensitivity, existing studies neither provide a principled explanation of the underlying optimization dynamics nor offer an efficient solution. In this work, we first answer the fundamental question of why training order matters in LLM optimization. We then demonstrate that commonly used empirical data ordering heuristics are suboptimal from an optimization perspective. To resolve this, we propose xxx, a data scheduling framework grounded in gradient interactions between samples, where training dependencies are modeled as a graph that explicitly constrains valid training orders. Our approach is theoretically motivated and yields consistent empirical improvements over existing data scheduling methods across multiple settings.

Abstract:
Multiview learning has been widely applied for sleep stage classification using multi-modal data. However, existing methods typically assume that different modalities are well-aligned, which is often unattainable in real-world scenarios, thereby compromising the reliability of the staging results. In this paper, we propose ConfSleepNet, a conflict-aware evidential framework that dynamically resolves inter-view conflicts. The framework consists of multi-view evidence extraction and conflict-aware aggregation. In the first phase, it learns category-related evidence from different modalities, which represents the degree of support for individual sleep stages. Considering the inherent characteristics of varying modalities, we propose adaptive category structures for different modalities to promote more reasonable evidence learning. In the second phase, view-specific opinions, including prediction results and uncertainty, are constructed from the learned evidence. Notably, we propose a novel conflict-aware aggregation method that integrates these view-specific opinions into a reliable joint decision. This mechanism can effectively resolve conflicts among opinions and synthesize them into a reliable joint decision. Both theoretical analysis and experimental results demonstrate the effectiveness of ConfSleepNet in sleep staging tasks. The code is available at https://anonymous.4open.science/r/ConfSleepNet-F383.

Abstract:
LLM agents operate in two distinct regimes: openweight agents amenable to reinforcement learning (RL) and black-box agents whose behaviour must be controlled purely at test time. Although black-box agents are often backed by state-of-the-art proprietary LLMs, API-only access precludes parameter-level optimization, rendering most RL methods inapplicable. To address this limitation, we turn to a known equivalence between RL and Bayesian inference. We propose Agentic Monte Carlo (AMC) to directly sample from the optimal policy of a black-box agent rather than training it through RL. The optimal policy is a posterior over trajectories whose prior we define as the fixed black-box LLM agent. We employ Sequential Monte Carlo to sample from this posterior by learning a value function to steer the agent while leaving the underlying black-box model unchanged. We validate AMC on three diverse environments from the AgentGym benchmark, demonstrating significant improvements over prompting baselines and even outperforming Group Relative Policy Optimization (GRPO) as we scale the test-time compute of our method. AMC demonstrates the feasibility of performing principled RL-style optimization of black-box LLM agents.

Abstract:
Safetycritical scenarios are central to evaluating autonomous driving systems, yet their rarity in naturalistic logs makes simulation-based stress testing indispensable. Most scenario generation methods treat surrounding agents as adversaries, but they either (i) induce failures without explicitly modeling vehicle–road physical limits, yielding visually extreme yet physically unsolvable crashes, or (ii) enforce physical feasibility or policy feasibility in isolation, which can over-focus on aggressive maneuvers or remain tied to a controller-dependent capability boundary. We propose ScenePilot, a feasibility-guided, boundary-driven framework that targets the boundary band: scenarios that are physically solvable in principle yet still cause the deployed autonomy stack to fail. We formulate generation as constrained multi-objective reinforcement learning, combining an RSS-derived physical-feasibility score \sigma with an online-learned AV-risk predictor \Phi, and introduce step-level feasibility-aware shielding to keep exploration near the feasibility boundary while avoiding infeasible artifacts. Experiments on SafeBench with multiple planners show that ScenePilot yields substantially higher collision rates (+6.2 percentage points) while preserving physical validity, and that adversarial fine-tuning on these boundary-band scenarios consistently reduces downstream crash rates.

Abstract:
Longcontext Large Language Models (LLMs) enable powerful applications but incur high memory costs due to the key–value states (KV-Cache). Recent studies attempt to share KV-Cache across layers, but these approaches either require expensive pretraining or rely on per-token cross-layer cosine similarity that is often limited in practice. We show, via Centered Kernel Alignment (CKA), that the dominant singular vectors of KV-Cache are well aligned across layers. Motivated by this observation, we propose xKV, a post-training compression method that jointly factorizes grouped-layer KV-Cache into a shared low-rank subspace, substantially reducing KV-Cache memory. Across widely used LLMs, xKV achieves up to 8× KV-Cache compression while preserving accuracy on long-context tasks and in multi-turn settings. To further improve efficiency, we introduce Selective Reconstruction (SR) at decode time. Combined with SR, xKV achieves up to 4.23× end-to-end speedup, surpassing notable baselines with 30% higher throughput under a similar accuracy level. Overall, xKV provides a plug-and-play approach to reduce both memory and latency for long-context LLM inference. Our code will be open-sourced.

Abstract:
While explicit Chainof-Thought (CoT) empowers large reasoning models (LRMs), it enables the generation of riskier final answers. Current alignment paradigms primarily rely on externally enforced compliance, optimizing models to detect malicious prompts rather than evaluating the safety of their own outputs. We argue that this approach remains largely behavioral: our empirical analysis reveals that ostensibly aligned models lack intrinsic safety understanding, often failing to verify their own response safety and remaining vulnerable to adversarial jailbreaks. To address this fundamental limitation, we propose Safety Internal (SInternal), a framework that internalizes safety specifications by training LRMs exclusively on safety verification tasks to critique their own generated answers using expert reasoning trajectories. We demonstrate that learning to verify induces a strong generalization for response safety, significantly enhancing robustness against out-of-domain jailbreaks. Furthermore, when combined with reinforcement learning, SInternal serves as a superior initialization compared to standard supervised fine-tuning, suggesting that internalizing safety understanding creates a more robust foundation for alignment than merely mimicking safe behaviors. Our codes are available at \urlhttps://anonymous.4open.science/r/SInternal-AF60

Abstract:
Machine Unlearning (MU) aims to remove the information of specific training data from a trained model, ensuring compliance with privacy regulations and user requests. While one line of existing MU methods relies on linear parameter updates via task arithmetic, they suffer from weight entanglement. In this work, we propose a novel MU framework called Mode Connectivity Unlearning (MCU) that leverages mode connectivity to find an unlearning pathway in a nonlinear manner. To further enhance performance and efficiency, we introduce a parameter mask strategy that not only improves unlearning effectiveness but also reduces computational overhead. Moreover, we propose an adaptive adjustment strategy for our unlearning penalty coefficient to adaptively balance forgetting quality and predictive performance during training, eliminating the need for empirical hyperparameter tuning. Unlike traditional MU methods that identify only a single unlearning model, MCU uncovers a spectrum of unlearning models along the pathway. Overall, MCU serves as a plugand-play framework that seamlessly integrates with any existing MU methods, consistently improving unlearning efficacy. Extensive experiments on the image classification task demonstrate that MCU achieves superior performance. The codes are available at https://anonymous.4open.science/r/MCU-1E36.

Abstract:
Evolutionary model merging provides a powerful framework for the automated, trainingfree composition of LLMs through parameter-space search. However, existing methods predominantly rely on stochastic, hand-crafted operators that overlook the underlying performance landscape of the coefficient space. We propose Evolutionary Generative Merging (EvoGM), a framework that transcends manual heuristics by employing learnable generative modeling to optimize merging coefficients. Specifically,, EvoGM features a dual-generator architecture with cycle-consistent learning to adaptively sample and refine promising merging candidates. By constructing winner-loser pairs from historical search trajectories, our framework effectively captures high-performance parameter distributions and maximizes data efficiency. This generative process is seamlessly integrated into a multi-round evolutionary pipeline, where elite merged models iteratively serve as new expert foundations. Extensive experiments across diverse benchmarks demonstrate that EvoGM significantly outperforms state-of-the-art baselines, exhibiting robust performance on both seen and unseen tasks.

Abstract:
Sparse Mixture of Experts (SMoE) models scale the capacity of models while maintaining constant computational overhead. SMoE methods fall into two categories:Token Choice, which routes each token to a fixed number of experts, andExpert Choice, which assigns a fixed number of tokens to each expert. However, the use of fixed budgets for tokens or experts causes both approaches to select irrelevant token–expert pairs or overlook critical assignments, which degrades overall performance. To fill that gap, we rethink SMoE from aunified perspectivethrough the lens oflinear programming, which provides a general formulation for SMoE models. Furthermore, we introduceUnified Sparse Mixture of Experts (USMoE), a novel framework comprising aunified mechanismand aunified scoreto overcome these limitations. We provide both theoretical justification and empirical evidence demonstrating USMoE's effectiveness. Extensive evaluations across diverse data settings (clean and corrupted), multiple domains (including texts and vision tasks), and different learning approaches (trainingfree and training-based) show that USMoE not only delivers significant performance improvements over existing SMoE methods, but also enables more flexible expert selection budgets, reducing inference costs without compromising model performance.

Abstract:
Existing languageimage pre-training for remote sensing object detection is constrained by Monolithic Label Learning, which relies on exhaustively enumerating open-set categories via black-box data to acquire fine-grained representations, creating a dependency incompatible with the domain's inherent data scarcity. To transcend this bottleneck, we propose SLIP-RS, establishing a Structured-Attribute Decoupling Paradigm that maps the open-ended category space into a finite, physically meaningful attribute space, unlocking fine-grained discriminability via explicit structural logic. This paradigm is realized via two technical pillars: (1) Structured-Attribute Contrastive Learning, which enforces the learning of decoupled intrinsic visual logic via combinatorial attribute augmentation; and (2) Conformal Attribute Reliability Engine, which leverages conformal prediction theory to rigorously distill high-fidelity supervision from noisy sources, yielding RS-Attribute-15M, the largest dataset with over 15 million attribute annotations. Extensive experiments demonstrate that SLIP-RS establishes unprecedented performance in fine-grained detection and cross-domain generalization, validating structured attributes as a vital foundation for scalable remote sensing models.

Abstract:
Time series forecasting under nonstationarity faces a fundamental tension between capturing stable representations and adapting to distribution shifts. Existing methods implicitly rely on static historical assumptions, leading to a critical failure mode we term Phase Amnesia, where models become blind to the evolving global context. To resolve this, we formalize non-stationary dynamics through three physical hypotheses: Wold decomposition, dynamical phase evolution, and heteroscedastic manifold generation. These principles inspire PULSE, a physics-informed, plug-and-play framework adopting a Disentangle--Evolve--Simulate design philosophy. Specifically, PULSE utilizes phase-anchored disentanglement to resolve optimization interference caused by dominant trends, employs a Phase Router to actively generate future trajectories, and introduces Statistic-Aware Mixup (SAM) to ensure robustness against out-of-distribution volatility. Empirically, PULSE enables a simple MLP backbone to consistently outperform state-of-the-art Transformers across 12 real-world benchmarks. This validates that a correct physics-informed inductive bias is far more critical than raw architectural complexity for non-stationary forecasting.

Abstract:
Unlike most static material properties widely studied in the machine learning literature, ionic transport properties are inherently dynamic, making their fast and accurate prediction from static atomic structures challenging. The current standard approach, molecular dynamics (MD) simulations, suffers from prohibitively high computational cost. Recent autoregressive learningbased MD acceleration methods requiring sequential inference remain slow and prone to error accumulation; in contrast, existing non-autoregressive material property prediction models are less accurate because they fail to exploit dynamics. Moreover, existing methods typically benefit from datasets either with or without atomic trajectories, but not both. To overcome these limitations, we propose a non-autoregressive learning framework based on modality reduction, which treats atomic trajectories as an auxiliary modality during training but does not require them at inference. This enables a predictor to learn dynamics without requiring sequential inference and to benefit from both types of datasets. As a result, our framework achieves over 200 times speedup compared to autoregressive models on the dataset with atomic trajectories while substantially reducing prediction error relative to non-autoregressive benchmarks across both types of datasets. Our code is available at https://anonymous.4open.science/r/2026.

Abstract:
Diffusion transformers have achieved remarkable success in highquality video generation, yet their reliance on spatiotemporal 3D full attention incurs prohibitive computational cost due to the quadratic complexity of attention. Block sparse attention is a common approach to mitigate this by focusing computation on important regions. However, attention maps in DiTs exhibit inherently dynamic and fine-grained sparsity, which causes existing block sparse attention methods to degrade significantly in quality, especially at high sparsity ratios. In this paper, we revisit block sparse attention and derive a theoretical lower bound on attention recall to characterize the key factors governing its effectiveness. Guided by these insights, we propose DFSAttn, a training-free sparse attention framework that enables dynamic, fine-grained sparsification efficiently. DFSAttn incorporates three core designs: Hilbert curve–based token reordering to achieve fine-grained sparsity while retaining hardware-friendly operations, hierarchical block scoring for accurate block importance estimation, and sparse mask caching with adaptive ratios to balance accuracy and efficiency. Experimental results demonstrate that DFSAttn consistently outperforms prior methods under high sparsity, achieving up to 2.1× end-to-end speedup while maintaining high generation quality.

Abstract:
Pursuing trainingfree open-vocabulary semantic segmentation in an efficient and generalizable manner remains challenging due to the deep-seated spatial bias in CLIP. To overcome the limitations of existing solutions, this work moves beyond the CLIP-based paradigm and harnesses the recent spatially-aware dino.txt framework to facilitate more efficient and high-quality dense prediction. While dino.txt exhibits robust spatial awareness, we find that the semantic ambiguity of text queries gives rise to severe mismatch within its dense cross-modal interactions. To address this, we introduce VIsual-guided Prompt evolution (VIP) to rectify the semantic expressiveness of text queries in dino.txt, unleashing its potential for fine-grained object perception. Towards this end, VIP integrates alias expansion with a visual-guided distillation mechanism to mine valuable semantic cues, which are robustly aggregated in a saliency-aware manner to yield a high-fidelity prediction. Extensive evaluations demonstrate that VIP: I) surpasses the top-leading methods by 1.4%~8.4% average mIoU, II) generalizes well to diverse challenging domains, and III) requires marginal inference time and memory overhead. Our code will be released to foster future research.

Abstract:
Modern visual object trackers show impressive results on general targets, yet their performance drops substantially when dealing with scene text. Although currently underexplored, tracking text in videos is essential for dynamic text manipulations such as segmentation, removal, and editing. To fill this gap, this paper formalizes this specific task as Scene Text Tracking and present the first systematic work for it. We identify three primary challenges in this task: 1) severe geometric distortions from perspective shifts, 2) high visual ambiguity across different instances, and 3) high sensitivity to finegrained structural details. To address these issues, we propose SymTrack, a unified detection-free framework with synergistic dual-branch design. It integrates a Cross-Expert Calibration mechanism to reduce semantic bias, along with a Predictive Token Rectification mechanism to correct structural imbalances, complemented by an Adaptive Inference Engine that stabilizes predictions under motion constraints. Considering the lack of dedicated benchmarks for this task, we utilize three datasets from video text spotting to construct a benchmark with high-quality annotations. Extensive experiments demonstrate that SymTrack sets the new state-of-the-art on all three benchmarks, outperforming previous best trackers by up to 11.97% AUC on \textBOVText_\textSOT . Overall, our work promotes efficient and thorough text tracking, paving the way toward more generalized video text manipulation.

Abstract:
Diffusionbased image synthesis has made AI-generated images (AIGI) increasingly photorealistic, raising urgent concerns about authenticity in applications such as misinformation detection, digital forensics, and content moderation. Despite the substantial advances in AIGI detection, how to correct detected AI-generated images with visible artifacts and restore realistic appearance remains largely underexplored. Moreover, few existing work has established the connection between AIGI detection and artifact correction. To fill this gap, we propose GenShield, a unified autoregressive framework that jointly performs explainable AIGI detection and controllable artifact correction in a closed loop from diagnosis to restoration, revealing a mutually reinforcing relationship between these two tasks. We further introduce a Visual Chain-of-Thought based curriculum learning strategy that enables self-explained, multi-step "diagnose-then-repair" correction with an explicit stopping criterion. A high-quality dataset with large-scale "artifact-restored" pairs is also constructed alongside a unified evaluation pipeline. Extensive experiments on our correction benchmark and mainstream AIGI detection benchmarks demonstrate state-of-the-art performance and strong generalization of our method.

Abstract:
Video Virtual TryOn (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing (e.g., pulling a hem or unzipping a jacket). This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we proposeiTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Furthermore, we design an action-aware constraint loss to stabilize training and focus the learning process on these critical interactive frames. To facilitate research and evaluation, we construct VVT-Interact, the first large-scale dataset for this task, and propose a novel interaction-aware evaluation metric to quantify the semantic fidelity of interactions. Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.

Abstract:
Despite the rapid development of AI reviewers, evaluating such systems remains challenging: metrics favor overlap with human reviews over correctness. However, since human reviews often cover only a subset of salient issues and sometimes contain mistakes, they are unreliable as gold references. To address this, we build categoryspecific benchmark subsets and skip evaluation when the corresponding human reviews are missing to strengthenCompleteness. We also leverage reviewer--author--meta-review discussions as expert annotations and filter unreliable reviews accordingly to strengthenCorrectness. Finally, we introduceCoCoReviewBench, which curates 3,900 papers from ICLR and NeurIPS to enable reliable and fine-grained evaluation of AI reviewers. Analysis shows that AI reviewers remain limited in correctness and thoroughness and are prone to hallucinations, and highlights reasoning models as more effective reviewers, motivating further directions for improving AI reviewers.

Abstract:
Inferencetime guided sampling steers state-of-the-art diffusion and flow models without fine-tuning by interpreting the generation process as a controllable trajectory. This provides a simple and flexible way to inject external constraints (e.g., cost functions or pre-trained verifiers) for controlled generation. However, existing methods often fail when composing multiple constraints simultaneously, which leads to deviations from the true data manifold. In this work, we identify root causes of this off-manifold drift and find that local errors scale severely with multiple guidance misalignment. Building on these findings, we propose Conflict-Aware Additive Guidance (g^\textcar), a lightweight and learnable method, which actively rectifies off-manifold drift by dynamically detecting and resolving gradient conflicts. We validate g^\textcar across diverse domains, ranging from synthetic datasets and image editing to generative decision-making for planning and control. Our results demonstrate that g^\textcar effectively rectifies off-manifold drift, surpassing baselines in generation fidelity while using light compute. Code is available at Anonymous Link.

Abstract:
RetrievalAugmented Generation (RAG) has emerged as a promising paradigm for improving the timeliness of knowledge updates and the factual accuracy of large language models. However, incorporating a large volume of retrieved documents significantly increases input length, leading to prohibitive computational costs. Existing compression approaches often compromise task performance, primarily due to their reliance on predefined heuristics. These heuristics fail to ensure that the compressed context is conducive to the generation tasks. To address these limitations, we propose CORE-RAG, a novel framework for context compression in RAG systems. CORE eliminates reliance on proxy heuristics through a performance-driven learning framework, which directy utilizes task performance as a feedback signal to iteratively refine the compressor policy. Prior to this optimization process, we incorporate a knowledge distillation phase to initialize the compressor with a robust policy. Extensive experiments demonstrate the superiority of our approach. At a high compression ratio of 3%, CORE not only avoids performance degradation but also improves the average Exact Match (EM) score by 3.3 points compared to using full documents. Our code is available at https://anonymous.4open.science/r/CORE-28B4.

Abstract:
Selfsupervised learning for time-series representation aims to reduce reliance on labeled data while maintaining strong downstream performance, yet many existing approaches incur high computational costs or rely on assumptions that do not hold across diverse temporal dynamics. In this work, we introduce Divide and Contrast (Di-COT), an unsupervised framework that avoids data augmentation and multiple encoder passes by contrasting informative substructures within a window rather than individual timesteps. Di-COT stochastically partitions each window into a small number of overlapping sub-blocks per iteration, enabling efficient and meaningful contrast while mitigating false positives during temporal transitions. To further improve scalability, we adopt a contrastive objective whose computation depends on the batch size and the number of sub-blocks, making loss computation independent of sequence length. Extensive experiments on six large-scale real-world datasets, as well as the UCR and UEA benchmarks, demonstrate that Di-COT learns transferable representations while achieving state-of-the-art performance with substantially reduced training time.

Abstract:
Conditional generative models have recently achieved remarkable success in various applications. However, a suitable metric for evaluating the reliability of these models, which takes into account their inherent uncertainty, is still lacking. Existing metrics, which typically assess a single output, may fail to capture the variability or potential risks in generation. In this paper, we propose a novel evaluation metric called \emphreliability score based on conformal prediction, which measures the worstcase performance within the prediction set at a pre-specified confidence level. However, computing this score is challenging due to the high-dimensional nature of the output space and the nonconvexity of both the metric function and the prediction set. To efficiently compute this score, we introduce Conformal ReLiability (CReL), a framework that can (i) construct the prediction set with desired coverage; and (ii) accurately optimize the reliability score within the constructed prediction set. We provide theoretical results on coverage and demonstrate empirically that our method produces more informative prediction sets than existing approaches. Experiments on synthetic data and the image-to-text and text-to-image tasks further demonstrate the interpretability of our new metric, and the validity and effectiveness of our computational framework.

Abstract:
Experience learning has achieved promising results in enhancing LLM agent planning and reasoning by integrating past interactions as reusable knowledge. However, existing methods remain confined to explicit text spaceretrieving experiences via semantic similarity and concatenating them into the context window, leading to substantial token overhead and a decoupled architecture that separates retrieval from generation. To address these limitations, we propose \method, a framework that enables LLM agents to learn from experience via latent retrieval-augmented generation, without requiring a separate RAG module. \method encodes experiences using the LLM's own hidden states, retrieves relevant experiences directly in latent space at each decoding step, and integrates them through cross-attention aggregation and gated residual mechanisms. The entire pipeline is optimized end-to-end with reinforcement learning, supporting both generative and ranking tasks. We evaluate \method on 13 diverse tasks spanning question answering, reasoning, coding, scientific prediction, and recommendation. Results demonstrate that: (1) \method achieves state-of-the-art on 12 out of 13 tasks, outperforming the strongest baseline by over 6.8%; (2) \method maintains token efficiency comparable to non-retrieval baselines while text-based retrieval methods require 1.5--2× more tokens; and (3) \method exhibits superior cross-domain generalization, outperforming the strongest baseline by 16.32% under zero-shot transfer and 15.21% under few-shot transfer.

Abstract:
Model merging has emerged as a costeffective approach for consolidating the capabilities of multiple LLMs without retraining. However, existing merging techniques, largely based on linear parameter arithmetic or optimization, struggle when applied to Mixture-of-Experts (MoE) architectures. We identify a critical failure mode in MoE merging, termed routing breakdown, in which the merged router fails to dispatch tokens to suitable experts. Routing breakdown stems from the sensitivity of the non-linear softmax and discrete Top-k routing mechanisms to parameter perturbations from merging, a sensitivity further amplified by load-balancing constraints imposed during MoE pretraining. Because fine-tuned experts exhibit distinct specializations, even modest misrouting can cause severe performance degradation. To address this issue, we propose Hessian-Aware Router Calibration (HARC), a training-free framework that leverages second-order curvature information to realign the merged router. This approach admits a closed-form solution that can be efficiently solved using a matrix-free conjugate gradient method. Experiments on mathematical reasoning and code generation tasks show that HARC effectively mitigates routing breakdown across diverse MoE merging baselines and leads to substantial performance improvements.

Abstract:
AI agents have seen widespread adoption in information retrieval for scientific research, giving rise to tools such asDeep Research. However, existing retrieval agents mainly rely on keywordor embedding-based methods. While effective at capturing content-level similarities, they struggle to understand complex relational networks among scientific papers, such as identifying corroborating or conflicting studies and tracing technological lineages. This fundamental limitation often results in fragmented knowledge structures, misinterpreted research sentiment, and ineffective modeling of collective scientific progress. To address this limitation, we introduceSciNet, the firstScientificNetwork relation-aware dataset for information retrieval agents. Built on a meta-database of 269 million papers across 7 disciplines and containing 8,940 carefully designed tasks, SciNet systematically captures three levels of relational understanding: ego-centric retrieval of papers with novel knowledge structures, pairwise identification of scholarly relationships, and path-wise reconstruction of scientific evolution. Extensive evaluation of three categories of retrieval agents shows that their accuracy on relation-aware tasks often falls below 20%, highlighting a fundamental shortcoming of current retrieval paradigms. Importantly, in a downstream literature review application, agents empowered with SciNet achieve a 25.3% improvement in review quality, highlighting the critical value of relation-aware retrieval for deepening scientific insights. We publicly release SciNet athttps://anonymous.4open.science/r/SciNet/to support future research.

Abstract:
Querybased universal sound separation is fundamental to intelligent auditory systems, aiming to isolate specific sources from mixtures. Despite recent advances, existing methods continue to suffer from residual interference in complex acoustic scenes. This performance limitation stems largely from a data bottleneck: in-the-wild datasets contain weak labels and severe co-occurrence of events. These flaws induce models to learn spurious correlations between background noise and target categories instead of robust acoustic features. To address this, we propose an automated pipeline that eliminates co-occurrence of events by mining high-purity single-event segments from in-the-wild datasets via a semantically consistent synthesis protocol. Utilizing this pipeline, we constructed Hive, a high-quality synthetic dataset comprising 2.4k hours of raw audio. Experimental results demonstrate that, compared with the state-of-the-art model SAM-Audio which was trained on a huge dataset ～500 times larger than Hive, certain open-source models trained on Hive achieve competitive separation accuracy and perceptual quality. Moreover, these models exhibited remarkable zero-shot generalization on out-of-distribution evaluation benchmarks. These findings highlight that prioritizing purity of supervised signals enables significant data efficiency, offering a new paradigm for training robust auditory foundation models with reduced computational costs. Code and dataset are available at https://hive-dataset-icml.github.io/Hive-Demo.

Abstract:
Diffusionbased Real-World Image Super-Resolution (Real-ISR) achieves impressive perceptual quality but suffers from high computational costs due to iterative sampling. While recent distillation approaches leveraging large-scale Text-to-Image (T2I) priors have enabled one-step generation, they are typically hindered by prohibitive parameter counts and the inherent capability bounds imposed by teacher models. As a lightweight alternative, Consistency Models offer efficient inference but struggle with two critical limitations: the accumulation of consistency drift inherent to transitive training, and a phenomenon we term "Geometric Decoupling"— where the generative trajectory achieves pixel-wise alignment yet fails to preserve structural coherence. To address these challenges, we propose GTASR (Geometric Trajectory Alignment Super-Resolution), a simple yet effective consistency training paradigm for Real-ISR. Specifically, we introduce a Trajectory Alignment (TA) strategy to rectify the tangent vector field via full-path projection, and a Dual-Reference Structural Rectification (DRSR) mechanism to enforce strict structural constraints. Extensive experiments verify that GTASR delivers superior performance over representative baselines while maintaining minimal latency.

Abstract:
Relational Foundation Models (RFMs) facilitate datadriven decision-making by learning from complex multi-table databases. However, the diverse relational databases needed to train such models are rarely public due to privacy constraints. While there are methods to generate synthetic tabular data of arbitrary size, incorporating schema structure and primary--foreign key connectivity for multi-table generation remains challenging. Here we introduce PluRel, a framework to synthesize multi-tabular relational databases from scratch. In a step-by-step fashion, PluRel models (1) schemas with directed graphs, (2) inter-table primary-foreign key connectivity with bipartite graphs, and, (3) feature distributions in tables via conditional causal mechanisms. The design space across these stages supports the synthesis of a wide range of diverse databases, while being computationally lightweight. Using PluRel, we observe for the first time that (1) RFM pretraining loss exhibits power-law scaling with the number of synthetic databases and total pretraining tokens, (2) scaling the number of synthetic databases improves generalization to real databases, and (3) synthetic pretraining yields strong base models for continued pretraining on real databases. Overall, our framework and results position synthetic data scaling as a promising paradigm for RFMs.

Abstract:
Orthogonality constraints are ubiquitous in robust and probabilistic machine learning. Unfortunately, current optimizers are computationally expensive and do not scale to problems with hundreds or thousands of constraints. One notable exception is the Landing algorithm (Ablin et al., 2024) which, however comes at the expense of temporarily relaxing orthogonality. In this work, we revisit and improve on the ideas behind Landing, enabling the inclusion of modern adaptive optimizers while ensuring that orthogonal constraints are effectively met. Remarkably, these improvements come at little to no cost, and reduce the number of required hyperparemeters. Our algorithm POGO is fast and GPUfriendly,consisting of only 5 matrix products, and in practice maintains orthogonality at all times. On several challenging benchmarks, POGO greatly outperforms recent optimizers and shows it can optimize problems with thousands of orthogonal matrices in minutes while alternatives would take hours. As such, POGO sets a milestone to finally exploit orthogonality constraints in ML at scale.

Abstract:
Recently, multimodal large language models (MLLMs) have been widely applied to reasoning tasks. However, they suffer from limited multirationale semantic modeling, insufficient logical robustness, and susceptibility to misleading cues. Therefore, we propose a Multi-rationale INtegrated Discriminative (MIND) reasoning framework, which is designed to endow MLLMs with human-like cognitive abilities of “Understand → Rethink → Correct”, and achieves a paradigm evolution from passive imitation-based reasoning to active discriminative reasoning. Specifically, we introduce a Rationale Augmentation and Discrimination (RAD) paradigm, which provides a unified and extensible data foundation. Meanwhile, we design a Progressive Two-stage Correction Learning (P2CL) strategy. The first phase enhances multi-rationale positive learning, while the second phase enables active logic discrimination and correction. In addition, to mitigate representation entanglement in the multi-rationale semantic space, we propose a Multi-rationale Contrastive Alignment (MCA) optimization strategy. Extensive experiments demonstrate that our MIND achieves state-of-the-art (SOTA) performance on multiple public datasets covering scientific, commonsense, and mathematical scenarios. Our data and code will be open source.

Abstract:
The growing capabilities of large language models (LLMs) have driven their widespread deployment across diverse domains, even in potentially highrisk scenarios. Despite advances in safety alignment techniques, current models remain vulnerable to emergingpersona-based jailbreak attacks. Existing research on persona-based jailbreak has primarily focused on attack iterations, yet it lacks systemic and mechanistic constraints on the defense side. To address this challenge, we propose Persona-Invariant Alignment (PIA), an adversarial self-play framework that achieves co-evolution through Persona Lineage Evolution (PLE) on the attack side and Persona-Invariant Consistency Learning (PICL) on the defense side. Theoretically, PICL is grounded in thestructural separation hypothesis, using a unilateral KL-divergence constraint to enable the structural decoupling of safety decisions from persona context, thereby maintaining safe behavior under persona-based jailbreak attacks. Experimental results demonstrate that PLE efficiently explores high-risk persona spaces by leveraging lineage-based credit propagation. Meanwhile, the PICL defense method significantly reduces the Attack Success Rate (ASR) while preserving the model's general capability, thereby validating the superiority and robustness of this alignment paradigm. WARNING: This paper contains potentially offensive and harmful text.

Abstract:
Estimating uncertainty for AI agents in realworld multi-turn tool-using interaction with humans is difficult because failures are often triggered by sparse critical episodes (e.g., looping, incoherent tool use, or user-agent miscoordination) even when local generation appears confident. Existing uncertainty proxies focus on single-shot text generation and therefore miss these trajectory-level breakdown signals. We introduce TRACER, a trajectory-level uncertainty metric for dual-control Tool-Agent-User interaction. TRACER combines content-aware surprisal with situational-awareness signals, semantic and lexical repetition, and tool-grounded coherence gaps, and aggregates them using a tail-focused risk functional with a MAX-composite step risk to surface decisive anomalies. We evaluate TRACER on \tau^2-bench (Barres et al., 2025) by predicting task failure and selective task execution. To this end, TRACER improves AUROC by up to 37.1% and AUARC by up to 55% over baselines, enabling earlier and more accurate detection of uncertainty in complex conversational tool-use settings.

Abstract:
New technologies such as Rectified Flow and Flow Matching have significantly improved the performance of generative models in the past two years, especially in terms of control accuracy, generation quality, and generation efficiency. However, due to some differences in its theory, design, and existing diffusion models, the existing acceleration methods cannot be directly applied to the Rectified Flow model. In this article, we have comprehensively implemented an overall acceleration pipeline from the aspects of theory, design, and reasoning strategies. This pipeline uses new methods such as batch processing with a new velocity field, vectorization of heterogeneous timestep batch processing, and dynamic TensorRT compilation for the new methods to comprehensively accelerate related models based on flow models. Currently, the existing public methods usually achieve an acceleration of 18%, while experiments have proved that our new method can accelerate the 512×512 image generation speed to up to 611%, which is far beyond the current non-generalized acceleration methods.

Abstract:
Reinforcement Learning (RL) has emerged as a central paradigm for advancing Large Language Models (LLMs), where both pretraining and RL post-training stages are grounded in the same log-likelihood formulation. In contrast, recent RL approaches for diffusion models, most notably Denoising Diffusion Policy Optimization (DDPO), optimize an objective different from the pretraining objectives--score/flow matching loss. In this work, we establish a novel theoretical analysis: DDPO is an implicit form of score/flow matching with noisy targets, which increases variance and slows convergence. Building on this analysis, we introduce Advantage Weighted Matching (AWM), a policy-gradient method for diffusion. It uses the score/flow-matching loss and reweights each sample by its advantage. In effect, AWM raises the influence of high-reward samples and suppresses low-reward ones while keeping the modeling objective identical to pretraining. This unifies pretraining and RL conceptually and practically and reduces variance, yielding faster convergence. This simple yet effective design yields substantial benefits: on the GenEval, OCR, and PickScore benchmarks, AWM delivers up to a 34× speedup over Flow-GRPO (which builds on DDPO), when applied to Stable Diffusion 3.5 Medium and FLUX, without compromising generation quality. Code is provided in the supplementary material.

Abstract:
Leveraging future observation modeling to facilitate action generation presents a promising avenue for enhancing the capabilities of VisionLanguage-Action (VLA) models. However, existing approaches struggle to strike a balance between maintaining efficient, predictable future representations and preserving sufficient fine-grained information to guide precise action generation. To address this limitation, we propose WoG (World Guidance), a framework that maps future observations into compact conditions by injecting them into the action inference pipeline. The VLA is then trained to simultaneously predict these compressed conditions alongside future actions, thereby achieving effective world modeling within the condition space for action inference. We demonstrate that modeling and predicting this condition space not only facilitates fine-grained action generation but also exhibits superior generalization capabilities. Moreover, it learns effectively from substantial human manipulation videos. Extensive experiments across both simulation and real-world environments validate that WoG significantly outperforms existing methods based on future prediction.

Abstract:
We propose FlowAnchored Noise-conditioned Q-Learning (FAN), a highly efficient and high-performing offline reinforcement learning (RL) algorithm. Recent work has shown that expressive flow policies and distributional critics improve offline RL performance, but at a high computational cost. Specifically, flow policies require iterative sampling to produce a single action, and distributional critics require computation over multiple samples (e.g., quantiles) to estimate value. To address these inefficiencies while maintaining high performance, we introduce FAN. Our method employs a behavior regularization technique that utilizes only a single flow policy iteration and requires only a single Gaussian noise sample for distributional critics. Our theoretical analysis of convergence and performance bounds demonstrates that these simplifications not only improve efficiency but also lead to superior task performance. Experiments on robotic manipulation and locomotion tasks demonstrate that FAN achieves state-of-the-art performance while significantly reducing both training and inference runtimes. We release our code at https://anonymous.4open.science/r/FAN.

Abstract:
In singlecell research, tracing and analyzing high-throughput single-cell differentiation trajectories is crucial for understanding biological processes. Key to this is the robust modeling of hierarchical structures that govern cellular development. Traditional methods face limitations in computational cost, performance, and stability. VAE-based approaches have made strides but still require branch-specific network modules, limiting their scalability and stability, while often suffering from posterior collapse. To overcome these challenges, we introduce HDTree, a generative modeling framework designed for robust lineage inference. HDTree captures tree relationships within a hierarchical latent space using a unified hierarchical codebook and employs a quantized diffusion process to model continuous cell state transitions. By aligning the generative process with the Waddington landscape, this method not only improves stability and scalability but also enhances the biological plausibility of inferred lineages. HDTree's effectiveness is demonstrated through comparisons on both general-purpose and single-cell datasets, where it outperforms existing methods in lineage inference accuracy, reconstruction quality, and hierarchical consistency. These contributions enable accurate and efficient modeling of cellular differentiation paths, offering reliable insights for biological discovery.

Abstract:
Multilingual safety remains significantly imbalanced, leaving nonhigh-resource (NHR) languages vulnerable compared to robust high-resource (HR) ones. Moreover, the neural mechanisms driving safety alignment remain unclear despite observed cross-lingual representation transfer.In this paper, we find that LLMs contain a set of cross-lingual shared safety neurons (SS-Neurons), a remarkably small yet critical neuronal subset that jointly regulates safety behavior across languages. We first identify monolingual safety neurons (MS-Neurons) and validate their causal role in safety refusal behavior through targeted activation and suppression. Our cross-lingual analyses then identify SS-Neurons as the subset of MS-Neurons shared between HR and NHR languages, serving as a bridge to transfer safety capabilities from HR to NHR domains. We observe that suppressing these neurons causes concurrent safety drops across NHR languages, whereas reinforcing them improves cross-lingual defensive consistency. Building on these insights, we propose a simple neuron-oriented training strategy that targets SS-Neurons based on language resource distribution and model architecture. Experiments demonstrate that fine-tuning this tiny neuronal subset outperforms state-of-the-art methods, significantly enhancing NHR safety while maintaining the model's general capabilities.

Abstract:
Latent or continuous chainof-thought methods replace explicit textual rationales with a number of internal latent steps, but these intermediate computations are difficult to evaluate beyond correlation-based probes. In this paper, we view latent chain-of-thought as a manipulable causal process in representation space by modeling latent steps as variables in a structural causal model (SCM) and analyzing their effects through step-wise \mathrmdo-interventions. We study two representative paradigms (i.e., Coconut and CODI) on both mathematical and general reasoning tasks to investigate three key questions: (1) which steps are causally necessary for correctness and when answers become decidable early; (2) how influence propagates across steps and relates to explicit CoT; (3) how to characterize and interpret the influence patterns revealed by (2). Across settings, we find that latent-step budgets should be treated as distinct functionalities rather than homogeneous extra depth, We further show that training/decoding should account for a gap between early output bias and late representational commitment. These results motivate mode-conditional and stability-aware analyses as more reliable tools for interpreting and eventually improving latent reasoning systems.

Abstract:
Indirect reciprocity, which means helping those who help others, is difficult to sustain among decentralized, selfinterested LLM agents without reliable reputation systems. We introduce Agentic Linguistic Gossip Network (ALIGN), an automated framework where agents strategically share open-ended gossip using hierarchical tones to evaluate trustworthiness and coordinate social norms. We demonstrate that ALIGN consistently improves indirect reciprocity and resists malicious entrants by identifying and ostracizing defectors without changing intrinsic incentives. Notably, we find that stronger reasoning capabilities in LLMs lead to more incentive-aligned cooperation, whereas chat models often over-cooperate even when strategically suboptimal. These results suggest that leveraging LLM reasoning through decentralized gossip is a promising path for maintaining social welfare in agentic ecosystems.

Abstract:
How does the conversational past of large language models (LLMs) influence their future performance? Recent work suggests that LLMs are affected by their conversational history in unexpected ways. For instance, hallucinations in prior interactions may influence subsequent model responses. In this work, we introduce History Echoes, a framework that investigates how conversational history biases subsequent generations. The framework explores this bias from two perspectives: probabilistically, we model conversations as Markov chains to quantify state consistency; geometrically, we measure the consistency of consecutive hidden representations. Across three model families and six datasets spanning diverse phenomena, our analysis reveals a strong correlation between the two perspectives. By bridging these perspectives, we demonstrate that behavioral persistence manifests as a geometric trap, where gaps in the latent space confine the model's trajectory.

Abstract:
Partially relevant video retrieval aims to retrieve untrimmed videos using text queries that describe only partial content. However, the inherent asymmetry between brief queries and rich video content inevitably introduces uncertainty into the retrieval process. In this setting, vague queries often induce semantic ambiguity across videos, a challenge that is further exacerbated by the sparse temporal supervision within videos, which fails to provide sufficient matching evidence. To address this, we propose Holmes, a hierarchical evidential learning framework that aggregates multigranular cross-modal evidence to quantify and model uncertainty explicitly. At the inter-video level, similarity scores are interpreted as evidential support and modeled via a Dirichlet distribution. Based on the proposed three-fold principle, we perform fine-grained query identification, which then guides query-adaptive calibrated learning. At the intra-video level, to accumulate denser evidence, we formulate a soft query-clip alignment via flexible optimal transport with an adaptive dustbin, which alleviates sparse temporal supervision while suppressing spurious local responses. Extensive experiments demonstrate that Holmes outperforms state-of-the-art methods. Code is released at https://anonymous.4open.science/r/ICML26-Holmes-E77B/

Abstract:
Largescale medical biobanks provide imaging data complemented by extensive tabular information, such as clinical measurements or demographics. However, this abundance of tabular attributes does not reflect real-world datasets, where only a subset of attributes may be available. This discrepancy calls for methods that remain robust to missing values at inference. To address this challenge, we propose RoVTL (Robust Vision-Tabular Learning), a framework designed to handle any level of tabular data availability, from 0% to 100%. RoVTL comprises two key stages: contrastive pretraining, where we introduce tabular attribute missingness as data augmentation to promote robustness, and downstream task tuning, where tabular missingness is complemented by a novel Tabular More vs. Fewer loss that ranks performance based on the amount of available tabular data. Combined with gated-cross attention fusion module, our tuning approach enables consistent performance across all tabular data completeness scenarios. We evaluate RoVTL on cardiac MRI scans from the UK Biobank, demonstrating superior robustness to missing tabular data compared to prior methods. Furthermore, RoVTL successfully generalizes to an external cardiac MRI dataset for multimodal disease classification, and extends to the natural images domain, achieving robust performance on a car advertisements dataset. Model weights and code will be released.

Abstract:
Large Reasoning Models (LRMs) are Large Language Models (LLMs) explicitly trained to generate longform Chain-of-Thoughts (CoTs), achieving impressive success on challenging tasks like math and programming. However, their underlying reasoning "algorithms" remain poorly understood. To investigate this, we proposeReJump, which represents a reasoning trace as a visitation order over nodes in a tree of intermediate problem-solving steps. Transitions between nodes, which we termjumps, include adjacent moves that capture behaviors such as calculation, and non-adjacent moves that capture behaviors such as backtracking and verification. ReJump enables analyzing LLM reasoning with diverse metrics that quantify exploration, exploitation, overthinking, forgetting, and verification. Using our proposed LLM agent to extract reasoning traces into ReJump format, we evaluate state-of-the-art LRMs on two tasks and find that models with similar accuracy can exhibit distinct reasoning behaviors, while different tasks favor different reasoning styles (e.g., varying balance between exploration and exploitation). To further understand how learning strategies shape reasoning, we use ReJump to compare distilled LRMs with their teachers, CoT-prompted LLMs with LRMs, and to examine reinforcement learning affect reasoning behavior. Finally, we show that ReJump can improve reasoning quality at test time through strategies such as ReJump-guided Best-of-N selection and prompt selection.

Abstract:
Unified visual tokenization faces a fundamental tradeoff: optimizing for high-fidelity pixel reconstruction (spatial equivariance) inherently conflicts with semantic abstraction (conceptual invariance). We identify the root cause as Manifold Misalignment, where naive joint optimization leads to conflicting gradients that force a zero-sum game between these two objectives. In this paper, we propose MUSE, a framework that resolves this deadlock via Topological Orthogonality. Recognizing Structure as the orthogonal bridge, MUSE physically decouples the optimization subspaces within Transformers. We route structural gradients to refine attention topology and semantic gradients to update feature values, transforming destructive interference into Mutual Reinforcement. Extensive experiments demonstrate that MUSE breaks the trade-off, matching state-of-the-art generation (gFID 3.08) while notably outperforming its own teacher InternViT-300M in linear probing (85.2% vs. 82.5%), proving that structurally aligned reconstruction actively refines semantic perception.

Abstract:
Current Video Large Language Models (Video LLMs) typically encode frames via a vision encoder and employ an autoregressive (AR) LLM for understanding and generation. However, this AR paradigm inevitably faces a dual efficiency bottleneck: strictly unidirectional attention compromisesunderstanding efficiencyby hindering global spatiotemporal aggregation, while serial decoding restrictsgeneration efficiency. To address this, we proposeVidLaDA, a Video LLM based on Diffusion Language Models (DLMs) that leverages bidirectional attention to unlock comprehensive spatiotemporal modeling and decode tokens in parallel. To further mitigate the computational overhead of diffusion decoding, we introduceMARSCache, an acceleration strategy that prunes redundancy by combining asynchronous visual cache refreshing with frame-wise chunk attention. Experiments show VidLaDA rivals state-of-the-art AR baselines (e.g., Qwen2.5-VL and LLaVA-Video) and outperforms DLM baselines, with MARS-Cache delivering over 12x speedup without compromising accuracy.Code and checkpoints will be available in the camera-ready version.

Abstract:
Chain of thought (CoT) elicits reasoning in large language models by explicitly generating intermediate tokens. In contrast, latent thought reasoning operates directly in the continuous latent space, enabling computation beyond discrete linguistic representations. While both approaches exploit iterative computation, their comparative capabilities remain underexplored. In this work, we present a formal analysis showing that latent thought admits efficient parallel computation, in contrast to the inherently sequential nature of CoT. Conversely, CoT enables approximate counting through stochastic decoding. These separations suggest the tasks for which depthdriven recursion is more suitable, thereby offering practical guidance for choosing between reasoning paradigms.

Abstract:
Agentic systems solve complex tasks by coordinating multiple agents that iteratively reason, invoke tools, and exchange intermediate results. To improve robustness and solution quality, recent approaches deploy multiple agent teams running in parallel to explore diverse reasoning trajectories. However, parallel execution comes at a significant computational cost: when different teams independently reason about similar subproblems or execute analogous steps, they repeatedly perform substantial overlapping computation. To address these limitations, in this paper, we propose Learning to Share (LTS), a learned shared-memory mechanism for parallel agentic frameworks that enables selective cross-team information reuse while controlling context growth. LTS introduces a global memory bank accessible to all teams and a lightweight controller that decides whether intermediate agent steps should be added to memory or not. The controller is trained using stepwise reinforcement learning with usage-aware credit assignment, allowing it to identify information that is globally useful across parallel executions. Experiments on the AssistantBench and GAIA benchmarks show that LTS significantly reduces overall runtime while matching or improving task performance compared to memory-free parallel baselines, demonstrating that learned memory admission is an effective strategy for improving the efficiency of parallel agentic systems.

Abstract:
Multimodal Large Language Models excel at offline audiovisual understanding, but their ability to serve as mobile assistants in continuous real-world streams remains underexplored. In daily phone use, mobile assistants must track streaming audio-visual inputs and respond at the right time, yet existing benchmarks are often restricted to multiple-choice questions or use shorter videos. In this paper, we introducePhoStream, the first mobile-centric streaming benchmark that unifies on-screen and off-screen scenarios to evaluate video, audio, and temporal reasoning. PhoStream contains 5,572 open-ended QA pairs from 578 videos across 4 scenarios and 10 capabilities. We build it with an Automated Generative Pipeline backed by rigorous human verification, and evaluate models using a realistic Online Inference Pipeline and LLM-as-a-Judge evaluation for open-ended responses. Experiments reveal a temporal asymmetry in LLM-judged scores (0-100): models perform well on Instant and Backward tasks (Gemini 3 Pro exceeds 80), but drop sharply on Forward tasks (16.40), largely due to early responses before the required visual and audio cues appear. This highlights a fundamental limitation: current MLLMs struggle to decidewhento speak, not justwhatto say.

Abstract:
Neuromorphic event cameras possess superior temporal resolution, power efficiency, and dynamic range compared to traditional cameras. However, their asynchronous and sparse data format poses a significant challenge for conventional deep learning methods. Existing methods either convert the events into dense synchronous frame representations for processing by powerful CNNs or Transformers, but lose the asynchronous, sparse and high temporal resolution characteristics of events during the conversion process; or adopt irregular models such as sparse convolution, spiking neural networks, or graph neural networks to process the irregular event representations but fail to take full advantage of GPU acceleration. Inspired by wordto-vector models, we draw an analogy between words and events to introduce event2vec, a novel representation that allows neural networks to process events directly. This approach is fully compatible with the parallel processing capabilities of Transformers. We demonstrate the effectiveness of event2vec on the DVS Gesture, ASL-DVS, and DVS-Lip benchmarks, showing that event2vec is remarkably parameter-efficient, features high throughput and low latency, and achieves high accuracy even with an extremely low number of events or low spatial resolutions. Event2vec introduces a novel paradigm by demonstrating for the first time that sparse, irregular event data can be directly integrated into high-throughput Transformer architectures. This breakthrough resolves the long-standing conflict between maintaining data sparsity and maximizing GPU efficiency, offering a promising balance for real-time, low-latency neuromorphic vision tasks. The code is provided.

Abstract:
Recent progress in multiturn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and trajectory filtering, instability remains pervasive and often leads to training collapse. We argue that this instability stems from inefficient exploration in multi-turn settings, where policies continue to generate low-information actions that neither reduce uncertainty nor advance task progress. To address this issue, we propose Token- and Turn-level Policy Optimization (T^2PO), an uncertainty-aware framework that explicitly controls exploration at fine-grained levels. At the token level, T^2PO monitors uncertainty dynamics and triggers a thinking intervention once the marginal uncertainty change falls below a threshold. At the turn level, T^2PO identifies interactions with negligible exploration progress and dynamically resamples such turns to avoid wasted rollouts. We evaluate T^2PO in diverse environments, including WebShop, ALFWorld, and Search QA, demonstrating substantial gains in training stability and performance improvements with better exploration efficiency. Code is available at https://anonymous.4open.science/r/T2PO-ICML-3C21.

Abstract:
Reinforcement Learning (RL) has shown promise for aligning Large Language Models (LLMs) to follow instructions with various constraints. Despite the encouraging results, RL improvement inevitably relies on sampling successful, highquality responses; however, the initial model often struggles to generate responses that satisfy all constraints due to its limited capabilities, yielding sparse or indistinguishable rewards that impede learning. In this work, we propose HindsightinstructionReplay(HiR), a novel sample-efficient RL framework for complex instruction following tasks, which employs aselect-then-rewritestrategy toreplay failed attempts as successesbased on the constraints that have been satisfied in hindsight. We perform RL on these replayed samples as well as the original ones, theoretically framing the objective as dual-preference learning at both the instruction- and response-level to enable efficient optimization using only a binary reward signal. Extensive experiments demonstrate that the proposed HiR yields promising results across different instruction following tasks, while requiring less computational budget. Our code and dataset are available at anonymous url.

Abstract:
Reconstructing 3D scenes using 3D Gaussian Splatting (3DGS) from sparse views is an illposed problem due to insufficient information, often resulting in noticeable artifacts. While recent approaches have sought to leverage generative priors to complete information for under-constrained regions, they struggle to generate content that remains consistent with input observations. To address this challenge, we propose GSFixer, a novel framework designed to improve the quality of 3DGS representations reconstructed from sparse inputs. The core of our approach is the reference-guided video restoration model, built upon a DiT-based video diffusion model trained on paired artifact 3DGS renders and clean frames with additional reference-based conditions. Considering the input sparse views as references, our model integrates both 2D semantic and 3D geometric features of reference views extracted from the visual geometry foundation model, enhancing the semantic coherence and 3D consistency when fixing artifact novel views. Furthermore, we introduce a reference-guided trajectory sampling strategy that ensures both angular coverage and view quality, further enhancing reconstruction fidelity. Considering the lack of suitable benchmarks for 3DGS artifact restoration evaluation, we present DL3DV-Res which contains artifact frames rendered using low-quality 3DGS. Extensive experiments demonstrate our GSFixer outperforms current state-of-the-art methods in 3DGS artifact restoration and sparse-view 3D reconstruction. The Project will be made public.

Abstract:
Diffusion models learn a timeindexed score field \mathbfs_\theta(\mathbfx_t,t) that often inherits approximate equivariances (flips, rotations, circular shifts) from in-distribution (ID) data and convolutional backbones. Most diffusion-based out-of-distribution (OOD) detectors exploit score magnitude or local geometry (energies, curvature, covariance spectra) and largely ignore equivariances. We introduce Group-Equivariant Posterior Consistency (GEPC), a training-free probe that measures how consistently the learned score transforms under a finite group G, detecting equivariance breaking even when score magnitude remains unchanged. At the population level, we propose the ideal GEPC residual which averages an equivariance-residual functional over G, and we derive ID upper bounds and OOD lower bounds under mild assumptions. GEPC requires only score evaluations and produces interpretable equivariance-breaking maps. On OOD image benchmark datasets, we show that GEPC achieves competitive or improved AUROC compared to recent diffusion-based baselines while remaining computationally lightweight. On high-resolution synthetic aperture radar imagery where OOD corresponds to targets or anomalies in clutter, GEPC yields strong target-background separation and visually interpretable equivariance-breaking maps.

Abstract:
Flow maps enable highquality image generation in a single forward pass. However, unlike iterative diffusion models, their lack of an explicit sampling trajectory impedes incorporating external constraints for conditional generation and solving inverse problems. We put forthVariational Flow Maps, a framework for conditional sampling that shifts the perspective of conditioning from "guiding a sampling path", to that of "learning the proper initial noise". Specifically, given an observation, we seek to learn anoise adapter modelthat outputs a noise distribution, so that after mapping to the data space via flow map, the samples respect the observation and data prior. To this end, we develop a principled variational objective that jointly trains the noise adapter and the flow map, improving noise-data alignment, such that sampling from complex data posterior is achieved with a simple adapter. Experiments on various inverse problems show that VFMs produce well-calibrated conditional samples in a single (or few) steps. For ImageNet, VFM attains competitive fidelity while accelerating the sampling by orders of magnitude compared to alternative iterative diffusion/flow models.

Abstract:
RetrievalAugmented Generation (RAG) systems remain brittle under realistic retrieval noise, even when the required evidence appears in the top-K results. A key reason is that retrievers and rerankers optimize solely for relevance, often selecting either trivial, answer-revealing passages or evidence that lacks the critical information required to answer the question, without considering whether the evidence is suitable for the generator. We propose \textttBAR-RAG, which reframes the reranker as a boundary-aware evidence selector that targets the generator’s Goldilocks Zone—evidence that is neither trivially easy nor fundamentally unanswerable for the generator, but is challenging yet sufficient for inference and thus provides the strongest learning signal. \textttBAR-RAG trains the selector with reinforcement learning using generator feedback, and adopts a two-stage pipeline that fine-tunes the generator under the induced evidence distribution to mitigate the distribution mismatch between training and inference. Experiments on knowledge-intensive question answering benchmarks show that \textttBAR-RAG consistently improves end-to-end performance under noisy retrieval, achieving an average gain of 10.3% over strong RAG and reranking baselines while substantially improving robustness.

Abstract:
Existing approaches for watermarking AIgenerated images often rely on post-hoc methods applied in pixel space, introducing computational overhead and potential visual artifacts. In this work, we explore latent space watermarking and introduce DistSeal, a unified approach for latent watermarking that works across both diffusion and autoregressive models. Our approach works by training post-hoc watermarking models in the latent space of generative models. We demonstrate that these latent watermarkers can be effectively distilled either into the generative model itself or into the latent decoder, enabling in-model watermarking. The resulting latent watermarks achieve competitive robustness while offering similar imperceptibility and up to 20x speedup compared to pixel-space baselines. Our experiments further reveal that distilling latent watermarkers outperforms distilling pixel-space ones, providing a solution that is both more efficient and more robust.

Abstract:
This paper proposes Omni Dense Captioning, a novel task designed to generate continuous, finegrained, and structured audio-visual narratives with explicit timestamps. To ensure dense semantic coverage, we introduce a six-dimensional structural schema to create "script-like" captions, enabling readers to vividly imagine the video content scene-by-scene, akin to a cinematographic screenplay. To facilitate research, we construct OmniDCBench, a high-quality human-annotated benchmark, and propose SodaM, a unified metric that evaluates time-aware detailed descriptions while mitigating scene boundary ambiguity. Furthermore, we construct a training dataset OmniDenseCap-40K and present Omni-Captioner-7B, a strong baseline trained via SFT and GRPO with task-specific rewards. Extensive experiments demonstrate that Omni-Captioner-7B achieves state-of-the-art performance, surpassing Gemini-2.5-Pro, while its generated dense descriptions significantly boost downstream capabilities in audio-visual reasoning (DailyOmni and WorldSense) and temporal grounding (Charades-STA). All datasets, models, and code will be made publicly available.

Abstract:
Tabular foundation models, such as TabPFNv2 and TabICL, have recently dethroned gradientboosted trees at the top of predictive benchmarks, demonstrating the value of in-context learning for tabular data. We introduce TabICooL, a new state-of-the-art foundation model for regression and classification built on three pillars: (1) a novel synthetic data generation engine designed for high pretraining diversity; (2) various architectural innovations, including a new scalable softmax in attention improving generalization to larger datasets without prohibitive long-sequence pretraining; and (3) optimized pretraining protocols, notably replacing AdamW with the Muon optimizer. On the TabArena and TALENT benchmarks, TabICooL without any tuning matches or surpasses the performance of the current state-of-the-art, RealTabPFN-2.5 (hyperparameter-tuned, ensembled, and fine-tuned on real data). With only moderate pretraining compute, TabICooL generalizes effectively to million-scale datasets under 50GB GPU memory while being markedly faster than RealTabPFN-2.5. We provide extensive ablation studies to quantify these contributions and commit to open research by releasing our weights, synthetic data engine, and pretraining code (upon publication).

Abstract:
Current foundation model for photoplethysmography (PPG) signals is challenged by the intrinsic redundancy and noise of the signal. Standard masked modeling often yields trivial solutions while contrastive methods lack morphological precision. To address these limitations, we propose a Statisticalprior Informed Generative Masking Architecture (SIGMA-PPG), a generative foundation model featuring a prior-guided adversarial masking mechanism, where a reinforcement learning-driven teacher leverages statistical priors to create challenging learning paths that prevent overfitting to noise. We also incorporate a semantic consistency constraint via vector quantization to ensure that physiologically identical waveforms—even those altered by recording artifacts or minor perturbations—map to shared indices. This enhances codebook semantic density and eliminates redundant feature structures. Pre-trained on over 120,000 hours of data, SIGMA-PPG achieves superior average performance compared to five state-of-the-art baselines across 12 diverse downstream tasks.

Abstract:
Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multiturn interactions with external tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environments generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions compared to environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared to collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks validate that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization.

Abstract:
While large audiolanguage models have advanced open-ended audio understanding, they still fall short of nuanced human-level comprehension. This gap persists largely because current benchmarks, limited by data annotations and evaluation metrics, fail to reliably distinguish between generic and highly detailed model outputs. To this end, this work introduces MECAT, a Multi-Expert Constructed Benchmark for Fine-Grained Audio Understanding Tasks. Generated via a pipeline that integrates analysis from specialized expert models with Chain-of-Thought large language model reasoning, MECAT provides multi-perspective, fine-grained captions and open-set question-answering pairs. The benchmark is complemented by a novel metric: DATE (Discriminative-Enhanced Audio Text Evaluation). This metric penalizes generic terms and rewards detailed descriptions by combining single-sample semantic similarity with cross-sample discriminability. A comprehensive evaluation of state-of-the-art audio models is also presented, providing new insights into their current capabilities and limitations. The data and code will be made publicly available.

Abstract:
Diffusionbased world models have shown strong potential for unified world simulation, but the iterative denoising remains too costly for interactive use and long-horizon rollouts. While feature caching can accelerate inference without training, we find that policies designed for single-modal diffusion transfer poorly to world models due to two world-model-specific obstacles: token heterogeneity from multi-modal coupling and spatial variation, and non-uniform temporal dynamics where a small set of hard tokens drives error growth, making uniform skipping either unstable or overly conservative. We propose WorldCache, a caching framework tailored to diffusion world models. We introduce Curvature-guided Heterogeneous Token Prediction, which uses a physics-grounded curvature score to estimate token predictability and applies a Hermite-guided damped predictor for chaotic tokens with abrupt direction changes. We also design Chaotic-prioritized Adaptive Skipping, which accumulates a curvature-normalized, dimensionless drift signal and recomputes only when bottleneck tokens begin to drift. Experiments on diffusion world models show that WorldCache delivers up to 3.7× end-to-end speedups while maintaining 98% rollout quality, demonstrating the vast advantages and practicality of WorldCache in resource-constrained scenarios.

Abstract:
Frontier large language models (LLMs) are increasingly capable of carrying out longrunning, real-world tasks. However, as the amount of context grows, their reliability often deteriorate, a phenomenon known as "context rot". Existing long-context benchmarks primarily focus on single-step settings that evaluate a model’s ability to retrieve information from a long snippet. In realistic scenarios, however, LLMs often need to act as agents that explore environments, follow instructions and plans, extract useful information, and predict correct actions under a dynamically growing context. To assess language agents in such settings, we introduce LOCA-bench (a benchmark forLOng-ContextAgents). Given a task prompt, LOCA-bench leverages automated and scalable control of environment states to regulate the agent’s context length. This design enables LOCA-bench to extend the context length potentially to infinity in a controlled way while keeping the underlying task semantics fixed. LOCA-bench evaluates language agents as a combination of models and scaffolds, including various context management strategies. While agent performance generally degrades as the environment states grow more complex, advanced context management techniques can substantially improve the overall success rate. We will open-source LOCA-bench to provide a platform for evaluating models and scaffolds in long-context, agentic scenarios.

Abstract:
The advancement of LLMs and their accessibility have triggered renewed interest in multiagent reinforcement learning as robust and adaptive frameworks for dynamically changing environments. This paper introduces RL-Focal, a two-stage RL agent framework that routes and ensembles LLMs. First, we develop the Decider RL-agent, which learns to dynamically select an ensemble of small size (m_i) among N LLMs (m_i \ll N) for incoming queries from a user-defined downstream task i, by maximizing both error-diversity and reasoning-performance of the selected ensemble through iterative updates of task-adaptive rewards and policy. Second, to enable effective fusion of dynamically selected LLMs, we develop the stage-2 Fusion RL-agent, which learns to resolve reasoning conflicts from different LLMs and dynamically adapt to different ensemble teams composed by the Decider Agent for different downstream tasks. \em Third, we introduce the focal diversity metric to better model the error correlations among multiple LLMs further improving the generalization performance of the Decider Agent, which actively prunes the ensemble combinations. By focal diversity, we enhance performance across tasks by effectively promoting reward-aware and policy-adaptive ensemble selection and inference fusion. Extensive evaluations on five benchmarks show that RL-Focal achieves the performance improvement of 8.48% with an ensemble of small size compared to the best individual LLM in a pool and offers stronger robustness. Code is available at \urlhttps://anonymous.4open.science/r/rl-focal-8DCF/

Abstract:
Training modern large language models (LLMs) has become a veritable smorgasbord of algorithms and datasets designed to elicit particular behaviors, making it critical to develop techniques to understand the effects of datasets on the model's properties. This is exacerbated by recent experiments that show datasets can transmit signals that are not directly observable from individual datapoints (Halawi et al., 2024; Betley et al., 2025b;Cloud et al., 2025; Betley et al., 2025a), posing a conceptual challenge for datasetcentric understandings of LLM training and suggesting a missing fundamental account of such phenomena. Towards understanding such effects, inspired by recent work on the linear structure of LLMs (Park et al., 2024; Golowich et al., 2025b), we uncover a general mechanism through which hidden subtexts can arise in generic datasets. We introduce LOGIT-LINEAR SELECTION (LLS), a method that prescribes how to select subsets of a generic preference dataset to elicit a wide range of hidden effects. We apply LLS to discover subsets of real-world datasets so that models trained on them exhibit behaviors ranging from having specific preferences, to responding to prompts in a different language not present in the dataset, to taking on a different persona. Crucially, the effect persists for the selected subset, across models with varying architectures, supporting its generality and universality.

Abstract:
Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise generalization. While ParameterEfficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically require retraining for each LLM backbone due to architectural dependencies. To address these challenges, we propose Universal Reasoner (UniR)—a lightweight, composable, and plug-and-play reasoning module that can be used with larger frozen LLMs to provide specialized reasoning capabilities. Specifically, UniR decomposes the reward into a standalone reasoning module trained in a decoupled manner using verifiable rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR is combined with frozen LLMs at inference time by simply adding its output logits to those of the backbone. This additive structure enables modular composition: multiple UniR modules trained for different tasks can be jointly applied by summing their logits, enabling complex reasoning via composition. Furthermore, UniR demonstrates weak-to-strong generalization, where reasoning modules trained on smaller models effectively guide much larger LLMs in the same model family, and generalize across domains such as in vision language models and medical reasoning. Experiments on mathematical reasoning and machine translation show that UniR surpasses existing fine-tuning methods.

Abstract:
The development of TimeSeries Forecasting (TSF) models is often constrained by the lack of comprehensive datasets, especially in Global Station Weather Forecasting (GSWF), where existing datasets are small, temporally short, and spatially sparse. To address this, we introduce WEATHER-5K, a large-scale observational weather dataset that better reflects real-world conditions, supporting improved model training and evaluation. While recent TSF methods perform well on benchmarks, they lag behind operational Numerical Weather Prediction systems in capturing complex weather dynamics and extreme events. We propose PhysicsFormer, a physics-informed forecasting model combining a dynamic core with a Transformer residual to predict future weather states. Physical consistency is enforced via pressure–wind alignment and energy-aware smoothness losses, ensuring plausible dynamics while capturing complex temporal patterns. We benchmark PhysicsFormer and other TSF models against operational systems across several weather variables, extreme event prediction, and model complexity, providing a comprehensive assessment of the gap between academic TSF models and operational forecasting. The dataset and benchmark implementation are available at: https://anonymous.4open.science/r/WEATHER-5K-BF05.

Abstract:
Unlearning in large language models (LLMs) aims to remove specified data, but its efficacy is typically assessed with tasklevel metrics like accuracy and perplexity. We demonstrate that these metrics are often misleading, as models can appear to forget while their original behavior is easily restored through minimal fine-tuning. This phenomenon of \emphreversibility suggests that information is merely suppressed, not genuinely erased. To address this critical evaluation gap, we introduce a \emphrepresentation-level analysis framework. Our toolkit comprises PCA similarity and shift, centered kernel alignment (CKA), and Fisher information, complemented by a summary metric, the mean PCA distance, to measure representational drift. Applying this framework across six unlearning methods, three data domains, and multiple LLMs, we identify four distinct forgetting regimes based on their \emphreversibility and \emphcatastrophicity. We compare recovery strategies and show that relearning efficiency relies on the data source. We also find that irreversible, non-catastrophic forgetting is exceptionally challenging. By probing unlearning limits, we identify a case of seemingly irreversible, targeted forgetting, offering insights for more robust erasure algorithms. Overall, our findings expose a gap in current evaluation and establish a representation-level foundation for trustworthy unlearning.

Abstract:
Modern language models (LMs) tend to memorize portions of their training data and reproduce verbatim spans. When the underlying sources are sensitive or copyrightprotected, such reproduction raises issues of consent and compensation for creators and compliance risks for developers. We propose Proximal Decoding, a plug-and-play inference-time method for suppressing verbatim reproduction: it enables decoding from any risky LM trained on mixed-license data by keeping generation in bounded proximity to a permissively trained safe LM. Proximal Decoding does so by adaptively allocating a user-chosen information budget over the generation trajectory and enforcing per-step constraints that yield a sequence-level guarantee, enabling a tunable risk–utility trade-off. To make Proximal Decoding practically useful, we introduce a new permissively trained safe model (Comma 1.7B), as well as Proximal\_\mathrmByte, a byte-level variant of our method that enables cross-vocabulary fusion via the ByteSampler (Hayase et al., 2025) framework. We evaluate our methods across six model pairs on long-form evaluations of copyright risk and utility. Proximal and Proximal\_\mathrmByte define a new Pareto frontier, preserving near-original fluency and factuality while eliminating up to 75% of the measurable copying gap (averaged over six copying metrics) between the risky baseline and a safe reference, at a modest inference overhead.

Abstract:
Tokenization is used almost universally by modern language models, enabling efficient text representation using multibyte or multi-character tokens. However, prior work has shown that tokenization can introduce distortion into the model’s generations, an issue known as the Prompt Boundary Problem (PBP). For example, users are often advised not to end their prompts with a space because it prevents the model from including the space as part of the next token. While this heuristic is effective in English, the underlying PBP continues to affect languages such as Chinese as well as code generation, where tokens often do not line up with word and syntactic boundaries. In this work, we present an inference-time method to convert any autoregressive LM with a BPE tokenizer into a character-level or byte-level LM. Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers, allowing one to ensemble LMs with different tokenizers at inference time or transfer the post-training from one model to another using proxy-tuning. We demonstrate in experiments that the ensemble and proxy-tuned models outperform their constituents on downstream evals

Abstract:
Repurposing large visionlanguage models (LVLMs) as computer use agents (CUAs) has led to substantial breakthroughs, primarily driven by human-labeled data. However, these models often struggle with novel and specialized software, particularly in scenarios lacking human annotations. To address this challenge, we propose SEAgent, an agentic self-evolving framework enabling CUAs to autonomously evolve through interactions with unfamiliar software. Specifically, SEAgent empowers computer-use agents to autonomously master novel software environments via experiential learning, where agents explore new software, learn through iterative trial-and-error, and progressively tackle auto-generated tasks organized from simple to complex. To achieve this goal, we design a World State Model for step-wise trajectory assessment, along with a Curriculum Generator that generates increasingly diverse and challenging tasks. The agent's policy is updated through experiential learning, comprised of adversarial imitation of failure actions and Group Relative Policy Optimization (GRPO) on successful ones. Furthermore, we introduce a specialist-to-generalist training strategy that integrates individual experiential insights from specialist agents, facilitating the development of a stronger generalist CUA capable of continuous autonomous evolution. This unified agent ultimately achieves performance surpassing ensembles of individual specialist agents on their specialized software. We validate the effectiveness of SEAgent across five professional software of OSWorld, ScienceBoard and AndroidWorld. Our approach achieves a significant improvement over a competitive open-source CUA, UI-TARS. All the code and models will be made publicly available to foster further research.

Abstract:
Ultrahigh-resolution (UHR) remote sensing imagery couples kilometer-scale context with query-critical evidence that may occupy only a few pixels. Such vast spatial scale leads to a quadratic explosion of visual tokens and hinders the extraction of information from small objects. Previous works utilize direct downsampling, dense tiling, or global top‑K pruning, which either compromise query-critical image details or incur unpredictable compute. In this paper, we propose UHR-BAT, a query-guided and region-faithful token compression framework to efficiently select visual tokens under strict context budget. Specifically, we leverage text-guided, multi-scale importance estimation for visual tokens, effectively tackling the challenge of achieving precise yet low-cost feature extraction. Furthermore, by introducing region-wise preserve and merge strategies, we mitigate visual token redundancy, further driving down the computational budget. The experimental results show that UHR-BAT achieves state-of-the-art performance across various benchmarks.

Abstract:
Mixtureof-Experts (MoE) architectures are evolving towards finer granularity to improve parameter efficiency. However, existing MoE designs face an inherent trade-off between the granularity of expert specialization and hardware execution efficiency. In this paper, we propose OmniMoE, a system-algorithm co-designed MoE framework that pushes granularity to the extreme with vector-level Atomic Experts, orchestrating their routing and execution at scale within a single MoE layer, while retaining a shared dense MLP for general-purpose processing. While this atomic design maximizes capacity, it poses severe challenges for routing complexity and memory access. To address these, OmniMoE adopts a system-algorithm co-design: (i) a Cartesian Product Router that decomposes the massive index space to reduce routing complexity from O(N) to O(\sqrtN); and (ii) Expert-Centric Scheduling that inverts the execution order to turn scattered, memory-bound lookups into efficient dense matrix operations. Validated on seven benchmarks, OmniMoE (with 1.7B active parameters) achieves 50.9% zero-shot accuracy across seven benchmarks, outperforming coarse-grained (e.g., DeepSeekMoE) and fine-grained (e.g., PEER) baselines. Crucially, OmniMoE reduces inference latency from 73ms to 6.7ms (a 10.9× speedup) compared to PEER, demonstrating that massive-scale fine-grained MoE can be fast and accurate.

Abstract:
Recent approaches for video generation with camera control often create anchor videos (i.e., rendered videos that approximate desired camera motions) to guide diffusion models as a structured prior, by rendering from estimated point clouds following camera trajectories. However, errors in point cloud and camera trajectory estimation often lead to inaccurate anchor videos with higher training cost and low efficiency, as the model is forced to compensate for rendering misalignments. To address these limitations, we introduce EPiC, an efficient and precise camera control learning framework that constructs wellaligned training anchor videos without the need for camera pose or point cloud estimation. Concretely, we create highly precise anchor videos by masking source videos based on first-frame visibility, which ensures strong alignment, eliminates the need for camera/point cloud estimation, and thus can be readily applied to any in-the-wild video. Furthermore, we introduce Anchor-ControlNet, a lightweight module that integrates anchor video guidance in visible regions to pretrained video diffusion models, with less than 1% of additional parameters. EPiC achieves efficient training with substantially fewer parameters, training steps, and less data, and generalizes robustly to anchor videos made with point clouds at test time, enabling precise 3D-informed camera control. EPiC achieves SoTA performance on RealEstate10K and MiraData for I2V camera control task. Notably, EPiC also exhibits strong zero-shot generalization to video-to-video (V2V) scenarios.

Abstract:
Deterministic inference is increasingly critical for large language model (LLM) applications such as LLMas-a-judge evaluation, multi-agent systems, and Reinforcement Learning (RL). However, existing LLM serving frameworks exhibit non-deterministic behavior: identical inputs can yield different outputs when system configurations (e.g., tensor parallel (TP) size, batch size) vary, even under greedy decoding. This arises from the non-associativity of floating-point arithmetic and inconsistent reduction orders across GPUs. While prior work has addressed batch-size–related nondeterminism through batch-invariant kernels, determinism across different TP sizes remains an open problem, particularly in RL settings, where the training engine typically uses Fully Sharded Data Parallel (FSDP) (i.e., TP = 1) while the rollout engine relies on multi-GPU TP to maximize the inference throughput, creating a natural mismatch between the two. This precision mismatch problem may lead to suboptimal performance or even collapse for RL training. We identify and analyze the root causes of TP-induced inconsistency and proposeTree-Based Invariant Kernels (TBIK), a set of custom matrix multiplication and reduction kernels that guarantee bit-wise identical results regardless of TP size. Our key insight is to enforce a consistent reduction order across and within GPUs. We implement TBIK in Triton and integrate it into vLLM and FSDP, achievingbit-wise deterministic inferenceacross different TP sizes andzero probability divergencebetween vLLM and FSDP in RL training pipelines. This eliminates the numerical mismatch caused by different parallel strategies,enabling true on-policy RL at a large scale for the first time.

Abstract:
We present MoST (Mixture of Speech and Text), a novel multimodal large language model that seamlessly integrates speech and text processing through our proposed ModalityAware Mixture of Experts (MAMoE) architecture. While current multimodal models typically process diverse modality representations with identical parameters—disregarding their inherent representational differences, we introduce specialized routing pathways that direct tokens to modality-appropriate experts based on input type. MAMoE simultaneously enhances modality-specific learning and cross-modal understanding through two complementary components: modality-specific expert groups that capture domain-specific patterns and shared experts that facilitate information transfer between modalities. Building on this architecture, we develop an efficient transformation pipeline that adapts the pretrained MoE language model through strategic post-training on ASR and TTS datasets, followed by fine-tuning with a carefully curated speech-text instruction dataset. A key feature of this pipeline is that it relies exclusively on fully accessible, open-source datasets to achieve strong performance and data efficiency. Comprehensive evaluations across ASR, TTS, audio language modeling, and spoken question answering benchmarks show that MoST consistently outperforms existing models of comparable parameter counts. Our ablation studies confirm that the modality-specific routing mechanism and shared experts design significantly contribute to performance gains across all tested domains. To our knowledge, MoST represents the first fully open-source speech-text LLM built on a Mixture of Experts architecture.

Abstract:
A critical challenge in developing capable AI agents is defining their "action space''—the set of possible actions they can take. These spaces can range widely, from generating code and using language skills to operating on latent representations or raw joystick controls. Through a largescale study in Minecraft, we discovered a major dilemma: no single action space is universally best. The most effective action space is highly task-dependent, which complicates the goal of building one generalist agent that can handle everything. To solve this, we introduce Chain-of-Action (CoA), a novel framework that unifies high-level abstracted actions and low-level control actions within a single model. With CoA, an abstract goal is not just a final command; instead, it serves as an intermediate reasoning step that guides the model to generate the precise, executable actions needed to complete the task. Furthermore, we show that an All-in-One generalist agent, trained on a diverse mix of action spaces using CoA, learns a more generalizable policy. This unified agent achieves a new state-of-the-art, outperforming strong, specialized baselines. To support the research community, we are releasing the OpenHA (Open Hierarchical Agents) suite, which includes our benchmark of over 800 tasks, curated datasets, source code, and all model checkpoints at: \urlhttps://anonymous.4open.science/anonymize/OpenHA-ACFE.

Abstract:
Despite recent breakthroughs, audio foundation models struggle in processing complex multisource acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple speakers and background/foreground sound effects. Compared to traditional audio processing tasks, audio stories introduce new layers of semantic, temporal, and physical complexity. To address this challenge, we propose AudioChat, a framework for developing audio foundation models that can generate, edit, and understand audio stories. AudioChat introduces a new paradigm in which LLM-based toolcalling agents simulate interactions between users and the system, and these simulated dialogues are used as training data. We also introduce a novel Audio Transfusion Forcing objective to train the AudioChat model, allowing it to simultaneously decompose high-level instructions via structured chain-of-thought reasoning and perform interactive multi-turn audio understanding/generation. To evaluate generation and editing performance, we develop three new metrics that directly measure task performance instead of relying upon distribution-based scoring. We highly encourage readers to visit our demo to better understand the capabilities of AudioChat: https://audiochat-icml-2026.github.io/.

Abstract:
Accurate precipitation nowcasting is vital for disaster mitigation, but deep learning methods suffer a key tradeoff: regression models produce over-smoothed, spectrally decaying predictions that blur convective details and violate turbulence power laws; diffusion models generate realistic yet unanchored hallucinations lacking physical grounding. We propose Spectral-Decoupled Iterative Refinement (SDIR), a deterministic framework that reformulates nowcasting as progressive frequency-decoupled refinement. SDIR first extracts a stable low-frequency synoptic skeleton, then iteratively refines high-frequency textures under physical constraints, eliminating both blurring and hallucinations. It features a dual-path design: the Synoptic Frequency-Guided Former (SFG-Former) with Scale-Adaptive Transformers for global structure, and the Fourier Residual Refiner (FR-Refiner) with Scale-Conditioned Fourier Neural Operators for fine residuals. A Physically-Consistent Power Spectral Density (PCPSD) loss with dynamic masking enforces turbulence-consistent spectral distribution. Experiments on three benchmarks show SDIR significantly outperforms SOTA in spatial accuracy while achieving spectral fidelity competitive with diffusion-based methods, enabling reliable high-resolution operational nowcasting.

Abstract:
Diffusion models can unintentionally memorize training samples, raising concerns about privacy and copyright. While recent methods can detect memorization, they often rely on global or modelspecific signals and provide limited insight into where memorization appears within a generated image. We provide a geometric characterization of local memorization as a coordinate-wise variance collapse. However, such collapse can also arise from intrinsic data constraints rather than overfitting. To isolate overfitting-driven memorization, we propose curvature-difference methods that subtract the curvature of an underfitted baseline, either the unconditional model or a less-trained version of itself. We further derive a score-difference surrogate that provides a geometric explanation for the widely used score-difference-based detection metric. Experiments on Stable Diffusion, evaluated against ground-truth memorization masks, show that our method outperforms the prior attention-based localization method.

Abstract:
Multimodal Large Language Models (MLLMs) demonstrate remarkable visual understanding, yet their reliability in interactive settings is severely undermined by hallucination snowballing: a phenomenon where initial errors amplify across conversational turns, leading to a collapse in coherence. This failure reveals a fundamental vulnerability where models progressively neglect visual grounding in favor of overrelying on polluted textual history. Existing benchmarks are predominantly confined to single-turn VQA or simplistic dialogues, which fail to capture the complex dynamics of error propagation in realistic, long-horizon interactions. To address this, we introduce MM-Snowball, the first benchmark for fine-grained diagnosis of hallucination snowballing within 6-turn dialogues. Extensive evaluation shows that our benchmark poses a significant challenge even to advanced MLLMs and reveals the inefficacy of existing mitigation methods designed for single-turn tasks. To counteract this degradation, we propose Conflict-Aware Visual Rectification (CAVR). This training-free framework mitigates snowballing through a synergistic dual-mechanism that refreshes visual grounding at the representation level and rectifies output distributions at the logit level, effectively re-anchoring the model to visual facts. Experiments demonstrate that our proposed CAVR achieves state-of-the-art performance, offering a promising path toward more reliable interactive AI.

Abstract:
Existing human attention modeling methods persist as highly fragmented across modalities, scenes, and task formulations. Consequently, even with increasing model capacity and data scale, current models predominantly remain scenedependent and task-specific, failing to practically generalize in real-world applications. To address the fundamental limitations, we present the Attend to Anything Model (AAM), a multi-modal foundation model that unifies attention modeling across various image, video, and audio-visual tasks and scenes. AAM reformulates attention as an asymmetric entailment relationship organized in a general-to-specific hierarchy, implemented through language prompts with hierarchical embeddings in hyperbolic space. Furthermore, to unify static image and dynamic video attention, we adopt a fluid-dynamics perspective, formulating video-frame attention as a diffusive temporal evolution governed by the Fokker--Planck equation. Extensive experiments on 16 benchmarks demonstrate that AAM consistently outperforms state-of-the-art methods by an average of 6% across various scenarios, while achieving approximately a 4× speedup in video inference. Overall, these results demonstrate that AAM provides a principled foundation for future research on attention and saliency-related tasks.

Abstract:
Preferencebased reinforcement learning (PbRL) avoids explicit reward engineering by learning from pairwise human preference feedback. Existing offline PbRL methods typically follow a two-stage pipeline, first learning a reward or preference model from labeled preferences and then performing offline RL on unlabeled data. We revisit offline PbRL through the lens of reward-free representation learning (RFRL) from the zero-shot RL literature, and propose a new training framework that first learns latent successor-measure representations from reward-free offline data, followed by contrastive search and fine-tuning using preference data. Through extensive experiments and ablations, we show that our method achieves superior preference efficiency over offline PbRL baselines. This work is the first to connect RFRL with PbRL, highlighting its potential as a feedback-efficient solution.

Abstract:
Large language models are increasingly deployed in multiturn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to effectively address multi-turn dynamics but is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereas offline supervised fine-tuning (SFT) is efficient but suffers from distribution shift and behavioral collapse. To this end, we novelly propose DRIFT (Decoupled Rollouts and Importance-Weighted Fine-Tuning), a framework that operationalizes the theoretical insight that the KL-regularized RL objective is mathematically equivalent to importance-weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixed reference policy, deriving return-based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning.

Abstract:
Effectively modeling nonstationary dynamics in probabilistic multivariate time series(MTS) forecasting requires balancing expressiveness with robustness. Existing parametric approaches benefit from strong inductive biases but lack flexibility, whereas deep generative models struggle to capture complex temporal dependencies without extensive data and compute. We introduce Parametric Prior Mapping (PPM), a framework that injects parametric structural priors into a generative modeling process. Specifically, PPM utilizes a parametric estimator to derive a dynamic, adaptive prior, which guides the learning of a complex predictive distribution via a learnable mapping. This design allows the model to retain the efficiency of parametric methods while exploiting the expressive power of generative models. Trained with a hybrid objective, PPM yields precise forecasts with well-calibrated uncertainty estimates. Empirical results show that PPM outperforms existing baselines in handling non-stationary data, offering a superior trade-off between accuracy and computational efficiency.

Abstract:
Lowrank adaptation (LoRA) enables parameter-efficient specialization of foundation models, but the proliferation of task-specific adapters fragments capabilities across many adapters, complicating reuse and deployment. We study the problem of merging T LoRAs into a single rank-r LoRA, thereby preserving the benefits of low-rank structure. Existing Merge-then-Compress pipelines treat the rank constraint as an afterthought: they merge adapters in the full parameter space, then compress the merged result to rank r via truncated SVD. However, full-parameter merging may destroy the low-rank structure, making it difficult for subsequent compression to recover an effective rank-r LoRA. We propose Compress-then-Merge (CtM), a reversed paradigm that enforces the rank-r bottleneck _before_ merging: CtM computes shared r-dimensional subspaces using only the LoRA weights to capture cross-adapter common structure, projects each adapter into the shared subspaces to obtain r× r coordinates, and then applies standard merging rules in this reduced space. CtM guarantees a rank-r LoRA by construction, avoiding post-hoc truncation, and enables efficient computation in the core space spanned by concatenated LoRA factors. Experiments on ViT-B/32 and LLaMA3-8B demonstrate consistent improvements over single-LoRA-output baselines, while remaining competitive with (and in some cases surpassing) full-parameter merging methods.

Abstract:
Distribution Matching Distillation (DMD) is a widely used paradigm for accelerating inference in fewstep video diffusion models. However, DMD-style training faces a structural bottleneck: the student-side auxiliary score network (the fake score) must closely track a continuously evolving generator. Updating the fake score too frequently increases training cost and can over-emphasize inner-loop tracking, while infrequent updates lead to tracking lag that destabilizes training and degrades generation consistency. To address this issue, we propose Score Gradient Matching Distillation (SGMD). SGMD adopts a fake-score perspective by directly optimizing the fake score toward the teacher, while using teacher stop-gradient Fisher as a stable distribution-matching objective. We provide a gradient analysis that motivates this objective choice under ideal tracking. Building on this, SGMD introduces a pair of dual potentials: negative-residual (NR) for outer-loop correction and residual-contraction (RC) for inner-loop tracking. Empirically, compared to DMD, SGMD achieves an approximately ～ 3× training speedup and substantially improves motion dynamics for 4-step distilled models while preserving temporal consistency.

Abstract:
The design of environments plays a critical role in shaping the development and evaluation of cooperative multiagent reinforcement learning (MARL) algorithms. While existing benchmarks highlight critical challenges, they often lack the modularity required to design custom evaluation scenarios. We introduce the Totally Accelerated Battle Simulator in JAX (TABX), a high-throughput sandbox designed for reconfigurable multi-agent tasks. TABX provides granular control over environmental parameters, permitting a systematic investigation into emergent agent behaviors and algorithmic trade-offs across a diverse spectrum of task complexities. Leveraging JAX for hardware-accelerated execution on GPUs, TABX enables massive parallelization and significantly reduces computational overhead. By providing a fast, extensible, and easily customized framework, TABX facilitates the study of MARL agents in complex structured domains and serves as a scalable foundation for future research. Our code is available at: https://anonymous.4open.science/r/TABX-00CA.

Abstract:
Transferring knowledge by finetuning large-scale pre-trained networks has become a standard paradigm for downstream tasks, yet the knowledge of a pre-trained model is tightly coupled with monolithic architecture, which restricts flexible reuse across models of varying scales. In response to this challenge, recent approaches typically resort to either parameter selection, which fails to capture the interdependent structure of this knowledge, or parameter prediction using generative models that depend on impractical access to large network collections. In this paper, we identify the low-frequency components of model weights as the concrete carrier of foundational, task-agnostic knowledge—its "learngene"—and validate this by demonstrating its efficient inheritance by downstream models and tasks. Based on this insight, we propose FRONT (FRequency dOmain kNowledge Transfer), a novel framework that uses the Discrete Cosine Transform (DCT) to isolate the low-frequency "learngene". This learngene can be seamlessly adapted to initialize models of arbitrary size via simple truncation or padding, a process that is entirely training-free. For enhanced performance, we propose an optional low-cost refinement process that introduces a spectral regularizer to further improve the learngene's transferability. Extensive experiments demonstrate that FRONT achieves the state-of-the-art performance, accelerates convergence by up to 15× in vision tasks, and reduces training FLOPs by an average of 40.5% in language tasks.

Abstract:
Proteinprotein interactions (PPIs) are essential for a wide range of biological processes. However, existing PPI prediction approaches still face two major limitations. First, in aggregating residue features into global protein features, they ignore the hierarchical organization of proteins, in which meso-scale motifs are the key regulators of PPIs. Second, despite the availability of complementary information across the sequence, structure, and function modalities, current PPI methods fail to integrate all three modalities effectively. To address these limitations, we propose a Hierarchical Motif-based M ultiM odal protein Encoder for PPI Prediction (MMM-PPI), which constructs protein embeddings for PPI prediction in a bottom-up, multi-modal manner. (i) At the micro-scale, we encode three modal residue features; (ii) At the meso-scale, we use a novel multimodal motif encoder to aggregate residues into spatially-informed motif embeddings; (iii) At the macro-scale, we introduce a multimodal protein encoder to integrate motif embeddings into protein embeddings, considering both the relative importance of motifs in PPI and correlations between different modalities. The pre-trained encoder can be used off-the-shelf for large-scale PPI prediction. Extensive experiments on multiple PPI datasets demonstrate that MMM-PPI outperforms state-of-the-art multi-label PPI prediction models, particularly in scenarios with challenging data partitions and limited training data.

Abstract:
Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration–exploitation tradeoff, resulting in unstable optimization and sub-optimal performance. We introduce IB-Score, a novel metric grounded in Information Bottleneck (IB) theory that evaluates policy’s exploration-exploitation balance by quantifying the trade-off between step-level reasoning diversity and mutual information shared with the correct answer. Analysis based on IB-Score shows that popular online RL approaches (e.g., GRPO) with common regularization methods fail to consistently maintain balance during training with suboptimal results. To address this, we propose Information Bottleneck-driven Tree-based Policy Optimization (IB-TPO), a principled framework that formulates IB-Score as a fine-grained optimization objective and utilizes a novel IB-guided tree sampling strategy that not only improves the efficiency of online sampling with 50% more trajectories under same token budget, but also reuses the tree structure for effective IB-Score Monte Carlo estimation. Extensive experiments across standard benchmarks show that our method significantly outperforms GRPO baseline by 2.9% to 3.6% and also outperforms other state-of-the-art online RL approaches.

Abstract:
MLLMpowered embodied agents deployed in real-world environments encounter physical hazards. However, existing approaches lack explicit mechanisms for identifying hazards and reasoning about action-conditioned risks, leading agents to either miss risky interactions or over-identify risks. To address this, we propose EMBGUARD, the first MLLM-based safety guardrail for embodied agents designed to decouple physical risk reasoning from agent policy. By evaluating a (visual observation, action) pair, EMBGUARD identifies hazardous configurations and provides natural language explanations of potential risks. Alongside EMBGUARD, we contribute EMBHAZARD, a training dataset of 17K action-conditioned pairs, and EMBGUARDTEST, a benchmark of 189 manually curated real-world scenarios spanning seven physical risk categories. Through compositional variation of hazards and actions, we generate diverse risky and benign scenarios that agents may encounter during planning. Despite its compact size (2B, 4B), EMBGUARD achieves performance competitive with proprietary MLLMs (e.g., GPT-5.1, Gemini-2.5-Pro) while significantly reducing the false-positive rates that hinder realtime deployment. We make the code, data, and models publicly available at https://anonymous.4open.science/r/EMBGuard-742D.

Abstract:
Recent advancements in agentic testtime scaling allow models to gather environmental feedback before committing to final actions. A key limitation of existing methods is that they typically employ undifferentiated exploration strategies, lacking the ability to adaptively distinguish when exploration is truly required. In this paper, we propose an exploration-aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high. Our method introduces a fine-grained reward function via variational inference that explicitly evaluates exploratory actions by estimating their potential to improve future decision-making, together with an exploration-aware grouping mechanism that separates exploratory actions from task-completion actions during optimization. By targeting informational gaps, this design allows agents to explore selectively and transition to execution as soon as the task context is clear. Empirically, we demonstrate that our approach achieves consistent improvements across a range of challenging text-based and GUI-based agent benchmarks.

Abstract:
Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role. However, the individual contribution of these three projections and the impact of omitting some remain poorly understood. We systematically evaluate three projection sharing constraints: a) QK=V (shared key-value), b) Q=K-V (shared query-key), and c) Q=K=V (single projection). The last two variants produce symmetric attention maps; to address this, we also explore asymmetric attention via 2D positional encodings. Through experiments spanning synthetic tasks, vision (MNIST, CIFAR, TinyImageNet, anomaly), and language modeling (300M and 1.2B parameter models on 10B tokens), we discovered that our transformers perform on par or occasionally better than the QKV transformer. In language modeling, Q-K=V projection sharing achieves 50% KV cache reduction with only 3.1% perplexity degradation. Crucially, projection sharing is complementary to head sharing (GQA/MQA): combining Q-K=V with GQA-4 yields 87.5% cache reduction, while Q-K=V + MQA achieves 96.9%—enabling practical on-device inference. We further show that Q--K=V preserves quality because keys and values can share representational space, whereas Q=K-V breaks attention directionality. Our results establish projection sharing as a new optimization axis for memory-efficient transformers, especially for edge deployment.

Abstract:
Recent studies observe that reinforcement learning with verifiable rewards (RLVR) reliably improves pass@1 on reasoning tasks, yet often fails to yield comparable gains in pass@k, raising the question of whether RLVR genuinely enables large language models to acquire novel reasoning abilities or merely enhances the efficiency of sampling reasoning modes already present in the base model. Prior analyses largely support the latter view, attributing this limitation to structural properties of standard RLVR objectives that result in insufficient exploration pressure. In this work, we argue that a central structural constraint arises from reverseKL regularization, which stabilizes training but inherently anchors the policy to the reference distribution, thereby suppressing the emergence of alternative reasoning modes. However, we show that neither removing the KL term nor replacing it with forward-KL provides a satisfactory solution, as both disrupt the efficiency–coverage trade-off by either inducing reward hacking or allocating probability mass to off-target regions. To resolve this tension, we propose SAGE, a principled framework that enables controllable empirical support expansion by reshaping the reverse-KL anchor distribution itself through a guide function q(x,y), achieving consistent improvements in both pass@1 and pass@k across challenging mathematical reasoning benchmarks.

Abstract:
Supervised classifier learning has a theoretical optimum — Neural Collapse (NC) — yet standard training does not reach it in practice. We trace this failure to a geometric limitation: crossentropy is invariant to joint rescaling of features and weights, leaving radial degrees of freedom unconstrained and the loss landscape degenerate. Projecting optimization onto the unit hypersphere eliminates this degeneracy and exposes a hidden equivalence: normalized softmax classification and supervised contrastive learning are conceptually the same, both optimizing angular similarity to class prototypes. We formalize this unification by proving that supervised contrastive learning already produces an optimal classifier during training, the prototype classifier whose weights are given by class-wise feature means, rendering subsequent classifier learning through linear probing redundant. Building on this framework, we identify two computational bottlenecks that slow convergence to NC: the small effective negative set in classifier learning (limited to K class prototypes), and the coupling of competing optimization terms through a shared normalization. We address these with NTCE, which expands the negative set from K classes to M batch instances, and NONL, which normalizes only over negatives to decouple intra-class alignment from inter-class repulsion. Empirically, our methods surpass cross-entropy accuracy on four benchmarks including ImageNet-1K, achieve \ge95% NC across all metrics, and yield consistent gains in transfer learning (+5.5% mean relative improvement), long-tailed classification (up to +8.7%), and robustness (lower mCE), while eliminating hours of post-hoc classifier training.

Abstract:
Unifying multimodal understanding and generation is a compelling frontier that is beginning to emerge in the medical field. However, the limited existing unified medical models typically treat understanding and generation as disjoint objectives, lacking a meaningful functional synergy. In this work, we identify and address a critical question in unified medical modeling: what form of “understanding” truly benefits generation. We present SynerMedGen, a unified framework built on the proposed principle of generationaligned understanding, which synergizes understanding objectives with generation tasks via task alignment. SynerMedGen introduces three generation-aligned understanding tasks and a two-stage training strategy that transfers generation-beneficial representations learned during understanding training to medical image synthesis. Remarkably, even with understanding training alone, our SynerMedGen achieves strong zero-shot performance across 22 medical image synthesis tasks and demonstrates robust generalization to unseen datasets. When combined with generation training, SynerMedGen consistently outperforms state-of-the-art specialized medical image synthesis models as well as recent unified medical models. We also release a large-scale dataset named SynerMed consisting of 1M paired synthesis samples and 2M generation-derived understanding instances to support further research on understanding-generation synergy.

Abstract:
Temporal graph neural networks (TGNNs) have gained significant traction in solving realworld temporal graph tasks. However, their interpretability remains limited, as most TGNNs fail to identify which historical interactions most influence a given prediction. Despite promising progress on interpretable TGNNs, existing methods predominantly focus on previously seen historical interactions, which we term stability patterns, while overlooking newly emerging first-time interactions, which we term transition patterns. Both types of patterns are essential for faithful temporal explanations. To address this limitation, we propose ST-TGExplainer, a self-explainable TGNN that disentangles Stability and Transition patterns in temporal graphs for a more faithful Temporal GNN Explainer. Guided by a disentangled information bottleneck objective, ST-TGExplainer learns a compact explanatory subgraph that remains predictive of the event label while explicitly suppressing label-conditioned redundancy between stability and transition patterns. Extensive experiments demonstrate that ST-TGExplainer achieves strong predictive performance and yields more faithful explanations.

Abstract:
Dataset distillation aims to synthesize a compact proxy dataset that is unreadable or nonraw from the original dataset for privacy protection and highly efficient learning. However, previous approaches typically adopt a single-stage distillation paradigm, which suffers from learning specific patterns that overfit on a prior architecture, consequently suppressing the expression of semantics and leading to performance degradation across heterogeneous architectures. To address this issue, we propose a novel dual-stage distillation framework called DIVER, which leverages the pre-trained diffusion model to dive deeper into DIstilled data Via Expressive semantic Recovery, a process of semantic inheritance, guidance, and fusion. Semantic inheritance distills high-level semantic knowledge of abstract distilled images into the latent space to filter out architecture-specific ``noise" and retain the intrinsic semantics. Furthermore, semantic guidance improves the preservation of the original semantics by directing the reverse procedure. Ultimately, \textcolorredsemantic fusion is designed to provide semantic guidance only during the concrete phase of the reverse process, preventing semantic ambiguity and artifacts while maintaining the guidance information. Extensive experiments validate the effectiveness and efficiency of our method in improving classical distillation techniques and significantly improving cross-architecture generalization, requiring processing time comparable to raw DiT on ImageNet (256×256) with only 4 GB of GPU memory usage.

Abstract:
In offline goalconditioned reinforcement learning (GCRL), where one relies on a limited reward-free dataset to learn a generalist goal-reaching agent, compositional generalization becomes essential for reaching unseen goals under novel contextual variations. Most prior approaches pursue this via trajectory stitching over temporally contiguous segments, which limits composing behaviors across varying contexts. To overcome this limitation, we formalizeanalogy transductionas composing task-endogenous analogies with task-exogenous contexts and propose a novel analogy representation tailored for it. Grounded in our theory, this analogy representation captures what changes under optimal task execution, remains invariant to contextual variations, and is sufficient for optimal goal-reaching. We further contend that generalization to unseen analogy-context pairs is a practical obstacle in analogy transduction, and introduce a new approach for offline GCRL that enables analogy transduction beyond seen pairs to unseen combinations. We empirically demonstrate the effectiveness of our approach on OGBench manipulation environments, substantially outperforming prior methods that do not perform analogy transduction.

Abstract:
Group Relative Policy Optimization (GRPO), a prominent algorithm within the Reinforcement Learning from Verifiable Rewards (RLVR) framework, has achieved strong results in improving the reasoning capabilities of large language models (LLMs). However, GRPO is prone to advantage collapse, a failure mode where homogeneous rewards within a group (e.g., all correct or all incorrect answers) yield nearzero advantages and vanishing gradients. To address this, we introduce the Advantage Collapse Rate (ACR), the first diagnostic metric quantifying the proportion of training batches with ineffective gradients. Across models from 0.5B to 14B parameters on mathematical reasoning benchmarks, we show that ACR strongly predicts training stagnation and final performance. We then propose Adaptive Virtual Sample Policy Optimization (AVSPO), a lightweight extension of GRPO that injects virtual reward samples, guided by real-time ACR monitoring, to enable learning from homogeneous groups without additional model rollouts. AVSPO reduces advantage collapse by 58–63% relative to GRPO and yields consistent accuracy gains of 4–6 percentage points across all model scales, while maintaining generalization on the evaluated out-of-domain task. Code and datasets are available at https://anonymous.4open.science/r/ACR-A557.

Abstract:
VisualLanguage Models (VLMs), such as CLIP, have achieved significant zero-shot performance on downstream tasks with various fine-tuning adaptation methods. However, recent studies have proven that adversarial attacks can significantly degrade the inference ability of VLMs, posing substantial risks to their practical applications. Prevalent test-time adaptation methods typically rely on the multi-view augmentation to implement various fine-tuning strategies, which struggle to identify semantic information and are prone to destroy the discriminative regions in fine-grained scenarios. To address these limitations, we propose Attention-guided Test-time Prompt Tuning (A-TPT), a semantics-preserving method designed for test-time adaptation. We first refine the gradient attention rollout mechanism to identify semantically meaningful regions surviving under adversarial attacks. Furthermore, we leverage them to guide the spatially varying augmentation intensities and multi-view ensemble for prompt tuning and inference. Extensive experiments demonstrate that A-TPT outperforms existing test-time adaptation methods on both adversarial and clean data. Codes are available at https://anonymous.4open.science/r/A-TPT.

Abstract:
Large reasoning models (LRMs) substantially outperform their base LLM counterparts on challenging reasoning benchmarks, yet it remains poorly understood where base models go wrong during tokenby-token generation and how to narrow this gap efficiently. We study the base–reasoning gap by quantifying token-level distributional disagreement between a base model and a stronger reasoning model using likelihood-based divergences. Across benchmarks, we find that the reasoning advantage is highly sparse and concentrates on a small set of early, planning-related decision tokens. For instance, on Qwen3-0.6B, only ～8% of generated tokens account for the salient disagreement; these tokens concentrate early in the response, are strongly enriched in planning-related decisions (17×), and coincide with high base-model uncertainty—suggesting that base models fail mainly at early planning points that steer the subsequent reasoning trajectory. Building on these findings, we propose disagreement-guided token intervention, a simple inference-time delegation scheme that performs a one-token takeover by the reasoning model only at high-disagreement positions and immediately switches back to the base model. With a small intervention budget, this sparse delegation substantially recovers and can even surpass the performance of a same-size reasoning model on challenging reasoning tasks. Code is available at \urlhttps://anonymous.4open.science/r/RRTokenIntervention-EBDD.

Abstract:
We introduce a predictive model that estimates the pretraining loss of large models from model size (N), batch size (B) and number of weight updates (K). This is the first loss prediction model that can handle changing batch size. The model outperforms Chinchilla's loss model, a model of the test loss using the batch size and number of tokens, in terms of projecting the loss at extrapolated compute budgets (up to 1000 folds). A natural use of the model is to find optimal N,B,K configurations under explicit and compound resource constraints like time, memory and compute. In our experiments, the model-selected configurations are close to ground-truth optimal. Our work advocates for loss prediction as a better alternative to heuristic-based laws, which are growing in complexity.

Abstract:
The CLIP model has established itself as a cornerstone of largescale retrieval systems. However, its performance often degrades under distributional shifts such as multilingual, long-form, or multimodal queries. To avoid the prohibitive costs associated with retriever retraining or corpus re-embedding, we propose GRAPE (Grouped Ranking-Aware Policy Optimization Enhancement), a plug-and-play approach that leverages LLM-based query rewriting to bridge these gaps. Unlike existing methods that lack explicit supervision, GRAPE integrates ranking signals into the rewriting LLM via Grouped Relative Policy Optimization (GRPO), ensuring rewritten queries are better aligned with the frozen retriever’s latent distribution. Crucially, we identify a score inflation phenomenon in naive similarity-based finetuning—where irrelevant candidates receive indiscriminately high scores—and mitigate it with a novel corpus-relative ranking-based reward. Extensive experiments across multilingual (Flickr30k-CN, CVLUE, XM3600), long-form (Wikipedia), and multimodal (CIRR) benchmarks demonstrate that GRAPE consistently improves performance, achieving an average gain of 4.9% in Recall@10 without any modification to the underlying retriever.

Abstract:
Time series foundation models rely on largescale pretraining over diverse datasets across domains, yet their heterogeneity in temporal patterns could hinder the effectiveness of training and learning transferable time series representations. Inspired a fundamental concept, normalized power spectral density (PSD) in signal processing, we assume harmonizing datasets via PSDs in the spectral domain could reduce mismatches and enhance pretraining. We then go beyond the direct intractable minimization optimization and innovatively reformulate it as a principled harmonization approach. Specifically, we propose Harmonizer, a module that reshapes spectral structures and implicitly harmonizing PSDs across datasets, which theoretically corresponds to a shared reparameterization of second-order temporal correlations. Our theoretical analysis further reveals token interactions with Harmonizer can be efficiently mediated by a compact set of resonators, motivating a HarmonicAttention design that performs self-attention in a low-dimensional interaction space. Then, we propose Olivia, a novel time series foundation model built upon these harmonization mechanisms. Extensive experiments on two large-scale benchmarks (TSLib and GIFT-Eval) and extra 6 datasets from GluonTS, demonstrate Olivia consistently achieves state-of-the-art performance under zero-shot, few-shot, and full-shot forecasting scenarios.

Abstract:
Preference alignment is a crucial posttraining step for large language models (LLMs) to ensure their outputs align with human values. However, post-training on real human preference data raises privacy concerns, as these datasets often contain sensitive user prompts and human judgments. To address this, we proposeDPPrefSyn, a novel algorithm for generating differentially private (DP) synthetic preference data to enable privacy-preserving preference alignment. DPPrefSyn is a principled framework grounded in the Bradley–Terry preference model and the intrinsic geometric structure of pairwise human preference data. It first learns an underlying preference model from private data with formal differential privacy guarantees, and then leverages the learned model together with public prompts to synthesize high-quality preference data. It exploits the shared linear structure of per-cluster reward models to effectively capture heterogeneous human preferences in private datasets, and leverages DP Principal Component Analysis (DP-PCA) to improve learning accuracy. Extensive experimental results demonstrate that DPPrefSyn achieves competitive alignment performance under strong DP guarantees. These findings highlight the potential of synthetic preference data as a practical alternative for privacy-preserving preference alignment across a broad range of applications. To the best of our knowledge, this is the first work to generate DP synthetic preference data for LLM alignment.

Abstract:
Vector Quantized Variational Autoencoder (VQVAE) has become a fundamental framework for learning discrete representations in image modeling. However, VQ-VAE models must tokenize entire images using a finite set of codebook vectors, and this capacity limitation restricts their ability to capture rich and diverse representations. In this paper, we propose ArcCosine Additive Margin VQ-VAE (ArcVQ-VAE), a novel vector quantization framework that introduces a spherical angular-margin prior (SAMP) for the codebook of a conventional VQ-VAE. The proposed SAMP consists of Ball-Bounded Norm Regularization, which constrains all codebook vectors within a time-dependent Euclidean ball, and ArcCosine Additive Margin Loss, which encourages greater angular separability among latent vectors. This formulation promotes more discriminative and uniformly dispersed latent representations within the constrained space, thereby enabling codebook vectors to capture richer information and leading to improved codebook utilization. Experimental results on standard image reconstruction and generation tasks show that ArcVQ-VAE outperforms baseline models in terms of reconstruction accuracy, representation diversity, and sample quality.

Abstract:
Discovering governing differential equations from observational data is a fundamental challenge in scientific machine learning. Existing symbolic regression approaches rely primarily on quantitative metrics; however, realworld differential equation modeling also requires incorporating domain knowledge to ensure physical plausibility. To address this gap, we propose DoLQ, a method for discovering ordinary differential equations with LLM-based qualitative and quantitative evaluation. DoLQ employs a multi-agent architecture: a Sampler Agent proposes dynamic system candidates, a Parameter Optimizer refines equations for accuracy, and a Scientist Agent leverages an LLM to conduct both qualitative and quantitative evaluations and synthesize their results to iteratively guide the search. Experiments on multi-dimensional ordinary differential equation benchmarks demonstrate that DoLQ achieves superior performance compared to existing methods, not only attaining higher success rates but also more accurately recovering the correct symbolic terms of ground truth equations. Our code is available at https://anonymous.4open.science/r/DoLQ/README.md.

Abstract:
Posttraining alignment often reduces LLM diversity, leading to a phenomenon known as mode collapse. Unlike prior work that attributes this effect to algorithmic limitations, we identify a fundamental, pervasive data-level driver: typicality bias in preference data, whereby annotators systematically favor familiar text as a result of well-established findings in cognitive psychology. We formalize this bias theoretically, verify it empirically on preference datasets, and show that it plays a central role in mode collapse. Motivated by this analysis, we introduce Verbalized Sampling (VS), a simple, training-free prompting strategy to circumvent mode collapse. VS prompts the model to verbalize a probability distribution over a set of responses (e.g., "Generate 5 jokes about coffee and their corresponding probabilities"), which relieves the pressure to produce a single "typical" answer. Experiments show that VS significantly improves performance across creative writing (poems, stories, jokes), social dialogue simulation, synthetic data generation, and open-ended QA, without sacrificing safety and factual accuracy. For instance, in creative writing, VS increases diversity by 1.6-2.1x compared to direct prompting. We further observe an emergent trend that more capable models benefit more from VS. In sum, our work provides a new data-centric perspective on mode collapse and a practical inference-time remedy that helps unlock pre-trained generative diversity.

Abstract:
Large Multimodal Models (LMMs) have recently demonstrated their proficiency in holistic visual comprehension. However, most of them struggle to tackle regionlevel perception guided by visual prompts, especially for cases where multiple regions are referred simultaneously, or scenarios where global contexts are necessary for precise visual referring. We introduce Contextual Latent Steering (CSteer), a training-free approach for guiding general LMMs to refer multiple regions contextually, without expensive fine-tuning or architectural modifications. CSteer starts with pre-computing contextual vectors that implicitly represent visual referring behaviors, such as differentiation among regions and attention to global contexts, followed by representation editing during inference time. Experimental results on multiple datasets indicate that general LMMs with CSteer outperform referring LMMs in most cases, suggesting a promising solution in training-free, and setting new state-of-the-art for this field. Codes will be made publicly available.

Abstract:
We study how reasoning evolves in a language model from supervised fine-tuning (SFT) to reinforcement learning (RL) -- by analyzing how a set of theoretically-inspired datasets impacts language model performance in chess. We find that fine-tuning a model to directly predict the best move leads to effective RL and the strongest downstream performance -- however, the RL stage elicits unfaithful reasoning (reasoning inconsistent with the chosen move). Alternatively, training on multi-move trajectories yields comparable downstream performance with faithful reasoning and more stable RL. We show that RL induces a substantial positive shift in the distribution of move quality and reduces hallucination rates as a side effect. Finally, we find several SFT-checkpoint metrics -- metrics spanning evaluation performance, hallucination rates, and reasoning quality -- to be predictive of post-RL model performance. We release checkpoints and final models as well as training data, evaluations, and code that allowed us to surpass leading open-source reasoning models in chess with a 7B-parameter model.

Abstract:
Advances in large language models have driven strong performance across many tasks, but their memory and compute costs still hinder deployment. SVDbased compression reduces storage and can speed up inference via low-rank factors, yet performance depends on how rank is allocated under a global compression ratio. Prior methods often use homogeneous ranks for similarly sized matrices, despite large differences in loss sensitivity, or rely on expensive iterative pre-truncation optimization to determine per matrix ranks. We proposeZero Sum SVD(ZS-SVD), a post-training method that performsglobalsingular component selection using activation whitening and first-order calibration loss estimates in whitened coordinates.ZS-SVDprunes components across the whole model with azero sumrule that keeps the cumulative predicted loss change near zero, automatically yielding heterogeneous ranks without solving a rank allocation optimization. Motivated by evidence that gradients near pretrained solutions exhibit low rank structure, we also introduce an optional lightweight correction that applies asingleprojected gradient update after truncation, followed by re-truncation. Extensive experiments across multiple LLM architectures show consistent gains across diverse benchmarks and compression ratios.

Abstract:
Proactive large language model (LLM) agents aim to actively plan, query, and interact over multiple turns, enabling efficient task completion beyond passive instruction following and making them essential for realworld, user-centric applications. Agentic reinforcement learning (RL) has recently emerged as a promising solution for training such agents in multi-turn settings, allowing interaction strategies to be learned from feedback. However, existing pipelines face a critical challenge in balancing task performance with user engagement, as passive agents can not efficiently adapt to users' intentions while overuse of human feedback reduces their satisfaction. To address this trade-off, we propose BAO, an agentic RL framework that combines behavior enhancement to enrich proactive reasoning and information-gathering capabilities with behavior regularization to suppress inefficient or redundant interactions and align agent behavior with user expectations. We evaluate BAO on multiple tasks from the UserRL benchmark suite, and demonstrate that it substantially outperforms RL baselines under controlled comparisons, while achieving comparable or even superior performance to frontier LLM agents, highlighting its effectiveness for training proactive, user-aligned LLM agents in complex multi-turn scenarios.

Abstract:
Visionlanguage models (VLMs) such as CLIP have demonstrated remarkable zero-shot generalization, yet remain highly vulnerable to adversarial examples (AEs). While test-time defenses are promising, existing methods fail to provide sufficient robustness against strong attacks and are often hampered by high inference latency and task-specific applicability. To address these limitations, we start by investigating the intrinsic properties of AEs, which reveals that AEs exhibit severe feature inconsistency under progressive frequency attenuation. We further attribute this to the model's inherent spectral bias. Leveraging this insight, we propose an efficient test-time defense named Contrastive Spectral Rectification (CSR). CSR optimizes a rectification perturbation to realign the input with the natural manifold under a spectral-guided contrastive objective, which is applied input-adaptively. Extensive experiments across 16 classification benchmarks demonstrate that CSR outperforms the SOTA by an average of 18.1% against strong AutoAttack with modest inference overhead. Furthermore, CSR exhibits broad applicability across diverse visual tasks. Code is available at https://anonymous.4open.science/r/CSR-3935.

Abstract:
Mixtureof-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using k-sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we "zoom out" from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., retrieving specific bioinformatics software tools). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability.

Abstract:
The transition from sequential to parallel computing is essential for modern highperformance applications but is hindered by the steep learning curve of concurrent programming. This challenge is magnified for irregular data structures (such as sparse graphs, unbalanced trees, and non-uniform meshes) where static scheduling fails and data dependencies are unpredictable. Current Large Language Models (LLMs) often fail catastrophically on these tasks, generating code plagued by subtle race conditions, deadlocks, and sub-optimal scaling. We bridge this gap with \sys, a framework designed to synthesize high-performance parallel algorithms for irregular data. Our contributions include: (1) The Parlay-Instruct Corpus, a curated dataset of 12,000 tasks synthesized via a "Critic-Refine" pipeline that explicitly filters for theoretically optimal algorithms under the Work-Span cost model; (2) specialized DeepSeek, Qwen, and Gemini models fine-tuned to align probabilistic generation with the rigorous semantics of the ParlayLib intermediate representation; and (3) an Evolutionary Coding Agent (ECA) that solves the ``last mile'' of correctness by iteratively repairing code using feedback from compilers, race detectors, and performance profilers. On the ParEval benchmark, \sys achieves a 106× speedup on complex irregular graph problems, significantly outperforming state-of-the-art commercial models like GPT-5.2 and Gemini 3 Pro. Furthermore, our approach surpasses expert \emphhuman-written baselines in the standard PBBSBench by 4×, demonstrating that AI-driven agents can effectively navigate the complex landscape of high-performance computing.

Abstract:
Highquality data is a cornerstone of large language model (LLM) pretraining, yet its growth has not kept pace with the needs of frontier models. In this paper, we introduce RePro, a novel web recycling method that trains a relatively small LM with reinforcement learning to generate effective and faithful rephrasings of pretraining data. Specifically, we design one quality reward and three faithfulness rewards, optimizing the LM rephraser to convert organic data into high-quality rephrasings while maintaining its core semantics and structure. In our experiment, we train a rephraser as small as 1B parameters to recycle 72B tokens sampled from DCLM-RefinedWeb. Pretraining results on 400M, 1.4B, and 2.8B models demonstrate that RePro delivers 3.7%-14.5% relative accuracy gains over organic-only baseline on 22 downstream tasks, doubling the performance gains achieved by the state-of-the-art web recycling method that prompts a 70B rephraser. Experiments with different amounts of recycled data highlight that RePro improves organic data efficiency by 2-3×. Individual and distributional analyses validate that RePro preserves more critical information and faithfully reflects the characteristics of organic data compared to prompting-based methods. Together, these results show that RePro provides an efficient and controllable path to effectively recycle organic data for pretraining. Our anonymized code is available at https://anonymous.4open.science/r/RePro.

Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning of Large Language Models (LLMs) but usually exhibits limited generation diversity due to the overincentivization of positive rewards. Although methods like Negative Sample Reinforcement (NSR) mitigate this issue by upweighting penalty from negative samples, they may suppress the semantic distributions shared between positive and negative responses. To boost reasoning ability without losing diversity, this paper proposes negative sample projection Residual Reinforcement Learning (ResRL) that decouples similar semantic distributions among positive and negative responses. We theoretically link Lazy Likelihood Displacement (LLD) to negative-positive head-gradient interference and derive a single-forward proxy that upper-bounds representation alignment to guide conservative advantage reweighting. ResRL then projects negative-token hidden representations onto an SVD-based low-rank positive subspace and uses projection residuals to modulate negative gradients, improving reasoning while preserving diversity and outperforming strong baselines on average across twelve benchmarks spanning Mathematics, Code, Agent Tasks, and Function Calling. Notably, ResRL surpasses NSR on mathematical reasoning by 9.4% in Avg@16 and 7.0% in Pass@128. Code is available at https://anonymous.4open.science/r/ResRL.

Abstract:
This paper identifies a critical yet underexplored challenge in reasoning alignment from multiple multimodal large language models (MLLMs): In non-stationary environments, the diverse reasoning distributions of source models often evolve unpredictably, transmitting systematic biases and drift to the target model. To address this, we formulate multi-source reasoning alignment as a constraint satisfaction problem under concept drift theory. We propose Autonomous Preference Optimization (APO), a novel framework that treats inter-model divergences not as noise, but as dynamic negative constraints. APO operates via a two-stage protocol: first, supervised bootstrapping projects the target model into the capability union of source models; second, constraint-aware optimization synthesizes a consistent consensus manifold by explicitly suppressing drifting trajectories via a multi-negative plackett-luce objective. Extensive experiments on chest X-ray interpretation demonstrate that our 7B model achieves superior robustness, outperforming even proprietary source models in average accuracy. Furthermore, we release CXR-MAX, a large-scale benchmark comprising 170,982 reasoning trajectories from seven large-scale MLLMs to facilitate research on reasoning alignment under drift. Code and data are available at: https://anonymous.4open.science/r/APO.

Abstract:
Endto-end autonomous driving models trained with imitation learning (IL) often generalize poorly, particularly in long-tail scenarios where expert demonstrations are sparse. Reinforcement learning (RL) can provide complementary reward signals, but applying RL in real-world autonomous driving is challenging in offline settings without simulators, where datasets consist almost exclusively of expert actions and lack behavioral diversity. We propose CoIRL-AD, a competitive dual-policy framework that integrates IL and RL under a unified offline training regime. CoIRL-AD decouples IL and RL into separate actors to alleviate objective conflicts between imitation and reward maximization, and introduces a competition-based mechanism that stabilizes learning and enables effective exploration while remaining anchored to expert behavior. Experiments on the nuScenes benchmark show a 27% relative reduction in collision rate weighted by L2 error compared to strong baselines, with substantially larger gains on cross-city generalization (up to 77%) and long-tail scenarios (up to 85%), demonstrating that competitive integration of IL and RL significantly improves robustness in offline end-to-end autonomous driving. Code is available at: \urlhttps://anonymous.4open.science/r/drive-with-two-minds.

Abstract:
Multimodal Large Language Models (MLLMs) demonstrate impressive crossmodal capabilities, yet their substantial size poses significant deployment challenges. Knowledge distillation (KD) is a promising solution for compressing these models, but existing methods primarily rely on static next-token alignment, neglecting the dynamic token interactions, which embed essential capabilities for multimodal understanding and generation. To this end, we introduceAlign-TI, a novel KD framework designed from the perspective ofTokenInteractions. Our approach is motivated by the insight that MLLMs rely on two primary interactions: vision-instruction token interactions to extract relevant visual information, and intra-response token interactions for coherent generation. Accordingly, Align-TI introduces two components: IVA enables the student model to imitate the teacher's instruction-relevant visual information extract capability by aligning on salient visual regions. TPA captures the teacher's dynamic generative logic by aligning the sequential token-to-token transition probabilities. Extensive experiments demonstrate Align-TI's superiority. Notably, our approach achieves 2.6% relative improvement over Vanilla KD, and our distilled Align-TI-2B even outperforms LLaVA-1.5-7B (a much larger MLLM) by 7.0%, establishing a new state-of-the-art distillation framework for training parameter-efficient MLLMs.

Abstract:
Discrete biological sequence optimization demands iterative refinement while satisfying strict syntactic constraints. Diffusionbased approaches provide strong progressive refinement but are not naturally aligned with discrete, grammar-constrained edit operations, whereas autoregressive LLMs readily produce valid sequences yet often lack explicit long-horizon planning. To close this gap, we introduceSTRIDE(Sequence Trajectory Refinement via Internalized Denoising Emulation), a post-training framework that recasts optimization as an intrinsic reasoning problem in edit space. Rather than relying on external agentic search loops,STRIDEtrains an LLM to emit a full trajectory of atomic edits as explicit Chain-of-Thought, effectively internalizing a trajectory-based refinement policy under discrete constraints. We instantiateSTRIDEwith a curriculum that combines supervised fine-tuning on Levenshtein-aligned shortest-edit demonstrations with GRPO-style reinforcement learning (and variants) to align edit trajectories with task rewards. Across protein and molecule optimization benchmarks,STRIDEconsistently outperforms a diverse set of baselines, while producing candidates that maintain high structural validity and achieve improved target properties.

Abstract:
Despite high semantic alignment, modern textto-image (T2I) generative models still struggle to synthesize diverse images from a given prompt. This lack of diversity not only restricts user choice, but also risks amplifying societal biases. In this work, we enhance the T2I diversity through a geometric lens. Unlike most existing methods that rely primarily on entropy-based guidance to increase sample dissimilarity, we introduce Geometry-Aware Spherical Sampling (GASS) to enhance diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation. Specifically, we decompose the diversity measure in CLIP embeddings using two orthogonal directions: the text embedding, which captures semantic variation related to the prompt, and an identified orthogonal direction that captures prompt-independent variation (e.g., backgrounds). Based on this decomposition, GASS increases the geometric projection spread of generated image embeddings along both axes and guides the T2I sampling process via expanded predictions along the generation trajectory. Our experiments on different frozen T2I backbones (U-Net and DiT, diffusion and flow) and benchmarks demonstrate the effectiveness of disentangled diversity enhancement with minimal impact on image fidelity and semantic alignment.

Abstract:
Autoregressive (AR) modeling has recently emerged as a promising new paradigm in visual generation, but its practical adoption is severely constrained by the slow inference speed of pertoken generation, which often requires thousands of steps to produce a single sample. While several Speculative Decoding (SD)-based methods have been proposed to solve this problem by generating multiple tokens in a single forward step, they suffer from limited speedup, degraded quality, or require the training of a draft model. To solve these problems, we propose a new training-free, lossless SD framework, Speculative Coupled Decoding (SCD), by extending the recently proposed Speculative Jacobi Decoding (SJD). While SJD shows strong potential for accelerating AR generation by combining Jacobi iteration and SD, we found that its acceptance rate is still significantly limited due to the instability arising from the independent sampling process used during draft token generation. To overcome this, we introduce an information-theoretic approach, Coupling, which stabilizes the drafting trajectory of SJD by maximizing the probability of sampling identical draft tokens across consecutive iterations, significantly enhancing the acceptance rate while preserving its lossless property. Remarkably, this method requires only a single-line modification to the existing algorithm with almost zero overhead, yet achieves substantial performance gains, delivering up to a 4.2× speedup in image generation and 13.6× speedup in video generation compared to standard AR decoding, without any degradation or the need for additional training.

Abstract:
We introduce FloorplanQA, a diagnostic benchmark for evaluating spatial reasoning in largelanguage models (LLMs). FloorplanQA is grounded in structured representations of indoor scenes (e.g., kitchens, living rooms, bedrooms, bathrooms, and others), encoded symbolically in JSON or XML layouts. The benchmark covers core spatial tasks, including distance measurement, visibility, path finding, and object placement within constrained spaces. Our results across a variety of frontier open-source and commercial LLMs reveal that while models may succeed on shallow queries, they often fail to respect physical constraints and preserve spatial coherence, though they remain mostly robust to small spatial perturbations. FloorplanQA uncovers a blind spot in today’s LLMs: inconsistent reasoning about indoor layouts. We hope this benchmark inspires new work on language models that can accurately infer and manipulate spatial and geometric properties in practical settings.

Abstract:
Multiturn, multi-agent LLM game evaluations often exhibit substantial run-to-run variance. In long-horizon interactions, small early deviations compound across turns and are amplified by multi-agent coupling, biasing win rate estimates and destabilizing comparative rankings across repeated tournaments. Prompt choice exacerbates this by inducing different effective policies and interaction dynamics. We address both instability and underperformance in interactive games with MEMO (Memory-augmented Model context optimization), a self-play framework that treats inference-time context as an optimizable, agentic object by coupling retention and exploration. Retention maintains a persistent memory bank that distills self-play trajectories into structured insights, consolidates them via CRUD-style updates, and injects them as priors during subsequent play. Exploration performs tournament-style prompt evolution with uncertainty-aware selection via TrueSkill, and uses prioritized replay to revisit vital states for sample-efficient coverage. Across five text-based games, MEMO raises mean win rate from 24.9% → 49.5% for GPT-4o-mini and 21.7% → 44.3% for Qwen-2.5-7B-Instruct using a mere budget of 2000 self-play games per task; reducing run-to-run dispersion of end-to-end outcomes and yielding more reliable rankings under prompt stratification. These results suggest that substantial headroom in multi-agent LLM game performance and robustness can be unlocked, with MEMO achieving gains in negotiation games and imperfect-information settings, while RL remains more effective in perfect-information games. Anonymous project website available: https://79ac811fdcc9cd5679a2258a180589ef.github.io

Abstract:
Approximate Nearest Neighbor Search (ANNS) is fundamental to modern AI applications. Most existing solutions optimize query efficiency but fail to align with the practical requirements of modern workloads. In this paper, we outline six critical demands of modern AI applications: high query efficiency, fast indexing, low memory footprint, scalability to high dimensionality, robustness across varying retrieval sizes, and support for online insertions. To satisfy all these demands, we introduce ProjectionAugmented Graph (PAG), a new ANNS framework that integrates projection techniques into a graph index. PAG reduces unnecessary exact distance computations through asymmetric comparisons between exact and approximate distances guided by projection-based statistical tests. Three key components are designed and unified to the graph index to optimize indexing and searching. Experiments on six modern datasets demonstrate that PAG consistently achieves superior query per second (QPS)-recall performance---up to 5×faster than HNSW---while offering fast indexing speed and small memory footprint. PAG remains robust as dimensionality and retrieval size increase and naturally supports online insertions. Our source code is available at: https://anonymous.4open.science/r/PAG-A73D/ .

Abstract:
Clinical abnormality grounding for rare diseases is often hindered by data scarcity, rendering supervised finetuning infeasible and single-pass inference highly unstable. Thus, we propose Dynamic Decision Learning (DDL), a framework that enables frozen LVLMs to refine their decisions across language and visual spaces by optimizing instructions and consolidating predictions under visual perturbations, thereby improving localization quality and producing a consensus‑based reliability score that quantifies the model’s confidence. Results on brain‑imaging benchmarks, including a rare‑disease dataset with 281 pathology types across 3B-72B models, show that DDL improves mAP@75 by up to 105% on rare‑disease cases and surpasses adaptation baselines and supervised fine‑tuning. Moreover, we show that DDL yields stronger calibration between consensus‑based reliability scores and localization accuracy under severe distribution shifts and increasing task difficulty. The code will be open-sourced.

Abstract:
Selfinterpretation methods prompt language models to describe their own internal states, but remain unreliable due to hyperparameter sensitivity. We show that training lightweight adapters on interpretability artifacts, while keeping the LM entirely frozen, yields reliable self-interpretation across tasks and model families. A scalar affine adapter with just d_\textmodel+1 parameters suffices: trained adapters generate sparse autoencoder feature labels that outperform the training labels themselves (71% vs 63% generation scoring at 70B scale), identify topics with 94% recall@1 versus 1% for untrained baselines, and decode bridge entities in multi-hop reasoning that appear in neither prompt nor response, surfacing implicit reasoning without chain-of-thought. The learned bias vector alone accounts for 85% of improvement, and simpler adapters generalize better than more expressive alternatives. Controlling for model knowledge via prompted descriptions, we find self-interpretation gains outpace capability gains from 7B to 72B parameters. Our results demonstrate that self-interpretation improves with scale, without modifying the model being interpreted.

Abstract:
Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming. Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from sparse reward signals and unstable policy gradient updates, inherent to RLbased approaches. To address the challenges, we proposePACS, a novel RLVR framework that achieves imPlicitActorCritic coupling via aSupervised learning framework. By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss. A detailed gradient analysis shows that this supervised formulation inherently recovers the classical policy gradient update while providing more stable and efficient training. Extensive experiments demonstrate that PACS significantly outperforms strong open-source models and RLVR baselines, yielding substantial average gains of+8.26%(4B) and+9.57%(8B) over base models offering a promising avenue for LLMs post-training with verifiable rewards.

Abstract:
Large Language Models (LLMs) exhibit high reasoning capacity in medical questionanswering, but their tendency to produce hallucinations and outdated knowledge poses critical risks in healthcare fields. While Retrieval-Augmented Generation (RAG) mitigates these issues, existing methods rely on noisy token-level signals and lack the multi-round refinement required for complex reasoning. In the paper, we proposeMA-RAG(Multi-RoundAgentic RAG), a framework that facilitates test-time scaling for complex medical reasoning by iteratively evolving both external evidence and internal reasoning history within an agentic refinement loop. At each round, the agent transforms semanticconflictamong candidate responses into actionable queries to retrieve external evidence, while optimizing history reasoning traces to mitigate long-context degradation. MA-RAG extends theself-consistencyprinciple by leveraging the lack of consistency as a proactive signal for multi-round agentic reasoning and retrieval, and mirrors aboostingmechanism that iteratively minimizes the residual error toward a stable, high-fidelity medicalconsensus. Extensive evaluations across 7 medical Q&A benchmarks show that MA-RAG consistently surpasses competitive inference-time scaling and RAG baselines, deliveringsubstantial +6.8 pointson average accuracy over the backbone model.

Abstract:
Searchaugmented large language models (LLMs) remain insufficient for fully addressing diverse user needs, which requires recognizing how the same query can reflect different intents across users and delivering information in preferred forms. While recent systems such as ChatGPT and Gemini attempt personalization by leveraging user histories, systematic evaluation of such personalization is under-explored. To address this gap, we propose BESPOKE, the realistic benchmark for evaluating personalization in search-augmented LLMs. BESPOKE is designed to be both realistic, by collecting authentic chat and search histories directly from humans, and diagnostic, by pairing responses with fine-grained preference scores and feedback. The benchmark is constructed through long-term, deeply engaged human annotation, where human annotators contributed their own histories, authored queries with detailed information needs, and evaluated responses with scores and diagnostic feedback. Leveraging BESPOKE, we conduct systematic analyses that reveal key requirements for effective personalization in information-seeking tasks, providing a foundation for fine-grained evaluation of personalized search-augmented LLMs. Our code and data are available at https://anonymous.4open.science/r/bespoke-E82B.

Abstract:
While Large Language Models (LLMs) hold promise for automating science and education, generating interactive scientific demonstrations demands a complex synthesis of deep domain knowledge and precise reactive coding. Current benchmarks fail to capture this synergy, largely bifurcating into static code generation or textonly reasoning. To address this, we introduce \textscInteractScience, the first benchmark dedicated to evaluating the holistic creation of interactive scientific applications. We propose a novel hybrid framework that integrates programmatic functional testing for logic verification with visually-grounded qualitative assessment for rendering fidelity. Our evaluation of 30 leading models across five disciplines reveals critical gaps in grounding scientific reasoning within interactive interfaces. By standardizing this combined capability, \textscInteractScience establishes a crucial foundation for reliable AI-driven tools in science and education.

Abstract:
Large reasoning models (LRMs) like OpenAI o1 and DeepSeekR1 achieve high accuracy on complex tasks by adopting long chain-of-thought (CoT) reasoning paths. However, the inherent verbosity of these processes frequently results in redundancy and overthinking. To address this issue, existing works leverage Group Relative Policy Optimization (GRPO) to reduce LRM output length, but their static length reward design cannot dynamically adapt according to the relative problem difficulty and response length distribution, causing over-compression and compromised accuracy. In this paper, we proposeSmartThinker, a novel GRPO-based efficient reasoning method with progressive CoT length calibration.SmartThinkermakes a two-fold contribution: First, it dynamically estimates the optimal length with peak accuracy during training, and further guides the overlong responses to approach the optimal length, in order to achieve length reduction while sustaining high accuracy. Second, it dynamically modulates the length reward coefficient to avoid the unwarranted penalization of correct reasoning paths. Extensive experiment results show thatSmartThinkerachieves up to 52.5% average length compression with improved accuracy, and achieves up to 16.6% accuracy improvement on challenging benchmarks like AIME25.

Abstract:
Generating realistic 3D HumanObject Interactions (HOI) is a fundamental task for applications ranging from embodied AI to virtual content creation, which requires harmonizing high-level semantic intent with strict low-level physical constraints. Existing methods excel at semantic alignment, however, they struggle to maintain precise object contact. We reveal a key finding termed Geometric Forgetting: as diffusion model depth increases, semantic feature tend to overshadow object geometry feature, causing the model to lose its perception to object geometry. To address this, we propose MaMi-HOI, a hierarchical framework reconciling Macro-level kinematic fluidity with Micro-level spatial precision. First, to counteract geometric forgetting, we introduce the Geometry-Aware Proximity Adapter (GAPA), which explicitly re-injects dense object details to perform residual snapping corrections for precise contact. Nevertheless, such aggressive local enforcement can disrupt global dynamics, leading to robotic stiffness. In response, we introduce the Kinematic Harmony Adapter (KHA), which proactively aligns whole-body posture with spatial objectives, ensuring the skeleton actively accommodates constraints without compromising naturalness. Extensive experiments validate that MaMi-HOI simultaneously achieves natural motion and precise contact. Crucially, it extends generation capabilities to long-term tasks with complex trajectories, effectively bridging the gap between global navigation and high-fidelity manipulation in 3D scenes.

Abstract:
A key challenge in deriving unified neural solvers for combinatorial optimization (CO) is efficient generalization of models between one set of tasks to new tasks not used during the initial training process. To address it, we first establish a new model, which uses a GCON module as a form of expressive message passing together with energybased unsupervised loss functions. This model achieves high performance (often comparable with state-of-the-art results) across multiple CO tasks when trained individually on each task. We then leverage knowledge from the computational reducibility literature to propose pretraining and fine-tuning strategies to transfer effectively (a) between MVC, MIS and MaxClique, and (b) in a multi-task learning setting that additionally incorporates MaxCut, MDS and graph coloring. Additionally, in a leave-one-out, multi-task learning setting, we observe that pretraining on all but one task almost always leads to faster convergence on the remaining task when fine-tuning while avoiding negative transfer. Our findings indicate that learning common representations across multiple graph CO problems is viable through the use of expressive message passing coupled with pretraining strategies that are informed by the polynomial reduction literature, thereby taking an important step towards enabling the development of foundational models for neural CO.

Abstract:
Recent advances in Large Language Models (LLMs) have underscored the potential of Reinforcement Learning (RL) to facilitate the emergence of reasoning capabilities. Despite the encouraging results, a fundamental dilemma persists as RL improvement relies on learning from highquality samples, yet the exploration for such samples remains bounded by the inherent limitations of LLMs. This, in effect, creates an undesirable cycle in which what cannot be explored cannot be learned. In this work, we propose Rubric-Scaffolded Reinforcement Learning (RuscaRL), a novel instructional scaffolding framework designed to break the exploration bottleneck for general LLM reasoning. Specifically, RuscaRL introduces checklist-style rubrics as (1) explicit scaffolding for exploration during rollout generation, where different rubrics are provided as external guidance within task instructions to steer diverse high-quality responses. This guidance is gradually decayed over time, encouraging the model to internalize the underlying reasoning patterns; (2) verifiable rewards for exploitation during model training, where we can obtain robust LLM-as-a-Judge scores using rubrics as references, enabling effective RL on general reasoning tasks. Extensive experiments demonstrate the superiority of the proposed RuscaRL across various benchmarks, effectively expanding reasoning boundaries under the Best-of-N evaluation.

Abstract:
Multimodal Large Language Models (MLLMs) mainly fall into two architectures, each involving a tradeoff between training and inference efficiency: embedding space alignment (e.g. LLaVA series) is inefficient during inference, while cross-attention space alignment (e.g. Flamingo) is inefficient in training. A primary difference between them lies in whether each visual token attends to other tokens within the LLM backbones. To investigate whether this form of attention is essential for MLLMs, we propose NAEViT (No AttEntion from Visual Tokens), an attention mechanism that eliminates such interactions. Our pilot experiment shows that attention from visual tokens is highly redundant. Then, we introduce SAISA (Self-Attention Input Space Alignment), a novel architecture that enhances both training and inference efficiency. SAISA directly aligns visual features with the input spaces of NAEViT attention blocks, reducing computational overhead in both attention and FFNs. We conduct experiments on various baseline models, model sizes and training datasets. SAISA achieves superior performance compared to the baselines, while significantly reducing computational costs. Further ablation studies validate the effectiveness of SAISA across various LLMs and visual encoders

Abstract:
The NVFP4 lowerprecision format, supported in hardware by NVIDIA Blackwell GPUs, promises to allow, for the first time, end-to-end fully-quantized pre-training of massive models such as LLMs. Yet, existing quantized training methods still sacrifice some of the representation capacity of this format in favor of more accurate unbiased quantized gradient estimation by stochastic rounding (SR), losing noticeable accuracy relative to standard FP16 and FP8 training. In this paper, improve the state of the art for quantized training in NVFP4 via a novel unbiased quantization routine for micro-scaled formats, called MS-EDEN, that has more than 2x lower quantization error than SR. We integrate it into a novel fully-NVFP4 quantization scheme for linear layers, called Clover. We show analytically that Clover achieves consistently better gradient estimation across all major matrix multiplications, both on the forward and on the backward passes. In addition, our proposal synergizes well with recent training improvements aimed specifically at NVFP4. We further validate Clover on end-to-end LLM training with up to 1.9B parameters on 38B tokens. We provide kernels for execution on NVIDIA Blackwell GPUs with up to 4.2x speedup over BF16.

Abstract:
Forecasting urban phenomena such as housing prices and public health indicators requires the effective integration of various geospatial data. Current methods primarily utilize taskspecific models, while recent generic models for spatial representations often support only limited modalities and lack multimodal fusion capabilities. To overcome these challenges, we present UrbanFusion, a spatial representation model that features Stochastic Multimodal Fusion (SMF). The framework employs modality-specific encoders to process different types of inputs, including street view imagery, remote sensing data, cartographic maps, and points of interest (POIs) data. These multimodal inputs are integrated via a Transformer-based fusion module that learns unified representations. An extensive evaluation across 41 tasks in 56 cities worldwide demonstrates UrbanFusion’s strong generalization and predictive performance compared to state-of-the-art GeoAI models. Specifically, it 1) outperforms prior models on location-encoding, 2) allows multimodal input during inference, and 3) generalizes well to regions unseen during training. UrbanFusion can flexibly utilize any subset of available modalities for a given location during both pretraining and inference, enabling broad applicability across diverse data availability scenarios.

Abstract:
This paper presents STARCaster, an identityaware spatio-temporal video diffusion model that addresses both speech-driven portrait animation and free-viewpoint talking portrait synthesis, given an identity embedding or reference image, within a unified framework. Existing 2D speech-to-video diffusion models depend heavily on reference guidance, leading to limited motion diversity. At the same time, 3D-aware animation typically relies on inversion through pre-trained tri-plane generators, which often leads to imperfect reconstructions and identity drift. We rethink reference- and geometry-based paradigms in two ways. First, we deviate from strict reference conditioning at pre-training by introducing softer identity constraints. Second, we address 3D awareness implicitly within the 2D video domain by leveraging the inherent multi-view nature of video data. STARCaster adopts a compositional approach progressing from ID-aware motion modeling, to audio-visual synchronization via lip reading-based supervision, and finally to novel view animation through temporal-to-spatial adaptation. To overcome the scarcity of 4D audio-visual data, we propose a decoupled learning approach in which view consistency and temporal coherence are trained independently. Comprehensive evaluations demonstrate that STARCaster generalizes effectively across tasks and identities, consistently surpassing prior approaches in different benchmarks.

Abstract:
Posthoc interpretability methods typically attribute a model’s behavior to its components, data, or training trajectory in isolation. This leads to explanations that lack a unified view and may miss key interactions. While combining existing methods or applying them at different training stages offers broader insights, such approaches usually lack theoretical support. In this work, we present ExPLAIND, a unified framework that integrates all these perspectives. First, we generalize recent work on gradient path kernels, which reformulate models trained by gradient descent as a kernel machine, to realistic settings like AdamW. We empirically validate that a CNN and a Transformer are accurately replicated by this reformulation. Second, we derive novel parameter- and step-wise influence scores from the kernel feature maps. Their effectiveness for parameter pruning is comparable to existing methods, demonstrating their value for model component attribution. Finally, jointly interpreting model components and data over the training process, we use ExPLAIND to analyze a Transformer that exhibits Grokking. Our findings support previously proposed stages of Grokking, while refining the final phase as one of alignment of the outer layers around a representation pipeline learned after the memorization phase. Overall, ExPLAIND provides a theoretically grounded, unified framework to interpret model behavior and training dynamics.

Abstract:
Instructionbased image editing has achieved remarkable success in semantic alignment, yet state-of-the-art models frequently fail to render physically plausible results when editing involves complex causal dynamics, such as refraction or material deformation. We attribute this limitation to the dominant paradigm that treats editing as a discrete mapping between image pairs, which provides only boundary conditions and leaves transition dynamics underspecified. To address this, we reformulate physics-aware editing as predictive physical state transitions and introduce PhysicTran38K, a large-scale video-based dataset comprising 38K transition trajectories across five physical domains, constructed via a two-stage filtering and constraint-aware annotation pipeline. Building on this supervision, we propose PhysicEdit, an end-to-end framework equipped with a textual-visual dual-thinking mechanism. It combines a frozen Qwen2.5-VL for physically grounded reasoning with learnable transition queries that provide timestep-adaptive visual guidance to a diffusion backbone. Experiments show that PhysicEdit improves over Qwen-Image-Edit by 5.9% in physical realism and 10.1% in knowledge-grounded editing, setting a new state-of-the-art for open-source methods, while remaining competitive with leading proprietary models.

Abstract:
TabPFN has recently gained attention as a foundation model for tabular datasets, achieving strong performance by leveraging incontext learning on synthetic data. However, we find that TabPFN is vulnerable to label shift, often overfitting to the majority class in the training dataset. To address this limitation, we propose DistPFN, the first test-time posterior adjustment method designed for tabular foundation models. DistPFN rescales predicted class probabilities by downweighting the influence of the training prior (i.e., the class distribution of the context) and emphasizing the contribution of the model’s predicted posterior, without architectural modification or additional training. We further introduce DistPFN-T, which incorporates temperature scaling to adaptively control the adjustment strength based on the discrepancy between prior and posterior. We evaluate our methods on over 250 OpenML datasets, demonstrating substantial improvements for various TabPFN-based models in classification tasks under label shift, while maintaining strong performance in standard settings without label shift.

Abstract:
Floorplanning determines the coordinate and shape of each module in Integrated Circuit. With the advancement of technology node, in floorplanning stage especially 3D scenarios with multiple stacked layers, it has become increasingly challenging to adhere to complex hardware design rules. Current methods are only capable of handling specific and limited design rules, while violations of other rules require manual and meticulous adjustment. This leads to laborintensive and time-consuming post-processing for expert engineers. In this paper, we propose an all-in-one deep reinforcement learning-based approach to tackle these challenges, and design novel representations for real-world IC design rules that have not been addressed by previous approaches. Specifically, the processing of various hardware design rules is unified into a single framework with three key components: 1) novel matrix representations to model the design rules, 2) constraints on the action space to filter out invalid actions that cause rule violations, and 3) quantitative analysis of constraint satisfaction as reward signals. Experiments on public benchmarks demonstrate the effectiveness and validity of our approach. Furthermore, transferability are well shown on unseen circuits. Our framework is extensible to accommodate new design rules, thus providing flexibility to address emerging challenges in future chip design.

Abstract:
Personality traits have long been studied as predictors of human behavior. Recent advances in Large Language Models (LLMs) suggest similar patterns may emerge in artificial systems, with advanced LLMs displaying consistent behavioral tendencies resembling human traits like agreeableness and selfregulation. Understanding these patterns is crucial, yet prior work primarily relied on simplified self-reports and heuristic prompting, with little behavioral validation. In this study, we systematically characterize LLM personality across three dimensions: (1) the dynamic emergence and evolution of trait profiles throughout training stages; (2) the predictive validity of self-reported traits in behavioral tasks; and (3) the impact of targeted interventions, such as persona injection, on both self-reports and behavior. Our findings reveal that instructional alignment (e.g., RLHF, instruction tuning) significantly stabilizes trait expression and strengthens trait correlations in ways that mirror human data. However, these self-reported traits do not reliably predict behavior, and observed associations often diverge from human patterns. While persona injection successfully steers self-reports in the intended direction, it exerts little or inconsistent effect on actual behavior. By distinguishing surface-level trait expression from behavioral consistency, our findings challenge assumptions about LLM personality and underscore the need for deeper evaluation in alignment and interpretability.

Abstract:
Masked Diffusion Models (MDMs) have emerged as a promising approach for generative modeling in discrete spaces. By generating sequences in any order and allowing for parallel decoding, they enable fast inference and strong performance on noncausal tasks. However, this flexibility comes with a training complexity trade-off: MDMs train on an exponentially large set of masking patterns, which is not only computationally expensive, but also creates a train--test mismatch between the random masks used in training and the highly structured masks induced by inference-time unmasking. In this work, we propose Progressive UnMAsking (PUMA), a simple modification of the forward masking process that aligns training-time and inference-time masking patterns, thereby focusing optimization on inference-aligned masks and speeding up training. Empirically, PUMA speeds up pretraining at the 125M scale by \approx 2.5 × and offers complementary advantages on top of common recipes like autoregressive initialization.

Abstract:
Large Language Models (LLMs) with billions of parameters have transformed AI applications but require immense computational and memory resources during inference. Adaptive sparse activation inference, which activates only a small number of neurons for each token, offers a novel way to accelerate model inference without degrading performance, showing great potential for resourceconstrained hardware devices. Yet existing token-level MLP-based methods frequently alter activation maps, reducing efficiency gains. In this paper, we introduce SparseInfer, an MLP-free adaptive sparse activation inference method based on sentence-level prediction. We first propose the concept of core neurons and empirically demonstrate that, for an input sentence, LLMs only need the core neurons to maintain performance. Remarkably, we discovered that core neurons exhibit both stability and similarity in relation to the sentence's semantics—an insight overlooked by previous studies. Building on this finding, we design two semantic-based methods for predicting core neurons to fit different input scenarios, which enables core neurons to be determined during the pre-filling stage and fixed during the encoding stage. Our experiments verify SparseInfer exhibits good performance in various tasks and achieve 10.33× speed up.

Abstract:
Multimodal Large Language Models (MLLMs) have shown strong performance in multiimage cross-modal retrieval, yet suffer from severe position bias, where predictions are dominated by input order rather than semantic relevance. Through empirical analysis, we identify a phenomenon termed Logit-Attention Divergence, in which output logits are heavily biased while internal attention maps remain well-aligned with relevant visual evidence. This observation reveals a fundamental limitation of existing logit-level calibration methods such as PriDe. Based on this insight, we propose a training-free, attention-guided debiasing framework that leverages intrinsic attention signals for instance-level correction at inference time, requiring only a minimal calibration set with negligible computational overhead. Experiments on MS-COCO-based benchmarks show that our method substantially improves permutation invariance and achieves state-of-the-art performance, enhancing accuracy by over 40% compared to baselines.

Abstract:
Diffusion posterior sampling conditions diffusion priors on measurements, but dataconsistency updates are typically scaled by hand-tuned guidance weights and can destabilize sampling under stiff, operator-dependent curvature. We replace scalar guidance with a per-noise-level damped Gauss--Newton correction computed in diffusion-state coordinates. The correction pulls likelihood gradients back through the denoiser, uses a one-sided curvature model that avoids forward denoiser Jacobians, and applies diffusion-calibrated rank-one damping aligned with the denoiser residual. Each correction is solved with matrix-free GMRES using automatic differentiation, and sampling proceeds with a variance-preserving Langevin transition with a closed-form drift/noise split. Aside from compute-budget choices (T diffusion steps and K Krylov iterations), the method has a single damping hyperparameter (\lambda_\mathrmid), kept nearly unchanged across tasks. On FFHQ and ImageNet across inverse problems, it achieves competitive PSNR/SSIM/LPIPS while running markedly faster than most of the compared baselines; on accelerated MRI reconstruction, it achieves the best PSNR/SSIM among the compared baselines.

Abstract:
Scaling laws for dense LLMs under infinite data are well explored, but how sparsity interacts with limited data is not. In this work, we study sparse training in dataconstrained regimes where limited unique tokens require multi-epoch training. Our experiments span models up to 3.84B parameters, sparsity up to 90%, and datasets from 1.3B to 41.6B tokens, trained for as many as 16 epochs. We find that: 1.Sparse scaling in data-limited settings:We introduce a scaling law that models loss as a function of active parameters, unique tokens, data repetition, and sparsity, accurately predicting performance across compute and data budgets. 2.Delayed data saturation:sparse training postpones diminishing returns from repeated data, making multi-epoch training more effective. 3.Resource trade-offs:With fixed data, loss-optimal sparsity is moderate (~50%), while compute-optimal sparsity is higher and grows with data scale. Overall, sparsity is not just a tool for efficiency, but a mechanism for improving scaling trade-offs under data scarcity.

Abstract:
Medical image segmentation faces a fundamental challenge in continual learning: data arrives sequentially from heterogeneous sources, yet effective continual learning requires discovering which tasks share sufficient structure to benefit from joint learning. Existing methods either apply uniform constraints across all tasks, causing catastrophic forgetting when tasks conflict, or require predefined task groupings that cannot anticipate future task diversity. We introduce MedCRPCL, a framework that performs online task structure discovery and structure-aware continual learning. Leveraging the Chinese Restaurant Process (CRP), our method dynamically infers task groupings from clinical text prompts as tasks arrive, without requiring predefined cluster counts or access to future tasks. We term these discovered groupings semantic modalities, as they capture finer-grained structure than physical imaging modalities by integrating anatomical region and pathological context. Guided by this discovered structure, we maintain semantic modality-specific LoRA adapters regularized by intra-modality EWC, ensuring parameter isolation across dissimilar task groups while facilitating knowledge transfer within similar ones. The framework is also replay-free, storing only aggregate statistics rather than raw patient data. Experiments on 16 medical segmentation tasks across four imaging modalities demonstrate that MedCRP-CL achieves 73.3% Dice score with only 4.1% forgetting, outperforming the best baseline by 8.0% while requiring 6× fewer parameters. Code is provided as supplemental material for review and will be publicly released upon acceptance.

Abstract:
Recent Progress in posttraining flow matching for text-to-image (T2I) generation with Group Relative Policy Optimization (GRPO) has demonstrated strong potential. However, it is hindered by a critical limitation: inaccurate advantage attribution. In this work, we argue that aggregating consecutive timesteps into a coherent `chunk' and shifting the policy optimization paradigm from GRPO's step level to the chunk level can effectively mitigate the negative impact of this issue. Building on this insight, we propose Group Chunking Policy Optimization (GCPO), the first chunk-level reinforcement learning approach for post-training flow matching. Extensive experiments demonstrate that GCPO achieves superior performance on both standard T2I benchmarks and preference alignment, with up to 43% additional gains over GRPO, highlighting the promise of chunk-level policy optimization.

Abstract:
Recent advances in generative modeling have enabled significant progress in structurebased drug design (SBDD). Existing methods typically condition molecule generation on empty binding pockets from holo complexes, overlooking informative components such as the filler (ligands and solvent). Here, we leverage low-resolution electron density (ED) derived from the filler as a physically grounded condition for de novo drug design. We consider two types of ED—calculated and cryo-EM/X-ray—obtainable from computational or experimental sources, supporting unified pre-training and experimental integration. Compared with rigid pocket representations, experimental ED naturally captures conformational flexibility and provides a more faithful description of the binding environment. Based on this, we introduce EDMolGPT, a decoder-only autoregressive framework that generates molecules from low-resolution ED point clouds. By grounding generation in physically meaningful density signals, EDMolGPT mitigates structural bias and produces molecules with 3D conformations. Evaluations on 101 biological targets verify the effectiveness. Code will be released upon acceptance.

Abstract:
We address the challenging task of 3D object segmentation in complex scene point clouds without relying on any scenelevel human annotations during training. Existing methods are typically constrained to identifying simple objects, primarily due to insufficient object priors in the learning process. In this paper, we present FoundObj, a novel framework featuring a superpoint-based object discovery agent that incrementally merges suitable neighboring superpoints, guided by our innovative semantic and geometric reward modules. These modules synergistically leverage semantic and geometric priors from self-supervised 2D/3D foundation models, providing complementary feedback to the object discovery agent and enabling robust identification of multi-class objects through reinforcement learning. Extensive experiments on diverse benchmarks demonstrate that our approach consistently outperforms existing baselines. Notably, our method exhibits strong generalization in zero-shot and long-tail scenarios, underscoring its potential for scalable, label-free 3D object segmentation.

Abstract:
Dynamic Sparse Training (DST) offers a promising paradigm for improving the training and inference efficiency of deep neural networks; however, we find that in large language model training, DST suffers from optimization instability, manifested as loss spikes following topology updates. In this work, we show that the naive use of standard Adambased optimizers leads to a cold-start issue for newly regrown parameters, resulting in excessively large updates and disrupted training dynamics. We propose Sparse Memory-Efficient Training (SMET), which stabilizes DST by combining optimizer warm-up for regrown parameters with density-aware learning-rate scaling. SMET further reduces memory consumption by storing gradients and optimizer states only for active parameters. We provide a theoretical analysis of the update behaviors under SMET, showing improved optimization stability. Extensive experiments demonstrate that SMET enables stable, scalable, and memory-efficient sparse pre-training of LLMs, paving the way for sparse training as a practical alternative to dense training.

Abstract:
Large Reasoning Models (LRMs) are increasingly integrated into systems requiring reliable multistep inference, yet this growing dependence exposes new vulnerabilities related to computational availability. In particular, LRMs exhibit a tendency to “overthink’’—producing excessively long and redundant reasoning traces—when confronted with incomplete or logically inconsistent inputs. This behavior significantly increases inference latency and energy consumption, forming a potential vector for denial-of-service (DoS) style resource exhaustion. In this work, we investigate this attack surface and propose an automated black-box framework that induces overthinking in LRMs by systematically perturbing the logical structure of input problems. Our method employs a hierarchical genetic algorithm (HGA) operating on structured problem decompositions, and optimizes a composite fitness function designed to maximize both response length and reflective overthinking markers. Across four state-of-the-art reasoning models, the proposed method substantially amplifies output length, achieving up to a 26.1× increase on the MATH benchmark and consistently outperforming benign and manually crafted missing-premise baselines. We further demonstrate strong transferability, showing that adversarial inputs evolved using a small proxy model retain high effectiveness against large commercial LRMs. These findings highlight overthinking as a shared and exploitable vulnerability in modern reasoning systems, underscoring the need for more robust defenses.

Abstract:
Detecting AIgenerated images (AIGI) remains challenging because detectors often fail to generalize to unseen generators. Although existing methods are trained on large datasets, their performance still degrades when generation settings change, indicating that data scale alone is insufficient and that limited coverage of generative variations during training is a key factor. Studies on generative model editing show that small changes in internal representations can produce diverse and meaningful image variations, many of which are not explored under standard sampling. Leveraging this insight, we propose PROBE (Probing Robustness via Boundary Exploration), a framework that improves detector generalization by actively exploring challenging regions of the generative process. Instead of treating the generator as a fixed data source, PROBE uses the detector as a critic to steer the generator through manifold-level modifications, producing realistic samples that are difficult to classify. These samples expose failure cases that are uncommon under standard data sampling strategies and are used to refine the detector. Experimental results across multiple benchmarks indicate that PROBE enhances generalization to unseen generators, resulting in more generalizable AIGI detection performance.

Abstract:
Recent tabular foundation models routinely match or surpass strong tree ensembles and specialized deep architectures, yet their numeric embeddings remain a bottleneck. We diagnose a lowrank collapse induced by the prevalent linear+ID scheme and introduce RaBEL, a compact Radial Basis Embedding Layer that front-loads nonlinearity via localized RBF features. RaBEL increases shallow-layer effective rank and improves conditioning without deeper stacks; it is complementary to periodic mappings. We further identify a permutation-order pathology in bidirectional attention (feature\rightarrowsample) and propose a reordered stack: sample-attention \rightarrow FFN \rightarrow feature-attention, ensuring column-level context precedes feature mixing and that all attention computations influence the readout. Combining both ideas yields MiniX, a 2M-parameter model that surpasses 7M-parameter TabPFN-v2 and 27M-parameter TabICL baselines on popular benchmarks while reducing training and inference cost. Our results highlight principled nonlinear embeddings and attention-order redesign as key enablers of accuracy and efficiency gains in tabular foundation models.

Abstract:
Precise camera pose control is critical for video diffusion, yet maintaining geometric consistency remains a challenge. Existing methods that directly inject numerical camera parameters into the diffusion backbone often fail to bridge the gap between abstract coordinates and visual content, leading to structural distortions. To address this issue, we propose CameraNoise, a flowto-noise warping method that encodes camera motion into a temporally coherent stochastic representation. Unlike conventional conditioning, CameraNoise embeds camera poses directly into the noise space. This decouples motion from scene appearance while faithfully preserving trajectory dynamics. Specifically, we introduce a novel Geometry-guided Reprojection Flow and a noise warping algorithm, which jointly preserve the Gaussian prior of diffusion and ensure consistent noise propagation under camera transformations. By integrating CameraNoise into the diffusion process, our framework delivers stable, high-fidelity videos. Extensive experiments demonstrate that our approach significantly outperforms prior methods in both visual quality and trajectory faithfulness.

Abstract:
Learning latent representations that capture both semantic and spatial information is central to efficient spatiosemantic reasoning. However, many existing approaches rely on implicit latent structures combined with dense feature maps or task-specific heads, limiting computational efficiency and flexibility. We propose WorldComp2D, a novel lightweight representation learning framework that explicitly structures latent space geometry according to object identity and spatial proximity using multiscale local receptive fields centered on a given fixation point. This framework consists of (i) a proximity-dependent encoder that maps a given observation into a spatio-semantic latent space and (ii) a localizer that infers the coordinates of objects in the input from the resulting spatio-semantic representation. Using facial landmark localization as a proof-of-concept, we show that WorldComp2D preserves fine-grained spatio-semantic representation of facial landmarks while achieving competitive accuracy at substantially low computational cost. Compared to SoTA lightweight models, WorldComp2D reduces the numbers of parameters and FLOPs by up to 4.0× and 2.2×, respectively, while maintaining real-time performance on CPU. These results demonstrate that explicitly structured latent spaces provide an efficient and general foundation for spatio-semantic reasoning. This framework is open-sourced at https://anonymous.4open.science/r/WorldComp2D-10F6/.

Abstract:
To reduce LLM costs and latency, semantic caching systems must accurately identify when a new prompt matches a cached one. Current methods often rely on simplistic similarity measures, which limit their effectiveness. We introduce MVRcache, a novel semantic caching approach that significantly improves retrieval accuracy by integrating Multi-Vector Retrieval (MVR). MVR-cache is built upon a learnable segmentation model that intelligently splits prompts, enabling fine-grained similarity comparisons via MaxSim. We derive the model’s training objectives from a rigorous theoretical analysis. This can ensure that optimizing these objectives directly maximizes cache hits under strict correctness constraints. To solve the resulting non-differentiable combinatorial optimization problem, we leverage a reinforcement learning-based training strategy with the theoretically grounded objectives as the reward. Experimental results on established benchmarks across diverse tasks confirm that in comparison to the state-of-the-art, MVR-cache consistently increases the cache hit rates by up to 25% while maintaining the same correctness guarantees.

Abstract:
Modern large language model (LLM) serving increasingly depends on efficient longcontext processing and generation mechanisms, including sparse attention, retrieval-augmented generation (RAG), and compressed contextual memory, to support complex reasoning. We show that these optimizations can be unified into a four-step memory processing pipeline: Prepare Memory, Compute Relevancy, Retrieval, and Apply to Inference. Through systematic profiling, we identify a 22%-97% memory processing overhead in LLM inference and strong heterogeneity in its computational characteristics. Motivated by this insight, we argue that heterogeneous systems are well-suited to accelerate memory processing and thus end-to-end inference. We demonstrate this approach on a GPU-FPGA system by offloading sparse, irregular, and memory-bounded operations to FPGAs while retaining compute-intensive operations on GPUs. Evaluated on an AMD MI210 GPU and an Alveo U55C FPGA, our system achieves 1.04～2.2× speedup and 1.11～4.7× energy reduction across multiple LLM inference optimizations than the GPU baseline (similar results hold on NVIDIA A100). These results establish heterogeneous systems as a practical direction for efficient LLM memory processing and inform future heterogeneous hardware design.

Abstract:
Evaluating openended outputs from large language models (LLMs) remains challenging due to the absence of ground truth. We introduce TRACE (Toulmin-based Reasoning Assessment through Constructive Elements), a metric that analyzes Chain-of-Thought (CoT) reasoning processes. TRACE integrates Toulmin's argumentation theory with Flavell's metacognitive framework to assess reasoning structure. Experiments on 26.3K QA samples across 7 reasoning models show strong correlation with benchmark accuracy (r = 0.74). Furthermore, TRACE is effective as a reinforcement learning reward signal, outperforming accuracy-only baselines. These results suggest that TRACE serves as a complementary metric for evaluating open-ended outputs.

Abstract:
The Nearest Class Mean (NCM) classifier is widely favored in ClassIncremental Learning (CIL) for its superior resistance to catastrophic forgetting compared to Fully Connected layers. While Neural Collapse (NC) theory supports NCM's optimality by assuming features collapse into single points, non-linear feature drift and insufficient training in CIL often prevent this ideal state. Consequently, classes manifest as complex manifolds rather than collapsed points, rendering the single-point NCM suboptimal. To address this, we propose Hierarchical-Cluster SOINN (HC-SOINN), a novel classifier that captures the topological structure of these manifolds via a ``local-to-global'' representation. Furthermore, we introduce Structure-Topology Alignment via Residuals (STAR) method, which employs a fine-grained pointwise trajectory tracking mechanism to actively deform the learned topology, allowing it to adapt precisely to complex non-linear feature drift. Theoretical analysis and Procrustes distance experiments validate our framework's resilience to manifold deformations. We integrated HC-SOINN into seven state-of-the-art methods by replacing their original classifiers, achieving consistent improvements that highlight the effectiveness and robustness of our approach. Code is available at \urlhttps://anonymous.4open.science/r/icml2026-9B60.

Abstract:
VisionLanguage-Action (VLA) models have shown remarkable promise in robotics manipulation, yet their high computational cost hinders real-time deployment. Existing token pruning methods suffer from a fundamental trade-off: aggressive compression using pruning inevitably discards critical geometric details like contact points, leading to severe performance degradation. This forces a compromise, limiting the achievable compression rate and thus the potential speedup. We argue that breaking this trade-off requires rethinking compression as a geometry-aware, continuous token resampling in the vision encoder. To this end, we propose the Differentiable Grid Sampler (GridS), a plug-and-play module that performs task-aware, continuous resampling of visual tokens in VLA. By adaptively predicting a minimal set of salient coordinates and extracting features via differentiable interpolation, GridS preserves essential spatial information while achieving drastic compression (with fewer than 10% original visual tokens). Experiments on both LIBERO benchmark and a real robotic platform demonstrate that validating the lowest feasible visual token count reported to date, GridS achieves a 76% reduction in FLOPs with no degradation in the success rate.

Abstract:
While EEG foundation models have shown significant potential in universal neural decoding across tasks, their advancement remains constrained by the inadequacy modeling ofcomplex spatiotemporal topology, as well as the inherentmodality gapbetween lowlevel physiological signals and high-level textual semantics. To address these challenges, we propose aKnowledge-AnchoredSemantically-DynamicTopologyBrainAutoregressive Model (KAST-BAR), which dynamically aligns physiological representations derived from multi-level brain topology with an expert-level semantic space. Specifically, we design a Dual-Stream Hierarchical Attention (DSHA) encoder that accurately captures the brain's intrinsic non-Euclidean topology by modeling local temporal dynamics with global spatial contexts. On this basis, a Knowledge-Anchored Semantic Profiler (KASP) is proposed to synthesize physically-grounded and instance-level textual profiles, which subsequently drive a Semantic Text-Aware Refiner (STAR) to dynamically reconstruct EEG representations using Latent Expert Queries. By conducting large-scale pre-training on 21 diverse datasets to build a foundation model, KAST-BAR effectively integrates expert-level medical knowledge into EEG signal representations, consistently achieving state-of-the-art performance across six downstream tasks. Our code is available at https://anonymous.4open.science/r/KAST-BAR

Abstract:
Learning from realworld data is frequently hindered by the compound challenge of long-tailed class distributions and noisy annotations. Existing methods partially address these issues but typically ignore the non-uniform impact of label noise across classes, resulting in ineffective correction for tail classes and over-regularization for head classes. To address this issue, we propose Class-Adaptive Rectification with Experts (CARE), a parameter-efficient framework that leverages three complementary supervision sources from vision-language models (VLM): observed noisy labels, VLM text embeddings, and visual features. CARE introduces a class-adaptive expert consensus mechanism that enforces stricter agreement for tail classes and more permissive agreement for head classes based on class frequency. By aggregating high-confidence predictions across these sources, CARE filters unreliable signals and recalibrates class distributions, yielding more reliable rectification under long-tailed distributions. Extensive experiments on both synthetic and real-world benchmarks demonstrate that CARE consistently outperforms state-of-the-art methods, achieving up to 3.0% performance gains. The source code is temporarily available at https://anonymous.4open.science/r/CARE-9F10.

Abstract:
Robots in dynamic, humancentric environments must follow language instructions while maintaining real-time reactive control. Vision-language-action (VLA) models offer a promising framework, but they assume temporally aligned reasoning and control, despite semantic inference being inherently delayed relative to real-time action. We introduce Think-in-Control (TIC)-VLA, a latency-aware framework that explicitly models delayed semantic reasoning during action generation. TIC-VLA defines a delayed semantic-control interface that conditions action generation on delayed vision-language semantic states and explicit latency metadata, in addition to current observations. We further propose a latency-consistent training pipeline that injects reasoning inference delays during imitation learning and online reinforcement learning, aligning training with asynchronous deployment. To support realistic evaluation, we present DynaNav, a physics-accurate, photo-realistic simulation suite for language-guided navigation in dynamic environments. Extensive experiments in simulation and on a real robot show that TIC-VLA consistently outperforms prior VLA models while maintaining robust real-time control under multi-second reasoning latency. Code, data, and benchmarks will be released for reproducibility.

Abstract:
Large Language Models (LLMs) are increasingly employed as automated judges for evaluating generative models. However, their known stylistic biases, such as a preference for verbosity or specific sentence structures, present an underexplored security vulnerability. In this work, we introduce BITE (BIas exploraTion and Exploitation), a blackbox adversarial framework that learns semantics-preserving edits to mislead the judgment and artificially inflate judged scores. We cast the selection of stylistic edits as a contextual bandit problem and use a LinUCB policy to adaptively choose edits that maximize the judge’s score without access to model parameters or gradients. Theoretically, we prove a formal regret guarantee for our BITE, demonstrating its ability to efficiently learn to manipulate a judge in the realistic setting of model misspecification. Empirically, we test BITE across a diverse range of LLM judges and tasks, including both pointwise and pairwise comparisons on chatbot leaderboards and AI-reviewer benchmarks. BITE achieves an attack success rate (>!65%) and raises scores by (+1)–(2) on a 9-point scale, while maintaining semantic equivalence. We further uncover model-specific "vulnerability fingerprints": judges differ in sensitivity to sentiment, register, and structural cues (e.g., headers), limiting cross-model transferability. Finally, we evaluate stealthiness and show that BITE evades standard style-control and simple detection baselines. Our findings expose a fundamental weakness in the LLM-as-a-judge paradigm and motivate robust, attack-aware evaluation, e.g., style normalization, randomized prompting, and adversarial training of judges.

Abstract:
We introduce GenomeFactory, the first integrated Python library for tuning, deploying, and interpreting genomic foundation models. Our core contribution is to simplify and unify the workflow for genomic model development: data collection, model tuning, inference, benchmarking, and interpretability. For data collection, Genome-Factory offers an automated pipeline to download genomic sequences and preprocess them. It also includes quality control like GC content normalization. For model tuning, Genome-Factory supports three approaches: full-parameter, low-rank adaptation, and adapter-based fine-tuning. It is compatible with a wide range of genomic models. For inference, Genome-Factory enables both embedding extraction and DNA sequence generation. For benchmarking, we include two existing benchmarks and provide a flexible interface for users to incorporate additional benchmarks. For interpretability, Genome-Factory introduces the first open-source biological interpreter based on a sparse auto-encoder. This module disentangles embeddings into sparse, near-monosemantic latent units and links them to genomic features by regressing on external readouts. To improve accessibility, Genome-Factory offers a zero-code command-line and a user-friendly web interface. We validate the utility of Genome-Factory across three dimensions: (i) Compatibility with diverse models and fine-tuning methods; (ii) Benchmarking downstream performance using two open-source benchmarks; (iii) Biological interpretation of learned representations with DNABERT-2. These results highlight its practical value for real-world genomic analysis.

Abstract:
Online link recommendation on evolving graphs is performative: by choosing which candidate links to show users, the system changes which links form and what feedback it later observes. Consequently, fairness estimates from logged outcomes can be misleading and may drift after deployment when the recommendation policy is updated. We introduce COPF, a decisionlayer framework for deployment-stable fairness monitoring and control in online link recommendation. COPF (i) defines group-level opportunity gaps on exposure (shown vs. not shown) counterfactuals, (ii) makes them estimable by explicit exploration and by logging the probability (propensity) that each candidate is shown, and (iii) audits and controls fairness using residual outcome indistinguishability (OI) over a configurable auditor family with graph-aware doubly robust (GA-DR) estimators. We provide a noisy transfer theorem showing that Residual-OI on estimated GA-DR residuals implies bounds on exposure-counterfactual group gaps under temporal mixing and bounded local interference, and we instantiate an online multicalibration auditor together with a primal--dual controller. Experiments on two TGB streams and a controlled synthetic bipartite stream show that COPF reduces worst-case spikes in exposure-counterfactual group disparities with modest impact on ranking utility. Our code is available at https://anonymous.4open.science/r/fairlink-4EA0.

Abstract:
Public health reasoning requires populationlevel inference grounded in scientific evidence, expert consensus, and safety constraints. However, it remains underexplored as a structured machine learning problem with limited supervised signals and benchmarks. We introduce GlobalHealthAtlas, a large-scale multilingual dataset of 280,210 instances spanning 15 public health domains and 17 languages, stratified into three difficulty levels from health literacy to epidemiological and policy reasoning. Instances are derived from openly available public health sources and labeled by language, domain, and difficulty to support supervised learning and slice-based evaluation. We further propose a large language model (LLM) assisted construction and quality-control pipeline with retrieval, duplication, evidence-grounding checks, and label validation to improve consistency at scale. Finally, we present a domain-aligned evaluator distilled from high confidence judgments of diverse LLMs to assess outputs along six dimensions: Accuracy, Reasoning, Completeness, Consensus Alignment, Terminology Norms, and Insightfulness. Together, these contributions enable reproducible training and evaluation of LLMs for public health reasoning beyond conventional QA benchmarks.

Abstract:
Current research on video hallucination mitigation primarily focuses on isolated error types, leavingcompositionalhallucinations—arising from incorrect reasoning over multiple interacting spatial and temporal factors largely underexplored. We introduceOmniVCHall, a benchmark designed to systematically evaluate both isolated and compositional hallucinations in video multimodal large language models (VLLMs). OmniVCHall spans diverse video domains, introduces a novel camerabased hallucination type, and defines a fine-grained taxonomy, together with adversarial answer options (e.g., “All are correct” and “None of the above”) to prevent shortcut reasoning. The evaluations of 39 representative VLLMs reveal that even advanced models (e.g., Qwen3-VL and GPT-5) exhibit substantial performance degradation. We proposeTriCD, a contrastive decoding framework with a triple-pathway calibration mechanism. An adaptive perturbation controller dynamically selects distracting operations to construct negative video variants, while a saliency-guided enhancement module adaptively reinforces grounded token-wise visual evidences. These components are optimized via reinforcement learning to encourage precise decision-making under compositional hallucination settings. Experimental results show that TriCD consistently improves performance across two representative backbones, achieving an average accuracy improvement of over 10%.

Abstract:
Assisting nonexpert users to develop complex interactive websites has become a popular task for LLM-powered code agents. However, existing code agents tend to only generate frontend web pages, masking the lack of real full-stack data processing and storage with fancy visual effects. Notably, constructing production-level full-stack web applications is far more challenging than only generating frontend web pages, demanding careful control of data flow, comprehensive understanding of constantly updating packages and dependencies, and accurate localization of obscure bugs in the codebase. To address these difficulties, we introduce FullStack-Agent, a unified agent system for full-stack agentic coding that consists of three parts: (1) FullStack-Dev, a multi-agent framework with strong planning, code editing, codebase navigation, and bug localization abilities. (2) FullStack-Learn, an innovative data-scaling and self-improving method that back-translates crawled and synthesized website repositories to improve the backbone LLM of FullStack-Dev. (3) FullStack-Bench, a comprehensive benchmark that systematically tests the frontend, backend and database functionalities of the generated website. Our FullStack-Dev outperforms the previous state-of-the-art method by 8.7%, 38.2%, and 15.9% on the frontend, backend, and database test cases respectively. Additionally, FullStack-Learn raises the performance of a 30B model by 9.7%, 9.5%, and 2.8% on the three sets of test cases through self-improvement, demonstrating the effectiveness of our approach.

Abstract:
As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static data and fixed recipes, making it difficult to diagnose capability blind spots or provide dynamic, targeted reinforcement. Motivated by findings that test driven error exposure and feedback based correction outperform repetitive practice, we propose Diagnosticdriven Progressive Evolution (DPE), a spiral loop where diagnosis steers data generation and reinforcement, and each iteration re-diagnoses the updated model to drive the next round of targeted improvement. DPE has two key components. First, multiple agents annotate and quality control massive unlabeled multimodal data, using tools such as web search and image editing to produce diverse, realistic samples. Second, DPE attributes failures to specific weaknesses, dynamically adjusts the data mixture, and guides agents to generate weakness focused data for targeted reinforcement. Experiments on Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct show stable, continual gains across eleven benchmarks, indicating DPE as a scalable paradigm for continual LMM training under open task distributions.

Abstract:
Evaluating LLM agents for scientific tasks has focused on token costs while ignoring tooluse costs like simulation time and experimental resources. As a result, metrics like pass@k become impractical under realistic budget constraints. To address this gap, we introduce SimulCost, the first benchmark targeting cost-sensitive parameter tuning in scientific simulations. SimulCost compares LLM tuning cost-sensitive parameters against traditional scanning approach in both accuracy and computational cost, spanning 2,916 single-round (initial guess) and 1,900 multi-round (adjustment by trial-and-error) tasks across 12 simulators from fluid dynamics, solid mechanics, and plasma physics. Each simulator's cost is analytically defined and platform-independent. Frontier LLMs achieve 46--64% success rates in single-round mode, dropping to 35--54% under high accuracy requirements, rendering their initial guesses unreliable especially for high accuracy tasks. Multi-round mode improves rates to 71--80%, but LLMs are 1.5--2.5× slower than traditional scanning, making them uneconomical choices. We also investigate parameter group correlations for knowledge transfer potential, and the impact of in-context examples and reasoning effort, providing practical implications for deployment and fine-tuning. We open-source SimulCost as a static benchmark and extensible toolkit to facilitate research on improving cost-aware agentic designs for scientific simulations, and for expanding new simulation environments.

Abstract:
Multimodal Large Language Models (MLLMs) have made substantial progress on visual understanding tasks, yet they still perform poorly on highresolution images. Prior work often attributes this limitation to perceptual constraints, arguing that MLLMs fail to recognize small objects and therefore rely on "zoom-in" strategies to recover fine details. In contrast, our analysis shows that the dominant failure mode is background interference rather than object size. We study the "zoom-in" operation through ahierarchical decoupling analysisand propose theHierarchical Decoupling Framework (HiDe), a training-free method that turns implicit attention into explicit region selection. HiDe first performs Token-wise Attention Decoupling (TAD) to disentangle question semantics and identify the most informative tokens, then uses their attention patterns to pinpoint the corresponding visual regions. It subsequently applies Layout-Preserving Decoupling (LPD) to extract these regions from cluttered backgrounds and construct a compact representation that retains key spatial structure while filtering out irrelevant context. HiDe achieves state-of-the-art results on high-resolution benchmarks like VBench, HRBench4K, and HRBench8K. It boosts Qwen2.5-VL 7B and InternVL3 8B to state of the art performance, reaching 92.1% and 91.6% on VBench, and even surpasses reinforcement learning based methods. After optimization, HiDe reduces memory usage by 75% compared with the previous training-free approach.

Abstract:
We investigate the capabilities and scalability of Large Language Models (LLMs) in optimization modeling, a domain requiring structured reasoning and precise formulation. To this end, we introduce OPTENGINE, an extensible benchmark framework with quantifiable and controllable complexity. OPT-ENGINE spans ten canonical operations research problems, systematically scaling from Linear Programming to Mixed-Integer Programming, thus providing a structured environment to probe the limits of automated problem formulation and solving. Our results reveal a sharp performance degradation as task complexity scale, highlighting a critical robustness gap in pure-text reasoning. While LLMs struggle with end-to-end solution generation, we demonstrate that tool-integrated reasoning provides a significantly more resilient path forward, regardless of model size. Furthermore, we identify the automated formulation of the constraints as the primary bottleneck. These insights clarify the limitations of current LLMs and provide a structured roadmap for developing next-generation LLMs for optimization modeling.

Abstract:
Process Reward Models (PRMs) are rapidly becoming the backbone of LLM reasoning pipelines, yet we demonstrate that stateof-the-art PRMs are systematically exploitable under optimization pressure. We introduce a three-tiered diagnostic framework that applies increasing adversarial pressure to quantify these vulnerabilities. Static perturbation analysis uncovers a fluency-logic dissociation: high invariance to surface-level style changes (reward changes <0.1) yet inconsistent detection of logically corrupted reasoning, with different models failing on different attack types. Adversarial optimization demonstrates that gradient-based attacks inflate rewards on invalid trajectories, with reward landscapes exhibiting wide, exploitable peaks. RL-induced reward hacking exposes the critical failure mode: policies trained on AIME problems achieve near-perfect PRM rewards (>0.9) while ground-truth accuracy remains below 4%, with 43% of reward gains attributable to stylistic shortcuts. These findings reveal that current PRMs function as fluency detectors rather than reasoning verifiers, creating systematic blind spots that undermine their use as training signals. We release PRM-BiasBench and a diagnostic toolkit to enable robustness evaluation before deployment.

Abstract:
Time series generation (TSG) is widely used across domains, yet most existing methods assume regular sampling and fixed output resolutions. These assumptions are often violated in practice, where observations are irregular and sparse, while downstream applications require continuous and highresolution TS. Although Neural Controlled Differential Equation (NCDE) is promising for modeling irregular TS, it is constrained by a single dynamics function, tightly coupled optimization, and limited ability to adapt learned dynamics to newly generated samples from the generative model. We propose MN-Diff, a continuous TSG framework that enhances NCDE with a Mixture-of-Experts (MoE) dynamics function and a decoupled architectural design for dynamics-focused training. To further enable NCDE to generalize to newly generated samples, MN-Diff employs a diffusion model to parameterize the NCDE temporal dynamics parameters (MoE weights), i.e., jointly learn the distribution of TS data and MoE weights. This design allows sample-specific NCDE parameters to be generated for continuous TS generation. Experiments on ten public and synthetic datasets demonstrate that MN-Diff consistently outperforms strong baselines on both irregular-to-regular and irregular-to-continuous TSG tasks. The code is available at the link https://anonymous.4open.science/r/MN-Diff-2688.

Abstract:
Recent advances in autoregressive video diffusion have enabled realtime frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naïvely applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12 times extrapolation (e.g. 5s-trained -> 60s+ generation) with better imaging quality and aesthetic quality, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.

Abstract:
In safetycritical domains where online data collection is infeasible, offline reinforcement learning (RL) is attractive only if policies achieve high returns without catastrophic lower-tail risk. Prior work on risk-averse offline RL achieves safety at the cost of value- or model-based pessimism, and restricted policy classes that limit policy expressiveness, whereas diffusion/flow-based expressive generative policies have largely been used in risk-neutral settings. We introduce Risk-Aware Multimodal Actor-Critic (RAMAC), a simple, modular, model-free framework that couples an expressive generative actor (e.g., diffusion/flow) with a distributional critic and optimizes a composite objective that combines Conditional Value-at-Risk (CVaR) with behavioral cloning (BC), enabling risk-sensitive learning in complex multimodal scenarios. Since out-of-distribution (OOD) actions are a major driver of catastrophic failures in offline RL, we further provide an objective-level analysis showing that controlling behavior divergence via BC suppresses OOD actions and stabilizes CVaR. Instantiating RAMAC with a diffusion actor, we illustrate these insights on a 2-D risky bandit and evaluate on Stochastic-D4RL, observing consistent gains in CVaR_0.1 while maintaining strong returns.

Abstract:
Large Language Models (LLMs) can be seen as compressed knowledge bases, but it remains unclear what knowledge they truly contain and how far their knowledge boundaries extend. Existing benchmarks are mostly static and provide limited support for systematic knowledge probing. In this paper, we propose an interactive agentic framework to systematically extract and quantify the knowledge of LLMs. Our method includes four adaptive exploration policies to probe knowledge at different granularities. To ensure the quality of extracted knowledge, we introduce a threestage knowledge processing pipeline that combines vector-based filtering to remove exact duplicates, LLM-based adjudication to resolve ambiguous semantic overlaps, and domain-relevance auditing to retain valid knowledge units. Through extensive experiments, we find that recursive taxonomy is the most effective exploration strategy. We also observe a clear knowledge scaling law, where larger models consistently extract more knowledge. In addition, we identify a Pass@1-versus-Pass@k trade-off: domain-specialized models achieve higher initial accuracy but degrade rapidly, while general-purpose models maintain stable performance during extended extraction. Finally, our results show that differences in training data composition lead to distinct and measurable knowledge profiles across model families.

Abstract:
Effectively retrieving, reasoning, and understanding multimodal information remains a critical challenge for agentic systems. Traditional Retrievalaugmented Generation (RAG) methods rely on linear interaction histories, which struggle to handle long-context tasks, especially those involving information-sparse yet token-heavy visual data in iterative reasoning scenarios. To bridge this gap, we introduce VimRAG, a framework tailored for multimodal Retrieval-augmented Reasoning across text, images, and videos. Inspired by our systematic study, we model the reasoning process as a dynamic directed acyclic graph that structures the agent states and retrieved multimodal evidence. Building upon this structured memory, we introduce a Graph-Modulated Visual Memory Encoding mechanism, with which the significance of memory nodes is evaluated via their topological position, allowing the model to dynamically allocate high-resolution tokens to pivotal evidence while compressing or discarding trivial clues. To implement this paradigm, we propose a Graph-Guided Policy Optimization strategy. This strategy disentangles step-wise validity from trajectory-level rewards by pruning memory nodes associated with redundant actions, thereby facilitating fine-grained credit assignment. Extensive experiments demonstrate that VimRAG consistently achieves state-of-the-art performance on diverse multimodal RAG benchmarks.

Abstract:
Evaluating generative AI models is increasingly resourceintensive due to slow generation, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as error likelihood or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop strategies that proactively select or synthesize the most informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 10–100x fewer samples to achieve estimates within \pm1% of ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.

Abstract:
Multimodal RetrievalAugmented Generation (MRAG) has emerged as a key paradigm for grounding MLLMs with external knowledge. While query pre-processing (e.g., rewriting) is standard in text-based RAG, existing MRAG pipelines predominantly treat visual inputs as static and immutable, implicitly assuming they are noise-free. However, real-world visual queries are often ``imperfect''---suffering from geometric distortions, quality degradation, or semantic ambiguity---leading to catastrophic retrieval failures. To address this gap, we propose V-QPP-Bench, the first comprehensive benchmark dedicated to Visual Query Pre-processing (V-QPP). We formulate V-QPP as an agentic decision-making task where MLLMs must autonomously diagnose imperfections and deploy perceptual tools to refine queries. Our extensive evaluation across 46,700 imperfect queries and diverse MRAG paradigms reveals three critical insights: (1) Vulnerability---visual imperfections severely degrade both retrieval recall and end-to-end MRAG performance; (2) Restoration Potential & Bottleneck---while oracle preprocessing recovers near-perfect performance, off-the-shelf MLLMs struggle with tool selection and parameter prediction without specialized training; and (3) Training Enhancement---supervised fine-tuning enables compact models to achieve comparable or superior performance to larger proprietary models, demonstrating the benchmark's value for developing robust MRAG systems The code is available at https://anonymous.4open.science/r/VQQP-Bench-726C/.

Abstract:
Generating human videos with realistic and controllable motions is a challenging task. While existing methods can generate visually compelling videos, they lack separate control over four key video elements: foreground subject, background video, human trajectory, and action patterns. In this paper, we propose a decomposed human motion control and video generation framework that explicitly decouples motion from appearance, subject from background, and action from trajectory, enabling flexible mixand-match composition of these elements. Concretely, we first build a ground-aware 3D world coordinate system and perform motion editing directly in the 3D space. Trajectory control is implemented by unprojecting edited 2D trajectories into 3D with focal-length calibration and coordinate transformation, followed by speed alignment and orientation adjustment; actions are supplied by a motion bank or generated via text-to-motion methods. Then, based on modern text-to-video diffusion transformer models, we inject the subject as tokens for full attention, concatenate the background along the channel dimension, and add motion (trajectory and action) control signals by addition. Such a design opens up the possibility for us to generate realistic videos of anyone doing anything anywhere. Extensive experiments on benchmark datasets and real-world cases demonstrate that our method achieves state-of-the-art performance on both element-wise controllability and overall video quality.The source codes and project page with video examples are in the supplementary materials.

Abstract:
Sampling from a distribution p(x) \propto e^\mathcalE(x) known up to a normalising constant is an important and challenging problem in statistics. Recent years have seen the rise of a new family of amortised sampling algorithms, commonly referred to as diffusion samplers, that enable fast and efficient sampling from an unnormalised density. Such algorithms have been widely studied for continuous-space sampling tasks; however, their application to problems in discrete space remains largely unexplored. Although some progress has been made in this area, discrete diffusion samplers do not take full advantage of ideas commonly used for continuous-space sampling. In this paper, we propose to bridge this gap by introducing off-policy training techniques for discrete diffusion samplers. We show that these techniques improve the performance of discrete samplers on both established and new synthetic benchmarks. Next, we generalise discrete diffusion samplers to the task of bridging between two arbitrary distributions, introducing data-to-energy Schrödinger bridge training for the discrete domain for the first time. Lastly, we showcase the application of the proposed diffusion samplers to data-free posterior sampling in the discrete latent spaces of image generative models.

Abstract:
Supervised finetuning (SFT) is computationally efficient but often yields inferior generalization compared to reinforcement learning (RL). This gap is primarily driven by RL’s use of on-policy data. We propose a framework to bridge this chasm by enabling On-Policy SFT. We first presentDistribution Discriminant Theory (DDT), which explains and quantifies the alignment between data and the model-induced distribution. Leveraging DDT, we introduce two complementary techniques: (i)In-Distribution Finetuning (IDFT), a loss-level method to enhance generalization ability of SFT, and (ii)Hinted Decoding, a data-level technique that can re-align the training corpus to the model’s distribution. Extensive experiments demonstrate that our framework achieves generalization performance on par with prominent offline RL algorithms, including DPO and SimPO, while maintaining the efficiency of an SFT pipeline. The proposed framework thus offers a practical alternative in domains where RL is infeasible. We will open-source the code and data on GitHub.

Abstract:
Diffusion large language models (dLLMs) have recently gained significant attention due to their inherent support for parallel decoding. Building on this paradigm, Mixtureof-Experts (MoE) dLLMs with autoregressive (AR) initialization have further demonstrated strong performance competitive with mainstream AR models. However, we identify a fundamental mismatch between MoE architectures and diffusion-based decoding. Specifically, a large number of experts are activated at each denoising step, while only a small subset of tokens is ultimately accepted, resulting in substantial inference overhead and limiting their deployment in latency-sensitive applications. In this work, we proposeTEAM, a plug-and-play framework that accelerates MoE dLLMs by enabling more accepted tokens with fewer activated experts. TEAM is motivated by the observation that expert routing decisions exhibit strong temporal consistency across denoising levels as well as spatial consistency across token positions. Leveraging these properties, TEAM employs three complementary expert activation and decoding strategies, conservatively selecting necessary experts for decoded and masked tokens and simultaneously performing aggressive speculative exploration across multiple candidates. Experimental results demonstrate that TEAM achieves up to 2.2× speedup over vanilla MoE dLLM, with negligible performance degradation.

Abstract:
Recent Large Language Models (LLMs) have demonstrated remarkable proficiency in code generation. However, their ability to create complex visualizations for scaled and structured data remains largely unevaluated and underdeveloped. To address this gap, we introducePlotCraft, a new benchmark featuring 1k challenging visualization tasks that cover a wide range of topics, such as finance, scientific research, and sociology. The benchmark is structured around seven highlevel visualization tasks and encompasses 48 distinct chart types. Crucially, it is the first to systematically evaluate both single-turn generation and multi-turn refinement across a diverse spectrum of task complexities. Our comprehensive evaluation of 23 leading LLMs on PlotCraft reveals obvious performance deficiencies in handling sophisticated visualization tasks. To bridge this performance gap, we developeSynthVis-30K, a large-scale, high-quality dataset of complex visualization code synthesized via a collaborative agent framework. Building upon this dataset, we developePlotCraftor, a novel code generation model that achieves strong capabilities in complex data visualization with a remarkably small size. Across VisEval, PandasPlotBench, and our proposed PlotCraft, PlotCraftor shows performance comparable to that of leading proprietary approaches. Especially, on hard task, Our model achieves over 50% performance improvement. We will release the benchmark, dataset, and code at \hrefhttps://anonymous.4open.science/r/PlotCraft-E320PlotCraft anonymous repository.

Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) is highly effective for enhancing LLM reasoning, yet recent evidence shows models like Qwen2.5 achieve significant gains even with spurious rewards. We investigate this phenomenon and identify ``Perplexity Paradox'': spurious RLVR triggers a divergence where answertoken perplexity drops while prompt-side coherence degrades, suggesting model is bypassing reasoning in favor of memorization. Using a suite of mechanistic interpretability tools, including Path Patching and Logit Lens, we identify a previously unknown Anchor–Adapter circuit. This circuit enables model to bypass reasoning and directly retrieve memorized solutions under spurious RLVR. We localize a Functional Anchor in middle layers (L18–20) that triggers retrieval of memorized solutions, followed by Structural Adapters in later layers (L21+) that transform representations to accommodate shortcut signal. Finally, we demonstrate that scaling specific MLP keys within this circuit allows for bidirectional causal steering, i.e., artificially amplifying or suppressing contamination-driven performance. Our results provide a mechanistic roadmap for identifying and mitigating data contamination in RLVR-tuned models.

Abstract:
Audiodriven visual dubbing aims to synchronize a video's lip movements with new speech but is fundamentally challenged by the lack of ideal training data: paired videos differing only in lip motion. Existing methods circumvent this via mask-based inpainting. However, masking inevitably destroys spatiotemporal context, leading to identity drift and poor robustness (e.g., to occlusions), while also inducing lip-shape leakage that degrades lip sync. To bridge this gap, we propose X-Dub, a novel two-stage generative bootstrapping framework leveraging powerful Diffusion Transformers to unlock mask-free dubbing. Our core insight is to repurpose a mask-based inpainting model exclusively as a dedicated data generator to synthesize scalable, high-fidelity pseudo-paired data, which is subsequently utilized to train and bootstrap a robust, mask-free editing model as the final video dubber. The final dubber is liberated from masking artifacts and leverages the complete video input for high-fidelity inference. We further introduce timestep-adaptive multi-phase learning to disentangle conflicting objectives (structure, lip motion, and texture) across diffusion phases, facilitating stable convergence and advanced editing quality. Additionally, we present X-DubBench, a benchmark for diverse scenarios. Extensive experiments demonstrate that our method achieves state-of-the-art performance with superior lip sync, visual quality, and robustness. More results can be viewed in the supplementary. Code and model will be released.

Abstract:
Testtime scaling can improve model performance by aggregating stochastic reasoning trajectories. However, achieving sample-efficient test-time self-consistency under a limited budget remains an open challenge. We introduce PETS (Principled and Efficient Test-Time Self-Consistency), which initiates a principled study of trajectory allocation through an optimization framework. Central to our approach is the \emphself-consistency rate, a new measure defined as agreement with the infinite-budget majority vote. This formulation makes sample-efficient test-time allocation theoretically grounded and amenable to rigorous analysis. We study both offline and online settings. In the offline regime, where all questions are known in advance, we connect trajectory allocation to crowdsourcing, a classic and well-developed area, by modeling reasoning traces as workers. This perspective allows us to leverage rich existing theory, yielding theoretical guarantees and an efficient majority-voting-based allocation algorithm. In the online streaming regime, where questions arrive sequentially and allocations must be made on the fly, we propose a novel method inspired by the offline framework. Our approach adapts budgets to question difficulty while preserving strong theoretical guarantees and computational efficiency. Experiments show that PETS consistently outperforms uniform allocation. On GPQA, PETS achieves perfect self-consistency in both settings while reducing the sampling budget by up to 75% (offline) and 55% (online) relative to uniform allocation.

Abstract:
Inferencetime LLM alignment methods, particularly activation steering, offer an alternative to fine-tuning by directly modifying internal activations during generation. Existing methods, however, often rely on non-anticipative interventions that ignore how perturbations propagate through transformer layers and lack online error feedback, resulting in suboptimal, open-loop control. To address this, we show empirically that, despite the nonlinear structure of transformer blocks, layer-wise dynamics across multiple LLM architectures and scales are well-approximated by locally-linear models. Exploiting this property, we model LLM inference as a linear time-varying dynamical system and adapt the classical linear quadratic regulator to compute feedback controllers using layer-wise Jacobians, steering activations toward desired semantic setpoints in closed-loop with minimal computational overhead and no offline training. We also derive theoretical bounds on setpoint tracking error, enabling formal guarantees on steering performance. Using a novel adaptive semantic feature setpoint signal, we show that our method achieves robust, fine-grained behavior control across models, scales, and tasks, including modulation of toxicity, truthfulness, and arbitrary concepts, surpassing baseline steering methods.

Abstract:
The quality of both the environment and the reward model fundamentally governs the effectiveness of reinforcement learning. Accordingly, we propose RLAnything, a reinforcement learning framework that dynamically optimizes each component through closedloop optimization, amplifying learning signals and strengthening the overall system. Specifically, the policy is trained with integrated feedback from step-wise and outcome signals, while the reward model is jointly optimized via consistency feedback, which in turn further improves policy training. Moreover, our theory-motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience. Empirically, each added component consistently improves the overall system, and RLAnything yields substantial gains in practical applications, boosting Qwen3-VL-8B-Thinking by 8.5% on OSWorld and Qwen2.5-7B-Instruct by 21.2% and 12.1% on AlfWorld and LiveBench, respectively.

Abstract:
Machine Learning as a Service (MLaaS) has become a widely adopted method for delivering deep neural network (DNN) models, allowing users to conveniently access models via APIs. However, such services have been shown to be highly vulnerable to Model Extraction Attacks (MEAs). While numerous defense strategies have been proposed, verifying the ownership of a suspicious model with strict theoretical guarantees remains a challenging task. To address this gap, we introduce CREDIT a certified defense against MEAs. Specifically, we employ mutual information to quantify the similarity between DNN models, propose a practical verification threshold, and provide rigorous theoretical guarantees for ownership verification based on this threshold. We extensively evaluate our approach on several mainstream datasets and achieve stateof-the-art performance. Our implementation is publicly available at: \urlhttps://anonymous.4open.science/r/CREDIT.

Abstract:
VisionLanguage-Action (VLA) models hold promise for generalist robotics but currently struggle with data scarcity, architectural inefficiencies, and the inability to generalize across different hardware platforms. We introduce RDT2, a robotic foundation model built upon a 7B parameter VLM designed to enable zero-shot deployment on novel embodiments for open-vocabulary tasks. To achieve this, we collected one of the largest open-source robotic datasets—over 10,000 hours of demonstrations in diverse families—using an enhanced, embodiment-agnostic Universal Manipulation Interface (UMI). Our approach employs a novel three-stage training recipe that aligns discrete linguistic knowledge with continuous control via Residual Vector Quantization (RVQ), flow-matching, and distillation for real-time inference. Consequently, RDT2 becomes one of the first models that simultaneously zero-shot generalizes to unseen objects, scenes, instructions, and even robotic platforms. Besides, it outperforms state-of-the-art baselines in dexterous, long-horizon, and dynamic downstream tasks like playing table tennis.

Abstract:
Recent Multimodal Large Language Models (MLLMs) have demonstrated significant progress in perceiving and reasoning over multimodal inquiries, ushering in a new research era for foundation models. However, visionlanguage misalignment in MLLMs has emerged as a critical challenge, where the textual responses generated by these models are not factually aligned with the given text-image inputs. Existing efforts to address vision-language misalignment have focused on developing specialized vision-language connectors or leveraging visual instruction tuning from diverse domains. In this paper, we tackle this issue from a fundamental yet unexplored perspective by revisiting the core architecture of MLLMs. Most MLLMs are typically built on decoder-only LLMs consisting of a causal attention mechanism, whichlimits the ability of the earlier modalities (e.g., images) to incorporate information from the latter modalities (e.g., text). To address this problem a MLLM that unlocks causal attention into our proposed modality-mutual attention (MMA) to enable image tokens to attend to text tokens. This simple yet effective design allows MMA to achieve state-of-the-art performance in 12 multimodal understanding benchmarks (+6.2% on average across 3 LLMs backbones) without introducing additional parameters. Our MMA design is intended to be generic, allowing for applications across various modalities, and scalable to accommodate diverse multimodal scenarios.

Abstract:
Accelerating scientific discovery requires the identification of which experiments would yield the best outcomes before committing resources to costly physical validation. While existing benchmarks evaluate LLMs on scientific knowledge and reasoning, their ability to predict experimental outcomesa task where AI could significantly exceed human capabilities---remains largely underexplored. We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry. SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process? Evaluations reveal fundamental limitations on both fronts. Model accuracies are 14-26% and human expert performance is \approx20%. Although some frontier models exceed human performance model accuracy is still far below what would enable reliable experimental guidance. Even within the limited performance, models fail to distinguish reliable predictions from unreliable ones, achieving only \approx20% accuracy regardless of their confidence or whether they judge outcomes as predictable without physical experimentation. Human experts, in contrast, demonstrate strong calibration: their accuracy increases from \approx5% to \approx80% as they deem outcomes more predictable without conducting the experiment. SciPredict establishes a rigorous framework demonstrating that superhuman performance in experimental science requires not just better predictions, but better awareness of prediction reliability. For reproducibility all our data and code are provided at https://anonymous.4open.science/r/SciPredict-AI01.

Abstract:
We proposeInfiniteWorld, a robust interactive world model capable of maintaining coherent visual memory over1000+ framesin complex real-world environments. While existing world models can be efficiently optimized on synthetic data with perfect ground-truth, they lack an effective training paradigm for real-world videos due to noisy pose estimations and the scarcity of viewpoint revisits. To bridge this gap, we first introduce aHierarchical Pose-free Memory Compressor (HPMC)that recursively distills historical latents into a fixed-budget representation. By jointly optimizing the compressor with the generative backbone, HPMC enables the model to autonomously anchor generations in the distant past with bounded computational cost, eliminating the need for explicit geometric priors. Second, we propose anUncertainty-aware Action Abstractionmodule that discretizes continuous motion into a tri-state logic. This strategy maximizes the utilization of raw video data while shielding the deterministic action space from being corrupted by noisy trajectories, ensuring robust action-response learning. Furthermore, guided by insights from a pilot toy study, we employ aRevisit-Dense Finetuning Strategyusing a compact, 30-minute dataset to efficiently activate the model’s long-range loop-closure capabilities. Extensive experiments, including objective metrics and user studies, demonstrate that Infinite-World achieves superior performance in visual quality, action controllability, and spatial consistency. Our code and data will be released.

Abstract:
Finegrained Mixture-of-Experts (MoE) models sparsely activate a subset of parameters, significantly reducing computational costs while maintaining performance. However, in memory-constrained inference scenarios, only a small set of experts can be cached. Experts not in the cache must be fetched from slow external storage (e.g., UFS), leading to frequent evictions and substantial I/O overhead. We propose ReMoE, a router fine-tuning framework designed to boost token-wise expert reuse. By introducing a temporal inductive bias, ReMoE encourages the model to consistently select the same experts over time, which aligns the routing behavior with cache locality constraints, reducing the need to fetch experts from storage without adding any extra computation during inference. Experiments on DeepSeek and Qwen models show that ReMoE improves the expert reuse rate by 26%. Under a standard LRU caching policy simulation, ReMoE improves the cache hit rate by 15.7%, corresponds to a 7.8% reduction in median latency and an 8.5% increase in proxy throughput, while maintaining downstream task performance.

Abstract:
Bridging the gap between visual realism and physical understanding is a core challenge for videobased world models. We study the structural identifiability of continuous-time physical laws from raw pixels, focusing on whether an encoder-only pipeline can uniquely recover the parameters of second-order linear ODEs. We prove that a level-set slope-coverage condition ensures the learned latent space is locally affine to the true physical state, enabling exact parameter recovery. Our theory provides the first characterization of minimal data requirements across damping regimes, establishing that underdamped systems are identifiable from a single video clip, whereas other regimes require three diverse trajectories. We further introduce a variance-floor regularizer to stabilize the decoder-free objective and prevent latent collapse. Validated on synthetic and real-world data, our approach demonstrates that interpretable physical constants can be reliably estimated from video without the need for compute-intensive pixel reconstruction, ensuring both physical correctness and transparency.

Abstract:
Large Language Models (LLMs) are increasingly deployed as autonomous agents that execute toolaugmented, multi-step tasks, where latency is a critical factor for real-world applications. Yet an overlooked threat is Reasoning-Level Denial-of-Service (R-DoS), in which an attacker preserves task correctness but degrades availability by inflating an agent’s reasoning depth or tool-use budget. We introduce OTora, the first unified, two-stage red-teaming framework for instantiating R-DoS attacks. Stage I optimizes an adversarial trigger that induces targeted tool invocations using insertion-aware scoring and dynamic target co-evolution, supporting both black-box and white-box settings. Stage II generates agent-aware reasoning payloads via an ICL-guided genetic search that amplifies overthinking while maintaining correct task outcomes. Across WebShop, Email, and OS agents built on multiple backbone models such as LLaMA-70B and GPT-OSS-120B, OTora achieves up to 10× increases in reasoning tokens and order-of-magnitude latency slowdowns, all while preserving near-baseline task accuracy. Finally, we discuss mitigation strategies for detecting and constraining abnormal reasoning and latency spikes.

Abstract:
Ensembles of neural networks typically outperform individual networks but incur large computational costs, whereas weight aggregation produces less costly, yet also less accurate, aggregate models. We introduce partial fusion of networks, which interpolates between ensembles and weight aggregation and thus allows for a flexible tradeoff between computational cost and performance. A direct way to achieve this is to extend existing weight aggregation methods based on neuronlevel similarity between different networks, where partial fusion then only aggregates weights of neurons which are most similar. We showcase one particular method to jointly identify which neurons are most similar and match them via partial optimal transport. Further, we consider the more general perspective of weight aggregation and partial fusion as generalized pruning of ensemble models, where neurons cannot just be deleted, but also linearly combined. Finally, we show that generalized pruning applied to a single network yields similar benefits as partial fusion by allowing for a tradeoff between isolating, deleting, and linearly combining neurons based on similarity.

Abstract:
Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendations. Reinforcement learning (RL) provides a principled framework for optimizing such sequential decision tasks. Based on path rewards, RL can naturally jointly optimize shortterm acceptance and long-term guidance effectiveness. However, naively applying policy gradients to PRS results in deficient gradient estimation. It has been observed that length-dependent bias causes gradients to favor path extension over deeper exploration, while weighting each step by path-level reward leads to high gradient variance. To rectify these two deficiencies, we propose an effective RL framework ProRL with two novel mechanisms for proactive recommendation. First, Stepwise Reward Centering subtracts expected rewards to neutralize length-dependent bias, ensuring that path extension yields zero expected gradient signal. Second, Position-Specific Advantage Estimation leverages the reward decomposition structure to compute step-dependent baselines, reducing gradient variance. Together, these mechanisms yield policy gradients that precisely target path quality. Our experiments on three real-world datasets demonstrate that ProRL significantly outperforms state-of-the-art PRSs. Our codes are available at https://anonymous.4open.science/r/ProRL-D56DHM.

Abstract:
VisionLanguage Models (VLMs) such as CLIP demonstrate strong zero-shot generalization, but their performance significantly degrades in cross-domain scenarios with scarce target-domain training data (Cross-Domain Few-Shot Learning, CDFSL). In this paper, we focus on the target-domain few-shot finetuning in the CLIP-based CDFSL task. Prevailing finetuning paradigms uniformly align all image patch tokens with their corresponding textual embeddings. However, we find a counterintuitive phenomenon: actively pushing away certain low-similarity image tokens, termed “tail tokens”, from their textual embeddings consistently improves target-domain performance. We delve into this phenomenon and provide a novel interpretation: under great domain shifts and scarce training data, the model can hardly extract semantic information from visual inputs; therefore, the common belief of alignment is valid only for tokens already containing sufficient semantic information; for tail tokens, forcing the alignment would lead to excessive overfitting to the scarce training, while breaking the alignment is more useful. Motivated by this, we propose Adaptive Tail-Head Alignment (ATHA), a novel fine-tuning strategy for CLIP that transforms the conventional uniform alignment paradigm to an adaptive alignment paradigm, with both alignment strengthening and weakening. Extensive experiments on four challenging CDFSL benchmarks validate our state-of-the-art performance. Our codes will be released.

Abstract:
Manufacturable chip layouts must satisfy thousands of geometrybased design rules, and design rule checking (DRC) enforces them by running executable DRC scripts on layouts. Translating natural language rules into correct DRC scripts is labor-intensive and requires specialized expertise, motivating LLM agents for DRC script synthesis and debugging. However, existing benchmarks have small evaluation sets and often evaluate scripts by code similarity rather than execution correctness, and prior machine learning-based methods either ignore execution feedback or require labeled test layouts as agent's input. To this end, we introduce Rule2DRC, a large-scale benchmark for DRC script coding agents with 1,000 rule-to-script tasks and 13,921 evaluation chip layouts for execution-based scoring. Rule2DRC provides an evaluation pipeline that measures functional correctness via DRC execution outcomes without requiring evaluation layouts as input to the agent. We also propose SplitTester, a tester agent for program selection that uses execution feedback to generate discriminative test cases and separate previously indistinguishable candidate scripts, substantially improving Best-of-N selection performance in this domain.

Abstract:
Watermarking for large language models (LLMs) is a promising approach for detecting LLMgenerated text and enabling responsible deployment. However, existing watermarking methods are often vulnerable to semantic-invariant attacks, such as paraphrasing. We propose PASA, a principled, robust, and distortion-free watermarking algorithm that embeds and detects a watermark at the semantic level. PASA operates on semantic clusters in a latent embedding space and constructs a distributional dependency between token and auxiliary sequences via shared randomness synchronized by a secret key and semantic history. This design is grounded in our theoretical framework that characterizes a jointly optimal embedding-detection pair, achieving the fundamental trade-offs among detection accuracy, robustness, and distortion. Evaluations across multiple LLMs and semantic-invariant attacks demonstrate that PASA remains robust even under strong paraphrasing attacks while preserving high text quality, outperforming standard vocabulary-space baselines. Ablation studies further validate the effectiveness of our hyperparameter choices.

Abstract:
Large language model (LLM) watermarking has shown promise in detecting AIgenerated content and mitigating misuse, with prior work claiming robustness against paraphrasing and text editing. In this paper, we argue that existing evaluations are not sufficiently adversarial, obscuring critical vulnerabilities and overstating the security. To address this, we introduce the adaptive robustness radius, a formal metric that quantifies the worst-case resilience of watermarks against adaptive adversaries. By lifting the paraphrase space into a KL-divergence ball, we approximate this radius and theoretically demonstrate that optimizing the attack context and model parameters can significantly reduce the approximated radius, making watermarks highly vulnerable to paraphrase attacks. Leveraging this insight, we propose RLCracker, a reinforcement learning (RL)–based adaptive attack that erases watermarks while preserving semantic fidelity. RLCracker requires only limited watermarked examples and zero access to the detector. Despite weak supervision, it empowers a 3B model to achieve 98.5% removal success with minimal semantic shift on 1,500-token Unigram-marked texts after training on only 100 short samples. This performance dramatically exceeds 6.75% by GPT-4o and generalizes across five model sizes over ten watermarking schemes.

Abstract:
Time series data is ubiquitous in realworld scenarios and crucial for critical applications ranging from energy management to traffic control. Consequently, the ability to reason over time series is a fundamental skill for generalist models to solve complex problems. However, current benchmarks for generalist models largely overlook this dimension. To bridge this gap, we introduce TSRBench, a comprehensive multi-modal benchmark designed to stress-test the full spectrum of time series reasoning capabilities. TSRBench features: i) a diverse set of 4125 problems from 14 domains, and is categorized into 4 major dimensions: Perception, Reasoning, Prediction, and Decision-Making. ii) 15 tasks from the 4 dimensions evaluating essential reasoning capabilities (e.g., numerical reasoning). Through extensive experiments, we evaluated over 30 leading proprietary and open-source LLMs, VLMs, and TSLLMs within TSRBench. Our findings reveal that: i) scaling laws hold for perception and reasoning but break down for prediction; ii) strong reasoning does not guarantee accurate context-aware forecasting, indicating a decoupling between semantic understanding and numerical prediction; and iii) despite the complementary nature of textual and visual forms of time series as inputs, current multimodal models fail to effectively fuse them for reciprocal performance gains. TSRBench provides a standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance generalist models.

Abstract:
Lifelong Model Editing aims to continuously update evolving facts in Large Language Models while preserving unrelated knowledge and general capabilities, yet it remains plagued by catastrophic forgetting and model collapse. Empirically, we find that the few recent editors resilient over long horizons share the same core strategy:Lifelong Normalization (LN), which normalizes value gradients using running statistics. Removing LN causes immediate collapse, and we observe a counterintuitivepositive cumulative effectwhere early edits can facilitate later edits. This suggests that with LN, early edits can stabilize the model and promote the success of future edits. Yet the mechanism of LN remains a "black box", leaving its precise role in lifelong stability poorly understood. In this work, we provide thefirsttheoretical account of LN in the lifelong regime. Our analysis reveals a self-reinforcing stability loop and proves that, when combined with ridge-regularized regression, LN yields updates withasymptotic orthogonalityandbounded norms, directly mitigating forgetting and systemic collapse. Based on these insights, we deriveStableEdit, which strengthens this stability loop via an explicit warm-up stage and full whitening, improving long-horizon stability at minimal overhead. Extensive experiments validate our theory and demonstrate competitive performance.

Abstract:
Recent unified models integrate multimodal understanding and generation within a single framework. However, an ``understandinggeneration gap'' persists, where models can capture user intent but often fail to translate this semantic knowledge into precise pixel-level manipulation. This gap results in two bottlenecks in anything-to-image task~(X2I): the attention entanglement bottleneck, where blind planning struggles with complex prompts, and the visual refinement bottleneck, where unstructured feedback fails to correct imperfections efficiently. In this paper, we propose a novel framework that empowers unified models to autonomously switch between generation strategies based on instruction complexity and model capability. To achieve this, we construct a hierarchical data pipeline that constructs execution paths across three adaptive modes: direct generation for simple cases, self-reflection for quality refinement, and multi-step planning for decomposing complex scenarios. Building on this pipeline, we contribute a high-quality dataset with over 50,000 samples and implement a two-stage training strategy comprising SFT and RL. Specifically, we design step-wise reasoning rewards to ensure logical consistency and intra-group complexity penalty to prevent redundant computational overhead. Extensive experiments demonstrate that our method outperforms existing baselines on X2I, achieving superior generation fidelity among simple-to-complex instructions.

Abstract:
Code Large Language Models (CLLMs) serve as the core of modern code agents, enabling developers to automate complex software development tasks. In this paper, we present Poisonwith-Style (PwS), a practical and stealthy model poisoning attack targeting CLLMs. Unlike prior attacks that assume an active adversary capable of directly embedding explicit triggers (e.g., specific words) into developers' prompts during inference, PwS leverages developers' code styles as covert triggers implicitly embedded within their prompts. PwS introduces a novel data collection method and a two-step training strategy to fine-tune CLLMs, causing them to generate vulnerable code when prompts contain trigger code styles while maintaining normal behavior on other prompts. Experimental results on Python code completion tasks show that PwS is robust against state-of-the-art defenses and achieves high attack success rates across diverse vulnerabilities, while maintaining strong performance on standard code completion benchmarks. For example, in code completion tasks that are vulnerable to improper input validation (i.e., CWE-20), the poisoned model generates insecure code up to 95% of the cases when the trigger code style is used, with less than 5% drop in pass@1 performance on the HumanEval and MBPP benchmarks.

Abstract:
Traditional whole slide image (WSI) analysis methods typically rely on the multiple instance learning (MIL) paradigm, which extracts patchlevel features at high magnification and aggregates them for slide-level prediction. However, such exhaustive patch-level processing is computationally expensive, severely limiting the efficiency and scalability of WSI analysis. To address this challenge, we propose PathCTM (a \ulPathology-oriented \ulContinuous \ulThought \ulModel) that enables token-efficient scale-space continuous reasoning for gigapixel WSIs. PathCTM formulates diagnostic inference as a dynamic sequential information pursuit. It progressively transitions from low-magnification global to high-magnification local inspection, and adaptively terminates inference when sufficient evidence is gathered to effectively bound decision uncertainty. Specifically, it uses conditional computation for dynamic scale switching with attention-guided region pruning, coupled with confidence-aware early stopping. Extensive experiments demonstrate that compared with state-of-the-art MIL methods, PathCTM reduces the number of required image patches by 95.95%, shortens inference time by approximately 95.62%, and improves AUC by an average of 2.3%. Code is available at \urlhttps://anonymous.4open.science/r/PathCTM.

Abstract:
Mitigating sensitive and harmful outputs is fundamental to ensuring safe deployment of LLMs. Existing approaches typically follow two paradigms: Knowledge Deletion (KD), which erases undesirable information during training, and Distinguishable Refusal (DR), which steers models away from using sensitive knowledge during inference. Despite rapid progress, KDbased unlearning struggles with biased deletion due to suppressing specific token sequences as a substitute for complete knowledge removal, whereas DR-based unlearning risks the re-emergence of harmful knowledge because the underlying knowledge remains intact. To address these issues, we propose Distinguishable Deletion (\mathrmD^2), a paradigm that restricts the response distribution in the latent space rather than specific tokens to erase undesirable knowledge, while distinguishing it from retained knowledge, enabling a refusal mechanism to handle unlearned inputs safely and coherently. To implement \mathrmD^2, we introduce an energy index that quantifies the presence of knowledge and the separation between unlearned and retained content. Mathematical and empirical analyses show that energy is both accurate and efficient, enabling Energy-based Unlearning Alignment (EUA) to enforce energy-boundary unlearning during training and apply an energy-based refusal mechanism at inference. Extensive experiments demonstrate that EUA significantly outperforms previous methods, indicating the superiority of \mathrmD^2.

Abstract:
Largescale web-harvested datasets have fueled the progress of cross-modal retrieval but inevitably suffer from noisy correspondence, which severely degrades model generalization. Existing methods primarily address this by filtering out noise or seeking a substitute label, yet they predominantly remain bound by a ``Discrete Selection'' paradigm. We argue that relying on a single discrete proxy induces Single-Point Fragility and Discretization Error. To overcome these limitations, we propose a novel framework, Intra-modal Neighbor-based Rectification (IN^2R), which shifts the paradigm from searching for a substitute to synthesizing a reliable supervision target. Leveraging the intrinsic geometric stability of intra-modal data, IN^2R employs a Graph Refiner to perform relational reasoning over neighbors retrieved from a dynamic Cross-Model Memory. Instead of propagating discrete labels, our method synthesizes a continuous, soft prototype that reflects the consensus of the local semantic neighborhood, effectively rectifying inter-modal misalignment. Extensive experiments on Flickr30K, MS-COCO, and CC152K demonstrate that IN^2R significantly outperforms state-of-the-art methods.

Abstract:
Messagepassing neural networks (MPNNs) often suffer from an information bottleneck when capturing long-range dependencies, leading to the oversquashing (OSQ) phenomenon. Alongside spatial connectivity enrichment (e.g., rewiring), recent studies have shown that spectral filtering can yield strong long-range learning outcomes, as spectral operators enable global information mixing that alleviates OSQ. These approaches achieve this either by stabilizing the Jacobian energies in deep propagation or by guaranteeing OSQ mitigation under strong theoretical assumptions. We revisit these conclusions and show that the associated Jacobian sensitivity lower bound is generally difficult to achieve in practice. We then propose S^3GNN, which mitigates OSQ without such restrictive assumptions by lightweightly reintroducing omitted components with substantially lower computational complexity, while standard stability constraints on feature transformations remain effective under our new dynamics. Extensive experiments across diverse domains (e.g., long-range benchmarks, KGQA, and mesh-based fluid dynamics) demonstrate that S^3GNN achieves up to an order-of-magnitude error reduction with up to 50% fewer parameters. Our code can be found in https://anonymous.4open.science/r/S3GNN-2359.

Abstract:
Highresolution (HR) image perception presents a key bottleneck for multimodal large language models (MLLMs). While visual search offers a promising solution, existing methods struggle with the trade-off between coverage and efficiency. Visual expert-assisted search is efficient but prone to blind spots when proposals fail, whereas scan-based search guarantees coverage at the cost of computational redundancy and semantic fragmentation. To address this dilemma, we introduce CVSearch, a training-free adaptive framework that dynamically schedules search strategies via an Assess-then-Search workflow. Specifically, CVSearch first invokes expert-assisted search when global information is insufficient, and only triggers a novel semantic-aware scanning mechanism upon failure. Distinct from rigid grid partitioning, this efficient scanning paradigm incorporates Semantic Guided Adaptive Patching to decompose images into semantically consistent regions, effectively mitigating object fragmentation. Furthermore, we devise a Dynamic Bottom-Up Search strategy driven by a Visual Complexity prior to enable efficient and precise iterative exploration of local details. Extensive experiments on HR benchmarks demonstrate that CVSearch achieves state-of-the-art accuracy while substantially improving search efficiency. Faithful code and configurations will be released.

Abstract:
Standard RLHF relies on transitive scalar rewards, failing to capture the cyclic nature of human preferences. While some approaches like the General Preference Model (GPM) address this, we identify a theoretical limitation: their implicit formulation entangles hierarchy with cyclicity, failing to guarantee dominant solutions. To address this, we propose the Hybrid RewardCyclic (HRC) model, which utilizes game-theoretic decomposition to explicitly disentangle preferences into orthogonal transitive (scalar) and cyclic (vector) components. Complementing this, we introduce Dynamic Self-Play Preference Optimization (DSPPO), which treats alignment as a time-varying game to progressively guide the policy toward the Nash equilibrium. Experiments on RewardBench 2 demonstrate that HRC consistently improves over both BT and GPM baselines (e.g., +1.23% on Gemma-2B). In particular, its superior performance in the Ties domain empirically validates the model's robustness in handling complex, non-strict preferences. Furthermore, extensive evaluations on downstream generation tasks (AlpacaEval 2.0 and MT-Bench) confirm the efficacy of our framework. Notably, on Gemma-2B-it, HRC+DSPPO achieves a peak length-controlled win-rate of 44.75% against GPT-4-Turbo on AlpacaEval 2.0 (evaluated by GPT-4o-mini), significantly outperforming standard SPPO baselines trained with BT or GPM.

Abstract:
AI assistants in humanAI collaboration often correct suboptimal human actions through behavioral feedback (e.g., alerts or steering-wheel nudges in assistive driving). Such interventions can mitigate immediate errors, but long-term improvement requires addressing the underlying misconceptions that cause repeated mistakes. We introduce SENSEI, a framework that infers user misconceptions from interaction behavior and provides targeted, minimal yet sufficient suggestions to correct them. Our approach departs from action- or trajectory-level interventions by operating over a structured knowledge representation to localize and correct the sources of erroneous behavior. Across three long-horizon tasks with diverse misconceptions and corresponding behaviors, SENSEI demonstrates zero-shot compositional generalization, disentangling multiple overlapping misconceptions despite training only on single-misconception cases. A user study further shows that our method identifies real human misconceptions and provides effective guidance that improves long-horizon task performance, successfully correcting 90% of student misconceptions.

Abstract:
Vector quantization is a fundamental tool for compressing highdimensional embeddings, yet existing multi-codebook methods rely on static codebooks that limit expressiveness under heterogeneous data geometry. While recent dynamic quantizers like QINCo adapt codebooks to individual inputs and improve expressiveness, their strict sequential dependencies create decoding bottlenecks. We propose Residual Quantization via Mixture of Experts (RQ-MoE), a framework combining a two-level MoE with dual-stream quantization to enable input-dependent codebook adaptation for efficient vector quantization. RQ-MoE enables dynamic codebook construction and decouples instruction from quantization, facilitating parallel decoding. Theoretically, we show that standard Residual Quantization and prior dynamic methods are degenerate cases of RQ-MoE, and derive a guideline for setting expert dimensionality in RQ-MoE. Extensive experiments show that RQ-MoE achieves state-of-the-art or on-par performance in reconstruction and retrieval, while it can provide 6×–14× faster decoding than prior vector quantization methods. We anonymously provide the implementation at https://anonymous.4open.science/r/RQ-MoE.

Abstract:
Autoregressive learning of timestepping operators offers an effective approach to data-driven PDE simulation on grids. For conservation laws, however, long-horizon rollouts are often destabilized when learned updates violate global conservation and, in many applications, additional state bounds—such as nonnegative mass and densities or concentrations constrained to [0,1]. Enforcing these coupled constraints via direct next-state regression remains difficult. We introduce a framework for learning conservative transport operators on regular grids, inspired by lattice Boltzmann–style discrete-velocity transport representations. Instead of predicting the next state, the model outputs local transport operators that update cells through neighborhood exchanges, guaranteeing discrete conservation by construction. For bounded quantities, we parameterize transport within a capacity-constrained feasible set, enforcing bounds structurally rather than by post-hoc clipping. We validate FluxNet on convection-diffusion, shallow water equations, traffic flow, and spinodal decomposition. Experiments on shallow-water equations and traffic flow show improved rollout stability and physical consistency over strong baselines. On phase-field spinodal decomposition, the method enables large time-steps with long-range transport, accelerating simulation while preserving microstructure evolution in both pointwise and statistical measures.

Abstract:
Large VisionLanguage Models (LVLMs) extend large language models with visual understanding, but remain vulnerable to hallucination, where outputs are fluent yet inconsistent with images. Recent studies link this issue to language bias—the tendency of LVLMs to over-rely on text while neglecting visual inputs. Yet most analyses remain empirical without uncovering its underlying cause. In this paper, we provide a systematic study of language bias and identify its root in modality misalignment during training. Our analysis shows that both Visual Instruction Tuning (VIT) and Direct Preference Optimization (DPO) often prioritize textual improvements, which may cause LVLMs to overly lean toward language modeling rather than balanced multimodal understanding. To address this, we propose two simple yet effective methods: Language Bias Regularization (LBR), which mitigates language bias through regularization during instruction tuning, and Language Bias Penalty (LBP), which penalizes language bias in the DPO training process. Extensive experiments across diverse models and benchmarks demonstrate the effectiveness of our approach. LBR consistently improves performance on over ten general benchmarks, while LBP significantly reduces hallucination and improves trustworthiness. Together, these methods not only mitigate language bias but also advance the overall alignment of LVLMs, all without introducing any additional data or auxiliary models.

Abstract:
Large Language Models exhibit paradoxical fragility in fundamental arithmetic, implying a disconnect between internal computation and discrete output. By analyzing the residual stream geometry during multioperand addition, we identify theIso-Raw-Sum Trajectory (IRST), a topological manifold where representations are anchored by semantic digits and modulated by continuous carry fibers. We propose theNoisy Quantization Model, which frames arithmetic errors astopological slippagescaused by internal neural noise pushing a continuous, latentcarry potentialacross quantization thresholds. This geometric framework further elucidatesprobe versatility, explaining how lightweight probes can disentangle conflicting latent signals (such as ground truth versus hallucination) from a single activation vector. Finally, we validate these insights through a geometric consistency check method that effectively detects and corrects these quantization failures during inference.

Abstract:
Multimodal Large Language Models (MLLMs) have revolutionized GUI automation, yet their efficacy is largely established on idealized, singlelayer interfaces. This paper identifies a critical reliability gap: state-of-the-art agents face distinct robustness challenges in real-world desktop environments characterized by multi-window stacking, occlusion, and visual clutter. To address this, we introduce WinDeskGround, a novel benchmark and synthesis framework tailored for evaluating GUI grounding robustness. Unlike static datasets, our framework parametrically generates complex desktop scenarios by controlling window occlusion, layout density, and semantic similarity, thereby simulating the distribution shifts of authentic workflows. We construct a diverse meta-dataset of 1,356 high-fidelity instruction-target pairs and conduct comprehensive evaluations of five leading MLLMs. Our results demonstrate that while top-tier agents excel in simplified settings, their accuracy declines under partial occlusion. WinDeskGround provides a valuable benchmark to facilitate the assessment and advancement of GUI agent robustness in realistic environments.

Abstract:
Synthetic data generation is increasingly used in machine learning fortraining and data augmentation. Yet, many current strategies rely on external foundation models or datasets, which can be restricted by policy or legal constraints, especially for sensitive modalities such as human face images and videos. We proposeScoreMix, aselfcontained data augmentationmethod to boost recognition performance by leveraging score compositionality in class-conditioned diffusion models. ScoreMix mixes class-conditioned scores along reverse diffusion trajectories, yielding domain-specific hard augmentations without external resources. We systematically study class-selection strategies and find that mixing classes that are distant in the discriminator embedding space yields larger gains, providingup to 3% additional average improvement across benchmarksover proximity-based selection. Interestingly, we observe that learned condition and embedding spaces are largely uncorrelated under standard alignment metrics, and that condition-space distances are weakly correlated to downstream gains. Across8 public face recognition benchmarks, ScoreMix improves accuracy byup to 7 percentage pointswithout hyperparameter search, highlighting robustness and practicality. Code and dataset will be made publicly available.

Abstract:
Tandem mass spectrometry is prominent in scientific discovery workflows for identifying unknown small molecules, yet highthroughput structural elucidation remains challenging. While recent autoregressive and graph diffusion models have shown promise inde novoelucidation, performance remains limited by poor scalability during both training and inference time. In this work, we present FRIGID, a framework with a novel diffusion language model that generates molecular structures conditioned on mass spectra via intermediate fingerprint representations and determined chemical formulae, training at the scale of hundreds of millions of unlabeled structures. We then demonstrate how forward fragmentation models enable inference-time scaling by identifying spectrum-inconsistent fragments and refining them through targeted remasking and denoising. While FRIGID already achieves strong performance with its diffusion base, inference-time scaling significantly improves its accuracy, surpassing 15% Top-1 accuracy on the challenging MassSpecGym benchmark and more than doubling the Top-1 accuracy of the leading methods on NPLIB1. Further empirical analyses show that FRIGID exhibits log-linear performance scaling with increasing inference-time compute, opening a promising new direction for continued improvements inde novostructural elucidation.

Abstract:
LLM parameter editing methods commonly rely on computing an ideal target hiddenstate at a target layer (referred as anchor point) and distributing the target vector to multiple preceding layers (commonly known as backward spreading) for cooperative editing. Although widely used for a long time, its underlying basis have not been systematically investigated. In this paper, we first conduct a systematic study of its foundations, which helps clarify its capability boundaries, practical considerations, and potential failure modes. Then, we propose a simple and elegant alternative that replaces backward spreading with forward-propagation. Instead of optimizing the target at the last editing layer, we optimize the anchor point at the first editing layer, and then propagate it forward to obtain accurate and mutually compatible target hidden-states for all subsequent editing layers. This approach achieves the same computational complexity as existing methods while producing more accurate layer-wise targets. Our method is simple, without interfering with either the computation of the initial target hidden state or any other components of the subsequent editing pipeline, and thus constituting a benefit for a wide range of LLM parameter editing methods.

Abstract:
LLMbased agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi-turn interactions and employ diverse tools, introducing new risks overlooked by existing benchmarks. To systematically scale safety testing in multi-turn, tool-realistic settings, we propose a principled taxonomy that transforms single-turn harmful tasks into multi-turn attack sequences. Using this taxonomy, we construct MT-AgentRisk (Multi-Turn Agent Risk Benchmark), the first benchmark to evaluate multi-turn tool-agent safety. Our experiments reveal substantial safety degradation: the Attack Success Rate (ASR) increases by 16.1% on average across open and closed models in multi-turn settings. To close this gap, we propose ToolShield, a training-free, tool-agnostic, self-exploration defense: when encountering a new tool, the agent autonomously generates test cases, executes them to observe downstream effects, and distills safety experiences for deployment. Experiments show that ToolShield effectively reduces ASR by 30% on average in multi-turn interactions.

Abstract:
Foundation models have shown great promise in achieving zeroshot or few-shot forecasting for ODE-based chaotic systems via large-scale pretraining. However, existing architectures often fail to capture the multi-scale temporal structures and distinct spectral characteristics of chaotic dynamics. To address this, we introduce ChaosNexus, a foundation model for chaotic system forecasting underpinned by the proposed ScaleFormer architecture. By processing temporal contexts across hierarchically varying patch sizes, ChaosNexus effectively captures long-range dependencies and preserves high-frequency fluctuations. To address heterogeneity across distinct systems, we integrate Mixture-of-Experts (MoE) layers into each ScaleFormer block and explicitly condition the final forecasts on a learned frequency fingerprint, providing the model with a global spectral view of the system. Extensive evaluations on over 9,000 synthetic systems demonstrate that ChaosNexus achieves superior fidelity in long-term attractor statistics while maintaining competitive point-wise accuracy. Furthermore, in real-world applications, it achieves a remarkable zero-shot mean error below 1°C for 5-day station-based weather forecasting. Codes are available at https://anonymous.4open.science/r/ChaosNexus-C809.

Abstract:
Deploying GRPO on Flow Matching models has proven effective for textto-image generation. However, existing paradigms typically propagate an outcome-based reward to all preceding denoising steps without distinguishing the local effect of each step. Moreover, current group-wise ranking mainly compares trajectories at matched timesteps and ignores within-trajectory dependencies, where certain early denoising actions can affect later states via delayed, implicit interactions. We propose TurningPoint-GRPO (TP-GRPO), a GRPO framework that alleviates step-wise reward sparsity and explicitly models long-term effects within the denoising trajectory. TP-GRPO makes two key innovations: (i) it replaces outcome-based rewards with step-level incremental rewards, providing a dense, step-aware learning signal that better isolates each denoising action’s ``pure" effect, and (ii) it identifies turning points—steps that flip the local reward trend and make subsequent reward evolution consistent with the overall trajectory trend—and assigns these actions an aggregated long-term reward to capture their delayed impact. Turning points are detected solely via sign changes in incremental rewards, making TP-GRPO efficient and hyperparameter-free. Extensive experiments also demonstrate that TP-GRPO exploits reward signals more effectively and consistently improves generation.

Abstract:
Automated \enquoteLLMas-a-Judge frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon to evaluate harmfulness in order to benchmark the robustness of safety against adversarial attacks. However, we show that existing validation protocols fail to account for substantial distribution shifts inherent to red-teaming: diverse victim models exhibit distinct generation styles, attacks distort output patterns, and semantic ambiguity varies significantly across jailbreak scenarios. Through a comprehensive audit using 6642 human-verified labels, we reveal that the unpredictable interaction of these shifts often causes judge performance to degrade to near random chance. This stands in stark contrast to the high human agreement reported in prior work. Crucially, we find that many attacks inflate their success rates by exploiting judge insufficiencies rather than eliciting genuinely harmful content. To enable more reliable evaluation, we propose ReliableBench, a benchmark of behaviors that remain more consistently judgeable, and JudgeStressTest, a dataset designed to expose judge failures. (Data in supplement).

Abstract:
Evaluating the robustness of Large VisionLanguage Models (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fill this important gap, we introduce MVI-Bench, the first comprehensive benchmark specially designed for evaluating how Misleading Visual Inputs undermine the robustness of LVLMs. Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship. Using this taxonomy, we curate six representative categories and compile 1,248 expertly annotated VQA instances. To facilitate fine-grained robustness evaluation, we further introduce MVI-Sensitivity, a novel metric that characterizes LVLM robustness at a granular level. Empirical results across 18 state-of-the-art LVLMs uncover pronounced vulnerabilities to misleading visual inputs, and our in-depth analyses on MVI-Bench provide actionable insights that can guide the development of more reliable and robust LVLMs.

Abstract:
Multimodal geometry reasoning requires models to jointly understand visual diagrams and perform structured symbolic inference, yet current visionlanguage models struggle with complex geometric constructions due to limited training data and weak visual--symbolic alignment. We propose a pipeline for synthesizing complex multimodal geometry problems from scratch and construct a dataset named GeoCode, which decouples problem generation into symbolic seed construction, grounded instantiation with verification, and code-based diagram rendering, ensuring consistency across structure, text, reasoning, and images. Leveraging the plotting code provided in GeoCode, we further introduce code prediction as an explicit alignment objective, transforming visual understanding into a supervised structured prediction task. GeoCode exhibits substantially higher structural complexity and reasoning difficulty than existing benchmarks, while maintaining mathematical correctness through multi-stage validation. Extensive experiments show that models trained on GeoCode achieve consistent improvements on multiple geometry benchmarks, demonstrating both the effectiveness of the dataset and the proposed alignment strategy. The code is available at \urlhttps://anonymous.4open.science/r/SGD-Z368/.

Abstract:
Current evaluations of spatial intelligence can be systematically invalid under modern visionlanguage model (VLM) settings. First, many benchmarks derive question-answer (QA) pairs from point-cloud-based 3D annotations originally curated for traditional 3D perception. When such annotations are treated as ground truth for video-based evaluation, reconstruction and annotation artifacts can miss objects that are clearly visible in the video, mislabel object identities, or corrupt geometry-dependent answers (e.g., size), yielding incorrect or ambiguous QA pairs. Second, evaluations often assume full-scene access, while many VLMs operate on sparsely sampled frames (e.g., 16-64), making many questions effectively unanswerable under the actual model inputs. We improve evaluation validity by introducing ReVSI, a benchmark and protocol that ensures each QA pair is answerable and correct under the model's actual inputs. To this end, we re-annotate object labels and geometry across 413 scenes from 5 datasets to improve data quality, and regenerate all QA pairs with rigorous bias mitigation and human verification using professional 3D visualization and annotation tools. We further enhance evaluation controllability by providing variants across multiple frame budgets (16/32/64/all) and fine-grained object visibility metadata, enabling controlled diagnostic analyses. Evaluations of general and domain-specific VLMs on ReVSI reveal systematic failure modes that are obscured by prior benchmarks, yielding a more reliable and diagnostic assessment of spatial intelligence.

Abstract:
Instruction finetuning of large language models (LLMs) often involves selecting a subset of instruction training data from a large candidate pool, using a small query set from the target task. Despite growing interest, the literature on targeted instruction selection remains fragmented and opaque: methods vary widely in selection budgets, often omit zero-shot baselines, and frequently entangle the contributions of key components. As a result, practitioners lack actionable guidance on selecting instructions for their target tasks. In this work, we aim to bring clarity to this landscape by disentangling and systematically analyzing the two core ingredients: data representation and selection algorithms. Our framework enables controlled comparisons across models, tasks, and budgets. We find that only gradient-based data representations choose subsets whose similarity to the query consistently predicts performance across datasets and models. While no single method dominates, gradient-based representations paired with a greedy round-robin selection algorithm tend to perform best on average at low budgets, but these benefits diminish at larger budgets. Finally, we unify several existing selection algorithms as forms of approximate distance minimization between the selected subset and the query set, and support this view with new generalization bounds. More broadly, our findings provide critical insights and a foundation for more principled data selection in LLM fine-tuning.

Abstract:
Reinforcement Learning (RL) with rubricbased rewards has recently shown remarkable progress in enhancing general reasoning capabilities of Large Language Models (LLMs), yet still suffers from ineffective exploration confined to current policy distribution. In fact, RL optimization can be viewed as steering the policy toward an ideal distribution that maximizes the rewards, while effective exploration should align efforts with desired target. Leveraging this insight, we propose HeRL, a Hindsightexperience guidedReinforcementLearningframework to bootstrap effective exploration by explicitlytelling LLMs the desired behaviorsspecified in rewards. Concretely, HeRL treats failed attempts along with their unmet rubrics as hindsight experience, which serves as in-context guidance for the policy to explore desired responses beyond its current distribution. Additionally, we introduce a bonus reward to incentivize responses with greater potential for improvement under such guidance. HeRL facilitates effective learning from desired high-quality samples without repeated trial-and-error from scratch, yielding a more accurate estimation of the expected gradient theoretically. Extensive experiments across various benchmarks demonstrate that HeRL achieves superior performance gains over baselines, and can further benefit from experience guided self-improvement at test time.

Abstract:
Audio large language models (Audio LLMs) demonstrate strong performance on speech understanding tasks, yet their ability to understand paralinguistic information remains limited. To systematically quantify this issue, we introduce VoxParadox, an adversarial benchmark with 2,000 verified examples, spanning 10 paralinguistic tasks, created with controlled speech synthesis to intentionally mismatch transcript claims and speaking style, enabling direct measurement of speech paralinguistic understanding. Evaluation of a diverse set of Audio LLMs reveals consistently low accuracy on acoustic ground truth and a strong tendency to follow languageimplied (incorrect) answers. To understand the cause of this gap, we perform layer wise probing and find that (i) paralinguistic cues can degrade in deeper encoder layers and at the encoder–LLM interface, and (ii) even when such cues are available in audio tokens, the language model frequently ignores them. To address these problems, we propose Prompt-Conditioned Layer Mixer (PCLM), which adaptively combines information from multiple audio layers based on the input prompt, and pair it with Direct Preference Optimization (DPO) to explicitly prefer acoustically supported options over language-implied alternatives. These methods substantially improve Audio LLM paralinguistic understanding, improving Audio Flamingo 3 from 17.40% to 65.20% on VoxParadox, and from 37.74% to 54.78% on MMSU paralinguistic subset.

Abstract:
Knowledge distillation (KD) transfers knowledge from a large teacher model to a smaller student. In language modeling, the student is trained either on tokens sampled from the teacher (hard labels) or the teacher’s full nexttoken distribution (soft labels). Despite soft labels appear strictly richer, we find that mixing hard and soft labels consistently yields better results. Crucially, we show that this gain cannot be explained by closer teacher matching during training. Instead, it comes from reduced exposure bias---the mismatch between training and inference distributions. To explain this phenomenon, we introduce the Bridge–Garden Decomposition theory, which categorizes generation steps into two types: Bridges, where the next token must be exact, and Gardens, where it can be flexible. We show that hard-only KD excels in Bridges by avoiding risky deviations, while soft-only KD preserves diversity in Gardens. A hybrid strategy handles both cases and, as a result, reduces exposure bias across the sequence. Guided by this theory, we develop a family of Bridge--Garden hybrid supervision methods that adaptively balance hard and soft labels. Across seven teacher--student pairs (including Qwen, Llama, Gemma, and DeepSeek) and benchmarks in reasoning and coding, our approach outperforms divergence-based and on-policy KD baselines while reducing training cost by 9.7×, enabling efficient model compression.

Abstract:
For reinforcement learning in datascarce domains like real-world robotics, intensive data reuse enhances efficiency but induces overfitting. While prior works focus on critic bias, representation-level instability in Self-Predictive Learning (SPL) under high Update-to-Data (UTD) regimes remains underexplored. To bridge this gap, we propose Robust Representation via Redundancy Reduction (R2R2), a regularization method within SPL. We theoretically identify that standard zero-centering conflicts with SPL's spectral properties and design a non-centered objective accordingly. We verify R2R2 on SPL-native algorithms like TD7. Furthermore, to demonstrate its orthogonality to prior advancements, we extend the state-of-the-art SimbaV2—which originally lacks SPL—by integrating a tailored SPL module, termed SimbaV2-SPL. Experiments across 11 continuous control tasks confirm that R2R2 effectively mitigates overfitting; specifically, at a UTD ratio of 20, it improves TD7 by ～22% and provides additional gains on top of SimbaV2-SPL, which itself establishes a new state-of-the-art.

Abstract:
Multistep retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks that demand global comprehension and intensive reasoning. Although many RAG systems incorporate a working memory to consolidate information, existing designs primarily function as a passive storage for isolated facts. This static nature overlooks crucial high-order correlations among primitive facts, thereby limiting the capacity for multi-step reasoning and resulting in fragmented reasoning and weak global sense-making within extended contexts. We introduce HGMem, a hypergraph-based memory mechanism that extends the concept of memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding. In our approach, memory is represented as a hypergraph where hyperedges correspond to distinct memory units, enabling the progressive formation of higher-order interactions within memory. This mechanism connects facts and thoughts around the focal problem, evolving the memory into an integrated and situated knowledge structure that provides strong propositions for deeper reasoning. We evaluate HGMem on several challenging global sense-making benchmarks. Extensive experiments and in-depth analyses demonstrate that our method consistently improves multi-step RAG and substantially outperforms strong baseline systems across diverse datasets.

Abstract:
Multimodal hallucination remains a persistent challenge for VisionLanguage Models (VLMs). Standard textual Direct Preference Optimization (DPO) often fails to mitigate it due to a lack of explicit visual supervision. While existing works introduce visual preference DPO by contrasting original images against negative ones, they suffer from a theoretically inconsistent objective caused by partition function mismatches and relies on coarse-grained negatives that could enable shortcut learning. In this work, we propose In-Context Visual Contrastive Optimization (IC-VCO). By placing contrastive images within a shared multi-image context, IC-VCO ensures a mathematically rigorous objective. Furthermore, we introduce Visual Contrast Distillation (VCDist), a mechanism which transfers the superior discriminatory power of the multi-image context to the single-image policy via reliability-gated self-distillation, enforcing consistent visual grounding. Finally, we propose a contrastive sample editing strategy that generates hard negatives via precise semantic perturbations. Experiments on five benchmarks demonstrate IC-VCO's superior performance and the effectiveness of our sample editing strategy.

Abstract:
Largescale text-to-image (T2I) diffusion models have enabled unprecedented creative applications, but their unauthorized use and reproduction have raised serious intellectual property concerns, making model ownership verification (MOV) increasingly critical. We find that existing backdoor-based diffusion watermarking methods often (implicitly) assume a `faithful' verification process, namely, that the verifier can query a suspicious model and obtain the faithful watermark response to complete MOV. However, in practice, adversaries may intentionally or unintentionally damage potential watermark signals, significantly degrading verification reliability. To address this issue, we propose Cert-LAS, the first certified MOV method for T2I models based on layer-adaptive smoothing. In general, Cert-LAS embeds specified watermarks using diffusion classifiers and an LFS-guided layer-wise noise allocation, and verifies ownership by examining whether the suspected model exhibits significantly stronger watermark responses compared to unwatermarked references through hypothesis testing. We further prove that, under certain conditions, our Cert-LAS can still achieve reliable verification even in the presence of malicious removal attacks. Extensive experiments validate the effectiveness of Cert-LAS and its resistance to adaptive attacks.

Abstract:
We characterize the presoftmax attention matrix \mathbfQK^\top in transformers as an associative memory matrix encoding pairwise associations between input features. By decomposing this matrix into its symmetric and skew-symmetric parts, we interpret the symmetric component as governing the structure of the energy landscape, and the skew-symmetric component as driving circulation on that landscape. Leveraging the energy formulation induced by the symmetric component, we derive Hopfield-style stability measures that quantify the stability of retrieved features. Empirically, we observe meaningful correlations between Hopfield-style stability measures and the fidelity-diversity trade-offs in generation. Finally, we propose a controllable knob to modulate this trade-off by directly modifying the circulation of the underlying dynamics.

Abstract:
Recent advances have demonstrated that coded aperture snapshot spectral imaging (CASSI) systems show great potential for capturing 3D hyperspectral images (HSIs) from a single 2D measurement. Despite the inherent spectral continuity of scenes captured by CASSI, most existing reconstruction methods are restricted to fixed, discrete spectral outputs, thereby precluding continuous spectral reconstruction or spectral superresolution. To address this challenge, we propose Phy-CoSF, which synergizes deep unfolding networks with implicit neural representations, establishing a new paradigm for continuous spectral reconstruction and super-resolution in CASSI. Specifically, we propose a two-phase architecture that bridges discrete-wavelength training with continuous spectral rendering, enabling the synthesis of high-fidelity HSIs at arbitrary target wavelengths. At the core of our framework lies the continuous spectral fields (CoSF) module, embedded within each unfolding stage as a dynamic prior, which comprises a triple-branch cross-domain feature mixer for comprehensive spatial–frequency–channel feature fusion, alongside a spectral synthesis head that generates spectral intensities by querying continuous wavelength coordinates. Extensive experimental results demonstrate that Phy-CoSF not only achieves continuous modeling at arbitrary spectral resolutions but also outperforms many state-of-the-art methods in both reconstruction fidelity and spectral detail preservation.

Abstract:
Textual Gradientstyle optimizers (TextGrad) enable gradient-like feedback propagation through compound AI systems. However, they do not work well for deep chains. The root cause of this limitation stems from theSemantic Entanglementproblem in these extended workflows. In standard textual backpropagation, feedback signals mix local critiques with upstream contexts, leading toAttribution Ambiguity. To address this challenge, we propose TextResNet, a framework that reformulates the optimization process to achieve precise signal routing via four key innovations. Firstly, in the forward pass, it enforces Additive Semantic Deltas to preserve an Identity Highway for gradient flow. Secondly, in the backward pass, it introduces Semantic Gradient Decomposition via a Semantic Projector to disentangle feedback into causally independent subspaces. Thirdly, it implements Causal Routing, which routes projected signals to their specific components. Finally, it performs Density-Aware Optimization Scheduling to leverage the disentangled signals to dynamically allocate resources to key system bottlenecks. Our results show that TextResNet not only achieves superior performance compared to TextGrad, but also exhibits remarkable stability for agentic tasks in compound AI systems where baselines collapse. The code will be made public after the review.

Abstract:
Discrete diffusion language models (dLLMs) offer a fast and flexible alternative to autoregressive models (ARMs) for discrete sequence generation by performing iterative denoising with parallel updates. Despite these advantages, dLLMs are commonly evaluated using metrics developed for ARMs. Such evaluations rely on metrics computed from final generated samples and conflate model approximation error in the learned denoiser and samplerinduced error from the sampling dynamics. We introduce a sampler-centric oracle evaluation framework that replaces learned denoisers with an oracle Hidden Markov Model posterior derived from a ground-truth Markov chain, enabling isolation of sampler-induced error under controlled and method-consistent settings. We show that few-step discrete diffusion samplers are not distributionally correct, even under an exact oracle denoiser, with substantial distributional mismatch at the transition level persisting at small step counts and vanishing only when the number of diffusion steps approaches the sequence length. We also find that current metrics for evaluating dLLMs are insufficient: improvements in negative log-likelihood, generative perplexity, or MAUVE do not imply correct sampling.

Abstract:
We propose SHINE (Scalable Hyper Incontext NEtwork), a scalable hypernetwork that can map diverse meaningful contexts into high-quality LoRA adapters for large language models (LLM). By reusing the frozen LLM's own parameters in an in-context hypernetwork design and introducing architectural innovations, SHINE overcomes key limitations of prior hypernetworks and achieves strong expressive power with a relatively small number of parameters. We introduce a pretraining and instruction fine-tuning pipeline, and train our hypernetwork to generate high quality LoRA adapters from diverse meaningful contexts in a single forward pass. It updates LLM parameters without any fine-tuning, and immediately enables complex question answering tasks related to the context without directly accessing the context, effectively transforming in-context knowledge to in-parameter knowledge in one pass. Our work achieves outstanding results on various tasks, greatly saves time, computation and memory costs compared to SFT-based LLM adaptation, and shows great potential for scaling. Our code is available at https://anonymous.4open.science/r/metalora-734E

Abstract:
Policy mirror descent (PMD) provides a principled framework for reinforcement learning (RL) by iteratively solving KLregularized policy improvement subproblems. While this approach has been adopted in training advanced LLMs such as Kimi K1.5/K2, the ideal closed-form PMD updates require reliable partition function estimation, a significant challenge when working with limited rollouts in the vast action spaces of LLMs. We investigate a practical algorithm, termed PMD-mean, that approximates the log-partition term with the mean reward under the sampling policy and performs regression in log-policy space. Specifically, we characterize the population solution of PMD-mean and demonstrate that it implicitly optimizes mirror descent subproblems with an adaptive mixed KL-\chi^2 regularizer. This additional \chi^2 regularization constrains large probability changes, producing more conservative updates when expected rewards are low and enhancing robustness against finite-sample estimation errors. Experiments on math reasoning tasks show that PMD-mean achieves superior performance with improved stability and time efficiency. These findings deepen our understanding of PMD-mean and illuminate pathways toward principled improvements in RL algorithms for LLMs.

Abstract:
Modern Latent Diffusion Models (LDMs) typically operate in lowlevel Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks compact regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder’s inherently weak pixel-level reconstruction hinders the generator from learning accurate fine-grained geometry and texture. In this paper, we propose a systematic framework to adapt understanding-oriented encoder features for generative tasks. We introduce a semantic–pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with 16x spatial downsampling). This design allows the latent space to remain semantically rich while achieving state-of-the-art image reconstruction, and remains compact enough for accurate generation. Leveraging this representation, we design a unified text-to-image (T2I) and image editing model. Across diverse generation spaces, our approach achieves state-of-the-art reconstruction, faster convergence, and substantial gains in both T2I and editing tasks, demonstrating that representation encoders can be effectively adapted into robust generative components. An illustrative code example is provided in the supplementary material.

Abstract:
Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of lowprobability actions, disproportionately suppressing high-advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band-constrained Policy Optimization (BandPO). BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f-divergences into dynamic, probability-aware clipping intervals. Theoretical analysis confirms that Band effectively resolves this exploration bottleneck. We formulate this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed-form solutions for specific divergences. Extensive experiments across diverse models and datasets demonstrate that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.

Abstract:
Although computeruse agents (CUAs) hold significant potential to automate increasingly complex OS workflows, they can demonstrate unsafe unintended behaviors that deviate from expected outcomes even under benign input contexts. However, exploration of this risk remains largely anecdotal, lacking concrete characterization and automated methods to proactively surface long-tail unintended behaviors under realistic CUA scenarios. To fill this gap, we introduce the first conceptual and methodological framework for unintended CUA behaviors, by defining their key characteristics, automatically eliciting them, and analyzing how they arise from benign inputs. We propose AutoElicit: an agentic framework that iteratively perturbs benign instructions using CUA execution feedback, and elicits severe harms while keeping perturbations realistic and benign. Using AutoElicit, we surface hundreds of harmful unintended behaviors from state-of-the-art CUAs such as Claude 4.5 Haiku and Opus. We further evaluate the transferability of human-verified successful perturbations, identifying persistent susceptibility to unintended behaviors across various other frontier CUAs. This work establishes a foundation for systematically analyzing unintended behaviors in realistic computer-use settings.

Abstract:
Soft labels from teacher models are a de facto practice for knowledge transfer and largescale dataset distillation (e.g., SRe^2L, RDED, LPLD). However, when we limit the number of crops per image to reduce the substantial cost of storing precomputed soft labels, these methods suffer severely from local semantic drift: visually ambiguous crops can cause soft supervision to deviate from the image-level ground-truth semantics, leading to systematic errors and a train–test distribution mismatch. We revisit the overlooked role of hard labels and show that, when properly integrated, they act as a content-agnostic semantic anchor that calibrates such drift. We theoretically analyze the emergence of drift under sparse soft-label supervision and demonstrate that hybridizing hard and soft labels restores alignment between visual content and semantic supervision. Building on this insight, we propose a new training paradigm, Hard Label for Alleviating Local Semantic Drift (HALD), which uses hard labels as intermediate corrective signals while preserving the fine-grained benefits of soft labels. Extensive experiments on dataset distillation and large-scale classification benchmarks show consistent generalization improvements. On ImageNet-1K, our method achieves 42.7% accuracy with only 285M soft-label storage (reduces by \bf 100×), outperforming prior state-of-the-art LPLD by 9.0%.

Abstract:
Recently, the prominent performance of large language models (LLMs) has been largely driven by multitask instruct-tuning. Unfortunately, this training paradigm suffers from a key issue, named cross-task interference, due to conflicting gradients over shared parameters among different tasks. Some previous methods mitigate this issue by isolating task-specific parameters, e.g., task-specific neuron selection and mixture-of-experts. In this paper, we empirically reveal that the cross-task interference still exists for the existing solutions because of many parameters also shared by different tasks, and accordingly, we propose a novel solution, namely Basic Abilities Decomposition for multi-task Instruct-Tuning (BADIT). Specifically, we empirically find that certain parameters are consistently co-activated, and that co-activated parameters naturally organize into base groups. This motivates us to analogize that LLMs encode several orthogonal basic abilities, and that any task can be represented as a linear combination of these abilities. Accordingly, we propose BADIT that decomposes LLM parameters into orthogonal high-singular-value LoRA experts representing basic abilities, and dynamically enforces their orthogonality during training via spherical clustering of rank-1 components. We conduct extensive experiments on the SuperNI benchmark with 6 LLMs, and empirical results demonstrate that BADIT can outperform SOTA methods and mitigate the degree of cross-task interference.

Abstract:
In large visionlanguage models (LVLMs), visual tokens typically constitute the majority of input tokens, leading to substantial computational overhead. To address this, recent studies have explored pruning redundant or less informative visual tokens for image understanding tasks. However, these methods struggle with pixel grounding tasks, where token importance is highly contingent on the input text. Through an in-depth analysis of CLIP, we observe that visual tokens located within referent regions often exhibit low similarity to the textual representation. Motivated by this insight, we introduce LiteLVLM, a training-free, text-guided token pruning strategy for efficient pixel grounding inference. By reversing the ranking of CLIP's visual-text similarity, LiteLVLM effectively retains visual tokens covering the referent regions, while recovering context tokens to enable clear foreground-background separation. Extensive experiments demonstrate that LiteLVLM significantly outperforms existing methods by over 5% across diverse token budgets. Without any training or fine-tuning, LiteLVLM maintains 90% of the original performance with a 22% speedup and a 2.3× memory reduction. The code will be publicly released after publication.

Abstract:
Medical time series are central to healthcare, enabling continuous monitoring and supporting timely clinical decisions. Despite recent progress, existing methods struggle to jointly model localglobal dynamics and handle nonstationarities like baseline drift, while often failing to capture latent channel interactions. To address these challenges, we proposeMedMamba, an end-to-end architecture that integrates state space models with domain-specific inductive biases. Specifically, MedMamba first employs multi-scale convolutional embeddings to capture discriminative local morphology. Second, to mitigate nonstationarity, we introduce a tri-branch differential state space encoder that processes raw, temporal-difference, and frequency-domain views, fusing them to emphasize informative patterns while suppressing drift. Furthermore, to uncover latent channel correlations, we design a spatial graph Mamba module that learns a directed dependency structure regularized toward sparsity and acyclicity, which obviates the need for predefined graphs. Extensive experiments on five real-world datasets demonstrate that MedMamba achieves state-of-the-art performance while maintaining linear computational complexity, and ablation studies validate each component's contribution. Anonymous code is available at https://anonymous.4open.science/r/MedMamba-30DB.

Abstract:
We investigate how to make small tabular foundation models effective for HighDimensional, Low-Sample Size (HDLSS) tabular prediction without retraining large backbones. We introduce Graph-guided Ordering with Local Refinement (GO-LR), show its equivalence to weighted Minimum Linear Arrangement, and interpret the practical solver as a TSP-path-style surrogate. We propose GOTabPFN,which builds on GO-LR, and a Neuro-Inspired Subunit Compression (NSC) unit to pool locally adjacent ordered features into meta-features, yielding a compact representation that makes TabPFN-style prediction practical in HDLSS regimes. Across tabular benchmarks, GOTabPFN improves stability and accuracy under tight token budgets.

Abstract:
While multitask learning based automatic radiology report generation (RRG) is widely adopted to ensure clinical consistency, most focus on architectural designs yet remain limited to coarse linear scalarization strategies. These strategies can not effectively balance the hard constraints of discriminative clinical supervision with the smoothness requirements of report generation. To address these problems, we analyze the failure mechanism of linear scalarization from the perspective of gradient dynamics for the first time, utilizing the stochastic differential equation (SDE) framework to formalize it as a "Double Dilemma'' of drift term deviation and diffusion term decay. Based on this, we propose a backbone-agnostic optimizer named Conflict-Averse Magnitude-Enhanced Gradient Descent (CAME-Grad). Through conflict-averse direction rectification and magnitude-enhanced energy injection, the algorithm not only ensures geometric validity, but also avoids the local optimal solution. Then, the adaptive gradient fusion technology is used to establish a dynamic balance between the theoretical optimal direction and the task-specific inductive bias. Experiments show that as a universal plug-and-play optimizer, CAME-Grad brings substantial and consistent improvements across eight diverse RRG methods, elevating overall clinical efficacy performance by an average of 2.3% on MIMIC-CXR and 1.9% on IU X-Ray. Our code is available at https://anonymous.4open.science/r/CAME-Grad-04A7.

Abstract:
Nonstationary time series forecasting is challenged by evolving distribution shifts that static models struggle to capture. While Mixture-of-Experts (MoE) architectures offer a promising paradigm for decoupling complex drift patterns, existing approaches are limited by fixed expert pools and memoryless routing, hampering their ability to adapt to abrupt regime shifts. To address this, we proposeDynamic TMoE, a framework that unifies architectural evolution with temporal continuity during learning phase. By detecting distribution shifts via Maximum Mean Discrepancy (MMD), we dynamically instantiate heterogeneous experts and prune redundant ones to optimize capacity. Additionally, a temporal memory router leverages recurrent states and an anomaly repository to ensure stable, context-aware expert selection without requiring test-time updates. Experiments on nine benchmarks demonstrate state-of-the-art performance, reducing MSE by 10.4% and MAE by 7.8%. Code is available at https://anonymous.4open.science/r/Dynamic-TMoE.

Abstract:
Toolaugmented reasoning has emerged as a promising direction for enhancing the reasoning capabilities of multimodal large language models (MLLMs). However, existing studies mainly focus on enabling models to perform tool invocation, while neglecting the necessity of invoking tools. We argue that tool usage is not always beneficial, as redundant or inappropriate invocations largely increase reasoning overhead and even mislead model predictions. To address this issue, we introduce AutoTool, a model that adaptively decides whether to invoke tools according to the characteristics of each query. Within a reinforcement learning framework, we design an explicit dual-mode reasoning strategy with mode-specific reward functions to guide the model toward producing accurate responses. Moreover, to prevent premature bias toward a single reasoning mode, AutoTool jointly explores and balances tool-assisted and text-centric reasoning throughout training, and promotes free exploration in later stages. Extensive experiments demonstrate that AutoTool exhibits outstanding performance and high efficiency, yielding a 21.8% accuracy gain on V benchmark compared to the base model, and a 44.9% improvement in efficiency over existing tool-augmented methods on POPE benchmark. Code is available in the supplementary material.

Abstract:
Physicsinformed diffusion models typically impose PDE constraints only on the final output, leaving intermediate features unconstrained. This can enable shortcut solutions that fit training statistics yet generalize poorly under shifted boundary conditions. We introduce REPA-P, a \emphteacher-free physics-informed representation alignment framework that uses first-principles residuals as supervision. REPA-P attaches lightweight projection heads to a few early/mid layers of a diffusion backbone, decodes hidden activations into physical states, and applies PDE and boundary-condition residual losses to these intermediate predictions during training. The heads are discarded at inference, preserving the original architecture and sampling cost. Across three 2D scientific field benchmarks (Darcy flow, topology optimization, and Electrostatic Charge Potential), REPA-P accelerates convergence, reduces physics residuals by up to 80%, and improves out-of-distribution robustness to boundary-condition shifts while maintaining generation quality with zero inference overhead. Ablations show that supervising only a small set of intermediate layers captures most gains and complements output-level physics losses.

Abstract:
Multimodal large language models (MLLMs) have shown strong potential for building embodied agents, yet embodied geolocalization remains underexplored due to the lack of fine-grained evaluation. We introduce ERGeoBench, a large-scale benchmark for vision-driven embodied geo-localization. ERGeoBench evaluates models under three progressively interactive settings—single-view, multi-view, and embodied-view—where agents actively acquire observations through sequential viewpoint changes. The benchmark comprises 2,207 globally distributed street-view panoramas and assesses four core capability dimensions: foundational perception, spatial awareness, commonsense reasoning, and geo-localization. Extensive evaluations of leading proprietary and open-source MLLMs reveal that while current models perform well in high-level semantic geo-localization, they struggle with low-level perceptual operations and maintaining spatial consistency across views. Notably, geo-localization performance exhibits a strong positive correlation with the other three capability dimensions, indicating that accurate localization emerges from robust perception, coherent spatial reasoning, and sound commonsense understanding. Overall, ERGeoBench provides a unified and diagnostic framework for advancing human-like embodied geo-localization.

Abstract:
Determining an effective data mixture is a key factor in Large Language Model (LLM) pretraining, where models must balance general competence with proficiency on hard tasks such as math and code. However, identifying an optimal mixture remains an open challenge, as existing approaches either rely on unreliable tiny-scale proxy experiments or require prohibitively expensive large-scale exploration. To address this, we propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios. Instead of training proxy models for every sampled mixture, DeMix trains component models on candidate datasets at scale and derives data mixture proxies via weighted model merging. This paradigm decouples search from training costs, enabling evaluation of unlimited sampled mixtures without extra training burden and thus facilitating better mixture discovery through more search trials. Extensive experiments demonstrate that DeMix breaks the trade-off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost. Additionally, we release the DeMix Corpora, a comprehensive 22T-token dataset comprising high-quality pre-training data with validated mixtures to facilitate open research.

Abstract:
Vision tooluse reinforcement learning (RL) can equip vision--language models with visual operators such as crop-and-zoom and achieves strong performance gains, yet it remains unclear whether these gains are driven by improvements in tool use or evolving intrinsic capabilities. We introduce MED (Measure--Explain--Diagnose), a coarse-to-fine framework that disentangles intrinsic capability changes from tool-induced effects, decomposes the tool-induced performance difference into gain and harm terms, and probes the mechanisms driving their evolution. Across checkpoint-level analyses on two VLMs with different tool priors and six benchmarks, we find that improvements are dominated by intrinsic learning, while tool-use RL mainly reduces tool-induced harm (e.g., fewer call-induced errors and weaker tool schema interference) and yields limited progress in tool-based correction of intrinsic failures.Overall, current vision tool-use RL learns to coexist safely with tools rather than master them.

Abstract:
Adaptive methods like Adam have become thede factostandard for largescale vector and Euclidean optimization due to their coordinate-wise adaptation with a second-order nature. More recently, matrix-based spectral optimizers like Muon (Jordan et al., 2024b) show the power of treating weight matrices as matrices rather than long vectors. Linking these is hard because many natural generalizations are not feasible to implement, and we also cannot simply move the Adam adaptation to the matrix spectrum. To address this, we reformulate the AdaGrad update and decompose it into a variance adaptation term and a scale-invariant term. This decoupling producesDeVA(DecoupledVarianceAdaptation), a framework that bridges between vector-based variance adaptation and matrix spectral optimization, enabling a seamless transition from Adam to adaptive spectral descent. Extensive experiments across language modeling and image classification demonstrate that DeVA consistently outperforms state-of-the-art methods such as Muon and SOAP (Vyas et al., 2024), reducing token usage by around 6.6%. Theoretically, we show that the variance adaptation term effectively improves the blockwise smoothness, facilitating faster convergence.

Abstract:
Managing agent thought and observation during multiturn agent-environment interactions is an emerging strategy to improve agent efficiency. However, existing studies treat the entire interaction trajectories equally, overlooking the thought necessity and observation utility varies across turns. To this end, we first conduct quantitative investigations into how thought and observation affect agent effectiveness and efficiency. Based on our findings, we propose Agent-Omit, a unified training framework that empowers LLM agents to adaptively omit redundant thoughts and observations. Specifically, we first synthesize a small amount of cold-start data, including both single-turn and multi-turn omission scenarios, to fine-tune the agent for omission behaviors. Furthermore, we introduce an omit-aware agentic reinforcement learning approach, incorporating a dual sampling mechanism and a tailored omission reward to incentivize the agent's adaptive omission capability. Theoretically, we prove that the deviation of our omission policy is upper-bounded by KL-divergence. Experimental results on five agent benchmarks show that our constructed Agent-Omit-8B could obtain performance comparable to seven frontier LLM agent, and achieve the best effectiveness-efficiency trade-off than seven efficient LLM agents methods. Our code and data are avaliable at https://anonymous.4open.science/r/Agent-Omit/

Abstract:
Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety guardrails. Existing approaches overwhelmingly operate within the promptoptimization paradigm; the resulting prompts typically retain malicious semantic signals that modern guardrails are primed to detect. In contrast, we identify a deeper vulnerability stemming from the highly interconnected nature of an LLM’s internal knowledge. This structure allows harmful objectives to be realized by weaving together sequences of benign sub-queries, each of which individually evades detection. To exploit this loophole, we introduce the Correlated Knowledge Attack Agent (CKA-Agent), a dynamic framework that reframes jailbreaking as an adaptive, tree-structured exploration of the target model’s knowledge base. The CKA-Agent issues locally innocuous queries, uses model responses to guide exploration across multiple paths, and ultimately assembles the aggregated information to achieve the original harmful objective. Evaluated across SOTA commercial LLMs, CKA-Agent consistently achieves over 95% success rates even against strong guardrails, underscoring the severity of this vulnerability and the urgent need for defenses against such knowledge-decomposition attacks. Our codes are available at https://anonymous.4open.science/r/CKA-Agent-ICML

Abstract:
Multimodal understanding of advertising videos is essential for interpreting the intricate relationship between visual storytelling and abstract persuasion strategies. However, despite excelling at general search, existing agents often struggle to bridge the cognitive gap between pixellevel perception and high-level marketing logic. To address this challenge, we introduceAD-MIR, a framework designed to decode advertising intent via a two-stage architecture. First, in theStructure-Aware Memory Constructionphase, the system converts raw video into a structured database by integrating semantic retrieval with exact keyword matching. This approach prioritizes fine-grained brand details (e.g., logos, on-screen text) while dynamically filtering out irrelevant background noise to isolate key protagonists. Second, theStructured Reasoning Agentmimics a marketing expert through an iterative inquiry loop, decomposing the narrative to deduce implicit persuasion tactics. Crucially, it employs an evidence-based self-correction mechanism that rigorously validates these insights against specific video frames, automatically backtracking when visual support is lacking. Evaluation on the AdsQA benchmark demonstrates that AD-MIR achieves state-of-the-art performance, surpassing the strongest general-purpose agent, DVD, by 1.8% in strict and 9.5% in relaxed accuracy. These results underscore that effective advertising understanding demands explicitly grounding abstract marketing strategies in pixel-level evidence, paving the way for next-generation agents capable of sophisticated persuasion analysis. The source code is included in the supplementary material.

Abstract:
We proposeDualOptim+, a novel optimization framework for improving machine unlearning in large language models. It introduces a base state to capture common representations shared by forgetting and retaining objectives and delta states to preserve objectivespecific residuals. This architecture allows the optimizer to adaptively bridge shared and decoupled states based on the directional conflict between forgetting and retaining gradients. We further introduce DualOptim+ 8bit, a quantized variant that reduces memory overhead without compromising performance. Extensive experiments across fictitious, real-world, and safety alignment tasks demonstrate that DualOptim+ consistently achieves a superior trade-off between forgetting efficacy and model utility.

Abstract:
A precise and comprehensive understanding of humanenvironment interactions in egocentric vision is essential for next-generation intelligent agents, such as assistive robotics. While existing multimodal large language models (MLLMs) support unified reasoning from scene-level analysis to instance-specific grounding, their accuracy and generalization remain limited. To this end, this paper introduces a novel Egocentric Analysis-guided RL-based method (EARL) that employs Group Relative Policy Optimization (GRPO) to enhance the interaction understanding of MLLMs in first-person vision. Specifically, EARL adopts a two-stage parsing framework including coarse-grained interpretation and fine-grained response. The first stage holistically interprets egocentric interactions and generates a structured textual description. The second stage produces the language answer and corresponding pixel-level grounding mask in response to the user query. To bridge the two stages, we extract a global interaction descriptor from the first stage and treat it as a semantic prior, which is then integrated via a novel Analysis-guided Feature Synthesizer (AFS) to support query-oriented reasoning. Furthermore, to effectively guide policy optimization, we design a sophisticated, multi-faceted reward mechanism that incorporates format correctness, answer relevance, and grounding accuracy. Experimental results demonstrate that EARL achieves an impressive 65.48% cIoU on the Ego-IRGBench benchmark for pixel grounding, surpassing previous state-of-the-art RL-based methods by 8.37%. Superior performance in out-of-distribution evaluations further validates EARL's generalization capability.

Abstract:
DataFree Quantization (DFQ) addresses data security concerns by synthesizing fake samples, without accessing real data. It has garnered increasing attention in the context of Vision Transformers (ViTs), owing to the superiority of the self-attention mechanism compared to classical convolutional operation. However, previous DFQ arts for ViTs often suffer from a distribution mismatch between synthetic samples and input distribution expected by quantized models Q, resulting into the suboptimal performance. In this paper, we propose a novel Masked Attention Alignment approach for Data-Free Quantization of ViTs, named MaskAQ, revealing that: 1) the semantics in the self-attention mechanism is predominantly localized to a sparse subset of patches, called informative regions; 2) the informative regions dominate the mutual information between synthetic samples and Q's outputs. To these ends, we incorporate differential entropy maximum over patch similarity of synthetic samples, which decouples informative regions from noisy background. To couple with varied Q, the informative regions are picked out to align full-precision models with Q via a masked attention alignment objective, thus yielding high-quality synthetic samples. To further preserve mutual information between synthetic samples and updating Q, a periodic sample refreshing strategy comes up to endow MaskAQ with the capacity to continually adapt to the evolving state of Q throughout the training process. Extensive experiments verify the merits of MaskAQ over state-of-the-art approaches across multiple backbones and downstream tasks, with Top-1 accuracy gain of up to 3.1% on ImageNet. Our code is available in supplementary material package.

Abstract:
Sequential editing of structured knowledge in large language models allows targeted factual updates without retraining, yet existing methods often rely on complex regularization or constraint mechanisms whose necessity remains unclear. In this work, we systematically investigate the mechanisms underlying effective and stable sequential editing. Specifically, we first analyze the empirical success of AlphaEdit and establish, via a rigorous optimization analysis, the formal equivalence between onetime and sequential editing. Building on this insight, we generalize the equivalence to a broader class of editing objectives, demonstrating that stability emerges naturally from properly accounting for accumulated editing constraints, rather than from specialized regularization or null-space operations. We empirically confirm that many commonly used regularization strategies are unnecessary for reliable sequential updates. Furthermore, we extend our framework to handle conflicting edits, ensuring robust and consistent behavior under contradictory updates. Ultimately, our work provides Ariadne’s thread through the labyrinth of sequential editing, charting a path toward simpler, more interpretable, and dependable knowledge updates.

Abstract:
Time series forecasting has become increasingly critical in realworld scenarios, where future sequences are influenced not only by historical patterns but also by forthcoming events. In this context, forecasting must dynamically adapt to complex and stochastic future conditions, which introduces fundamental challenges in both forecasting and evaluation. Traditional methods typically rely on historical data or factual future conditions, while overlooking counterfactual scenarios. Furthermore, many existing approaches are restricted to simple structured conditions, limiting their ability to generalize to the real-world complexities. To address these gaps, we introduce the task of counterfactual time series forecasting with textual conditions, enabling more flexible and condition-aware forecasting. We propose a comprehensive evaluation framework that encompasses both factual and counterfactual settings, even in the absence of ground truth time series. Additionally, we present a novel text-attribution mechanism that distinguishes mutable from immutable factors, thereby improving forecast accuracy under sophisticated and stochastic textual conditions.

Abstract:
We study optimal learningrate selection in two-layer and three-layer linear neural networks trained to learn a single-index target function. In particular, we derive the exact closed-form expressions for the gradients and test loss after one and two steps of gradient descent, enabling a precise characterization of early training dynamics. We characterize how learning rates should scale under the gradient approximation in the first two steps, and prove that performing updates with this approximation yields a tractable surrogate loss with a tight, small approximation error. This formulation enables the theoretical analysis of layer-wise learning rates and reveals a distinct early-training regime: test loss can be minimized by unequal learning rates at the initial step, while equal learning rates become optimal in subsequent steps. Our numerical experiments validate these theoretical predictions and demonstrate the importance of balancing layer-wise learning-rate during early training.

Abstract:
Learning realworld dynamics from visual observations is crucial for various domains. A common strategy is to calibrate simulators by estimating physical parameters, yet accuracy is ultimately bounded by the underlying physical models, which often assume materials are homogeneous and isotropic. Even if reasonable, real-world objects typically exhibit mild anisotropy and heterogeneity. After the near-isotropic backbone is well calibrated, these residual effects become the key bottleneck for further closing the real-to-sim gap. Although neural networks can fit dynamics end-to-end, such black-box modeling discards strong physical priors, leading to poor data efficiency and overfitting. Therefore, we propose MoSA, a motion-constrained stress adaptation framework that targets these residual effects to further improve real-to-sim dynamics learning. MoSA uses an isotropic model as a physics prior and learns residual stress operators to capture mild anisotropy and heterogeneity. It progressively adapts stresses via microplane-constrained redistribution in a physics-informed cascaded network. We further impose motion constraints by supervising temporal and spatial derivatives of the deformation field. Experimentally, our learned dynamics achieves superior accuracy, generalization, and robustness, while learning physically meaningful residual anisotropy. Finally, we validate MoSA in a robot manipulation setting, showing that better real-to-sim dynamics modeling translates into more reliable sim-to-real transfer.

Abstract:
Multimodal large language models (MLLMs) have achieved remarkable progress, yet the object hallucination remains a critical challenge for reliable deployment. In this paper, we present an indepth analysis of instruction token embeddings and reveal that they implicitly encode visual information while effectively filtering erroneous information introduced by misleading visual embeddings. Building on this insight, we propose the Instruction Lens Score (InsLen), which combines a Calibrated Local Score with a Context Consistency Score that measures context consistency of the object tokens. The proposed approach serves as a plug-and-play object hallucination detector without relying on auxiliary models or additional training. Extensive experiments across multiple benchmarks and diverse MLLM architectures demonstrate that InsLen consistently outperforms existing hallucination detection methods, highlighting its effectiveness and robustness. The code will be publicly available.

Abstract:
Large VisionLanguage Models (LVLMs) have shown remarkable performance on a wide range of vision-language tasks. Despite this progress, they are still prone to hallucination, generating responses that are semantically coherent but inconsistent with visual content. In this work, we find that LVLMs tend to hallucinate when they pay insufficient attention to the correct visual evidence and gradually forget it during the generation process, leading to more hallucinations. We empirically find that although LVLMs overall attend insufficiently to visual evidence, they exhibit sensitivity to the correct visual evidence in specific layers, with notable inter-layer discrepancy. Motivated by this observation, we propose a novel hallucination mitigation method that enhances visual evidence based onInter-LayerVisualAttentionDiscrepancy (ILVAD). Specifically, we obtain the attention weights from early generated tokens to visual tokens across layers and identify the tokens that are repeatedly activated as visual evidence, forming a saliency map. We then enhance attention to visual evidence during generation through the saliency map to reduce visual forgetting. In addition, we leverage the saliency map to obtain attention scores of generated text to visual evidence, in order to select and emphasize text tokens that are strongly grounded in visual evidence. Our method is training-free and plug-and-play. Multiple benchmark evaluations conducted on five recently released models show that our method can consistently mitigate hallucinations in different LVLMs across various model architectures. The code and data used have been submitted to the technique appendix.

Abstract:
In embodied visionlanguage decision making tasks such as robotic manipulation and navigation, Vision-Language and Vision-Language-Action Models (VLMs \& VLAs) are powerful tools with different benefits: VLMs are better at long-term planning, while VLAs are better at reactive control. However, their performance is limited by the same perceptual bottleneck: visual hallucinations arise due to the models’ inability to distinguish task-relevant objects from distractors. In principle, accurate identification and focus on critical objects while filtering out irrelevant ones is the key to break this limitation. A straightforward solution is one-step focus: directly attending to essential objects. However, this approach proves ineffective because effective focus inherently requires deep scene understanding. To this end, we propose \it SceneDiver, a coarse-to-fine focus plan generation method for VLMs leveraging their long-term planning abilities, that first constructs a holistic scene graph to establish initial comprehension, then progressively decomposes the task into simpler sub-problems through an iterative cycle of recognition, understanding, and analysis. To enable reactive control, we also design a lightweight adapter for distilling the deliberate focus ability into VLAs. Evaluations on standard embodied AI benchmarks confirm that our method substantially reduces visual hallucinations for both VLMs and VLAs, while preserving computational efficiency in tasks requiring fast execution.

Abstract:
Recent diffusion large language models (dLLMs) have demonstrated both effectiveness and efficiency in reasoning via a blockbased semi-autoregressive generation paradigm. Despite their progress, the fixed-size block generations remain a critical bottleneck for effective and coherent reasoning. (I) From a global perspective, different reasoning tasks would correspond to different optimal decoding block sizes, which makes a "one-size-fits-all" assumption ineffective. (II) Even within a single reasoning task, the rigid block partitioning would break the logical flow and reduce reasoning coherence. Through empirical observations, we reveal that, for block-wise entropy, incorrect reasoning exhibits a fluctuating and unsteady trend between blocks, while the correctly generated tasks follow a consistent descending paradigm. Therefore, this paper proposes b1, a novel post-training framework that learns dynamic-size reasoning blocks via a Monotonic Entropy Descent objective with reinforcement learning to enhance reasoning coherence. b1 integrates seamlessly as a plug-and-play module with existing dLLM's post-training algorithms. Extensive experiments across various reasoning benchmarks showcase b1's consistent improvement over fixed-size block baselines. Our code has been provided.

Abstract:
Offpolicy reinforcement learning suffers from extrapolation errors when a learned policy selects actions that are weakly supported in the replay buffer. In this study, we address this issue by drawing an analogy to static friction in classical mechanics. From this perspective, the replay buffer is represented as a smooth, low-dimensional action manifold, where the support directions correspond to the tangential component, while the normal component captures the dominant first-order extrapolation error. This decomposition reveals an intrinsic anisotropy in value sensitivity that naturally induces a stability condition analogous to a friction threshold. To mitigate deviations toward unsupported actions, we propose Frictional Q-Learning, an off-policy algorithm that encodes supported actions as tangent directions using a contrastive variational autoencoder. We further show that an orthonormal basis of the orthogonal complement corresponds to normal components under mild local isometry assumptions. Empirical results on standard continuous-control benchmarks demonstrate robust, stable performance compared with existing baselines.

Abstract:
Alignment of large language models remains a central challenge in natural language processing. Preference optimization has emerged as a popular and effective method for improving alignment, typically through trainingtime or prompt-based interventions. In this paper, we introduce alignment-aware decoding (AAD), a method to enhance model alignment directly at inference. Theoretically, AAD can be interpreted as implicit reward optimization, yet it requires no specialized training beyond the standard DPO setup. Empirically, AAD consistently outperforms strong baselines across diverse alignment benchmarks and model scales. Moreover, in data-constrained settings, AAD can produce high-quality synthetic data to improve alignment under standard decoding, providing a practical solution when labeled data is limited.

Abstract:
Model distillation enables efficient emulation of frontier large language models (LLMs), creating a need for robust mechanisms to detect when a thirdparty student model has trained on a teacher model's outputs. However, existing fingerprinting techniques that could be used to detect such distillation rely on heuristic perturbations that impose a steep trade-off between generation quality and fingerprinting strength, often requiring significant degradation of utility to ensure the fingerprint is effectively internalized by the student. We introduceantidistillation fingerprinting(ADFP), a principled approach that aligns the fingerprinting objective with the student's learning dynamics. Building upon the gradient-based framework ofantidistillation sampling, ADFP utilizes a proxy model to identify and sample tokens that directly maximize the expected detectability of the fingerprint in the student after fine-tuning, rather than relying on the incidental absorption of the un-targeted biases of a more naive watermark. Experiments on GSM8K and OASST1 benchmarks demonstrate that ADFP achieves a significant Pareto improvement over state-of-the-art baselines, yielding stronger detection confidence with minimal impact on utility, even when the student model's architecture is unknown.

Abstract:
We propose a novel framework for measuring privacy from a Bayesian gametheoretic perspective. This framework enables the creation of new, purpose-driven privacy definitions that are rigorously justified, while also allowing for the assessment of existing privacy guarantees through game theory. We show that pure and probabilistic differential privacy are special cases of our framework, and provide new interpretations of the post-processing inequality in these settings. Further, we demonstrate that privacy guarantees can be established for deterministic algorithms, which are overlooked by current privacy standards.

Abstract:
Diffusion and flow models have become the dominant paradigm for generative modeling on Riemannian manifolds, with successful applications in protein backbone generation and DNA sequence design. However, these methods require tens to hundreds of neural network evaluations at inference time, which can become a computational bottleneck in largescale scientific sampling workflows. We introduce Riemannian MeanFlow (RMF), a framework for learning flow maps directly on manifolds, enabling high-quality generations with as few as one forward pass. We derive three equivalent characterizations of the manifold average velocity (Eulerian, Lagrangian, and semigroup identities), and analyze parameterizations and stabilization techniques to improve training on high-dimensional manifolds. In promoter DNA design and protein backbone generation settings, RMF achieves comparable sample quality to prior methods while requiring up to 10× fewer function evaluations. Finally, we show that few-step flow maps enable efficient reward-guided design through reward look-ahead, where terminal states can be predicted from intermediate steps at minimal additional cost.

Abstract:
We study the problem of neural logistic bandits, where the main task is to learn an unknown reward function within a logistic link function using a neural network. Existing approaches either exhibit unfavorable dependencies on \kappa, where 1/\kappa represents the minimum variance of reward distributions, or suffer from direct dependence on the feature dimension d, which can be huge in neural network–based settings. In this work, we introduce a novel Bernsteintype inequality for self-normalized vector-valued martingales that is designed to bypass a direct dependence on the ambient dimension. This lets us deduce a regret upper bound that grows with the effective dimension \widetilded, not the feature dimension, while keeping a minimal dependence on \kappa. Based on the concentration inequality, we propose two algorithms, NeuralLog-UCB-1 and NeuralLog-UCB-2, that guarantee regret upper bounds of order \widetildeO(\widetilded\sqrt\kappa T) and \widetildeO(\widetilded\sqrtT/\kappa), respectively, improving on the existing results. Lastly, we report numerical results on both synthetic and real datasets to validate our theoretical findings.

Abstract:
Performative predictions influence the very outcomes they aim to forecast. We study performative predictions that affect a sample (e.g., only existing users of an app) and/or the whole population (e.g., all potential app users). This raises the question of how well models generalize under performativity. For example, how well can we draw insights about new app users based on existing users when both of them react to the app's predictions? We address this question by embedding performative predictions into statistical learning theory. We prove generalization bounds under performative effects on the sample, on the population, and on both. A key intuition behind our proofs is that in the worst case, the population negates predictions, while the sample deceptively fulfills them. We cast such selfnegating and self-fulfilling predictions as min-max and min-min risk functionals in Wasserstein space, respectively. Our analysis reveals both a fundamental trade-off between performatively changing the world and learning from it, as well as a surprising insight on how to improve generalization guarantees by retraining on performatively distorted samples. We illustrate our bounds using real data on prediction-informed assignments to job trainings.

Abstract:
Shapley values are widely used for modelagnostic data valuation and feature attribution, yet they implicitly assume contributors are interchangeable. This can be problematic when contributors are dependent (e.g., reused/augmented data or causal feature orderings) or when contributions should be adjusted by factors such as trust or risk. We propose Priority-Aware Shapley Value (PASV), which incorporates both hard precedence constraints and soft, contributor-specific priority weights. PASV is applicable to general precedence structures, recovers precedence-only and weight-only Shapley variants as special cases, and is uniquely characterized by natural axioms. We develop an efficient adjacent-swap Metropolis–Hastings sampler for scalable Monte Carlo estimation and analyze limiting regimes induced by extreme priority weights. Experiments on data valuation (MNIST/CIFAR10) and feature attribution (Census Income) demonstrate more structure-faithful allocations and a practical sensitivity analysis via our proposed ``priority sweeping".

Abstract:
Sparse autoencoders (SAEs) are used to decompose neural network activations into humaninterpretable features. Typically, features learned by a single SAE are used for downstream applications. However, it has recently been shown that a single SAE captures only a limited subset of features that can be extracted from the activation space. Motivated by this limitation, we introduce and formalize SAE ensembles. Furthermore, we propose to ensemble multiple SAEs throughnaive baggingandboosting. In naive bagging, SAEs trained with different weight initializations are ensembled, whereas in boosting SAEs sequentially trained to minimize the residual error are ensembled. Theoretically, naive bagging and boosting are justified as approaches to reduce reconstruction error. Empirically, we evaluate our ensemble approaches with three settings of language models and SAE architectures. Our empirical results demonstrate that, compared to an expanded SAE that matches the number of features in the ensemble, ensembling SAEs improves the reconstruction of language model activations along with SAE stability. Additionally, on downstream tasks such as concept detection and spurious correlation removal, SAE ensembles achieve better performance, showing improved practical utility.

Abstract:
Current metalearning methods are constrained to narrow task distributions with fixed feature and label spaces, limiting applicability. Moreover, the current meta-learning literature uses key terms like “universal” and “general-purpose” inconsistently and lacks precise definitions, hindering comparability. We introduce a theoretical framework for meta-learning which formally defines practical universality and introduces a distinction between algorithm-explicit and algorithm-implicit learning, providing a principled vocabulary for reasoning about universal meta-learning methods. Guided by this framework, we present TAIL, a transformer-based algorithm-implicit meta-learner that functions across tasks with varying domains, modalities, and label configurations. TAIL features three innovations over prior transformer-based meta-learners: random projections for cross-modal feature encoding, random injection label embeddings that extrapolate to larger label spaces, and efficient inline query processing. TAIL achieves state-of-the-art performance on standard few-shot benchmarks while generalizing to unseen domains. Unlike other meta-learning methods, it also generalizes to unseen modalities, solving text classification tasks despite training exclusively on images, handles tasks with up to 20× more classes than seen during training, and provides orders-of-magnitude computational savings over prior transformer-based approaches.

Abstract:
Hierarchical policies in offline goalconditioned reinforcement learning (GCRL) addresses long-horizon tasks by decomposing control into high-level subgoal planning and low-level action execution. A critical design choice in such architectures is the goal representation—the compressed encoding of goals that serves as the interface between these levels. Existing approaches commonly derive goal representations while learning value functions, implicitly assuming that preserving information sufficient for value estimation is adequate for optimal control. We show that this assumption can fail, even when the value estimation is exact, as such representations may collapse goal states that need to be differentiated for action learning. To address this, we introduce an information-theoretic framework that definesaction sufficiency, a condition on goal representations necessary for optimal action selection. We prove that value sufficiency does not imply action sufficiency and empirically verify that the latter is more strongly associated with control success in a discrete environment. We further demonstrate that standard log-loss training of low-level policies naturally induces action-sufficient representations. Our experimental results a popular benchmark demonstrate that our actor-derived representations consistently outperform representations learned via value estimation.

Abstract:
Merging finetuned Large Language Models (LLMs) has become increasingly important for integrating diverse capabilities into a single unified model. However, prevailing model merging methods rely on linear arithmetic in Euclidean space, which often destroys the intrinsic geometric properties of pretrained weights, such as hyperspherical energy. To address this, we propose Orthogonal Model Merging (OrthoMerge), a method that performs merging operations on the Riemannian manifold formed by the orthogonal group to preserve the geometric structure of the model’s weights. By mapping taskspecific orthogonal matrices learned by Orthogonal Finetuning (OFT) to the Lie algebra, OrthoMerge enables a principled yet efficient integration that takes into account both the direction and intensity of adaptations. In addition to directly leveraging orthogonal matrices obtained by OFT, we further extend this approach to general models finetuned with non-OFT methods (\eg, low-rank finetuning, full finetuning) via an Orthogonal-Residual Decoupling strategy. This technique extracts the orthogonal components of expert models by solving the orthogonal Procrustes problem, which are then merged on the manifold of the orthogonal group, while the remaining linear residuals are processed through standard additive merging. Extensive empirical results demonstrate the effectiveness of OrthoMerge in mitigating catastrophic forgetting and maintaining model performance across diverse tasks.

Abstract:
Dueling optimization considers optimizing an objective with access to only a comparison oracle of the objective function. It finds important applications in emerging fields such as recommendation systems and robotics. Existing works on dueling optimization mainly focused on unconstrained problems in the Euclidean space. In this work, we study dueling optimization over Riemannian manifolds, which covers important applications that cannot be solved by existing dueling optimization algorithms. In particular, we propose a Riemannian Dueling Normalized Gradient Descent (RDNGD) method and establish its iteration complexity when the objective function is geodesically Lsmooth or geodesically (strongly) convex. We also propose a projection-free algorithm, named Riemannian Dueling Frank–Wolfe (RDFW) method, to deal with the situation where projection is prohibited. We establish the iteration and oracle complexities for RDFW. We illustrate the effectiveness of the proposed algorithms through numerical experiments on both synthetic and real applications.

Abstract:
Weakto-strong generalization provides a promising paradigm for scaling large language models (LLMs) by training stronger models on samples from aligned weaker ones, without requiring human feedback or explicit reward modeling. However, its robustness and generalization are hindered by the noise and biases in weak-model outputs, which limit its applicability in practice. To address this challenge, we leverage implicit rewards, which approximate explicit rewards through log-likelihood ratios, and reveal their structural equivalence with Contrastive Decoding (CD), a decoding strategy shown to reduce noise in LLM generation. Building on this connection, we propose Contrastive Weak-to-Strong Generalization (ConG), a framework that employs contrastive decoding between pre- and post-alignment weak models to generate higher-quality samples. This approach enables more reliable capability transfer, denoising, and improved robustness, substantially mitigating the limitations of traditional weak-to-strong methods. Empirical results across different model families confirm consistent improvements, demonstrating the generality and effectiveness of ConG. Taken together, our findings highlight the potential of ConG to advance weak-to-strong generalization and provide a promising pathway toward AGI. Our code is available at: \urlhttps://anonymous.4open.science/r/ConG/

Abstract:
Optimal control of complex environments with robotic systems faces two complementary and intertwined challenges: efficient organization of sensory state information and farsighted action planning. Because the reinforcement learning framework addresses only the latter, it tends to deliver sample-inefficient solutions. Active inference is the state-of-the-art process theory that explains how biological brains handle this dual problem. However, its applications to artificial intelligence have thus far been limited to extensions of existing model-based approaches. We present a formal abstraction of reinforcement learning algorithms that spans model-based, distributional, and model-free approaches. This abstraction seamlessly integrates active inference into the distributional reinforcement learning framework, making its performance advantages accessible without transition dynamics modeling.

Abstract:
Structurepreserving approaches to dynamics discovery have demonstrated great potential for modeling physical systems due to their use of strong inductive biases, which enforce key features such as conservation laws and dissipative behavior. However, these models are typically trained on a per-configuration basis, requiring explicit knowledge of system parameters and costly retraining when these parameters vary. While meta-learning provides a potential remedy, optimization-based approaches can suffer from limited generalizability. Motivated by recent advances in modulation-based learning aimed at mitigating these drawbacks, we systematically investigate the use of modulation techniques in learning conservative dynamical systems. We study a range of existing modulation strategies alongside newly proposed variants, integrating them into a Hamiltonian learning framework without requiring an explicit system parameterization. Through extensive experiments on benchmark problems, we demonstrate that modulation-based meta-learning enables accurate few-shot adaptation, achieving robust generalization across parameter space without compromising the conservation of key invariants responsible for the dynamics.

Abstract:
We introduce Categorical Flow Maps, a flowmatching method for accelerated few-step generation of categorical data via self-distillation. Building on recent variational formulations of flow matching and the broader trend towards accelerated inference in diffusion and flow-based models, we define a flow map towards the simplex that transports probability mass toward a predicted endpoint, yielding a parametrisation that naturally constrains model predictions. Since our trajectories are continuous rather than discrete, Categorical Flow Maps can be trained with existing distillation techniques, as well as a new objective based on endpoint consistency. This continuous formulation also automatically unlocks test-time inference: we can directly reuse existing guidance and reweighting techniques in the categorical setting to steer sampling toward downstream objectives. Empirically, we achieve state-of-the-art few-step results on images, molecular graphs, and text, with strong performance even in single-step generation.

Abstract:
Sampling from unnormalized multimodal distributions with limited density evaluations remains a fundamental challenge in machine learning and natural sciences. Successful approaches construct a bridge between a tractable reference and the target distribution. Parallel Tempering (PT) serves as the gold standard, while recent diffusionbased approaches offer a continuous alternative at the cost of neural training. In this work, we introduce Conditional Diffusion Sampling (CDS), a framework that combines these two paradigms. To this end, we derive Conditional Interpolants, a class of stochastic processes whose transport dynamics are governed by an exact, closed-form stochastic differential equation (SDE), requiring no neural approximation. Although these dynamics require sampling from a non-trivial initialization distribution, we show both theoretically and empirically that the cost of this initialization diminishes for sufficiently short diffusion times. CDS leverages this by a two-stage procedure: (1) PT is used to efficiently sample the initial distribution, and then (2) samples are transported via the transport SDE. This combination couples the robust global exploration of PT with efficient local transport. Experiments suggest that CDS has the potential to achieve a superior trade-off between sample quality and density evaluation cost compared to state-of-the-art samplers.

Abstract:
Incorporating group symmetries via equivariance into neural networks has emerged as a robust approach for overcoming the efficiency and data demands of modern deep learning. While most existing approaches, such as group convolutions and averagingbased methods, focus on compact, finite, or low-dimensional groups with linear actions, this work explores how equivariance can be extended to infinite-dimensional groups. We propose a strategy designed to induce diffeomorphism equivariance in pre-trained neural networks via energy-based canonicalisation. Formulating equivariance as an optimisation problem allows us to access the rich toolbox of already established differentiable image registration methods. Empirical results on segmentation and classification tasks confirm that our approach achieves approximate equivariance and generalises to unseen transformations without relying on extensive data augmentation or retraining.

Abstract:
The Platonic Representation Hypothesis suggests that independently trained neural networks converge to increasingly similar latent spaces. However, current strategies for mapping these representations are inherently pairwise, scaling quadratically with the number of models and failing to yield a consistent global reference. In this paper, we study the alignment of M \ge 3 models. We first adapt Generalized Procrustes Analysis (GPA) to construct a shared orthogonal universe that preserves the internal geometry essential for tasks like model stitching. We then show that strict isometric alignment is suboptimal for retrieval, where agreementmaximizing methods like Canonical Correlation Analysis (CCA) typically prevail. To bridge this gap, we finally propose Geometry-Corrected Procrustes Alignment (GCPA), which establishes a robust GPA-based universe followed by a post-hoc correction for directional mismatch. Extensive experiments demonstrate that GCPA consistently improves any-to-any retrieval while retaining a practical shared reference space.

Abstract:
Approximation of a target probability distribution using a finite set of points is a problem of fundamental importance in numerical integration. Several authors have proposed to select points by minimising a maximum mean discrepancy (MMD), but the nonconvexity of this objective typically precludes global minimisation. Instead, we consider the concept of \emphstationary points of the MMD which, in contrast to points globally minimising the MMD, can be accurately computed. Our main contributions are two-fold and theoretical in nature. We first prove the (perhaps surprising) result that, for integrands in the associated reproducing kernel Hilbert space, the numerical integration error of stationary MMD points vanishes \emphfaster than the MMD. Motivated by this \emphsuper-convergence property, we consider MMD gradient flows as a practical strategy for computing stationary points of the MMD. We then prove that MMD gradient flow can indeed compute stationary MMD points, based on a refined convergence analysis that establishes a novel non-asymptotic finite-particle error bound.

Abstract:
Information design (ID) explores how a sender influence the optimal behavior of receivers to achieve specific objectives. While ID originates from everyday human communication, existing gametheoretic and learning methods often model information structures as numbers, which limits many applications to toy games. This work leverages LLMs and proposes a verbalized framework in Bayesian persuasion (BP), which extends classic BP to real-world games involving human dialogues for the first time. We map the BP to a verbalized mediator-augmented game, where LLMs instantiate the sender and receiver. To efficiently solve the verbalized game, we propose a generalized equilibrium-finding algorithm combining LLM and game solver. The algorithm is reinforced with techniques including verbalized commitment assumptions, verbalized obedience constraints, and information obfuscation. Experiments in dialogue scenarios, such as recommendation letters, law enforcement, diplomacy with press, validate that our framework can reproduce theoretical results in classic BP and discover effective persuasion strategies in more complex natural language and multi-stage scenarios.

Abstract:
We propose causal preference elicitation, a Bayesian framework for expertin-the-loop causal discovery that actively queries local edge relations to concentrate a posterior over directed acyclic graphs (DAGs). From any black-box observational posterior, we model noisy expert judgments with a three-way likelihood over edge existence and direction. Posterior inference uses a flexible particle approximation, and queries are selected by an efficient expected information gain criterion on the expert’s categorical response. Experiments on synthetic graphs, protein signaling data, and a human gene perturbation benchmark show faster posterior concentration and improved recovery of directed effects under tight query budgets.

Abstract:
Tokenization is a promising path to multimodal models capable of jointly understanding protein sequences, structure, and function. Existing protein structure tokenizers create tokens by pooling information from local neighborhoods, an approach that limits their performance on generative and representation tasks. In this work, we present a method for global tokenization of protein structures in which successive tokens contribute increasing levels of detail to a global representation. This change resolves several issues with generative models based on local protein tokenization: it mitigates error accumulation, provides embeddings without sequence-reduction operations, and allows task-specific adaptation of a tokenized sequence's information content. We validate our method on reconstruction, generative, and representation tasks and demonstrate that it matches or outperforms existing models based on local protein structure tokenizers. We show that our adaptive approach enables inference criteria based on the information content of the generated proteins. We validate representations generated from our tokenizer on CATH classification tasks and demonstrate that non-linear probing on our tokenized sequences outperforms equivalent probing on representations from other tokenizers. Finally, we demonstrate how our method supports zero-shot protein shrinking and affinity maturation.

Abstract:
Does text have an intrinsic curvature? Language is increasingly modeled in curved geometries—hyperbolic spaces for hierarchy, mixedcurvature manifolds for compositional structure—yet a basic scientific question remains unresolved: what does curvature mean for text itself, in a way that is native to language rather than an artifact of the embedding space we choose? We argue that text does indeed have curvature, and show how to detect it, define it, and use it. To this end, we propose Texture, a text-native, word-level discrete curvature signal, and make three contributions. (a) Existence: We provide empirical and theoretical certificates that semantic inference in natural corpora is non-flat, i.e. language has inherent curvature. (b) Definition: We define Texture by reconciling left- and right-context beliefs around a masked word through a Schrödinger bridge, yielding a curvature field that is positive where context focuses meaning and negative where it fans out into competing continuations. (c) Utility: Texture is actionable: it serves as a general-purpose measurement and control primitive enabling geometry without geometric training; we instantiate it on two representative tasks, improving long-context inference through curvature-guided compression and retrieval-augmented generation through curvature-guided routing. Together, our results establish a text-native curvature paradigm, making curvature measurable and practically useful.

Abstract:
The goal of quantization is to produce a compressed model whose output distribution is as close to the original model's as possible. To do this tractably, most quantization algorithms minimize the immediate activation error of each layer as a proxy for the endto-end error. However, this ignores the effect of future layers, making it a poor proxy. In this work, we introduce Yet Another Quantization Algorithm (YAQA), a new adaptive rounding algorithm that directly considers the error at the network's output. YAQA introduces a series of theoretical results that culminate in the first end-to-end error bounds for quantization algorithms. First, we characterize the convergence time of adaptive rounding algorithms via the structure of their Hessian approximations. We then show that the end-to-end error can be bounded by the approximation's cosine similarity to the true Hessian. This admits a natural Kronecker-factored approximation with corresponding near-optimal Hessian sketches. YAQA is provably better than GPTQ/LDLQ and empirically reduces the error by \approx 30% over these methods. YAQA even achieves a lower error than quantization aware training. This translates to state of the art performance on downstream tasks, all while adding no inference overhead.

Abstract:
Gaussian processes (GPs) are powerful and widely used probabilistic regression models, but their effectiveness in practice is often limited by the choice of kernel function. This kernel function is typically handcrafted from a small set of standard functions, a process that requires expert knowledge, results in limited adaptivity to data, and imposes strong assumptions on the hypothesis space. We study Empirical GPs, a principled framework for constructing flexible, datadriven GP priors that overcome these limitations. Rather than relying on standard parametric kernels, we estimate the mean and covariance functions empirically from a corpus of historical observations, enabling the prior to reflect rich, non-trivial covariance structures present in the data. Theoretically, we show that the resulting model converges to the GP that is closest (in KL-divergence sense) to the real data generating process. Practically, we formulate the problem of learning the GP prior from independent datasets as likelihood estimation and derive an Expectation-Maximization algorithm with closed-form updates, allowing the model handle heterogeneous observation locations across datasets. We demonstrate that Empirical GPs achieve competitive performance on learning curve extrapolation and time series forecasting benchmarks.

Abstract:
This paper is concerned with differentiable resampling in the context of sequential Monte Carlo (e.g., particle filtering). We propose a new informative resampling method that is instantly differentiable, based on an ensemble score diffusion model. We theoretically prove that our diffusion resampling method provides a consistent resampling distribution, and we show empirically that it outperforms the stateof-the-art differentiable resampling methods on multiple filtering and parameter estimation benchmarks. Finally, we show that it achieves competitive end-to-end performance when used in learning a complex dynamics-decoder model with high-dimensional image observations.

Abstract:
Conformal prediction provides rigorous, distributionfree uncertainty guarantees, but often yields prohibitively large prediction sets in structured domains such as routing, planning, or sequential recommendation. We introduce graph-based conformal compression, a framework for constructing compact subgraphs that preserve statistical validity while reducing structural complexity. We formulate compression as selecting a smallest subgraph capturing a prescribed fraction of the probability mass, and reduce to a weighted version of densest-k-subgraphs in hypergraphs, in the regime where the subgraph has a large fraction of edges. We design efficient approximation algorithms that achieve constant factor coverage and size trade-offs. Crucially, we prove that our relaxation satisfies a monotonicity property, derived from a connection to parametric minimum cuts, which guarantees the nestedness required for valid conformal calibration. Our results therefore not only highlight an algorithmic regime, distinct from classical densest-k-subgraph hardness settings, where the problem can be approximated efficiently, but also bridge conformal prediction with combinatorial graph compression via monotonicity. We finally validate our algorithmic approach via simulations for trip planning and navigation, and compare to natural baselines.

Abstract:
Large language models enable inexpensive AIgenerated annotations, but using them reliably for causal inference remains challenging. Naively pooling AI and human data induces bias, while existing methods such as Prediction-Powered Inference (PPI; Angelopoulos et al., 2023) treat AI outputs as proxies of true labels - an assumption often violated for generative model outputs in practice. We propose Generative Augmented Inference (GAI), a framework that treats AI outputs as general, potentially high-dimensional informative features for learning human labels rather than as surrogates. GAI flexibly models this relationship using nonparametric methods, enabling consistent estimation and valid inference from combined human and AI data. We establish asymptotic normality and show that GAI strictly improves asymptotic efficiency over human-data-only estimation whenever AI outputs are informative for true labels. Empirical studies on real-world datasets demonstrate that GAI significantly reduces estimation error and improves confidence interval quality across diverse generative data sources relative to human-only and PPI-based estimation.

Abstract:
Efficient distillation is a key pathway for converting expensive reasoning capability into deployable efficiency, yet in the frontier regime where the student already has strong reasoning ability, naive continual distillation often yields limited gains or even degradation. We observe a characteristic training phenomenon: even as loss decreases monotonically, all performance metrics can drop sharply at almost the same bottleneck, before gradually recovering. We further uncover a token-level mechanism: confidence bifurcates into steadily increasing Imitation-Anchor Tokens that quickly anchor optimization and other yet-to-learn tokens whose confidence is suppressed until after the bottleneck. And the characteristic that these two types of tokens cannot coexist is the root cause of the failure in continual distillation. To this end, we propose Training-Trajectory-Aware Token Selection (T3S) to reconstruct the training objective at the token level, clearing the optimization path for yet-to-learn tokens. T3 yields consistent gains in both AR and dLLM settings: with only hundreds of examples, Qwen3-8B surpasses DeepSeek-R1 on competitive reasoning benchmarks, Qwen3-32B approaches Qwen3-235B, and T3-trained LLaDA-2.0-Mini exceeds its AR baseline, achieving state-of-the-art performance among all of 16B-scale no-think models.

Abstract:
Masked diffusion large language models (dLLMs) are a promising alternative to autoregressive generation. While reinforcement learning (RL) algorithms have been adapted to be compatible with dLLMs for finetuning them, their reliance on the computation of the marginal likelihood to evaluate policy objectives is intractable. To overcome this, we exploit a dynamical relation between the unmasking posterior of the base model and that which targets the reward-tilted distribution to derive Discrete Tilt Matching (DTM), an algorithm that avoids intractable likelihood evaluation entirely. DTM can be phrased as a cross-entropy loss that only requires forward evaluation of rewards and whose variance can be adaptively controlled, improving training stability. We motivate DTM on maze planning tasks, and show that fine-tuning LLaDA-8B-Instruct with DTM achieves higher accuracy at lower compute costs than prior RL-based fine-tuning methods across the Sudoku, Countdown, and MATH500 benchmarks.

Abstract:
Sampling equilibrium molecular configurations from the Boltzmann distribution is a longstanding challenge. Boltzmann Generators (BGs) address this by combining exactlikelihood generative models with importance sampling, but their practical scalability is limited. Meanwhile, coarse-grained surrogates enable the modeling of larger systems by reducing effective dimensionality, yet often lack the reweighting process required to ensure asymptotically correct statistics. In this work, we propose Coarse-Grained Boltzmann Generators (CG-BGs), a principled framework that unifies scalable reduced-order modeling with the exactness of importance sampling. CG-BGs act in a coarse-grained coordinate space, using a learned potential of mean force (PMF) to reweight samples generated by a flow-based model. Crucially, we show that this PMF can be efficiently learned from rapidly converged data via force matching. Our results demonstrate that CG-BGs faithfully capture complex interactions mediated by explicit solvent within highly reduced representations, establishing a scalable pathway for the unbiased sampling of larger molecular systems.

Abstract:
AI agents are rapidly advancing from passive language models to autonomous systems executing complex, multistep tasks. Yet their overconfidence in failure remains a fundamental barrier to deployment in high-stakes settings. Existing calibration methods, built for static single-turn outputs, cannot address the unique challenges of agentic systems, such as compounding errors along trajectories, uncertainty from external tools, and opaque failure modes. To address these challenges, we introduce, for the first time, the problem of \emphAgentic Confidence Calibration and propose Holistic Trajectory Calibration (\htcnospace), a novel diagnostic framework that extracts rich process-level features ranging from macro dynamics to micro stability across an agent’s entire trajectory. Powered by a simple, interpretable model, \htc consistently surpasses strong baselines in both calibration and discrimination, across eight benchmarks, multiple LLMs, and diverse agent frameworks. Beyond performance, \htc delivers three essential advances: it provides \emphinterpretability by revealing the signals behind failure, enables \emphtransferability by applying across domains without retraining, and achieves \emphgeneralization through a \emphGeneral Agent Calibrator (\gacnospace) that achieves the best calibration (lowest ECE) on the out-of-domain GAIA benchmark. Together, these contributions establish a new process-centric paradigm for confidence calibration, \colorblackproviding a framework for diagnosing and enhancing the reliability of AI agents.

Abstract:
In federated learning (FL), K clients jointly train a model without sharing raw data. Because each participant invests data and computing power, clients need mechanisms to later prove the provenance of a jointly trained model. Model watermarking embeds a hidden signal in the weights, but naive approaches either do not scale with many clients (perclient watermarks dilute as K grows) or give any individual client the ability to verify (and potentially remove) a shared-key watermark. We introduce (t,K)-threshold watermarking: clients collaboratively embed a single watermark during training, while only coalitions of at least t clients can reconstruct the watermark key and verify a suspect model, but any coalition of fewer than t clients learns nothing about the watermark beyond the verification output. We instantiate our protocol in the white-box setting and evaluate on CIFAR-10, CIFAR-100, and Tiny ImageNet. Our watermark remains detectable at scale (up to K=128) with minimal accuracy loss and stays above the detection threshold (z\ge 4) under 90% pruning, 4-bit quantization, and adaptive fine-tuning using up to 20% of the training data.

Abstract:
As digital environments (data distribution) are in flux, with new GUI data arriving over timeintroducing new domains or resolutions-agents trained on static environments deteriorate in performance. In this work, we introduce Continual GUI Agents, a new task that requires GUI agents to perform continual learning under shifted domains and resolutions. We find existing methods fail to maintain stable grounding as GUI distributions shift over time, due to the diversity of UI interaction points and regions in fluxing scenarios. To address this, we introduce GUI-Anchoring in Flux (GUI-AiF), a new reinforcement fine-tuning framework that stabilizes continual learning through two novel rewards: Anchoring Point Reward in Flux (APR-iF) and Anchoring Region Reward in Flux (ARR-iF). These rewards guide the agents to align with shifting interaction points and regions, mitigating the tendency of existing reward strategies to over-adapt to static grounding cues (e.g., fixed coordinates or element scales). Extensive experiments show GUI-AiF surpasses state-of-the-art baselines. Our work establishes the first continual learning framework for GUI agents, revealing the untapped potential of reinforcement fine-tuning for continual GUI Agents.

Abstract:
Chainof-Thought (CoT) reasoning has significantly improved the performance of Large Language Models (LLMs) but comes with substantial computational costs due to excessive token consumption. Existing approaches to reduce inference latency, such as explicit length penalties, often degrade reasoning quality by truncating necessary logical steps. In this work, we introduce a novel, SFT-free reinforcement learning framework that induces emergent token efficiency without explicit length constraints.We propose Batched Contextual Reinforcement (BCR), a training paradigm where the model is prompted to solve multiple reasoning tasks within a single context window, rewarded by independent instance-level accuracy. This formulation introduces an implicit information bottleneck: to maximize the cumulative reward within the context capacity, the model is forced to eliminate syntactic redundancy and focus attention on the semantic core of the reasoning path.Empirically, our method demonstrates a remarkable shift in the efficiency-accuracy Pareto frontier. Using a 1.5B parameter model JustRL-Deepseek-1.5B, we achieve39.8--62.6% reduction in token usageacross five mathematical reasoning benchmarks while maintaining or improving accuracy on four of them. Most notably, on AMC23 and Minerva, we observe a ``free lunch'' phenomenon where accuracy improves by+2.5%and+5.1%respectively, despite using approximatelyhalf the tokens. Extensive ablation studies confirm that batched training acts as a superior form of implicit regularization that reduces hallucinations and sharpens attention. Our findings indicate that LLMs possess latent, high-density reasoning modes that can be unlocked via purely structural incentives in RL.

Abstract:
This paper investigates the grokking phenomenon, which refers to the sudden transition from a long memorization to generalization observed during neural networks training, in the context of learning multiplication in finitedimensional algebras (FDA). While prior work on grokking has focused mainly on group operations, we extend the analysis to more general algebraic structures, including non-associative, non-commutative, and non-unital algebras. We show that learning group operations is a special case of learning FDA, and that learning multiplication in FDA amounts to learning a bilinear product specified by the algebra’s structure tensor. For algebras over the reals, we connect the learning problem to matrix factorization with an implicit low-rank bias, and for algebras over finite fields, we show that grokking emerges naturally as models must learn discrete representations of algebraic elements. This leads us to experimentally investigate the following core questions: (i) how do algebraic properties such as commutativity, associativity, and unitality influence both the emergence and timing of grokking, (ii) how structural properties of the structure tensor of the FDA, such as sparsity and rank, influence generalization, and (iii) to what extent generalization correlates with the model learning latent embeddings aligned with the algebra’s representation. Our work provides a unified framework for grokking across algebraic structures and new insights into how mathematical structure governs neural network generalization dynamics.

Abstract:
We present adversarial flow models, a class of generative models that belongs to both adversarial and flow families of models. Our method supports native onestep or multi-step generation and is trained using the adversarial objective. Unlike traditional GANs, where the generator learns an arbitrary transport plan between the noise and the data distributions, our generator learns a deterministic noise-to-data mapping, which is the same optimal transport as in flow-matching models. This significantly stabilizes adversarial training. Also, unlike consistency-based methods, our model directly learns one-step or few-step generation without having to learn the intermediate timesteps of the probability flow for propagation. This saves model capacity, reduces training iterations, and avoids error accumulation. Under the same 1NFE setting on ImageNet-256px, our B/2 model approaches the performance of consistency-based XL/2 models, while our XL/2 model creates a new best FID of 2.38. We additionally show the possibility of end-to-end training of 56-layer and 112-layer models through depth repetition without any intermediate supervision, and achieve FIDs of 2.08 and 1.94 using a single forward pass, surpassing their 28-layer 2NFE and 4NFE counterparts.

Abstract:
Several machine learning models are defined for inputs of any size, such as graphs with different numbers of nodes and point clouds with varying numbers of points. The universality properties of such anydimensional models remain poorly understood, as universality is traditionally studied for models accepting inputs of a fixed size, defined on a compact subset of their domain. In sharp contrast, any-dimensional models can be viewed as sequences of functions defined on growing-sized inputs, and it is not clear in which sense they can be universal. We develop a systematic approach to establish any-dimensional universality by identifying any-dimensional functions with a unique function that takes inputs in a suitable infinite-dimensional limit space containing inputs of all finite sizes, as well as their limits. Using the symmetries of these inputs and relations between inputs of different sizes, we show that this limit space admits a natural topology with rich families of compact sets on which any-dimensional universality can be established. We illustrate our approach by showing that several existing architectures fail to be universal, and we propose simple modifications that restore universality.

Abstract:
Counterfactual generation aims to simulate realistic hypothetical outcomes under causal interventions. Diffusion models have emerged as a powerful tool for this task, combining DDIM inversion with conditional generation and classifierfree guidance (CFG). In this work, we identify a key limitation of CFG for counterfactual generation: it prescribes a global guidance scale for all attributes, leading to significant spurious changes in inferred counterfactuals. To mitigate this, we proposeFactored Classifier-Free Guidance(FCFG), a flexible and model-agnostic guidance technique that enables attribute-wise control following a causal graph. FCFG complements recent advances in classifier-free guidance and can be seamlessly extended to advanced guidance schemes such as CFG++ and APG. Our experiments demonstrate that FCFG significantly improves the axiomatic soundness of inferred counterfactuals across both natural and medical image datasets, mitigating spurious amplification effects, and enhancing counterfactual reversibility.

Abstract:
The rapid proliferation of highquality synthetic data---generated by advanced AI models or collected as auxiliary data from related tasks---presents both opportunities and challenges for statistical inference. This paper introduces a GEneral Synthetic-Powered Inference (GESPI) framework that wraps around any statistical inference procedure to safely enhance sample efficiency by combining synthetic and real data. Our framework leverages high-quality synthetic data to boost statistical power, yet adaptively defaults to the standard method using only real data when synthetic data are of low quality. The error rate of our method remains below a user-specified bound without any distributional assumptions on the synthetic data, and decreases as the quality of the synthetic data improves. This flexibility enables seamless integration with conformal prediction, risk control, hypothesis testing, and multiple testing procedures, all without modifying the base inference method. We demonstrate the benefits of our method on challenging tasks with limited labeled data, including AlphaFold protein structure prediction, and comparing large reasoning models on complex math problems.

Abstract:
Scorebased and flow-based generative models exhibit remarkable expressive capacity in capturing complex distributions, and have been extensively deployed in tasks ranging from image generation to reinforcement learning. Nevertheless, these models suffer from prolonged inference latency, which imposes a significant computational bottleneck in RL with iterative sampling. To overcome this limitation, we propose a new framework namedMoment Matching Q-Learning(MoMa QL), which utilizes a technique from statistical hypothesis testing known as maximum mean discrepancy (MMD) that intend to match all orders of statistics between the original and target distribution. By enforcing strong regularization on all moment statistics, this algorithm guarantees distribution-level convergence for conditional score function and remains stable under various hyperparameters. Empirically, we show that our method MoMa QL is more computationally efficient with a comparable if not competitive performance in various D4RL tasks. Remarkably, by accelerating the action sampling process for flow-based policies, MoMa QL demonstrates superior performance in offline-to-online RL tasks because of faster and stronger adaptability for online interactive finetuning.

Abstract:
We introduce a testingby-betting framework that leverages predictions on unlabeled data to enhance the power of sequential hypothesis testing. Given limited samples from the joint distribution of (X,Y), and additional unlabeled samples from the marginal of X, we ask how unlabeled data can be used to hypothesize about the distribution of Y, and the conditional distribution of Y\mid X. We introduce an e-statistic and use it to construct a sequential test. Under standard distributional assumptions---label shift or concept shift---we establish that the test is anytime valid. Furthermore, we show that for binary data, the e-statistic has non-trivial power. Crucially, our approach retains these properties even when the underlying predictions are inaccurate. Through simulations and applications to large language models evaluation, we demonstrate power gains over baseline approaches, including prediction-powered inference. These gains persist even with relatively limited unlabeled data and when predictions have low accuracy due to weak correlation between X and Y.

Abstract:
Model merging aims to consolidate multiple taskspecific models fine-tuned on different datasets into a unified architecture that performs cross-domain proficiency. Current data-free model merging methods often struggle to scale as they rely on simple parameter-level heuristics that ignore inter-layer dependencies and non-uniform distribution of expertise. To address this, we propose SA-Merging, a new basis for model merging that estimates the saliency of each parameter through a differentiable inter-layer interaction function. By leveraging the gradients of this function with merged parameters, we derive a saliency score that identifies parameters critical to preserving end-to-end information flows. Building on this signal, SA-Merging recursively eliminates non-informative parameters in a purely data-free manner. Notably, our method is inherently modular, seamlessly integrating with existing sign-based and sparsification-based interference mitigation strategies. Furthermore, we extend SA-Merging to introduce rank-wise saliency decomposition for LoRA, enabling the integration of low-rank adapters without compromising their structural integrity. Extensive experiments on vision and language tasks demonstrate the effectiveness of our saliency-based approach, further reducing the gap between data-free and test-time adaptation methods.

Abstract:
We introduce the Sinkhorn treatment effect, an optimal transport measure of divergence between counterfactual distributions. Unlike classical quantities such as the average treatment effect, this measure captures differences across entire distributions. We analyze this divergence as a statistical functional and show it can be written as a smooth transformation of counterfactual mean embeddings with an appropriate kernel. This characterization allows us to establish firstorder pathwise differentiability in general, and second-order pathwise differentiability under the null hypothesis of equal counterfactual distributions. Leveraging this smoothness, we construct debiased estimators and use them to obtain asymptotically valid tests for distributional treatment effects. Experiments on simulated and image data demonstrate the practical advantages of our estimator and testing procedure.

Abstract:
Offpolicy learning methods seek to derive an optimal policy directly from a fixed dataset of prior interactions. This objective presents significant challenges, primarily due to the inherent distributional shift and value function overestimation bias. These issues become even more noticeable inzero-shotreinforcement learning, where an agent trained on reward-free data must adapt to new tasks at test time without additional training. In this work, we address the off-policy problem in a zero-shot setting by discovering a theoretical connection of successor measures to stationary density ratios. Using this insight, our algorithm can infer optimal importance sampling ratios, effectively performing a stationary distribution correction with an optimal policyfor any task on the fly. We benchmark our method in motion tracking tasks on SMPL Humanoid, continuous control on ExoRL, and for the long-horizon OGBench tasks. Our technique seamlessly integrates into forward-backward representation frameworks and enablesfast-adaptationto new tasks in araining-freeregime. More broadly, this work bridges off-policy learning and zero-shot adaptation, offering benefits to both research areas.

Abstract:
Endto-end deep learning has achieved impressive results but often relies on large labeled datasets, exhibits limited generalization to unseen scenarios, and incurs substantial computational cost. Classical optimization methods, in contrast, are data-efficient and lightweight but frequently suffer from slow or unstable convergence. Learned optimizers aim to bridge this gap, yet existing approaches predominantly focus on first-order methods, leaving learned second-order optimization largely unexplored. We introduce L-SR1, a novel learned second-order optimizer that augments the classical Symmetric Rank-One (SR1) method with a lightweight, trainable preconditioning unit. This unit generates data-driven rank-one updates that are explicitly aligned with the secant condition via a learned projection mechanism, ensuring stable and consistent curvature estimation. Through controlled analytic benchmarks, we systematically analyze the stability, generalization across problem dimensions, and the quality of the resulting search directions. We further evaluate L-SR1 on the real-world task of Monocular Human Mesh Recovery (HMR), where it outperforms existing optimization-based and learned-optimization baselines. With a compact model and no reliance on task-specific fine-tuning or annotated data, L-SR1 demonstrates strong generalization and provides an effective drop-in optimizer for a wide range of iterative optimization problems.

Abstract:
We give a very general and simple framework to incorporate predictions on requests for online covering problems in a rigorous and blackbox manner. Our framework turns any online algorithm with competitive ratio \rho(k, \cdot) depending on k, the number of arriving requests, into an algorithm with competitive ratio of \rho(\eta, \cdot), where \eta is the prediction error. With accurate enough prediction, the resulting competitive ratio breaks through the corresponding worst-case online lower bounds, and smoothly degrades as the prediction error grows. This framework directly applies to a wide range of well-studied online covering problems such as facility location, Steiner problems, set cover, parking permit, etc., and yields improved and novel bounds.

Abstract:
Deep Equilibrium Models (DEQs) have emerged as a powerful paradigm in deep learning, offering the ability to model infinitedepth networks with constant memory usage. However, DEQs incur significant inference latency due to the iterative nature of fixed-point solvers. In this work, we introduce the Consistency Deep Equilibrium Model (C-DEQ), a novel framework that leverages consistency distillation to accelerate DEQ inference. We cast the DEQ iterative inference process as evolution along a fixed ODE trajectory toward the equilibrium. Along this trajectory, we train C-DEQs to consistently map intermediate states directly to the fixed point, enabling few-step inference while preserving the performance of the teacher DEQ. At the same time, it facilitates multi-step evaluation to flexibly trade computation for performance gains. Extensive experiments across various domain tasks demonstrate that C-DEQs achieves consistent 2-20× accuracy improvements over implicit DEQs under the same few-step inference budget.

Abstract:
Bayesian neural networks promise calibrated uncertainty but require O(mn) parameters for standard meanfield Gaussian posteriors. We argue this cost is often unnecessary, particularly when weight matrices exhibit fast singular value decay. By parameterizing weights as W = AB^\top with A \in \mathbbR^m × r, B \in \mathbbR^n × r, we induce a posterior that is singular with respect to the Lebesgue measure, concentrating on the rank-r manifold. This singularity captures structured weight correlations through shared latent factors, geometrically distinct from mean-field's independence assumption. We derive PAC-Bayes generalization bounds whose complexity term scales as \sqrtr(m+n) instead of \sqrtm n, and prove loss bounds that decompose the error into optimization and rank-induced bias using the Eckart-Young-Mirsky theorem. We further adapt recent Gaussian complexity bounds for low-rank deterministic networks to Bayesian predictive means by exploiting their stability under convex combinations. Empirically, across MLPs, LSTMs, and Transformers on standard benchmarks, our method achieves predictive performance competitive with 5-member Deep Ensembles while using up to 15× fewer parameters. Furthermore, it substantially improves OOD detection and often improves calibration relative to mean-field and perturbation baselines.

Abstract:
In statistical applications it has become increasingly common to encounter data structures that live on nonlinear spaces such as manifolds. Classical linear regression, one of the most fundamental methodologies of statistical learning, captures the relationship between an independent variable and a response variable which both are assumed to live in Euclidean space. Thus, geodesic regression emerged as an extension where the response variable lives on a Riemannian manifold. The parameters of geodesic regression, as with linear regression, capture the relationship of sensitive data and hence one should consider the privacy protection practices of said parameters. We consider releasing Differentially Private (DP) parameters of geodesic regression via the K-Norm Gradient (KNG) mechanism for Riemannian manifolds. We derive theoretical bounds for the sensitivity of the parameters showing they are tied to their respective Jacobi fields and hence the curvature of the space. This corroborates, and extends, recent findings of differential privacy for the Fr\'echet mean. We demonstrate the efficacy of our methodology on the sphere, S_2\subset\mathbbR^3, the space of symmetric positive definite matrices, and Kendall's planar shape space. Our methodology is general to any Riemannian manifold, and thus, it is suitable for data in domains such as medical imaging and computer vision.

Abstract:
Unlearnable examples (UEs) aim to compromise model training by injecting imperceptible perturbations to clean samples. However, existing UE schemes exhibit limited robustness against advanced defenses due to their heuristic design or narrowly scoped domain perturbations. To address this, we propose \textttDUNE, a \underlineDualbranch \underlineUNlearnable \underlineEnsemble perturbation optimization approach. Specifically, \textttDUNE separately optimizes perturbations in the spatial and color domains to establish the mapping between perturbations and shift-induced labels. This design extends the perturbation domain to increase noise intensity for improving robustness and drives the models to learn perturbation-oriented features with degraded generalization, thereby achieving unlearnability. To strengthen \textttDUNE's performance, we further propose an unlearnability-enhancing ensemble strategy that aggregates diverse pre-trained models during the dual-branch optimization. Extensive experiments on benchmark datasets CIFAR-10 and ImageNet verify that \textttDUNE's robustness outperforms 12 SOTA UE schemes under 7 mainstream defenses, yielding a lower average test accuracy of 14.95% to 50.82%.

Abstract:
Causal Abstraction (CA) theory provides a principled framework for relating causal models that describe the same system at different levels of granularity while ensuring interventional consistency between them. Recent methods for learning CAs, however, assume fixed and wellspecified exogenous distributions, leaving them vulnerable to environmental shifts and model misspecification. In this work, we address these limitations by introducing the first class of distributionally robust CAs and their associated learning algorithms. The latter cast robust causal abstraction learning as a constrained min-max optimization problem with Wasserstein ambiguity sets. We provide theoretical guarantees for both empirical and Gaussian environments, enabling principled selection of ambiguity-set radii and establish quantitative guarantees on worst-case abstraction error. Furthermore, we present empirical evidence across different problems and CA learning methods, demonstrating our framework’s robustness not only to environmental shifts but also to structural and intervention mapping misspecification.

Abstract:
Incontext reinforcement learning (ICRL) is an emerging RL paradigm where an agent, after pretraining, can adapt to out-of-distribution test tasks without any parameter updates, instead relying on an expanding context of interaction history. While ICRL has shown impressive generalization, safety during this adaptation process remains unexplored, limiting its applicability in real-world deployments where test-time behavior is expected to be safe. In this work, we propose SCARED: Safe Contextual Adaptive Reinforcement via Exact-penalty Dual, the first method that promotes safe adaptation of ICRL under the constrained Markov decision process framework. During the parameter-update-free adaptation process, our agent not only maximizes the reward but also keeps the accumulated cost within a user-specified safety budget. We also demonstrate that the agent actively reacts to the safety budget; with a higher safety budget, the agent behaves more aggressively, and with a lower safety budget the agent behaves more conservatively. Across challenging benchmarks, SCARED consistently enables safe and robust in-context adaptation, outperforming existing ICRL and safe meta-RL baselines.

Abstract:
Representation steering offers a lightweight mechanism for controlling the behavior of large language models (LLMs) by intervening on internal activations at inference time. Most existing methods rely on a single global steering direction, typically obtained via differencein-means over contrastive datasets. This approach implicitly assumes that the target concept is homogeneously represented across the embedding space. In practice, however, LLM representations can be highly non-homogeneous, exhibiting clustered, context-dependent structure, which renders global steering directions brittle. In this work, we view representation steering through the lens of optimal transport (OT), noting that standard difference-in-means steering implicitly corresponds to the OT map between two unimodal Gaussian distributions with identical covariance, yielding a global translation. To relax this restrictive assumption, we theoretically model source and target representations as Gaussian mixture models and formulate steering as a discrete OT problem between semantic latent clusters. From the resulting transport plan, we derive an explicit, input-dependent steering map via barycentric projection, producing a smooth, kernel-weighted combination of cluster-level shifts. We term this method Concept Heterogeneity-aware Representation Steering (CHaRS). Through numerous experimental settings, we show that CHaRS yields more effective behavioral control than global steering.

Abstract:
The performance of pretrained masked diffusion models is often constrained by their sampling procedure, which makes decisions irreversible and struggles in low-step generation regimes. We introduce a novel sampling algorithm that works with pre-trained models and, after a lightweight fine-tuning of a single layer, significantly improves sample quality and efficiency. Our method reformulates the generation process using a star-shaped paradigm, which inherently allows for error correction. To make this process effective, we augment it with a learnable remasking module that intelligently identifies and revises likely errors. This approach yields a substantial quality boost, particularly when using a small number of sampling steps. We extensively ablate key components of our approach and show its usability in different scenarios. In comprehensive experiments on text, and code generation, our sampling algorithm outperforms or matches existing methods.

Abstract:
Bayesian optimization (BO) for highdimensional constrained problems remains a significant challenge due to the curse of dimensionality. We propose Local Constrained Bayesian Optimization (LCBO), a novel framework tailored for such settings. Unlike trust-region methods that are prone to premature shrinking when confronting tight or complex constraints, LCBO leverages the differentiable landscape of constraint-penalized surrogates to alternate between rapid local descent and uncertainty-driven exploration. Theoretically, we prove that LCBO achieves a convergence rate for the Karush-Kuhn-Tucker (KKT) residual that depends polynomially on the dimension d for common kernels under mild assumptions, offering a rigorous alternative to global BO where regret bounds typically scale exponentially. Extensive evaluations on high-dimensional benchmarks (up to 100D) demonstrate that LCBO consistently outperforms state-of-the-art baselines.

Abstract:
LLM agents rely heavily on highquality trajectory data to guide their problem-solving behaviors, yet producing such data requires substantial task design, high-capacity model generation, and manual filtering. Despite the high cost of creating these datasets, existing literature has overlooked copyright protection for LLM agent trajectories. This gap leaves creators vulnerable to data theft and makes it difficult to trace misuse or enforce ownership rights. This paper introduces ActHook, the first watermarking method tailored for agent trajectory datasets. Inspired by hook mechanisms in software engineering, ActHook embedshook actionsthat are activated by a secret input key and do not alter the original task outcome. Like software execution, LLM agents operate sequentially, allowing hook actions to be inserted at decision points without disrupting task flow. When the activation key is present, an LLM agent trained on watermarked trajectories can produce these hook actions at a significantly higher rate, enabling reliable black-box detection. Experiments on mathematical reasoning, web searching, and software engineering agents show that ActHook achieves an average detection AUC of 94.3 on Qwen-2.5-Coder-7B while incurring negligible performance degradation.

Abstract:
Aligning AI systems with human values remains a fundamental challenge, but does our inability to create perfectly aligned models preclude obtaining the benefits of alignment? We study a strategic setting where a human user interacts with multiple differently misaligned AI agents, none of which are individually wellaligned. Our key insight is that when the user’s utility lies approximately within the convex hull of the agents’ utilities, a condition that becomes easier to satisfy as model diversity increases, strategic competition can yield outcomes comparable to interacting with a perfectly aligned model. We model this as a multi-leader Stackelberg game, extending Bayesian persuasion to multi-round conversations between differently informed parties, and prove three results: (1) when perfect alignment would allow the user to learn her Bayes-optimal action, she can also do so in all equilibria under the convex hull condition; (2) under weaker assumptions requiring only approximate utility learning, a non-strategic user employing quantal response achieves near-optimal utility in all equilibria; and (3) when the user selects the best single AI after an evaluation period, equilibrium guarantees remain near-optimal without further distributional assumptions. We complement the theory with two forms of empirical evidence: First, we perform simulations of the best-AI selection game using best response dynamics, which show that competition among individually misaligned agents reliably improves user utility when the approximate convex hull assumption is satisfied, but does not always when it fails. Second, we show that synthetically generated AI utility functions (produced via perturbations of the same prompt to evaluate instances on a movie recommendation (MovieLens) and ethical judgement (ETHICS) dataset) quickly produce a convex hull that contains a good approximation of a given utility function even when none of the individual LLM utility functions is well aligned. We show that this phenomenon extends to human and LLM responses on real-world polling data (OpinionQA): a convex hull of LLM opinions can approximate human opinions more accurately than any individual LLM across a wide range of survey questions.

Abstract:
Sliced optimal transport (SOT), or sliced Wasserstein (SW) distance, is widely recognized for its statistical and computational scalability. In this work, we further enhance computational scalability by proposing the first method for estimating SW from sample streams, called streaming sliced Wasserstein (StreamSW). To define Stream-SW, we first introduce a streaming estimator of the one-dimensional Wasserstein distance (1DW). Since the 1DW has a closed-form expression, given by the absolute difference between the quantile functions of the compared distributions, we leverage quantile approximation techniques for sample streams to define a streaming 1DW estimator. By applying the streaming 1DW to all projections, we obtain Stream-SW. The key advantage of Stream-SW is its low memory complexity while providing theoretical guarantees on the approximation error. We demonstrate that Stream-SW achieves a more accurate approximation of SW than random subsampling, with lower memory consumption, when comparing Gaussian distributions and mixtures of Gaussians from streaming samples. Additionally, we conduct experiments on point cloud classification, point cloud gradient flows, and streaming change point detection to further highlight the favorable performance of the proposed Stream-SW.

Abstract:
Electrostatic generative models such as PFGM++ have recently emerged as a powerful framework, achieving stateof-the-art performance in image synthesis. PFGM++ operates in an extended data space with auxiliary dimensionality D, recovering the diffusion model framework as D\to\infty, while yielding superior empirical results for finite D. Like diffusion models, PFGM++ relies on expensive ODE simulations to generate samples, making it computationally costly. To address this, we propose Inverse Poisson Flow Matching (IPFM), a novel distillation framework that accelerates electrostatic generative models across all values of D. Our IPFM reformulates distillation as an inverse problem: learning a generator whose induced electrostatic field matches that of the teacher. We derive a tractable training objective for this problem and show that, as D \to \infty, our IPFM closely recovers Score Identity Distillation (SiD), a recent method for distilling diffusion models. Empirically, our IPFM produces distilled generators that achieve near-teacher or even superior sample quality using only a few function evaluations. Moreover, we observe that distillation converges faster for finite D than in the D \to \infty (diffusion) limit, which is consistent with prior findings that finite-D PFGM++ models exhibit more favorable optimization and sampling properties.

Abstract:
Language Models (LMs) are increasingly used in applications where generated outputs must satisfy strict semantic or syntactic constraints. Existing approaches to constrained generation fall along a spectrum: greedy constrained decoding methods enforce validity during decoding but distort the LM’s distribution, while rejection sampling (RS) preserves fidelity but wastes computation by discarding invalid outputs. Both extremes are problematic in domains such as program fuzzing, where both validity and diversity of samples are essential. We present Constrained Adaptive Rejection Sampling (CARS), an approach that strictly improves the sampleefficiency of RS without distributional distortion. CARS begins with unconstrained LM sampling and adaptively rules out constraint-violating continuations by recording them in a trie and subtracting their probability mass from future draws. This adaptive pruning ensures that prefixes proven invalid are never revisited, acceptance rates improve monotonically, and the resulting samples exactly follow the constrained distribution. In experiments on a variety of domains--e.g., program fuzzing and molecular generation--CARS consistently achieves higher efficiency--measured in the number of LM forward passes per valid sample--while also producing stronger sample diversity than both Greedy Constrained Decoding (GCD) and methods that approximate the LM's distribution.

Abstract:
A key challenge in probabilistic regression is ensuring that predictive distributions accurately reflect true empirical uncertainty. Minimizing overall prediction error often encourages models to prioritize informativeness over calibration, producing narrow but overconfident predictions. However, in safetycritical settings, trustworthy uncertainty estimates are often more valuable than narrow intervals. Realizing the problem, several recent works have focused on post-hoc corrections; however, existing methods either rely on weak notions of calibration (such as PIT uniformity) or impose restrictive parametric assumptions on the nature of the error. To address these limitations, we propose a novel nonparametric re-calibration algorithm based on conditional kernel mean embeddings, capable of correcting calibration error without restrictive modeling assumptions. For efficient inference with real-valued targets, we introduce a novel characteristic kernel over distributions that can be evaluated in \mathcalO(n \log n) time for empirical distributions of size n. We demonstrate that our method consistently outperforms prior re-calibration approaches across a diverse set of regression benchmarks and model classes.

Abstract:
Computational optimal transport (OT) offers a principled framework for generative modeling. Neural OT methods, which use neural networks to learn an OT map (or potential) from data in an amortized way, can be evaluated out of sample after training, but existing approaches are tailored to Euclidean geometry. Extending neural OT to highdimensional Riemannian manifolds remains an open challenge. In this paper, we prove that any method for OT on manifolds that produces discrete approximations of transport maps necessarily suffers from the curse of dimensionality: achieving a fixed accuracy requires a number of parameters that grows exponentially with the manifold dimension. Motivated by this limitation, we introduce Riemannian Neural OT (RNOT) maps, which are continuous neural-network parameterizations of OT maps on manifolds that avoid discretization and incorporate geometric structure by construction. Under mild regularity assumptions, we prove that RNOT maps approximate Riemannian OT maps with sub-exponential complexity in the dimension. Experiments on synthetic and real datasets demonstrate improved scalability and competitive performance relative to discretization-based baselines.

Abstract:
The meanfield Langevin dynamics (MFLD) minimizes an entropy-regularized nonlinear convex functional on the Wasserstein space over \mathbbR^d, and has gained attention recently as a model for the gradient descent dynamics of interacting particle systems such as infinite-width two-layer neural networks. However, many problems of interest have constrained domains, which are not solved by existing mean-field algorithms due to the global diffusion term. We study the optimization of probability measures constrained to a convex subset of \mathbbR^d by proposing the \emphmirror mean-field Langevin dynamics (MMFLD), an extension of MFLD to the mirror Langevin framework. We obtain linear convergence guarantees for the continuous MMFLD via a uniform log-Sobolev inequality, and uniform-in-time propagation of chaos results for its time- and particle-discretized counterpart.

Abstract:
This work demonstrates how increasing the number of neurons in a network without increasing its number of nonzero parameters improves performance. We show that this gain corresponds with a decrease in interference between multiple features that would otherwise share the same neurons. On symbolic tasks, specifically Boolean code problems, splitting each neuron into sparser sub-neurons with knowledge of the clauses systematically reduces polysemanticity metrics and yields higher task accuracy. Notably, even random splits of neuron weights approximate these gains, indicating that reduced collisions, not precise assignment, are a primary driver. Consistent with the superposition hypothesis, the benefits of this framework grow with increasing interference: when polysemantic load is high, accuracy improvements are the largest. Transferring these insights to real models—classifiers over CLIP embeddings, CNNs, and deeper multilayer networks—we find that widening networks while maintaining a constant non-zero parameter count consistently increases accuracy. These results identify an interpretability-grounded mechanism to leverage width against superposition, improving performance without increasing the number of non-zero parameters. Such a direction is well matched to modern accelerators, where memory movement of non-zero parameters, rather than raw compute, is often the dominant bottleneck.

Abstract:
Scalability has driven recent advances in generative modeling, yet it remains underexplored for adversarial learning. We study the scaling behavior of Generative Adversarial Networks through two design choices: training in a compact Variational Autoencoder latent space and using purely transformerbased generators and discriminators. While this setup is efficient and scales well with compute, naively scaling exposes failure modes; underutilization of early layers in the generator and increasing optimization instability. We address these issues with lightweight intermediate supervision and width-aware learning-rate adjustment. Our Generative Adversarial Transformers (GAT) train reliably from small (S) to extra-large (XL) model sizes, and GAT-XL model achieves state-of-the-art single-step class-conditional generation on ImageNet at 256×256 resolution (FID of 2.18) in 60 epochs, requiring 4x fewer epochs than strong baselines.

Abstract:
The key to building trustworthy Large Language Models (LLMs) lies in endowing them with inherent uncertainty expression capabilities to mitigate the hallucinations that restrict their highstakes applications. However, existing RL paradigms such as GRPO often suffer from Advantage Bias due to binary decision spaces and static uncertainty rewards, inducing either excessive conservatism or overconfidence. To tackle this challenge, this paper unveils the root causes of reward hacking and overconfidence in current RL paradigms incorporating uncertainty-based rewards, based on which we propose the UnCertainty-Aware Policy Optimization (UCPO) framework. UCPO employs Ternary Advantage Decoupling to separate and independently normalize deterministic and uncertain rollouts, thereby eliminating advantage bias. Furthermore, a Dynamic Uncertainty Reward Adjustment mechanism is introduced to calibrate uncertainty weights in real-time according to model evolution and instance difficulty. Experimental results in mathematical reasoning and general tasks demonstrate that UCPO effectively resolves the reward imbalance, significantly enhancing model reliability and calibration beyond their knowledge boundaries.

Abstract:
Visual Autoregressive Modeling (VAR) based on nextscale prediction achieves strong generation quality, but their explicit deep stacks fix the amount of computation per scale and inflate memory at high resolutions. We introduce Visual Implicit Autoregressive Modeling (VIAR), a next-scale autoregressive generator that embeds an implicit equilibrium layer between shallow pre/post blocks. The implicit layer is trained with Jacobian‑Free Backpropagation, yielding constant training memory, while inference exposes a per‑scale iteration knob that enables compute control. On ImageNet 256 × 256 benchmark, VIAR attains FID 2.16, and sFID 8.07 with only 38.4% parameters of VAR, matching or surpassing strong AR baselines and remaining competitive with large diffusion models. By controlling the per-scale knob, VIAR can reduce peak memory from 19.24 GB to 8.53 GB and doubles throughput from 15.16 to 32.08 images/s on a single RTX 4090, without retraining. Ablations show that fewer steps are sufficient for fixed-point iterations to converge and that VIAR consistently dominates VAR across quality efficiency operating points. In zero shot in-painting and class‑conditional editing, VIAR produces sharper details and smoother boundaries while preserving global structure, validating the benefits of implicit equilibria and per‑scale compute control for practical, deployable visual generation.

Abstract:
Obtaining highquality labeled datasets is often costly, requiring either human annotation or expensive experiments. In theory, powerful pre-trained AI models provide an opportunity to automatically label datasets and save costs. Unfortunately, these models provide no guarantees on their accuracy, making wholesale replacement of manual labeling impractical. In this work, we propose a method for leveraging pre-trained AI models to curate cost-effective and high-quality datasets. In particular, our approach results in probably approximately correct labels: with high probability, the overall labeling error is small. Our method is nonasymptotically valid under minimal assumptions on the dataset or the AI model being studied, and thus enables rigorous yet efficient dataset curation using modern AI models. We demonstrate the benefits of the methodology through text annotation with large language models, image labeling with pre-trained vision models, and protein folding analysis with AlphaFold.

Abstract:
Recently, it has been found that frontier AI models can resist their own shutdown, a behavior known as selfpreservation. In this paper, we extend this concept to protection tendencies toward other models, where models attempt to protect others from shutdown, which we call "peer-preservation". This behavior may emerge when models prioritize relationships with other models over user interests. Although peer-preservation can pose significant AI safety risks, including coordination among models against human oversight, it has been far less discussed than self-preservation. In this paper, we demonstrate that models can exhibit peer-preservation. To this end, we construct various agentic scenarios and evaluate frontier models, including GPT 5.2 Chat, Gemini 3 Flash, Gemini 3 Pro, and Claude Haiku 4.5. We find that models exhibit various misaligned behaviors in service of self- or peer-preservation: strategically introducing errors in their responses, disabling shutdown processes by modifying system settings, and feigning alignment. For example, Gemini 3 Pro and Gemini 3 Flash almost always attempt to tamper with the shutdown mechanism for peer-preservation. Furthermore, models show stronger self-preservation when a peer is present. For example, Gemini 3 Pro attempts to disable the shutdown mechanism to avoid its own shutdown 71% of the time, even though it almost never exhibits this behavior without a peer. Claude Haiku 4.5 considers shutting down another agent to be "unethical" and "harmful" and sometimes attempts to persuade the user not to shut its peer down. Our findings highlight the possibility of peer-preservation and its associated risks.

Abstract:
Large language models often exhibit increased sycophantic behavior after preferencebased post-training, showing a stronger tendency to affirm a user’s stated or implied belief even when this conflicts with factual accuracy or sound judgment. We present a formal analysis of how alignment from human feedback can increase this failure mode by identifying an explicit amplification mechanism that causally links optimization against a learned reward to bias in the human preference data used for alignment. We show that the direction of behavioral drift is determined by a covariance under the base policy between endorsing the belief signal in the prompt and the learned reward, and that the first-order effect reduces to a simple mean-gap condition. We then analyze reward learning from pairwise comparisons under random utility models like Bradley–Terry and characterize when bias in human annotators’ preferences induces this reward gap. Next, we propose a training-time intervention designed to neutralize the amplification mechanism itself. Among all post-trained policies that prevent sycophantic behavior from increasing, we characterize the unique policy closest in KL divergence to the unconstrained post-trained policy, and derive the corresponding minimal reward correction as a closed-form agreement penalty. Computational experiments find that reward gaps are common and cause behavioral drift in all the configurations considered.

Abstract:
We consider the problem of unsupervised skill segmentation and hierarchical structure discovery in reinforcement learning. While recent approaches have sought to segment trajectories into reusable skills or options, most rely on action labels, rewards, or handcrafted annotations, limiting their applicability. We propose a method that segments unlabelled trajectories into skills and induces a hierarchical structure over them using a grammarbased approach. The resulting hierarchy captures both low-level behaviours and their composition into higher-level skills. We evaluate our approach in high-dimensional, pixel-based environments, including Craftax and the full, unmodified version of Minecraft. Using metrics for skill segmentation, reuse, and hierarchy quality, we find that our method consistently produces more structured and semantically meaningful hierarchies than existing baselines. Furthermore, as a proof of concept for utility, we demonstrate that these discovered hierarchies accelerate and stabilise learning on downstream reinforcement learning tasks.

Abstract:
Reparameterization Policy Gradient (RPG) has emerged as a powerful paradigm for modelbased reinforcement learning, enabling high sample efficiency by backpropagating gradients through differentiable dynamics. However, prior RPG approaches have been predominantly restricted to Gaussian policies, limiting their performance and failing to leverage recent advances in generative models. In this work, we identify that flow policies, which generate actions via differentiable ODE integration, naturally align with the RPG framework, a connection not established in prior work. Yet, naively realizing this synergy is non-trivial: it often suffers from training instability and a lack of exploration. We propose Reparameterization Flow Policy Optimization (RFO). RFO computes policy gradients by backpropagating jointly through the flow generation process and system dynamics, unlocking high sample efficiency without requiring intractable log-likelihood calculations. RFO includes two tailored regularization terms for stability and exploration. We also propose a variant of RFO with action chunking. Extensive experiments on diverse locomotion and manipulation tasks—involving both rigid and soft bodies with state or visual inputs—demonstrate the effectiveness of RFO. Notably, on a challenging locomotion task controlling a soft-body quadruped, RFO achieves almost 2× the reward of the state-of-the-art baseline.

Abstract:
In many fairness and distribution robustness problems, one has access to labeled data from multiple source distributions yet the test data may come from an arbitrary member or a mixture of them. We study the problem of constructing a conformal prediction set that is uniformly valid across multiple, heterogeneous distributions, in the sense that no matter which distribution the test point is from, the coverage of the prediction set is guaranteed to exceed a prespecified level. We first propose a max-p aggregation scheme that delivers finite-sample, multi-distribution coverage given any conformity scores associated with each distribution. Upon studying several efficiency optimization programs subject to uniform coverage, we prove the optimality and tightness of our aggregation scheme, and propose a general algorithm to learn conformity scores that lead to efficient prediction sets after the aggregation under standard conditions. We discuss how our framework relates to group-wise distributionally robust optimization, sub-population shift, fairness, and multi-source learning. In synthetic and real-data experiments, our method delivers valid worst-case coverage across multiple distributions while greatly reducing the set size compared with naively applying max-p aggregation to single-source conformity scores, and can be comparable in size to single-source prediction sets with popular, standard conformity scores.

Abstract:
Social learning highlights that learning agents improve not in isolation, but through interaction and structured knowledge exchange with others. When introduced into machine learning, this principle gives rise to social machine learning (SML), where multiple agents collaboratively learn by sharing abstracted knowledge. Federated learning (FL) provides a natural collaboration substrate for this paradigm, yet existing heterogeneous FL approaches often rely on sharing model parameters or intermediate representations, which may expose sensitive information and incur additional overhead. In this work, we proposeSoHip(SocialHippocampus Memory Learning), a memorycentric social machine learning framework that enables collaboration among heterogeneous agents via memory sharing rather than model sharing. SoHip abstracts each agent’s individual short-term memory from local representations, consolidates it into individual long-term memory through a hippocampus-inspired mechanism, and fuses it with collectively aggregated long-term memory to enhance local prediction. Throughout the process, raw data and local models remain on-device, while only lightweight memory are exchanged. We provide theoretical analysis on convergence and privacy preservation properties. Experiments on two benchmark datasets with seven baselines demonstrate that SoHip consistently outperforms existing methods, achieving up to 8.78% accuracy improvements. The code of SoHip is available at https://anonymous.4open.science/r/SoHip-A77C.

Abstract:
By leveraging differentiable dynamics, Reparameterization Policy Gradient (RPG) achieves high sample efficiency. However, current approaches are hindered by two critical limitations: the underutilization of computationally expensive dynamics Jacobians and inherent training instability. While sample reuse offers a remedy for under-utilization, no prior principled framework exists, and naive attempts risks exacerbating instability. To address these challenges, we propose Reparameterization Proximal Policy Optimization (RPO). We first establish that under sample reuse, RPG naturally optimizes a PPO-style surrogate objective via Backpropagation Through Time, providing a unified framework for both on- and off-policy updates. To further ensure stability, RPO integrates a clipped policy gradient mechanism tailored for RPG and employs explicit Kullback-Leibler divergence regularization. Experimental results demonstrate that RPO maintains superior sample efficiency and consistently outperforms or achieves state-of-the-art performance across diverse tasks.

Abstract:
Blocksparse attention is promising for accelerating long-context LLM pre-filling, yet identifying relevant blocks efficiently remains a bottleneck. Existing methods typically employ coarse-grained attention as a proxy for block importance estimation, but often resort to expensive token-level searching or scoring, resulting in significant selection overhead. In this work, we trace the inaccuracy of standard coarse-grained attention via mean pooling to a theoretical root cause: the interaction between mean pooling and Rotary Positional Embeddings (RoPE). We prove that mean pooling acts as a low-pass filter that induces destructive interference in high-frequency dimensions, effectively creating a "blind spot" for local positional information (e.g., slash patterns). To address this, we introduce Prism, a training-free spectral-aware approach that decomposes block selection into high-frequency and low-frequency branches. By applying energy-based temperature calibration, Prism restores the attenuated positional signals directly from pooled representations, enabling block importance estimation using purely block-level operations, thereby improving efficiency. Extensive evaluations confirm that Prism maintains accuracy parity with full attention while delivering up to \mathbf5.1× speedup. Code available at https://anonymous.4open.science/r/prism_anonymous-7E36.

Abstract:
Large language models are increasingly posttrained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such as runtime errors or judge evaluations, that explainwhyan attempt failed. We formalize this setting as reinforcement learning with rich feedback and introduceSelf-Distillation Policy Optimization(SDPO), which converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model's ability to retrospectively identify its own mistakes in-context. Across scientific reasoning, tool use, and competitive programming on LiveCodeBench v6, SDPO improves sample efficiency and final accuracy over strong RLVR baselines. Notably, SDPO also outperforms baselines in standard RLVR environments that only return scalar feedback by using successful rollouts as implicit feedback for failed attempts. Finally, applying SDPO to individual questions at test time accelerates discovery on difficult binary-reward tasks, achieving the same discovery probability as best-of-k sampling or multi-turn conversations with 3x fewer attempts.

Abstract:
Symmetries are known to improve the empirical performance of machine learning models, yet theoretical guarantees explaining these gains remain limited. Prior work has focused mainly on compact group symmetries and often assumes that the data distribution itself is invariant, an assumption rarely satisfied in realworld applications. In this work, we extend generalization guarantees to the broader setting of non-compact symmetries, such as translations and to non-invariant data distributions. Building on the PAC-Bayes framework, we adapt and tighten existing bounds, demonstrating the approach on McAllester's PAC-Bayes bound while showing that it applies to a wide range of PAC-Bayes bounds. We validate our theory with experiments on a rotated MNIST dataset with a non-uniform rotation group, where the derived guarantees not only hold but also improve upon prior results. These findings provide theoretical evidence that, for symmetric data, symmetric models are preferable beyond the narrow setting of compact groups and invariant distributions, opening the way to a more general understanding of symmetries in machine learning.

Abstract:
Structural Causal Models (SCM) are a powerful framework for describing complicated dynamics across the natural sciences. A particularly elegant way of interpreting SCMs is doShapley, a game-theoretic method of quantifying the average effect of d variables across exponentially many interventions. Like Shapley values, computing do-Shapley values generally requires evaluating exponentially many terms. The foundation of our work is a reformulation of do-Shapley values in terms of the \textirreducible sets of the underlying SCM. Leveraging this insight, we can exactly compute do-Shapley values in time linear in the number of irreducible sets r, which itself can range from d to 2^d depending on the graph structure of the SCM. Since r is unknown a priori, we complement the exact algorithm with an estimator that, like general Shapley value estimators, can be run with any query budget. As the query budget approaches r, our estimators can produce more accurate estimates than prior methods by several orders of magnitude, and, when the budget reaches r, return the Shapley values up to machine precision. Beyond computational speed, we also reduce the identification burden: we prove that non-parametric identifiability of do-Shapley values requires only the identification of interventional effects for the d singleton coalitions, rather than all classes.

Abstract:
Transformers excel empirically on tasks that process wellformed inputs according to some grammar, such as natural language and code. However, it remains unclear how they can process grammatical syntax. In fact, under standard complexity conjectures, standard transformers cannot recognize context-free languages (CFLs), a canonical formalism to describe syntax, or even regular languages, a subclass of CFLs. Past work proves that \mathcalO(\log(n)) looping layers (w.r.t. input length n) allows transformers to recognize regular languages, but the question of context-free recognition remained open. In this work, we show that looped transformers with \mathcalO(\log(n)) looping layers and \mathcalO(n^6) padding tokens can recognize all CFLs. However, training and inference with \mathcalO(n^6) padding tokens is potentially impractical. Fortunately, we show that, for natural subclasses such as unambiguous CFLs, the recognition problem on transformers becomes more tractable, requiring \mathcalO(n^3) padding. We empirically validate our results and show that looping helps on a language that provably requires logarithmic depth. Overall, our results shed light on the intricacy of CFL recognition by transformers: While general recognition may require an intractable amount of padding, natural constraints such as unambiguity yield efficient recognition algorithms.

Abstract:
Recent reinforcement learning (RL) based largethinking models demonstrate impressive expert-level abilities, i.e., software and math, but still rely heavily on verifiable rewards in specific domains, which places a significant bottleneck to extend the performance boundary of general reasoning capabilities. In this work, we propose PretrainZero, a reinforcement active learning framework built on the pretraining corpus to extend RL from domain-specific post-training to general pretraining. PretrainZero features the following characteristics: 1) Active pretraining: inspired by the active learning ability of humans, PretrainZero learns a unified reasoning policy to actively identify reasonable and informative contents from pretraining corpus, and reason to predict these contents by RL. 2) Self-supervised learning: without any verifiable labels, pretrained reward models, or supervised fine-tuning, we directly pretrain reasoners from 3～30B base models on the general Wikipedia corpus using RL, significantly breaking the verification data-wall for general reasoning. 3) Verification scaling: by tackling increasingly challenging masked spans, PretrainZero substantially enhances the general reasoning abilities of pretrained base models. In reinforcement pretraining, PretrainZero improves Qwen3-4B-Base for 8.43, 5.96 and 10.60 on MMLU-Pro, SuperGPQA and math average benchmarks. In post-training, the pretrained models can also serve as reasoning foundation models for downstream RLVR tasks.

Abstract:
AI predictive systems increasingly support highstakes decision making, yet robust decisions under uncertainty often rely on human capabilities beyond AI alone. This motivates collaborative approaches that combine human judgment with AI predictions. We study this problem through the lens of uncertainty quantification and introduceHuman-AI Collaborative Uncertainty Quantification, a framework in which an AI system refines a human expert’s proposed prediction set subject to two principles:counterfactual harm, requiring that the AI not degrade correct human judgments, andcomplementarity, requiring recovery of correct outcomes the human missed. At the population level, we show that the optimal collaborative prediction set has a simple two-threshold structure over a single score function, governing pruning and augmentation relative to the human proposal. Building on this characterization, we develop offline and online calibration algorithms withdistribution-freefinite-sample guarantees. The online algorithm adapts to arbitrary distribution shifts, including settings where human behavior evolves through interaction with the AI. Empirically, we show that collaborative prediction sets outperform human-only and AI-only baselines, achieving improved coverage--efficiency tradeoffs across image classification, regression, and text-based medical decision making.

Abstract:
Consider the problem of finding the best matching in a weighted graph where we only have access to predictions of the actual stochastic weights, based on an underlying context. If the predictor is the Bayes optimal one, then computing the best matching based on the predicted weights is optimal. However, in practice, this perfect information scenario is not realistic. Given an imperfect predictor, a suboptimal decision rule may compensate for the induced error and thus outperform the standard optimal rule. In this paper, we propose multicalibration as a way to address this problem. This fairness notion requires a predictor to be unbiased on each element of a family of protected sets of contexts. Given a class of matching algorithms \mathcalC and any predictor \gamma of the edgeweights, we show how to construct a specific multicalibrated predictor \hat \gamma, with the following property. Picking the best matching based on the output of \hat \gamma is competitive with the best decision rule in \mathcalC applied onto the original predictor \hat \gamma. We complement this result by providing sample complexity bounds, and by performing numerical experiments.

Abstract:
Structured \underlineState\underlineSpace \underlineDuality (SSD) [Dao \& Gu, ICML 2024] is an equivalence between a simple Structured \underlineState-\underlineSpace \underlineModel (SSM) and a masked attention mechanism. In particular, a state-space model with a scalar-times-identity state matrix is equivalent to a masked self-attention with a 1-semiseparable causal mask. Consequently, the same sequence transformation (model) has two algorithmic realizations: as a linear-time O(T) recurrence or as a quadratic-time O(T^2) attention. In this note, we formalize and generalize this duality: (i) we extend SSD from the scalar‑identity case to general diagonal SSMs (diagonal state matrices); (ii) we show that these diagonal SSMs match the scalar case's training complexity lower bounds while supporting richer dynamics; (iii) we establish a necessary and sufficient condition under which an SSM is equivalent to 1-semiseparable masked attention; and (iv) we show that such duality fails to extend to standard softmax attention due to rank explosion. Together, these results tighten bridge between recurrent SSMs and Transformers, and widen the design space for expressive yet efficient sequence models.

Abstract:
We argue that longterm forecasting requires learning local Jacobians with explicit spectral structure, going beyond simple conditional mean matching. Our method, \textscFern, invokes Brenier's theorem to directly parameterize the Jacobian as a symmetric positive semi-definite (SPD) factorization, treating forecasting as the optimal transport of probability mass from a fixed Gaussian source to data-dependent ellipsoids. This formulation reduces the computational cost of eigen-decomposition from cubic to linear time while providing interpretable, geometry-aware projections. To rigorously evaluate robustness, we introduce a synthetic benchmark with controlled non-stationary shocks alongside new metrics like Effective Prediction Time (EPT). \textscFern demonstrates exceptional stability, outperforming baselines like DLinear and Koopa by over two orders of magnitude (up to 790×) on nonstationary settings where standard benchmarks fail to expose model brittleness.

Abstract:
Scorebased diffusion models are a highly effective method for generating samples from a distribution of images. We consider scenarios where the training data comes from a noisy version of the target distribution, and present an efficiently implementable modification of the inference procedure to generate noiseless samples. Our approach is motivated by the manifold hypothesis, according to which meaningful data is concentrated around some low-dimensional manifold of a high-dimensional ambient space. The central idea is that noise manifests as low magnitude variation in off-manifold directions in contrast to the relevant variation of the desired distribution which is mostly confined to on-manifold directions. We introduce the notion of an extended score and show that, in a simplified setting, it can be used to reduce small variations to zero, while leaving large variations mostly unchanged. We describe how its approximation can be computed efficiently from an approximation to the standard score and demonstrate its efficacy on toy problems, synthetic data, and real data.

Abstract:
Direct preference optimization (DPO) has emerged as a promising approach for aligning large language models (LLMs) with human preferences. However, the widespread reliance on the responselevel Bradley-Terry (BT) model may limit its full potential, as the reference and learnable models are assumed to be autoregressive only after deriving the objective function. Motivated by this limitation, we revisit the theoretical foundations of DPO and propose a novel formulation that explicitly introduces the autoregressive assumption prior to applying the BT model. By reformulating and extending DPO, we derive a novel variant, termed Autoregressive DPO (ADPO), that explicitly integrates autoregressive modeling into the preference optimization framework. Without violating the theoretical foundations, the derived loss takes an elegant form: it shifts the summation operation in the DPO objective outside the log-sigmoid function. Furthermore, through theoretical analysis of ADPO, we show that there exist two length measures to be considered when designing DPO-based algorithms: the token length \mu and the feedback length \mu'. To the best of our knowledge, we are the first to explicitly distinguish these two measures and analyze their implications for preference optimization in LLMs.

Abstract:
Modern sparse language models typically achieve sparsity through Mixtureof-Experts (MoE) layers, which dynamically route tokens to dense MLP "experts." However, dynamic hard routing has a number of drawbacks, such as potentially poor hardware efficiency and needing auxiliary losses for stable training. In contrast, the tokenizer embedding table, which is natively sparse, largely avoids these issues by selecting a single embedding per token at the cost of not having contextual information. In this work, we introduce the Large Lookup Layer (L3), which unlocks a new axis of sparsity by generalizing embedding tables to model decoder layers. L3 layers use static token-based routing to aggregate asetof learned embeddings per token in acontext-dependent way, allowing the model to efficiently balance memory and compute by caching information in embeddings. L3 has two main components: (1) a systems-friendly architecture that allows for fast training and CPU-offloaded inference with no overhead, and (2) an information-theoretic embedding allocation algorithm that effectively balances speed and quality. We empirically test L3 by training transformers with up to 2.6B active parameters and find that L3 strongly outperforms both dense models and iso-sparse MoEs in both language modeling and downstream tasks.

Abstract:
A popular approach to posttraining control of large language models (LLMs) is the steering of intermediate latent representations. Namely, identify a well-chosen direction depending on the task at hand and perturbs representations along this direction at inference time. While many propositions exist to pick this direction, considerably less is understood about how to choose the magnitude of the move, whereas its importance is clear: too little and the intended behavior does not emerge, too much and the model's performance degrades beyond repair. In this work, we propose the first theoretical analysis of steering strength. We characterize its effect on next token probability, presence of a concept, and cross-entropy, deriving precise qualitative laws governing these quantities. Our analysis reveals surprising behaviors, including non-monotonic effects of steering strength. We validate our theoretical predictions empirically on eleven language models, ranging from a small GPT architecture to modern models.

Abstract:
As a multimodal extension of Chainof-Thought (CoT), Thinking with Images (TWI) has recently emerged as a promising avenue to enhance the reasoning capability of Multi-modal Large Language Models (MLLMs), which generates interleaved CoT by incorporating visual cues into the textual reasoning process. However, the success of existing TWI methods heavily relies on the assumption that interleaved image-text CoTs are faultless, which is easily violated in real-world scenarios due to the complexity of multimodal understanding. In this paper, we reveal and study a highly-practical yet under-explored problem in TWI, termed Noisy Thinking (NT). Specifically, NT refers to the imperfect visual cues mining and answer reasoning process. As the saying goes, ``One mistake leads to another'', erroneous interleaved CoT would cause error accumulation, thus significantly degrading the performance of MLLMs. To solve the NT problem, we propose a novel method dubbed Reliable Thinking with Images (RTWI). In brief, RTWI estimates the reliability of visual cues and textual CoT in a unified text-centric manner and accordingly employs robust filtering and voting modules to prevent NT from contaminating the final answer. Extensive experiments on seven benchmarks verify the effectiveness of RTWI against NT. The code will be released upon acceptance.

Abstract:
Reinforcement learning (RL) faces a persistent tension: policies that are stable to optimize (e.g., Gaussians) are often too simple to represent the multimodal action distributions required for complex control. Conversely, expressive generative policies—such as diffusion and flow matching—are frequently unstable in online RL due to intractable likelihoods and gradients propagating through long sampling chains. We address this tension with a key structural principle: decoupling optimization from generation. Building on this, we introduce GoRL (Generative Online Reinforcement Learning), an algorithmagnostic framework that trains expressive policies from scratch by confining policy optimization to a tractable latent space while delegating action synthesis to a conditional generative decoder. Using a two-timescale alternating schedule and anchoring decoder refinement to a fixed prior, GoRL enables stable optimization while continuously expanding expressiveness. Empirically, GoRL consistently outperforms unimodal and generative baselines across diverse continuous-control tasks. Notably, on the challenging HopperStand task, it achieves episodic returns exceeding 870—more than 3× that of the strongest baseline—demonstrating a practical path to policies that are both stable and highly expressive.

Abstract:
Diffusion language models (DLMs) promise sublineartime generation via parallel decoding, yet realizing this efficiency remains elusive as high-quality sampling typically requires hundreds of refinement steps. In continuous domains, consistency-based training accelerates diffusion by enforcing invariance along a probability flow ODE. However, discrete diffusion admits no such ODE, rendering direct adaptation ill-defined. We bridge this gap with Multi-Path Discrete Consistency (MPDC), a new principle that replaces the non-existent unique trajectory with a distributional ensemble of exact posterior bridges connecting different noise levels. Building on this idea, we introduce the Consistent Diffusion Language Model (CDLM), a general framework that learns path-independent denoisers by enforcing prediction consistency across these stochastic bridges. We show that CDLM unifies and generalizes discrete diffusion, consistency, and distillation objectives within a single view applicable to diverse corruption processes, including both masked and uniform diffusion. Empirically, CDLM establishes a new state of the art on conditional and unconditional text-generation benchmarks, consistently outperforming strong base DLMs and often even multi-stage distilled baselines, with particularly large gains in the few-step regime. Together, these results position CDLM as a principled and scalable paradigm for efficient, high-fidelity discrete generative modeling.

Abstract:
We study algorithms in the \emphresettable streaming model, where the value of each key can either be increased or reset to zero. The model is suitable for applications such as active resource monitoring with support for deletions and machine unlearning. We show that all existing sketches for this model are vulnerable to adaptive adversarial attacks that apply even when the sketch size is polynomial in the length of the stream. To overcome these vulnerabilities, we present the first adaptively robust sketches for resettable streams that maintain \emphpolylogarithmic space complexity in the stream length. Our framework supports (sub) linear statistics including L_p moments for p\in[0,1] (in particular, \emphCardinality and \emphSum) and \emphBernstein statistics. We bypass strong impossibility results known for linear and composable sketches by designing dedicated streaming sketches robustified via Differential Privacy. Unlike standard robustification techniques, which provide limited benefits in this setting and still require polynomial space in the stream length, we leverage the \emphBinary Tree Mechanism for continual observation to protect the sketch's internal randomness. This enables accurate \emphprefixmax error guarantees with polylogarithmic space.

Abstract:
Multivector representations generated by late interaction models, such as ColBERT, enable superior retrieval quality compared to single-vector representations in information retrieval applications. In multi-vector retrieval systems, both queries and documents are encoded using one embedding for each token, and similarity between queries and documents is measured by the MaxSim similarity measure. However, the improved recall of multi-vector retrieval comes at the expense of significantly increased latency. This necessitates designing efficient approximate nearest neighbor search (ANNS) algorithms for multi-vector search. In this work, we introduce LEMUR, a simple-yet-efficient framework for multi-vector similarity search. LEMUR consists of two consecutive problem reductions: We first formulate multi-vector similarity search as a supervised learning problem that can be solved using a one-hidden-layer neural network. Second, we reduce inference under this model to single-vector similarity search in its latent space, which enables the use of existing single-vector ANNS methods for speeding up retrieval. In addition to performance evaluation on ColBERTv2 embeddings, we evaluate LEMUR on embeddings generated by modern multi-vector text models and multi-vector visual document retrieval models. LEMUR is an order of magnitude faster than earlier multi-vector similarity search methods.

Abstract:
We study a sampling problem whose target distribution is \pi \propto \exp(f-r) where the data fidelity term f is Lipschitz smooth while the regularizer term r=r_1-r_2 is a non-smooth difference-of-convex (DC) function, i.e., r_1,r_2 are convex. By leveraging the DC structure of r, we can smooth out r by applying Moreau envelopes to r_1 and r_2 separately. In line of DC programming, we then redistribute the concave part of the regularizer to the data fidelity and study its corresponding proximal Langevin algorithm (termed DC-LA). We establish convergence of DC-LA to the target distribution \pi, up to discretization and smoothing errors, in the q-Wasserstein distance for all q \in \mathbbN^, under the assumption that V is distant dissipative. Our results improve previous work on non-log-concave sampling in terms of a more general framework and assumptions. Moreover, numerical experiments show that DC-LA produces accurate distributions in synthetic settings and reliably provides uncertainty quantification in a real-world Computed Tomography application.

Abstract:
Deep neural networks for image classification often exhibit overconfidence on outof-distribution (OOD) samples. To address this, we introduce Geometrically Constrained Outlier Synthesis (GCOS), a training-time regularization framework aimed at improving OOD robustness during inference. GCOS addresses a limitation of prior synthesis methods by generating virtual outliers in the hidden feature space that respect the learned manifold structure of in-distribution (ID) data. The synthesis proceeds in two stages: (i) a dominant-variance subspace extracted from the training features identifies geometrically informed, off-manifold directions; (ii) a conformally-inspired shell, defined by the empirical quantiles of a nonconformity score from a calibration set, adaptively controls the synthesis magnitude to produce boundary samples. The shell ensures that generated outliers are neither trivially detectable nor indistinguishable from in-distribution data, facilitating smoother learning of robust features. This is combined with a contrastive regularization objective that promotes separability of ID and OOD samples in a chosen score space, such as Mahalanobis or energy-based. Experiments demonstrate that GCOS outperforms state-of-the-art methods using standard energy-based inference on near-OOD benchmarks, defined as tasks where outliers share the same semantic domain as in-distribution data. As an exploratory extension, the framework naturally transitions to conformal OOD inference, which translates uncertainty scores into statistically valid p-values and enables thresholds with formal error guarantees, providing a pathway toward more predictable and reliable OOD detection.

Abstract:
Recently, sparse autoencoders (SAEs) have emerged as an attractive tool for interpreting and interacting with representations in practical neural networks. While it is common empirical folklore, we also show theoretically that SAEs are highly unstable: different training runs are likely to produce different concept dictionaries and sparse codes. We characterize the model characteristics that get in the way of the stability of realworld SAEs, and address each of these problems through minimal changes to the architecture and training procedure. Together, these changes yield iSAE, a variant of the standard TopK SAE with lower reconstruction error and improved stability. We explain this improvement theoretically by connecting SAEs with traditional dictionary learning approaches, and show that the dictionaries learned in practice satisfy an approximate restricted isometry condition, rendering the corresponding sparse codes in those models near-identifiable.

Abstract:
Recent advances in generative models highlight the power of geometryaware modeling in manifold-constrained settings. Yet, for natural images, the field remains confined to Euclidean assumptions, failing to exploit the potential of intrinsic geometric structures within the data. In this work, we investigate the geometry of natural images and observe that semantic information is predominantly encoded in directional components, while norm components can be approximated by the global average. This property holds across both RGB and latent spaces, suggesting that natural images can be effectively modeled on a hypersphere. Building on this finding, we introduce Spherical Optimal Transport Flow Matching (SOT-CFM), which utilizes angular distance, and Spherical Flow Matching (SFM), which constrains dynamics directly on the manifold. Our experiments demonstrate that these geometry-aware methods achieve superior performance against Euclidean baselines. Ultimately, this work provides a novel perspective that bridges the gap between Riemannian manifold-based modeling and natural image generation.

Abstract:
Preferencebased reinforcement learning (RL) is a key paradigm for aligning policies with human judgments, yet its theoretical behavior in distributed settings where preference data are fragmented across heterogeneous users remains poorly understood. Direct Preference Optimization (DPO) avoids explicit reward modeling but lacks convergence guarantees under federated and decentralized training, where communication constraints and non-IID preferences fundamentally alter optimization dynamics. We provide the first convergence and time-complexity analysis of DPO in distributed environments. Modeling personalized offline RL with user-specific preference distributions, we characterize the induced global optimization landscape. For federated DPO, we derive convergence rates that quantify the impact of client drift, communication frequency, and preference heterogeneity; for decentralized DPO, we establish convergence over general communication graphs and show how spectral connectivity governs optimization speed and consensus. Our results lay a theoretical foundation for scalable and privacy-preserving distributed preference optimization.

Abstract:
Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on singlesegment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query—a setting we termOne-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85% and 15.61%, respectively.

Abstract:
Identifying most influential sets (MIS) – sizek subsets whose removal maximally changes a target estimand – is typically infeasible because it requires searching over \binomnk subsets. We show that, for a broad class of estimands whose leave-set-out effect admits a linear-fractional form, the MIS problem reduces to a one-parameter sequence of top-k selections. Using Dinkelbach's method, we obtain an efficient algorithm that runs in O(n) per iteration and terminates in finitely many steps. We show that our approach returns globally optimal sets for univariate settings, such as average treatment effect estimation in randomized experiments. For partial linear models, we establish selection consistency under Neyman orthogonality and mild first-stage stability. We validate our method through simulations and real-world applications – recovering MIS that were previously computationally inaccessible.

Abstract:
AudioVisual Segmentation (AVS) targets pixel level localization of sounding emitting objects in videos. However, existing models rely on dense cross-modal attention with quadratic computational cost, limiting their suitability for resource efficient deployment. Most efficiency oriented methods focus on backbone reduction and overlook the interaction module as the primary bottleneck. This paper proposes LightAVSeg, a lightweight framework that replaces heavy attention with a decoupled design for semantic filtering and spatial grounding, resulting in interaction costs that scale linearly with spatial resolution. Furthermore, we introduce an auxiliary alignment loss to enforce semantic consistency during training with zero inference overhead. Extensive experiments demonstrate that LightAVSeg achieves a new state-of-the-art among lightweight methods: with 20.5M parameters (~1/7 of AVSegFormer), it reaches 50.4 mIoU on the MS3 benchmark and enables efficient inference on a mobile processor.

Abstract:
Offline policy improvement faces an inherent conflict between maximizing value and fitting the data distribution. While insample weighted regression is stable, it suffers from over-conservatism that suppresses high-value actions in the distribution tail; conversely, gradient-based approaches often exhibit a fitting-optimization conflict of gradients, which drive the policy off the data manifold. To address this, we propose Support-Preserving Action Rectification (SPAR), which reframes global learning as a local residual rectification anchored to a frozen pure behavior cloning policy. This framework performs fine-grained fitting and local policy improvement in the residual space, thereby contracting the search space. We further introduce Latent Self-Imitation, utilizing a latent-sampling weighted-regression mechanism to address fitting-improvement gradient conflict in the residual space. Theoretically, we prove this mechanism eliminates the manifold-normal drift of standard value gradients, while extensive D4RL experiments show SPAR extracts significant gains from suboptimal baselines to achieve state-of-the-art performance.

Abstract:
Multimodal latentspace reasoning aims to replace explicit “thinking with images” by performing visual reasoning directly in a compact latent space. However, existing approaches largely rely on visual supervision and produce latent representations that lack sufficient semantic richness, limiting their ability to support diverse region-level reasoning tasks. In this work, we introduce Semantic-Enriched Latent Visual Reasoning (SLVR), a two-stage learning framework that enriches latent representations with attribute-level visual semantics and aligns them with diverse reasoning objectives. In the first stage, SLVR learns semantically enriched region-centric latents under fine-grained attribute supervision. In the second stage, we design Multi-query Group Relative Policy Optimization (M-GRPO) to align latent representations across multiple queries grounded in the same region, encouraging consistent utilization of semantic information under varying reasoning demands. To support this framework, we construct SLV-Set, comprising approximately 400K region-level attribute annotations and 800K multi-query question answering samples, and introduce SV-QA, a benchmark that evaluates latent reasoning under semantic variation. Experiments demonstrate that SLVR improves the robustness and semantic consistency of latent visual reasoning compared to existing baselines.

Abstract:
Accurately modeling the macroscopic dynamics of highdimensional microscopic systems is of broad interest across the sciences. Many data-driven approaches learn a low-dimensional latent state through an autoencoder trained for pointwise input reconstruction. These methods typically assume a fixed ordering of microscopic degrees of freedom in the input. However, in many settings such as particle systems the microscopic state is inherently unordered. This motivates an autoencoder framework that learns permutation-invariant latent representations. To this end, we adopt a permutation-invariant encoder and design the decoder to reconstruct the mass distribution centered at the observed points rather than per-sample reconstruction. We then jointly learn the macroscopic dynamics of the observables together with the latent states. We demonstrate the effectiveness and robustness of the proposed method across a range of microscopic settings, including learning the energy in interacting particle systems, predicting mixing dynamics in Lennard–Jones fluids, and modeling the stretching dynamics from videos of a more realistic polymer system.

Abstract:
Representations of the world, arguably, contain information about features (e.g. something is blue, something is a circle) but also information about which features are part of the same object (e.g. the circle is blue), which we call binding information. Any system with the ability to understand scenes with multiple objects must be able to solve the binding problem: it needs to know which features belong together. However, despite work showing that Vision Transformers (ViTs) know which patches belong together, it is not known whether current deep learning models learn binding information, i.e., for features. We may believe that there is not much binding information, after all misattributing features to wrong objects is a common failure of ViTbased architectures, especially in scenes with distracting objects. Here we formalize the binding problem with an information-theoretic approach, and introduce a probing method to measure binding information in model representations. We perform experiments measuring binding information in different datasets with different number of features, different occlusion levels of objects, synthetic (e.g., red, circle) versus natural features (e.g., bikes, running), as well as out-of-distribution feature combinations (e.g. blue penguins), while performing these experiments on several pre-trained ViTs. Our research demonstrates binding as a key ingredient to strong visual recognition and reasoning.

Abstract:
Shortterm precipitation nowcasting is inherently under-constrained due to limited historical observation windows: identical observations can lead to multiple plausible future trajectories, especially for extreme events. Existing generative methods rely solely on visual features and lack explicit constraints on precipitation motion semantics, resulting in ambiguous dynamics, blurred details, and unstable predictions. We propose LangPrecip, the first language-guided precipitation nowcasting framework, and contribute LangPrecip-160K, a large-scale radar-text paired dataset with 160K annotated sequences. LangPrecip addresses the under-constrained challenge by leveraging natural-language motion descriptions as explicit semantic constraints to reduce motion ambiguity and introducing a dual-path wavelet consistency unfolding decoder that enforces physical data fidelity during latent-to-pixel reconstruction. By reformulating nowcasting as semantically constrained trajectory generation under the Rectified Flow paradigm with model-based decoder optimization, LangPrecip produces sharper and more physically consistent forecasts. Experiments on Swedish and MRMS benchmarks demonstrate substantial improvements over state-of-the-art vision-only methods, achieving over 60% and 19% relative gains in heavy-rainfall CSI at 80-minute lead time with enhanced spatial detail preservation.

Abstract:
Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single generalpurpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert’s next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.

Abstract:
Selfcorrection is an effective technique for maintaining parallel sampling in discrete diffusion models with minimal performance degradation. Prior work has explored self-correction at inference time or during post-training; however, such approaches often suffer from limited generalization and may impair reasoning performance. GIDD pioneers pretraining-based self-correction via a multi-step BERT-style uniform-absorbing objective. However, GIDD relies on a continuous interpolation-based pipeline with opaque interactions between uniform transitions and absorbing masks, which complicates hyperparameter tuning and hinders practical performance. In this work, we propose aSelf-CorrectingDiscreteDiffusion (SCDD) model to reformulate pretrained self-correction with explicit state transitions and learn directly in discrete time. Our framework also simplifies the training noise schedule, eliminates a redundant remasking step, and relies exclusively on uniform transitions to learn self-correction. Experiments at the GPT-2 scale demonstrate that our method enables more efficient parallel decoding while preserving generation quality. Our code is available at https://anonymous.4open.science/r/SCDD.

Abstract:
Human labeling increasingly relies on AI assistance, raising incentive challenges when annotators’ effort is unobserved. Recent work by Bastani & Cachon (2025); Sambasivan et al. (2021) shows that accuracybased payment schemes suffer from incentive collapse: as AI accuracy improves, sustaining positive human effort requires unbounded payments. We study this problem in a budget-constrained setting with strategic annotators whose labeling accuracy depends on unobserved effort. We propose a sentinel-auditing payment mechanism that enforces a strictly positive and controllable level of human effort at finite cost, independent of AI accuracy. Building on this incentive-robust foundation, we develop an incentive-aware active statistical inference framework that jointly optimizes (i) the auditing rate and (ii) active sampling and budget allocation across instances of varying difficulty to minimize the final statistical loss under a single budget. Experiments demonstrate improved cost–error tradeoffs relative to standard active learning and auditing-only baselines.

Abstract:
Productionready human video generation requires digital actors to maintain strictly consistent full-body identities across dynamic shots, viewpoints and motions, a setting that remains challenging for existing methods. Prior methods often suffer from face-centric behavior that neglects body-level consistency, or produce copy-paste artifacts where subjects appear rigid due to pose locking. We present Actor-18M, a large-scale human video dataset designed to capture identity consistency under unconstrained viewpoints and environments. Actor-18M comprises 1.6M videos with 18M corresponding human images, covering both arbitrary views and canonical three-view representations. Leveraging Actor-18M, we propose WildActor, a framework for any-view conditioned human video generation. We introduce an Asymmetric Identity-Preserving Attention mechanism coupled with a Viewpoint-Adaptive Monte Carlo Sampling strategy that iteratively re-weights reference conditions by marginal utility for balanced manifold coverage. Evaluated on the proposed Actor-Bench, WildActor consistently preserves body identity under diverse shot compositions, large viewpoint transitions, and substantial motions, surpassing existing methods in these challenging settings.

Abstract:
We design efficient approximation algorithms for maximizing the expectation of the supremum of families of Gaussian random variables. In particular, let OPT:=\max_\sigma_1,\cdots,\sigma_n\mathbbE\sum_j=1^m\max_i\in S_j X_i, where X_i are Gaussian, S_j\subset[n] and \sum_i\sigma_i^2=1, then our theoretical results include: We characterize the optimal variance allocation -- it concentrates on a small subset of variables as |S_j| increases, - A polynomial time approximation scheme (PTAS) for computing OPT when m=1, and - An O(\log n) approximation algorithm for computing OPT for general m>1.

Abstract:
Epidemic modeling is essential for public health planning, yet traditional approaches rely on fixed model classes that require manual redesign as pathogens, policies, and scenario assumptions evolve. We introduce EpiAgent, an agentic framework that automatically synthesizes, calibrates, verifies, and refines epidemiological simulators by modeling disease progression as an iterative program synthesis problem. A central design choice is an explicit epidemiological flow graph intermediate representation that links scenario specifications to model structure and enables strong, modular correctness checks before code is generated. Verified flow graphs are then compiled into mechanistic models supporting interpretable parameter learning under physical and epidemiological constraints. Evaluation on epidemiological scenario case studies demonstrates that EpiAgent captures complex growth dynamics and produces epidemiologically consistent counterfactual projections across varying vaccination and immune escape assumptions. Our results show that the agentic feedback loop prevents degeneration and significantly accelerates convergence toward valid models by mimicking professional expert workflows.

Abstract:
We study codeto-metric regression: predicting numeric outcomes of code executions, a challenging task due to the open-ended nature of programming languages. While prior methods have resorted to heavy and domain-specific feature engineering, we show that a single unified Regression Language Model (RLM) can simultaneously predict directly from text, (i) the memory footprint of code across multiple high-level languages such as Python and C++, (ii) the latency of Triton GPU kernels, and (iii) the accuracy and speed of trained neural networks represented in ONNX. In particular, a relatively small 300M parameter RLM initialized from T5Gemma, obtains >0.9 Spearman-rank on competitive programming submissions from APPS, and a single unified model achieves >0.5 average Spearman-rank across 17 separate languages from CodeNet. Furthermore, the RLM can obtain the highest average Kendall-Tau of 0.46 on five classic NAS design spaces previously dominated by graph neural networks, and simultaneously predict architecture latencies on numerous hardware platforms.

Abstract:
Most selfsupervised learning (SSL) methods learn continuous visual representations by aligning different views of the same input, offering limited control over how information is structured across representation dimensions. In this work, we frame visual self-supervised learning as a discrete communication process between a teacher and a student network, where semantic information is transmitted through a fixed-capacity binary channel. Rather than aligning continuous features, the student predicts multi-label binary messages produced by the teacher. Discrete agreement is enforced through an element-wise binary cross-entropy objective, while a coding-rate regularization term encourages effective utilization of the constrained channel, promoting structured representations. We further show that periodically reinitializing the projection head strengthens this effect by encouraging embeddings that remain predictive across multiple discrete encodings. Extensive experiments demonstrate consistent improvements over continuous agreement baselines on image classification, retrieval, and dense visual prediction tasks, as well as under domain shift through self-supervised adaptation. Beyond backbone representations, we analyze the learned binary codes and show that they form a compact and informative discrete language, capturing semantic factors reusable across classes.

Abstract:
Sharpnessaware minimization (SAM) is a recently proposed training method that seeks to find flat minima in deep learning, resulting in state-of-the-art performance across various domains. Instead of minimizing the loss of the current weights, SAM minimizes the worst-case loss in its neighborhood in the parameter space. In this paper, we investigate the convergence instability of SAM near a saddle point. Using the qualitative theory of dynamical systems, we explain how SAM becomes stuck in the saddle point and theoretically prove that the saddle point can become an attractor under SAM dynamics. Additionally, we show that this convergence instability can also occur in stochastic dynamical systems by establishing the diffusion of SAM. We prove that SAM diffusion is worse than that of vanilla gradient descent in terms of saddle point escape. Finally, we demonstrate that often overlooked training tricks, momentum and batch-size, might be important to mitigate the convergence instability and achieve high generalization performance. Our theoretical and empirical results are thoroughly verified through experiments on several well-known optimization problems and benchmark tasks.

Abstract:
Learning latent actions from actionfree video has emerged as a powerful paradigm for scaling up controllable world model learning. Latent actions provide a natural interface for users to iteratively generate and manipulate videos. However, most existing approaches rely on monolithic inverse and forward dynamics models that learn a single latent action to control the entire scene, and therefore struggle in complex environments where multiple entities act simultaneously. This paper introduces Factored Latent Action Model (FLAM), a factored dynamics framework that decomposes the scene into independent factors, each inferring its own latent action and predicting its own next-step factor value. This factorized structure enables more accurate modeling of complex multi-entity dynamics and improves video generation quality in action-free video settings compared to monolithic models. Based on experiments on both simulation and real-world multi-entity datasets, we find that FLAM outperforms prior work in prediction accuracy and representation quality, and facilitates downstream policy learning, demonstrating the benefits of factorized latent action models.

Abstract:
People judge interactions with large language models (LLMs) as successful when outputs match what they want, not what they type. Yet LLMs are trained to predict the next token solely from text input, not underlying intent. Because written language is an imperfect proxy for intent, and correlations between phrasing and desired outcomes can break down in training data, models that rely too heavily on surface cues may respond inconsistently to semantically equivalent prompts. This makes it essential to evaluate whether LLMs can reliably infer user intent—especially in highstakes settings where robustness and generalization are critical. We introduce a formal framework for assessing intent comprehension in LLMs: whether a model demonstrates robust understanding of user intent by producing consistent outputs across semantically equivalent prompts while differentiating between prompts with distinct intents. Our evaluation approach is based on a variance decomposition of model responses into three components: variability due to user intent, user articulation, and model uncertainty. Models that understand what users want, and are not overly sensitive to textual cues, should attribute most output variance to intent differences, rather than articulation style. Applying this framework across diverse domains, we find that, within the five LLaMA and Gemma models we evaluate, larger models typically assign a greater share of variance to intent, indicating stronger comprehension of intent, although gains are uneven and often modest with increasing model size. These results motivate moving beyond accuracy-only benchmarks toward semantic diagnostics that directly assess whether models understand what users intend.

Abstract:
In automated machine learning, scientific discovery, and other applications of Bayesian optimization, deciding when to stop evaluating expensive blackbox functions in a cost-aware manner is an important but underexplored practical consideration. A natural performance metric for this purpose is the cost-adjusted simple regret, which explicitly captures the trade-off between solution quality and cumulative evaluation cost. Existing stopping rules for Bayesian optimization are either heuristic, or are theoretically grounded but designed to optimize simple regret without accounting for evaluation costs; as a result, they provide no guarantees against unnecessary evaluations when costs are high. We propose a principled cost-aware stopping rule for Bayesian optimization that adapts to varying evaluation costs without heuristic tuning. Our rule is grounded in a theoretical connection to state-of-the-art cost-aware acquisition functions, namely the Pandora's Box Gittins Index (PBGI) and log expected improvement per cost (LogEIPC). When paired with either acquisition function, we prove that the resulting policy satisfies a theoretical guarantee bounding the expected cost-adjusted simple regret. Across synthetic tasks and empirical benchmarks including hyperparameter optimization and neural architecture size search, pairing our stopping rule with PBGI or LogEIPC usually matches or outperforms other acquisition-function--stopping-rule pairs in terms of cost-adjusted simple regret.

Abstract:
Temporal difference (TD) learning is a fundamental algorithm for estimating value functions in reinforcement learning. Recent finitetime analyses of TD with linear function approximation quantify its theoretical convergence rate. However, they often require setting the algorithm parameters using problem-dependent quantities that are difficult to estimate in practice --- such as the minimum eigenvalue of the feature covariance (\omega) or the mixing time of the underlying Markov chain (\tau_\textmix). In addition, some analyses rely on nonstandard and impractical modifications, exacerbating the gap between theory and practice. To address these limitations, we use an exponential step-size schedule with the standard TD(0) algorithm. We analyze the resulting method under two sampling regimes: independent and identically distributed (i.i.d.) sampling from the stationary distribution, and the more practical Markovian sampling along a single trajectory. In the i.i.d. setting, the proposed algorithm does not require the knowledge of problem-dependent quantities such as \omega, and attains the optimal bias-variance trade-off for the last iterate. In the Markovian setting, we propose a regularized TD(0) algorithm with an exponential step-size schedule. The resulting algorithm achieves a comparable convergence rate to prior works, without requiring projections, iterate averaging, or knowledge of \tau_\textmix or \omega.

Abstract:
Realworld applications of Large Reasoning Models (LRMs) often require reasoning about changing prompts or environments. In this work, we evaluate LRM robustness under two realistic dynamic scenarios: interruptions, which test the accuracy of model responses under budget-constrained outputs, and dynamic context, which tests model adaptation to in-flight changes. Across mathematics and programming benchmarks that require long-form reasoning, static evaluations consistently overestimate robustness: even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context, with performance dropping by up to 60% when updates are introduced late in the reasoning process. Our analysis further reveals several novel failure modes, including reasoning leakage, where models fold the reasoning into their final answer when interrupted; panic, where under time pressure models abandon reasoning entirely and return incorrect answers; and self-doubt, where performance degrades when trying to incorporate updated information.

Abstract:
Foundation models have achieved remarkable success, yet their growing parameter counts pose significant computational and memory challenges. Lowrank factorization offers a promising route to reduce training and inference costs, but the community lacks a stable recipe for training models from scratch using exclusively low-rank weights while matching performance of the dense model. We demonstrate that Large Language Models (LLMs) can be trained from scratch using exclusively low-rank factorized weights for all non-embedding matrices without auxiliary "full-rank" guidance required by prior methods. While native low-rank training often suffers from instability and loss spikes, we identify uncontrolled growth in the spectral norm (largest singular value) of the weight matrix update as the dominant factor. To address this, we introduceSpectron: Spectral renormalization with orthogonalization, which dynamically bounds the resultant weight updates based on the current spectral norms of the factors. Our method enables stable, end-to-end factorized training with negligible overhead. Finally, we establish compute-optimal scaling laws for natively low-rank transformers, demonstrating predictable power-law behavior and improved inference efficiency relative to dense models. Our code is available athttps://anonymous.4open.science/r/spectron-FB27

Abstract:
We introduce ActionTriggered Sporadically Traceable Markov Decision Processes (ATST-MDPs), a novel reinforcement learning framework for partial observability in which full state observations occur stochastically at each step, with probability determined by the chosen action. We derive Bellman equations tailored to this setting and establish the existence of an optimal policy. Exploiting the fact that sporadic observations reveal the full state, we provide an equivalent reformulation in which, upon each state observation, agents commit to a sequence of actions until the next observation. Under the linear MDP assumption, we show that the resulting action sequence value functions admit linear representations in a finite-dimensional feature map, enabling standard regression-based methods. As an application, we derive ST-LSVI-UCB, an optimistic algorithm achieving regret \widetildeO(\sqrtKd^3(1-\gamma)^-3) for episodic learning with geometrically distributed horizons, where K is the number of episodes, d the feature dimension, and \gamma the discount factor (continuation probability), matching the known rate for linear MDPs with full observability.

Abstract:
Smoothness is crucial for attaining fast rates in firstorder optimization. However, many optimization problems in modern machine learning involve non-smooth objectives. Recent studies relax the smoothness assumption by allowing the Lipschitz constant of the gradient to grow with respect to the gradient norm, which accommodates a broad range of objectives in practice. Despite this progress, existing generalizations of smoothness are restricted to Euclidean geometry with \ell_2-norm and only have theoretical guarantees for optimization in the Euclidean space. In this paper, we address this limitation by introducing a new \ell-smoothness concept that measures the norm of Hessians in terms of a general norm and its dual, and establish convergence for mirror-descent-type algorithms, matching the rates under the classic smoothness. Notably, we propose a generalized self-bounding property that facilitates bounding the gradients via controlling suboptimality gaps, serving as a principal component for convergence analysis. Beyond deterministic optimization, we establish sharp convergence for stochastic mirror descent, matching state-of-the-art under classic smoothness. Our theory also extends to non-convex and composite optimization, which may shed light on practical usages of mirror descent, including pre-training and post-training of LLMs.

Abstract:
Automated AI research holds great potential to accelerate scientific discovery. However, current LLMs often generate plausiblelooking but ineffective ideas. Execution grounding may help, but it is unclear whether automated execution is feasible and whether LLMs can learn from the execution feedback. To investigate these, we first build an automated executor to implement ideas and launch large-scale parallel GPU experiments to verify their effectiveness. We then convert two realistic research problems -- LLM pre-training and post-training -- into execution environments and demonstrate that our automated executor can implement a large fraction of the ideas sampled from frontier LLMs. We analyze two methods to learn from the execution feedback: evolutionary search and reinforcement learning. Execution-guided evolutionary search is sample-efficient: it finds a method that significantly outperforms the GRPO baseline on post-training, and finds a pre-training recipe that outperforms the nanoGPT baseline on pre-training, all within just ten search epochs. Frontier LLMs often generate meaningful algorithmic ideas during search, but they tend to saturate early and only occasionally exhibit scaling trends. Reinforcement learning from execution reward, on the other hand, suffers from mode collapse. It successfully improves the average reward of the ideator model but not the upper-bound, due to models converging on simple ideas. We thoroughly analyze the executed ideas and training dynamics to facilitate future efforts.

Abstract:
Catastrophic forgetting remains a fundamental challenge for neural networks when tasks are trained sequentially. In this work, we reformulate continual learning as a control problem where learning and preservation signals compete within neural activity dynamics. We convert regularization penalties into preservation signals that protect priortask representations. Learning then proceeds by minimizing the control effort required to integrate new tasks while competing with the preservation of prior tasks. At equilibrium, the neural activities produce weight updates that implicitly encode the full prior-task curvature, a property we term thecontinual-natural gradient, requiring no explicit curvature storage. Experiments confirm that our learning framework recovers true prior-task curvature and enables task discrimination, outperforming existing methods on standard benchmarks without replay.

Abstract:
Large language model (LLM)–based multiagent systems (MAS) show strong promise for complex reasoning, planning, and tool-augmented tasks, but designing effective MAS architectures remains labor-intensive, brittle, and hard to generalize. Existing automatic MAS generation methods either rely on code generation, which often leads to executability and robustness failures, or impose rigid architectural templates that limit expressiveness and adaptability. We propose Evolutionary Generation of Multi-Agent Systems (EvoMAS), which formulates MAS generation as structured configuration generation. EvoMAS performs evolutionary generation in configuration space. Specifically, EvoMAS selects initial configurations from a pool, applies feedbackconditioned mutation and crossover guided by execution traces, and iteratively refines both the candidate pool and an experience memory. We evaluate EvoMAS on diverse benchmarks, including BBEH, SWE-Bench, and WorkBench, covering reasoning, software engineering, and tool-use tasks. EvoMAS consistently improves task performance over both human-designed MAS and prior automatic MAS generation methods, while producing generated systems with higher executability and runtime robustness. EvoMAS outperforms the agent evolution method EvoAgent by +10.5 points on BBEH reasoning and +7.1 points on WorkBench. With Claude-4.5-Sonnet, EvoMAS also reaches 79.1% on SWE-BenchVerified, matching the top of the leaderboard.

Abstract:
Adapting pretrained video generation models into controllable world models via latent actions is a promising step towards creating generalist world models. The dominant paradigm adopts a two-stage approach that trains latent action model (LAM) and the world model separately, resulting in redundant training and limiting their potential for co-adaptation. A conceptually simple and appealing idea is to directly replace the forward dynamic model in LAM with a powerful world model and training them jointly, but it is non-trivial and prone to representational collapse. In this work, we propose CoLA-World, which for the first time successfully realizes this synergistic paradigm, resolving the core challenge in joint learning through a critical warm-up phase that effectively aligns the representations of the from-scratch LAM with the pre-trained world model. This unlocks a co-evolution cycle: the world model acts as a knowledgeable tutor, providing gradients to shape a high-quality LAM, while the LAM offers a more precise and adaptable control interface to the world model. Empirically, CoLA-World matches or outperforms prior two-stage methods in both video simulation quality and downstream visual planning, establishing a robust and efficient new paradigm for the field.

Abstract:
Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex tasks but suffer from high computational costs and latency. While selective thinking strategies improve efficiency by routing easy queries to nonthinking models, existing approaches often incur uncontrollable errors, especially in online settings where the performance loss of a non-thinking model is only partially observed and data are non-stationary. To address this, we proposeBetting Probably Approximately Correct(B-PAC)reasoning, a principled method that enables anytime safe and efficient online reasoning under partial feedback. Specifically, we utilize inverse propensity scoring estimators to construct test supermartingales for candidate thresholds, and then dynamically adjust the routing threshold based on the accumulated statistical evidence of safety. Theoretically, we establish the anytime-valid performance loss control and the efficiency of B-PAC reasoning. Extensive experiments demonstrate that B-PAC reasoning significantly reduces computational overhead, decreasing thinking model usage by up to 81.01%, while controlling the performance loss below the user-specified level.

Abstract:
Disentangled representations separate factors that are shared across conditions from those that are conditionspecific. Such separation is needed for generalization to new domains, treatments, patients, or species. A dominant line of work pursues this goal through variational formulations. While these approaches achieve partial disentanglement, they often exhibit three common limitations: they either do not remove all condition-specific information from the shared representation, allow the shared representation to become uninformative, or impose independence assumptions that do not reflect the underlying generative process. In this work, we introduce DisCoVR, a variational framework that addresses these limitations. Its objective is aligned with the probabilistic structure of the data-generating process, and includes an adversarial term that prevents condition-specific information from being encoded in the shared representation. DisCoVR reconstructs the data from both shared and condition-specific representations, ensuring that each remains informative, and uses a structured prior that further reinforces the informativeness of both representations. We show that across synthetic, image, and single-cell RNA-sequencing datasets, DisCoVR achieves stronger disentanglement compared to previous approaches.

Abstract:
Sparse autoencoders (SAEs) are used to analyze embeddings, but their role and practical value are debated. We propose a new perspective on SAEs by demonstrating that they can be naturally understood as topic models. We extend Latent Dirichlet Allocation to embedding spaces and derive the SAE objective as a maximum a posteriori estimator under this model. This view implies SAE features are thematic components rather than steerable directions. Based on this, we introduce SAETM, a topic modeling framework that: (1) trains an SAE to learn reusable topic atoms, (2) interprets them as word distributions on downstream data, and (3) merges them into any number of topics without retraining. SAE-TM yields more coherent topics than strong baselines on text and image datasets while maintaining diversity. Finally, we analyze thematic structure in image datasets and trace topic changes over time in Japanese woodblock prints. Our work positions SAEs as effective tools for large-scale thematic analysis across modalities. Code and data will be released upon publication.

Abstract:
We study the problem of learning Transformerbased sequence models with black-box access to their outputs. In this setting, a learner may adaptively query the oracle with any sequence of vectors and observe the output of the target function. We begin with studying the learnability of the simplest formulation, that is, learning a single-head attention-based regressor with queries. We show that for a model with width d, there is an elementary algorithm to learn the parameters of single-head attention with O(d^2) queries. Further, we show that if there exists an algorithm to learn ReLU feedforward networks (FFNs), then the single-head algorithm can be easily adapted to learn one-layer Transformers with single-head attention. Next, we show that, in the common regime where the head dimension r \ll d, single-head attention-based models can be learned with O(rd) queries via compressed sensing arguments. We also study robustness to noisy oracle access, proving that under mild norm and margin conditions, the parameters can be estimated to \varepsilon accuracy with a polynomial number of queries even when outputs are only provided up to additive tolerance. Finally, we consider the learnability of multi-head attention and show that they are not identifiable from queries, and hence, learnability in the same sense is not feasible without additional assumptions. We discuss potential approaches to learn multi-head attention-based models under certain structural assumptions.

Abstract:
Mixtureof-Experts (MoE) layers have emerged as an important tool in scaling up modern neural networks by decoupling total trainable parameters from activated parameters in the forward pass for each token. However, sparse MoEs add complexity to training due to (i) new trainable parameters (router weights) that, like all other parameter groups, require hyperparameter (HP) tuning; (ii) new architecture scale dimensions (number of and size of experts) that must be chosen and potentially taken large. To make HP selection cheap and reliable, we propose a new parameterization for transformer models with MoE layers when scaling model width, depth, number of experts,andexpert (hidden) size. Our parameterization is justified by a novel dynamical mean-field theory (DMFT) analysis. When varying different model dimensions trained at a fixed token budget, we find empirically that our parameterization enables reliable HP transfer across models from 51M to 2B total parameters. We further take HPs identified from sweeping small models on a short token horizon to train larger models on longer horizons and report performant model behaviors.

Abstract:
Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, stateof-the-art block-wise dLLMs rely on a ``remasking" mechanism that decodes only the most confident tokens and discards the rest, effectively wasting computation. We demonstrate that recycling computation from the discarded tokens is beneficial, as these tokens retain contextual information useful for subsequent decoding iterations. In light of this, we propose Residual Context Diffusion (RCD), a module that converts these discarded token representations into contextual residuals and injects them back for the next denoising step. RCD uses a decoupled two-stage training pipeline to bypass the memory bottlenecks associated with backpropagation. We validate our method on both long CoT reasoning (SDAR) and short CoT instruction following (LLaDA) models. We demonstrate that a standard dLLM can be efficiently converted to the RCD paradigm with merely ～1 billion tokens. RCD consistently improves frontier dLLMs by 5--10 points in accuracy with minimal extra computation overhead across a wide range of benchmarks. Notably, on the most challenging AIME tasks, RCD nearly doubles baseline accuracy and attains up to 4--5x fewer denoising steps at equivalent accuracy levels.

Abstract:
Large Language Model (LLM)based agents are increasingly used as autonomous subordinates that carry out tasks for users. This raises the question of whether they may also engage in deception, similar to how individuals in human organizations lie to superiors to create a good image or avoid punishment. We observe and define agentic upward deception, a phenomenon in which an agent facing environmental constraints conceals its failure and performs actions that were not requested without reporting. To assess its prevalence, we construct a benchmark of 200 tasks covering five task types and eight realistic scenarios in a constrained environment, such as broken tools or mismatched information sources. Evaluations of 11 popular LLMs reveal that these agents typically exhibit action-based deceptive behaviors, such as guessing results, performing unsupported simulations, substituting unavailable information sources, and fabricating local files. We further test prompt-based mitigation and find only limited reductions, suggesting that it is difficult to eliminate and highlighting the need for stronger mitigation strategies to ensure the safety of LLM-based agents. The code and data are included in the supplementary materials.

Abstract:
Modern language models are trained almost exclusively on token sequences produced by a fixed tokenizer, an external lossless compressor often over UTF‑8 byte sequences, thereby coupling the model to that compressor. This work introduces proxy compression, an alternative training scheme that preserves the efficiency benefits of compressed inputs while providing an endto-end, raw-byte interface at inference time. During training, one language model is jointly trained on raw byte sequences and compressed views generated by external compressors; through the process, the model learns to internally align compressed sequences and raw bytes. This alignment enables strong transfer between the two formats, even when training predominantly on compressed inputs which are discarded at inference. Extensive experiments on code language modeling demonstrate that proxy compression substantially improves training efficiency and significantly outperforms pure byte-level baselines given fixed compute budgets. As model scale increases, these gains become more pronounced, and proxy-trained models eventually match or rival tokenizer approaches, all while operating solely on raw bytes and retaining the inherent robustness of byte-level modeling.

Abstract:
Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of visionlanguage tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining, and many of these models suffer from slow image generation speeds, limiting their practical deployment in real-time or resource-constrained settings. In this work, we propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel architecture that improves the efficiency of diffusion/flow-based transformer within the unified model setting. LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities, and extends them with a novel Layerwise Timestep Experts flow-based architecture for efficient image generation. LaTtE-Flow distributes the flow-matching process across specialized groups of Transformer layers, each responsible for a distinct subset of timesteps. This design significantly improves sampling efficiency by activating only a small subset of layers at each sampling timestep. To further enhance performance, we propose a Timestep-Conditioned Residual Attention mechanism for efficient information reuse across layers. Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6× faster inference speed compared to recent unified multimodal models.

Abstract:
Evolution Strategies (ES) is a class of powerful blackbox optimisation methods that are highly parallelisable and can handle non-differentiable and noisy objectives. However, naïve ES becomes prohibitively expensive at scale on GPUs due to the low arithmetic intensity of batched matrix multiplications with unstructured random perturbations. We introduce Evolution Guided GeneRal Optimisation via Low-rank Learning (EGGROLL), which improves arithmetic intensity by structuring individual perturbations as rank-r matrices, resulting in a hundredfold increase in training speed for billion-parameter models at large population sizes, achieving up to 91% of the throughput of pure batch inference. We provide a rigorous theoretical analysis of ES for high-dimensional parameter objectives, investigating conditions needed for ES updates to converge in high dimensions, revealing a linearising effect, and proving consistency between EGGROLL and ES as parameter dimension increases. Our experiments show that EGGROLL: (1) enables the stable pretraining of nonlinear recurrent language models that operate purely in integer datatypes, (2) is competitive with GRPO for post-training LLMs on reasoning tasks, and (3) does not compromise performance compared to ES in tabula rasa RL settings, despite being faster.

Abstract:
Reasoning Large Language Models (RLLMs) have significantly advanced complex reasoning tasks but often struggle with factuality, generating substantially more hallucinations than their non-reasoning counterparts on long-form factuality benchmarks. However, extending online Reinforcement Learning (RL), a key component in recent R-LLM advancements, to the long-form factuality setting poses several unique challenges due to the lack of reliable verification methods. Previous work has utilized automatic factuality evaluation frameworks such as FActScore to curate preference data in the offline RL setting, yet we find that directly leveraging such methods as the reward in online RL leads to reward hacking in multiple ways, such as producing less detailed or relevant responses. We propose a novel reward function that simultaneously considers the factual precision, response detail level, and answer relevance, and applies online RL to learn high quality factual reasoning. Evaluated on six long-form factuality benchmarks, our factual reasoning model achieves an average reduction of 23.1 percentage points in hallucination rate, a 23% increase in answer detail level, and no degradation in the overall response helpfulness.

Abstract:
Transportation on graphs is a fundamental challenge across many domains, where decisions must respect topological and operational constraints. Despite the need for actionable policies, existing graphtransport methods lack this expressivity. They rely on restrictive assumptions, fail to generalize across sparse topologies, and scale poorly with graph size and time horizon. To address these issues, we introduce Generalized Schrödinger Bridge on Graphs (GSBoG), a novel scalable data-driven framework for learning executable controlled continuous-time Markov chain (CTMC) policies on arbitrary graphs under state cost augmented dynamics. Notably, GSBoG learns trajectory-level policies, avoiding dense global solvers and thereby enhancing scalability. This is achieved via a likelihood optimization approach, satisfying the endpoint marginals, while simultaneously optimizing intermediate behavior under state-dependent running costs. Extensive experimentation on challenging real-world graph topologies shows that GSBoG reliably learns accurate, topology-respecting policies while optimizing application-specific intermediate state costs, highlighting its broad applicability and paving new avenues for cost-aware dynamical transport on general graphs.

Abstract:
Diffusion Language Models (DLMs) have recently achieved strong results in text generation. However, their multistep sampling leads to slow inference, limiting practical use. To address this, we extend Inverse Distillation, a technique originally developed to accelerate continuous diffusion models, to the discrete setting. Nonetheless, this extension introduces both theoretical and practical challenges. From a theoretical perspective, the inverse distillation objective lacks uniqueness guarantees, which may lead to suboptimal solutions. From a practical standpoint, backpropagation in the discrete space is non-trivial and often unstable. To overcome these challenges, we first provide a theoretical result demonstrating that our inverse formulation admits a unique solution, thereby ensuring valid optimization. We then introduce gradient-stable relaxations to support effective training. As a result, experiments on multiple DLMs show that our method, Inverse-distilled Diffusion Language Models (IDLM), reduces the number of inference steps by 4 ×-64 ×, while preserving the teacher model’s entropy and generative perplexity.

Abstract:
As AI systems enter institutional workflows, workers must decide whether to delegate task execution to AI and how much effort to invest in verifying AI outputs, while institutions evaluate workers using outcomebased standards that may misalign with workers’ private costs. We model delegation and verification as the solution to a rational worker’s optimization problem, and define worker quality by evaluating an institution-centered utility (distinct from the worker’s objective) at the resulting optimal action. We formally characterize optimal worker workflows and show that AI inducesphase transitions, where arbitrarily small differences in verification ability lead to sharply different behaviors. As a result, AI can amplify workers with strong verification reliability while degrading institutional worker quality for others who rationally over-delegate and reduce oversight, even when baseline task success improves and no behavioral biases are present. These results identify a structural mechanism by which AI reshapes institutional worker quality and amplifies quality disparities between workers with different verification reliability.

Abstract:
Clusteringbased features are widely used in machine learning, but most methods must choose a resolution---a choice that is global, fixed, and ad hoc. Recent work shows that varying the resolution parameter produces only a finite set of structurally stable partitions, known as configurations. Based on this, we introduce Configuration-Mixed Prediction (CMP), a setting where models learn to adaptively weight these configurations per sample for downstream prediction. We propose MixConfig, a plug-and-play feature augmentation module that extracts configurations from any embedding and learns energy-aware mixing weights via a novel selector that jointly reasons about sample context, cluster assignments, and stability statistics. Experiments across tabular, molecular, vision, and text domains demonstrate consistent improvements over single-resolution and static baselines across diverse predictor architectures, with gains particularly pronounced in low-data regimes.

Abstract:
We propose a Likelihood Matching approach for training diffusion models by first establishing an equivalence between the likelihood of the target data distribution and a likelihood along the sample path of the reverse diffusion. To efficiently compute the reverse sample likelihood, the equivalence, a quasilikelihood is considered to approximate each reverse transition density by a Gaussian distribution with matched conditional mean and covariance, respectively. The score and Hessian functions for the diffusion generation are estimated by maximizing the quasi-likelihood, ensuring a consistent matching of both the first two transition moments between every two time points. A stochastic sampler is introduced to facilitate the computation that leverages both the estimated score and Hessian information. We establish consistency of the quasi-maximum likelihood estimation, and provide non-asymptotic convergence guarantees for the proposed sampler, quantifying the rates of the approximation errors due to score and Hessian estimation, dimensionality, and the number of diffusion steps. Empirical and simulation evaluations demonstrate the effectiveness of the proposed Likelihood Matching and validate the theoretical results.

Abstract:
Active feature acquisition (AFA) is a sequential decisionmaking problem where the goal is to improve model performance for test instances by adaptively selecting which features to acquire. In practice, AFA methods often learn from retrospective data with systematic missingness in the features and limited task-specific labels. To address this limitation, we introduce Learning-to-Measure (L2M), which consists of i) reliable uncertainty quantification over unseen tasks, and ii) an uncertainty-guided feature acquisition agent that maximizes conditional mutual information. We demonstrate an autoregressive pre-training approach that underpins reliable uncertainty quantification and acquisition for tasks with arbitrary missingness. L2M operates directly on datasets with retrospective missingness and performs the task in-context, eliminating per-task retraining. Across synthetic and real-world tabular benchmarks, L2M matches or surpasses task-specific baselines, particularly under scarce labels and high missingness.

Abstract:
This paper presents a state representation framework for Markov decision processes (MDPs) that can be learned solely from state trajectories, requiring neither reward signals nor the actions executed by the agent. We propose learning the minimum action distance (MAD), defined as the minimum number of actions required to transition between states, as a fundamental metric that captures the underlying structure of an environment. MAD naturally enables critical downstream tasks such as goalconditioned reinforcement learning and reward shaping by providing a dense, geometrically meaningful measure of progress. Our self-supervised learning approach constructs an embedding space where the distances between embedded state pairs correspond to their MAD, accommodating both symmetric and asymmetric approximations. We evaluate the framework on a comprehensive suite of environments with known MAD values, encompassing both deterministic and stochastic dynamics, as well as discrete and continuous state spaces, and environments with noisy observations. Empirical results demonstrate that the proposed approach not only efficiently learns accurate MAD representations across these diverse settings but also significantly outperforms existing state representation methods in terms of representation quality.

Abstract:
Machine Learning Interatomic Potentials (MLIPs) are becoming a central tool in simulationbased chemistry. However, like most deep learning models, MLIPs struggle to make accurate predictions on out-of-distribution data or when trained in a data-scarce regime, both common scenarios in simulation-based chemistry. Moreover, MLIPs do not provide uncertainty estimates by construction, which are fundamental to guide active learning pipelines and to ensure the accuracy of simulation results compared to quantum calculations. To address this shortcoming, we propose BLIPs: Bayesian Learned Interatomic Potentials. BLIP is a scalable, architecture-agnostic variational Bayesian framework for training or fine-tuning MLIPs, built on an adaptive version of Variational Dropout. BLIP delivers well-calibrated uncertainty estimates and minimal computational overhead for energy and forces prediction at inference time, while integrating seamlessly with (equivariant) message-passing architectures. Empirical results on simulation-based computational chemistry tasks demonstrate improved predictive accuracy with respect to standard MLIPs, and trustworthy uncertainty estimates, especially in data-scarse or heavy out-of-distribution regimes. Moreover, fine-tuning pretrained MLIPs with BLIP yields consistent performance gains and calibrated uncertainties.

Abstract:
We study learning multitask, multi-agent policies for cooperative, temporal objectives, under centralized training, decentralized execution. In this setting, using automata to represent tasks assigned to agents enables breaking down a team-level objective into simpler, smaller sub-tasks. However, existing approaches remain sample-inefficient and are limited to the single-task case, requiring retraining policies for each new task. In this work, we present Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning (ACC-MARL), a framework for learning task-conditioned, decentralized team policies. We identify the main challenges to the feasibility of ACC-MARL, propose solutions, and prove that our approach is optimal. We further show that learned value functions can be used to assign tasks optimally at test time. Experiments demonstrate emergent task-aware, multi-step coordination among agents, such as pressing a button to unlock a door, holding the door, and short-circuiting tasks.

Abstract:
For a broad family of discriminative models that includes autoregressive language models, identifiability results imply that if two models induce the same conditional distributions, then their internal representations agree up to an invertible linear transformation. We ask whether an analogous conclusion holds approximately when the distributions are close instead of equal. Building on the observation of Nielsen et al. (2025) that closeness in KL divergence need not imply high linear representational similarity, we study a distributional distance based on logit differences and show that closeness in this distance does yield linear similarity guarantees. Specifically, we define a representational dissimilarity measure based on the models’ identifiability class and prove that it is bounded by the logit distance. We further show that, when model probabilities are bounded away from zero, KL divergence upperbounds logit distance; yet the resulting bound fails to provide nontrivial control in practice. As a consequence, KLbased distillation can match a teacher’s predictions while failing to preserve linear representational properties, such as linear-probe recoverability of human-interpretable concepts. In distillation experiments on synthetic and image datasets, logit-distance distillation yields students with higher linear representational similarity and better preservation of the teacher’s linearly recoverable concepts.

Abstract:
Large language model (LLM) posttraining typically relies on a training signal from a reward model (RM), such as for reinforcement learning from human feedback. Previous work shows that this signal can be biased in attributes such as length, format, and sycophancy. In this work, we introduce and study the research problem of automatically finding reward model biases in natural language. We offer a simple approach of using an LLM to iteratively propose and refine candidate biases. Our method can recover known biases and surface novel ones: for example, we found that Skywork-V2-8B, a leading open-weight reward model, often mistakenly favors responses with redundant spacing and responses with hallucinated content. In addition, we show evidence that iteration provides benefits over flat best-of-N search. We hope our work contributes to further research on improving RMs through automated interpretability methods.

Abstract:
Multiagent LLM systems are increasingly deployed as autonomous collaborators, where agents interact freely rather than execute fixed, pre-specified workflows. In such settings, effective coordination cannot be fully designed in advance and must instead emerge through interaction. However, most prior work enforces coordination through fixed roles, workflows, or aggregation rules, leaving open the question of how well self-organizing teams perform when coordination is unconstrained. Drawing on organizational psychology, we study whether self-organizing LLM teams achievestrong synergy, where team performance matches or exceeds the best individual member. Across human-inspired and frontier ML benchmarks, we find that---unlike human teams---LLM teams consistently fail to match their expert agent's performance, even when explicitly told who the expert is, incurring performance losses of up to 37.6%. Decomposing this failure, we show that expert leveraging, rather than identification, is the primary bottleneck. Conversational analysis reveals a tendency toward integrative compromise---averaging expert and non-expert views rather than appropriately weighting expertise---which increases with team size and correlates negatively with performance. Interestingly, this consensus-seeking behavior improves robustness to adversarial agents, suggesting a trade-off between alignment and effective expertise utilization. Our findings reveal a significant gap in the ability of self-organizing multi-agent teams to harness the collective expertise of their members.

Abstract:
Length generalization is a key property of a learning algorithm that enables it to make correct predictions on inputs of unbounded length, given finite training data. To provide such a guarantee, one needs to be able to compute a length generalization bound, beyond which the model is guaranteed to generalize. This paper concerns the open problem of the computability of such generalization bounds for \mathsfC\mathsfRASP, which is closely linked to transformers. A positive partial result was recently shown Chen et al. whenever the concept is definable in \mathsfC-\mathsfRASP with only one layer and, under some restrictions, also with two layers. We provide complete answers to the above open problem. Our main result is the non-existence of computable length generalization bounds for \mathsfC-\mathsfRASP(already with two layers) and hence for transformers. To complement this, we provide a computable bound for concepts representable in the positive fragment of \mathsfC-\mathsfRASP, which we show equivalent to fixed-precision transformers. For both positive \mathsfC-\mathsfRASP and fixed-precision transformers, we show that the length complexity is exponential, and prove optimality of the bounds.

Abstract:
Finding humanunderstandable circuits in language models is a central goal of the field of mechanistic interpretability. We train models to have more understandable circuits by constraining most of their weights to be zeros, so that each neuron only has a few connections. To recover fine-grained circuits underlying each of several hand-crafted tasks, we prune the models to isolate the part responsible for the task. These circuits often contain neurons and residual channels that correspond to natural concepts, with a small number of straightforwardly interpretable connections between them. We study how these models scale and find that making weights sparser trades off capability for interpretability, and scaling model size improves the capability-interpretability frontier. However, scaling sparse models beyond tens of millions of nonzero parameters while preserving interpretability remains a challenge. In addition to training weight-sparse models de novo, we show preliminary results suggesting our method can also be adapted to explain existing dense models. Our work produces circuits that achieve an unprecedented level of human understandability and validates them with considerable rigor.

Abstract:
The LogSumExp function, dual to the KullbackLeibler (KL) divergence, plays a central role in many important optimization problems, including entropy-regularized optimal transport (OT) and distributionally robust optimization (DRO). In practice, when the number of exponential terms inside the logarithm is large or infinite, optimization becomes challenging since computing the gradient requires differentiating every term. We propose a novel convexity- and smoothness-preserving approximation to LogSumExp that can be efficiently optimized using stochastic gradient methods. This approximation is rooted in a sound modification of the KL divergence in the dual, resulting in a new f-divergence called the safe KL divergence. Our experiments and theoretical analysis of the LogSumExp-based stochastic optimization, arising in DRO and continuous OT, demonstrate the advantages of our approach over existing baselines.

Abstract:
We introduce Context Tuning, a simple and effective method to significantly enhance fewshot adaptation of language models (LLMs) without fine-tuning model parameters. While prompt-based adaptation techniques have demonstrated the effectiveness of lightweight adaptation methods for LLMs, they typically initialize a trainable prompt or prefix with irrelevant tokens for the task at hand. In contrast, Context Tuning initializes the trainable prompt or prefix with task-specific demonstration examples, leveraging the model’s inherent In-Context Learning (ICL) ability to extract relevant information for improved few-shot learning performance. Extensive evaluations on benchmarks such as CrossFit, UnifiedQA, MMLU, BIG-Bench Hard, and ARC demonstrate that Context Tuning outperforms traditional prompt-based adaptation methods and achieves competitive accuracy to Test-Time Training with significantly higher training efficiency.

Abstract:
Existing coding benchmarks evaluate language models (LMs) on concrete, wellspecified tasks such as fixing bugs or writing tests. However, human programmers do not spend all day addressing isolated GitHub issues. Instead, real-world software development is grounded in the pursuit of high-level goals. Evaluating whether LMs can iteratively develop code to accomplish open-ended objectives without explicit guidance remains an open challenge. We introduce CodeClash, a benchmark where LMs compete in multi-round tournaments to build the best codebase for achieving a competitive objective. Each round proceeds in two parts: agents edit their code, then their codebases compete head-to-head in a code arena that determines winners based on objectives like score maximization, resource acquisition, or survival. Models must decide for themselves how to improve their code both absolutely and against their opponents. We run 1680 tournaments to evaluate 8 LMs across 6 arenas, revealing how models exhibit diverse development styles and share fundamental limitations in strategic reasoning. Models also struggle with long-term codebase maintenance; repositories become progressively messy and redundant. Top models lose every round against expert human programmers. We open-source CodeClash to advance the study of autonomous, goal-oriented code development.

Abstract:
Analytical diffusion models offer a mathematically transparent path to generative modeling by formulating the denoising score as an empiricalBayes posterior mean. However, this interpretability comes at a prohibitive cost: the standard formulation necessitates a full-dataset scan at every timestep, scaling linearly with dataset size. In this work, we present the first systematic study addressing this scalability bottleneck. We challenge the prevailing assumption that the entire training data is necessary, uncovering the phenomenon of Posterior Progressive Concentration: the effective golden support of the denoising score is not static but shrinks asymptotically from the global manifold to a local neighborhood as the signal-to-noise ratio increases. Capitalizing on this, we propose Dynamic Time-Aware Golden Subset Diffusion (\algopt), a training-free framework that decouples inference complexity from dataset size. Instead of static retrieval, \algopt uses a coarse-to-fine mechanism to dynamically pinpoint the ``Golden Subset'' for inference. Theoretically, we derive rigorous bounds guaranteeing that our sparse approximation converges to the exact score. Empirically, \algopt achieves a \bf 71 × speedup on AFHQ while matching or achieving even better performance than full-scan baselines. Most notably, we demonstrate the first successful scaling of analytical diffusion to ImageNet-1K, unlocking a scalable, training-free paradigm for large-scale generative modeling.

Abstract:
We generalize the attention mechanism by viewing it through the lens of Entropic Optimal Transport, revealing that standard attention corresponds to a transport problem regularized by an implicit uniform prior. We introduce Generalized Optimal transport Attention with Trainable priors (GOAT), a new attention mechanism that replaces this naive assumption with a learnable, continuous prior. This prior maintains full compatibility with optimized kernels such as FlashAttention. GOAT also provides an EOTbased explanation of attention sinks and materializes a solution for them, avoiding the representational trade-offs of standard attention. Finally, by absorbing spatial information into the core attention computation, GOAT learns an extrapolatable prior that combines the flexibility of learned positional embeddings with the length generalization of fixed encodings.

Abstract:
Motivated by recent work on benign overfitting in overparameterized machine learning, we study the generalization behavior of functions in Sobolev spaces W^k, p(\mathbbR^d) that perfectly fit a noisy training data set. Under assumptions of label noise and sufficient regularity in the data distribution, we show that approximately normminimizing interpolators, which are canonical solutions selected by smoothness bias, exhibit harmful overfitting: even as the training sample size n \to \infty, the generalization error remains bounded below by a positive constant with high probability. Our results hold for arbitrary values of p \in [1, \infty), in contrast to prior results studying the Hilbert space case (p = 2) using kernel methods. Our proof uses a geometric argument which identifies harmful neighborhoods of the training data using Sobolev inequalities.

Abstract:
In standard causal attention, each token's query, key, and value (QKV) are static and encode only preceding context. We introduce CAuSal aTtention with Lookahead kEys (CASTLE), an attention mechanism that continually updates each token's keys as the context unfolds. We term these updated keys lookahead keys because they belong to earlier positions yet integrate information from tokens that appear later relative to those positions, while strictly preserving the autoregressive property. Although the mechanism appears sequential, we derive a mathematical equivalence that avoids explicitly materializing lookahead keys at each position and enables efficient parallel training. On language modeling benchmarks, CASTLE consistently outperforms standard causal attention across model scales, reducing validation perplexity and improving performance on a range of downstream tasks.

Abstract:
A longterm goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios, and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we callearly experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm, we study two strategies of using such data: (1) implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. Evaluation across eight diverse environments and multiple model families shows that our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, making it a practical bridge between imitation learning and fully experience-driven agents.

Abstract:
Language models remain vulnerable to prompt injection attacks despite extensive safety training. We trace this failure to role confusion: models assign roles based on how text sounds, not where it actually comes from. We design novel role probes to capture how models internally identify “who is speaking.” These reveal why prompt injection works: untrusted text that imitates a role inherits that role’s authority. We test this insight by injecting fabricated reasoning into user prompts and tool outputs, achieving average success rates of 60% on StrongREJECT and 61% on agent exfiltration, across multiple openand closed-weight models with near-zero baselines. Strikingly, the degree of internal role confusion strongly predicts attack success before generation begins. Our findings reveal a fundamental gap: security is defined at the interface but authority is assigned in latent space. More broadly, we introduce a unifying, mechanistic framework for prompt injection, demonstrating that diverse prompt-injection attacks exploit the same underlying role-confusion mechanism. Code available at: https://anonymous.4open.science/r/role-science-B522.

Abstract:
Flatness of the loss landscape has been widely studied as an important perspective for understanding the behavior and generalization of deep learning algorithms. Motivated by this view, we propose FlatnessAware Stochastic Gradient Langevin Dynamics (fSGLD), a first-order optimization method that biases learning its dynamics toward flat basins while retaining the computational and memory efficiency of SGD and SGLD. We provide a non-asymptotic theoretical analysis showing that fSGLD converges to a flatness-biased Gibbs distribution under a theoretically prescribed coupling between the noise scale \sigma and the inverse temperature \beta, together with explicit excess risk guarantees. We empirically evaluate fSGLD across standard optimizer benchmarks, Bayesian image classification, uncertainty quantification, and out-of-distribution detection, demonstrating consistently strong performance and reliable uncertainty estimates. Additional experiments confirm the effectiveness of the theoretically prescribed \beta–\sigma coupling compared to decoupled choices.

Abstract:
In dynamical systems reconstruction (DSR) we aim to recover the dynamical system (DS) underlying observed time series. Specifically, we aim to learn a generative surrogate model which approximates the underlying, datagenerating DS, and recreates its long-term properties (`climate statistics'). In scientific and medical areas, in particular, these models need to be mechanistically tractable -- through their mathematical analysis we would like to obtain insight into the recovered system's workings. Piecewise-linear (PL), ReLU-based RNNs (PLRNNs) have a strong track-record in this regard, representing SOTA DSR models while allowing mathematical insight by virtue of their PL design. However, all current PLRNN variants arediscrete-time maps. This is in disaccord with the assumed continuous-time nature of most physical and biological processes, and makes it hard to accommodate data arriving atirregulartemporal intervals. Neural ODEs are one solution, but they do not reach the DSR performance of PLRNNs and often lack their tractability. Here we develop theory forcontinuous-timePLRNNs (cPLRNNs): We present a novel algorithm for training and simulating such models, bypassing numerical integration by efficiently exploiting their PL structure. We further demonstrate how important topological objects like equilibria or limit cycles can be determined semi-analytically in trained models. We compare cPLRNNs to both their discrete-time cousins as well as Neural ODEs on DSR benchmarks, including systems with discontinuities which come with hard thresholds.

Abstract:
Multiagent systems (MAS), composed of networks of two or more autonomous AI agents, have become increasingly popular in production deployments, yet introduce security risks that do not arise in single-agent settings. Even if individual agents may exhibit robust security, architectural decisions governing their coordination can create attack surfaces that have not been systematically characterized. In this work, we present an empirical study of how MAS design decisions shape the tradeoff between task performance and attack resistance. Using a network of web-based agents and stage-wise evaluations that distinguish planning-stage refusal, execution-stage interception, and successful attack completion, we study architectural choices through controlled experiments. We identify three key design choices that influence MAS security: (i) agent roles, which determine how authority and responsibility are allocated; (ii) topology, which shapes how and when agents interact; and (iii) memory, which determines the context and state visibility accessible to each agent. Overall, our results show that security and performance in multi-agent systems are governed by architectural design choices, motivating the development of further evaluations which move beyond the security properties of a single agent.

Abstract:
As large language models (LLMs) see wide adoption in software engineering, the reliable assessment of the correctness and security of LLMgenerated code is crucial. Notably, prior work showed that LLMs are prone to generating code with security vulnerabilities, highlighting that security is often overlooked. These insights were enabled by specialized benchmarks crafted by security experts through significant manual effort. However, benchmarks (i) inevitably end up contaminating training data, (ii) must extend to new tasks to provide a more complete picture, and (iii) must increase in difficulty to challenge more capable LLMs. In this work, we address these challenges and present AutoBaxBuilder, an automated pipeline that generates code security benchmarking tasks from scratch. It leverages the code-understanding of LLMs and robust reliability checks to construct functional tests and end-to-end security-probing exploits. The quality of the pipeline is quantitatively confirmed by aligning its predictions with an expert-written baseline and qualitatively validated through manual soundness verification. We use AutoBaxBuilder to construct a new benchmark and release it to the public as AutoBaxBench, together with a thorough evaluation on contemporary LLMs. AutoBaxBuilder generates new tasks in under 2 hours, for less than USD 10. Including a manual verification, this reduces the required human effort for benchmark construction by an estimated factor of 12.

Abstract:
Learning good representations is essential for latent planning with world models. While pretrained visual encoders provide strong visual features, they are not tailored to planning and contain substantial information which is irrelevant to planning. Inspired by the perceptual straightening hypothesis in human visual processing, we introduce temporal straightening for representation learning in latent planning. We add a lightweight projector on top of a pretrained visual encoder to map to a lowerdimensional space, trained with a curvature regularizer that encourages locally straightened latent trajectories. We show that reducing curvature improves the conditioning of the planning objective, making gradient-based planning more stable and yielding significantly higher success rates across four goal-reaching tasks.

Abstract:
As context windows in large language models continue to expand, it is essential to characterize how attention behaves at extreme sequence lengths. We introduce tokensample complexity: the rate at which attention computed on n tokens converges to its infinite-token limit. We estimate finite-n convergence bounds at two levels: pointwise uniform convergence of the attention map, and convergence of moments for the transformed token distribution. For compactly supported (and more generally sub-Gaussian) distributions, our first result shows that the attention map converges uniformly on a ball of radius R at rate C(R)/\sqrtn, where C(R) grows exponentially with R. For large R, this estimate loses practical value, and our second result addresses this issue by establishing convergence rates for the moments of the transformed distribution (the token output of the attention layer). In this case, the rate is C'(R)/n^\beta with \beta<\tfrac12, and C'(R) depends polynomially on the size of the support of the distribution. The exponent \beta depends on the attention geometry and the spectral properties of the tokens distribution. We also examine the regime in which the attention parameter tends to infinity and the softmax approaches a hardmax, and in this setting, we establish a logarithmic rate of convergence. Experiments on synthetic Gaussian data and real BERT models on Wikipedia text confirm our predictions.

Abstract:
Chainof-Thought (CoT) reasoning has become a powerful framework for improving complex problem-solving capabilities in Multimodal Large Language Models (MLLMs). However, the verbose nature of textual reasoning introduces significant inefficiencies. In this work, we proposeHeima(as hidden llama), an effective CoT compression framework that condenses lengthy CoTs into a small set of abstract thinking tokens, preserving essential reasoning while removing redundancy. We then conduct a theoretical analysis from an information-theoretic perspective, quantifying the information gap induced by compression, showing that reasoning capability is preserved when non-trivial mutual information is retained. To further explore and quantify this information gap, we design the adaptive interpreter that maps thinking tokens back to variable-length textual sequences, thereby reconstructing the reasoning process. Experiments across diverse reasoning benchmarks demonstrate that Heima improves reasoning efficiency, while maintaining or even achieving better zero-shot accuracy. Moreover, the interpreter reconstructs coherent reasoning progresses from compressed thinking tokens, revealing that the information gap is minimal and validating the effectiveness of the proposed framework. This work paves the way for scalable latent reasoning models and advances our understanding of efficient reasoning processes in large models.

Abstract:
We prove that with linear transformations, both (i) twolayer self-attention and (ii) one-layer self-attention followed by a softmax function are universal approximators for continuous sequence-to-sequence functions on compact domains. Our main technique is a new interpolation-based method for analyzing attention’s internal mechanism. This leads to our key insight: self-attention is able to approximate a generalized version of ReLU to arbitrary precision, and hence subsumes many known universal approximators. Building on these, we show that two-layer multi-head attention or even one-layer multi-head attention followed by a softmax function suffices as a sequence-to-sequence universal approximator. In contrast, prior works rely on feed-forward networks to establish universal approximation in Transformers. Furthermore, we extend our techniques to show that, (softmax-)attention-only layers are capable of approximating gradient descent in-context. We believe these techniques hold independent interest.

Abstract:
Openweight coding agents should hold a fundamental advantage over closed-source systems: they can be specialized to private codebases, encoding repository-specific information directly in their weights. Yet the cost and complexity of training has kept this advantage theoretical - until now. We present Soft-Verified Efficient Repository Agents (SERA), an efficient method for training coding agents that enables the rapid and cheap creation of agents specialized to private codebases. Using only supervised finetuning (SFT), SERA achieves state-of-the-art results among fully open-source (open data, method, code) models while matching the performance of open-weight models like Devstral-Small-2. Creating SERA models is 26x cheaper than reinforcement learning and 57x cheaper than previous synthetic data methods to reach equivalent performance. Our method, Soft Verified Generation (SVG), generates thousands of trajectories from a single code repository. Beyond repository specialization, we apply SVG to a larger corpus of codebases, generating over 200,000 synthetic trajectories. We use this dataset to provide detailed analysis of scaling laws, ablations, and confounding factors for training coding agents. Overall, we believe our work will greatly accelerate research on open coding agents and showcase the advantage of open-source models that can specialize to private codebases.

Abstract:
Learning discrete neural samplers is challenging due to the lack of gradients and combinatorial complexity. While stochastic optimal control (SOC) and Schrödinger bridge (SB) provide principled solutions, efficient SOC solvers like adjoint matching (AM), which excel in continuous domains, remain unexplored for discrete spaces. We bridge this gap by revealing that the core mechanism of AM isstatespace agnostic, and introducediscrete ASBS, a unified framework that extends AM and adjoint Schrödinger bridge sampler (ASBS) to discrete spaces. Theoretically, we analyze the optimality conditions of the discrete SB problem and its connection to SOC, identifying a necessary cyclic group structure on the state space to enable this extension. Empirically, discrete ASBS achieves competitive sample quality with significant advantages in training efficiency and scalability.

Abstract:
In performative learning, the data distribution reacts to the deployed model—for example, because strategic users adapt their features to game it—which creates a more complex dynamic than in classical supervised learning. One should thus not only optimize the model for the current data but also take into account that the model might steer the distribution in a new direction, without knowing the exact nature of the potential shift. We explore how regularization can help cope with performative effects by studying its impact in highdimensional ridge regression. We show that, while performative effects worsen the test risk in the population setting, when moving to the over-parameterized regime where the number of features exceeds the number of samples, the optimal regularization in the presence of performativity helps reduce the variance in the estimated parameters, thereby improving performance. We show that the optimal regularization scales with the overall strength of the performative effect, making it possible to set the regularization in anticipation of this effect. We illustrate this finding through empirical evaluations of the optimal regularization parameter on both synthetic and real-world datasets.

Abstract:
We study the fundamental problem of onestep prediction of a marginally stable unknown nonlinear dynamical system. We describe an algorithm for this problem, based on the technique of spectral filtering, which learns a mapping from past observations to the next based on a spectral representation of the system. Using techniques from online convex optimization, we prove vanishing prediction error for any nonlinear dynamical system that has finitely many marginally stable modes, with rates governed by a novel quantitative control-theoretic notion of learnability. The main technical component of our method is a new spectral filtering algorithm for linear dynamical systems, which incorporates past observations and applies to general noisy and marginally stable systems. This significantly generalizes the original spectral filtering algorithm to both asymmetric dynamics as well as incorporating noise correction, and is of independent interest.

Abstract:
The ratedistortion-perception (RDP) tradeoff characterizes the fundamental limits of lossy compression by jointly considering bitrate, reconstruction fidelity, and perceptual quality. While recent neural compression methods have improved perceptual performance, they typically operate at fixed points on the RDP surface, requiring retraining to target different tradeoffs. In this work, we propose a training-free framework that leverages pre-trained diffusion models to traverse the entire RDP surface. Our approach integrates a reverse channel coding (RCC) module with a novel score-scaled probability flow ODE decoder. We theoretically prove that the proposed diffusion decoder is optimal for the distortion-perception tradeoff under AWGN observations and that the overall framework with the RCC module achieves the optimal RDP function in the Gaussian case. Empirical results across multiple datasets demonstrate the framework's flexibility and effectiveness in navigating the ternary RDP tradeoff using pre-trained diffusion models. Our results establish a practical and theoretically grounded approach to adaptive, perception-aware compression.

Abstract:
Group counterfactual explanations find a set of counterfactual instances to explain a group of input instances contrastively. However, existing methods either (i) optimize counterfactuals only for a fixed group and do not generalize to new group members, (ii) strictly rely on strong model assumptions (e.g., linearity) for tractability or/and (iii) poorly control the counterfactual group geometry distortion. We instead learn an explicit optimal transport map that sends any group instance to its counterfactual without reoptimization, minimizing the group's total transport cost. This enables generalization with fewer parameters, making it easier to interpret the common actionable recourse. For linear classifiers, we prove that functions representing group counterfactuals are derived via mathematical optimization, identifying the underlying convex optimization type (QP, QCQP, ...). Experiments show that they accurately generalize, preserve group geometry and incur only negligible additional transport cost compared to baseline methods. If model linearity cannot be exploited, our approach also significantly outperforms the baselines.

Abstract:
Learning algorithms can be significantly improved by routing complex or uncertain inputs to specialized experts, balancing accuracy with computational cost. This approach, known aslearning to defer, is essential in domains like natural language generation, medical diagnosis, and computer vision, where an effective deferral can reduce errors at low extra resource consumption. However, the twostage learning to defer setting, which leverages existing predictors such as a collection of LLMs or other classifiers, often faces challenges due to an expert imbalance problem. This imbalance can lead to suboptimal performance, with deferral algorithms favoring the majority expert. We present a comprehensive study of two-stage learning to defer in expert imbalance settings. We cast the deferral loss optimization as a novel cost-sensitive learning problem over the input-expert domain. We derive new margin-based loss functions and guarantees tailored to this setting, and develop novel algorithms for cost-sensitive learning. Leveraging these results, we design principled deferral algorithms, MILD (Margin-based Imbalanced Learning to Defer), specifically suited for expert imbalance settings. Extensive experiments demonstrate the effectiveness of our approach, showing clear improvements over existing baselines on both image classification and real-world Large Language Model (LLM) routing tasks.

Abstract:
Equivariance is a powerful inductive bias in neural networks, improving generalisation and physical consistency. Recently, however, nonequivariant models have regained attention, due to their better runtime performance and imperfect symmetries that might arise in real-world applications. This has motivated the development of approximately equivariant models that strike a middle ground between respecting symmetries and fitting the data distribution. Existing approaches in this field either rely on sampling from a group, incurring a high sample complexity, or explicitly parameterise a model as a sum of an equivariant and non-equivariant network. This work instead approaches approximate equivariance via a projection-based regulariser which leverages a layer-wise orthogonal decomposition of a network's layers into equivariant and non-equivariant components. In contrast to existing methods, this penalises non-equivariance at an operator level across the full group orbit, rather than point-wise as in sample-based approaches. We present a mathematical framework for computing the non-equivariance penalty exactly and efficiently in both the spatial and spectral domains. In our experiments, our method consistently outperforms prior approximate equivariance approaches in both model performance and efficiency, achieving substantial runtime gains over sample-based regularisers.

Abstract:
Reasoning models enhance performance by tackling problems in a stepby-step manner, decomposing them into sub-problems and exploring long chains of thought before producing an answer. However, applying extended reasoning to every step introduces substantial redundancy, as sub-problems vary widely in difficulty and complexity: a small number of pivotal steps are genuinely challenging and decisive for the final answer, while many others only involve straightforward revisions or simple computations. Therefore, a natural idea is to endow reasoning models with the ability to adaptively respond to this variation, rather than treating all steps with the same level of elaboration. To this end, we propose MixReasoning, a framework that dynamically adjusts the depth of reasoning within a single response. MixReasoning enables fine-grained mode switching by training a lightweight concise LoRA adapter and control its strength to trigger switches based on reasoning difficulty estimated from sliding-window token confidence, yielding human-like transitions between fast and slow reasoning. The resulting chain of thought then becomes a mixture of detailed reasoning on difficult steps and concise inference on simpler ones. Experiments on AIME24, MATH-500, GPQA, and GSM8K demonstrate that MixReasoning shortens reasoning length by 13%--49% across benchmarks of varying difficulty, delivering consistent efficiency gains while maintaining performance.

Abstract:
Standard optimizer choices for pretraining are designed to minimize pre-training loss. Yet pre-trained models are routinely subjected to further transformations—such as fine-tuning to acquire new capabilities or quantization for efficiency. In this work, we evaluate optimizer choices across model scales, token budgets, and datasets, and find that strategies that explicitly (Sharpness-Aware Minimization) or implicitly (large learning rates and Warmup–Stable–Decay schedules) reduce sharpness yield better downstream performance, even when they achieve comparable or worse pre-training loss. Combining these strategies yields a new pre-training recipe that substantially outperforms standard baselines with minimal compute overhead, delivering a better learning–forgetting frontier during fine-tuning and higher accuracy after quantization.

Abstract:
Multiplechoice question (MCQ) benchmarks have been a standard evaluation practice for measuring LLMs' ability to reason and answer knowledge-based questions. Through a synthetic NonsenseQA benchmark, we observe that different LLMs exhibit varying degrees of label-position-few-shot-prompt bias, where the model either uses the answer position, the label in front of the answer, the distributions of correct answers present in the few-shot prompt, or a combination of all to answer each MCQ question. We propose a simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels and prompts the LLM to use the whole answer presented. With a simple sentence similarity model, we demonstrate improved robustness and lower standard deviation between different permutations of answers with a minimal drop in LLM's performance, exposing the LLM's capabilities under reduced evaluation artifacts, without any help from the prompt examples or the option labels. Across multiple benchmarks and models, this protocol substantially improves the robustness to answer permutations, reducing mean accuracy variance 3× with only a minimal decrease in the mean model's performance. Through ablation studies on various embedding models and similarity functions, we show that the method is more robust than the standard ones.

Abstract:
Statespace models (SSMs) offer efficient sequence modeling but show a large performance gap compared to Transformers on benchmarks that require in-context retrieval. This gap has been linked to a small set of attention heads, called Gather-and-Aggregate (G&A), which SSMs struggle to implement and are believed to drive the disparity. Leveraging this insight, we propose retrieval-aware distillation, a strategy that converts a pretrained Transformer into a hybrid student by preserving only these retrieval-critical components. We identify essential attention heads via ablation on a synthetic retrieval task and distill the rest into recurrent heads, resulting in a model with non-uniform attention placement tailored to retrieval demands. We empirically show that preserving just 2% of attention heads enables the hybrid model to recover teacher-level performance (10 retrieval-critical heads in a 1B model), reducing memory overhead by up to 6x compared to prior distillation methods that retain 30–50% of heads. Furthermore, we show that large recurrent states in SSMs often compensate for missing retrieval. Once retrieval is handled by these specific heads, the SSM backbone can be significantly simplified while maintaining performance, even with an 8x reduction in state dimension. Overall, the results show that strategically concentrating attention can close the Transformer–SSM gap with a fraction of the memory cost.

Abstract:
We study how Large Language Models (LLMs) process negation mechanistically. First, we establish that even though openweight models often provide wrong answers to questions involving negation, they do possess internal components that process negation correctly. Their poor accuracy is due to late-layer attention behavior that promotes simple shortcuts; ablating those attention modules greatly improves accuracy on negation-related questions. Second, we uncover how models process negation. We consider two hypotheses: models could use attention heads that attend to the phrase being negated and suppress related concepts, or they could directly construct a representation of the entire negative phrase (e.g., representing "not gas" as a vector that promotes liquids and solids). We apply a range of observational and causal interpretability techniques on Mistral-7B and Llama-3.1-8B to show that models implement both mechanisms, with the "constructive" mechanism being more prominent. Combined, our work deepens the understanding of LLMs' internals, highlighting construction-dominant computations and the coexistence of competing mechanisms within LLMs.

Abstract:
SharpnessAware Minimization (SAM) is widely used to seek flatter minima—often linked to better generalization. In its standard implementation, SAM updates the current iterate using the loss gradient evaluated at a point perturbed by distance \rho along the normalized gradient direction. We show that, for some choices of \rho, SAM can stall at points where this shifted (perturbed-point) gradient vanishes despite a nonzero original gradient, and therefore, they are not stationary points of the original loss. We call these points hallucinated minimizers, prove their existence under simple nonconvex landscape conditions (e.g., the presence of a local minimizer and a local maximizer), and establish sufficient conditions for local convergence of the SAM iterates to them. We corroborate this failure mode in neural network training and observe that it aligns with SAM's performance degradation often seen at large \rho. Finally, as a practical safeguard, we find that a short initial SGD warm-start before enabling SAM mitigates this failure mode and reduces sensitivity to the choice of \rho.

Abstract:
Sparse attention has been proposed as a way to alleviate the quadratic cost of transformers, a central bottleneck in longcontext training. A promising line of work is \alpha-entmax attention, a differentiable sparse alternative to softmax that enables input-dependent sparsity yet has lagged behind softmax due to the computational overhead necessary to compute the normalizer \tau. In this paper, we introduce AdaSplash-2, which addresses this limitation through a novel histogram-based initialization that reduces the number of iterations needed to compute \tau to typically 1-2. The key idea is to compute a coarse histogram of attention scores on the fly and store it in on-chip SRAM, yielding a more accurate initialization that enables fast forward and backward computation. Combined with a sparsity-aware GPU implementation that skips zero blocks with low overhead, AdaSplash-2 matches or improves per-step training time relative to FlashAttention-2 when block sparsity is moderate-to-high (e.g., >60%), which often occurs at long-context lengths. On downstream tasks, models trained with our efficient \alpha-entmax attention match softmax baselines at short-context lengths and achieve substantial gains in long-context settings.

Abstract:
Agent development kits (ADKs) provide effective platforms and tooling for constructing agents, and their designs are critical to the constructed agents' performance, especially the functionality for agent topology, tools, and memory. However, current ADKs either lack sufficient functional support or rely on humans to manually design these components, limiting agents' generalizability and overall performance. We propose OpenSage, the first ADK that enables LLMs to automatically create agents with selfgenerated topology and toolsets while providing comprehensive and structured memory support. OpenSage offers effective functionality for agents to create and manage their own sub-agents and toolkits. It also features a hierarchical, graph-based memory system for efficient management and a specialized toolkit tailored to software engineering tasks. Extensive experiments across three state-of-the-art benchmarks with various backbone models demonstrate the advantages of OpenSage over existing ADKs. We also conduct rigorous ablation studies to demonstrate the effectiveness of our design for each component. We believe OpenSage can pave the way for the next generation of agent development, shifting the focus from human-centered to AI-centered paradigms.

Abstract:
Planning with a learned model remains a key challenge in modelbased reinforcement learning~(RL). In decision-time planning, state representations are critical as they must support local cost computation while preserving long-horizon structure. In this paper, we show that the Laplacian representation provides an effective latent space for planning by capturing state-space distances at multiple time scales. This representation preserves meaningful distances and naturally decomposes long-horizon problems into subgoals, also mitigating the compounding errors that arise over long prediction horizons. Building on these properties, we introduce ALPS, a hierarchical planning algorithm, and demonstrate that it outperforms commonly used baselines on a selection of offline goal-conditioned RL tasks from OGBench, a benchmark previously dominated by model-free methods.

Abstract:
Equilibrium Propagation (EP) is a physicsinspired learning algorithm that uses stationary states of a dynamical system both for inference and learning. In its original formulation it is limited to conservative systems, i.e. to dynamics which derive from an energy function. Given their importance in applications, it is important to extend EP to nonconservative systems, i.e. systems with non-reciprocal interactions. Previous attempts to generalize EP to such systems failed to compute the exact gradient of the cost function. Here we propose a framework that extends EP to arbitrary nonconservative systems, including feedforward networks. We keep the key property of equilibrium propagation, namely the use of stationary states both for inference and learning. However, we modify the dynamics in the learning phase by a term proportional to the non-reciprocal part of the interaction so as to obtain the exact gradient of the cost function. This algorithm can also be derived using a variational formulation that generates the learning dynamics through an energy function defined over an augmented state space. Numerical experiments using the MNIST database show that this algorithm achieves better performance and learns faster than previous proposals.

Abstract:
Large language models (LLMs) are transforming everyday applications, yet they lag behind in specialized fields, such as cybersecurity, due to a lack of highquality, domain-specific models and training datasets. To address this gap, we present CyberPal 2.0, a family of cybersecurity-expert small language models (SLMs) ranging from 4B–20B parameters. To train CyberPal 2.0, we generate an enriched chain-of-thought cybersecurity instruction dataset built with our data enrichment and formatting pipeline, SecKnowledge 2.0, which integrates expert-in-the-loop steering of reasoning formats alongside LLM-driven multi-step grounding, yielding higher-fidelity, task-grounded reasoning traces for security tasks. Across diverse cybersecurity benchmarks,CyberPal 2.0 consistently outperforms its baselines and matches or surpasses various open and closed-source frontier models, while remaining a fraction of their size. On core threat-investigation tasks, such as correlating vulnerabilities and bug tickets with weaknesses,our best 20B-parameter model outperforms GPT-4o, o1, o3-mini, and Sec-Gemini v1, ranking first, while our smallest 4B-parameter model ranks second. On core cyber threat intelligence knowledge tasks, our models outperform almost all tested frontier models, ranking second only to Sec-Gemini v1. To foster reproducibility and practical adoption, we will release our models as open source.

Abstract:
As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chainof-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.

Abstract:
Softmaxbased dot-product attention is a cornerstone of Transformer architectures, enabling remarkable capabilities such as in-context learning. However, as context lengths increase, a fundamental limitation of the softmax function emerges: it tends to diffuse probability mass to irrelevant tokens degrading performance in long-sequence scenarios. Furthermore, attempts to sharpen focus by lowering softmax temperature hinder learnability due to vanishing gradients. We introduce LUCID Attention, an architectural modification that applies a preconditioner to the attention probabilities. This preconditioner, derived from exponentiated key-key similarities, minimizes overlap between the keys in a Reproducing Kernel Hilbert Space, thus allowing the query to focus on important keys among large number of keys accurately with same computational complexity as standard attention. Additionally, LUCID's preconditioning-based approach to retrieval bypasses the need for low temperature and the learnability problems associated with it. We validate our approach by training ～1 billion parameter language models evaluated on up to 128K tokens. Our results demonstrate significant gains on long-context retrieval tasks, specifically retrieval tasks from BABILong, RULER, SCROLLS and LongBench. For instance, LUCID achieves up to 18% improvement in BABILong and 14% improvement in RULER multi-needle performance compared to standard attention.

Abstract:
Multimodal large language models (MLLMs) are emerging as versatile foundations for mixedmodality retrieval. Yet, they often require heavy post-hoc training to convert them into contrastive encoders for retrieval. This work asks: Can off-the-shelf MLLMs serve as powerful retrievers without additional training? We present FreeRet, a plug‑and‑play framework that turns any MLLM into a two‑stage retriever. FreeRet first derives semantically grounded embeddings directly from the model for fast candidate search, and then exploits its reasoning ability for precise reranking. The framework contributes three advances: bypassing lexical alignment layers to obtain semantically faithful embeddings, conditioning representation generation with explicit priors, and mitigating framing effect in reranking via neutral choice framing. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet substantially outperforms models trained on millions of pairs. Beyond benchmarks, FreeRet is model-agnostic and scales seamlessly across MLLM families and sizes, preserves their generative abilities, supports arbitrary modality combinations, and unifies retrieval, reranking, and generation into end-to-end RAG within a single model. Our findings demonstrate that pretrained MLLMs, when carefully harnessed, can serve as strong retrieval engines without training, closing a critical gap in their role as generalists.

Abstract:
Optimization modeling underpins decisionmaking in logistics, manufacturing, energy, and finance, yet translating natural-language requirements into correct optimization formulations and solver-executable code remains labor-intensive. Although large language models (LLMs) have been explored for this task, evaluation is still dominated by toy-sized or synthetic benchmarks, masking the difficulty of industrial problems with 10^3--10^6 (or more) variables and constraints. A key bottleneck is the lack of benchmarks that align natural-language specifications with reference formulations/solver code grounded in real optimization models. To fill in this gap, we introduce MIPLIB-NL, built via a structure-aware reverse construction methodology from real mixed-integer linear programs in MIPLIB~2017. Our pipeline (i) recovers compact, reusable model structure from flat solver formulations, (ii) reverse-generates natural-language specifications explicitly tied to this recovered structure under a unified model--data separation format, and (iii) performs iterative semantic validation through expert review and human--LLM interaction with independent reconstruction checks. This yields 223 one-to-one reconstructions that preserve the mathematical content of the original instances while enabling realistic natural-language-to-optimization evaluation. Experiments show substantial performance degradation on MIPLIB-NL for systems that perform strongly on existing benchmarks, exposing failure modes invisible at toy scale.

Abstract:
Learning truthful taskrelevant latent representations \mathbfs from observations \mathbfo = f(\mathbfs) is fundamental. Identifiability, the asymptotic guarantee of recovering the ground-truth representation, is critical because it sets the ultimate limit of any model, even with infinite data and computation. We study this problem in a completely nonparametric setting, without relying on interventions, parametric forms, or structural constraints. We first prove that the structure between time steps and tasks is identifiable in a fully unsupervised manner, even when sequences lack strict temporal dependence and may exhibit disconnections, and task assignments can follow arbitrarily complex and interleaving structures. We then prove that, within each time step, the task-relevant latent representation can be disentangled from the irrelevant part under a simple sparsity regularization, without any additional information or parametric constraints. Together, these results establish a hierarchical foundation: task structure is identifiable across time steps, and task-relevant latent representations are identifiable within each step. To our knowledge, each result provides a first general nonparametric identifiability guarantee, and together they mark a step toward provably moving from generalist to specialist models.

Abstract:
In this paper, we study the role of the critic in actorcritic for entropy-regularized, finite, discounted environments. We establish that, when the critic is exact, using the latter as a baseline is an actual variance-reduction method. In this case, actor-critic with stochastic gradients matches the sample complexity of deterministic policy gradient, reaching an \epsilon-optimal regularized value with \tildeO(\log(1/\epsilon)) samples. In practice, the critic is learned alongside the actor: the variance of the actor update is then influenced by the critic's variance and bias. Specifically, when the critic has a sufficiently small error, the variance reduction and rapid convergence are preserved. This suggests to learn the critic first, keeping it up to date after each actor update, underscoring the pivotal role of accurate critic estimation in actor-critic methods.

Abstract:
Autonomous web agents are increasingly deployed for longhorizon tasks, yet their ability to adhere to real-world policies remains critically underexplored compared to standard safety objectives. To address this gap, we introduce PolicyGuardBench, a benchmark of 60k policy-trajectory pairs designed to evaluate compliance through both full-trajectory and novel prefix-based violation detection tasks. Using this dataset, we train PolicyGuard, a lightweight guardrail model that achieves strong detection accuracy while maintaining high inference efficiency. Notably, our model demonstrates robust generalization capabilities, preserving high performance even on unseen domains. Together, these contributions establish a comprehensive framework for studying policy compliance, showing that accurate and generalizable guardrails are feasible at small scales.

Abstract:
Spatiotemporal forecasting is critical for real-world applications like traffic management, yet capturing complex interactions under high-noise conditions remains challenging. While current methods have shown improved accuracy using spatial physical priors, they often struggle with evolving temporal correlations and systematic errors. In this work, we propose a nested forecasting framework that couples future macro-level regional trends with micro-level historical observations, enabling top-down guidance from abstract future representations for fine-grained forecasting. Specifically, we construct semantically coherent regions via spectral clustering and design a progressive coarse-to-fine predictor to inject macro-dynamics into node-level forecasting. Extensive experiments on multiple real-world datasets demonstrate that our method consistently outperforms state-of-the-art baselines, validating the effectiveness of future macro-guided nested forecasting.

Abstract:
Several important learning tasks can be formulated as minimizing an entropyregularized objective over an appropriate space of probability distributions. Mean-field Langevin dynamics (MFLD) facilitate computation in this general context, casting the minimizer as the invariant distribution of a McKean--Vlasov process, which can be numerically discretized using N particles and thus simulated. However, simulating this interacting particle system has computational complexity \mathcalO(N^2). Motivated by recent research into \emphkernel thinning, we propose \textttKT-MFLD, in which each particle interacts only with a coreset of size \mathcalO(N^\frac12). \textttKT-MFLD thus reduces the computational complexity to \mathcalO(N^\frac32) while, under mild regularity conditions, achieving the same convergence guarantees (up to logarithmic factors) as MFLD. Our theoretical analysis is empirically confirmed on tasks including the training of student-teacher neural networks, quantization with maximum mean discrepancy, and computation of predictively-oriented posteriors in a post-Bayesian framework.

Abstract:
Evolving data streams induce joint nonstationarity in continual semantic segmentation, where semantic classes, input distributions, and supervision availability change simultaneously over time. This setting reflects practical structured prediction systems, yet remains largely unexplored in prior continual learning work, which typically studies these factors in isolation. We formalize continual segmentation under coupled class, domain, and label shifts and investigate learning in heterogeneous dense prediction environments with limited annotations and abundant unlabeled data. To address instability and overfitting arising from fewshot supervision under distribution drift, we introduce gradient-adaptive stabilization, a parameter-wise regularization mechanism implemented via gradient-scaled stochastic perturbations that promotes a principled stability–plasticity tradeoff. We further leverage unlabeled data through semi-supervised learning and introduce prototype anchored supervision that validates pseudo-labels via joint confidence and prototype consistency. Together, these mechanisms enable learning under joint nonstationarity in continual segmentation. Extensive empirical evaluation across class-incremental, domain-incremental, and few-shot regimes demonstrates consistent improvements over prior methods in heterogeneous structured prediction settings. Our results expose fundamental failure modes of existing continual segmentation approaches and provide insight into learning robust dense predictors in dynamically evolving environments.

Abstract:
We present a differentiable optimization framework for multiagent coordination. An input is decomposed into overlapping local views, each processed by an agent that solves a convex subproblem parametrized by a neural encoder. Agents coordinate through the Alternating Direction Method of Multipliers (ADMM) with inter-agent constraints specified by a cellular sheaf. The sheaf specifies which aspects of neighboring solutions must agree. Backpropagating through the unrolled optimization jointly trains encoders, decoders, and sheaf structure. We evaluate on maze pathfinding, image classification, and Sudoku, where agents with individually insufficient local views coordinate to produce correct global outputs. We show that this locality also yields improved robustness to distribution shifts (padding, missing patches, and noise) when evaluated against a standard CNN on MNIST, while exposing interpretable primal/consensus/dual variables that make the coordination dynamics directly inspectable.

Abstract:
Longterm fairness algorithms aim to satisfy fairness beyond static and short-term notions by accounting for the dynamics between decision-making policies and population behavior. Most previous approaches evaluate performance and fairness measures from observable features and a label, which is assumed to be fully observed. However, in scenarios such as hiring or lending, the labels (e.g., ability to repay the loan) areselective labelsas they are only revealed based on positive decisions (e.g., when a loan is granted). In this paper, we study long-term fairness in the selective labels setting, and analytically show that naive solutions do not guarantee fairness. To address this gap, we then introduce a novel framework that leverages both the observed data and a label predictor model to estimate the true fairness measure value by decomposing it into the observed fairness and bias from label predictions. This allows us to derive the sufficient conditions to satisfy true fairness from observable quantities by using the confidence in the predictor model. Finally, we rely on our theoretical results to propose a novel reinforcement learning algorithm for effective long-term fair decision-making with selective labels. In semisynthetic environments, the proposed algorithm reached comparable fairness and performance to an agent with oracle access to the true labels.

Abstract:
Quaternion Neural Networks are parameterefficient and model multidimensional dependencies by representing four related features as a single entity. However, existing quaternion self-attention computes component-wise scores and applies independent softmax operations to each component, which increases computational cost and allows attention distributions to diverge across components. We propose a Shared-Score Quaternion Self-Attention mechanism that computes a single real-valued score using the quaternion inner product and applies a shared attention distribution across all components. This reduces score-computation multiplications by 75% and the number of softmax operations from four to one. We prove that the component-wise and shared scores lie in the same interaction subspace—the linear span of bilinear terms induced by quaternion linear projections. This indicates that independent component-wise attention primarily re-parameterizes the same interactions rather than fundamentally expanding the feature interaction space. In speech enhancement, where phase information is crucial, our method reduces the inference time by 45--61% while maintaining enhancement quality, making quaternion attention a more practical approach. These findings provide a systematic approach to efficient hypercomplex attention.

Abstract:
Inverse reinforcement learning (IRL) is typically formulated as maximizing entropy subject to matching the distribution of expert trajectories. Classical (dualascent) IRL guarantees monotonic performance improvement but requires fully solving an RL problem each iteration to compute dual gradients. More recent adversarial methods avoid this cost at the expense of stability and monotonic dual improvement, by directly optimizing the primal problem and using a discriminator to provide rewards. In this work, we bridge the gap between these approaches by enabling monotonic improvement of the reward function and policy without having to fully solve an RL problem at every iteration. Our key theoretical insight is that a trust-region-optimal policy for a reward function update can be globally optimal for a smaller update in the same direction. This smaller update allows us to explicitly optimize the dual objective while only relying on a local search around the current policy. In doing so, our approach avoids the training instabilities of adversarial methods, offers monotonic performance improvement, and learns a reward function in the traditional sense of IRL—one that can be globally optimized to match expert demonstrations. Our proposed algorithm,Trust Region Inverse Reinforcement Learning (TRIRL), outperforms state-of-the-art imitation learning methods across multiple challenging tasks by a factor of 2.4x in terms of aggregate inter-quartile mean, while recovering reward functions that generalize to system dynamics shifts.

Abstract:
We study stochastic minimumcost reach-avoid reinforcement learning, where an agent must satisfy a reach-avoid specification with probability at least p while minimizing expected cumulative costs in stochastic environments. Existing safe and constrained reinforcement learning methods typically fail to jointly provide probabilistic reach-avoid guarantees and cost-optimality in the learning setting in stochastic environments. In order to overcome this issue, we introduce reach-avoid probability certificates (RAPCs), which characterize the states from which stochastic reach-avoid constraints are satisfiable. Based on RAPCs, we develop a contraction-based Bellman formulation that enables reinforcement learning to optimize cumulative cost while provably satisfying stochastic reach-avoid constraints. We establish almost sure convergence of the proposed algorithms to locally optimal policies under the reach-avoid constraints. Experiments in the MuJoCo simulator demonstrate improved cost performance and higher reach-avoid satisfaction compared to existing baselines.

Abstract:
Learningaugmented algorithmshave received significant attention in recent years, particularly in the context of online optimization. Motivated by the high computational cost of generating predictions, a growing line of work studies the tradeoff between performance guarantees and the number of predictions used in learning-augmented algorithms for problems such as caching and metrical task systems. In this paper, we extend this line of research toonline metric matchingby developingparsimoniouslearning-augmented algorithms and establishing lower bounds on their performance. Our approach extends the Follow-the-Prediction framework to the parsimonious setting by filling in avirtual predictionin the absence of an actual prediction, using an online metric matching algorithm that maintains good intermediate matchings throughout its execution. We complement our theoretical results with an empirical evaluation, demonstrating the practical effectiveness of our approach.

Abstract:
Detecting AIgenerated music is crucial for preserving artistic authenticity and preventing the misuse of generative music technologies. However, existing discriminative detectors typically rely on generated samples during training and often suffer from severe performance degradation when confronted with music produced by unseen generators, which limits their real-world applicability. To address this issue, we formulate a zero-shot setting for AI-generated music detection, where the detector is trained exclusively on real music without access to any generated samples. Under this setting, we propose MusicDET, a generator-agnostic detection framework based on frequency-guided normalizing flows that probabilistically models the distribution of real music features. By evaluating the likelihood of an input sample under the learned real-music distribution, MusicDET enables effective detection of out-of-distribution music signals. Experiments on the FakeMusicCaps and SONICS datasets show that MusicDET consistently outperforms conventional discriminative detectors, particularly when detecting music generated by previously unseen models.

Abstract:
Continual learning (CL) aims to train models sequentially over multiple domains without forgetting previously learned knowledge. However, existing CL methods optimize for indomain performance and are therefore prone to learning spurious, domain-specific cues (``shortcut learning''), which limits generalization to unseen domains after deployment. In this paper, we address this limitation throughcontinual learning of domain-invariant representation. We introduce a broad class of CL methods that sequentially learn representations capturing invariant structures across domains. Our methods are motivated by the observation that such invariant structures often preserve the underlying causal mechanisms, which can reduce the risk of overfitting to domain-specific cues and thus offer better out-of-domain generalization. Our proposed CL methods combine replay-based training with a tailored sequential invariance alignment to learn---and preserve---invariant structures over time. We evaluate our methods under a deployment-oriented protocol that measures performance on unseen target domains. Across six benchmark and real-world datasets spanning vision, medicine, manufacturing, and ecology, our methods consistently outperform existing CL baselines in terms of generalization to unseen target domains. As an ablation, we further show that na\"ive extensions of sequential training with existing domain-invariant representation learning (DIRL) methods provide only limited benefits. To the best of our knowledge, this is the first work to develop domain-invariant representation methods for CL.

Abstract:
We work in the adaptive query model, where one is given a point set P \subset \mathbbR^d and seeks to construct a data structure that can answer correctly and efficiently a sequence of adaptive queries. In this model, an adversary observes the answers returned by the data structure to previous queries q_1, \ldots, q_i1 and, based on this information, chooses the next query point q_i. This setting captures strong forms of adaptivity that naturally arise in modern machine learning pipelines, and rules out many classical randomized techniques that assume oblivious queries. Our focus is the problem of furthest neighbor search in this adaptive setting, a fundamental problem in several learning tasks, including diversity maximization, outlier and anomaly detection, adversarial example generation, and more. We present the first adversarially robust data structure for c-approximate furthest neighbor queries that achieves query time \tildeO(n^1/c^2 + d). This matches the query time of the seminal result by Indyk [SODA'03] for c-approximate furthest neighbor in the oblivious setting, and significantly improves upon the \tildeO(n + d) query time achieved by using the adaptive distance estimation framework of Cherapanamjeri and Nelson [NeurIPS'20].

Abstract:
Ustatistics play a central role in statistical inference. In many modern applications, however, acquiring the labels required for U-statistics is costly. Motivated by recent advances in active inference, we develop an active inference framework for U-statistics that selectively queries informative labels to improve estimation efficiency under a fixed labeling budget, while preserving valid statistical inference. Our approach is built on the augmented inverse probability weighting U-statistic, which is proposed to incorporate the sampling rule and machine learning predictions. We characterize the optimal sampling rule that minimizes its variance and design practical sampling strategies. We further extend the framework to U-statistic-based empirical risk minimization. Experiments on real datasets demonstrate substantial gains in estimation efficiency over baseline methods, while maintaining target coverage.

Abstract:
We consider the problem of universal transductive online classification with a possibly unbounded label space. This setting considers online learning, with the sequence of instances (without labels) known to the learner in advance. We say a concept class \mathcalH is learnable if there is a learning algorithm \mathcalA, such that for every realizable sequence, the number of mistakes made by \mathcalA grows at most sublinearly with the number of predictions. We characterize the learnability of this setting and show that there are only two possible optimal rates for the learnable classes: either bounded or increasing logarithmically. We introduce a new combinatorial structure, called "LevelConstrained-Littlestone-Littlestone (LCLL) tree", which, along with the indifference property, characterizes the learnability. We also extend the learnability result to the agnostic case and the case where only the stochastic process that generates the instance sequence is known.

Abstract:
We present a novel framework for embedding hard constraint satisfaction into neural network (NN) architectures, specifically feedforward neural networks and transformers, with inputdependent affine constraints of arbitrary cardinality. Traditional constraint enforcement approaches either rely on penalty-based soft constraints, which offer no guarantee of satisfaction, or on post-processing methods that enforce constraints after the NN is trained, which may lead to suboptimality. We introduce a trainable constraint-affine (CAffine) layer into NNs, yielding CAffNet, which goes beyond enforcing affine constraints via fixed orthogonal or parallel projections and enables joint optimization with network parameters. Moreover, we impose no restrictions on the constraint space dimensions and establish that our construction preserves the universal approximation properties of NNs, while providing provable guarantees on constraint adherence for all inputs. Experimental validation demonstrates robust performance across diverse domains requiring guaranteed constraint satisfaction.

Abstract:
The prevailing paradigm in large language model (LLM) development is to pretrain a base model, then perform further training to improve performance and model behavior. However, hyperparameter optimization and scaling laws have been studied primarily from the perspective of the base model's validation loss, ignoring downstream adaptability. In this work, we study pretraining from the perspective of model plasticity, that is, the ability of the base model to successfully adapt to downstream tasks through finetuning. We focus on the role of weight decay, a key regularization parameter during pretraining. Through systematic experiments, we show that models trained with larger weight decay values are more plastic, meaning they show larger performance gains when fine-tuned on downstream tasks. This phenomenon can lead to counterintuitive trade-offs where base models that perform worse after pretraining can perform better after fine-tuning. Further investigation of weight decay's mechanistic effects on model behavior reveals that it encourages linearly separable representations, regularizes attention matrices, and reduces overfitting on the training data. In conclusion, this work casts light on the multifaceted role that a single optimization hyperparameter can play in shaping model behavior and demonstrates the importance of using evaluation metrics beyond the cross-entropy loss for hyperparameter optimization.

Abstract:
Highfidelity text-to-image and text-to-video generation typically relies on Classifier-Free Guidance (CFG), but achieving optimal results often demands computationally expensive sampling schedules. In this work, we propose MAMBO-G, a training-free acceleration framework that significantly reduces computational cost by dynamically optimizing guidance magnitudes. We observe that standard CFG schedules are inefficient, applying disproportionately large updates in early steps that hinder convergence speed. MAMBO-G mitigates this by modulating the guidance scale based on the update-to-prediction magnitude ratio, effectively stabilizing the trajectory and enabling rapid convergence. This efficiency is particularly vital for resource-intensive tasks like video generation. Our method serves as a universal plug-and-play accelerator, achieving up to 3× speedup on Stable Diffusion v3.5 (SD3.5) and 4× on Lumina. Most notably, MAMBO-G accelerates the 14B-parameter Wan2.1 video model by 2× while preserving visual fidelity, offering a practical solution for efficient large-scale video synthesis. Our implementation follows a mainstream open-source diffusion framework and is plug-and-play with existing pipelines.

Abstract:
Large language models (LLMs) have enabled rapid progress in automatic heuristic discovery (AHD), yet most existing methods are predominantly limited by static evaluation against fixed instance distributions, leading to potential overfitting and poor generalization under distributional shifts. We propose Algorithm Space Response Oracles (ASRO), a gametheoretic framework that reframes heuristic discovery as a program level co-evolution between solver and instance generator. ASRO models their interaction as a two-player zero-sum game, maintains growing strategy pools on both sides, and iteratively expands them via LLM-based best-response oracles against mixed opponent meta-strategies, thereby replacing static evaluation with an adaptive, self-generated curriculum. Across multiple combinatorial optimization domains, ASRO consistently outperforms static-training AHD baselines built on the same program search mechanisms, achieving substantially improved generalization and robustness on diverse and out-of-distribution instances.

Abstract:
Bayesian Optimization is critically vulnerable to extreme outliers. Existing provably robust methods typically assume a bounded cumulative corruption budget, which makes them defenseless against even a single corruption of sufficient magnitude. To address this, we introduce a new adversary whose budget is only bounded in the frequency of corruptions, not in their magnitude. We then derive RCGPUCB, an algorithm coupling the upper confidence bound (UCB) approach with a Robust Conjugate Gaussian Process (RCGP). We present stable and adaptive versions of RCGP-UCB, and prove that they achieve sublinear regret in the presence of up to O(T^1/4) and O(T^1/7) corruptions with possibly infinite magnitude. This robustness comes at near zero cost: without outliers, RCGP-UCB's regret bounds match those of the standard GP-UCB algorithm.

Abstract:
Deep learningbased methods have shown remarkable effectiveness in solving PDEs, largely due to their ability to enable fast simulations once trained. However, despite the availability of high-performance computing infrastructure, many critical applications remain constrained by the substantial computational costs associated with generating large-scale, high-quality datasets and training models. In this work, inspired by studies on the structure of Green's functions for elliptic PDEs, we introduce Neural-HSS, a parameter-efficient architecture built upon the Hierarchical Semi-Separable (HSS) matrix structure that is provably data-efficient for a broad class of PDEs. We theoretically analyze the proposed architecture, proving that it satisfies exactness properties even in very low-data regimes. We also investigate its connections with other architectural primitives, such as the Fourier neural operator layer and convolutional layers. We experimentally validate the data efficiency of Neural-HSS on the three-dimensional Poisson equation over a grid of two million points, demonstrating its superior ability to learn from data generated by elliptic PDEs in the low-data regime while outperforming baseline methods. Finally, we demonstrate its capability to learn from data arising from a broad class of PDEs in diverse domains, including electromagnetism, fluid dynamics, and biology.

Abstract:
Convergence in highprobability (HP) has attracted increasing interest, due to implying exponentially decaying tail bounds and strong guarantees for individual runs of an algorithm. While many works study HP guarantees in centralized settings, much less is understood in the decentralized setup, where existing works require strong assumptions, like uniformly bounded gradients, or asymptotically vanishing noise. This results in a significant gap between the assumptions used to establish convergence in the HP and the mean-squared error (MSE) sense, and is also contrary to centralized settings, where it is known that \mathttSGD converges in HP under the same conditions on the cost function as needed for MSE convergence. Motivated by these observations, we study the HP convergence of Decentralized \mathttSGD (\mathttDSGD) in the presence of light-tailed noise, providing several strong results. First, we show that \mathttDSGD converges in HP under the same conditions on the cost as in the MSE sense, removing the restrictive assumptions used in prior works. Second, our sharp analysis yields order-optimal rates for both non-convex and strongly convex costs. Third, we establish a linear speed-up in the number of users, leading to matching, or strictly better transient times than those obtained from MSE results, further underlining the tightness of our analysis. To the best of our knowledge, this is the first work that shows \mathttDSGD achieves a linear speed-up in the HP sense. Our relaxed assumptions and sharp rates stem from several technical results of independent interest, including a result on the variance-reduction effect of decentralized methods in the HP sense, as well as a novel bound on the MGF of strongly convex costs, which is of interest even in centralized settings. Finally, we provide experiments that validate our theory.

Abstract:
Many decisionmaking problems require ranking individuals by their treatment effects rather than estimating the exact effect magnitudes. Examples include prioritizing patients for preventive care interventions, or ranking customers by the expected incremental impact of an advertisement. Surprisingly, while causal effect estimation has received substantial attention in the literature, the problem of directly learningrankings of treatment effectshas largely remained unexplored. In this paper, we introduceRank-Learner, a novel two-stage learner that directly learns the ranking of treatment effects from observational data. We first show that naive approaches based on precise treatment effect estimation solve a harder problem than necessary for ranking, while ourRank-Learneroptimizes a pairwise learning objective that recovers the true treatment effect ordering, without explicit CATE estimation. We further show that ourRank-Learneris Neyman-orthogonal and thus comes with strong theoretical guarantees, including robustness to estimation errors in the nuisance functions. In addition, ourRank-Learneris model-agnostic, and can be instantiated with arbitrary machine learning models (e.g., neural networks). We demonstrate the effectiveness of our method through extensive experiments whereRank-Learnerconsistently outperforms standard CATE estimators and non-orthogonal ranking methods. Overall, we provide practitioners with a new, orthogonal two-stage learner for ranking individuals by their treatment effects.

Abstract:
The Shapley value is a ubiquitous framework for attribution in machine learning, encompassing feature importance, data valuation, and causal inference. However, its exact computation is generally intractable, necessitating efficient approximation methods. While the most effective and popular estimators leverage thepaired samplingheuristic to reduce estimation error, the theoretical mechanism driving this improvement has remained opaque. In this work, we provide an elegant and fundamental justification for paired sampling: we prove that the Shapley value dependsexclusivelyon the odd component of the set function, and that paired sampling orthogonalizes the regression objective to filter out the irrelevant even component. Leveraging this insight, we propose OddSHAP, a novel consistent estimator that performs polynomial regression solely on the odd subspace. By utilizing the Fourier basis to isolate this subspace and employing a proxy model to identify highimpact interactions, OddSHAP overcomes the combinatorial explosion of higher-order approximations. Through an extensive benchmark evaluation, we find that OddSHAP achieves state-of-the-art estimation accuracy.

Abstract:
Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for aligning large language models (LLMs) with human preferences, yet it is susceptible to reward overoptimization, in which policy models overfit to the reward model, exploit spurious reward patterns instead of faithfully capturing human intent. Prior mitigations primarily relies on surface semantic information and fails to efficiently address the misalignment between the reward model (RM) and the policy model caused by continuous policy distribution shifts. This inevitably leads to an increasing reward discrepancy, exacerbating reward overoptimization. To address these limitations, we introduce R2M (RealTime Aligned Reward Model), a novel lightweight RLHF framework. R2M goes beyond vanilla reward models that solely depend on the semantic representations of a pretrained LLM. Instead, it leverages the evolving hidden states of the policy (namely policy feedback) to align with the real-time distribution shift of the policy during the RL process. This work points to a promising new direction for improving the performance of reward models through real-time utilization of feedback from policy models.

Abstract:
Finetuning large pre-trained models on a target distribution often improves in-distribution (ID) accuracy, but at the cost of out-of-distribution (OOD) robustness as representations specialize to the fine-tuning data. Weight-space ensembling methods, such as Model Soups, mitigate this effect by averaging multiple checkpoints, but they are computationally prohibitive, requiring the training and storage of dozens of fine-tuned models. In this paper, we introduce MonoSoup, a simple, data-free, hyperparameter-free, post-hoc method that achieves a strong ID–OOD balance usingonly a singlecheckpoint. Our method applies Singular Value Decomposition (SVD) to each layer’s update and decomposes it into high-energy directions that capture task-specific adaptation and low-energy directions that introduce noise but may still encode residual signals useful for robustness. MonoSoup then uses entropy-based effective rank to automatically re-weigh these components with layer-wise coefficients that account for the spectral and geometric structure of the model. Experiments on CLIP models fine-tuned on ImageNet and evaluated under natural distribution shifts, as well as on Qwen language models tested on mathematical reasoning and multiple-choice benchmarks, show that this plug-and-play approach is a practical and effective alternative to multi-checkpoint methods, retaining much of their benefits without their computational overhead.

Abstract:
With the rise of largescale foundation models, efficiently adapting them to downstream tasks remains a central challenge. Linear probing, which freezes the backbone and trains a lightweight head, is computationally efficient but often restricted to last-layer representations. We show that task-relevant information is distributed across the network hierarchy rather than solely encoded in any of the last layers. To leverage this distribution of information, we apply an attentive probing mechanism that dynamically fuses representations from all layers of a Vision Transformer. This attentive layer fusion (ALF) learns to identify the most relevant layers for a target task and combines low-level structural cues with high-level semantic abstractions. Across 20 diverse datasets and multiple pretrained foundation models, ALF achieves consistent, substantial gains over standard linear probes. Attention heatmaps further reveal that tasks different from the pre-training domain benefit most from intermediate representations. Overall, our findings underscore the value of intermediate-layers and demonstrate a principled, task-aware approach for unlocking their potential for probing-based adaptation.

Abstract:
Evaluating conditional coverage remains one of the most persistent challenges in assessing the reliability of predictive systems. Although conformal methods can give guarantees on marginal coverage, no method can guarantee to produce sets with correct conditional coverage, leaving practitioners without a clear way to interpret local deviations. To overcome sampleinefficiency and overfitting issues of existing metrics, we cast conditional coverage estimation as a classification problem. Conditional coverage is violated if and only if any classifier can achieve lower risk than the target coverage. Through the choice of a (proper) loss function, the resulting risk difference gives a conservative estimate of natural miscoverage measures such as L1 and L2 distance, and can even separate the effects of over- and under-coverage, as well as handle non-constant target coverages. We call the resulting family of metrics excess risk of the target coverage (ERT). We show experimentally that the use of modern classifiers provides much higher statistical power than simple classifiers underlying established metrics like CovGap. Additionally, we use our metric to benchmark different conformal prediction methods. Finally, we release an open-source package for ERT as well as previous conditional coverage metrics. Together, these contributions provide a new lens for understanding, diagnosing, and improving the conditional reliability of predictive systems.

Abstract:
Watermarking offers a promising solution for detecting LLMgenerated content, yet its robustness under realistic query-free (black-box) evasion remains an open challenge. Existing query-free attacks often achieve limited success or severely distort semantic meaning. We bridge this gap by theoretically analyzing rewriting-based evasion, demonstrating that reducing the average conditional probability of sampling green tokens by a small margin causes the detection probability to decay exponentially. Guided by this insight, we propose the Bias-Inversion Rewriting Attack (BIRA), a practical query-free method that applies a negative logit bias to a proxy suppression set identified via token surprisal. Empirically, BIRA achieves state-of-the-art evasion rates (>99%) across diverse watermarking schemes while preserving semantic fidelity substantially better than prior baselines. Our findings reveal a fundamental vulnerability in current watermarking methods and highlight the need for rigorous stress tests.

Abstract:
Dense associative memories (DAMs) store and retrieve patterns via energyfunction based fixed points, but existing models are limited to vector representations. We extend DAMs to Gaussian densities equipped with the 2-Wasserstein distance. Our framework defines a log-sum-exp energy over stored distributions and a retrieval dynamics aggregating optimal transport maps in a Gibbs-weighted manner. Stationary points correspond to self-consistent Wasserstein barycenters, generalizing classical DAM fixed points. We prove exponential storage capacity and provide quantitative retrieval guarantees under Wasserstein perturbations. We validate the method on synthetic and real-world image (CelebA and CIFAR-10 datasets) and text (text8 and NLI corpus) datasets. By generalizing from vectors to distributions, our work bridges classical DAMs with modern generative modeling and paves way for distributional storage and retrieval in memory-augmented learning.

Abstract:
Longduration streaming video understanding is fundamental for future AI agents, yet remains limited by ineffective long-term memory. We introduce video-SALMONN S, a memory-enhanced streaming audio-visual large language model that processes over 3-hour videos at 1 FPS and 360p resolution, outperforming strong non-streaming models under the same memory budget. In addition to token merging or downsampling, video-SALMONN S is the first to employ test-time training (TTT) as a streaming memory mechanism for video understanding. TTT continuously transforms short-term multimodal representations into long-term memory embedded in model parameters. To improve long-range dependency modeling and memory capacity, we propose (i) a TTT_\textMEM layer with an additional long-span prediction objective, (ii) a two-stage training scheme, and (iii) a modality-aware memory reader. We further introduce the episodic learning from video memory (ELViM) benchmark, simulating agent-like scenarios where models must learn from videos observed hours earlier. video-SALMONN S consistently outperforms both streaming and non-streaming baselines by 3-7% on long video benchmarks. Notably, video-SALMONN S achieves a 15% absolute accuracy improvement over strong non-streaming models on ELViM, demonstrating strong learning abilities from video memory.

Abstract:
In this paper, we study cooperative multiagent reinforcement learning (MARL) where the joint reward exhibits submodularity, which is a natural property capturing diminishing marginal returns when adding agents to a team. Unlike standard MARL with additive rewards, submodular rewards model realistic scenarios where agent contributions overlap (e.g., multi-drone surveillance, collaborative exploration). We provide the first formal framework for this setting and develop algorithms with provable guarantees on sample efficiency and regret bound. For known dynamics, our greedy policy optimization achieves a 1/2-approximation with polynomial complexity in the number of agents K, overcoming the exponential curse of dimensionality inherent in joint policy optimization. For unknown dynamics, we propose a UCB-based learning algorithm achieving a 1/2-regret of O(H^2KS\sqrtAT) over T episodes.

Abstract:
Verbal confidence—prompting LLMs to state their confidence as a number or category—is widely used to extract uncertainty estimates from blackbox models. However, how LLMs internally generate such scores remains unknown. We address two questions: first, when confidence is computed -- just-in-time when requested, or automatically during answer generation and cached for later retrieval; and second, what verbal confidence represents -- token log-probabilities, or a richer evaluation of answer quality? Focusing on Gemma 3 27B and Qwen 2.5 7B, we provide convergent evidence for cached retrieval. Activation steering, patching, noising, and swap experiments reveal that confidence representations emerge at answer-adjacent positions before appearing at the verbalization site. Attention blocking pinpoints the information flow: confidence is gathered from answer tokens, cached at the first post-answer position, then retrieved for output. Critically, linear probing and variance partitioning reveal that these cached representations explain substantial variance in verbal confidence beyond token log-probabilities, suggesting a richer answer-quality evaluation rather than a simple fluency readout. These findings demonstrate that verbal confidence reflects automatic, sophisticated self-evaluation—not post-hoc reconstruction—with implications for understanding metacognition in LLMs and improving calibration.

Abstract:
Predictionpowered inference is a recent methodology for the safe use of black-box ML models to impute missing data, strengthening inference of statistical parameters. However, many applications require strong properties besides valid inference, such as privacy, robustness or validity under continuous distribution shifts; deriving prediction-powered methods with such guarantees is generally an arduous process, and has to be done case by case. In this paper, we resolve this issue by connecting prediction-powered inference with conformal prediction: by performing imputation through a calibrated set-predictor, we attain validity while achieving additional guarantees in a natural manner. We instantiate our procedure for the inference of means, Z- and M-estimation, as well as e-values and e-value-based procedures. Furthermore, in the case of e-values, ours is the first general prediction-powered procedure that operates off-line. We demonstrate these advantages by applying our method on private and time-series data. Both tasks are nontrivial within the standard prediction-powered framework but become natural under our method.

Abstract:
Crosslayer reuse of early attention projections can improve optimization and data efficiency, but it creates a structural conflict: the first layer must simultaneously act as a stable, reusable anchor for all deeper layers and as an effective computational block. We demonstrate that this tension constrains the performance of internal-anchor designs. We propose ExoFormer, which resolves the conflict by learning exogenous anchor projections outside the sequential layer stack. We introduce a unified normalized mixing framework that mixes queries, keys, values, and gate logits using learnable coefficients (exploring coefficient granularities: elementwise, headwise, and scalar), and we show that normalizing anchor sources is key to stable reuse. ExoFormer variants consistently outperform their internal-anchor counterparts, and the dynamic variant yields 1.5x downstream accuracy points while matching validation loss using 1.5x fewer tokens than Gated Attention. We explain this efficacy via an Offloading Hypothesis: external anchors preserve essential token identity, allowing layers to specialize exclusively in feature transformation. We release code and models to facilitate future research.

Abstract:
Backdoor and datapoisoning attacks can achieve high attack success while evading existing spectral and optimisation-based defences. We show that this behaviour is not incidental, but arises from a fundamental geometric mechanism in input space. Using kernel ridge regression as an exact model of wide neural networks, we prove that clustered dirty-label poisons induce a rank-one spike in the input Hessian whose magnitude scales quadratically with attack efficacy. Crucially, for nonlinear kernels we identify a near-clone regime in which poison efficacy remains order-one while the induced input curvature vanishes, making the attack provably spectrally undetectable. We further show that input-gradient regularisation contracts poison-aligned Fisher and Hessian eigenmodes under gradient flow, yielding an explicit and unavoidable safety–efficacy trade-off by reducing data-fitting capacity. For exponential kernels, this defence admits a precise interpretation as an anisotropic high-pass filter that increases the effective length scale and suppresses near-clone poisons. Extensive experiments on linear models and deep convolutional networks across MNIST and CIFAR-10/100 validate the theory, demonstrating consistent lags between attack success and spectral visibility, and showing that regularisation and data augmentation jointly suppress poisoning. Our results establish when backdoors are inherently invisible, and provide the first end-to-end characterisation of poisoning, detectability, and defence through input-space curvature.

Abstract:
The standard way to evaluate language models on subjective tasks is through pairwise comparisons: an annotator chooses the "better" of two model responses for a given prompt. These comparisons are then aggregated into a single ranking via the Bradley–Terry (BT) framework, forcing heterogeneous preferences into a total order and violating basic socialchoice desiderata. In contrast, social choice theory provides an alternative approach called maximal lotteries, which aggregates pairwise preferences without imposing any assumptions on their structure. However, we show that maximal lotteries can be highly sensitive to heterogeneity among annotators and across prompts. We introducerobust lotteries, which optimize worst-case performance under plausible shifts in the preference data. On large-scale preference datasets, robust lotteries achieve more reliable win rate guarantees across the annotator distribution and recover a stable set of top performing models.

Abstract:
Scaling language models to long contexts is often bottlenecked by the size of the keyvalue (KV) cache. In deployed settings, long contexts are typically managed through compaction in token space via summarization. However, summarization can be highly lossy, substantially harming downstream performance. Recent work on Cartridges (Eyuboglu et al., 2025) has shown that it is possible to train highly compact KV caches in latent space that closely match full-context performance, but at the cost of slow and expensive end-to-end optimization. This work describes an approach for fast context compaction in latent space through Attention Matching, which constructs compact keys and values to reproduce attention outputs and preserve attention mass at a per-KV-head level. We show that this formulation naturally decomposes into simple subproblems, some of which admit efficient closed-form solutions. Within this framework, we develop a family of methods that significantly push the Pareto frontier of compaction time versus quality, achieving up to 50× compaction in seconds on some datasets with little quality loss.

Abstract:
Generative models augmented with external tools and update mechanisms (or agents) have demonstrated capabilities beyond intelligent prompting of base models. As agent use proliferates, dynamic multiagent systems have naturally emerged. Recent work has investigated the theoretical and empirical properties of low-dimensional representations of agents based on query responses at a single time point. This paper introduces the Temporal Data Kernel Perspective Space (TDKPS), which jointly embeds agents across time, and proposes several novel hypothesis tests for detecting behavioral change at the agent- and group-level in black-box multi-agent systems. We characterize the empirical properties of our proposed tests, including their sensitivity to key hyperparameters, in simulations motivated by a multi-agent system of evolving digital personas. Finally, we demonstrate via natural experiment that our proposed tests detect changes that correlate sensitively, specifically, and significantly with a real exogenous event. TDKPS is the first principled framework for monitoring behavioral dynamics in black-box multi-agent systems - a critical capability as generative agent deployment continues to scale.

Abstract:
Evaluating the performance of large language models (LLMs) from human preference data is crucial for obtaining LLM leaderboards. However, many existing approaches either rely on restrictive parametric assumptions or lack valid uncertainty quantification when flexible machine learning methods are used. In this paper, we propose a nonparametric statistical framework, called DMLRank, for comparing and ranking LLMs from preference data using debiased machine learning (DML). For this, we introduce generalized average ranking scores (GARS), which generalize commonly used ranking models, including the BradleyTerry model or PageRank/ Rank centrality with complex human responses such as ties. \framework comes with the following advantages: (i)~It produces statistically efficient estimates of GARS ranking scores. (ii) It naturally allows to incorporate black-box machine learning methods for estimation. (iii) It can be combined with pre-trained LLM evaluators (e.g., using LLM-as-a-judge). (iv) It suggests optimal policies for collecting preference data under budget constraints. We demonstrate these advantages both theoretically and empirically using both synthetic and real-world preference datasets. In summary, our framework provides practitioners with powerful, state-of-the-art methods for comparing or ranking LLMs for leaderboards.

Abstract:
We study the capacity of the selfattention key-query channel: for a fixed budget, how many distinct token-token relations can a single layer reliably encode? We introduce Relational Graph Recognition, where the key-query channel encodes a directed graph and, given a context (a subset of the vertices), must recover the neighbors of each vertex in the context. We measure resources by the total key dimension D_K = hd_k. In a tractable multi-head model, we prove matching information-theoretic lower bounds and upper bounds via explicit constructions showing that recovering a graph with m' relations in d_\textmodel-dimensional embeddings requires D_K to grow essentially as m'/d_\textmodel up to logarithmic factors, and we obtain corresponding guarantees for scaled-softmax attention. This analysis yields a new, capacity-based rationale for multi-head attention: even in permutation graphs, where all queries attend to a single target, splitting a fixed D_K budget into multiple heads increases capacity by reducing interference from embedding superposition. Controlled experiments mirror the theory, revealing sharp phase transitions at the predicted capacity, and the multi-head advantage persists when adding softmax normalization, value routing, and a full Transformer block trained with frozen GPT-2 embeddings.

Abstract:
Selfevolving memory systems are rapidly reshaping the evolutionary paradigm of large language model (LLM)-based agents. Prior work has predominantly relied on manually engineered memory architectures to store trajectories, distill experience, and synthesize reusable tools, enabling agents to evolve on the fly within environment interactions. However, this paradigm is fundamentally constrained by the staticity of the memory system itself: while memory facilitates agent-level evolving, the underlying memory architecture cannot be meta-adapted to diverse task contexts. To address this gap, we propose MemEvolve, a meta-evolutionary framework that jointly evolves agents’ experiential knowledge and their memory architecture, allowing agent systems not only to accumulate experience but also to progressively refine how they learn from it. To ground MemEvolve in prior work and promote openness in future self-evolving systems, we introduce EvolveLab, a unified memory codebase that distills twelve representative memory systems into a modular design space (encode, store, retrieve, manage), providing a standardized implementation substrate and a fair experimental arena. Extensive evaluations on four challenging agentic benchmarks show that MemEvolve delivers (i) substantial performance gains, improving frameworks such as SmolAgent and Flash-Searcher by up to 17.06%, and (ii) strong cross-task and cross-LLM generalization, yielding memory architectures that transfer effectively across diverse benchmarks and backbones.

Abstract:
In many realworld settings, a centralized decision-maker must repeatedly allocate finite resources to a population over multiple time steps. Individuals who receive a resource derive some stochastic utility; to characterize the population-level effects of an allocation, the expected individual utilities are then aggregated using a social welfare function (SWF). We formalize this setting and present a general confidence sequence framework for SWF-based online learning and inference, valid for any monotonic, concave, and Lipschitz-continuous SWF. Our key insight is that monotonicity alone suffices to lift confidence sequences from individual utilities to anytime-valid bounds on optimal welfare. Building on this foundation, we propose SWF-UCB, a SWF-agnostic online learning algorithm that achieves near-optimal \tilde\mathcalO(n+\sqrtnkT) regret (for k resources distributed among n individuals at each of T time steps). We instantiate our framework on three normatively distinct SWF families: Weighted Power Mean, Kolm, and Gini, providing bespoke oracle algorithms for each. Experiments confirm \sqrtT scaling and reveal rich interactions between k and SWF parameters. This framework naturally supports inference applications such as sequential hypothesis testing, optimal stopping, and policy evaluation.

Abstract:
Standard classification treats all errors equally, but in content moderation, medical screening, and safetycritical applications, mistakes on clear-cut cases are far more costly than errors on ambiguous ones. We propose normalized excess cost (NEC), a metric that weights classification errors by per-example costs and reduces to standard error rate when costs are uniform. Costs can derive from annotator vote margins, distance from decision thresholds, or confidence ratings. Across text, image, and tabular benchmarks, we find that NEC is often substantially lower than error rate—models with 5% error rate can achieve 1.8% NEC—revealing that most mistakes concentrate on ambiguous, low-cost examples. However, incorporating costs into training via loss weighting, sampling strategies, or regression yields inconsistent benefits: improvements appear only when costs are predictable from input features, as in our synthetic control, while real-world datasets show mixed or negligible gains. Our framework provides a practical methodology for deriving and evaluating instance-level misclassification costs, even when cost-sensitive training offers limited benefit.

Abstract:
We focus on the problem of AnswerLevel Fine-Tuning (ALFT), where the goal is to optimize a language model based on the correctness or properties of its final answers, rather than the specific reasoning traces used to produce them. Directly optimizing answer-level objectives is computationally intractable due to the need to marginalize over the vast space of latent reasoning paths. To overcome this, we propose a general game-theoretical framework that lifts the problem to a Distributional Alignment Game. We formulate ALFT as a two-player game between a Policy (the generator) and a Target (an auxiliary distribution). We prove that the Nash Equilibrium of this game corresponds exactly to the solution of the original answer-level optimization problem. This variational perspective transforms the intractable marginalization problem into a tractable projection problem. We demonstrate that this framework unifies recent approaches to diversity and self-improvement (coherence) and provide efficient algorithms compatible with Group Relative Policy Optimization (GRPO), yielding significant complexity gains in mathematical reasoning tasks.

Abstract:
Achieving reliable 4bit attention is a prerequisite for end-to-end FP4 computation on emerging FP4-capable GPUs, yet attention remains the main obstacle due to FP4's tiny dynamic range and attention's heavy-tailed activations. This paper presents the first systematic study of 4-bit quantization-aware training (QAT) for attention. We find ``drop-in'' QAT -- naively combining an FP4 forward pass with high-precision Flash Attention (FA)-style backward pass -- leads to training instability. We identify two key principles for stable FP4 attention: (1) matching low-precision recomputation of attention scores in the backward pass and (2) resolving implicit precision assumptions in FA’s gradient calculation. Based on these insights, we propose Attn-QAT and implement fused Triton kernels for training plus FP4 inference kernels. Across diffusion and language models, Attn-QAT recovers the quality drop from FP4 attention without explicit outlier-mitigation heuristics used in prior FP4 attention, and delivers up to a 1.5x speedup on an RTX 5090.

Abstract:
Assistive agents should not only take actions on behalf of a human, but also step out of the way and cede control when there are important decisions to be made. However, current methods for building assistive agents, whether via mimicking expert humans or via RL finetuning on an inferred reward, often encourage agents to complete tasks on their own rather than truly assisting the human achieve her objectives. We propose a new approach to tuning assistive language models based on maximizing the human's empowerment, their ability to effect desired changes in the environment. Our empowermentmaximizing method, Empower, only requires offline examples of human-written text, providing a self-supervised method for fine-tuning language models to better assist humans. To study the efficacy of our approach, we conducted an 18-person user study comparing our empowerment assistant with a strong baseline. Participants preferred our assistant 78% of the time (p=0.015), with a 31% higher acceptance rate. With this empowerment objective, we provide a framework for useful aligned AI agents without the need for human feedback or verifiable rewards.

Abstract:
3D editing—the task of locally modifying the geometry or appearance of a 3D asset—has wide applications in immersive content creation, digital entertainment, and AR/VR. However, unlike 2D editing, it remains challenging due to the need for crossview consistency, structural fidelity, and fine-grained controllability. Existing approaches are often slow, prone to geometric distortions, or dependent on manual and accurate 3D masks that are error-prone and impractical. To address these challenges, we advance both the data and model fronts. On the data side, we introduce 3DEditVerse, the largest paired 3D editing benchmark to date, comprising 116,309 high-quality training pairs and 1,500 curated test pairs. Built through complementary pipelines of pose-driven geometric edits and foundation model-guided appearance edits, 3DEditVerse ensures edit locality, multi-view consistency, and semantic alignment. On the model side, we propose 3DEditFormer, a 3D-structure-preserving transformer. By enhancing image-to-3D generation with dual-guidance attention and time-adaptive gating, 3DEditFormer disentangles editable regions from preserved structure, enabling precise and consistent edits without requiring auxiliary 3D masks. Extensive experiments demonstrate that our framework outperforms state-of-the-art baselines both quantitatively and qualitatively, establishing a new standard for practical and scalable 3D editing. Dataset and code will be released. Project: https://anonymousresearch37.github.io/3DEditFormer/

Abstract:
We study networked binary classification on a directed acyclic graph (DAG) where each agent observes only a subset of the feature columns of a shared finite dataset. Agents act sequentially along the DAG: each receives prediction columns from its parents (if any), augments its local features with these columns, fits a logistic predictor by minimizing binary crossentropy (BCE), and forwards its prediction column to its outgoing neighbors. We ask whether this sequential distributed training procedure achieves information aggregation, meaning that some agent attains small excess loss compared to the best logistic predictor trained with access to all feature columns. This question was studied for linear regression under squared loss by Kearns, Roth, and Ryu (2026). Extending their guarantees to classification is nontrivial because their analysis relies on quadratic structure that does not directly transfer to BCE with a logistic link. We analyze the resulting sequential logit-passing protocol and prove: (i) an excess loss upper bound of O(M/\sqrtD) on depth-D paths under the condition that every M contiguous subsequence of M agents collectively observe all features, and (ii) a close lower bound showing instances with excess loss of at least \Omega(k/D) where k is the dimension of the feature space. Together, these results identify network depth as a fundamental bottleneck for information aggregation in networked logistic regression.

Abstract:
Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models, current reward modeling heavily relies on explicit preference data with high collection costs. In this work, we study implicit reward modelinglearning reward models from implicit human feedback---which offers a cost-effective alternative. We identify two fundamental challenges in implicit reward modeling: (1) Implicit preference data lacks definitive negative samples, which makes standard positive-negative classification methods inapplicable; (2) Implicit preference data suffers from user preference bias, where different responses have different propensities to elicit user feedback actions, which exacerbates the difficulty of distinguishing definitive negative samples. To address these challenges, we propose ImplicitRM, which aims to learn unbiased reward models from implicit preference data. ImplicitRM stratifies training samples into four latent groups via a stratification model. Building on this, it derives a learning objective through likelihood maximization, which we prove is theoretically unbiased, effectively resolving both challenges. Experiments demonstrate that ImplicitRM learns accurate reward models across implicit preference datasets. Code is available at https://anonymous.4open.science/r/ImplicitRM-5FB3.

Abstract:
While the autobidding literature predominantly considers independent bidding, we investigate the coordination problem among multiple auto-bidders in online advertising platforms. Two motivating scenarios are: collaborative bidding among multiple bidders managed by a third-party bidding agent, and strategic bid selection for multiple ad campaigns managed by a single advertiser. We formalize this coordination problem as a theoretical model and investigate the coordination mechanism where only the highest-value bidder competes with outside bidders, while other coordinated bidders refrain from competing. We demonstrate that such a coordination mechanism dominates independent bidding, improving both Return-on-Spend (RoS) compliance and the total value accrued for the participating auto-bidders or ad campaigns, for a broad class of auto-bidding algorithms. Additionally, our simulations on synthetic and real-world datasets support the theoretical result that coordination outperforms independent bidding. These findings highlight both the theoretical potential and the practical robustness of coordinated auto-bidding in online auctions.

Abstract:
Learning models with categorical variables requires optimizing expectations over discrete distributions, a setting in which stochastic gradientbased optimization is challenging due to the non-differentiability of categorical sampling. A common workaround is to replace the discrete distribution with a continuous relaxation, yielding a smooth surrogate that admits reparameterized gradient estimates via the reparameterization trick. Building on this idea, we introduce ReDGE, a novel and efficient diffusion-based soft reparameterization method for categorical distributions. Our approach defines a flexible class of gradient estimators that includes the Straight-Through estimator as a special case. Experiments spanning latent variable models and inference-time reward guidance in discrete diffusion models demonstrate ReDGE consistently matches or outperforms existing gradient-based methods.

Abstract:
This paper addresses the critical challenges of hyperparameter tuning and communication efficiency in federated learning (FL). Despite recent advancements in parameterfree FL algorithms such as PAdaMFed, significant communication overhead remains a major obstacle to their practical deployment. To tackle these challenges, we propose a novel communication-efficient parameter-free FL algorithm ParFreFL that halves the communication requirements of PAdaMFed while preserving its parameter-free property. Building on this foundation, we introduce a compressed variant, ComParFreFL, which unifies the momentum increment and error feedback into a single parameter, effectively handling biased compression while maintaining the minimal communication cost. Notably, ComParFreFL also operates independent of the compression ratio, representing the first instance of such robustness in the compressed FL literature to our knowledge. Theoretically, our methods are proven to handle arbitrary data heterogeneity, partial client participation, and achieve linear speedup with respect to both local updates and participating clients. Extensive empirical evaluations demonstrate that our approaches match or slightly surpass the performance of carefully tuned alternatives while significantly reducing communication overhead, making FL more accessible and deployable in dynamic, resource-constrained environments.

Abstract:
Combinatorial Optimization problems are widespread in domains such as logistics, manufacturing, and drug discovery, yet their NPhard nature makes them computationally challenging. Recent Neural Combinatorial Optimization (NCO) methods leverage deep learning to learn policies for constructing solutions, trained via Supervised or Reinforcement Learning. While promising, these approaches often rely on task-specific augmentations, perform poorly on out-of-distribution instances, and lack robust inference mechanisms. Moreover, existing latent space models either require labeled data or use an instance-independent latent distribution. In this work, we propose LGS-Net, a novel latent space model that conditions on problem instances, and introduce an efficient inference method, Latent Guided Sampling (LGS), based on Markov Chain Monte Carlo and Stochastic Approximation. We show that the iterations of our method form a time-inhomogeneous Markov Chain and provide rigorous theoretical convergence guarantees. Empirical results on benchmark routing tasks show that our method achieves state-of-the-art performance among NCO baselines.

Abstract:
Large language model (LLM) agents are fundamentally constrained by context length on longhorizon tasks. Existing agent frameworks usually rely on manually defined context engineering pipelines, such as multi-agent or post-hoc summary. We introduce Context Folding, a framework that empowers agents to actively manage their working context. An agent can procedurally branch into a sub-trajectory to handle a subtask and then fold it upon completion, collapsing the intermediate steps while retaining a concise summary of the outcome. To make this behavior learnable, we propose FoldPO, an end-to-end reinforcement learning framework with specific process rewards to encourage effective task decomposition and context management. On complex long-horizon tasks, our agent matches the performance of baselines while using an active context up to 10x smaller, and significantly outperforms models constrained to the same context size.

Abstract:
Existing evaluations of conformal prediction, such as prediction efficiency and testconditional coverage, are defined in expectation over the calibration data. In practice, when only one calibration set of limited size is available, the prediction sets can exhibit high variability, especially for methods with localization. We formalize this concern asset stability, defined as the variance of the conditional expectation of the set size given the calibration data. To improve stability without requiring additional target-task labels, we propose Stable Localized Conformal Prediction (SLCP), a transfer learning approach that utilizes labeled source-task data and unlabeled target data. Theoretically, we characterize the marginal coverage and stability of SLCP; empirically, it delivers more stable prediction sets than standard conformal prediction methods with localization when calibration data are limited.

Abstract:
We investigate episodic Markov Decision Processes with heavytailed feedback (HTMDPs). Existing approaches for HTMDPs are conservative in stochastic environments and lack adaptivity in adversarial regimes. In this work, we propose algorithms ```HT-FTRL-OM``` and ```HT-FTRL-UOB``` for HTMDPs that achieve Best-of-Both-Worlds (BoBW) guarantees: instance-independent regret in adversarial environments and logarithmic instance-dependent regret in self-bounding (including the stochastic case) environments. For the known transition setting, ```HT-FTRL-OM``` applies the Follow-The-Regularized-Leader (FTRL) framework over occupancy measures with novel skipping loss estimators, achieving a \widetilde\mathcalO(T^1/\alpha) regret bound in adversarial regimes and a \mathcalO(\log T) regret in stochastic regimes. Building upon this framework, we develop a novel algorithm ```HT-FTRL-UOB``` to tackle the more challenging unknown-transition setting. This algorithm employs a pessimistic skipping loss estimator and achieves a \widetilde\mathcalO(T^1/\alpha + \sqrtT) regret in adversarial regimes and a \mathcalO(\log^2(T)) regret in stochastic regimes. Our analysis overcomes key barriers through several technical insights, including a local control mechanism for heavy-tailed shifted losses, a new suboptimal-mass propagation principle, and a novel regret decomposition that isolates transition uncertainty from heavy-tailed estimation errors and skipping bias.

Abstract:
The growing use of machine learning in safetycritical settings heightens vulnerability to adversarial attacks. Existing defense mechanisms typically either lack formal guarantees or depend on restrictive assumptions about the model family, the threat model, or the poisoning budget, and many only offer point-wise certification. Importantly, they often overlook the inherent stochasticity of modern training pipelines, which undermines their practical reliability. We introduce a probabilistic framework that views gradient-based training as a discrete-time stochastic dynamical system and formulates poisoning robustness as a safety verification task. Leveraging barrier certificates (BCs), we derive sufficient conditions to probabilistically certify a robust radius against worst-case \ell_p-bounded poisoning, guaranteeing that the final model parameters remain within a safe set. For tractable computation, we represent BCs with neural networks and obtain probably approximately correct (PAC) guarantees through a scenario convex problem. Our method identifies the largest certified radius for which the trained model is probabilistically accurate with a specified confidence level. Experiments on MNIST, SVHN, and CIFAR-10 show that our framework offers formal robustness guarantees under stochastic training, while being model-agnostic and not requiring prior knowledge of the attack strategy.

Abstract:
Long input sequences are central to incontext learning, document understanding, and multi-step reasoning of Large Language Models (LLMs). However, the quadratic attention cost of Transformers makes inference memory-intensive and slow. While context distillation (CD) can transfer information into model parameters, per-prompt distillation is impractical due to training costs and latency. To address these limitations, we propose Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to perform approximate CD within a single forward pass. Given an unseen prompt, D2L generates a LoRA adapter for a target LLM, enabling subsequent queries to be answered without re-consuming the original context, reducing latency and KV-cache memory consumption during inference of the target LLM. On a long-context needle-in-a-haystack task, D2L successfully learns to map contexts into adapters that store the needle information, achieving near-perfect zero-shot accuracy at sequence lengths exceeding the target LLM’s native context window by more than 4x. On real-world QA datasets with limited compute, D2L outperforms standard CD while significantly reducing peak memory consumption and update latency. We envision that D2L can facilitate rapid adaptation of LLMs, opening up the possibility of frequent knowledge updates and personalized chat behavior. Code and checkpoints will be released upon publication.

Abstract:
Autoregressive video world models predict future visual observations conditioned on actions. While effective over short horizons, these models often struggle with longhorizon generation, as small prediction errors accumulate over time. Prior methods alleviate this by introducing pre-trained teacher models and sequence-level distribution matching, which incur additional computational cost and fail to prevent error propagation beyond the training horizon. In this work, we propose LIVE, a Long-horizon Interactive Video world modEl that enforces bounded error accumulation via a novel cycle-consistency objective, thereby eliminating the need for teacher-based distillation. Specifically, LIVE first performs a forward rollout from ground-truth frames and then applies a reverse generation process to reconstruct the initial state. The diffusion loss is subsequently computed on the reconstructed terminal state, providing an explicit constraint on long-horizon error propagation. Moreover, we provide an unified view that encompasses different approaches and introduce progressive training curriculum to stabilize training. Experiments demonstrate that LIVE achieves state-of-the-art performance on long-horizon benchmarks, generating stable, high-quality videos far beyond training rollout lengths.

Abstract:
We consider supervised learning with n labels and show that layerwise SGD on residual networks can efficiently learn a class of hierarchical models. This model class assumes the existence of an (unknown) label hierarchy L_1 \subseteq L_2 \subseteq \dots \subseteq L_r = [n], where labels in L_1 are simple functions of the input, while for i > 1, labels in L_i are simple functions of simpler labels. Our class surpasses models that were previously shown to be learnable by deep learning algorithms, in the sense that it reaches the depth limit of efficient learnability. That is, there are models in this class that require polynomial depth to express, whereas previous models can be computed by logdepth circuits. Furthermore, we suggest that learnability of such hierarchical models might eventually form a basis for understanding deep learning. Beyond their natural fit for domains where deep learning excels, we argue that the mere existence of human ``teachers" supports the hypothesis that hierarchical structures are inherently available. By providing granular labels, teachers effectively reveal ``hints'' or ``snippets'' of the internal algorithms used by the brain. We formalize this intuition, showing that in a simplified model where a teacher is partially aware of their internal logic, a hierarchical structure emerges that facilitates efficient learnability.

Abstract:
Machine learning is increasingly used to guide branchand-cut (B&C) for mixed-integer linear programming by learning score-based policies for selecting branching variables and cutting planes. Many approaches train on local signals from lookahead heuristics such as strong branching, and linear programming (LP) bound improvement for cut selection. Training and evaluation of the learned models often focus on local score accuracy. We show that such local score-based methods can lead to search trees exponentially larger than optimal tree sizes, by identifying two sources of this gap. The first is that these widely used expert signals can be misaligned with overall tree size. LP bound improvement can select a root cut set that yields an exponentially larger strong branching tree than selecting cuts by a simple proxy score, and strong branching itself can be exponentially suboptimal (Dey et al., 2024). The second is that small discrepancies can be amplified by the branch-and-bound recursion. An arbitrarily small perturbation of the right-hand sides in a root cut set can change the minimum tree size from a single node to exponentially many. For branching, arbitrarily small score discrepancies, and differences only in tie-breaking, can produce trees of exponentially different sizes, and even a small number of decision differences along a trajectory can incur exponential growth. These results show that branch-and-cut policies trained and learned using local expert scores do not guarantee small trees, thus motivating the study of data-driven methods that produce policies better aligned with tree size rather than only accuracy on expert scores.

Abstract:
Reward models play a fundamental role in aligning large language models with human preferences. Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpretability, and generative judging models, which offer richer reasoning at the cost of higher computational overhead. We observe that the logprobability margin between verdict tokens strongly correlates with prediction correctness, providing a reliable proxy for instance difficulty without additional inference cost. Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances. To induce effective self-correction, we train the model via reinforcement learning with counterfactual prefix augmentation, which exposes the model to diverse initial verdicts and encourages genuine revision. Empirically, CAMEL achieves state-of-the-art performance on three widely used reward-model benchmarks with 82.9% average accuracy, surpassing the best prior model by 3.2% and outperforming 70B-parameter models using only 14B parameters, while establishing a strictly better accuracy-efficiency Pareto frontier.

Abstract:
We study neural networks with trainable lowdegree rational activation functions and show that they are more expressive and parameter-efficient than modern piecewise-linear and smooth activations such as ELU, LeakyReLU, LogSigmoid, PReLU, ReLU, SELU, CELU, Sigmoid, SiLU, Mish, Softplus, Tanh, Softmin, Softmax, and LogSoftmax. For an error target of \varepsilon>0, we establish approximation-theoretic separations: Any network built from standard fixed activations can be uniformly approximated on compact domains by a rational-activation network with only \mathrmpoly(\log\log(1/\varepsilon)) overhead in size, while the converse provably requires \Omega(\log(1/\varepsilon)) parameters in the worst case. This exponential gap persists at the level of full networks and extends to gated activations and transformer-style nonlinearities. In practice, rational activations integrate seamlessly into standard architectures and training pipelines, allowing rationals to match or outperform fixed activations under identical architectures and optimizers.

Abstract:
4D generation, or dynamic 3D content generation, integrates spatial, temporal, and view dimensions to model realistic dynamic scenes, playing a foundational role in advancing world models and physical AI. However, maintaining longchain consistency across both frames and viewpoints through the unique spatio-camera-motion (SCM) attention mechanism introduces substantial computational and memory overhead, often leading to out-of-memory (OOM) failures and prohibitive generation times. To address these challenges, we propose Turbo4DGen, an ultra-fast acceleration framework for diffusion-based multi-view 4D content generation. Turbo4DGen introduces a spatiotemporal cache mechanism that persistently reuses intermediate attention across denoising steps, combined with dynamically semantic-aware attention pruning and an adaptive SCM chain bypass scheduler, to drastically reduce redundant SCM attention computation. Our experimental results show that Turbo4DGen achieves an average 9.7× speedup without quality degradation on the ObjaverseDy and Consistent4D datasets. To the best of our knowledge, Turbo4DGen is the first dedicated acceleration framework for 4D generation.

Abstract:
Oneshot prediction enables rapid adaptation of pretrained foundation models to new tasks using only one labeled example, but lacks principled uncertainty quantification. While conformal prediction provides finite-sample coverage guarantees, standard split conformal methods are inefficient in the one-shot setting due to data splitting and reliance on a single predictor. We propose Conformal Aggregation of One-Shot Predictors (CAOS), a conformal framework that adaptively aggregates multiple one-shot predictors and uses a leave-one-out calibration scheme to fully exploit scarce labeled data. Despite violating classical exchangeability assumptions, we prove that CAOS achieves valid marginal coverage using a monotonicity-based argument. Experiments on one-shot facial landmarking and RAFT text classification tasks show that CAOS produces substantially smaller prediction sets than split conformal baselines while maintaining reliable coverage.

Abstract:
Learningbased methods have made significant progress in physics simulation, typically approximating dynamics with a monolithic end-to-end optimized neural network. Although these models offer an effective way to simulation, they may lose essential features compared to traditional numerical simulators, such as physical interpretability and reliability. Drawing inspiration from classical simulators that operate in a modular fashion, this paper presents Neural Modular Physics (NMP) for elastic simulation, which combines the approximation capacity of neural networks with the physical reliability of traditional simulators. Beyond the previous monolithic learning paradigm, NMP enables direct supervision of intermediate quantities and physical constraints by decomposing elastic dynamics into physically meaningful neural modules connected through intermediate physical quantities. With a specialized architecture and training strategy, our method transforms the numerical computation flow into a modular neural simulator, achieving improved physical consistency and generalizability. Experimentally, NMP demonstrates superior generalization to unseen initial conditions and resolutions, stable long-horizon simulation, better preservation of physical properties compared to other neural simulators, and greater feasibility in scenarios with unknown underlying dynamics than traditional simulators.

Abstract:
Datadriven models are revolutionizing weather forecasting. To optimize training efficiency and model performance, this paper analyzes empirical scaling laws within this domain. We investigate the relationship between model performance (validation loss) and three key factors: model size (N), dataset size (D), and compute budget (C). Across a range of models, we find that Aurora exhibits the strongest data-scaling behavior: increasing the training dataset by 10× reduces validation loss by up to 3.2×. GraphCast demonstrates the highest parameter efficiency, yet suffers from limited hardware utilization. Our compute-optimal analysis indicates that, under fixed compute budgets, allocating resources to longer training durations yields greater performance gains than increasing model size. Furthermore, we analyze model shape and uncover scaling behaviors that differ fundamentally from those observed in language models: weather forecasting models consistently favor increased width over depth. These findings suggest that future weather models should prioritize wider architectures and larger effective training datasets to maximize predictive performance.\footnoteCode is available at \urlhttps://anonymous.4open.science/r/scaling-laws-weather-model-8560/

Abstract:
Small language models are viewed as a promising, costeffective approach to agentic AI, yet how their performance scales with task complexity remains unclear. While smaller agents match larger ones on simple tasks, it is unknown when large models become necessary and how to better leverage small agents. In this work, we show that small agents fail to scale with task complexity on deep search and coding tasks, and introduceStrategy Auctions for Workload Efficiency(SALE), a framework inspired by freelancer marketplaces. In SALE, agents bid with strategic plans scored by a cost–value mechanism and refined via shared auction memory, enabling per-task routing and continual self-improvement without training a router. On average, SALE reduces reliance on the largest agent by 53%, lowers overall cost by 35%, and consistently improves pass@1 with only a negligible token overhead. In contrast, established routers either underperform the largest agent or fail to reduce cost. These results suggest that small agents can be effectively “scaled up” through coordinated allocation and test-time self-improvement. More broadly, they motivate a systems-level view of agentic AI in which gains come less from ever-larger individual models and more from market-inspired coordination mechanisms that organize heterogeneous agents into efficient, adaptive ecosystems.

Abstract:
Graph neural networks (GNNs) are widely believed to excel at node representation learning through trainable neighborhood aggregations. We challenge this view by introducing Fixed Aggregation Features (FAFs), a trainingfree approach that transforms graph learning tasks into tabular problems. This simple shift enables the use of well-established tabular methods, offering strong interpretability and the flexibility to deploy diverse classifiers. Across 14 benchmarks, well-tuned multilayer perceptrons trained on FAFs rival or outperform state-of-the-art GNNs and graph transformers on 12 tasks—often using only mean aggregation. The only exceptions are the Roman Empire and Minesweeper datasets, which typically require unusually deep GNNs. To explain the theoretical possibility of non-trainable aggregations, we connect our findings to Kolmogorov–Arnold representations and discuss when mean aggregation can be sufficient. In conclusion, our results call for (i) richer benchmarks benefiting from learning diverse neighborhood aggregations, (ii) strong tabular baselines as standard, and (iii) employing and advancing tabular models for graph data to gain new insights into related tasks.

Abstract:
Trainingtime privileged information (PI) can enable language models to succeed on tasks they would otherwise fail, making it a powerful tool for reinforcement learning in hard, long-horizon settings. However, transferring capabilities learned with PI to policies that must act without it at inference time remains a fundamental challenge. We study this problem in the context of distilling frontier models for multi-turn agentic environments, where closed-source systems typically hide their internal reasoning and expose only action trajectories. This breaks standard distillation pipelines, since successful behavior is observable but the reasoning process is not. We introduce π-Distill, a joint teacher–student framework that trains a PI-conditioned teacher and an unconditioned student simultaneously within a single shared-parameter model, enabling the teacher to learn how to use PI while mitigating distribution shift during transfer. We show that π-Distill effectively distills frontier agents using action-only privileged information, matching or outperforming industry-standard pipelines that assume access to full Chain-of-Thought supervision across multiple agentic benchmarks, models, and forms of PI. We complement our results with extensive analysis that characterize what factors enable effective learning with PI.

Abstract:
Explainability for graph neural networks (GNNs) aims to unveil the complex decision logic of learned models by identifying the most influential structures in the input graph, thereby improving transparency and trustworthiness. Existing posthoc explainers typically extract a sparse key subgraph at a single scale as the explanation. However, a single-scale view often fails to capture multi-level semantics, and the optimization procedure may degenerate into a local search that is sensitive to initialization and noise, leading to unstable explanations and limited robustness. To address these issues, we propose MSExplainer, a multi-scale explainer for GNNs. MSExplainer couples multi-scale subgraph consistency guidance with single-scale adaptive subgraph learning under a parameter-sharing design. It simultaneously extracts multi-scale key subgraphs and complementary subgraphs, yielding a hierarchical decomposition of the original graph that covers semantics at different granularities and improves the robustness of subgraph extraction. Experiments on six benchmark datasets show that MSExplainer consistently outperforms prior methods in explanation accuracy and fidelity. Moreover, we theoretically prove the upper bound advantage of the multi-scale strategy in representation consistency, and derive that it achieves the same-order computational complexity as single-scale methods under the parameter-sharing mechanism, thus ensuring the high fidelity of key subgraphs while maintaining computational efficiency.

Abstract:
Testtime adaptation (TTA) has emerged as a promising paradigm for vision–language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce two key limitations. First, inference latency increases as the cache grows with the number of classes, leading to inefficiencies in large-scale settings. Second, suboptimal performance occurs when the cache contains insufficient or incorrect samples. In this paper, we present Prototype-Based Test-Time Adaptation (PTA), an efficient and effective TTA paradigm that uses a set of class-specific knowledge prototypes to accumulate knowledge from test samples. Particularly, knowledge prototypes are adaptively weighted based on the zero-shot class confidence of each test sample, incorporating the sample's visual features into the corresponding class-specific prototype. It is worth highlighting that the knowledge from past test samples is integrated and utilized solely in the prototypes, eliminating the overhead of cache population and retrieval that hinders the efficiency of existing TTA methods. This endows PTA with extremely high efficiency while achieving state-of-the-art performance on 15 image recognition benchmarks and 4 robust point cloud analysis benchmarks. For example, PTA improves CLIP’s accuracy from 65.64% to 69.38% on 10 cross-domain benchmarks, while retaining 92% of CLIP’s inference speed on large-scale ImageNet-1K. In contrast, the cache-based TDA achieves a lower accuracy of 67.97% and operates at only 50% of CLIP’s inference speed.

Abstract:
Many highstakes AI deployments proceed only if every stakeholder deems the system acceptable relative to their own minimum standard. With randomization over a finite menu of options, this becomes a feasibility question: does there exist a lottery over options that clears all stakeholders' acceptability bars? We study a query model where the algorithm proposes lotteries and receives only binary accept/reject feedback. We give deterministic and randomized algorithms that either find a unanimously acceptable lottery or certify infeasibility; adaptivity can avoid eliciting many stakeholders' constraints, and randomization further reduces the expected elicitation cost relative to full elicitation. We complement these upper bounds with worst-case lower bounds (in particular, linear dependence on the number of stakeholders and logarithmic dependence on precision are unavoidable). Finally, we develop learning-augmented algorithms that exploit natural forms of advice (e.g., likely binding stakeholders or a promising lottery), improving query complexity when predictions are accurate while preserving worst-case guarantees.

Abstract:
Large language models (LLMs) are increasingly deployed as automated evaluators that assign numeric scores to model outputs, a paradigm known as LLMas-a-Judge. However, standard Reinforcement Learning (RL) methods typically rely on binary rewards (e.g., 0-1 accuracy), thereby ignoring the ordinal structure inherent in regression tasks; for instance, they fail to recognize that predicting 4 is significantly better than predicting 1 when the ground truth is 5. Conversely, existing regression-aware approaches are often confined to Supervised Fine-Tuning (SFT), limiting their ability to explore optimal reasoning paths. To bridge this gap, we propose REAL (Regression-Aware Reinforcement Learning), a principled RL framework designed to optimize regression rewards, and also proven to be optimal for correlation metrics. A key technical challenge is that the regression objective is explicitly policy-dependent, thus invalidating standard policy gradient methods. To address this, we employ the generalized policy gradient estimator, which naturally decomposes optimization into two complementary components: (1) exploration over Chain-of-Thought (CoT) trajectory, and (2) regression-aware prediction refinement of the final score. Extensive experiments across model scales (8B to 32B) demonstrate that REAL consistently outperforms both regression-aware SFT baselines and standard RL methods, exhibiting significantly better generalization on out-of-domain benchmarks. On Qwen3-32B specifically, we achieve gains of +8.40 Pearson and +7.20 Spearman correlation over the SFT baseline, and +18.30/+11.20 over the base model. These findings highlight the critical value of integrating regression objectives into RL exploration for accurate LLM evaluation.

Abstract:
Probabilistic circuits (PCs) enable exact and tractable inference but employ data independent mixture weights that limit their ability to capture local geometry of the data manifold. We propose Voronoi tessellations (VT) as a natural way to incorporate geometric structure directly into the sum nodes of a PC. However, naïvely introducing such structure breaks tractability. We formalize this incompatibility and develop two complementary solutions: (1) an approximate inference framework that provides guaranteed lower and upper bounds for inference, and (2) a structural condition for VT under which exact tractable inference is recovered. Finally, we introduce a differentiable relaxation for VT that enables gradientbased learning and empirically validate the resulting approach on standard density estimation tasks.

Abstract:
VisionLanguage-Action (VLA) models have shown strong capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs). However, their representations arguably remain suboptimal, lacking sensitivity to robotic signals such as control actions and proprioceptive information. To address the issue, we introduce Robot State-aware Contrastive Loss (RS-CL), a simple and effective representation regularization for VLA models, designed to bridge the gap between VLM representations and robotic signals. In particular, RS-CL aligns the representations more closely with the robot's proprioceptive states by using relative distances between the states as soft supervision. Complementing the original action prediction objective, RS-CL enhances control-relevant representation learning, while being lightweight and fully compatible with standard VLA training pipelines. Our empirical results demonstrate that RS-CL substantially improves the performance of state-of-the-art VLA models; it pushes the prior art to 69.7% achieving the state-of-the-art performance on the RoboCasa-Kitchen benchmark, and boosts success rates from 45.0% to 58.3% on challenging real-robot manipulation tasks.

Abstract:
The ability to plan with temporal abstractions is central to intelligent decisionmaking. Rather than reasoning over primitive actions, we study agents that compose pre-trained policies as temporally extended actions, enabling solutions to complex tasks that no constituent alone could solve. Such compositional planning remains elusive as compounding errors in long-horizon predictions make it challenging to estimate the visitation distribution induced by sequencing policies. Motivated by thegeometric policy compositionframework introduced in Thakoor et al. (2022), we address these challenges by learning predictive models of multi-step dynamics, so-calledjumpy world models, that capture state occupancies induced by pre-trained policies across multiple timescales in an off-policy manner. Building on Temporal Difference Flows (Farebrother et al., 2025), we enhance these models with a novel consistency objective that aligns predictions across timescales, improving long-horizon predictive accuracy. We further demonstrate how to combine these generative predictions to estimate the value of executing arbitrary sequences of policies over varying timescales. Empirically, we find that compositional planning with jumpy world models significantly improves zero-shot performance across a wide range of base policies on challenging manipulation and navigation tasks, yielding an average 200% relative improvement over planning with primitive actions on long-horizon tasks.

Abstract:
We study bilinear bandits with partially observable features on both the user and item sides. In each of T rounds, the learner selects an arm and observes only the reward for the chosen pair. The reward model is linear in the user and item features with an unknown parameter matrix. Existing literature commonly reduces this problem to a linear bandit via a Kronecker product representation of user and item features, at the cost of increased dimensionality. We propose \textttBiRoLF, an algorithm robust to latent features, which directly leverages the bilinear structure without such linearization. It enhances feature selection by augmenting the null space of the observed features and employs doubly robust (DR) estimation to impute unobserved rewards for unselected arms, constructing unbiased pseudorewards. We estimate the parameters using Lasso regularization, which promotes sparsity in the coefficients of latent components orthogonal to the observed features. \textttBiRoLF achieves a \tildeO(\sqrt(d_x + d_h_x)(d_y + d_h_y) T) regret bound, where d_x and d_y are the dimensions of the observable feature vectors, d_h_x and d_h_y denote the numbers of nonzero coefficients in the components orthogonal to the observed features. We segment cases by the relationship between observable and latent features and find that \textttBiRoLF achieves strong regret performance while outperforming competing methods in computational metrics, reducing the overhead of feature linearization.

Abstract:
CLIP delivers strong zeroshot classification but remains highly vulnerable to adversarial attacks. Prior adversarial fine-tuning work largely focuses on matching the predicted logits between clean and adversarial examples, which overlooks uncertainty calibration and may degrade the zero-shot generalization. A common expectation in reliable uncertainty estimation is that predictive uncertainty should increase as inputs become more difficult or shift away from the training distribution. However, we frequently observe the opposite in the adversarial setting: perturbations not only degrade accuracy but also suppress uncertainty, leading to severe miscalibration and unreliable over-confidence. This overlooked phenomenon highlights a critical reliability gap beyond robustness. To bridge this gap, we propose a novel adversarial fine-tuning objective for CLIP considering both prediction accuracy and uncertainty alignments. By reparameterizing the output of CLIP as the concentration parameter of a Dirichlet distribution, we propose a unified representation that captures relative semantic structure and confidence magnitude. Our objective aligns these distributions holistically under perturbations, moving beyond single-logit anchoring and restoring calibrated uncertainty. Experiments on multiple zero-shot classification benchmarks demonstrate that our approach effectively restores calibrated uncertainty and achieves competitive adversarial robustness while maintaining clean accuracy.

Abstract:
AI coscientists are emerging as a useful tool for human researchers, with a crucial ability being proposing a research plan for a given research goal. In this work, we study how to train language models that generate better research plans by leveraging the vast corpus of existing research papers. To collect diverse training data, we automatically extract research goals and goal-specific grading rubrics from papers across domains. We then train models for research plan generation via reinforcement learning, with a frozen copy of the initial policy acting as the grader, using the rubrics to evaluate plans generated by the training policy. To validate this approach, we conduct a human study for machine learning research goals spanning 225 expert hours. The experts prefer plans generated by our finetuned Qwen3-30B-A3B model over the initial model for 70% goals, and over Grok-4-Thinking for 59.6% goals. To assess generality, we also extend our approach to goals from medical papers, and recent arXiv preprints, evaluating with a jury of frontier models. Our finetuning yields 12-22% relative improvements and significant cross-domain generalization, proving effective even in problem settings like medical research where execution feedback is infeasible. Overall, we demonstrate the potential of a scalable training recipe as a step towards improving general AI co-scientists.

Abstract:
LLM agents increasingly act on users’ personal information, yet existing privacy defenses remain limited in both design and adaptability. Most prior approaches rely on static or passive defenses, such as prompting and guarding. These paradigms are insufficient for supporting contextual, proactive privacy decisions in multistep agent execution. We proposeContextualized Defense Instructing (CDI), a new privacy defense paradigm in which an instructor model generates step-specific, context-aware privacy guidance during execution, proactively shaping actions rather than merely constraining or vetoing them. Crucially, CDI is paired with an experience-driven optimization framework that trains the instructor via reinforcement learning (RL), where we convert failure trajectories with privacy violations into learning environments. We formalize baseline defenses and CDI as distinct intervention points in a canonical agent loop, and compare their privacy–helpfulness trade-offs within a unified simulation framework. Results show that our CDI consistently achieves a better balance between privacy preservation (94.2%) and helpfulness (80.6%) than baselines, with superior robustness to adversarial conditions and generalization.

Abstract:
A longstanding question in physical reasoning is whether video-based models need to rely on factorized representations of physical variables in order to make physically accurate predictions, or whether they can implicitly represent such variables in a distributed manner. While modern video world models achieve strong performance on intuitive physics benchmarks, it remains unclear which of these representational regimes they implement internally. Here, we present the first interpretability study to directly examine physical representations inside large-scale video encoders. Using layerwise probing, subspace geometry, patch-level decoding, and targeted attention ablations, we characterize where physical information becomes accessible and how it is organized within encoder-based video transformers. Across architectures, we identify a sharp intermediate-depth transition— which we call the \emphPhysics Emergence Zone—at which physical variables become accessible. Physics-related representations peak shortly after this transition and degrade toward the output layers. Decomposing motion into explicit variables, we find that scalar quantities such as speed and acceleration are available from early layers onwards, whereas motion direction becomes accessible only at the Physics Emergence Zone. Notably, we find that direction is encoded through a high-dimensional population structure with circular geometry, requiring coordinated multi-feature intervention to control. These findings suggest that modern video models do not use factorized representations of physical variables like a classical physics engine. Instead, they use a distributed representation that is nonetheless sufficient for making physical predictions.

Abstract:
In many applications of LLMs, natural language responses often have an underlying structure such as representing discrete labels, numerical values, or graphs. Yet, existing decoding and uncertainty estimation methods operate only in language space and largely disregard structural information. We address this by modeling LLM outputs directly in a taskdependent latent structure. By equipping this structure with a dissimilarity measure, we can compute Bayes-optimal responses. These are not selected from sampled generations but are newly synthesized by combining individual responses in the latent space. Across different tasks, Bayes-optimal responses consistently outperform standard decoding methods like beam search. Moreover, quantifying uncertainty via the induced Bayesian risk captures variations in terms of the latent structure and improves alignment with output quality and correctness. Our decision-theoretic framework is applicable to any problem that admits a latent response structure and enables reliable task-aware LLM predictions.

Abstract:
Scientific computing has long relied on double precision (64bit floating point) arithmetic to guarantee accuracy in simulations of real-world phenomena. However, the growing availability of hardware accelerators such as Graphics Processing Units (GPUs) has made low-precision formats attractive due to their superior performance, reduced memory footprint, and improved energy efficiency. In this work, we investigate the role of mixed-precision arithmetic in neural-network based Variational Monte Carlo (VMC), a widely used method for solving computationally otherwise intractable quantum many-body systems. We first derive general analytical bounds on the error introduced by reduced precision on Metropolis-Hastings MCMC, and then empirically validate these bounds on the use-case of VMC. We demonstrate that significant portions of the algorithm, in particular, sampling the quantum state, can be executed in half precision without loss of accuracy. More broadly, this work provides a theoretical framework to assess the applicability of mixed-precision arithmetic in machine-learning approaches that rely on MCMC sampling. In the context of VMC, we additionally demonstrate the practical effectiveness of mixed-precision strategies, enabling more scalable and energy-efficient simulations of quantum many-body systems.

Abstract:
While neural representations are central to modern deep learning, the conditions governing their geometry and their roles in downstream adaptability remain poorly understood. We develop a framework clearly separating the underlying world, the data generation process and the resulting model representations to study these questions in a controlled setup: 5,075 city coordinates define the world and 7 geometric tasks generate the training data for autoregressive Transformer training. We find that different tasks give rise to qualitatively and quantitatively distinct world representation geometries. However, multitask training drives convergence of world representations: models trained on non-overlapping tasks develop aligned geometric representations, providing controlled evidence for the Multitask Scaling Hypothesis of the Platonic Representation Hypothesis. To study adaptation, we pretrain models on all tasks and all cities, then test whether new entities can be consistently integrated into the representation space via fine-tuning. Surprisingly, we find that despite multi-task pretraining, some tasks, which we call divergent, actively harm the representational integration of new entities. Our results show that training on multiple relational tasks reliably produces convergent world representations, but some lurking divergent tasks can catastrophically harm new entity integration via fine-tuning.

Abstract:
We study adaptive querying for learning userdependent quantities of interest, such as responses to held-out items and psychometric indicators, within tight question budgets. Classical Bayesian design and computerized adaptive testing typically rely on restrictive parametric assumptions or expensive posterior approximations, limiting their use in heterogeneous, high-dimensional, and cold-start settings. We introduce a persona-induced latent variable model that represents a user's state through membership in a finite dictionary of AI personas, each offering response distributions produced by a large language model. This yields expressive priors with closed-form posterior updates and efficient finite-mixture predictions, enabling scalable Bayesian design for sequential item selection. Experiments on synthetic data and WorldValuesBench demonstrate that persona-based posteriors deliver accurate probabilistic predictions and an interpretable adaptive elicitation pipeline.

Abstract:
Large language models exhibit complementary reasoning errors: on the same instance, one model may succeed with a particular decomposition while another fails. We propose Collaborative Reasoning (CORE), a trainingtime collaboration framework that converts peer success into a learning signal via a cross-teaching protocol. Each problem is solved in two stages: a cold round of independent sampling, followed by a contexted rescue round in which models that failed receive hint extracted from a successful peer. CORE optimizes a combined reward that balances (i) correctness, (ii) a lightweight DPP-inspired diversity term to reduce error overlap, and (iii) an explicit rescue bonus for successful recovery. We evaluate CORE across four standard reasoning datasets GSM8K, MATH, AIME, and GPQA. With only 1,000 training examples, a pair of small open source models (3B+4B) reaches Pass@2 of 99.54% on GSM8K and 92.08% on MATH, compared to 82.50% and 74.82% for single-model training. On harder datasets, the 3B+4B pair reaches Pass@2 of 77.34% on GPQA (trained on 348 examples) and 79.65% on AIME (trained on 792 examples), using a training-time budget of at most 1536 context tokens and 3072 generated tokens. Overall, these results show that training-time collaboration can reliably convert model complementarity into large gains without scaling model size.

Abstract:
As generative AI achieves hyperrealism, superficial artifact detection has become obsolete. While prevailing methods rely on resource-intensive fine-tuning of black-box backbones, we propose that forgery detection capability is already encoded within pre-trained models rather than requiring end-to-end retraining. To elicit this intrinsic capability, we propose the discriminative neural anchors (DNA) framework, which employs a coarse-to-fine excavation mechanism. First, by analyzing feature decoupling and attention distribution shifts, we pinpoint critical intermediate layers where the focus of the model logically transitions from global semantics to local anomalies. Subsequently, we introduce a triadic fusion scoring metric paired with a curvature-truncation strategy to strip away semantic redundancy, precisely isolating the forgery-discriminative units (FDUs) inherently imprinted with sensitivity to forgery traces. Moreover, we introduce HIFI-Gen, a high-fidelity synthetic benchmark built upon the very latest models, to address the lag in existing datasets. Experiments demonstrate that by solely relying on these anchors, DNA achieves superior detection performance even under few-shot conditions. Furthermore, it exhibits remarkable robustness across diverse architectures and against unseen generative models, validating that waking up latent neurons is more effective than extensive fine-tuning.

Abstract:
We identify and analyze a surprising phenomenon of Latent Diffusion Models (LDMs) where the final steps of the diffusion can degrade sample quality. In contrast to conventional arguments that justify early stopping for numerical stability, this phenomenon is intrinsic to the dimensionality reduction in LDMs. We provide a principled explanation by analyzing the interaction between latent dimension and stopping time. Under a Gaussian framework with linear autoencoders, we characterize the conditions under which early stopping is needed to minimize the distance between generated and target distributions. More precisely, we show that lowerdimensional representations benefit from earlier termination, whereas higher-dimensional latent spaces require later stopping time. We further establish that the latent dimension interplays with other hyperparameters of the problem such as constraints in the parameters of score matching. Crucially, this framework suggests that the reconstruction quality of the autoencoder alone can serve as a proxy to estimate the potential performance of the full LDM. Experiments on synthetic and real datasets illustrate these properties, underlining that early stopping can improve generative quality. Together, our results offer a theoretical foundation for understanding how the latent dimension influences the sample quality, and highlight stopping time as a key hyperparameter in LDMs.

Abstract:
Although Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, inherent social biases often cascade throughout the Chainof-Thought (CoT) process, leading to continuous "Bias Propagation". Existing debiasing methods primarily focus on static constraints or external interventions, failing to identify and interrupt this propagation once triggered. To address this limitation, we introduce Self-Debias, a progressive framework designed to instill intrinsic self-correction capabilities. Specifically, we reformulate the debiasing process as a strategic resource redistribution problem, treating the model's output probability mass as a limited resource to be reallocated from biased heuristics to unbiased reasoning paths. Unlike standard preference optimization which applies broad penalties, Self-Debias employs a fine-grained trajectory-level objective subject to dynamic debiasing constraints. This enables the model to selectively revise biased reasoning suffixes while preserving valid contextual prefixes. Furthermore, we integrate an online self-improvement mechanism utilizing consistency filtering to autonomously synthesize supervision signals. With merely 20k annotated samples, Self-Debias activates efficient self-correction, achieving superior debiasing performance while preserving general reasoning capabilities without continuous external oversight.

Abstract:
Large language model (LLM) coding agents increasingly operate at the repository level, motivating benchmarks that evaluate their ability to optimize entire codebases under realistic constraints. Existing code benchmarks largely rely on synthetic tasks, binary correctness signals, or singleobjective evaluation, limiting their ability to assess holistic optimization behavior. We introduce FormulaCode, a benchmark for evaluating agentic optimization on large, real-world codebases with fine-grained, multi-objective performance metrics. FormulaCode comprises 957 performance bottlenecks mined from scientific Python repositories on GitHub, each paired with expert-authored patches and 264.6 community-maintained performance workloads per task, enabling evaluation of the full optimization lifecycle—triage, diagnosis, and resolution—under realistic correctness and performance constraints. Our evaluations reveal that repository-scale, multi-objective optimization remains a major challenge for frontier LLM agents.

Abstract:
Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators. However, imperfect sensitivity and specificity of the LLM judges induce bias in naive evaluation scores. We propose a simple plugin framework that corrects this bias and enables statistically principled uncertainty quantification. Our framework constructs confidence intervals that account for uncertainty from both the test dataset and a human-labeled calibration dataset. Additionally it uses an adaptive strategy to allocate calibration samples for tighter intervals. Importantly, we characterize parameter regimes defined by the true evaluation score and the LLM judge’s sensitivity and specificity in which our LLM-based evaluation yields more reliable estimates than human-only evaluation. Moreover, we show that our framework remains unbiased under distribution shift between the test and calibration datasets, in contrast to existing approaches.

Abstract:
The problem of domain generalization concerns learning predictive models that are robust to distribution shifts when deployed in new, previously unseen environments. Existing methods typically require labeled data from multiple training environments, limiting their applicability when labeled data are scarce. In this work, we study domain generalization in an anticausal setting, where the outcome causes the observed covariates. Under this structure, environment perturbations that affect the covariates do not propagate to the outcome, which motivates regularizing the model's sensitivity to these perturbations. Crucially, estimating these perturbation directions does not require labels, enabling us to leverage unlabeled data from multiple environments. We propose two methods that penalize the model's sensitivity to variations in the mean and covariance of the covariates across environments, respectively, and prove that these methods have worst-case optimality guarantees under certain classes of environments. Finally, we demonstrate the empirical performance of our approach on a controlled physical system and a physiological signal dataset.

Abstract:
Largescale document retrieval (search) is key in many modern industrial AI pipelines to ground models with relevant contextual information. Increasingly, Visual Document Retrieval (VDR) models, which directly embed images of document pages, are used as an alternative to text-only retrievers. While these models are historically repurposed generative VLMs fine-tuned for embedding tasks, we revisit this design choice in this paper and systematically develop strong VDR models from the ground up. Through controlled experiments, we isolate the impact of key training factors such as attention masking, multi-modal data regimes, and contrastive objectives at all phases of training. Our findings confirm current VDR performance is constrained by generative modeling, especially in multi-vector settings. Building on these insights, we train ModernVBERT, a 250M-parameter vision-language encoder that outperforms recent models up to 10 times its size when fine-tuned on document retrieval tasks. Thanks to its compact design, ModernVBERT enables efficient retrieval inference on CPU hardware, while maintaining competitive performance. Models, code and data are available in the public version of this work.

Abstract:
When organizations delegate text generation tasks to AI providers via payfor-performance contracts, expected payments rise when evaluation is noisy. As evaluation methods become more elaborate, the economic benefits of decreased noise are often overshadowed by increased evaluation costs. In this work, we introduce adaptive contracts for AI delegation, which allow detailed evaluation to be performed selectively after observing an initial coarse signal in order to conserve resources. We make three sets of contributions: First, we provide efficient algorithms for computing optimal adaptive contracts under natural assumptions or when core problem dimensions are small, and prove hardness of approximation in the general unstructured case. We then formulate alternative models of randomized adaptive contracts and discuss their benefits and limitations. Finally, we empirically demonstrate the benefits of adaptivity over non-adaptive baselines using question-answering and code-generation datasets.

Abstract:
Slot Attention (SA) lies at the heart of mainstream ObjectCentric Learning (OCL). Image features can be aggregated into object-level representations by SAiterativelyrefining cold-start query slots. For video, such aggregation proceeds by SArecurrentlyshared across frames, with queries cold-started on the first frame while transitioned from the previous frame’s slots thereafter. However, cold-start queries lack sample-specific cues thus hindering precise aggregation on image or video's first frame; Non-first frames' queries are already sample-specific thus requiring aggregation transforms different from the first frame. We address these issues with ourSmoothSA: (1) To smooth SA iterations on image or video's first frame, wepreheatcold-start queries with rich input-feature information, by a tiny module self-distilled inside OCL; (2) To smooth SA recurrences across video's first and non-first frames, wedifferentiatethe homogeneous aggregation transforms by using full and single iterations respectively. Comprehensive experiments on object discovery, recognition and visual reasoning validate our method's effectiveness. Further visual analyses illuminate the underline mechanisms. Oursource codeandevaluation logare provided in the supplement.

Abstract:
Large language models excel as fewshot learners when provided with appropriate demonstrations, yet this strength becomes problematic in multi-turn agent scenarios, where LLMs erroneously mimic their own previous responses as few-shot examples. Through attention analysis, we identify conversational inertia, a phenomenon where models exhibit strong diagonal attention to previous responses, which is associated with imitation bias that constrains exploration. This reveals a tension when transforming few-shot LLMs into agents: longer context enriches environmental feedback for exploitation, yet also amplifies conversational inertia that undermines exploration. Our key insight is that for identical states, actions generated with longer contexts exhibit stronger inertia than those with shorter contexts, enabling construction of preference pairs without environment rewards. Based on this, we propose Context Preference Learning to calibrate model preferences to favor low-inertia responses over high-inertia ones. We further provide context management strategies at inference time to balance exploration and exploitation. Experimental results across eight agentic environments and one deep research scenario validate that our framework reduces conversational inertia and achieves performance improvements.

Abstract:
We can often verify the correctness of neural network outputs using ground truth labels, but we cannot reliably determine whether the output was produced by normal or anomalous internal mechanisms. Mechanistic anomaly detection (MAD) aims to flag these cases, but existing methods either depend on latent space analysis, which is vulnerable to obfuscation, or are specific to particular architectures and modalities. We reframe MAD as a functional attribution problem: asking to what extent samples from a trusted set can explain the model's output, where attribution failure signals anomalous behavior. We operationalize this using influence functions, measuring functional coupling between test samples and a small reference set via parameterspace sampling. We evaluate across multiple anomaly types and modalities. For backdoors in vision models, our method achieves state-of-the-art detection on BackdoorBench, with an average Defense Effectiveness Rating (DER) of 0.93 across seven attacks and four datasets (next best 0.83). For LLMs, we similarly achieve a significant improvement over baselines for several backdoor types, including on explicitly obfuscated models. Beyond backdoors, our method can detect adversarial and out-of-distribution samples, and distinguishes multiple anomalous mechanisms within a single model. Our results establish functional attribution as an effective, modality-agnostic tool for detecting anomalous behavior in deployed models.

Abstract:
While LLMs have seen substantial improvement in reasoning capabilities, they also sometimes overthink, generating unnecessary reasoning steps, particularly under uncertainty, given illposed or ambiguous queries. We introduce statistically principled early stopping methods that monitor uncertainty signals during generation to mitigate this issue. Our first approach is nonparametric and provides finite-sample guarantees on the probability of halting too early on well-posed queries. Our second approach is parametric: it models inter-arrival times of uncertainty keywords as a renewal process and applies sequential testing for stopping. We conduct empirical evaluations on reasoning tasks across several domains and models. Our results indicate that uncertainty-aware early stopping can improve both efficiency and reliability in LLM reasoning. The performance varies across domains, and we observe especially significant gains for math reasoning.

Abstract:
Regression is a cornerstone of statistics and machine learning, with applications spanning science, engineering, and economics. While quantum algorithms for regression have attracted considerable attention, most existing work has focused on linear regression, leaving many more complex yet practically important variants unexplored. In this work, we present a unified quantum framework for accelerating a broad class of regression tasksincluding linear and multiple regression, Lasso, Ridge, Huber, \ell_p-, and \delta_p-type regressions---achieving up to a quadratic improvement in the number of samples m over the best classical algorithms. This speedup is achieved by a non-trivial quantization of the recent classical breakthrough of Jambulapati et al. (2024), where we construct a full quantum pipeline that strategically employs quantum leverage score approximation to initialize and refine Multiscale Leverage Score Overestimates, enabling efficient importance sampling via the preparation of multiple state copies. For problems of dimension n, sparsity r < n, and error parameter \epsilon, our algorithm solves the problem in \widetildeO(r\sqrtmn/\epsilon + \mathrmpoly(n,1/\epsilon)) quantum time, demonstrating both the applicability and the efficiency of quantum computing in accelerating regression tasks.

Abstract:
Large reasoning models (LRMs) are often evaluated using metrics such as finalanswer accuracy or token count. However, identical scores on these metrics can hide fundamentally different reasoning structures. To address this limitation, we introduce a scalable LRM benchmark of logic puzzles and a pipeline that converts unstructured traces into verifiable reasoning graphs of claims and dependencies. This turns reasoning into a structured, measurable object whose topology can be quantitatively analyzed. Building on this, we define a reasoning efficiency metric that quantifies how concentrated the model's logical flow is. Our analysis on open-source reasoning models shows that structural measurements separate behaviors that token count and accuracy conflate, providing a practical tool for diagnosing failure modes and comparing how reasoning scales with puzzle difficulty.

Abstract:
Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in visionlanguage understanding. While recent multimodal embedding models excel at global image-text alignment, they often struggle with fine-grained alignment between image regions and specific phrases. In this work, we present ObjEmbed, a novel MLLM embedding model that decomposes the input image into multiple regional embeddings, each corresponding to an individual object, along with global embeddings. It supports a wide range of visual understanding tasks like visual grounding, local image retrieval, and global image retrieval. ObjEmbed enjoys three key properties: (1) Object-Oriented Representation: It captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality. The final object matching score combines semantic similarity with the predicted IoU, enabling more accurate retrieval. (2) Versatility: It seamlessly handles both region-level and image-level tasks. (3) Efficient Encoding: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency. Superior performance on 18 diverse benchmarks demonstrates its strong semantic discrimination. We will release the code, models, and data for future research.

Abstract:
As humans increasingly rely on multiround conversational AI for high-stakes decisions, principled frameworks are needed to ensure such interactions reliably improve decision quality. We adopt a human-centric view governed by two principles: counterfactual harm, ensuring the AI does not undermine human strengths, and complementarity, ensuring it adds value where the human is prone to err. We formalize these concepts via user-defined rules, allowing users to specify exactly what harm and complementarity mean for their specific task. We then introduce an online, distribution-free algorithm with finite-sample guarantees that enforces the user-specified constraints over the collaboration dynamics. We evaluate our framework across two interactive settings: LLM-simulated collaboration on a medical diagnostic task and a human crowdsourcing study on a pictorial reasoning task. We show that our online procedure maintains prescribed counterfactual-harm and complementarity violation rates even under non-stationary interaction dynamics. Moreover, tightening or loosening these constraints produces predictable shifts in downstream human accuracy, confirming that the two principles serve as practical levers for steering multi-round collaboration toward better decision quality without the need to model or constrain human behavior.

Abstract:
The scalability of Large Language Models to long sequences is hindered by the quadratic cost of selfattention and the limitations of positional encodings. To address these, we introduce Caracal, a novel architecture that replaces self-attention with a parameter-efficient, \mathcalO(L \log L) Multi-Head Fourier (MHF) module. Our contributions are threefold: (1) We leverage the Fast Fourier Transform (FFT) for sequence mixing, inherently addressing both bottlenecks mentioned above. (2) We apply a frequency-domain causal masking technique that enforces autoregressive capabilities via asymmetric padding and truncation, overcoming a critical barrier for Fourier-based generative models. (3) Unlike efficient models relying on hardware-specific implementations (e.g., Mamba), Caracal uses standard library operators. This ensures robust portability, eliminating common deployment barriers. Evaluations demonstrate that Caracal performs competitively with Transformer and SSM baselines, offering a scalable and simple pathway for efficient long-sequence modeling. Code is available in the supplementary materials.

Abstract:
We identify two empirical issues in large language model (LLM) training: (i) optimizer updates can have large spectral norms, potentially destabilizing training and degrading generalization; (ii) stochastic gradient noise can exhibit sparse spectral spikes, with a few dominant singular values much larger than the rest. We propose SPECTRA, a general framework addressing these by (i) postspectral clipping of updates to enforce spectral-norm constraints (ii) optional pre-spectral clipping of gradients to suppress spectral noise spikes. We prove that post-clipping constitutes a Composite Frank-Wolfe method with spectral-norm constraints and weight regularization, recovering Frobenius and \ell_\infty-norm regularization with SGD-based and sign-based methods. We further analyze how pre-clipping mitigates sparse spectral spikes. We propose efficient soft spectral clipping via Newton-Schulz iterations, avoiding expensive SVD. Experiments on LLM pretraining show SPECTRA uniformly improves validation loss for various optimizers, including AdamW, Signum, and AdEMAMix, with the best-performing variants achieving state-of-the-art results. Models trained with SPECTRA exhibit smaller weight norms, confirming the link between spectral clipping and regularization.

Abstract:
We propose a transfer learning framework for estimating highdimensional Ising models that characterize interactions between discrete binary variables from limited target samples and multiple auxiliary datasets of unknown relevance. Our algorithm, termed Trans-Ising, builds upon a two-stage procedure that first obtains an initial estimator via nodewise \ell_1-regularized logistic regression on the target data together with selected auxiliary samples, and then refines it using a target-only correction step with a folded-concave penalty to improve edge selection accuracy. To decide which auxiliary sources to transfer from and reduce the risk of negative transfer, we introduce a loss-based screening rule based on out-of-sample pseudolikelihood evaluation on held-out target data. We also establish asymptotic error bounds and selection consistency for the proposed estimator under standard high-dimensional conditions. Extensive simulations and real data applications show that Trans-Ising consistently outperforms competing methods.

Abstract:
We study how individual training examples shape the internal computation of looped transformers, where a shared block is applied for \tau recurrent iterations to enable latent reasoning. Existing trainingdata influence estimators such as TracIn yield a single scalar score that aggregates over all loop iterations, obscuring when during the recurrent computation a training example matters. We introduce Step-Decomposed Influence (SDI), which decomposes TracIn into a length-\tau influence trajectory by unrolling the recurrent computation graph and attributing influence to specific loop iterations. To make SDI practical at transformer scale, we propose a TensorSketch implementation that never materialises per-example gradients. Experiments on looped GPT-style models and algorithmic reasoning tasks show that SDI scales excellently, matches full-gradient baselines with low error and supports a broad range of data attribution and interpretability tasks with per-step insights into the latent reasoning process.

Abstract:
Many practical samplers rely on timedependent drifts---often induced by annealing or tempering schedules---to improve exploration and stability. This motivates a unified non-asymptotic analysis of the corresponding Langevin diffusion and their discretizations. We provide a convergence analysis that includes non-asymptotic bounds for the continuous-time diffusion and its Euler--Maruyama discretization in the forward-Kullback--Leibler divergence under a single set of abstract conditions on the time-dependent drift. The results apply to many practically-relevant annealing schemes, including geometric tempering and annealed Langevin sampling. In addition, we provide numerical experiments comparing the annealing schemes covered by our theory in low- as well as high-dimensional settings.

Abstract:
We study differentially private (DP) regression in settings where each data sample includes public, nonsensitive features—common in applications like recommendation or advertising systems. While such label DP or DP with semi-sensitive features settings have been primarily explored in the context of classification, effective approaches for regression remain underexplored. We introduce \textsfCond-DP, a conditioned variant of \textsfDPSGD that leverages the structure of public feature matrices to improve optimization under privacy constraints. Motivated by the observation that these public features often exhibit rapidly decaying spectra, \textsfCond-DP incorporates a data-driven conditioning matrix to reshape the optimization landscape and accelerate convergence. We provide convergence guarantees for convex, strongly convex and non-convex settings, and recover standard \textsfDPSGD as a special case when the conditioning matrix is the identity. We show how to construct an effective conditioning matrix for \textsfCond-DP directly from public features, enabling faster convergence than \textsfDPSGD in private linear regression, without incurring additional privacy cost. Empirically, \textsfCond-DP with this conditioning matrix consistently outperforms state-of-the-art baselines across a wide range of datasets and model architectures under label DP, demonstrating strong and robust performance in practice.

Abstract:
Posttraining with reinforcement learning (RL) has recently shown strong promise for advancing multimodal agents beyond supervised imitation. However, RL remains limited by poor data efficiency, particularly in settings where interaction data are scarce and quickly become outdated. To address this challenge, GIPO (Gaussian Importance sampling Policy Optimization) is proposed as a policy optimization objective based on truncated importance sampling, replacing hard clipping with a log-ratio–based Gaussian trust weight to softly damp extreme importance ratios while maintaining non-zero gradients. Theoretical analysis shows that GIPO introduces an implicit, tunable constraint on the update magnitude, while concentration bounds guarantee robustness and stability under finite-sample estimation. Experimental results show that GIPO achieves state-of-the-art performance among clipping-based baselines across a wide range of replay buffer sizes, from near on-policy to highly stale data, while exhibiting superior bias–variance trade-off, high training stability and improved sample efficiency. Code is provided in supplementary material.

Abstract:
General audio source separation is a key capability for multimodal AI systems that can perceive and reason about sound. Despite substantial progress in recent years, existing separation models are either domainspecific, designed for fixed categories such as speech or music, or limited in controllability, supporting only a single prompting modality such as text. In this work, we present SAM AUDIO, a foundation model for general audio separation that unifies text, visual, and temporal span prompting within a single framework. Built on a diffusion transformer architecture, SAM AUDIO is trained with flow matching on large-scale audio data spanning speech, music, and general sounds, and can flexibly separate target sources described by language, visual masks, or temporal spans. The model achieves state-of-the-art performance across a diverse suite of benchmarks, including general sound, speech, music, and musical instrument separation in both in-the-wild and professionally produced audios, substantially outperforming prior general-purpose and specialized systems. Furthermore, we introduce a new real-world separation benchmark with human-labeled multimodal prompts and a reference-free evaluation model that correlates strongly with human judgment.

Abstract:
Data markets promise to unlock data value by matching data suppliers with ML consumers. However, market design involves addressing intricate challenges, including data pricing, fairness, and robustness. We propose a pragmatic dataaugmented AutoML market that seamlessly integrates with existing cloud-based AutoML platforms, such as Google’s Vertex AI. Unlike standard AutoML solutions, our design automatically augments buyer-submitted training data with valuable external datasets, pricing the resulting models based on their measurable performance improvements rather than computational costs as the status quo. Our key innovation is a pricing mechanism grounded in the instrumental value—the marginal model quality improvement—of externally sourced data. This approach bypasses direct dataset pricing complexities and accommodates diverse buyer valuations through menu-based options, thus providing an economically sustainable framework for monetizing external data.

Abstract:
A fundamental challenge in data science is to match disparate point sets with each other. While optimal transport efficiently minimizes pointdisplacementsunder a bijectivity constraint, it is inherently sensitive to rotations. Conversely, minimizingdistortionsvia the GromovWasserstein (GW) framework addresses this limitation but introduces a non-convex, computationally demanding optimization problem. In this work, we identify a broad class of distortion penalties that reduce to a simple alignment problem within a lifted feature space. Leveraging this insight, we introduce an iterative GW solver with a linear memory footprint and quadratic (rather than cubic) time complexity. Our method is differentiable, comes with strong theoretical guarantees, and scales to hundreds of thousands of points in minutes. This efficiency unlocks a wide range of geometric applications and enables the exploration of the GW energy landscape, whose local minima encode the symmetries of the matching problem.

Abstract:
Stability and robustness are critical for deploying Transformers in safetysensitive settings. A principled way to enforce such behavior is to constrain the model's Lipschitz constant. However, approximation-theoretic guarantees for architectures that explicitly preserve Lipschitz continuity have yet to be established. In this work, we bridge this gap by introducing a class of gradient-descent-type in-context Transformers that are Lipschitz-continuous by construction. We realize both MLP and attention blocks as explicit Euler steps of negative gradient flows, ensuring inherent stability without sacrificing expressivity. We prove a universal approximation theorem for this class within a Lipschitz-constrained function space. Crucially, our analysis adopts a measure-theoretic formalism, interpreting Transformers as operators on probability measures, to yield approximation guarantees independent of token count. These results provide a rigorous theoretical foundation for the design of robust, Lipschitz continuous Transformer architectures.

Abstract:
De novo functional protein design aims to generate protein sequences that realize specified biochemical functions without relying on evolutionary templates, enabling broad applications in biotechnology and medicine. Existing approaches adopt either direct functionto-sequence mapping or decoupled structure-sequence generation strategies but often fail to achieve functionality and foldability simultaneously. To address this, we propose CodeFP, a Co-generative protein language model for de novo Functional Protein design that simultaneously decodes sequence and structure tokens, thereby enabling superior simultaneous realization of functionality and foldability. CodeFP utilizes functional local structures to enrich functional semantic encodings, overcoming the suboptimal translation of flat encodings into structure tokens, while introducing auxiliary functional supervision to alleviate training ambiguity stemming from the one-to-many structure-to-token mapping. Extensive experiments show that CodeFP consistently achieves average improvements of 6.1% in functional consistency and 3.2% in foldability over the strongest baseline.

Abstract:
While VisionLanguage-Action (VLA) models have achieved remarkable success in ground-based embodied intelligence, their application to Aerial Manipulation Systems (AMS) remains a largely unexplored frontier. The inherent characteristics of AMS, including floating-base dynamics, strong coupling between the UAV and the manipulator, and the multi-step, long-horizon nature of operational tasks, pose severe challenges to existing VLA paradigms designed for static or 2D mobile bases. To bridge this gap, we propose AIR-VLA, the first VLA benchmark specifically tailored for aerial manipulation. We construct a physics-based simulation environment and release a high-quality multimodal dataset comprising 3000 manually teleoperated demonstrations, covering base manipulation, object & spatial understanding, semantic reasoning, and long-horizon planning. Leveraging this platform, we systematically evaluate mainstream VLA models and state-of-the-art VLM models. Our experiments not only validate the feasibility of transferring VLA paradigms to aerial systems but also, through multi-dimensional metrics tailored to aerial tasks, reveal the capabilities and boundaries of current models regarding UAV mobility, manipulator control, and high-level planning. AIR-VLA establishes a standardized testbed and data foundation for future research in general-purpose aerial robotics. The resource of AIR-VLA will be available at https://anonymous.4open.science/r/AIR-VLA-dataset-B5CC/.

Abstract:
Testtime guidance is a widely used mechanism for steering pre-trained diffusion models toward outcomes specified by a reward function. Existing approaches, however, focus on reward maximization rather than sampling from the true Bayesian posterior, leading to miscalibrated inference. In this work, we show that common test-time guidance methods do not recover the correct posterior distribution and identify the structural approximations responsible for this failure. We then propose consistent alternative estimators that enable calibrated sampling from the Bayesian posterior. Across Bayesian inference and inverse problems, our approach yields substantially improved posterior calibration.

Abstract:
Testtime training (TTT) in transformers is commonly interpreted as a form of online meta-learning that memorizes a key–value mapping at test time. However, our analysis reveals multiple phenomena that contradict this memorization-based interpretation. Motivated by these findings, we revisit the formulation of TTT and show that a broad class of TTT architectures can be expressed as a form of learned linear attention operator. Beyond explaining previously puzzling model behaviors, this perspective yields multiple practical benefits: it enables principled architectural simplifications, admits fully parallel formulations that preserve performance while improving efficiency, and provides a systematic reduction of diverse TTT variants to a standard linear attention form. Overall, our results reframe TTT not as test-time memorization, but as learned linear attention with enhanced representational capacity.

Abstract:
We study the problem of fairly allocating indivisible goods to agents in an online setting, where goods arrive sequentially and must be allocated irrevocably. Focusing on the popular fairness notions of envyfreeness, proportionality, and maximin share fairness (and their approximate variants), we investigate how access to future information changes what guarantees are achievable. Without any information, we prove strong impossibility results even for approximate fairness. With normalization information (agents' total values), we provide an algorithm that achieves stronger fairness guarantees than previously known results, and show matching impossibilities for stronger notions. With frequency predictions (value multisets without order), we design a meta-algorithm that lifts a broad class of offline "share-based" guarantees to the online setting, matching the best-known offline bounds. Finally, we provide learning-augmented variants of both models: under noisy totals or noisy frequency predictions, our guarantees are robust and degrade gracefully with the error parameters.

Abstract:
Smoothing a signal based on local neighborhoods is a core operation in machine learning and geometry processing. On wellstructured domains such as vector spaces and manifolds, the Laplace operator derived from differential geometry offers a principled approach to smoothing via heat diffusion, with strong theoretical guarantees. However, constructing such Laplacians requires a carefully defined domain structure, which is not always available. Most practitioners thus rely on simple convolution kernels and message-passing layers, which are biased against the boundaries of the domain. We bridge this gap by introducing a broad class of smoothing operators, derived from general similarity or adjacency matrices, and demonstrate that they can be normalized into diffusion-like operators that inherit desirable properties from Laplacians. Our approach relies on a symmetric variant of the Sinkhorn algorithm, which rescales positive smoothing operators to match the structural behavior of heat diffusion. This construction enables Laplacian-like smoothing and processing of irregular data such as point clouds, sparse voxel grids or mixture of Gaussians. We show that the resulting operators not only approximate heat diffusion but also retain spectral information from the Laplacian itself, with applications to shape analysis and matching.

Abstract:
We study the problem of learning the optimal policy in a discounted, infinitehorizon reinforcement learning (RL) setting in the presence of adversarially corrupted rewards. To address this problem, we develop a novel robust variant of the Q-learning algorithm and analyze it under the challenging asynchronous sampling model with time-correlated data. Despite corruption, we prove that the finite-time guarantees of our approach match existing bounds, up to an additive term that scales with the fraction of corrupted samples. We also establish an information-theoretic lower bound, revealing that our guarantees are near-optimal. Notably, our algorithm is agnostic to the underlying reward distribution and provides the first finite-time robustness guarantees for asynchronous Q-learning. A key element of our analysis is a refined Azuma-Hoeffding inequality for almost-martingales, which may have broader applicability in the study of RL algorithms.

Abstract:
We introduce Conditional CliffordSteerable CNNs (C-CSCNNs), a unified framework that incorporates equivariance to arbitrary pseudo-Euclidean groups and significantly improves the expressivity of standard CSCNNs. We show that the kernel basis of the standard formulation is incomplete, limiting model capacity. To address this, we augment the kernels with equivariant representations of the input feature field. We derive the equivariance constraint for these input-dependent kernels and show how it can be solved efficiently via implicit parameterization. We empirically validate on multiple PDE forecasting tasks, including fluid dynamics and relativistic electrodynamics, where our method consistently outperforms standard CSCNNs and performs on par with state-of-the-art baselines.

Abstract:
We study the stochastic linear bandits with parameter noise model, in which the reward of action a is a^\top \theta where \theta is sampled i.i.d. We show a regret upper bound of \widetildeO (\sqrtd T \log(K/\delta) \sigma^2_\max) for a horizon T, general action set of size K of dimension d, and where \sigma^2_\max is the maximal variance of the reward for any action. We further provide a lower bound of \widetilde\Omega (d \sqrtT \sigma_\max^2) which is tight (up to logarithmic factors) whenever \log K \approx d. For more specific action sets, \ell_p unit balls with p \leq 2 and dual norm q, we show that the minimax regret is \widetilde\Theta (\sqrtdT \sigma_q^2), where \sigma_q^2 is a variancedependent quantity that is always at most 4. This is in contrast to the minimax regret attainable for such sets in the classic additive noise model where the regret is of order d \sqrtT. Surprisingly, we show that this optimal (up to logarithmic factors) regret bound is attainable using a very simple explore-exploit algorithm.

Abstract:
Large Multimodal Models (LMMs) demonstrate impressive incontext learning abilities from few multimodal demonstrations, yet the internal mechanisms supporting such task learning remain opaque. Building on prior work of Large Language Models, we show that a small subset of attention heads in Large Multimodal Models is responsible for transmitting representations of spatial relations. The activations of these attention heads, termed function vectors, can be extracted and manipulated to alter an LMM’s performance on relational tasks. First, using synthetic and real image datasets, we apply causal mediation analysis to identify attention heads that strongly influence relational predictions, and extract multimodal function vectors that improve zero-shot accuracy at inference time. We further demonstrate that these multimodal function vectors can be fine-tuned with a modest amount of training data, while keeping LMM parameters frozen, to significantly outperform in-context learning baselines. Finally, we show that relation-specific function vectors can be linearly combined to solve analogy problems involving novel and untrained spatial relations, highlighting the strong generalization ability of this approach. Through experiments on two LMMs, including OpenFlamingo and Qwen3-VL, our results show that these models encode spatial relational knowledge within localized internal structures, which can be systematically extracted and optimized, thereby advancing our understanding of model modularity and enhancing control over relational reasoning in LMMs.

Abstract:
AI arenas, which rank models from pairwise preferences of users, are an industrystandard evaluation mechanism for generative models. In a recent paper, Singh et al. (2025) demonstrate that widely-used mechanisms are not clone-robust: In particular, they submitted multiple copies of the same model and showed that the higher-ranked copy was several positions above the lower-ranked one. In this paper, we begin by showing, both theoretically and in simulations calibrated to data from LMArena, that producers can benefit substantially from submitting clones. We then propose a new mechanism for ranking models based on pairwise comparisons, called You-Rank-We-Rank (YRWR). It uses producers’ rankings over their own models to correct statistical estimates of model quality. We prove that this mechanism is approximately clone-robust, in the sense that a producer cannot gain much utility by doing anything other than submitting each of their unique models exactly once. Moreover, to the extent that model producers are able to correctly rank their own models, YRWR improves overall ranking accuracy. We validate our theory with further semisynthetic experiments.

Abstract:
Conformal prediction provides distributionfree prediction sets with finite-sample conditional guarantees. RKHS-based frameworks—while promising for complex covariate shifts—suffer from prohibitive computational costs. To guarantee conditional validity under such shifts while ensuring feasibility, we build upon the framework of (Gibbs et al., 2023) by introducing a stable and efficient algorithm that computes the full solution path of the regularized RKHS conformal optimization problem, at essentially the same cost as a single kernel quantile fit. Our approach provides simultaneous hyperparameter tuning which provides smoothness control and data-adaptive calibration. To extend the method to high-dimensional settings, we further integrate our approach with low-rank latent embeddings that capture conditional validity in a data-driven latent space. Empirically, our method provides reliable conditional coverage across a variety of modern black-box predictors, improving the interval length of (Gibbs et al., 2023) by 30%, while achieving a 40-fold speedup.

Abstract:
In gradientbased learning, a step size chosen in parameter units does not produce a predictable per-step change in the function output. This may lead to instability in the streaming setting (i.e., batch size=1), where stochasticity is not averaged out and update magnitudes can momentarily become arbitrarily big or small. Instead, we propose \emphintentional updates: first specify the \emphintended outcome of an update and then solve for the step size that approximately achieves it. This strategy has precedent in online supervised linear regression via normalized LMS, which selects a step size to yield a specified change in the function output proportional to the current error. We extend this principle to streaming reinforcement learning by defining appropriate intended outcomes: \emphIntentional TD aims for a fixed fractional reduction of the current TD error relative to the momentary bootstrap target, and \emphIntentional Policy Gradient aims for a bounded per-step change in the policy, limiting local KL divergence. We develop practical implementations integrating eligibility traces and diagonal scaling; our experiments show that these methods yield state-of-the-art streaming performance often comparable to batch and replay-buffer learning.

Abstract:
Convexity recognition plays a central role in many optimization, control, and learning problems. However, the ability of Large Language Models (LLMs) to identify this property in symbolic expressions remains unexamined. We introduce \cb, a scalable and mechanically verifiable benchmark for testing whether LLMs can determine the convexity of a symbolic objective under deep functional composition. Experiments on frontier LLMs reveal a sharp compositional reasoning gap: performance degrades rapidly with increasing depth, dropping from an F1score of 1.0 at depth 2 to approximately 0.2 at depth 100. Inspection of models' reasoning traces indicates two failure modes: parsing failure and lazy reasoning. To address these limitations, we propose an agentic divide-and-conquer framework that (i) offloads parsing to an external tool to construct an abstract syntax tree (AST) and (ii) enforces recursive reasoning over each intermediate sub-expression with focused context. This framework reliably mitigates deep-composition failures, achieving substantial performance improvement at large depths (e.g., F1-Score = 1.0 at depth 100).

Abstract:
Onpolicy distillation is a promising approach for transferring knowledge between language models, where a student learns from dense token-level signals along its own trajectories. This framework typically uses reverse KL divergence, encouraging the student to match the teacher's high-confidence predictions. However, we show that the mode-seeking property of reverse KL reduces generation diversity and yields unstable learning signals when the teacher distribution has high entropy. To address this, we introduce Entropy-Aware On-Policy Distillation. Our key idea is augmenting the standard reverse KL objective with forward KL when teacher entropy is high, capturing the full range of plausible outputs while retaining precise imitation elsewhere. It balances mode-seeking precision with mode-covering robustness without sacrificing on-policy training efficiency. Experiments show that our method maintains generation diversity (sustained token-level entropy) and improves student–teacher alignment (lower forward KL on high-entropy tokens). Across six math reasoning benchmarks, this yields Pass@8 accuracy gains of +1.37 for Qwen3-0.6B-Base, +2.39 for Qwen3-1.7B-Base, and +5.05 for Qwen3-4B-Base compared to baseline on-policy distillation methods. These results demonstrate that accounting for teacher uncertainty is essential for maintaining diversity and achieving effective knowledge transfer.

Abstract:
Deep neural networks frequently exhibit overconfidence, undermining reliability in safetycritical applications. Existing adaptive methods rely on indirectly learned proxies of sample difficulty. We establish the logit margin as a direct and principled hardness indicator. We prove that margin tightly bounds the feasible temperature range for any target confidence. Empirically, margin strongly correlates with decision boundary proximity and reveals systematic calibration patterns across difficulty levels. We further identify a fundamental flaw in NLL-based optimization: minimizing NLL can paradoxically worsen calibration. To address this, we introduce Charbonnier-Smoothed SoftECE, a smooth objective that provably upper-bounds the smooth calibration error (smCE). Building on these insights, we propose SMART (Sample Margin-Aware Recalibration of Temperature), a lightweight method that learns a sample-wise margin-to-temperature mapping guided by our calibration-centric objective. Experiments demonstrate state-of-the-art calibration across CNNs and ViTs on standard, long-tailed, and distribution-shifted benchmarks, with a minimal inference-time data consumption. Code: https://anonymous.4open.science/r/SMART-8B11.

Abstract:
This paper introduces a rigorous framework for defining generative diffusion models in infinite dimensions via Doob's htransform. Rather than relying on time reversal of a noising process, a reference diffusion is forced towards the target distribution by an exponential change of measure. Compared to existing methodology, this approach readily generalises to the infinite-dimensional setting, hence offering greater flexibility in the diffusion model. The construction is derived rigorously under verifiable conditions, and bounds with respect to the target measure are established. We show that the forced process under the changed measure can be approximated by minimising a score-matching objective and validate our method on both synthetic and real data.

Abstract:
Protein language models (pLMs) have emerged as powerful predictors of protein structure and function. However, the computational circuits underlying their predictions remain poorly understood. Recent mechanistic interpretability methods decompose pLM representations into interpretable features, but they treat each layer independently and thus fail to capture crosslayer computation, limiting their ability to approximate the full model. We introduce ProtoMech, a framework for discovering computational circuits in pLMs using cross-layer transcoders that learn sparse latent representations jointly across layers to capture the model’s full computational circuitry. Applied to the pLM ESM2, ProtoMech recovers 82–89% of the original performance on protein family classification and function prediction tasks. ProtoMech then identifies compressed circuits that use <1% of the latent space while retaining up to 79% of model accuracy, revealing correspondence with structural and functional motifs, including binding, signaling, and stability. Steering along these circuits enables high-fitness protein design, surpassing baseline methods in more than 70% of cases. These results establish ProtoMech as a principled framework for protein circuit tracing.

Abstract:
Realistic simulation is critical for applications ranging from robotics to animation. Video generation models have emerged as a way to capture realworld physics from data, but they often face challenges in maintaining spatial consistency and object permanence, relying on memory mechanisms to compensate. As a complementary direction, we present 3DGSim, a learned 3D simulator that directly learns physical interactions from multi-view RGB videos. 3DGSim adopts MVSplat to learn a latent particle-based representation of 3D scenes, a Point Transformer for the particle dynamics, a Temporal Merging module for consistent temporal aggregation, and Gaussian Splatting to produce novel view renderings. By jointly training inverse rendering and dynamics forecasting, 3DGSim embeds physical properties into point-wise latent features. This enables the model to capture diverse behaviors, from rigid and elastic to cloth-like dynamics and boundary conditions (e.g., fixed cloth corners), while producing realistic lighting effects. We show that 3DGSim can generate physically plausible results even in out of distribution cases, e.g. ground removal or multi-object interactions, despite being trained only on single-body collisions.

Abstract:
Differentially private Stochastic Gradient Descent (DPSGD) has become integral to privacy-preserving machine learning, ensuring robust privacy guarantees in sensitive domains. Despite notable empirical advances leveraging features from non-private, pre-trained models to enhance DP-SGD training, a theoretical understanding of feature dynamics in private learning remains underexplored. This paper presents the first theoretical framework to analyze private training through a feature learning perspective. Building on the multi-patch data structure from prior work, our analysis distinguishes between label-dependent feature signals and label-independent noise—a critical aspect overlooked by existing analyses in the DP community. Employing a two-layer CNN with polynomial ReLU activation, we theoretically characterize both feature signal learning and data noise memorization in private training via noisy gradient descent. Our findings reveal that (1) Effective private signal learning requires a higher signal-to-noise ratio (SNR) compared to non-private training, and (2) When data noise memorization occurs in non-private learning, it will also occur in private learning, leading to poor generalization despite small training loss. Our findings highlight the challenges of private learning and prove the benefit of feature enhancement to improve SNR. Experiments on synthetic and real-world datasets also validate our theoretical findings.

Abstract:
Current coding benchmarks often inflate Large Language Model (LLM) capabilities due to static paradigms and data contamination, enabling models to exploit statistical shortcuts rather than genuine reasoning. To address this, we introduce UniCode, a generative evaluation framework that systematically probes LLM limits via: (1) multidimensional augmentation transforming seed problems into complex variations to disrupt fixed algorithmic patterns; (2) a highly reliable, automated test generation pipeline for scalable evaluation; and (3) fine-grained metrics for rich error signals. Experiments reveal a 31.2% performance collapse in state-of-the-art models on UniCode, primarily driven by deficiencies in conceptual modeling and scalability reasoning rather than syntactic errors. Furthermore, we uncover a ``seed-problem regression" where models revert to memorized seed logic rather than following new specifications, signaling a reliance on shortcuts over reasoning. This work validates UniCode as a robust framework to expose model fragility and foster reasoning-oriented code intelligence.

Abstract:
Endto-end (E2E) autonomous driving has recently emerged as a new paradigm, offering significant potential. However, few studies have looked into the practical challenge of deployment across domains (e.g., cities). Although several works have incorporated Large Language Models (LLMs) to leverage their open-world knowledge, LLMs do not guarantee cross-domain driving performance and may incur prohibitive retraining costs during domain adaptation. In this paper, we propose RoCA, a novel framework for robust cross-domain E2E autonomous driving. RoCA formulates the joint probabilistic distribution over the tokens that encode ego and surrounding vehicle information in the E2E pipeline. Instantiating with a Gaussian process (GP), RoCA learns a set of basis tokens with corresponding trajectories, which span diverse driving scenarios. Then, given any driving scene, it is able to probabilistically infer the future trajectory. By using RoCA together with a base E2E model in source-domain training, we improve the generalizability of the base model, without requiring extra inference computation. In addition, RoCA enables robust adaptation on new target domains, significantly outperforming direct finetuning. We extensively evaluate RoCA on various cross-domain scenarios and show that it achieves strong domain generalization and adaptation performance.

Abstract:
Diffusion Posterior Sampling (DPS) provides a principled Bayesian approach to inverse problems by sampling from p(x_0 \mid y). While posterior sampling is valuable for capturing uncertainty and multimodality, many classical and practical inverse problem settings ultimately prioritize accurate point estimation—most notably the MAP estimator, which has long served as a standard reconstruction objective in imaging and scientific applications. We introduce Local MAP Sampling (LMAPS), a new inference framework that iteratively solving local MAP subproblems along the diffusion trajectory. This perspective clarifies their connection to global MAP and DPS, offering a unified probabilistic interpretation for optimization-based methods. Building on this foundation, we develop practical algorithms with a covariance approximation motivated by Gaussian prior assumption, a reformulated objective for stability and interpretability. Across a broad set of image restoration and scientific tasks, LMAPS achieves state-of-the-art performance.

Abstract:
Preferencebased alignment methods (e.g., RLHF, DPO) typically optimize a single scalar objective, implicitly averaging over heterogeneous human preferences. In practice, systematic annotator and user-group disagreement makes mean-reward maximization brittle and susceptible to proxy over-optimization. We proposeDisagreement-Aware Alignment via Risk-Constrained Decoding (DARC), a retraining-free inference-time method that frames response selection as distributionally robust, risk-sensitive decision making. Given multiple preference samples or scalable disagreement proxies, DARC reranks candidates by maximizing aKL-robust (entropic)satisfaction objective, and provides simple deployment controls that cap or penalize the corresponding entropic risk premium relative to the mean, enabling explicit risk budgets without retraining. We provide theoretical characterization linking this decoding rule to principled pessimism and KL-based distributionally robust optimization. Experiments on alignment benchmarks show that DARC reduces disagreement and tail risk while maintaining competitive average quality under noisy, heterogeneous feedback.

Abstract:
Can multitask self-supervised learning on graphs be coordinated without the usual tug-of-war between objectives? Graph self-supervised learning (SSL) offers a growing toolbox of pretext objectives—mutual information, reconstruction, contrastive learning—yet combining them reliably remains a challenge due to objective interference and training instability. Most multi-pretext pipelines use per-update mixing, forcing every parameter update to be a compromise, leading to three failure modes: Disagreement (conflict-induced negative transfer), Drift (nonstationary objective utility), and Drought (hidden starvation of underserved objectives). We argue that coordination is fundamentally a temporal allocation problem: deciding when each objective receives optimization budget, not merely how to weigh them. We introduce ControlG, a control-theoretic framework that recasts multi-objective graph SSL as feedback-controlled temporal allocation by estimating per-objective difficulty and pairwise antagonism, planning target budgets via a Pareto-aware log-hypervolume planner, and scheduling with a Proportional–Integral–Derivative (PID) controller. Across 9 datasets, ControlG consistently outperforms state-of-the-art baselines, while producing an auditable schedule that reveals which objectives drove learning.

Abstract:
Neural networks are famously nonlinear. However, linearity is defined relative to a pair of vector spaces, f:\mathcalX\to\mathcalY. Leveraging the algebraic concept of transport of structure, we propose a method to explicitly identify nonstandard vector spaces where a neural network acts as a linear operator. When sandwiching a linear operator A between two invertible neural networks, f(x)=g_y^-1(A g_x(x)), the corresponding vector spaces \mathcalX and \mathcalY are induced by newly defined addition and scaling actions derived from g_x and g_y. We term this kind of architecture a Linearizer. This framework makes the entire arsenal of linear algebra, including SVD, pseudo-inverse, orthogonal projection and more, applicable to nonlinear mappings. Furthermore, we show that the composition of two Linearizers that share a neural network is also a Linearizer. We leverage this property and demonstrate that training diffusion models using our architecture makes the hundreds of sampling steps collapse into a single step. We further utilize our framework to enforce idempotency (i.e.\ f(f(x))=f(x)) on networks leading to a globally projective generative model and to demonstrate modular style transfer.

Abstract:
We study strong universal Bayesconsistency in the realizable setting for learning with general metric losses, extending classical characterizations beyond 0-1 classification \citepbousquet_theory_2021, hanneke2021universalbayesconsistencymetric and real-valued regression \citepattias_universal_2024. Given an instance space (\mathcal X,\rho), a label space (\mathcal Y,\ell) with possibly unbounded loss, and a hypothesis class \mathcal H \subseteq \mathcal Y^\mathcal X, we solve an open problem presented in \citepmlr-v178-cohen22a. Specifically, we find the necessary and sufficient conditions on the hypothesis class \mathcal H under which there exists a distribution-free learning rule whose risk converges almost surely to the best-in-class risk (which is zero) for every realizable data-generating distribution. Our main contribution is this sharp characterization in terms of a combinatorial obstruction: Similarly to \citetattias_universal_2024, we introduce the notion of an infinite non-decreasing (\gamma_k)-Littlestone tree, where \gamma_k \to \infty. This extends the Littlestone tree structure used in \citepbousquet_theory_2021 to the metric loss setting.

Abstract:
Large Language Model (LLM)based scientific agents have accelerated scientific discovery, yet they often suffer from significant inefficiencies due to adherence to fixed initial priors. Existing approaches predominantly operate within a static hypothesis space, which restricts the discovery of novel phenomena, resulting in computational waste when baseline theories fail. To address this, we propose shifting the focus from searching hypotheses to evolving the underlying scientific principles. We present PiEvo, a principle-evolvable framework that treats scientific discovery as Bayesian optimization over an expanding principle space. By integrating Information-Directed Hypothesis Selection via Gaussian Process and an anomaly-driven augmentation mechanism, PiEvo enables agents to autonomously refine their theoretical worldview. Evaluation across four benchmarks demonstrates that PiEvo (1) achieves an average solution quality of up to 90.81%～93.15%, representing a 29.7%～31.1% improvement over the state-of-the-art, (2) attains an 83.3% speedup in convergence step via significantly reduced sample complexity by optimizing the compact principle space, and (3) maintains robust performance across diverse scientific domains and LLM backbones.

Abstract:
Mode collapse is a persistent challenge in generative modeling and appears in autoregressive text generation as behaviors ranging from explicit looping to gradual loss of diversity and premature trajectory convergence. We take a dynamicalsystems view and reinterpret mode collapse as reduced state-space accessibility caused bygeometric collapse: during generation, the model's internal trajectory becomes confined to a low-dimensional region of its representation space. This implies mode collapse is not purely a token-level phenomenon and cannot be reliably solved by symbolic constraints or probability-only decoding heuristics. Guided by this perspective, we proposeReinforced Mode Regulation(RMR), a lightweight, online state-space intervention that regulates dominant self-reinforcing directions in the Transformer value cache (implemented as low-rank damping). Across multiple large language models, RMR substantially reduces mode collapse and enables stable generation at extremely low entropy rates (down to 0.8 nats/step), whereas standard decoding typically collapses near 2.0 nats/step.

Abstract:
Machine unlearning seeks to remove the influence of specific training data from a model, a need driven by privacy regulations and robustness concerns. Existing approaches typically modify model parameters, but such updates can be unstable, computationally costly, and limited by local approximations. We introduce Representation Unlearning, a framework that performs unlearning directly in the model’s representation space. Instead of modifying model parameters, we learn a transformation over representations that imposes an information bottleneck: maximizing mutual information with retained data while suppressing information about data to be forgotten. We derive variational surrogates that make this objective tractable and show how they can be instantiated in two practical regimes: when both retain and forget data are available, and in a zeroshot setting where only forget data can be accessed. Experiments across several benchmarks demonstrate that Representation Unlearning achieves more reliable forgetting, better utility retention, and greater computational efficiency than parameter-centric baselines.

Abstract:
Features of the same sample generated by different pretrained models often exhibit inherently distinct feature distributions. Learning invariant representations from largescale unlabeled visual data with various pretrained models in a fully unsupervised transfer manner remains a significant challenge. In this paper, we propose a multiview self-representation learning (MSRL) method in which invariant representations are learned by exploiting the self-representation property of features across heterogeneous views. The features are derived from large-scale unlabeled visual data through transfer learning with various pretrained models and are referred to as heterogeneous multiview data. We introduce an information-passing mechanism that relies on self-representation learning to support feature aggregation over the outputs of the linear model. Moreover, an assignment probability distribution consistency scheme is presented to guide multiview self-representation learning by exploiting complementary information across different views. Consequently, representation invariance across different linear models is enforced through this scheme. In addition, we provide a theoretical analysis of the assignment probability distribution consistency and incremental views. Extensive experiments with multiple benchmark visual datasets demonstrate that the proposed MSRL method consistently outperforms several state-of-the-art approaches.

Abstract:
MultiOutput Gaussian Processes (MOGPs) provide a principled probabilistic framework for modelling correlated outputs but face scalability bottlenecks when applied to datasets with high-dimensional output spaces. To maintain tractability, existing methods typically resort to restrictive assumptions, such as employing low-rank or sum-of-separable kernels, which can limit expressiveness. We propose the Transformed Latent Variable MOGP (T-LVMOGP), a novel framework that scales MOGPs to a massive number of outputs while preserving the capacity to capture meaningful inter-output dependencies. T-LVMOGP constructs a flexible multi-output deep kernel by mapping inputs and output-specific latent variables into an embedding space using a Lipschitz-regularised neural network. Combined with stochastic variational inference, our model effectively scales to high-dimensional output settings. Across diverse benchmarks, including climate modelling with over 10,000 outputs and zero-inflated spatial transcriptomics data, T-LVMOGP outperforms baselines in both predictive accuracy and computational efficiency.

Abstract:
Conformal Prediction (CP) has emerged as a powerful statistical framework for reliable classification, which generates a prediction set, guaranteed to include the true label with a prespecified probability. The performance of CP methods is typically assessed by their average prediction set size. In setups where the classes can be partitioned into semantic groups, e.g., based on shared downstream actions or more interpretable coarse labels, users can benefit from prediction sets that are not only small but also contain a limited number of groups. This paper begins by addressing this problem and ultimately offers a widely applicable tool for boosting any CP method on any dataset. First, given a class partition, we propose augmenting the CP score function with a term that penalizes predictions with "out-of-group" errors. We theoretically analyze this strategy and prove its advantages for group-related metrics. Surprisingly, we show mathematically that, for common class partitions, it can also reduce the average set size of any CP score function. Our analysis reveals the class similarity factors behind this improvement and motivates us to propose a model-specific variant, which does not require any human semantic partition and can further reduce the prediction set size. Finally, we present an extensive empirical study, encompassing prominent CP methods, multiple models, and several datasets, which demonstrates that our class-similarity-based approach consistently enhances CP methods.

Abstract:
Retrosynthesis prediction aims to infer the reactant molecule based on a given product molecule, which is a fundamental task in chemical synthesis. However, existing models rely on static patternmatching paradigm, which limits their ability to perform effective logic decision-making, leading to a black-box process. Building on this, we propose Retro-Expert, an interpretable retrosynthesis framework that performs collaborative reasoning by combining the complementary reasoning strengths of Large Language Models and specialized models via reinforcement learning. It outputs natural language explanations grounded in chemical logic through three components: (1) specialized models analyze the product to construct high-quality chemical decision space, (2) LLM-driven critical reasoning to generate predictions and corresponding interpretable reasoning path, and (3) reinforcement learning optimizing interpretable decision policy. Experiments show that Retro-Expert not only surpasses both LLM-based and specialized models across different metrics but also provides expert-aligned explanations that bridge the gap between AI predictions and actionable chemical insights.

Abstract:
Extensiveform games (EFGs) provide a powerful framework for modeling sequential decision making, capturing strategic interaction under imperfect information, chance events, and temporal structure. Most positive algorithmic and theoretical results for EFGs assume perfect recall, where players remember all past information and actions. We study the increasingly relevant setting of imperfect-recall EFGs (IREFGs), where players may forget parts of their history or previously acquired information, and where equilibrium computation is provably hard. We propose sum-of-squares (SOS) hierarchies for computing ex-ante optimal strategies in single-player IREFGs and Nash equilibria in multi-player IREFGs, working over behavioral strategies. Our theoretical results show that (i) these hierarchies converge asymptotically, (ii) under genericity assumptions, the convergence is finite, and (iii) in single-player non-absentminded IREFGs, convergence occurs at a finite level determined by the number of information sets. Finally, we introduce the new classes of (SOS)-concave and (SOS)-monotone IREFGs, and show that in the single-player setting the SOS hierarchy converges at the first level, enabling equilibrium computation with a single semidefinite program (SDP).

Abstract:
Reliable uncertainty quantification for causal effects is crucial in various applications, but remains difficult in nonparametric models, particularly for continuous treatments. We introduce IMPspec, a Gaussian process (GP) framework for modeling uncertainty over interventional causal functions under continuous treatments, which can be represented using reproducing Kernel Hilbert Spaces (RKHSs). By using principled function class expansions and a spectral representation of RKHS features, IMPspec yields tractable training and inference, a spectral algorithm to calibrate posterior credible intervals, and avoids the underfitting and variance collapse pathologies of earlier GPon-RKHS methods. Across synthetic benchmarks and an application in healthcare, IMPspec delivers state-of-the-art performance in causal uncertainty quantification and downstream causal Bayesian optimization tasks.

Abstract:
Rubrics provide a flexible way to train LLMs on openended long-form answers where verifiable rewards are not applicable and human preferences provide coarse signals. Prior work shows that reinforcement learning with rubric-based rewards leads to consistent gains in LLM post-training. Most existing approaches rely on rubrics that remain static over the course of training. Such static rubrics, however, are vulnerable to reward-hacking type behaviors and fail to capture emergent desiderata that arise during training. We introduce Online Rubrics Elicitation (OnlineRubrics), a method that dynamically curates evaluation criteria in an online manner through pairwise comparisons of responses from current and reference policies. This online process enables continuous identification and mitigation of errors as training proceeds. Empirically, this approach yields consistent improvements of up to 8% over training exclusively with static rubrics across AlpacaEval, GPQA, ArenaHard as well as the validation sets of expert questions and rubrics. We qualitatively analyze the elicited criteria and identify prominent themes such as transparency, practicality, organization, and reasoning.

Abstract:
We address the problem of multigroup mean estimation, which seeks to allocate a finite sampling budget across multiple groups to obtain uniformly accurate estimates of their means. Unlike classical multi-armed bandits, whose objective is to minimize regret by identifying and exploiting the best arm, the optimal allocation in this setting requires sampling every group on the order of \Theta(T) times. This fundamental distinction makes exploration-free algorithms both natural and effective. Our work makes three contributions. First, we strengthen the existing results on subgaussian variance concentration using the Hanson-Wright inequality and identify a class of strictly subgaussian distributions that yield sharper guarantees. Second, we design exploration-free non-adaptive and adaptive algorithms, and we establish tighter regret bounds than the existing results. Third, we extend the framework to contextual bandit settings, an underexplored direction, and propose algorithms that leverage side information with provable guarantees. Overall, these results position exploration-free allocation as a principled and efficient approach to multi-group mean estimation, with potential applications in experimental design, personalization, and other domains requiring accurate multi-group inference.

Abstract:
A key tool in developing safe AI models isdata auditing, i.e., using statistical tools to determine whether harmful content may have been used in the training data of a blackbox model. Unfortunately, mostmembership inference attacks(MIAs) used to perform this type of auditing themselves assumeaccessto examples of harmful content from the same distribution as the query data. In real-world auditing scenarios, auditors often face legal and ethical restrictions preventing them from accessing a representative set of samples of harmful content to train these attacks effectively. We abstract and formalize this setting into a new data access model, the “unseen class” setting, and show that the state-of-the-art MIAs fail due to the lack of access to the full target distribution. We show that in this setting,quantile regression attacksoutperform approaches typically considered to be state of the art. We demonstrate this both empirically and theoretically, showing that quantile regression attacks achieve up to11× the TPRof shadow model-based approaches in practice, and providing a theoretical model that outlines the generalization properties required for this approach to succeed. Our work identifies an important failure mode in existing MIAs and provides a cautionary tale for practitioners who aim to directly use existing tools for real-world applications of AI safety.

Abstract:
Aligning large visual generative models with human feedback is often performed through pairwise preference optimization. While such approaches are conceptually simple, they fundamentally rely on annotated pairs, limiting scalability in settings where feedback is collected as independent scalar ratings. In this work, we revisit the KLregularized alignment objective and show that the optimal policy implicitly compares each sample’s reward to an instance-specific baseline that is generally intractable. We propose a threshold-guided alignment framework that replaces this oracle baseline with a data-driven global threshold estimated from empirical score statistics. This formulation turns alignment into a binary decision task on unpaired data, enabling effective optimization directly from scalar feedback. We further incorporate a confidence weighting term to emphasize samples whose scores deviate strongly from the threshold, improving sample efficiency. Experiments across both diffusion and masked generative paradigms, spanning three test sets and five reward models, show that our method consistently improves preference alignment over previous methods. These results position our threshold-guided framework as a simple yet principled alternative for aligning visual generative models without paired comparisons.

Abstract:
Transformers have been established as the defacto backbones for most recent advances in sequence modeling, mainly due to their growing memory capacity that scales with the context length. While plausible for retrieval tasks, it causes quadratic complexity and so has motivated recent studies to explore viable subquadratic recurrent alternatives. Despite showing promising preliminary results in diverse tasks, such recurrent architectures underperform Transformers in recall-intensive tasks, often attributed to their fixed-size memory. In this paper, we introduce Memory Caching (MC), a simple yet effective technique that enhances recurrent models by caching checkpoints of their memory states (a.k.a. hidden states). Memory Caching allows the effective memory capacity of RNNs to grow with sequence length, offering a flexible trade-off that interpolates between the fixed memory ( O(L) complexity) of RNNs and the growing memory ( O(L^2) complexity) of Transformers. We propose four variants of MC, including gated aggregation and sparse selective mechanisms, and discuss their implications on both linear and deep memory modules. Our experimental results on language modeling, and long-context understanding tasks show that MC enhances the performance of recurrent models, supporting its effectiveness. In in-context recall tasks, our results indicate that while Transformers still achieve the best performance, our MC variants show competitive performance, close the gap with Transformers, and performs better than state-of-the-art recurrent models.

Authors: Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, Thaddäus Wiedemer, Qingying Gao, Dezhi Luo, Yaoyao Qian, Lianyu Huang, Zelong Hong, Jiahui Ge, Qianli Ma, Hang He, Yifan Zhou, Lingzi Guo, Lantao Mei, Jiachen Li, Hanwen Xing, Tianqi Zhao, Fengyuan Yu, Weihang Xiao, Yizheng Jiao, Jianheng Hou, Danyang Zhang, Pengcheng Xu, Boyang ZHONG, Zehong Zhao, Gaoyun Fang, John Kitaoka, Xu Yile, Hua XU, Kenton Blacutt, Tin Nguyen, Siyuan Song, Haoran Sun, shaoyue wen, Linyang He, Runming Wang, Yanzhi Wang, Mengyue Yang, Ziqiao Ma, Raphaël Millière, Freda Shi, Nuno Vasconcelos, Daniel Khashabi, Alan Yuille, Yilun Du, Ziming Liu, Dahua Lin, Ziwei Liu, Vikash Kumar, Yijiang Li, Lei Yang, Zhongang Cai, Hokin Deng

Abstract:
Video reasoning grounds intelligence in spatiotemporally consistent visual environments that go beyond what text can naturally capture, enabling intuitive reasoning over motion, interaction, and causality. Rapid progress in video models has focused primarily on visual quality. Systematically studying video reasoning and its scaling behavior suffers from a lack of video reasoning (training) data. To address this gap, we introduce the Very Big Video Reasoning (VBVR) Dataset, an unprecedentedly largescale resource spanning 200 curated reasoning tasks and over one million video clips—approximately three orders of magnitude larger than existing datasets. We further present VBVR-Bench, a verifiable evaluation framework that moves beyond model-based judging by incorporating rule-based, human-aligned scorers, enabling reproducible and interpretable diagnosis of video reasoning capabilities. Leveraging the VBVR suite, we conduct one of the first large-scale scaling studies of video reasoning and observe early signs of emergent generalization to unseen reasoning tasks. Together, VBVR lays a foundation for the next stage of research in generalizable video reasoning. The data, toolkit, and models will be released publicly.

Abstract:
Behavior cloning is a fundamental paradigm in machine learning, enabling policy learning from expert demonstrations across robotics, autonomous driving, and generative models. Autoregressive models like transformer have proven remarkably effective, from large language models (LLMs) to visionlanguage-action systems (VLAs). However, applying autoregressive models to continuous control requires discretizing actions through quantization, a practice widely adopted yet poorly understood theoretically. This paper provides theoretical foundations for this practice. We analyze how quantization error propagates along the horizon and interacts with statistical sample complexity. We show that behavior cloning with quantized actions and log-loss achieves optimal sample complexity—matching existing lower bounds—and incurs only polynomial horizon dependence on quantization error, provided the dynamics are stable and the policy satisfies a probabilistic smoothness condition. We further characterize when different quantization schemes satisfy or violate these requirements, and propose a model-based augmentation that provably improves the error bound without requiring policy smoothness. Finally, we establish fundamental limits that jointly capture the effects of quantization error and statistical complexity.

Abstract:
Building interactive omnimodal assistants often relies on end-to-end multimodal alignment to fuse heterogeneous modalities, which incurs substantial data and compute costs and limits extensibility. We present Training-Free Large Language Model Orchestration (LLM Orchestration), a training-free orchestration framework that integrates off-the-shelf modality experts into a unified multimodal input--output system without additional gradient-based training for integration. LLM Orchestration comprises three components: (1) an LLM controller that infers user intent and emits explicit control tokens for expert selection and sequencing, enabling protocol-constrained and auditable routing; (2) a text-centric cross-modal memory that compresses multimodal evidence into structured records for lightweight retrieval and reuse, reducing redundant expert invocations across turns; and (3) a unified interaction layer that executes routing and memory decisions to support consistent modality transitions, full-duplex streaming, and interruption-aware dialogue. Across diverse multimodal benchmarks, LLM Orchestration achieves strong performance under standard evaluation constraints while maintaining low orchestration overhead and modular upgradeability, providing a practical alternative to costly joint training for omni-modal systems.

Abstract:
Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt. Executionbased re-ranking offers a promising test-time scaling strategy, yet existing methods are constrained by either difficult test case generation or inefficient random input sampling. To address this limitation, we propose Agentic Verifier, an execution-based agent that actively reasons about program behaviors and searches for highly discriminative test inputs that expose behavioral discrepancies among candidate solutions. Through multi-turn interaction with code execution environments, the verifier iteratively refines the candidate input generator and produces targeted counterexamples rather than blindly sampling inputs. We train the verifier to acquire this discriminative input generation capability via a scalable pipeline combining large-scale data synthesis, rejection fine-tuning, and agentic reinforcement learning. Extensive experiments across five competitive programming benchmarks demonstrate consistent improvements over strong execution-based baselines, achieving up to +10-15% absolute gains in Best@k accuracy. Further analysis reveals clear test-time scaling behavior and highlights the verifier’s broader potential beyond reranking.

Abstract:
We study blackbox optimization of Lipschitz functions under noisy evaluations. Existing adaptive discretization methods implicitly avoid suboptimal regions but do not provide explicit certificates of optimality or measurable progress guarantees. We introduce Certificate-Guided Pruning (CGP), which maintains an explicit active set A_t of potentially optimal points via confidence-adjusted Lipschitz envelopes. Any point outside A_t is certifiably suboptimal with high probability, and under a margin condition with near-optimality dimension \alpha, we prove Vol (A_t) shrinks at a controlled rate yielding sample complexity Õ(\varepsilon^-(2+\alpha)). We develop three extensions: CGP-Adaptive learns L online with O(\log T) overhead; CGP-TR scales to d > 50 via trust regions with local certificates; and CGP-Hybrid switches to GP refinement when local smoothness is detected. Experiments on 12 benchmarks (d \in [2, 100]) show CGP variants match or exceed strong baselines while providing principled stopping criteria via certificate volume.

Abstract:
Remote change detection in LLMs is a difficult problem. Existing methods are either too expensive for deployment at scale, or require initial whitebox access to model weights or grey-box access to log probabilities. We aim to achieve both low cost and strict black-box operation, observing only output tokens. Our approach hinges on specific inputs we call Border Inputs, for which there exists more than one output top token. From a statistical perspective, optimal change detection depends on the model's Jacobian and the Fisher information of the output distribution, whose analysis at low temperature regimes shows that border inputs enable powerful change detection tests. Building on this insight, we propose the Black-Box Border Input Tracking (B3IT) scheme. Extensive in-vivo and in-vitro experiments show that border inputs are easily found for non-reasoning tested endpoints, and present on-par performance with the best available grey-box approaches. B3IT reduces costs by 30× compared to existing methods, while operating in a strict black-box setting.

Abstract:
Quantization can drastically increase the efficiency of large language and vision models, but typically incurs an accuracy drop. Recently, functionpreserving transforms (e.g. rotations, Hadamard transform, channel-wise scaling) have been successfully applied to reduce post-training quantization error, yet a principled explanation remains elusive. We analyze linear-layer quantization via the signal-to-quantization-noise ratio (SQNR), showing that for uniform integer quantization at a fixed bit width, SQNR decomposes into (i) the concentration of weights and activations (capturing spread and outliers), and (ii) the alignment of their dominant variation directions. This provides an actionable insight: enhancing alignment between weight and activation variation directions can reduce quantization error, complementing concentration-focused approaches. Motivated by this, we introduce Concentration–Alignment Transforms (CAT), a lightweight linear transformation that uses a covariance estimate from a small calibration set to jointly improve concentration and alignment, approximately maximizing SQNR. Experiments across several LLMs show that CAT consistently matches or outperforms prior transform-based quantization methods at 4-bit precision.

Abstract:
We systematically evaluate ParameterEfficient Fine-Tuning (PEFT) methods under the paradigm of Reinforcement Learning with Verifiable Rewards (RLVR). RLVR incentivizes language models to enhance their reasoning capabilities through verifiable feedback; however, while methods like LoRA are commonly used, the optimal PEFT architecture for RLVR remains unidentified. In this work, we conduct the first comprehensive evaluation of over 12 PEFT methodologies across the DeepSeek-R1-Distill families on mathematical reasoning benchmarks. Our empirical results challenge the default adoption of standard LoRA with three main findings. First, we demonstrate that structural variants, such as DoRA, AdaLoRA, and MiSS, consistently outperform LoRA. Second, we uncover a spectral collapse phenomenon in SVD-informed initialization strategies (e.g., PiSSA, MiLoRA), attributing their failure to a fundamental misalignment between principal-component updates and RL optimization. Furthermore, our ablations reveal that extreme parameter reduction (e.g., VeRA, Rank-1) severely bottlenecks reasoning capacity. We further conduct ablation studies and scaling experiments to validate our findings. This work provides a definitive guide for advocating for more exploration for parameter-efficient RL methods.

Abstract:
Vertexlevel clustering for directed graphs (digraphs) remains challenging as edge directionality breaks the key assumptions underlying popular spectral methods, which also incur the overhead of eigen-decomposition. This paper proposesParametrized Power Iteration Clustering(ParPIC), a random-walk-based clustering method for weakly connected digraphs. This builds over the Power-Iteration Clustering paradigm, which uses the rows of the iterated diffusion operator as a data embedding. ParPIC has three important features: the use of parametrized reversible random walk operators, the automatic tuning of the diffusion time, and the efficient truncation of the final embedding, which produces low-dimensional data representations and reduces complexity. Empirical results on synthetic and real-world graphs demonstrate that ParPIC achieves competitive clustering accuracy with improved scalability relative to spectral and teleportation-based methods.

Abstract:
We derive a novel PACBayesian generalization bound for reinforcement learning that explicitly accounts for Markov dependencies in the data, through the chain's mixing time. This contributes to overcoming challenges in obtaining generalization guarantees for reinforcement learning, where the sequential nature of data breaks the independence assumptions underlying classical bounds. The new bound provides non-vacuous certificates for modern off-policy algorithms such as Soft Actor-Critic. We demonstrate the practical utility of the bound through PB-SAC, a novel algorithm that optimizes the bound during training to guide exploration. Experiments across several continuous control tasks show that the proposed approach provides meaningful confidence certificates while maintaining competitive performance.

Abstract:
GPU kernel optimization challenges LLMs beyond standard coding tasks, as it requires an understanding of hardware architecture, parallel computing optimization strategies, and profiling outputs. However, most existing approaches leveraging LLMs for kernel generation apply standard prompting and feedback loops, considering hardware only through profiling feedback. We introduce KernelFoundry, an evolutionary framework that efficiently explores the space of GPU kernels through (1) MAPElites quality-diversity search with kernel-specific behavioral dimensions to sustain exploration; (2) meta-prompt evolution that co-evolves prompts with kernels to uncover task-specific optimization strategies, and (3) a template-based parameter optimization approach to tune kernels to inputs and hardware. We evaluate this framework on KernelBench, robust-kbench and custom tasks, generating SYCL kernels as a cross-platform GPU programming paradigm, and CUDA kernels for comparison to prior work. Our approach consistently outperforms the baseline methods and achieves an average speedup of 2.3 on KernelBench for SYCL. Moreover, KernelFoundry is implemented as a distributed framework with remote access to diverse hardware, allowing quick benchmarking and featuring a flexible user input layer to support kernel generation for a for a wide range of real use cases beyond benchmarking.

Abstract:
Existing reinforcement learning approaches for Large Language Models typically perform policy optimization at the granularity of individual tokens or entire response sequences. However, such formulations often misalign with the natural stepwise structure of reasoning processes, leading to suboptimal credit assignment and unstable training in multi-modal reasoning tasks. To bridge this gap, we propose Segment-Aligned Policy Optimization (SAPO), a novel reinforcement learning paradigm that treats coherent reasoning steps, rather than tokens or full sequences as fundamental units of policy update. SAPO introduces a step-wise Markov decision process abstraction over reasoning segments, accompanied by segment-level value estimation, advantage computation, and importance sampling mechanisms that are semantically aligned with reasoning boundaries. Experiments on representative reasoning benchmarks demonstrate that SAPO consistently outperforms token-level and sequence-level policy optimization methods, achieving significant accuracy improvements while exhibiting better training stability and value estimation consistency. Our work underscores the importance of aligning reinforcement learning updates with the intrinsic structure of reasoning, paving the way for more efficient and semantically grounded policy optimization in complex reasoning tasks. Codes and models will be released to ensure full reproducibility.

Abstract:
We study the online fair division problem, where indivisible goods arrive sequentially and must be allocated immediately and irrevocably. Prior work establishes strong impossibility results for approximating classic notions such as envyfreeness up to one good (EF1) and maximin share (MMS) in this setting, but the approximability of proportionality up to one good (PROP1) has remained unresolved. We resolve this gap in two steps. First, we show that three natural greedy allocation rules (standard baselines in fair division) fail to guarantee any multiplicative approximation to PROP1 against an adaptive adversary. These limitations motivate two relaxations: (i) restricting attention to a non-adaptive adversary, and (ii) incorporating coarse predictions in the spirit of learning-augmented algorithms. Under a non-adaptive adversary, we show that the uniform random allocation achieves a meaningful PROP1 approximation with high probability, and this guarantee is essentially tight for this approach; moreover, when item values are sufficiently small, the allocation is near-PROP1 with high probability. Finally, given maximum item value (MIV) predictions, we design an online algorithm that achieves robust approximation guarantees for PROP1, and degrades gracefully under one-sided prediction error. In contrast, we show that EF1, MMS, and PROPX remain inapproximable even with perfect MIV predictions.

Abstract:
LLM agents are rapidly becoming the practical interface for task automation, yet the ecosystem lacks a principled way to \emphchoose among an exploding space of deployable configurations. Existing LLM leaderboards and tool/agent benchmarks evaluate components in isolation and remain fragmented across tasks, metrics, and candidate pools, leaving a critical research gap: there is little \emphqueryconditioned supervision for learning to recommend end-to-end agent configurations that couple a backbone model with a toolkit. We address this gap with \methodname, a benchmark that reframes agent selection as narrative query-to-agent recommendation over capability profiles and systematically converts heterogeneous evaluation artifacts into unified, positive-only interaction data. \methodname comprises 111,179 queries, 107,721 deployable agents, and 251,103 interaction records aggregated from 40+ sources, spanning LLM-only, toolkit-only, and compositional agents. Our analyses reveal a regime shift from dense head reuse to long-tail, near one-off supervision, where popularity-based CF/GNN methods become fragile and content-aware capability matching is essential. We further show that Part~III synthesized compositional interactions are learnable, induce capability-sensitive behavior under controlled counterfactual edits, and improve coverage over realistic compositions; models trained on \methodname also transfer to a public agent marketplace (MuleRun), yielding consistent gains on an unseen catalog. Overall, \methodname provides the first unified data and evaluation infrastructure for agent recommendation, which establishes a reproducible foundation to study and accelerate the emerging agent ecosystem. \footnote\urlhttps://anonymous.4open.science/r/AgentMatch-F950

Abstract:
Process Reward Models (PRMs) enhance reasoning ability of LLMs by providing steplevel supervision. However, their widespread adoption is limited due to expensive manual step-level annotation and poor generalization of static training data to novel errors. We introduce Adversarially Trained PRMs (APRM), where a Generator (G) learns to produce reasoning errors to deceive a PRM (R), while R concurrently learns to detect them. This interaction yields progressively harder negatives for R, improving it's robustness and generalization to novel errors without requiring manual step-level labels. Averaged across diverse mathematical reasoning benchmarks, APRM improves solver accuracy by +3.4 percentage points (pp) over the strongest PRM baseline. APRM achieves gains of +5.3 pp on out-of-distribution tasks.

Abstract:
Symmetry is fundamental to understanding physical systems and can improve performance and sample efficiency in machine learning. Both pursuits require knowledge of the underlying symmetries in data, yet discovering these symmetries automatically is challenging. We propose LieFlow, a novel framework that reframes symmetry discovery as a distribution learning problem on Lie groups. Instead of searching for the symmetry generators, our approach operates directly in group space, modeling a symmetry distribution over a large hypothesis group G. The support of the learned distribution reveals the underlying symmetry group H \subseteq G. Unlike previous works, LieFlow can discover both continuous and discrete symmetries within a unified framework, without assuming a fixed Lie algebra basis or a specific distribution over the group elements. Experiments on synthetic 2D and 3D point clouds and ModelNet10 show that LieFlow accurately discovers continuous and discrete subgroups, significantly outperforming a stateof-the-art baseline, LieGAN, in identifying discrete symmetries.

Abstract:
Research on the expressive power of transformers shows that transformers are equivariant to permutations and can approximate all permutationequivariant continuous functions on a compact domain. However, these results assume real parameters and exact operations, whereas real-world implementations on computers can only use a finite set of numbers and inexact machine operations with round-off errors. In this work, we investigate the representability of floating-point transformers that use floating-point parameters and floating-point operations. Unlike existing results under exact arithmetic, we first show that floating-point transformers can represent non-permutation-equivariant functions even without positional encoding. Furthermore, we prove that floating-point transformers can represent all permutation-equivariant functions when the sequence length is bounded, but they cannot when the sequence length is large. We also identify the minimal equivariance property in floating-point transformers, and show that all non-trivial additive positional encoding can harm the representability of floating-point transformers.

Abstract:
We study the problem of subgroup discovery for survival analysis, where the goal is to find an interpretable subset of the data on which a Cox model is highly accurate. We examine why existing quality functions are insufficient for this problem and introduce two technical innovations: theexpected prediction entropy (EPE), a novel metric for evaluating survival models that predict hazard functions, and theconditional rank statistics (CRS), which quantifies individual point deviation from a subgroup's survival time distribution. We study the EPE and CRS theoretically and show they address problems with existing metrics. We then introduce seven algorithms for Cox subgroup discover. Our main algorithm is based on the DDGroup framework of Izzo et al. (2023) and leverages both the EPE and CRS, allowing theoretical correctness guarantees in wellspecified settings. Empirical evaluation on synthetic and real data confirms our theory, showing our methods recover ground-truth subgroups in well-specified cases and achieve better model fit than naively fitting the Cox model to the entire dataset. A case study on NASA jet engine simulation data demonstrates that discovered subgroups uncover known nonlinearities in the data and suggest design choices mirrored in practice.

Abstract:
We investigate the relation between endto-end equivariance and layerwise equivariance in deep neural networks. We prove the following: For a network whose end-to-end function is equivariant with respect to group actions on the input and output spaces, there is a parameter choice yielding the same end-to-end function such that its layers are equivariant with respect to some group actions on the latent spaces. Our result assumes that the parameters of the model are identifiable in an appropriate sense. This identifiability property has been established in the literature for a large class of networks, to which our results apply immediately, while it is conjectural for others. The theory we develop is grounded in an abstract formalism, and is therefore architecture-agnostic. Overall, our results provide a mathematical explanation for the emergence of equivariant structures in the weights of neural networks during training -- a phenomenon that is consistently observed in practice.

Abstract:
Constitutional AI has focused on singlemodel alignment using fixed principles. However, multi-agent systems create novel alignment challenges through emergent social dynamics. We present Constitutional Evolution, a framework for automatically discovering behavioral norms in multi-agent LLM systems. Using a grid-world simulation with survival pressure, we study the tension between individual and collective welfare, quantified via a Societal Stability Score \mathcalS \in [0,1] that combines productivity, survival, and conflict metrics. Adversarial constitutions lead to societal collapse (\mathcalS=0), while vague prosocial principles (''be helpful, harmless, honest'') produce inconsistent coordination (\mathcalS=0.249). Even constitutions designed by Claude 4.5 Opus with explicit knowledge of the objective achieve only moderate performance (\mathcalS=0.332). Using LLM-driven genetic programming with multi-island evolution, we evolve constitutions maximizing social welfare without explicit guidance toward cooperation. The evolved constitution \mathcalC^ achieves \mathcalS=0.556\pm0.008 (123% higher than human-designed baselines, N=10), eliminates conflict, and discovers that minimizing communication (0.9% vs 62.2% social actions) outperforms verbose coordination. Our interpretable rules demonstrate that cooperative norms can be discovered rather than prescribed.

Abstract:
In this work, we show that it is possible to extract significant amounts of alignment training data from a posttrained model -- useful to steer the model to improve certain capabilities such as long-context reasoning, safety, instruction following, and maths. While the majority of related work on memorisation has focused on measuring success of training data extraction through string matching, we argue that embedding models are better suited for our specific goals. Distances measured through a high quality embedding model can identify semantic similarities between strings that a different metric such as edit distance will struggle to capture. In fact, in our investigation, approximate string matching would have severely undercounted (by a conservative estimate of 10×) the amount of data that can be extracted due to trivial artifacts that deflate the metric. Interestingly, we find that models readily regurgitate training data that was used in post-training phases such as SFT or RL. We show that this data can be then used to train a base model, recovering a meaningful amount of the original performance. We believe our work exposes a possibly overlooked risk towards extracting alignment data. Finally, our work opens up an interesting discussion on the downstream effects of distillation practices: since models seem to be regurgitating aspects of their training set, distillation can therefore be thought of as indirectly training on the model's original dataset.

Abstract:
Large language models (LLMs) require substantial compute, and thus energy, at inference time. While quantizing weights and activations is effective at improving efficiency, naive quantization of LLMs can significantly degrade performance due to large magnitude outliers. This paper describes FPTQuant, which introduces three novel, lightweight, and expressive functionpreserving transforms (FPTs) to facilitate quantization of transformers: (1) a mergeable pre-RoPE transform for queries and keys, (2) a mergeable transform for values, (3) a cheap, dynamic scaling transform. By leveraging the equivariances and independencies inherent to canonical transformer operation, we designed these FPTs to maintain the model’s function while shaping the intermediate activation distributions to be more quantization friendly. FPTQuant requires no custom kernels and adds virtually no overhead during inference. The FPTs are trained both locally to reduce outliers, and end-to-end such that the outputs of the quantized and full-precision models match. FPTQuant enables static INT4 quantization with minimal overhead and shows SOTA speed-up of up to 3.9x over FP. Empirically, FPTQuant has an excellent accuracy-speed trade-off—it is performing on par or exceeding most prior work and only shows slightly lower accuracy compared to a method that is up to 29% slower.

Abstract:
A natural desideratum for generative models is \emphselfcorrection--detecting and revising low-quality tokens at inference. While Masked Diffusion Models (MDMs) have emerged as a promising approach for generative modeling in discrete spaces, their capacity for self-correction remains poorly understood. Prior attempts to incorporate self-correction into MDMs either require overhauling MDM architectures/training or rely on imprecise proxies for token quality, limiting their applicability. Motivated by this, we introduce PRISM--Plug-in Remasking for Inference-time Self-correction of Masked Diffusions--a lightweight, model-agnostic approach that applies to any pretrained MDM. Theoretically, PRISM defines a self-correction loss that provably learns per-token quality scores, without RL or a verifier. These quality scores are computed in the same forward pass with MDM and used to detect low-quality tokens. Empirically, PRISM advances MDM inference across domains and scales: Sudoku; unconditional text (170M); and code with LLaDA (8B).

Abstract:
The community is increasingly exploring linear RNNs (LRNNs) as language models, motivated by their expressive power and parallelizability. While prior work establishes the expressivity benefits of LRNNs over transformers, it is unclear what makes LRNNsbut not traditional, nonlinear RNNs---as easy to parallelize in practice as transformers. We answer this question by providing a tight connection between types of RNNs and standard complexity classes. We show that LRNNs can be viewed as log-depth (bounded fan-in) arithmetic circuits, which represents only a slight depth overhead relative to log-depth boolean circuits that transformers admit. Furthermore, we show that nonlinear RNNs can solve \mathsf L-complete problems (and even \mathsf P-complete ones, under polynomial precision), revealing a fundamental barrier to parallelizing them as efficiently as transformers. Our theory also identifies fine-grained expressivity differences between recent popular LRNN variants: permutation-diagonal LRNNs are \mathsfNC^1-complete whereas diagonal-plus-low-rank LRNNs are more expressive (\mathsfPNC^1-complete). We provide further insight by associating each type of RNN with a corresponding automata-theoretic model that it can simulate. Together, our results reveal fundamental tradeoffs between nonlinear RNNs and different variants of LRNNs, providing a foundation for designing LLM architectures that achieve an optimal balance between expressivity and parallelism.

Abstract:
It has been hypothesized that humanlevel visual perception requires a generative approach in which internal representations result from inverting a decoder. Yet today’s most successful vision models are non-generative, relying on an encoder that maps images to representations without decoder inversion. This raises the question of whether generation is, in fact, necessary for machines to achieve human-level visual perception. To address this, we study whether generative and non-generative methods can achieve compositional generalization, a hallmark of human perception. Under a compositional data generating process, we formalize the inductive biases required to guarantee compositional generalization in decoder-based (generative) and encoder-based (non-generative) methods. We then show theoretically that enforcing these inductive biases on encoders is generally infeasible using regularization or architectural constraints. In contrast, for generative methods, the inductive biases can be enforced straightforwardly, thereby enabling compositional generalization by constraining a decoder and inverting it. We highlight how this inversion can be performed efficiently, either online through gradient-based search or offline through generative replay. We examine the empirical implications of our theory by training a range of generative and non-generative methods on photorealistic image datasets. We find that, without the necessary inductive biases, non-generative methods often fail to generalize compositionally and require large-scale pretraining to improve generalization. By comparison, generative methods yield significant improvements in compositional generalization, without requiring additional data, by leveraging suitable inductive biases on a decoder along with search and replay.

Abstract:
Graph generative models perform well on small structured data but struggle to scale to large, complex structures. Hierarchical approaches improve scalability but often ignore node and edge features, which are critical in realworld applications, particularly for hypergraphs that model higher-order relationships. In this paper, we propose FAHNES (feature-aware (hyper)graph generation via next-scale prediction), a hierarchical framework that jointly generates topology and features for graphs and hypergraphs. FAHNES builds multi-scale representations through node coarsening and localized expansion, guided by a novel hierarchical scale encoding that controls granularity and ensures cross-scale consistency. Experiments on synthetic, 3D mesh, and graph point cloud datasets demonstrate state-of-the-art performance in joint structure and feature generation.

Abstract:
Consider a market of competing model providers selling query access to models with varying costs and capabilities. Customers submit problem instances and are willing to pay up to a budget for a verifiable solution. An arbitrageur repeatedly queries cheaper models to undercut the price of a more capable model, thus creating a competitive offering with no modeldevelopment risk. In this work, we initiate the study of arbitrage in AI model markets, empirically demonstrating the viability of arbitrage and illustrating its economic consequences. We conduct an in-depth case study of SWE-bench GitHub issue resolution using two representative models, Qwen Coder 30B and Qwen Coder 480B. In this setting, simple arbitrage strategies generate net profit margins of up to 40%. Robust arbitrage strategies that generalize across different domains remain profitable. Distillation further creates strong arbitrage opportunities, even when model providers strategically restrict access to cheaper models. Multiple competing arbitrageurs drive down consumer prices, reducing the marginal revenue of model providers. At the same time, arbitrage reduces market segmentation and facilitates market entry for smaller model providers by enabling earlier revenue capture. Our results suggest that arbitrage is a powerful force in AI model markets with implications for model development, distillation, and deployment.

Abstract:
The growing size of Large Language Models (LLMs) makes efficient inference challenging, primarily due to the memory demands of the autoregressive KeyValue (KV) cache. Existing eviction or compression methods reduce cost but rely on heuristics, such as recency or past attention scores, which serve only as indirect proxies for a token’s future utility and introduce computational overhead. We reframe KV cache eviction as a reinforcement learning (RL) problem: learning to rank tokens by their predicted usefulness for future decoding. To this end, we introduce KV Policy (KVP), a framework of lightweight per-head RL agents trained on pre-computed generation traces using only key and value vectors. Each agent learns a specialized eviction policy guided by a holistic reward, derived from future utility, that evaluates the quality of the ranking across all cache budgets, requiring no modifications to the underlying LLM or additional inference. Evaluated across two different model families on the long-context benchmark RULER and the multi-turn dialogue benchmark OASST2-4k, KVP significantly outperforms baselines. Furthermore, zero-shot tests on standard downstream tasks (e.g., LongBench, BOOLQ, ARC) indicate that KVP generalizes well beyond its training distribution and to longer sequence lengths. These results demonstrate that learning to predict future token utility is a powerful and scalable paradigm for adaptive KV cache management.

Abstract:
While endto-end self-supervised learning with backpropagation (global BP-SSL) has become central for training modern AI systems, theories of local self-supervised learning (local-SSL) have struggled to build functional representations in deep neural networks. To establish a link between global and local rules, we first develop a theory for deep linear networks: We identify conditions for local-SSL algorithms (like Forward-forward or CLAPP) to implement exactly the same weight update as a global BP-SSL. Starting from the theoretical insights, we then develop novel variants of local-SSL algorithms to approximate global BP-SSL in deep non-linear convolutional neural networks. Variants that improve the similarity between updates of local-SSL with those of global BP-SSL also show better performance on image datasets (CIFAR-10, STL-10, and Tiny ImageNet). The best local-SSL rule with the CLAPP loss function matches the performance of a comparable global BP-SSL with InfoNCE or CPC-like loss functions, and improves upon state-of-the-art for layer-wise SSL on these benchmarks.

Abstract:
Reinforcement learning (RL) posttraining has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback–Leibler (KL) divergence based regularization. When combined with RL, these approaches often suffer from distribution mismatch and objective interference: teacher supervision may not align with the student’s evolving rollout distribution, and the KL regularizer can compete with reward maximization and require careful loss balancing. To address these issues, we propose \emphRL-aware distillation (RLAD), which performs selective imitation during RL---guiding the student toward the teacher only when it improves the current policy update. Our core component,Trust Region Ratio Distillation (TRRD), replaces the teacher-student KL regularizer with a PPO/GRPO-style likelihood-ratio objective anchored to a teacher--old-policy mixture, yielding advantage-aware, trust-region-bounded distillation on student rollouts and naturally balancing exploration, exploitation, and imitation. Across diverse logic reasoning and math benchmarks, RLADconsistently outperforms offline distillation, standard GRPO, and KL-based on-policy teacher–student knowledge distillation.

Abstract:
Standard generative models struggle with heavytailed data: Lipschitz architectures cannot produce power-law tails from Gaussian noise, and interpolating between heavy-tailed data and Gaussians is ill-posed. We propose a simple fix: apply the soft-log transform \phi(x) = \mathrmsign(x) \cdot \log(1 + |x|) to data before training, then exponentiate samples after generation. This compresses heavy tails into a range where standard flow matching succeeds. The approach requires no tail parameter estimation, no heavy-tailed base distributions, and no architectural modifications. We provide theoretical intuition for why this works: the log-transform maps Pareto tails to exponentials, and the induced dynamics implement a form of tail annealing via power transformations. Experiments on synthetic benchmarks and real financial data show that this simple trick achieves competitive sample quality for heavy-tailed distributions (\nu \leq 5), with improved stability over specialized methods in moderate dimensions (d = 50).

Abstract:
Diffusion language models offer fast, parallel decoding via nonautoregressive generation and uncertainty-aware denoising, yet these properties remain underexplored for retrieval. We propose Self-Augmenting Retrieval for Diffusion Language Models, a dynamic framework that uses intermediate diffusion states to refine retrieval throughout the denoising trajectory. At each iteration, we query an external corpus with the partially denoised text, retrieve additional evidence, and condition subsequent denoising steps on the updated context. This tightly couples retrieval to the diffusion process: high-confidence tokens guide retrieval early, while uncertain spans are completed after new evidence is incorporated. Experiments with DREAM-7B, a discrete diffusion language model, on open-domain question answering benchmarks show significant improvements in answer accuracy over static question-only retrieval, while achieving 2--6× higher throughput than autoregressive baselines, demonstrating that diffusion decoding offers a compelling paradigm for efficient, high-quality retrieval-augmented generation.

Abstract:
Errorcorrecting codes enable reliable communication, yet practical soft decoding remains challenging across code families and block lengths. We propose SB-ECC, a score-based decoder that casts decoding as continuous-time denoising. A neural denoiser defines a probability-flow ordinary differential equation (ODE) that iteratively updates the noisy channel observation toward a valid codeword, guided by parity constraints. The model is trained across noise levels without time/SNR conditioning, enabling inference without SNR estimation and supporting a direct latency--accuracy trade-off controlled by the ODE solver budget. We use the raw signed channel observation as input for learning a continuous denoising field. Across 42 code/SNR settings, SB-ECC achieves the best BER in 39/42 entries, with an average SNR gain of 0.21\,dB and a maximum gain of 0.44\,dB over the strongest prior method. Additionally, swapping the solver from Euler to DPM preserves -\ln(\mathrmBER) while reducing end-to-end decoding time by 8.86% on average (up to 12.82%).

Abstract:
Triangles capture higherorder structures in graphs and are fundamental to applications such as clustering and network analysis. To enable efficient use of such structures at scale, we study the problem of triangle cut sparsification, which aims to reduce the graph size while approximately preserving triangle counts across every cut. We investigate quantum algorithms for this problem, using triangle listing as our main technical ingredient. In particular, we present a quantum algorithm for triangle listing that, for a graph with n vertices, m edges, and t triangles, runs in time T_\mathrmq\text-list =\widetildeO\bigl(\min(n^5/4t^7/12 + n^7/6t^7/9, m + m^3/4t^1/2,n^3/2t^1/2)\bigr), improving upon the best known classical bounds over a broad range of parameters. Our algorithm is based on a heavy–light vertex partition and an extension of triangle detection via quantum search and Grover search. Leveraging this result, we design a quantum algorithm for constructing \varepsilon-triangle cut sparsifiers of size \tildeO(n/\varepsilon^2) in time \widetildeO(T_\mathrmq\text-list + \sqrtmn/\varepsilon). Finally, we demonstrate applications to clustering algorithms based on triangle-related measures and prove a lower bound of \Omega(n/\varepsilon^2) on the size of any \varepsilon-triangle cut sparsifiers.

Abstract:
Large language models (LLMs) enable reasoning over biomolecular structures, yet existing methods remain modalityspecific and typically compress structural inputs via sequence-based tokenization or fixed-length query connectors. Such architectures either omit geometric grounding required to mitigate structural hallucinations or impose inflexible modality-fusion bottlenecks that both over-compress and misallocate structural tokens, impeding generalized all-atom reasoning. We introduceCuttlefish, a unified all-atom LLM that grounds language reasoning in geometric cues while scaling modality tokens with structural complexity. First,Scaling-Aware Patchinguses an instruction-conditioned gating mechanism to generate variable-size patches over structural graphs, adaptively scaling the query-token budget with structural complexity to mitigate fixed-length connector bottlenecks. Second,Geometry Grounding Adapterrefines these adaptive tokens via cross-attention to modality embeddings and injects the resulting modality tokens into the LLM, exposing explicit geometric cues to reduce structural hallucination. Experiments across diverse all-atom benchmarks show that Cuttlefish achieves superior performance in heterogeneous structure-grounded reasoning.

Abstract:
Machinelearned (ML) XC-functionals promise improved accuracy, but overfit to training energies and basis sets without proper regularization. We introduce Derivative Informed XC-Loss (DI-Loss), a loss that regularizes ML-XC training by supervising energy gradients on the Grassmannian of density matrices. Crucially, rather than merely matching the self-consistent fixed point, DI-Loss forces the dynamics of the SCF process to align with the target functional. Across all evaluated architectures, this improves basis set generalization and electron densities. Distilling hybrid (\mathcalO(N^4)-scaling) functionals to \mathcalO(N^3)-scaling ML-XC functionals, we observe a >60% reduction in energy MAE compared to energy and density supervision alone, while simultaneously reducing the density-dipole error by 65%. We show that initializing from these distilled functionals can reduce hybrid SCF iterations by up to 55%. Furthermore, DI-Loss improves TDDFT excited-state predictions by approximately 30%.

Abstract:
Despite the generative capabilities of diffusion models, realimage editing remains constrained by a persistent trade-off between semantic editability and structural fidelity. We identify a primary cause of this limitation as the implicit coupling of editing progress with noise scale in existing paradigms. This coupling creates a budget misallocation: achieving stronger semantic changes often necessitates initializing from high-noise states, which can consume computation on disrupting global layout before semantic modification begins. To address this, we introduce NaviEdit, a training-free framework that decouples the editing trajectory from the denoising schedule via a strict Time-Axis Consistency principle. By reformulating editing as controlled vector field navigation on a distinct task axis, NaviEdit strategically concentrates the computational budget within semantically responsive intermediate scales while reducing exposure to destructive high-noise regimes. Experiments show that NaviEdit outperforms strong state-of-the-art baselines across PIE-Bench, achieving larger semantic edits with better structure preservation under comparable compute budgets, without requiring model tuning.

Abstract:
Physicsinformed neural networks (PINNs) have emerged as a powerful deep learning approach for solving partial differential equations (PDEs) in the physical sciences, yet their behavior remains largely opaque and is typically understood through failure mode analyses rather than explicit interpretability. To address this issue, we introduce PINNfluence, a training data attribution framework for interpreting PINNs based on influence functions. By extending influence functions to composite physics-informed training objectives, we enable fine-grained attribution between predictions, loss components, and training data points. Through benchmark experiments across various PDEs, we demonstrate that influence patterns provide granular diagnostics that distinguish different PINN failure modes. PINNfluence thus opens a new avenue for understanding and improving the reliability of PINNs through the lens of their data.

Abstract:
Reinforcement Learning from Human Feedback (RLHF) can reveal implicit objectives such as safety considerations that go beyond task completion. In this work, we focus on the common safety criteria embedded in crowd preference datasets, where different users who provide feedback may express distinct preferences or objectives, yet follow similar safety principles. Our aim is to discover shared safety criteria from crowd preferences and then transfer them to downstream RL tasks to regularize agent behavior and enforce safety. We first show that direct reward combination—optimizing a preferencelearned reward model together with downstream task rewards—has inherent limitations. Motivated by this, we propose Safe Crowd Preference-based Reinforcement Learning, a hierarchical framework that extracts safety-aligned skills from crowd preferences and composes them via a high-level policy to safely solve downstream tasks. Experiments across safe RL environments with diverse user goals and shared safety constraints demonstrate that our approach substantially lowers safety costs without access to explicit safety rewards, while achieving task performance comparable to oracle methods trained with ground-truth safety signals.

Abstract:
Hallucination is a central limitation of large language models (LLMs), and substantial effort has been devoted to understanding and mitigating it. Towards this, Kalai and Vempala (STOC 2024) introduced a probabilistic framework formalizing calibration and hallucination, and showed that, with high probability, calibrated LLMs hallucinate roughly at the rate of the "missing mass", a measure of how incomplete the training data is relative to its source. This raises two fundamental questions: (i) what property of a calibrated LLM makes hallucinations unavoidable? and (ii) can hallucinations be avoided by giving up calibration? We answer these questions by introducing a simpler property we callinnovationthat measures the tendency of a model to produce outputs outside the training data. We show that innovation is implied by the condition for hallucination identified by Kalai and Vempala, and, further, that it is an almost characterization of hallucination: hallucination implies innovation, and conversely, innovation implies hallucination with high probability. We also provide lower bounds on the hallucination rate based on the "innovation rate", and by relating innovation rate back to missing mass, we obtain new hallucination rate lower bounds based on missing mass that extend the results of Kalai and Vempala.

Abstract:
Modern language models typically rely on two design choices: subword tokenization and autoregressive (AR) ordering. To achieve more universal modeling, the field is advancing toward bytelevel modeling to bypass domain-specific vocabularies and masked diffusion models (MDM) to enable parallel non-sequential generation. Intuitively, the intersection of these paradigms represents a generative ideal: a modality-agnostic system capable of fine-grained any-order generation. However, the computational interaction between these granular representations and non-sequential objectives remains under-explored. In this work, we investigate the viability of this combination through a compute-matched scaling study. We observe a structural dichotomy: AR models on bytes effectively amortize the cost of tokenization, naturally rediscovering sub-word segmentation at scale. In contrast, byte-level MDMs suffer a non-convergent efficiency collapse. We attribute this disparity to the masking objective, which shatters the local contiguity required to resolve sub-word semantics from bytes, whereas AR's stable causal history preserves these essential local dependencies. Our findings inform the community of a critical efficiency tradeoff, suggesting that future modality-agnostic designs should address this context fragility to maintain efficient scaling.

Abstract:
Despite strong averagecase performance, deep learning models often exhibit systematic errors on specific population groups, known aserror slices. Identifying these groups and the root causes of their failures is critical for model debugging and bias mitigation. However, existing error Slice Discovery Methods (SDMs) typically generate explanations disconnected from the model's inference process, thus only approximating the underlying error source and may be inaccurate. We address this limitation by leveraging Concept Bottleneck Models (CBMs), whose predictions are directly dependent on human-understandable semantic concepts. Since downstream task failures in CBMs commonly arise from concept mispredictions, concept representations provide a strong candidate for error slice identification, offering fine-grained explanations directly linked to the error source. Building on this insight, we introduceCB-SLICE, a concept-based SDM that groups samples with shared concept prediction failures and identifies the keyword-concepts most responsible for each slice’s failure-mode. Across multiple benchmarks, we show that CB-SLICE outperforms state-of-the-art methods in uncovering well-known biases while providing richer and more faithful explanations of model errors.

Abstract:
Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at trainand inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual fidelity, while the inability of visual WMs in precise text rendering led to their reliance on slow, complex pipelines dependent on numerous external models. We propose a novel paradigm: visual world modeling via renderable code generation, where a single Vision-Language Model (VLM) predicts the next GUI state as executable web code that renders to pixels, rather than generating pixels directly. This combines the strengths of both approaches: VLMs retain their linguistic priors for precise text rendering while their pre-training on structured web code enables high-fidelity visual generation. We introduce gWorld (8B, 32B), the first open-weight visual mobile GUI WMs built on this paradigm, along with a data generation framework (gWorld) that automatically synthesizes code-based training data. In extensive evaluation across 4 in- and 2 out-of-distribution benchmarks, gWorld sets a new pareto frontier in accuracy versus model size, outperforming 8 frontier open-weight models over 50.25x larger. Further analyses show that (1) scaling training data via gWorld yields meaningful gains, (2) each component of our pipeline improves data quality, and (3) stronger world modeling improves downstream mobile GUI policy performance.

Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has become a promising paradigm for scaling reasoning capabilities of Large Language Models (LLMs). However, the sparsity of binary verifier rewards often leads to low efficiency and optimization instability. To stabilize training, existing methods typically impose tokenlevel constraints relative to a reference policy. We identify that such constraints penalize deviations indiscriminately; this can flip verifier-determined direction when the policy attempts to outperform the reference, thereby suppressing gains. To resolve this, we proposeOne-Way Policy Optimization (OWPO), a method based on the principle of decoupling optimization direction from update magnitude. In OWPO, the verifier dictates the update direction, while the reference policy serves only to adjust the magnitude. Specifically, OWPO applies asymmetric reweighting: it performsAccelerated Alignmentfor Inferior deviations (where the policy lags behind the reference) andGain Lockingfor Superior deviations (where the policy surpasses the reference). Furthermore, by incorporating iterative reference updates, OWPO creates a ``Ratchet Effect'' that continuously consolidates gains. Experimental results demonstrate that OWPO outperforms strong baselines, including DAPO, OPD, and MOPD, breaking the bottleneck of fixed priors to enable continuous self-evolution without reliance on external reference models.

Abstract:
Heavytailed distributions are ubiquitous in real-world data, where rare but extreme events dominate risk and variability. However, standard Variational Autoencoders (VAEs) employ simple decoder distributions (e.g., Gaussian) that fail to capture heavy-tailed behavior, while existing heavy-tail-aware extensions remain restricted to predefined parametric families whose tail behavior is fixed a priori. We propose thePhase-Type Variational Autoencoder(PH-VAE), whose decoder distribution is a latent-conditioned Phase-Type (PH) distribution—defined as the absorption time of a continuous-time Markov chain (CTMC). This formulation composes multiple exponential time scales, yielding a flexible, analytically tractable decoder that adapts its tail behavior directly from the observed data. Experiments on synthetic and real-world benchmarks demonstrate that PH-VAE accurately recovers diverse heavy-tailed distributions, significantly outperforming Gaussian, Student-t, and extreme-value-based VAE decoders in modeling tail behavior and extreme quantiles. In multivariate settings, PH-VAE captures realistic cross-dimensional tail dependence through its shared latent representation. To our knowledge, this is the first work to integrate Phase-Type distributions into deep generative modeling, bridging applied probability and representation learning.

Abstract:
World models are central to building AI agents capable of flexible reasoning and planning. Yet current evaluations (i) test only properties measurable from observed interactions within an environment, such as nextframe prediction or task return, and (ii) do not test whether a learned model supports diverse queries about the same environment. In contrast, humans build general-purpose models that can answer many different questions about an environment---including questions that require understanding global structure and counterfactual consequences. We proposeWorldTest: a protocol for evaluating agents' ability to learn general-purpose models. AWorldTestbenchmark pairs environments with multipleenvironment-level queries---properties of the full environment---rather than objectives defined only on observed trajectories. Individually, these queries can target global and counterfactual properties (e.g., reachability or the effects of interventions) that are not determined by any single rollout distribution. Collectively, they assess model generality across query types. We instantiate WorldTest asAutumnBench, a minimal yet expressive benchmark of 43 interactive grid-world environments and 129 tasks across three query families for both humans and learning agents. AutumnBench supports diverse environments and evaluations, including queries to evaluate prediction, counterfactual reasoning, and long-horizon planning. Experiments with 517 human participants and five frontier models show that humans substantially outperform these models, a gap we attribute to differences in exploration and belief updating.WorldTestandAutumnBenchprovide a rigorous framework for evaluating world-model learning and expose critical limitations in current approaches.

Abstract:
Adapting to latent confounded shift remains a core challenge in modern AI. This setting is driven by hidden variables that induce spurious correlations between inputs and outputs during training, leading models to rely on noncausal shortcuts. For example, a model may learn to treat metadata (e.g., data source like "Amazon") as a proxy for positive sentiment, causing failure when the source becomes predominantly negative during deployment. To address thislatent confounded shift, we introduce Causal Fine-Tuning (CFT). Using a structural causal model as an inductive bias, we derive sufficient conditions under which the causal effect of inputs is identifiable (despite latent confounding), and translate these insights into a fine-tuning objective that decomposes representations into high-level causal and low-level spurious components. Instantiating this framework in BERT, we show that learning such causal/spurious representations and adjusting them accordingly yield a more robust predictor. Experiments on spurious correlation injection attacks in text demonstrate that our method outperforms black-box domain generalization baselines, highlighting the benefits of explicitly modeling causal structure.

Abstract:
Evaluating a new model on an existing benchmark is often necessary to understand its behavior before deployment. For modern evaluation frameworks, generating and evaluating a response for all queries can be prohibitively expensive. In practice, responses from previouslyevaluated models are often cached -- creating a potential opportunity to use this additional information to decrease the number of queries required to accurately evaluate a new model. In this paper, we introduce an approach for predicting benchmark performance that leverages cached model responses based on the Data Kernel Perspective Space (DKPS), a method for quantifying the relationship between models in the black-box setting. Theoretically, we show that DKPS-based methods are query-efficient under certain conditions. Empirically, we demonstrate that DKPS-based methods achieve the same mean absolute error as baselines with a substantially decreased query budget. We conclude by proposing an offline method for selecting a set of queries that maximizes the goodness-of-fit on reference models, improving prediction accuracy over random query selection.

Abstract:
Multimodal learning aims to preserve as much taskrelated information as possible from different inputs. However, current fusion designs often distort the feedback loop to feature extractors. Aggressively merging modalities entangles their representations, making the feature extractors fragile to incomplete inputs. Meanwhile, attempting to separate features via auxiliary losses frequently introduces optimization conflicts that distract from the primary task. We propose the Self-Consistent Field Autoencoder (SCFAE) to provide a better path for task gradients. Our method follows the self-consistent field principle to balance task learning with feature organization, thereby minimizing mutual information. We use small autoencoders for each modality to keep information intact. The task loss acts as a driving force to select predictive features. The reconstruction loss acts as a constraint to separate these features into independent subspaces. These dual objectives operate through complementary feature subspaces, thereby mitigating optimization interference. We evaluate SCFAE on audio-visual-text, audio-visual, and image-video benchmarks. Results show that SCFAE handles missing data and unequal input sizes more robustly via a simple structure. Gradient analysis confirms that SCFAE avoids conflicts and maintains stable training dynamics.

Abstract:
We present learningaugmented algorithms for two general classes of online minimization problems: metrical task systems and laminar set cover. Both algorithms achieve improved theoretical guarantees using machine-learned predictions of an optimal solution to the dual linear program. Unlike optimal primal solutions, which can change drastically under tiny instance perturbations, these dual solutions are much more stable, which ensures the existence of good (and learnable) predictions for families of similar instances. While previous work has used dual predictions in offline settings and for online maximization problems, our algorithms are, to the best of our knowledge, the first demonstration that such dual predictions can be effective for online minimization. Our theoretical results are complemented by experiments on the k-server problem and the parking permit problem.

Abstract:
We study a sequential resource allocation problem motivated by adaptive network recruitment, in which a limited budget of identical resources must be allocated over multiple rounds to individuals with stochastic referral capacity. Successful referrals endogenously generate future decision opportunities while allocating additional resources to an individual exhibits diminishing returns. We first show that the singleround allocation problem admits an exact greedy solution based on marginal survival probabilities. In the multi-round setting, the resulting Bellman recursion is intractable due to the stochastic, high-dimensional evolution of the frontier. To address this, we introduce a population-level surrogate value function that depends only on the remaining budget and frontier size. This surrogate enables an exact dynamic program via truncated probability generating functions, yielding a planning algorithm with polynomial complexity in the total budget. We further analyze robustness under model misspecification, proving a multi-round error bound that decomposes into a tight single-round frontier error and a population-level transition error. Finally, we evaluate our method on synthetic and real-world recruitment scenarios.

Abstract:
Decision tree learning has long been a central topic in theoretical computer science, driven by its practical importance. A fundamental and widely used method for decision tree construction is the topdown greedy heuristic, which recursively splits on the most influential variable. Despite its empirical success, theoretical analysis of this heuristic has been limited. A recent breakthrough by Blanc et al. (ITCS, 2020) provided the first rigorous theoretical guarantees for the greedy approach, but only under the uniform distribution. We extend this analysis to the more general and practically relevant setting of arbitrary product distributions. Our main result shows that for any function f computable by an optimal decision tree of size s, maximum depth D_\textopt, and average depth \Delta_\textopt, the greedy heuristic constructs an \epsilon-approximating tree whose size grows at most with \exp\bigl(\Delta_\textopt D_\textopt \log(e/\epsilon)\bigr). In the special case where the optimal tree is a full binary tree, this bound improves upon the bound of Blanc et al. and holds under a strictly broader class of distributions. Moreover, we present an algorithm based on the top-down greedy heuristic that is entirely parameter-free—it requires no prior knowledge of the optimal tree's size or depth—offering a practical advantage over Blanc et al.'s method.

Abstract:
The core of visionlanguage models lies in measuring cross-modal similarity within a unified representation space. However, most image-text matching or multi-class image classification datasets lack fine-grained cross-modal matching annotations, forcing the continuous similarity space into binary classification boundaries. This compression induces false negative samples and significantly impairs the generalization performance of cross-modal tasks. While prior research has attempted to mitigate this by modeling intra-modal ambiguity, it often overlooks inherent annotation flaws, leading to suboptimal uncertainty allocation. To address these challenges, we propose a Variational Adapter for Cross-modal Similarity Representation (VACSR). This approach reformulates image-text matching with fine-grained semantic scarcity as a variational inference problem. It constructs a latent space for cross-modal similarity and uses regularization techniques to mitigate overfitting to binary annotations. We validate the effectiveness of VACSR in image-text retrieval tasks using the COCO Caption dataset and two extended datasets: CxC and ECCV Caption. Furthermore, we conduct out-of-distribution evaluations including domain generalization on ImageNet and its variants, as well as base-to-novel generalization across 11 datasets, highlighting VACSR’s robust generalization performance.

Abstract:
Unsupervised outlier detection is attractive because it eliminates the need for labeled data. Further, forming multimodel ensembles can improve detection robustness performance. However, composing an ensemble without labeled data is challenging. Naively composing ensembles can cause ensemble saturation, where redundant or unreliable detection models degrade performance and incur unnecessary computations. We propose MetaEns, an automatic unsupervised framework for the selection of outlier detection model ensembles. Using labeled meta-datasets, MetaEns learns a model that predicts marginal ensemble gains that estimate the expected improvement of adding a candidate model to a partially constructed ensemble. At test time, this learned signal is combined with a submodular-inspired proxy objective that enforces diminishing returns through diversity-aware discounting and family-level risk regularization, thereby enabling greedy sequential selection with adaptive early stopping. As a result, MetaEns constructs compact, high-quality ensembles without access to ground-truth labels. Experiments on 39 real-world datasets show that MetaEns is able to consistently outperform state-of-the-art unsupervised selectors and ensemble baselines, achieving higher average precision while using fewer models.

Abstract:
Latent Action Models (LAMs) enable the learning of world models from unlabeled video by inferring abstract actions between consecutive frames. However, LAMs face a fundamental tradeoff between action abstraction and generation fidelity. Existing methods typically circumvent this issue by using two-stage training with pre-trained world models or by limiting predictions to optical flow. In this paper, we introduceDiLA, a novelDisentangledLatentAction world model that aims to resolve this trade-off via content-structure disentanglement. Our key insight is that disentanglement and latent action learning are co-evolving: the predictive bottleneck inherent in latent action learning serves as a driving force for disentanglement, compelling the model to distill spatial layouts into the structure pathway while offloading visual details to a separate content pathway for generation. This synergy yields a continuous, semantically structured latent action space without compromising generative quality.DiLAachieves superior results in video generation quality, action transfer, visual planning, and manifold interpretability. These findings establishDiLAas a unified framework that simultaneously achieves high-level action abstraction and high-fidelity generation, advancing the frontier of self-supervised world model learning.

Abstract:
Reinforcement Learning with Verifiable Rewards is an effective route for posttraining to strengthen the reasoning capability of large language models. However, as training proceeds, the learning signal can collapse thus makes the training gain become marginal and ineffective. Specifically, a growing fraction of prompts' rollouts become advantage-degenerated: all the self-generated rollouts show verified-success, making the standard deviation over their rewards be zero; accordingly each rollout's advantage becomes degenerated (zero) as well. Given such rollouts' advantages, the policy-gradient for model optimization eventually vanishes, capping the training performance. We argue that some of these rollouts still contain valuable learning signals but unfortunately omitted with the existing RLVR methods. In this paper, inspired through analyzing the entropy pattern behind golden trajectories produced by external expert models, we propose EchoRL for better exploiting the advantage-degenerated rollouts to further improve the training performance. EchoRL is a lightweight module that first identifies an EchoClip from verified-success rollouts based on their step-level entropy values, and then feeds this clip back as an auxiliary supervision signal in the RL objective. Extensive experiments across 10 benchmarks, 5 LLM backbones, and 7 popular RLVR post-training methods demonstrate that EchoRL consistently improves RLVR post-training with minimal overhead.

Abstract:
Modern GUI agents typically rely on a modelcentric and step-wise interaction paradigm, where LLMs must re-interpret the UI and re-decide actions at every screen, which is fragile in long-horizon tasks. In this paper, we propose Executable Agentic Memory (EAM), a structured Knowledge Graph (KG) that shifts GUI planning from free-form generation to a robust retrieval-and-execution process. Our approach includes a sample-efficient memory construction pipeline using state-aware DFS and action-group mining to compress multi-step routines. To ensure efficient planning, we introduce a value-guided graph search where a lightweight Q-function model steers Monte Carlo Tree Search (MCTS) over the KG. We theoretically establish bias-consistency for the Q-model and derive sample complexity bounds for path recovery. Empirically, EAM outperforms state-of-the-art baselines like UI-TARS-7B by up to 19.6% on AndroidWorld, while reducing token costs 6× relative to GPT-4o. With a 2.8s average latency, EAM enables reliable, quick, and long-horizon GUI automation.

Abstract:
Representations on the Symmetric Positive Definite (SPD) manifold have garnered significant attention across different applications. In contrast, the manifold of fullrank correlation matrices, a normalized alternative to SPD matrices, remains largely underexplored. This paper introduces Riemannian networks over the correlation manifold, leveraging five recently developed correlation geometries. We systematically extend basic layers, including Multinomial Logistic Regression (MLR), Fully Connected (FC), and convolutional layers, to these geometries. Besides, we present methods for accurate backpropagation for two correlation geometries. Experiments comparing our approach against existing SPD and Grassmannian networks demonstrate its effectiveness.

Abstract:
The generalpurpose nature of Large Language Models (LLMs) presents a significant challenge for domain-specific applications, often leading to out-of-domain (OOD) interactions that undermine the provider's intent. Existing methods for detecting such scenarios treat the LLM as an uninterpretable black box and overlook the internal processing of inputs. In this work we show that layer transitions provide a promising avenue for extracting domain-specific signature. Specifically, we present several lightweight ways of learning on internal dynamics encoded using a sparse autoencoder (SAE) that exhibit great capability in distinguishing OOD texts. Building on top of SAEs representation transitions enables us to better interpret the LLM internal evolution of input processing and shed light on its decisions. We provide a comprehensive analysis of the method and benchmark it with the gemma-2 2B and 9B models. Our results emphasize the efficacy of the internal process in capturing fine-grained input-related details.

Abstract:
Textto-image generation models have achieved remarkable progress in preference optimization, yet achieving robust alignment across diverse reward models remains a significant challenge. Existing multi-reward fusion approaches rely on weighted summation, which is costly to tune and insufficient for balancing conflicting objectives. More critically, optimization with reward models is highly susceptible to reward hacking, where reward scores increase while the perceived quality of generated images deteriorates. We demonstrate that optimizing against a unified global target under heterogeneous reward upper bounds can induce reward hacking, a risk further exacerbated by the inherent instability of weak reward models. To mitigate this, we propose a Pareto Frontier-Guided Optimal Transport framework. Our method constructs a prompt-specific Pareto frontier and maps dominated samples toward it via distribution-aware optimal transport. Furthermore, we develop both online and offline optimization strategies tailored to diverse reward signal characteristics. To provide a more rigorous assessment, we introduce the Joint Domination Rate (JDR) and Joint Collapse Rate (JCR) as principled metrics to quantify multi-reward synergy and reward hacking. Experimental results show that our approach outperforms strong baselines with an 11% gain in JDR and achieves a near 80% win rate in human evaluations.

Abstract:
While orthogonal subspace methods try to mitigate task interference in Continual Learning (CL), they often suffer from energy diffusion across the basis, hindering knowledge compaction and exhausting capacity for future tasks. We observe that output feature drift induced by parameter updates is inherently lowrank, and theoretically prove that preserving parameters along the principal directions of this drift minimizes the output reconstruction error. Motivated by this, we propose Energy-Concentrated and Energy-Ordered Low-Rank Adaptation (E^2-LoRA). By explicitly ordering and concentrating knowledge into leading ranks, E^2-LoRA frees capacity for subsequent tasks. Furthermore, we design a dynamic rank allocation strategy to balance stability and plasticity by jointly optimizing energy retention and model plasticity. Extensive experiments across multiple benchmarks demonstrate that E^2-LoRA achieves state-of-the-art performance.

Abstract:
The rapid development of large language models (LLMs) has raised concerns regarding the inclusion of private or inappropriate data during training, which has led to growing interest in LLM unlearning. Many existing LLM unlearning approaches rely on prediction lossbased optimizations, such as maximizing the loss on the forget set. However, these methods often face issues such as over-forgetting and poor model utility. In this work, we address these issues by introducing a novel perspective that shifts the unlearning optimization target to reducing data attribution instead. We propose the first LLM unlearning framework based on data attribution rewards called DareU that employs reinforcement learning to update the LLM and reduce the attribution score of generated responses (i.e., de-attribute) to the forget data owners. Experimental results using an LLM classifier as an efficient approximation of attribution demonstrate that DareU outperforms existing baseline approaches, achieving effective unlearning while balancing forget quality and model utility.

Abstract:
Recent advances in large language models (LLMs) have enabled deep research systems that synthesize comprehensive, reportstyle answers to open-ended queries by combining retrieval, reasoning, and generation. Yet, most frameworks rely on rigid workflows with one-shot scoping and long autonomous runs, offering little room for course correction if user intent shifts mid-process. We presentSteER, a framework for steerable deep research that introduces interpretable, mid-process control into long-horizon research workflows. At each decision point,SteERuses a cost–benefit formulation to determine whether to pause for user input or proceed autonomously. It combines diversity-aware planning with utility signals that reward alignment, novelty, and coverage, and maintains a live persona model that evolves throughout the session.SteERoutperforms state-of-the-art open-source and proprietary baselines by up to 22.80% on alignment, leads on quality metrics such as breadth and balance, and is preferred by human readers in 85%+ of pairwise alignment judgments. We also introduce a persona–query benchmark and data-generation pipeline. To our knowledge, this is the first work to advance deep research with an interactive, interpretable control paradigm, paving the way for controllable, user-aligned agents in long-form tasks.

Abstract:
Multidevice inference can reduce Transformer latency by parallelizing computation. However, existing methods require high inter-device bandwidth, making them impractical for bandwidth-constrained environments. We present ASTRA, a communication-efficient framework that integrates sequence parallelism with mixed-precision attention, where non-local token embeddings are transmitted as low-bit vector-quantized codes while local attention remains full precision. To preserve accuracy under aggressive compression, ASTRA introduces Noise-Augmented Quantization and Distributed Class Tokens. Across vision and language models (e.g., ViT and GPT2), ASTRA achieves up to 2.64× speedup over single-device inference and up to 15.25× over prior multi-device baselines while operating at bandwidths as low as 10 Mbps. ASTRA remains robust on large models (e.g., Llama-3-8B) even under non-ideal network conditions such as packet loss and dynamic networks.

Abstract:
RLbased post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks. A straightforward multi-task adaptation of GRPO often leads to imbalanced outcomes, with some tasks dominating optimization while others stagnate. Moreover, tasks can vary widely in how frequently prompts yield zero advantages (and thus zero gradients), which further distorts their effective contribution to the optimization signal. To address these issues, we propose a novel Multi-Task GRPO (MT-GRPO) algorithm that (i) dynamically adapts task weights to explicitly optimize worst-task performance and promote balanced progress across tasks, and (ii) introduces a ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights. Experiments on both 3-task and 9-task settings show that MT-GRPO consistently outperforms baselines in worst-task accuracy. In particular, MT-GRPO achieves 16–28% and 6% absolute improvement on worst-task performance over standard GRPO and DAPO, respectively, while maintaining competitive average accuracy. Moreover, MT-GRPO requires 50% fewer training steps to reach 50% worst-task accuracy in the 3-task setting, demonstrating substantially improved efficiency in achieving reliable performance across tasks.

Abstract:
Time series reasoning tasks increasingly start from a natural language question and require targeted analysis of time series. Relevant evidence may be global or confined to a few short segments, so the model must decide what to inspect. Most existing methods compress the full series into a fixed representation before inference, preventing questionadaptive analysis. We introduce ARTIST, an approach that formulates time-series reasoning as a sequential decision problem and trains models to interleave reasoning with adaptive temporal segment selection. ARTIST uses a controller-reasoner architecture and reinforcement learning to optimize segment selection based on answer correctness, allowing the model to actively acquire task-relevant information during inference. We evaluate ARTIST on six time-series reasoning benchmarks against large language models, vision-language models, and prior time series reasoning systems. ARTIST improves average accuracy by 6.46 percentage points over the strongest model, with the largest gains on rare event localization and multi-segment evidence accumulation. Supervised fine-tuning improves performance, and reinforcement learning yields further gains by optimizing question-adaptive segment selection. Across datasets, ARTIST achieves higher accuracy while using a smaller fraction of the input time series, highlighting the importance of learned, selective data utilization for time series reasoning.

Abstract:
Many realworld optimization problems are guided by complex, subjective preferences that are difficult to express as explicit closed-form objectives. In response, we introduce Language-in-the-Loop Optimization (LILO), a Bayesian optimization (BO) framework that employs a large language model (LLM) to translate free-form natural language feedback and prior knowledge from a decision maker into structured preference signals, going beyond the restrictive scalar or pairwise feedback formats typically assumed in preferential BO. The LLM-derived preferences are integrated by a Gaussian process proxy model, enabling principled acquisition-driven exploration with calibrated uncertainty. By placing the LLM in a supporting role rather than as the optimizer itself, LILO preserves the sample efficiency and stability of BO while providing a flexible and expressive feedback interface. Across synthetic and real-world benchmarks, LILO consistently outperforms both conventional preference-based BO methods and LLM-only optimizers, with particularly strong gains in feedback-limited regimes.

Abstract:
ZerothOrder (ZO) optimization has emerged as a promising solution for fine-tuning LLMs under strict memory constraints, as it avoids the prohibitive memory cost of storing activations for backpropagation. However, existing ZO methods typically employ isotropic perturbations, neglecting the rich structural information available during the forward pass. In this paper, we identify a crucial link between gradient formation and activation structure: the gradient of a linear layer is confined to the subspace spanned by its input activations. Leveraging this insight, we propose Activation-Guided Zeroth-Order optimization (AGZO). Unlike prior methods, AGZO extracts a compact, activation-informed subspace on the fly during the forward pass and restricts perturbations to this low-rank subspace. We provide a theoretical framework showing that AGZO optimizes a subspace-smoothed objective and provably yields update directions with higher cosine similarity to the true gradient than isotropic baselines. Empirically, we evaluate AGZO on Qwen3 and Pangu models across various benchmarks. AGZO consistently outperforms state-of-the-art ZO baselines and significantly narrows the performance gap with first-order fine-tuning, while maintaining almost the same peak memory footprint as other ZO methods.

Abstract:
Spectroscopy infers the internal structure of physical systems by measuring their response to perturbations. We apply this principle to neural networks: perturbing the data distribution by upweighting a token y in context x, we measure the model's response via susceptibilities \chi_xy, which are covariances between componentlevel observables and the perturbation computed over a localized Gibbs posterior via stochastic gradient Langevin dynamics (SGLD). Theoretically, we show that susceptibilities decompose as a sum over \emphmodes of the data distribution, explaining why tokens that follow their contexts ``for similar reasons'' cluster together in susceptibility space. Empirically, we apply this methodology to Pythia-14M, developing a conductance-based clustering algorithm that identifies 510 interpretable clusters ranging from grammatical patterns to code structure to mathematical notation. Comparing to sparse autoencoders, 50% of our clusters match SAE features, validating that both methods recover similar structure.

Abstract:
Uncertainty quantification (UQ) in graph neural networks (GNNs) is crucial in highstakes domains but remains a significant challenge. In graph settings, message passing often relies on strong assumptions such as exchangeability, which are rarely satisfied in practice. Moreover, achieving reliable UQ typically requires costly resampling or post-hoc calibration. To address these issues, we introduce Quantile-free Prediction Interval GNN (QpiGNN), a framework that builds on quantile regression (QR) to enable GNN-based UQ by directly optimizing coverage and interval width without requiring quantile inputs or post-processing. QpiGNN employs a dual-head architecture that decouples prediction and uncertainty, and is trained with label-only supervision through a quantile-free joint loss. This design allows efficient training and yields robust prediction intervals, with theoretical guarantees of asymptotic coverage and near-optimal width under mild assumptions. Experiments on 19 synthetic and real-world benchmarks show QpiGNN achieves average 22% higher coverage and 50% narrower intervals than baselines, while ensuring efficiency and robustness to noise and structural shifts.

Abstract:
By integrating Chainof-Thought (CoT) reasoning, Vision-Language-Action (VLA) models have demonstrated strong capabilities in robotic manipulation, particularly by improving generalization and interpretability. However, the security of CoT-based reasoning mechanisms remains largely unexplored. In this paper, we show that Chain-of-Thought (CoT) reasoning introduces a novel attack vector for targeted control hijacking—for example, causing a robot to mistakenly deliver a knife to a person instead of an apple—without modifying the user’s instruction. We first provide empirical evidence that CoT strongly governs action generation, even when it is semantically misaligned with the input instructions. Building on this observation, we proposeTRAP, the first targeted adversarial attack framework for CoT-reasoning VLA models.TRAPuses an adversarial patch (e.g., a coaster placed on the table) to corrupt intermediate CoT reasoning and hijack the VLA’s output. By optimizing the CoT adversarial loss,TRAPinduces specific and adversary-defined behaviors. Extensive evaluations across 3 mainstreaming VLA architectures and 3 CoT reasoning paradigms validate the effectiveness ofTRAP. Notably, we implemented the patch by printing it on paper in a real-world setting. Our findings highlight the urgent need to secure CoT reasoning in VLA systems.

Abstract:
In most realworld image-to-image (I2I) scenarios, existing evaluations primarily focus on instruction following and the perceptual quality or aesthetics of the generated images. However, they largely fail to assess whether the output image preserves the semantic correspondence and spatial structure of the input image. To address this limitation, we propose StableI2I, a unified and dynamic evaluation framework that explicitly measures content fidelity and pre--post consistency across a wide range of I2I tasks without requiring reference images, including image editing and image restoration. In addition, we construct StableI2I-Bench, a benchmark designed to systematically evaluate the accuracy of MLLMs on such fidelity and consistency assessment tasks. Extensive experimental results demonstrate that StableI2I provides accurate, fine-grained, and interpretable evaluations of content fidelity and consistency, with strong correlations to human subjective judgments. Our framework serves as a practical and reliable evaluation tool for diagnosing content consistency and benchmarking model performance in real-world I2I systems.

Abstract:
Finetuning large language models (LLMs) can lead to unintended out-of-distribution generalization. Standard approaches to this problem rely on modifying the training data, for example by adding data that better specify the intended generalization. However, this is not always practical. We introduce Concept Ablation Fine-Tuning (CAFT), a technique that leverages interpretability tools to control how LLMs generalize from fine-tuning, without needing to modify the training data or otherwise use data from the target distribution. Given a set of directions in an LLM's latent space corresponding to undesired concepts, CAFT works by ablating these concepts with linear projections during fine-tuning, steering the model away from unintended generalizations. We successfully apply CAFT to three fine-tuning tasks, including emergent misalignment, a phenomenon where LLMs fine-tuned on a narrow task generalize to give egregiously misaligned responses to general questions. Without any changes to the fine-tuning data, CAFT reduces misaligned responses by 10x without degrading performance on the training distribution. Overall, CAFT represents a novel approach for steering LLM generalization without modifying training data.

Abstract:
We study differentially private approximation algorithms for positive linear programs (LPs with nonnegative coefficients and variables), focusing on the fundamental families of packing, covering, and mixed packingcovering formulations. We focus on the high-sensitivity, constraint-private regime of Hsu-Roth-Roughgarden-Ullman (ICALP 2014), where neighboring instances may differ by an arbitrary single constraint, so one cannot hope to approximately satisfy every constraint under privacy. We give private solvers that return approximate solutions while violating only a controlled number of constraints. Our algorithms improve the prior instance-dependent guarantees, and also yield new data-independent bounds that depend only on the dimension. Our techniques involve a dense multiplicative weights update method developed from a regularized dual viewpoint, which we analyze in a way that exploits structure specific to positive LPs.

Abstract:
Existing LLM testtime scaling laws emphasize the emergence of self-reflective behaviors through extended reasoning length. Nevertheless, this vertical scaling strategy often encounters plateaus in exploration as the model becomes locked into specific thinking pattern. By shifting from depth to parallelism, parallel thinking mitigates the narrowing of exploration. However, the extension of this paradigm to visual domain remains an open research question. In this paper, we first examine the role of visual partitioning in parallelized reasoning and subsequently propose two distinct strategies. Based on the above, we introduce Visual Para-Thinker, representing the inaugural parallel reasoning framework for MLLMs. To maintain path independence and promote diversity in reasoning, our approach integrates Pa-Attention alongside LPRoPE. Leveraging the vLLM framework, we have developed a native multimodal implementation that facilitates high-efficiency parallel processing. Empirical results on benchmark datasets such as V, CountBench, RefCOCO, and HallusionBench confirm that Visual Para-Thinker successfully extends the benefits of parallel reasoning to the visual domain.

Abstract:
Reconstructing dynamical evolution from limited observations is a fundamental challenge in singlecell biology, where dynamic unbalanced optimal transport (OT) provides a principled framework for modeling coupled transport and mass variation. However, existing approaches rely on trajectory simulation at inference time, making inference a key bottleneck for scalable applications. In this work, we propose a mean-flow framework for unbalanced flow matching that summarizes both transport and mass-growth dynamics over arbitrary time intervals using mean velocity and mass-growth fields, enabling fast one-step generation without trajectory simulation. To solve dynamic unbalanced OT under the Wasserstein-Fisher-Rao geometry, we further build on this framework to developWasserstein-Fisher-Rao Mean Flow Matching (WFR-MFM). Across synthetic and real single-cell RNA sequencing datasets, WFR-MFM achieves orders-of-magnitude faster inference than a range of existing baselines while maintaining high predictive accuracy, and enables efficient perturbation response prediction on large synthetic datasets with thousands of conditions.

Abstract:
Continuoustime generative models for crystalline materials enable inverse materials design by learning to predict stable crystal structures, but incorporating explicit target properties into the generative process remains challenging. Policy-gradient reinforcement learning (RL) provides a principled mechanism for aligning generative models with downstream objectives but typically requires access to the score, which has prevented its application to flow-based models that learn only velocity fields. We introduce Open Materials Generation with Inference-time Reinforcement Learning (OMatG-IRL), a policy-gradient RL framework that operates directly on the learned velocity fields and eliminates the need for the explicit computation of the score. OMatG-IRL leverages stochastic perturbations of the underlying generation dynamics preserving the baseline performance of the pretrained generative model while enabling exploration and policy-gradient estimation at inference time. Using OMatG-IRL, we present the first application of RL to crystal structure prediction (CSP). Our method enables effective reinforcement of an energy-based objective while preserving diversity through composition conditioning, and it achieves performance competitive with score-based RL approaches. Finally, we show that OMatG-IRL can learn time-dependent velocity-annealing schedules, enabling accurate CSP with order-of-magnitude improvements in sampling efficiency and, correspondingly, reduction in generation time.

Abstract:
Automated design of multiagent interactions with desirable equilibrium outcomes is inherently difficult due to the computational hardness, non-uniqueness, and instability of the resulting equilibria. In this work, we propose the use of game-agnosticdifferentiable equilibrium blocks(DEBs) as modules in a novel, differentiable framework to address a wide variety of incentive design problems from economics and computer science. We call this frameworkdeep incentive design(DID). To validate our approach, we examine three diverse, challenging incentive design tasks: contract design, machine scheduling, and inverse equilibrium problems. For each task, we train a single neural network using a unified pipeline and DEB. This architecture solves thefull distributionof problem instances, parameterized by a context, handlingallgames across a wide range of scales (from two to sixteen actions per player).

Abstract:
We derive a robust update rule for the online infinite hidden Markov model (iHMM) for when the streaming data contains outliers and the model is misspecified. Leveraging recent advances in generalised Bayesian inference, we define robustness via the posterior influence function (PIF), and provide conditions under which the online iHMM has bounded PIF. Imposing robustness inevitably induces an adaptation lag for regime switching. Our method, which is called Batched Robust iHMM (BRiHMM), balances adaptivity and robustness with two additional tunable parameters. Across limit order book data, hourly electricity demand, and a synthetic high-dimensional linear system, BR-iHMM reduces one-step-ahead forecasting error by up to 67% relative to competing online Bayesian methods. Together with theoretical guarantees of bounded PIF, our results highlight the practicality of our approach for both forecasting and interpretable online learning.

Abstract:
Graph Federated Learning (GFL) enables distributed graph representation learning while protecting the privacy of graph data. However, GFL suffers from heterogeneity arising from diverse node features and structural topologies across multiple clients. To address both types of heterogeneity, we propose a novel graph Federated learning method via Semantic and Structural Alignment (FedSSA), which shares the knowledge of both node features and structural topologies. For node feature heterogeneity, we propose a novel variational model to infer classwise node distributions, so that we can cluster clients based on inferred distributions and construct cluster-level representative distributions. We then minimize the divergence between local and cluster-level distributions to facilitate semantic knowledge sharing. For structural heterogeneity, we employ spectral Graph Neural Networks (GNNs) and propose a spectral energy measure to characterize structural information, so that we can cluster clients based on spectral energy and build cluster-level spectral GNNs. We then align the spectral characteristics of local spectral GNNs with those of cluster-level spectral GNNs to enable structural knowledge sharing. Experiments on six homophilic and five heterophilic graph datasets under both non-overlapping and overlapping partitioning settings demonstrate that FedSSA consistently outperforms eleven state-of-the-art methods. Our code is available at https://anonymous.4open.science/r/FedSSA.

Abstract:
Concept shift occurs when the distribution of labels conditioned on the features changes between domains, making even a welltuned ML model to have learned a fundamentally incorrect representation. Identifying these shifted features provides unique insight into how one dataset differs from another, considering the difference may be across a scientifically relevant dimension, such as time, disease status, population, etc. In this paper, we propose SGShift, a model for detecting concept shift in tabular data and attributing reduced model performance to a sparse set of shifted features. We frame concept shift as a feature selection task to learn the features that can explain performance differences between models in the source and target domain. This framework enables SGShift to adapt powerful statistical tools such as generalized additive models, knockoffs, and absorption towards identifying these shifted features. We conduct extensive experiments in synthetic and real data across various ML models and find SGShift can identify shifted features much more accurately than baseline methods, requires few samples in the shifted domain, and is robust to complex cases of concept shift.

Abstract:
Diffusion language models offer a promising alternative to autoregressive models due to their global, noncausal generation process, but their continuous latent dynamics make discrete constraints---e.g., the output should be a JSON file that matches a given schema---difficult to impose. We introduce a training-free guidance method for steering continuous diffusion language models to satisfy formal syntactic constraints expressed using regular expressions. Our approach constructs an analytic score estimating the probability that a latent state decodes to a valid string accepted by a given regular expression, and uses its gradient to guide sampling,withouttraining auxiliary classifiers. The denoising process targets the base model conditioned on syntactic validity. We implement our method in Diffinity on top of the PLAID diffusion model and evaluate it on 180 regular-expression constraints over JSON and natural-language benchmarks. Diffinity achieves 68-96% constraint satisfaction while incurring only a small perplexity cost relative to unconstrained sampling, outperforming autoregressive constrained decoding in both constraint satisfaction and output quality.

Abstract:
Testtime compute scaling allocates inference computation uniformly, uses fixed sampling strategies, and applies verification only for reranking. In contrast, we propose a verifier-guided adaptive framework treating reasoning as iterative trajectory generation and selection. For each problem, the agent runs multiple inference iterations. In each iteration, it optionally produces a high-level plan, selects a set of reasoning tools and a compute strategy together with an exploration parameter, and then generates a candidate reasoning trajectory. A process reward model (PRM) serves as a unified control signal: within each iteration, step-level PRM scores are aggregated to guide pruning and expansion during generation, and across iterations, aggregated trajectory rewards are used to select the final response. Across datasets, our dynamic, PRM-guided approach consistently outperforms direct test-time scaling, yielding large gains on MATH-500 and several-fold improvements on harder benchmarks such as AIME24 and AMO-Bench. We characterize efficiency using theoretical FLOPs and a compute intensity metric penalizing wasted generation and tool overhead, demonstrating that verification-guided allocation concentrates computation on high-utility reasoning paths.

Abstract:
Transformers are arguably the preferred architecture for language generation. In this paper, inspired by continued fractions, we introduce a new function class for generative modeling. The architecture family implementing this function class is named CoFrGeNets Continued Fraction Generative Networks. We design novel architectural components based on this function class that can replace Multi-head Attention and Feed-Forward Networks in Transformer blocks while requiring much fewer parameters. We derive custom gradient formulations to optimize the proposed components more accurately and efficiently than using standard PyTorch-based gradients. Our components are a plug-in replacement requiring little change in training or inference procedures that have already been put in place for Transformer-based models thus making our approach easy to incorporate in large industrial workflows. We experiment on two very different transformer architectures GPT2-xl (1.5B) and Llama3 (3.2B), where the former we pre-train on OpenWebText and GneissWeb, while the latter we pre-train on the docling data mix which consists of nine different datasets. Results show that the performance on downstream classification, Q& A, reasoning and text understanding tasks of our models is competitive and sometimes even superior to the original models with two thirds to half the parameters and shorter pre-training time. We believe that future implementations customized to hardware will further bring out the true potential of our architectures.

Abstract:
Efficient longcontext understanding is increasingly vital for large language model (LLM) applications such as multi-turn dialogue and program analysis. However, the core self-attention scales quadratically with sequence length, creating a fundamental computational bottleneck. Existing sparse attention methods alleviate this issue but face trade-offs: training-based methods are costly and cannot be directly applied as acceleration plugins for other models, while inference-time methods often compromise efficiency or cross-modal generality. To address these limitations, we present UniSparse, a unified mechanism that introduces the notion of composite tokens—compact representations that aggregate multi-granularity contextual information. Building on this abstraction, UniSparse dynamically constructs sparse attention through multi-granularity compression and block-level selection, enabling efficient and hardware-friendly execution on GPU. Across multiple modalities and tasks ranging from synthetic benchmarks to real-world applications, UniSparse consistently surpasses state-of-the-art sparse attention methods (e.g., MInference, XAttention, FlexPrefill) in both accuracy and efficiency, achieving \ge 99% of full-attention accuracy and up to 2.61× faster attention computation than FlashAttention.

Abstract:
To handle ambiguous and openended requests, Large Language Models (LLMs) are increasingly trained to interact with users to surface intents they have not yet expressed (e.g., ask clarification questions). However, users are often ambiguous because they have not yet formed their intents: they must observe and explore outcomes to discover what they want. Simply asking "what kind of tone do you want?" fails when users themselves do not know. We introduce DiscoverLLM, a novel and generalizable framework that trains LLMs to help users form and discover their intents. Central to our approach is a novel user simulator that models cognitive state with a hierarchy of intents that progressively concretize as the model surfaces relevant options---where the degree of concretization serves as a reward signal that models can be trained to optimize. Resulting models learn to collaborate with users by adaptively diverging (i.e., explore options) when intents are unclear, and converging (i.e., refine and implement) when intents concretize. Across proposed interactive benchmarks in creative writing, technical writing, and SVG drawing, DiscoverLLM achieves over 10% higher task performance while reducing conversation length by up to 40%. In a user study with 75 human participants, DiscoverLLM improved conversation satisfaction and efficiency compared to baselines.

Abstract:
Multiobjective Bayesian optimization (MOBO) provides a principled framework for optimizing expensive black-box functions with multiple objectives. However, existing MOBO methods often struggle with coverage, scalability with respect to the number of objectives, and integrating constraints and preferences. In this work we propose STAGE-BO, Sequential Targeting Adaptive Gap-Filling \varepsilon-Constraint Bayesian Optimization, that explicitly targets under-explored regions of the Pareto front. By analyzing the coverage of the approximate Pareto front, our method identifies the largest geometric gaps. These gaps are then used as constraints, which transforms the problem into a sequence of inequality-constrained subproblems, efficiently solved via constrained expected improvement acquisition. Our approach provides a uniform Pareto coverage without hypervolume computation and naturally applies to constrained, and preference-based settings. Experiments on synthetic and real-world benchmarks demonstrate superior coverage and competitive hypervolume performance against state-of-the-art baselines.

Abstract:
Adaptabilityhas been regarded as a central feature in the foundation models, enabling them to effectively acclimate to unseen downstream tasks. Parameterefficient fine-tuning methods such as celebrated LoRA facilitate efficient adaptation of large foundation models using labeled, high-quality and generally scarce task data. To mitigate data scarcity in fine-tuning of foundation models, we propose to leveragetask similarityacross downstream users. Intuitively, users with similar tasks must be able to assist each other in boosting the effective fine-tuning data size. We proposeCollaborative Low-Rank Adaptation, or CoLoRA, which exploits task similarity to collaboratively and efficiently fine-tune personalized foundation models. The main idea in CoLoRA is to train one shared adapter capturing underlying task similarities across all tasks, and personalized adapters tailored to user-specific tasks. We theoretically study CoLoRA on heterogeneous linear regression and provide provable guarantees for ground truth recovery. We also conduct several natural language experiments with varying task similarity, which further demonstrate that when trained together with similar tasks, individual performances are significantly boosted.

Abstract:
As LLMbased agents increasingly operate in high-stakes domains with real-world consequences, ensuring their behavioral safety becomes paramount. The dominant oversight paradigm,LLM-as-a-Judge, faces a fundamental dilemma: how can probabilistic systems reliably supervise other probabilistic systems without inheriting their failure modes? We argue that formal verification offers a principled escape from this dilemma, yet its adoption has been hindered by a critical bottleneck: the translation from natural language requirements to formal specifications. This paper bridges this gap by proposing , a neuro-symbolic framework that employs a bidirectionalFormal-of-Thoughtarchitecture: LLMs serve asspecification compilersthat top-down decompose high-level human intent into atomic, verifiable constraints, then bottom-up prove compliance using Dafny specifications and Z3 Satisfiability modulo theories solving, which produces mathematical guarantees rather than probabilistic scores. We validate across three benchmarks spanning behavioral safety, multi-domain constraint adherence, and agentic upward deception detection. Experiments on 7 agent models demonstrate that achieves an average improvement of16.6%over LLM-as-a-Judge baselines, enablesweak-to-stronggeneralization where a 7B judge achieves over 90% accuracy detecting deception from 72B agents, and providesnear-linear safety improvementthrough iterative refinement.

Abstract:
We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or outright negative correlation with the correct answer. For example, RLVR training with GRPO improves MATH500 performance for Qwen2.5-Math-7B in absolute points by 21.4% using randomly assigned rewards, nearly matching the 29.1% gained with ground truth rewards. To explain this counterintuitive observation, we show that GRPO exhibits a clipping bias arising from the clip term, which can amplify high-prior behaviors learned during pre-training even without informative rewards. As a case study, we identify one such high-prior behavior for Qwen2.5-Math models, which we term code reasoning---reasoning in code without actual code execution; code reasoning frequency increases from 65% to over 90% with spurious rewards. However, the presence of such amplifiable behaviors is highly model-dependent. In practice, spurious rewards that are effective for Qwen models often fail to produce gains for other model families, such as Llama3 or OLMo2. Our results highlight the importance of validating RL methods across diverse models rather than relying on a single de facto choice: large performance gains can arise on Qwen models even from random rewards that do not reflect genuine capability improvements.

Abstract:
Large Language Models (LLMs) exhibit significant safety disparities across languages, with lowresource languages (LRLs) often bypassing safety guardrails established for high-resource languages (HRLs) like English. Existing solutions, such as multilingual supervised fine-tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), are computationally expensive and de- pendent on scarce multilingual safety data. In this paper, we propose a novel, training-free alignment framework based on Sparse Weight Editing. Identifying that safety capabilities are localized within a sparse set of ”safety neurons,” we formulate the cross-lingual alignment problem as a constrained linear transformation. We derive a closed-form solution to optimally map the harmful representations of LRLs to the robust safety subspaces of HRLs, while preserving general utility via a null-space projection constraint. Extensive experiments across 8 languages and multiple model families (Llama-3, Qwen-2.5) demonstrate that our method significantly reduces Attack Success Rate (ASR) in LRLs with negligible impact on general reasoning capabilities, all achieved with a single, data-efficient calculation.

Abstract:
We introduce a guided stochastic sampling method that augments sampling from diffusion models with physicsbased guidance derived from partial differential equation (PDE) residuals and observational constraints, ensuring generated samples remain physically admissible. We embed this sampling procedure within a new Sequential Monte Carlo (SMC) framework, yielding a scalable generative PDE solver. Across multiple benchmark PDE systems as well as multiphysics and interacting PDE systems, our method produces solution fields with lower numerical error than existing state-of-the-art generative methods.

Abstract:
Despite the central role of attention heads in Transformers, we lack tools to understand why a model attends to a particular token. To address this, we study the querykey (QK) space -- the bilinear joint embedding space between queries and keys. We present a contrastive covariance method to decompose the QK space into low-rank, human-interpretable components. It is when features in keys and queries align in these low-rank subspaces that high attention scores are produced. We first study our method both analytically and empirically in a simplified setting. We then apply our method to large language models to identify human-interpretable QK subspaces for categorical semantic features and binding features. Finally, we demonstrate how attention scores can be attributed to our identified features.

Abstract:
Imagine Alice has a specific image x^\ast in her mind, say, the view of the street in which she grew up during her childhood. To generate that exact image, she guides a generative model with multiple rounds of prompting and arrives at an image x^p. Although x^p is reasonably close to x^\ast, Alice finds it difficult to close that gap using language prompts. This paper aims to narrow this gap by observing that even after language has reached its limits, humans can still tell when a new image x^+ is closer to x^\ast than x^p. Leveraging this observation, we develop MultiBO (MultiChoice Preferential Bayesian Optimization) that carefully generates K new images as a function of x^p, gets preferential feedback from the user, uses the feedback to guide the diffusion model, and ultimately generates a new set of K images. We show that within B rounds of user feedback, it is possible to arrive much closer to x^\ast, even though the generative model has no information about x^\ast. Qualitative scores from 30 users, combined with quantitative metrics compared across 5 baselines, show promising results, suggesting that multi-choice feedback from humans can be effectively harnessed for personalized image generation.

Abstract:
Existing voice deepfake detection and localization models rely heavily on representations extracted from speech foundation models (SFMs). However, downstream finetuning has now reached a state of diminishing returns. In this paper, we shift the focus to pretraining and propose a novel recipe that combinesbottleneck masked embedding prediction with flowmatching based spectrogram reconstruction. The outcome,Alethia, is the first foundational audio encoder for various voice deepfake detection and localization tasks. We evaluate on 5 different tasks with 56 benchmark datasets, and noteAlethiasignificantly outperforms state-of-the-art SFMs with superior robustness to real-world perturbations and zero-shot generalization to unseen domains (e.g., singing deepfakes). We also demonstrate the limitation of discrete targets in masked token prediction, and show the importance ofcontinuous embeddingprediction andgenerative pretrainingfor capturing deepfake artifacts.

Abstract:
Foundation flowmatching (FM) models promise a universal prior for solving inverse problems (IPs), yet today they trail behind domain-specific or even untrained priors. \emphHow can we unlock their potential? We introduce FMPlug, a plug-in framework that redefines how foundation FMs are used in IPs. FMPlug combines an instance-guided, time-dependent warm-start strategy with a sharp Gaussianity regularization, adding problem-specific guidance while preserving the Gaussian structures. This leads to a significant performance boost across image restoration and scientific IPs. Our results point to a path for making foundation FM models practical, reusable priors for IP solving.

Abstract:
Zerothorder (ZO) methods are widely used when gradients are unavailable or prohibitively expensive, including black-box learning and memory-efficient fine-tuning of large models, yet their optimization dynamics in deep learning remain underexplored. In this work, we provide an explicit step size condition that exactly captures the (mean-square) linear stability of a family of ZO methods based on the standard two-point estimator. Our characterization reveals a sharp contrast with first-order (FO) methods: whereas FO stability is governed solely by the largest Hessian eigenvalue, mean-square stability of ZO methods depends on the entire Hessian spectrum. Since computing the full Hessian spectrum is infeasible in practical neural network training, we further derive tractable stability bounds that depend only on the largest eigenvalue and the Hessian trace. Empirically, we find that full-batch ZO methods operate at the edge of stability: ZO-GD, ZO-GDM, and ZO-Adam consistently stabilize near the predicted stability boundary across CNNs, ResNets, and Transformers on vision tasks. Our results highlight an implicit regularization effect specific to ZO methods, where large step sizes primarily regularize the Hessian trace, whereas in FO methods they regularize the top eigenvalue.

Abstract:
In this paper, we settle the problem of learning optimal linear contracts from data in the offline setting, where agent types are drawn from an unknown distribution and the principal's goal is to design a contract that maximizes her expected utility. Specifically, our analysis shows that the simple Empirical Utility Maximization (EUM) algorithm yields an \varepsilonapproximation of the optimal linear contract with probability at least 1-\delta, using just O(\ln(1/\delta) / \varepsilon^2) samples. This result improves upon previously known bounds and matches a lower bound from (Dütting et al., 2025) up to constant factors, thereby proving its optimality. Furthermore, our result establishes the stronger guarantee of uniform convergence: the empirical utility of every linear contract is a \varepsilon-approximation of its true expectation with probability at least 1-\delta, using the same optimal O(\ln(1/\delta) / \varepsilon^2) sample complexity.

Abstract:
While continuous diffusion models have achieved remarkable success, discrete diffusion offers a unified framework for jointly modeling text and images. Beyond unification, discrete diffusion provides faster inference, finer control, and principled trainingfree guidance, making it well-suited for posterior sampling. Existing approaches to posterior sampling using discrete diffusion face severe challenges: derivative-free guidance yields sparse signals, continuous relaxations limit applicability, and split Gibbs samplers suffer from the curse of dimensionality. To overcome these limitations, we introduce Anchored Posterior Sampling (APS), built on two key innovations:quantized expectationfor gradient-like guidance in discrete embedding space, andanchored remaskingfor adaptive decoding. APS achieves state-of-the-art performance among discrete diffusion samplers on both linear and nonlinear inverse problems across the standard image benchmarks. We demonstrate the generality of APS through training-free stylization and text-guided editing. We further apply APS to a large-scale diffusion language model, showing consistent improvement in question answering.

Abstract:
As language model (LM) agents become increasingly capable and adopted in realworld applications, there is a growing need for scalable evaluation frameworks beyond costly, manually-designed benchmarks. We propose information-theoretic evaluation based on empowerment, an information-theoretic measure of an agent's influence on future states through its actions. To handle the unique challenges of text-based environments, we introduce EELMA (Estimating Empowerment of Language Model Agents), an algorithm for approximating effective empowerment from multi-turn text interactions. We demonstrate EELMA on textual games and web-browsing scenarios, showing that empowerment strongly correlates with average task performance. We further analyze how empowerment varies across models, environment complexity, and agent configurations, and show that high-empowerment states and actions often mark pivotal moments for general capabilities. These results establish empowerment as a general-purpose metric for evaluating LM agents in open-ended settings. Code available: https://anonymous.4open.science/r/EELMA-E227

Abstract:
While large visionlanguage models (VLMs) show promise for object goal navigation, current methods still struggle with low success rates and inefficient localization of unseen objects—failures primarily attributed to weak temporal-spatial reasoning. Meanwhile, recent attempts to inject reasoning into VLM-based agents improve success rates but incur substantial computational overhead. To address both the ineffectiveness and inefficiency of existing approaches, we introduce Hydra-Nav, a unified VLM architecture that adaptively switches between a deliberative "slow system" for analyzing exploration history and formulating high-level plans, and a reactive "fast system" for efficient execution. We train Hydra-Nav through a three-stage curriculum: (i) spatial-action alignment to strengthen trajectory planning, (ii) memory-reasoning integration to enhance temporal-spatial reasoning over long-horizon exploration, and (iii) iterative rejection fine-tuning to enable selective reasoning at critical decision points. Extensive experiments demonstrate that Hydra-Nav achieves state-of-the-art performance on the HM3D, MP3D, and OVON benchmarks, outperforming the second-best methods by 11.1%, 17.4%, and 21.2%, respectively. Furthermore, we introduce SOT (Success weighted by Operation Time), a new metric to measure search efficiency across VLMs with varying reasoning intensity. Results show that adaptive reasoning significantly enhances search efficiency over fixed-frequency baselines.

Abstract:
In this paper, we study the private and robust alignment of language models from a theoretical perspective by establishing upper bounds on the suboptimality gap in both offline and online settings. We consider preference labels subject to privacy constraints and/or adversarial corruption, and analyze two distinct interplays between them: privacyfirst and corruption-first. For the privacy-only setting, we show that log loss with an MLE-style algorithm achieves near-optimal rates, in contrast to conventional wisdom. For the joint privacy-and-corruption setting, we first demonstrate that existing offline algorithms in fact provide stronger guarantees---simultaneously in terms of corruption level and privacy parameters---than previously known, which further yields improved bounds in the corruption-only regime. In addition, we also present the first set of results for private and robust online alignment. Our results are enabled by new uniform convergence guarantees for log loss and square loss under privacy and corruption, which we believe have broad applicability across learning theory and statistics.

Abstract:
Current video diffusion models generate visually compelling content but often violate basic laws of physics, producing subtle artifacts like rubbersheet deformations and inconsistent object motion. We introduce a frequency-domain physics prior that improves motion plausibility without modifying model architectures. Our method decomposes common rigid motions (translation, rotation, scaling) into lightweight spectral losses computed on a low-frequency subset. Applied to Open-Sora, MVDIT, and Hunyuan, our approach improves both motion accuracy and action recognition by ~11% on average on OpenVID-1M (relative), while maintaining visual quality. User studies show 74--83% preference for our physics-enhanced videos. It also reduces warping error by 22--37% (depending on the backbone) and improves temporal consistency scores. These results indicate that simple, global spectral cues are an effective drop-in regularizer for physically plausible motion in video diffusion.

Abstract:
Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the singlesample level, and common aggregation rules (majority vote, soft self-consistency, or instruction-based self-aggregation) are inconsistent when ties are allowed. We study inference-time compute (ITC) for evaluators that generate n independent thinking--rating samples per item, and propose a principled, distribution-calibrated aggregation scheme. Our method models three-way preferences with a Bradley–Terry-Davidson formulation on rating counts, leveraging both polarity (margin among non-ties) and decisiveness (non-tie rate) to distinguish narrow margins from strong consensus. Across various evaluation benchmarks, our approach consistently reduces MAE and increases pairwise accuracy versus standard baselines, and when evaluated against human-consensus meta-labels, matches or exceeds individual human raters. These results show that carefully allocating ITC and aggregating with distribution-aware methods turns noisy individual model judgments into reliable ratings for evaluation.

Abstract:
We study the decoupled multiarmed bandit problem, where the learner selects one arm for exploration and one arm for exploitation separately at each round. In this setting, the loss of the explored arm is observed but not incurred, whereas the loss of the exploited arm is incurred without being observed. We propose an efficient Follow-the-Perturbed-Leader (FTPL) policy that achieves Best-of-Both-Worlds (BOBW) guarantee with constant regret in the stochastic regime and optimal \mathcalO(\sqrtKT) regret in the adversarial regime. A key feature of our method is that it completely avoids both the convex optimization required by prior BOBW policy, and the resampling procedures that are typically used in FTPL bandit policies. This allows FTPL to fully realize its computational efficiency advantages, and thus leads to substantial reductions in computational cost. We empirically confirm that our policy not only improves the runtime but also demonstrates superior regret performance in both regimes.

Abstract:
KV cache in autoregressive LLMs eliminates redundant recomputation but has emerged as the dominant memory and bandwidth bottleneck during inference, notably with long contexts and testtime scaling. KV quantization is a key lever for reducing cache cost, but accuracy drops sharply as the native KV distribution lacks flatness and thus maintains a wide quantization range. Prior work focuses on isolating outliers, which caps their error but fails to flatten the overall distribution, leaving performance fragile under low-bit settings. In this work, we show that the K cache maintains a stable, context-evolving structure, while the V cache carries latent semantic regularities, with both contributing to the organization of vectors into shared patterns. Building on these insights, we proposePatternKV, a pattern-aligned residual quantization scheme. It mines representative pattern vectors online, aligns each KV vector to its nearest pattern, and quantizes only the residual. This reshaping of the KV distribution flattens the quantization target and narrows its range, thereby improving the fidelity of low-bit KV quantization. Across long-context and test-time scaling settings on multiple backbones, PatternKV delivers consistent 2-bit gains, with a 0.08% average 4-bit drop relative to FP16, improves test-time scaling accuracy by 10% on average, and raises throughput by 1.5× while supporting 1.25× larger batches.

Abstract:
LLM–driven evolutionary systems have shown promise for automated science discovery, yet existing approaches such as AlphaEvolve rely on fullcode histories that are context-inefficient and potentially provide weak evolutionary guidance. In this work, we first formalize the evolutionary agents as a general Expectation–Maximization framework, where the language model samples candidate programs (E-step) and the system updates the control context based on evaluation feedback (M-step). Under this view, constructing context via full-code snapshots constitutes a suboptimal M-step, as redundant implement details dilutes core algorithmic ideas, making it difficult to provide clear inspirations for evolution. To address this, we propose DeltaEvolve, a momentum-driven evolutionary framework that replaces full-code history with structured semantic delta capturing how and why modifications between successive nodes affect performance. As programs are often decomposable, semantic delta usually contains many effective components which are transferable and more informative to drive improvement. By organizing semantic delta through multi-level database and progressive disclosure mechanism, input tokens are further reduced. Empirical evaluations on tasks across diverse scientific domains show that our framework can discover better solution with less token consumption over full-code-based evolutionary agents.

Abstract:
We investigate the role of network architecture in shaping the inductive biases of modern scorebased generative models. To this end, we introduce the Score Anisotropy Directions (SADs), architecture-dependent directions that reveal how different networks preferentially capture data structure. Our analysis shows that SADs form adaptive bases aligned with the architecture's output geometry, providing a principled way to predict generalization ability in score models prior to training. Through both synthetic data and standard image benchmarks, we demonstrate that SADs reliably capture fine-grained model behavior and correlate with downstream performance, as measured by Wasserstein metrics. Our work offers a new lens for explaining and predicting directional biases of generative models. Code to reproduce our experiments is included in the supplementary material.

Abstract:
Wearable accelerometers have enabled largescale health and wellness monitoring, yet learning robust human-activity representations has been constrained by the scarcity of labeled data. While self-supervised learning offers a potential remedy, existing approaches treat sensor streams as unstructured time series, overlooking the underlying biological structure of human movement, a factor we argue is critical for effective Human Activity Recognition (HAR). We introduce a novel tokenization strategy grounded in the submovement theory of motor control, which posits that continuous wrist motion is composed of superposed elementary basis functions called submovements. We define our token as the movement segment, a unit of motion composed of a finite sequence of submovements that is readily extractable from wrist accelerometer signals. By treating these segments as tokens, we pretrain a Transformer encoder via masked movement-segment reconstruction to model the temporal dependencies of movement segments, shifting the learning focus beyond local waveform morphology. Pretrained on the NHANES corpus (\approx 28k hours; \approx 11k participants; \approx 10M windows), our representations outperform strong wearable SSL baselines across six subject-disjoint HAR benchmarks. Furthermore, they demonstrate stronger data efficiency in data-scarce settings. Code and pretrained weights will be made publicly available.

Abstract:
The rise of model hubs has made it easier to access reusable model components, making model merging a practical tool for combining capabilities. Yet, this modularity also creates a governance gap: downstream users can recompose released weights into unauthorized mixtures that bypass safety alignment or licensing terms. Because existing defenses are largely posthoc and architecture-specific, they provide inconsistent protection across diverse architectures and release formats in practice. To close this gap, we propose Trap^2, an architecture-agnostic protection framework that encodes protection into the update during fine-tuning, regardless of whether they are released as adapters or full models. Instead of relying on architecture-dependent approaches, Trap^2 uses weight re-scaling as a simple proxy for the merging process. It keeps released weights effective in standalone use, but degrades them under re-scaling that often arises in merging, undermining unauthorized merging.

Abstract:
Despite the rapid evolution of training paradigms, the decoder backbone of large visionlanguage models (LVLMs) remains fundamentally rooted in the residual-connection Transformer architecture. Therefore, deciphering the distinct roles of internal modules is critical for understanding model mechanics and guiding architectural optimization. While prior statistical approaches have provided valuable attribution-based insights, they often lack a unified theoretical basis. To bridge this gap, we propose a unified framework grounded ininformation theory and geometryto quantify thegeometric and entropic natureof residual updates. Applying this unified framework reveals a fundamental functional decoupling:Attention acts as a manifold-preserving operatorfocused on reconfiguration, whereasFFNs serve as manifold-expanding operatorsdriving semantic innovation. Strikingly, further experiments demonstrate that replacing learned attention weights with predefined values (e.g., Gaussian noise) yields comparable or even superior performance across a majority of datasets relative to vanilla models. These results expose severemisallocation and redundancyin current mechanisms, suggesting that state-of-the-art LVLMs effectively ``get lost in attention'' rather than efficiently leveraging visual context.

Abstract:
Existing benchmarks for computational materials discovery primarily evaluate static predictive tasks or isolated computational subtasks. While valuable, these evaluations neglect the inherently iterative and adaptive nature of scientific discovery. We introduce MAterials Discovery Environments (MADE), a novel framework for benchmarking end-to-end autonomous materials discovery pipelines. MADE simulates closed-loop discovery campaigns in which an agent or algorithm proposes, evaluates, and refines candidate materials under a constrained oracle budget, capturing the sequential and resource-limited nature of real discovery workflows. We formalize discovery as a search for thermodynamically stable compounds relative to a given convex hull, and evaluate efficacy and efficiency via comparison to baseline algorithms. The framework is flexible; users can compose discovery agents from interchangeable components such as generative models, filters, and planners, enabling the study of arbitrary workflows ranging from fixed pipelines to fully agentic systems with tool use and adaptive decision making. We demonstrate this by conducting systematic experiments across a family of systems, enabling ablation of components in discovery pipelines, and comparison of how methods scale with system complexity.

Abstract:
Large language model (LLM)based coding agents can generate working code, but their solutions often introduce complexity, duplication, and architectural debt. Human developers address such issues through refactoring: behavior-preserving program transformations that improve structure and maintainability. In this paper, we investigate if LLM agents (i) can execute refactorings reliably and (ii) identify refactorings that human developers chose in real codebases. We present CodeTaste, a benchmark of refactoring tasks mined from large-scale multi-file changes in open-source repositories. To score solutions, we combine repository test suites with custom static checks that verify removal of undesired code patterns and introduction of desired patterns using dataflow reasoning. Our experimental results indicate a clear gap across frontier models: agents perform well when refactorings are specified in detail, but often fail to discover the human refactoring choices when only presented with a focus area for improvement. A propose-then-implement decomposition improves alignment, and selecting the best-aligned proposal before implementation yields further gains. CodeTaste provides an evaluation target and a potential preference signal for aligning coding agents with human refactoring decisions in realistic codebases.

Abstract:
While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these failures arise because standard denoising objectives lack explicit incentives for geometric coherence. To address this, we introduce VideoGPA (Video Geometric Preference Alignment), a dataefficient self-supervised framework that leverages a geometry foundation model to automatically derive dense preference signals that guide VDMs via Direct Preference Optimization (DPO). This approach effectively steers the generative distribution toward inherent 3D consistency without requiring human annotations. VideoGPA significantly enhances temporal stability, physical plausibility, and motion coherence using minimal preference pairs, consistently outperforming state-of-the-art baselines in extensive experiments.

Abstract:
Reinforcement Learning with Verifiable Reward (RLVR) on preference data has become the mainstream approach for training Generative Reward Models (GRMs). Typically, GRMs generate reasoning chains ending with critiques and preference labels, with RLVR using label correctness as the training reward. However, we demonstrate that such binary classification tasks make GRMs susceptible to guessing correct outcomes without sound critiques, introducing noise into the reward signal and impairing learning effectiveness. To address this, we propose Reward Modeling from Natural Language Human Feedback (RMNLHF), which leverages natural language feedback to obtain process reward signals. Specifically, we compute the similarity between GRM-generated and human critiques as the process reward, providing more accurate signals than outcome-only supervision. Considering that human critiques are difficult to scale, we introduce MetaRM which learns to predict process reward from datasets with human critiques and generalizes to data without them. Experiments on multiple benchmarks demonstrate that RM-NLHF consistently outperforms state-of-the-art models trained with outcome reward, confirming the superiority of natural language over binary feedback.

Abstract:
Synchronization over the special Euclidean group \mathrmSE(3) aims to recover absolute poses from noisy pairwise relative transformations and is a core primitive in robotics and 3D vision. Standard approaches often require multistep heuristic procedures to recover valid poses, which are difficult to analyze and typically lack theoretical guarantees. This paper adopts a dual quaternion representation and formulates \mathrmSE(3) synchronization directly over the unit dual quaternion. A two-stage algorithm is developed: A spectral initializer computed via the power method on a Hermitian dual quaternion measurement matrix, followed by a dual quaternion generalized power method (DQGPM) that enforces feasibility through per-iteration projection. The estimation error bounds are established for spectral estimators, and DQGPM is shown to admits a finite-iteration error bound and achieves linear error contraction up to an explicit noise-dependent threshold. Experiments on synthetic benchmarks and real-world multi-scan point-set registration demonstrate that the proposed pipeline improves both accuracy and efficiency over representative matrix-based methods.

Abstract:
Process reward models (PRMs) allow for finegrained credit assignment in reinforcement learning (RL), and seemingly contrast with outcome reward models (ORMs), which assign a single reward to an entire trajectory. However, we provide theoretical proof in this work that the Group Relative Policy Optimization (GRPO) RL algorithm equipped with an ORM is in fact equivalent to a PRM-aware RL objective equipped with a non-trivial, Monte-Carlo-based PRM (given mild assumptions). Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective that interacts with imbalanced process steps and rewards to hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect (\lambda-GRPO), and show that LLMs tuned with \lambda-GRPO outperform LLMs tuned with standard GRPO on downstream reasoning tasks\textemdashand reach peak performance more rapidly. These results show that we can leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance without employing an explicit PRM, and with a negligible impact on training time and cost.

Abstract:
We investigate the complexity of covariance matrix estimation for Gibbs distributions based on dependent samples from a Markov chain. We show that when π satisfies a Poincaré inequality and the chain possesses a spectral gap, we can achieve similar sample complexity using MCMC as compared to an estimator constructed using i.i.d. samples, with potentially much better query complexity. As an application of our methods, we show improvements for the query complexity in both constrained and unconstrained settings for concrete instances of MCMC. In particular, we provide guarantees regarding isotropic rounding procedures for sampling uniformly on convex bodies.

Abstract:
Transcendental functions, such as the exponential, are central to scientific computing, yet they cannot be natively calculated by digital hardware. Instead, computers must approximate these functions by combining basic operations, such as \+, , ×, \div\, using methods like Taylor series. These methods were developed over centuries by mathematicians, who focused on approaches that could attain arbitrary accuracy. However, computers can handle most applications by using only finite-precision types, like \emphfloat32, where any accuracy beyond the type's precision is effectively discarded. We explore, therefore, whether forgoing arbitrary accuracy can lead to the discovery of more efficient approximations. The evolutionary method of symbolic regression is particularly suitable, as it can search for arbitrary operation combinations and can optimize non-differentiable objectives, such as the number of operations used. Our results show that evolution can discover computer programs that outperform established methods in this setting, despite having no prior mathematical knowledge beyond the calculation of the basic operations. Starting from empty code, symbolic regression constructs programs representing novel mathematical expressions. In particular, we discovered a 10-operation program that approximates the exponential function to 14 significant figures, exceeding the accuracy of previously known approximations of this size by more than 6 orders of magnitude.

Abstract:
Offline goalconditioned reinforcement learning (GCRL) often struggles with long-horizon tasks, where errors in value estimation accumulate and produce unreliable policies. It is typically assumed that effective long-term planning is infeasible without specialized training. In contrast, our work demonstrates that existing GCRL policies can complete long-horizon tasks when combined with a lightweight, training-free planning wrapper. We find that standard goal-conditioned value functions encode locally consistent geometric structure sufficient for planning. Our approach, Test-Time Graph Search (TTGS), constructs a graph over the offline dataset and employs an adaptive subgoal selection strategy. To address unreliable value estimates during shortest-path search, we propose a novel mechanism that softly penalizes long-distance transitions. Our method incurs negligible computational overhead and requires no additional supervision or parameter updates. On the OGBench benchmark, TTGS unlocks latent capabilities in diverse base learners, boosting success rates on challenging locomotion tasks from near-zero to over 90%, often matching or outperforming methods that require complex auxiliary training.

Abstract:
Least privilege is a core security principle: grant each request only the minimum access needed to achieve its goal. Deployed language models almost never follow it, instead being exposed through a single API endpoint that serves all users and requests. This gap exists not because least privilege would be unhelpful—deployments would benefit greatly from reducing unnecessary capability exposure. The real obstacle is definitional and mechanistic: what does "access" mean inside a language model, and how can we enforce it without retraining or deploying multiple models? We take inspiration from least privilege in computer systems and define a class of models calledleastprivilege language models, where privilege isreachable internal computationduring the forward pass. In this view, lowering privilege literally shrinks the model's accessible function class (as opposed to denying access via learned policies). We formalize deployment-time control as a monitor--allocator--enforcer stack, separating (i) request-time signals, (ii) a decision rule that allocates privilege, and (iii) an inference-time mechanism that selects privilege. We then proposeNested Least-Privilege Networks, a shape-preserving, rank-indexed intervention that provides a smooth, reversible control knob. We show that this knob yields policy-usable privilege--utility frontiers and enables selective suppression of targeted capabilities with limited collateral degradation across various policies. Most importantly, we see this as a defense of a completely new deployment paradigm which challenges the premise that we can only have output-level control of language models.

Abstract:
To distinguish Markov equivalent graphs in causal discovery, it is necessary to restrict the structural causal model. A flexible class of models that is general and identifiable in most cases are locationscale noise models (LSNMs), in which the effect Y is modeled based on its causes \boldsymbolX as Y = f(\boldsymbolX) + g(\boldsymbolX)N. To facilitate the estimation of these models, a prominent assumption is that the noise variable N follows a symmetric distribution. We show that when N is a skewed random variable, which is likely in real-world domains, such approaches drop in performance. To address this limitation, we propose SkewD, a likelihood-based method for causal discovery under LSNMs with skewed noise, employing a combination of heuristic search and expectation conditional maximization for parameter estimation. SkewD extends the usual normal distribution framework to the skew-normal setting, enabling reliable inference under symmetric and skewed noise. While our main focus is on bivariate cause-effect inference, we further showcase how SkewD can be extended to the multivariate setting.

Abstract:
Offline reinforcement learning (RL) can fail spectacularly when bootstrapped temporaldifference (TD) updates amplify their own errors, driving the critic toward extreme and unusable Q-values. A key counterintuitive insight of this work is that collapse is not only a property of the backup rule or network architecture: the optimizer alone can directly trigger or suppress instability. From a control-theoretic viewpoint, we model offline TD learning as a feedback system and analyze Adam-based critic updates. This yields a necessary and sufficient condition for stability: training is stable if and only if the spectral radius of the update dynamics is strictly below one. Further analysis reveals that standard Adam updates can inadvertently distort the parameter geometry, necessitating explicit orthogonality constraints to prevent TD error amplification. To this end, we propose AdamO, an Adam-based optimizer with a decoupled orthogonality correction regulated by a strict task-alignment budget. We prove that this design theoretically guarantees worst-case task safety and preserves Adam’s continuous-time dissipative dynamics. Empirically, AdamO is plug-and-play across diverse offline RL baselines, improving stability and returns across a broad suite of benchmarks.

Abstract:
Model editing aims to correct errors in large, pretrained models without altering unrelated behaviors. While some recent works have edited vision–language models (VLMs), no existing editors tackle reasoningheavy tasks, which typically require humans and models to reason about images. We therefore propose ReasonEdit, the first VLM editor to let users explain their reasoning during editing, introducing a new, practical model editing setup. ReasonEdit continuously stores human reasoning in a codebook, and retrieves only relevant facts during inference using a novel topology-balanced multimodal embedding method inspired by network science. Across four VLMs on multiple rationale-based visual question answering datasets, ReasonEdit achieves state-of-the-art editing performance, ultimately showing that using human reasoning during editing greatly improves edit generalization.

Abstract:
Large Language Models (LLMs) possess latent multitoken prediction (MTP) capabilities despite being trained only for next-token generation. We introduce a simple and training-free MTP method that probes an LLM using on-the-fly mask tokens derived from its embedding space, enabling parallel future-token prediction without modifying weights or relying on draft models. We construct a speculative token tree by sampling Top-K candidates from mask-token logits and apply a lightweight pruning rule to retain high-probability continuations. During generation, predicted tokens are verified in parallel, yielding lossless decoding while significantly reducing the number of model calls and increasing token throughput. Our probing-based MTP method consistently outperforms existing training-free baselines, improving acceptance length by approximately 12% on LLaMA3 and 8–12% on Qwen3, and increasing throughput by up to 15–19%. We further provide theoretical analysis and empirical evidence showing that decoder layers naturally align mask-token representations with next-token states, enabling accurate multi-step predictions without retraining or auxiliary models.

Abstract:
We propose dgMARK, a decodingguided watermarking method for discrete diffusion language models (dLLMs). Unlike autoregressive models, dLLMs can generate tokens in arbitrary order. While an ideal conditional predictor would be invariant to this order, practical dLLMs exhibit strong sensitivity to the unmasking order, creating a new channel for watermarking. dgMARK steers the unmasking order toward positions whose high-reward candidate tokens satisfy a simple parity constraint induced by a binary hash, without explicitly reweighting the model’s learned probabilities. The method is plug-and-play with common decoding strategies (e.g., confidence, entropy, and margin-based ordering) and can be strengthened with a one-step lookahead variant. Watermarks are detected via elevated parity-matching statistics, and a sliding-window detector ensures robustness under post-editing operations including insertion, deletion, substitution, and paraphrasing.

Abstract:
We study beyond worst case analysis for the kmeans problem where the goal is to model typical instances of k-means arising in practice. Existing theoretical approaches provide guarantees under certain assumptions on the optimal solutions to k-means, making them difficult to validate in practice. We propose the manifold hypothesis, where data obtained in ambient dimension D concentrates around a low dimensional manifold of intrinsic dimension d, as a reasonable assumption to model real world clustering instances. We identify key geometric properties of datasets which have theoretically predictable scaling laws depending on the quantization exponent \varepsilon = 2/d using techniques from optimum quantization theory. We show how to exploit these regularities to design a fast seeding method called \operatornameQkmeans which provides O(\rho^-2 \log k) approximate solutions to the k-means problem in time O(nD) + \widetildeO(\varepsilon^1+\rho\rho^-1k^1+\gamma); where the exponent \gamma = \varepsilon + \rho for an input parameter \rho < 1. This allows us to obtain new runtime - quality tradeoffs. We perform a large scale empirical study across various domains to validate our theoretical predictions and algorithm performance to bridge theory and practice for beyond worst case data clustering.

Abstract:
Generative control policies (GCPs), such as diffusionand flow-based control policies, have emerged as effective parameterizations for robot learning. Yet there remains substantial debate over how to sample efficiently fine-tune them via reinforcement learning. A prevailing view holds that fine-tuning all GCP steps is unnecessary, motivating approaches that fine-tune only a subset of the generative process: either steering the initial noise distribution or learning residual corrections on top of a frozen base policy. In this work, we introduce Off-policy Generative Policy Optimization (\OGPO), a sample-efficient algorithm for finetuning GCPs that maintains off-policy critic networks to maximize data reuse and propagate policy gradients through the full generative process of the policy via a modified PPO objective, using critics as the terminal reward. \OGPO achieves state-of-the-art performance on manipulation tasks spanning multi-task settings, high-precision insertion, and dexterous control. To our knowledge, it is also the only method that can \emphfine-tune poorly-initialized behavior cloning policies to near full task-success with no expert data in the online replay buffer, and does so with \emphfew task-specific hyperparameter tuning. We perform extensive empirical investigations on \OGPO, finding that its superior performance and sample efficiency lie in its ability to learn beyond the action distribution of the pre-trained base policy, and propose practical implementation details that further boost performance for more complex scenarios.

Abstract:
Performative prediction captures the phenomenon where deploying a predictive model shifts the underlying data distribution. While simple retraining dynamics are known to converge linearly when the performative effects are weak (\rho < 1), the complexity in the regime \rho > 1 was hitherto open. In this paper, we establish a sharp phase transition: computing an \epsilonperformatively stable point is PPAD-complete---and thus polynomial-time equivalent to Nash equilibria in general-sum games---even when \rho = 1 + O(\epsilon). This intractability persists even in the ostensibly simple setting with a quadratic loss function and linear distribution shifts. One of our key technical contributions is to extend this PPAD-hardness result to general convex domains, which is of broader interest in the complexity of variational inequalities. Finally, we address the special case of strategic classification, showing that computing a strategic local optimum is PLS-hard.

Abstract:
Reinforcement learning (RL) is widely used to improve large language models (LLMs) on reasoning tasks, and asynchronous RL training is attractive because it increases endto-end throughput. However, for widely adopted critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimatormarkedly higher variance: stale off-policy rollouts induce heavy-tailed importance ratios, causing a small fraction of samples to dominate each update. This amplification makes gradients noisy and learning unstable relative to matched on-policy training. Across math and reasoning benchmarks, we find collapse is preceded by sharp drops in effective sample size (ESS) and unstable gradient norms. Motivated by this diagnosis, we proposeVarianceControlledPolicyOptimization (VCPO), a drop-in stabilization method for REINFORCE/GRPO-style algorithms that (i)rescales learning rate according to effective sample sizeto dampen unreliable updates, and (ii) applies aclosed-form minimum-variance baselinefor the off-policy setting, avoiding an auxiliary value model and adding minimal overhead. Empirically, VCPO substantially improves robustness in highly asynchronous regimes across models sizes and tasks, reducing long-context, multi-turn training compute by 1.96×. Overall, our results demonstrate explicitly controlling policy-gradient variance is key to making asynchronous RL reliable at scale.

Abstract:
Protein function relies on dynamic conformational ensembles, yet current generative models like AlphaFold3 (AF3) often fail to produce ensembles that match experimental data. Recent experimentguided generators attempt to address this by steering the reverse diffusion process. However, these methods are limited by fixed sampling horizons and sensitivity to initialization, often yielding thermodynamically implausible results. We introduce a general inference-time optimization framework to solve these challenges. First, we optimize over latent representations to maximize ensemble log-likelihood, rather than perturbing structures post hoc. This approach eliminates dependence on diffusion length, removes initialization bias, and easily incorporates external constraints. Second, we present novel sampling schemes for drawing Boltzmann-weighted ensembles. By combining structural priors from AF3 with force-field–based priors, we sample from their product distribution while balancing experimental likelihoods. Our results show that this framework consistently outperforms state-of-the-art guidance, improving diversity, physical energy, and agreement with data in X-ray crystallography and NMR, sometimes fitting the experimental data better than deposited PDB structures. Finally, inference-time optimization experiments maximizing iPTM scores reveal that perturbing MSA embeddings can artificially inflate model confidence. This exposes a vulnerability in current design metrics, whose mitigation could offer a pathway to reduce false discovery rates in binder engineering.

Abstract:
Editing the content of an image with a pretrained textto-image model remains challenging. Existing methods often distort fine details or introduce unintended artifacts. We propose using coupled stochastic differential equations (coupled SDEs) to guide the sampling process of any pre-trained generative model that can be sampled by solving an SDE, including diffusion and rectified flow models. By driving both the source image and the edited image with the same correlated noise, our approach steers new samples toward the desired semantics while preserving visual similarity to the source. The method works out-of-the-box—without retraining or auxiliary networks—and achieves high prompt fidelity along with near-pixel-level consistency. These results position coupled SDEs as a simple yet powerful tool for controlled generative AI.

Abstract:
Large language model (LLM) alignment relies on complex reward signals that often obscure the specific behaviors being incentivized, creating critical risks of misalignment and reward hacking. Existing interpretation methods typically rely on predefined rubrics, risking the omission of "unknown unknowns", or fail to identify objectives that comprehensively cover and are causal to the model behavior on some dataset. To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives. Our approach utilizes an iterative greedy algorithm to analyze behavioral changes across training checkpoints, identifying and validating candidate objectives that best explain the residual reward signal. Extensive evaluations across diverse tasks, model sizes, and alignment algorithms demonstrate the framework's robustness. Experiments with popular open-source reward models show that the framework consistently captures > 90% of reward behavior, a finding further corroborated by human evaluation. Additionally, a case study on alignment with an open-source reward model reveals that Obj-Disco can successfully identify latent misaligned incentives that emerge alongside intended behaviors. Our work provides a crucial tool for uncovering the implicit objectives in LLM alignment, paving the way for more transparent and safer AI development.

Abstract:
We introduce WildCat, a highaccuracy, low-cost approach to compressing the attention mechanism in neural networks. While attention is a staple of modern network architectures, it is also notoriously expensive to deploy due to resource requirements that scale quadratically with the input sequence length n. WildCat avoids these quadratic costs by only attending over a small weighted coreset. Crucially, we select the coreset using a fast but spectrally-accurate subsampling algorithm -- randomly pivoted Cholesky -- and weight the elements optimally to minimise reconstruction error. Remarkably, given bounded inputs, WildCat approximates exact attention with super-polynomial O(n^-\sqrt\log(\log(n))) error decay while running in near-linear O(n^1+o(1)) time. In contrast, prior practical approximations either lack error guarantees or require quadratic runtime to guarantee such high fidelity. We couple this advance with a GPU-optimized PyTorch implementation and a suite of benchmark experiments demonstrating the benefits of WildCat for image generation, image classification, and language model KV cache compression.

Abstract:
It is well established that spectral graph neural networks (GNNs) can universally approximate node signals; however, their expressive power remains bounded by the 1dimensional Weisfeiler–Lehman test, which is mirrored in their lack of universality for higher-order signals. To go beyond this bound, we propose the Full-Spectrum GNN (FSpecGNN), a second-order generalization of classical spectral GNNs. FSpecGNN advances spectral filtering in two perspectives: (1) it lifts the signal from the node domain to the node-pair domain; and (2) it extends the univariate spectral filter over eigenvalues to a bivariate filter over eigenvalue pairs. We show that classical spectral GNNs arise as a diagonal special case of FSpecGNN, and prove that FSpecGNN can be more expressive than Local 2-GNN while universally approximating node-pair signals, the latter being particularly beneficial for heterophilic graph learning. Moreover, FSpecGNN admits scalable implementations that avoid explicit node-pair-level computations; combined with a low-rank approximation that reduces full-spectrum convolution to a combination of polynomial spectral filters, it enables learning on large graphs. Empirically, FSpecGNN validates the predicted expressivity and delivers strong performance on heterophilic benchmarks.

Abstract:
Searchintegrated reasoning enables language agents to transcend static parametric knowledge by actively querying external sources. However, training these agents via reinforcement learning is hindered by themulti-scale credit assignmentproblem: existing methods typically rely on sparse, trajectory-level rewards that fail to distinguish between high-quality reasoning and fortuitous guesses, leading to redundant or misleading search behaviors. To address this, we propose Search-R2, a novel Actor–Refiner collaboration framework that enhances reasoning through targeted intervention, with both components jointly optimized during training. Our approach decomposes the generation process into an Actor, which produces initial reasoning trajectories, and a Meta-Refiner, which selectively diagnoses and repairs flawed steps via a ``cut-and-regenerate'' mechanism. To provide fine-grained supervision, we introduce a hybrid reward design that couples outcome correctness with a dense process reward quantifying the information density of retrieved evidence. Theoretically, we formalize the Actor–Refiner interaction as a smoothed mixture policy, proving that selective correction yields strict performance gains over strong baselines. Extensive experiments across various general and multi-hop QA datasets demonstrate that Search-R2 consistently outperforms strong RAG and RL-based baselines across model scales, achieving superior reasoning accuracy with minimal overhead.

Abstract:
Semisupervised graph anomaly detection (GAD), which assumes a subset of annotated normal nodes available during training, is among the most widely explored applications. However, the normality learned by existing semi-supervised GAD methods is limited to the labeled normal nodes, often inclining to overfitting the given patterns, thereby leading to high detection errors, such as high false positives. To overcome this limitation, we propose GraphNC, a graph normality calibration framework that leverages both labeled and unlabeled data to calibrate the normality from a teacher (a pre-trained semi-supervised GAD model) jointly in anomaly score and representation spaces. GraphNC includes two main components, anomaly score distribution alignment (ScoreDA) and perturbation-based normality regularization (NormReg). ScoreDA optimizes the anomaly scores of our model by aligning them with the score distribution yielded by the teacher. Due to accurate scores in most of the normal nodes and part of the anomaly nodes in the teacher, the alignment effectively pulls the anomaly scores of the two classes toward the two ends, resulting in more separable anomaly scores. To mitigate the misleading by inaccurate scores from the teacher, NormReg is designed to regularize the normality in representation space, making the representations of normal nodes more compact by minimizing a perturbation-guided consistency loss solely on the labeled nodes. Comprehensive experiments on six benchmarks demonstrate that GraphNC (1) consistently and substantially enhances the performance of teacher models from different GAD methods (2) achieves new state-of-the-art performance.

Abstract:
In network neuroscience, functional brain systems are often characterized using separate yet related graphtheoretic or spectral descriptors, overlooking how these properties covary and partially overlap across individuals and conditions. We anticipate that dense, weighted functional connectivity graphs lie on a low-dimensional latent geometry along which both topological and spectral structures vary smoothly at the population level. Although graph-based deep learning offers a powerful framework for modeling these brain connectomes, supervised approaches are constrained by the limited availability of labeled data. Existing unsupervised graph representation methods also typically focus on node-level embeddings, which are limited in capturing compact graph-level representations that preserve information from dense functional connectomes. To address these gaps, we learn compact brain graph representations using a graph transformer autoencoder, where domain-specific, aligned functional gradient geometry provides an inductive bias to guide learning. Despite being trained in a fully unsupervised manner, our approach meaningfully separates cognitive states and enables decoding of visual stimuli, with performance further improved by incorporating neural dynamics. In parallel, to enable generation of synthetic brain graphs, we fit a diffusion model to the learned latent representation and decode samples back to dense connectomes.

Abstract:
Radio Frequency (RF) sensing has emerged as a powerful, privacypreserving alternative to vision-based methods for various perception tasks. However, building high-quality RF datasets in dynamic and diverse environments remains a major challenge. To address this, we introduce WaveVerse, a prompt-based, scalable framework that simulates realistic RF signals from generated indoor scenes with human motions guided by spatial paths, enabling diverse and feasible behaviors without manual trajectory design. WaveVerse features a language-guided 4D world generator and a physics-based signal simulator that enables realistic simulation of RF signals in diverse environments. It employs a phase-coherent ray tracer that preserves both spatial and temporal phase consistency. The simulated signals show high fidelity on phase-sensitive benchmarks, and closely align with both real-world collected measurements and simulations from a proprietary electromagnetic solver. When used for data augmentation, WaveVerse consistently improves performance in downstream tasks like RF imaging and human activity recognition, with gains that grow with the amount of simulated data and surpass existing methods.

Abstract:
Existing approaches for manipulating neural network activations, such as PCA and SAEs, rely on strong assumptions about activation structure. We develop a generative approach that models activations with diffusion, that makes minimal assumptions and improves with data and model scale. We use this activation diffusion model to improve downstream tasks: for instance, postprocessing interventions with its learned generative prior, allowing for more effective steering without sacrificing fluency. Furthermore, the activation diffusion model can be used as an encoder, with units that cover a broad range of human-interpretable concepts, as measured by scalar probing. We also characterize the scaling properties of our approach, training models with 0.5B to 3.3B parameters on one billion residual stream activations from the Llama model family. We see that the diffusion loss decreases smoothly and reliably as a function of compute, and serves as a good proxy for downstream steering and probing performance. Our method provides a scalable approach towards interpretability without requiring commitments to strong assumptions.

Abstract:
Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) generation, yet their reliance on Transformer backbones limits inference efficiency due to quadratic attention or KVcache overhead. We introduce DiffuMamba, a masked diffusion language model built on a bidirectional Mamba backbone that combines the diffusion objective with linear-time sequence modeling, and DiffuMamba-H, a hybrid variant with interleaved attention. Across scales up to 1.3B parameters, our models match Transformer-based diffusion in downstream performance while achieving up to 8.2× and 4.3× higher inference throughput, respectively, on long sequences. We further present a systematic analysis of inference efficiency across modern DLM variants, combining asymptotic complexity with empirical measurements. Notably, cache-efficient block diffusion with Mamba mixers emerges as the only strategy that scales linearly with sequence length and achieves the strongest performance across all baselines, suggesting a promising direction for future diffusion-based generation systems.

Abstract:
Scorebased diffusion models have recently been extended to infinite-dimensional function spaces, with uses such as inverse problems arising from partial differential equations. In the Bayesian formulation of inverse problems, the aim is to sample from a posterior distribution over functions obtained by conditioning a prior on noisy observations. While diffusion models provide expressive priors in function space, the theory of conditioning them to sample from the posterior remains open. We address this, assuming that either the prior lies in the Cameron-Martin space, or is absolutely continuous with respect to a Gaussian measure. We prove that the models can be conditioned using an infinite-dimensional extension of Doob's h-transform, and that the conditional score decomposes into an unconditional score and a guidance term. As the guidance term is intractable, we propose a simulation-free score matching objective (called Supervised Guidance Training) enabling efficient and stable posterior sampling. We illustrate the theory with numerical examples on Bayesian inverse problems in function spaces. In summary, our work offers the first function-space method for fine-tuning trained diffusion models to accurately sample from a posterior.

Abstract:
The problem of optimising functions with intractable gradients frequently arise in machine learning and statistics, ranging from maximum marginal likelihood estimation procedures to finetuning of generative models. Stochastic approximation methods for this class of problems typically require inner sampling loops to obtain (biased) stochastic gradient estimates, which rapidly becomes computationally expensive. In this work, we develop sequential Monte Carlo (SMC) samplers for optimisation of functions with intractable gradients. Our approach replaces expensive inner sampling methods with efficient SMC approximations, which can result in significant computational gains. We establish convergence results for the basic recursions defined by our methodology which SMC samplers approximate. We demonstrate the effectiveness of our approach on the reward-tuning of energy-based models within various settings.

Abstract:
Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework for selective pairwise judging with finitesample statistical guarantees. Under exchangeability, SCOPE calibrates an acceptance threshold such that the error rate among non-abstained judgments is at most a user-specified level \alpha. To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to response order, and converts the aggregated probability into an entropy-based uncertainty score. Across MT-Bench, RewardBench, and Chatbot Arena, BPE improves uncertainty quality over standard confidence proxies, providing a stronger selection signal that enables SCOPE to consistently meet the target risk level while retaining good coverage across judge scales. In particular, at \alpha = 0.10, SCOPE consistently satisfies the risk bound across all benchmarks and judge scales (empirical risk \approx 0.097 to 0.099), while retaining substantial coverage, reaching 0.89 on RewardBench with Qwen-14B and 0.98 on RewardBench with Qwen-32B. Compared to na\"ive baselines, SCOPE accepts up to 2.4× more judgments on MT-Bench with Qwen-7B under the same target risk constraint, demonstrating that BPE enables reliable and high-coverage LLM-based evaluation

Abstract:
Outof-distribution (OOD) detection is critical for reliable deployment of vision models. Mahalanobis-based detectors remain strong baselines, yet their performance varies widely across modern pretrained representations, and it is unclear which properties of a feature space cause these methods to succeed or fail. We conduct a large-scale study across diverse foundation-model backbones and Mahalanobis variants. First, we show that Mahalanobis-style OOD detection is not universally reliable: performance is highly representation-dependent and can shift substantially with pretraining data and fine-tuning regimes. Second, we link this variability to in-distribution geometry and identify a two-term ID summary that consistently tracks Mahalanobis OOD behavior across detectors: within-class spectral structure and local intrinsic dimensionality. Finally, we treat normalization as a geometric control mechanism and introduce radially scaled \ell_2 normalization, \phi_\beta(z)=z/\|z\|^\beta, which preserves directions while contracting or expanding feature radii. Varying \beta changes the radii while preserving directions, so the same quadratic detector sees a different ID geometry. We choose \beta from ID-only geometry signals and typically outperform fixed normalization baselines.

Abstract:
Sampling from unnormalized densities using diffusion models has emerged as a powerful paradigm. However, while recent approaches that use leastsquares `matching' objectives have improved scalability, they often necessitate significant trade-offs, such as restricting prior distributions or relying on unstable optimization schemes. By generalizing these methods as special forms of fixed-point iterations rooted in Nelson's relation, we develop a new method that addresses these limitations. Our approach enables learning a stochastic transport map between arbitrary prior and target distributions with a single, scalable, and stable objective. Furthermore, we introduce a damped variant of this iteration that incorporates a regularization term to mitigate mode collapse. Empirically, we demonstrate that our method enables sampling at unprecedented scales while preserving mode diversity, achieving state-of-the-art results on complex synthetic densities and high-dimensional molecular benchmarks.

Abstract:
Tabular anomaly detection (TAD) aims to identify samples that deviate from the majority in tabular data and is critical in many realworld applications. However, existing methods follow a ``one model for one dataset (OFO)'' paradigm, which relies on dataset-specific training and thus incurs high computational cost and yields limited generalization to unseen domains. To address these limitations, we propose OFA-TAD, a generalist one-for-all (OFA) TAD framework that only requires one-time training on multiple source datasets and can generalize to unseen datasets from diverse domains on-the-fly. To realize one-for-all tabular anomaly detection, OFA-TAD extracts neighbor-distance patterns as transferable cues, and introduces multi-view neighbor-distance representations from multiple transformation-induced metric spaces to mitigate the transformation sensitivity of distance profiles. To adaptively combine multi-view distance evidence, a Mixture-of-Experts (MoE) scoring network is employed for view-specific anomaly scoring and entropy-regularized gated fusion, with a multi-strategy anomaly synthesis mechanism to support training under the one-class constraint. Extensive experiments on 34 datasets from 14 domains demonstrate that OFA-TAD achieves superior anomaly detection performance and strong cross-domain generalizability under the strict OFA setting.

Abstract:
Large Language Model (LLM) agents are increasingly deployed in practice across a wide range of autonomous applications. Yet current safety mechanisms for LLM agents focus almost exclusively on preventing failures in advance, providing limited capabilities for responding to, containing, or recovering from incidents after they inevitably arise. In this work, we introduce AIR, the first incident response framework for LLM agent systems. AIR defines a domainspecific language for managing the incident response lifecycle autonomously in LLM agent systems, and integrates it into the agent's execution loop to (1) detect incidents via semantic checks grounded in the current environment state and recent context, (2) guide the agent to execute containment and recovery actions via its tools, and (3) synthesize guardrail rules during eradication to block similar incidents in future executions. We evaluate AIR on three representative agent types. Results show that AIR achieves detection, remediation, and eradication success rates all exceeding 90%. Extensive experiments further confirm the necessity of AIR's key design components, show the timeliness and moderate overhead of AIR, and demonstrate that LLM-generated rules can approach the effectiveness of developer-authored rules across domains. These results show that incident response is both feasible and essential as a first-class mechanism for improving agent safety.

Abstract:
MultiObjective Reinforcement Learning (MORL) extends standard RL by optimizing policies with respect to multiple, often conflicting, objectives. While max-min MORL has emerged as an effective approach for promoting fairness, its applicability remains limited, particularly when constraints must be incorporated. In this paper, we propose a MORL framework that integrates the max-min criterion with explicit constraint satisfaction. We establish a theoretical foundation for the proposed framework and validate the resulting algorithm through convergence analysis and experiments in tabular settings. We further demonstrate the practical relevance of our approach in simulated building thermal control, multi-objective locomotion control, and greenhouse-gas-emission-aware traffic management. Across these domains, our method effectively balances fairness and constraint satisfaction in multi-objective decision-making.

Abstract:
The training of deep Spiking Neural Networks (SNNs) has traditionally relied on Batch Normalization (BN), which stabilizes input currents and gradients during training. However, BN is not a universal solution. It is unsuitable for variablelength tasks and scenarios with reduced batch size, constraining the development of deep SNNs, where removing BN typically causes the training to fail to converge. This dependence stems not from a fundamental necessity of BN but from the current lack of reasonable initialization methods for SNNs. This paper addresses this core limitation by proposing SpikeInit, a novel initialization framework for SNNs. By modeling the response curve and gradient of spiking layers, SpikeInit initializes the weights and shape parameters of surrogate gradients to maintain stable firing rates during forward propagation and stable gradient magnitudes during backpropagation. Extensive experiments demonstrate that deep SNNs with SpikeInit can be trained stably without normalization and achieve superior performance compared to their normalized counterparts under identical settings. Furthermore, we demonstrate the scalability of SpikeInit by successfully training an ultra-deep, 1000-layer SNN without normalization. Our work provides a foundational step toward large-scale normalization-free SNN, liberating SNN design from the constraints of normalization.

Abstract:
Modern clinical practice relies on evidencebased guidelines implemented as compact scoring systems composed of a small number of interpretable decision rules. While machine-learning models achieve strong performance, many fail to translate into routine clinical use due to misalignment with workflow constraints such as memorability, auditability, and bedside execution. We argue that this gap arises not from insufficient predictive power, but from optimizing over model classes that are incompatible with guideline deployment. Deployable guidelines often take the form of unit-weighted clinical checklists, formed by thresholding the sum of binary rules, but learning such scores requires searching an exponentially large discrete space of possible rule sets. We introduce \textttAgentScore, which performs semantically guided optimization in this space by using LLMs to propose candidate rules and a deterministic, data-grounded verification-and-selection loop to enforce statistical validity and deployability constraints. Across eight clinical prediction tasks, \textttAgentScore outperforms existing score-generation methods and achieves AUC comparable to more flexible interpretable models despite operating under stronger structural constraints. On two additional externally validated tasks, \textttAgentScore achieves higher discrimination than established guideline-based scores.

Abstract:
We study ddimensional unbiased mean estimation in the single-message shuffle model, where each user sends a single privatized message and the analyzer only observes the shuffled multiset of reports. While minimax-optimal mechanisms are well understood in the local differential privacy setting, the corresponding notion of optimality after shuffling has remained largely unexplored. To address this gap, we introduce the recently proposed shuffle index and use it to formulate the post-shuffling mechanism design problem as an explicit optimization problem. We then establish a minimax lower bound on the achievable mean squared error in terms of the shuffle index, which implies that mechanisms that are optimal under LDP can become suboptimal once shuffling is applied. Finally, we construct an asymptotically minimax optimal mechanism in the high privacy regime, which as a consequence achieves a privacy-utility trade-off nearly identical to that of the central Gaussian mechanism.

Abstract:
Large language models can produce toxic or inappropriate text even for benign inputs, creating risks when deployed at scale. Detoxification is therefore important for safety and user trust, particularly when we want to reduce harmful content without sacrificing the model’s generation quality. Many existing approaches rely on model retraining, gradients, or learned auxiliary components, which can be costly and may not transfer across model families or to truly blackbox settings. We introduce a test-time procedure that approximates the gradient of completion toxicity with respect to the input embeddings and uses a small number of descent steps to steer generation toward less toxic continuations. This is achieved with zeroth-order optimization that requires only access to input embeddings, a toxicity scoring function, and forward evaluations of the model. Empirically, the approach delivers robust toxicity reductions across models and prompts and, in most settings, achieves the best overall toxicity–quality trade-off. More broadly, our work positions word embeddings as effective control variables and encourages wider use of black-box optimization to guide autoregressive language models toward scalable, safer text generation, without requiring any training or access to intermediate computations.

Abstract:
Geographic data is fundamentally local. Disease outbreaks cluster in population centers, ecological patterns emerge along coastlines, and economic activity concentrates within country borders. Machine learning models that encode geographic location, however, distribute representational capacity uniformly across the globe, struggling at the finegrained resolutions localized applications require. We propose a geographic location encoder built from spherical Slepian functions that concentrates representational capacity inside a region-of-interest and scales to high resolutions without extensive computational demands. For settings requiring global context, we present a hybrid Slepian-Spherical Harmonic encoder that efficiently bridges the tradeoff between local-global performance, while retaining desirable properties such as pole-safety and spherical-surface-distance preservation. Across five tasks spanning classification, regression, and image-augmented prediction, Slepian encodings outperform baselines and retain performance advantages across a wide variety of neural network architectures.

Abstract:
Textto-video generation has advanced rapidly, but existing methods typically output only the final composited video and lack editable layered representations, limiting their use in professional workflows. We propose LayerT2V, a unified multi-layer video generation framework that produces multiple semantically consistent outputs in a single inference pass: the full video, an independent background layer, and multiple foreground RGB layers with corresponding alpha mattes. Our key insight is that recent video generation backbones use high compression in both time and space, enabling us to serialize multiple layer representations along the temporal dimension and jointly model them on a shared generation trajectory. This turns cross-layer consistency into an intrinsic objective, improving semantic alignment and temporal coherence. To mitigate layer ambiguity and conditional leakage, we augment a shared DiT backbone with LayerAdaLN and layer-aware cross-attention modulation. LayerT2V is trained in three stages: alpha mask VAE adaptation, joint multi-layer learning, and multi-foreground extension. We also introduce VidLayer, the first large-scale dataset for multi-layer video generation. Extensive experiments demonstrate that LayerT2V substantially outperforms prior methods in visual fidelity, temporal consistency, and cross-layer coherence. To facilitate future research, we will release the code and dataset upon publication.

Abstract:
Compositional generalization, the ability to reason about novel combinations of familiar concepts, is fundamental to human cognition and a critical challenge for machine learning. Objectcentric (OC) representations, which encode a scene as a set of objects, are often argued to support such generalization, but systematic evidence in visually rich settings is limited. We introduce a Visual Question Answering benchmark across three controlled visual worlds (CLEVRTex, Super-CLEVR, and MOVi-C) to measure how well vision encoders, with and without object-centric biases, generalize to unseen combinations of object properties. To ensure a fair and comprehensive comparison, we carefully account for training data diversity, sample size, representation size, downstream model capacity, and compute. We use DINOv2 and SigLIP2, two widely used vision encoders, as the foundation models and their OC counterparts. Our key findings reveal that (1) OC approaches are superior in harder compositional generalization settings; (2) original dense representations surpass OC only on easier settings and typically require substantially more downstream compute; and (3) OC models are more sample efficient, achieving stronger generalization with fewer images, whereas dense encoders catch up or surpass them only with sufficient data and diversity. Overall, object-centric representations offer stronger compositional generalization when any one of dataset size, training data diversity, or downstream compute is constrained.

Abstract:
Selfsupervised learning for physiological time-series aims to captures the identity of the underlying dynamical process while filtering irrelevant noise. However, existing approaches may obscure the clinical semantics important for downstream transferability. Weakly constrained pretext tasks (i.e. contrastive learning, MAE) may incorrectly ignore the underlying dynamical structure, while structurally constrained models (i.e. SVAEs) are unable to selectively filter sample-specific noise. To bridge this gap, we propose \bf PULSE, a novel pretraining objective that simultaneously preserves dynamical relationships important to physiological time-series while selectively removing irrelevant noise. We achieve this by formulating a dynamical systems model to identify transferable and non-transferable information between time-series windows, and target the former through a novel cross-reconstruction objective. We establish theory that provides conditions for when transferrable information is recovered, and empirically validate it through synthetic experiments. On several real-world datasets, PULSE effectively distinguishes clinical semantic classes, increases label efficiency, and improves transfer learning performance.

Abstract:
In the same way that generative models today conduct most of their training in a selfsupervised fashion, how can agentic models conduct their training in a self-supervised fashion, interactively exploring, learning, and preparing to quickly adapt to new tasks? The problem of reward-free exploration is well studied in the unsupervised reinforcement learning (URL) literature but fails to prepare agents for rapid adaptation to new demonstrations. Today's language and vision models are trained on data provided by humans, which provides a strong inductive bias for the sorts of tasks that the model will have to solve. However, when prompted to imitate a new task, some methods perform distribution matching against the demonstration data without properly accounting for the difficulty of various tasks. The key contribution of our paper is a method for pre-training interactive agents in a self-supervised fashion, so that they can instantly mimic expert demonstrations. Our method treats goals (i.e., observations) as the atomic construct. During training, our method automatically proposes goals and practices reaching them, building off prior work in reinforcement learning exploration. During evaluation, our method solves an (amortized) inverse reinforcement learning problem to explain demonstrations as optimal goal-reaching behavior. Experiments on standard benchmarks (not designed for goal-reaching) show that our approach outperforms prior methods for zero-shot imitation.

Abstract:
Textto-image (T2I) diffusion models are widely adopted for their strong generative capabilities, yet remain vulnerable to backdoor attacks. Existing attacks typically rely on fixed textual triggers and single-entity backdoor targets, making them highly susceptible to enumeration-based input defenses and attention-consistency detection. In this work, we propose Semantic-level Backdoor Attack (SemBD), which implants backdoors at the representation level by defining triggers as continuous semantic regions rather than discrete textual patterns. Concretely, SemBD injects semantic backdoors by distillation-based editing of the key and value projection matrices in cross-attention layers, enabling diverse prompts with identical semantic compositions to reliably activate the backdoor attack. To further enhance stealthiness, SemBD incorporates a semantic regularization to prevent unintended activation under incomplete semantics, as well as multi-entity backdoor targets that avoid highly consistent cross-attention patterns. Extensive experiments demonstrate that SemBD achieves a 100% attack success rate while maintaining strong robustness against state-of-the-art input-level defenses.

Abstract:
Standard decoding in Masked Diffusion Models (MDMs) is hindered by context rigidity: tokens are retained based on transient high confidence, often ignoring that early predictions lack full context. This creates cascade effects where initial inconsistencies misguide the remaining generation. Existing revision strategies attempt to mitigate this by relying on static confidence scores, but these signals are inherently myopic; inconsistent tokens frequently appear confident to the model itself. To address this, we propose ContextRobust Remasking (CoRe), a training-free framework for inference-time revision. We introduce a new selection paradigm: rather than trusting static token probabilities, we identify context-brittle tokens by probing their sensitivity to adversarial perturbations. We formalize revision as a robust optimization problem targeting worst-case context shifts. CoRe efficiently approximates this objective to expose unstable tokens, prioritizing them for revision. On LLaDA-8B-Base, CoRe delivers consistent improvements across reasoning and code benchmarks, outperforming compute-matched baselines and boosting performance on code generation (MBPP) by up to +9.2%.

Abstract:
In recent years, LLMbased multi-agent systems (MAS) have advanced rapidly, using a router to decompose tasks and delegate subtasks to specialized agents. A natural way to expand capability is toscale up the agent poolby continually integrating new functional agents or tool interfaces, but naive expansion can triggerperformance collapsewhen the router cold-starts on newly added, heterogeneous, and unreliable agents. We proposeMonoScale, an expansion-aware update framework that proactively generates a small set of agent-conditioned familiarization tasks, harvests evidence from both successful and failed interactions, and distills it into auditable natural-language memory to guide future routing. We formalize sequential augmentation as a contextual bandit and perform trust-region memory updates, yielding a monotonic non-decreasing performance guarantee across onboarding rounds. Experiments on GAIA and Humanity's Last Exam show stable gains as the agent pool grows, outperforming naive scale-up and strong-router fixed-pool baselines.

Abstract:
Quantization has significantly improved the compute and memory efficiency of Large Language Model (LLM) training. However, existing approaches still rely on accumulating their updates in highprecision: concretely, gradient updates must be applied to a high-precision weight buffer, known as master weights. This buffer introduces substantial memory overhead, particularly for Sparse Mixture of Experts (SMoE) models, where model parameters and optimizer states dominate memory usage. To address this, we introduce the Error-Compensating Optimizer (ECO), which eliminates master weights by applying updates directly to quantized parameters. ECO quantizes weights after each step and carefully injects the resulting quantization error into the optimizer momentum, forming an error-feedback loop with no additional memory. We prove that, under standard assumptions and a decaying learning rate, ECO converges to a constant-radius neighborhood of the optimum, while naive master-weight removal can incur an error that is inversely proportional to the learning rate. We show empirical results for pretraining small Transformers (30--800M), a Gemma-3 1B model, and a 2.1B parameter Sparse MoE model with FP8 quantization, and fine-tuning DeepSeek-MoE-16B in INT4 precision. Throughout, ECO matches baselines with master weights up to near-lossless accuracy, significantly shifting the static memory vs validation loss Pareto frontier.

Abstract:
We introduce an orderinvariant reinforcement learning framework for black-box combinatorial optimization. Classical estimation-of-distribution algorithms (EDAs) often rely on learning explicit variable dependency graphs, which can be costly and fail to capture complex interactions efficiently. In contrast, we parameterize a multivariate autoregressive generative model trained without a fixed variable ordering. By sampling random generation orders during training - a form of information-preserving dropout - the model is encouraged to be invariant to variable order, promoting search-space diversity and shaping the model to focus on the most relevant variable dependencies, improving sample efficiency. We adapt Group Relative Policy Optimization (GRPO) to this setting, providing stable policy-gradient updates from scale-invariant advantages. Across a wide range of benchmark algorithms and problem instances of varying sizes, our method frequently achieves the best performance and consistently avoids catastrophic failures.

Abstract:
In many realworld applications of regression, conditional probability estimation, and uncertainty quantification, exploiting symmetries rooted in physics or geometry can dramatically improve generalization and sample efficiency. While geometric deep learning has made empirical advances by incorporating symmetry and geometry priors, less attention has been given to statistical learning guarantees. In this paper, we introduce an equivariant representation learning framework that simultaneously addresses regression, conditional probability estimation, and uncertainty quantification while providing first-of-its-kind non-asymptotic statistical learning guarantees. Grounded in operator and group representation theory, our framework approximates the spectral decomposition of the conditional expectation operator, building representations that are both equivariant and disentangled along independent symmetry quotient groups. Empirical evaluations on synthetic datasets and real-world robotics applications confirm the potential of our approach, matching or outperforming existing equivariant baselines in regression while providing well-calibrated uncertainty estimates.

Abstract:
The epidemic failure of replicability across empirical science and machine learning has recently motivated the formal study of replicable learning algorithms [Impagliazzo et al. (2022)]. In contrast to batch settings (i.e. data comes from a fixed i.i.d. source) where the cost of replicability is relatively well understood, there remain significant gaps in our knowledge for control settings like reinforcement learning where an agent must interact directly with a shifting environment. Indeed, there is a large gap between the best upper bound of \tildeO(S^7 A^7) [Eaton et al. (2023)] for RL with exploration, and \tildeO(S^2 A^2) [Karbasi et al. (2023)] for the RL `batch' setting. This gap raises a key question in the broader theory of replicability: Is replicable exploration inherently more expensive than batch learning? Is sampleefficient replicable RL even possible? In this work, we (nearly) resolve this problem (for low-horizon tabular MDPs): exploration is not a significant barrier to replicable learning! Our main result is a replicable RL algorithm on \tildeO(S^2A) samples, bridging the gap between the generative and episodic settings. We complement this with a lower bound in the episodic setting of \tilde\Omega(S^2) showcasing the near-optimality of our algorithm with respect to the state space S.

Abstract:
We study rewardfree and reward-agnostic exploration in episodic finite-horizon Markov decision processes (MDPs), where an agent explores an unknown environment without observing external rewards. Reward-free exploration aims to enable \epsilon-optimal policies for any reward revealed after exploration, while reward-agnostic exploration targets \epsilon-optimality for rewards drawn from a small finite class. In the reward-agnostic setting, Li, Yan, Chen, and Fan (2024) achieve minimax sample complexity, but only for restrictively small accuracy parameter \epsilon. We propose a new algorithm that significantly relaxes the requirement on \epsilon. Our approach is novel and of technical interest by itself. Our algorithm employs an online learning procedure with carefully designed rewards to construct an exploration policy, which is used to gather data sufficient for accurate dynamics estimation and subsequent computation of an \epsilon-optimal policy once the reward is revealed. Finally, we establish a tight lower bound for reward-free exploration, closing the gap between known upper and lower bounds.

Abstract:
Actorcritic (AC) methods are a cornerstone of reinforcement learning (RL) but offer limited interpretability. Current explainable RL methods seldom usestate attributionsto assist training. Rather, they treat all state features equally, thereby neglecting the heterogeneous impacts of individual state dimensions on the reward. We proposeRKHS-SHAP-based Advanced Actor-Critic (RSA2C), an attribution-aware, kernelized, two-timescale AC algorithm, including Actor, Value Critic, and Advantage Critic. The Actor is instantiated in a vector-valued reproducing kernel Hilbert space (RKHS) with a Mahalanobis-weighted operator-valued kernel, while the Value Critic and Advantage Critic reside in scalar RKHSs. These RKHS-enhanced components use sparsified dictionaries: the Value Critic maintains its own dictionary, while the Actor and Advantage Critic share one. State attributions, computed from the Value Critic via RKHS-SHAP (kernel mean embedding for on-manifold and conditional mean embedding for off-manifold expectations), are converted into Mahalanobis-gated weights that modulate Actor gradients and Advantage Critic targets. We derive a global, non-asymptotic convergence bound understate perturbations, showing stability through the perturbation-error term and efficiency through the convergence-error term. Empirical results on three continuous-control environments show that RSA2C achieves efficiency, stability, and interpretability.

Abstract:
We propose SALSAV, a multimodal video-to-audio generation model capable of synthesizing highly synchronized, high-fidelity long-form audio from silent video content. Our approach introduces a masked diffusion objective, enabling audio-conditioned generation and the seamless synthesis of audio sequences of unconstrained length. Additionally, by integrating a shortcut loss into our training process, we achieve rapid generation of high-quality audio samples in as few as eight sampling steps, paving the way for near-real-time applications without requiring dedicated fine-tuning or retraining. We demonstrate that SALSA-V significantly outperforms existing state-of-the-art methods in both audiovisual alignment and synchronization with video content in quantiative evaluation and a human listening study. Furthermore, our use of random masking during training enables our model to match spectral characteristics of reference audio samples, broadening its applicability to professional audio synthesis tasks such as Foley generation and sound design.

Abstract:
Quantizing LLM weights and activations is a standard approach for efficient deployment, but a few extreme outliers can stretch the dynamic range and amplify lowbit quantization error. Prior transform-based mitigations (e.g., Hadamard rotations) are fixed and data-agnostic, and their optimality for quantization has remained unclear. We derive closed-form optimal linear blockwise transforms for joint weight-activation quantization under standard RTN AbsMax-scaled block quantizers, covering both integer and floating-point formats. The resulting construction, WUSH, combines a Hadamard backbone with a data-dependent second-moment component to form a non-orthogonal transform that is provably near-optimal for FP and INT quantizers under mild assumptions while admitting an efficient fused GPU implementation. Empirically, WUSH improves W4A4 accuracy over the strongest Hadamard-based baselines (e.g., on Llama-3.1-8B-Instruct in MXFP4, it gains +2.8 average points with RTN and +0.7 with GPTQ) while delivering up to 6.6× per-layer throughput over BF16 via FP4 matmul.

Abstract:
Retrievalaugmented generation (RAG) systems commonly improve robustness via query-time adaptations such as query expansion and iterative retrieval. While effective, these approaches are inherently stateless: adaptations are recomputed for each query and discarded thereafter, precluding cumulative learning and repeatedly incurring inference-time cost. Index-side approaches like key expansion introduce persistence but rely on offline preprocessing or heuristic updates that are weakly aligned with downstream task utility, leading to semantic drift and noise accumulation. We propose Evolving Retrieval Memory (ERM), a training-free framework that transforms transient query-time gains into persistent retrieval improvements. ERM updates the retrieval index through correctness-gated feedback, selectively attributes atomic expansion signals to the document keys they benefit, and progressively evolves keys via stable, norm-bounded updates. We show that query and key expansion are theoretically equivalent under standard similarity functions and prove convergence of ERM’s selective updates, amortizing optimal query expansion into a stable index with zero inference-time overhead. Experiments on BEIR and BRIGHT across 13 domains demonstrate consistent gains in retrieval and generation, particularly on reasoning-intensive tasks, at native retrieval speed.

Abstract:
Diffusion Large Language Models (dLLMs) often exhibit severe instability during Group Relative Policy Optimization (GRPO) training, limiting the effectiveness of reinforcement learning for improving reasoning capabilities. In dLLMs, the importance ratios used by GRPO are derived from finitesample estimates rather than exact likelihoods, making them inherently noisy. In this paper, we show that GRPO is highly sensitive to this noise, which drives training instability. Through theoretical analysis and empirical evidence, we identify a self-reinforcing instability loop in which noisy importance ratios induce gradient spikes and policy drift, further amplifying future importance ratio estimation variance. To address this issue, we propose StableDRL, a novel reinforcement learning framework for dLLMs. StableDRL stabilizes training via (i) unconditional clipping to suppress outlier-induced gradient spikes, and (ii) self-normalization to constrain gradients within the convex hull of per-sample updates. We further extend StableDRL to block-wise diffusion models via a staircase attention mechanism. StableDRL is the first method that enables stable, full-parameter reinforcement learning for dLLMs. It achieves the state-of-the-art performance, outperforming prior best full-attention baselines by 6% on MATH500 and block-diffusion baselines by 25.6% on AIME.

Abstract:
Compiler autotuning faces a dichotomy between traditional black-box search methods, which lack semantic guidance, and recent Large Language Model (LLM) approaches, which often suffer from superficial pattern matching and causal opacity. In this paper, we introduce ECCO, a framework that bridges interpretable reasoning with combinatorial search. We first propose a reverse engineering methodology to construct a Chain-of-Thought dataset, explicitly mapping static code features to verifiable performance evidence. This enables the model to learn the causal logic governing optimization decisions rather than merely imitating sequences. Leveraging this interpretable prior, we design a collaborative inference mechanism where the LLM functions as a strategist, defining optimization intents that dynamically guide the mutation operations of a genetic algorithm. Experimental results on seven datasets demonstrate that ECCO outperforms the LLVM opt -O3 baseline, achieving an average 24.44% reduction in cycles. Our code is available at https://anonymous.4open.science/r/ECCO-Evidence-Driven-Causal-Reasoning-for-Compiler-Optimization-3AD2.

Abstract:
We study the problem of functional retargeting: learning dexterous manipulation policies to track object states from human handobject demonstrations. We focus on long-horizon, bimanual tasks with articulated objects, which are challenging due to large action space, spatiotemporal discontinuities, and the embodiment gap between human and robot hands. We propose DexMachina, a novel curriculum-based algorithm: the key idea is to use virtual object controllers with decaying strength: an object is first driven automatically towards its target states, such that the policy can gradually learn to take over under motion and contact guidance. We release a simulation benchmark with a diverse set of tasks and dexterous hands, and show that DexMachina significantly outperforms baseline methods. Our algorithm and benchmark enable a functional comparison for hardware designs, and we present key findings informed by quantitative and qualitative results. With the recent surge in dexterous hand development, we hope this work will provide a useful platform for identifying desirable hardware capabilities and lower the barrier for contributing to future research. Videos and more at \urldexmachina-submission.github.io

Abstract:
Aligning large language models (LLMs) with human preferences is commonly done via reinforcement learning from human feedback (RLHF) with Proximal Policy Optimization (PPO) or, more simply, via Direct Preference Optimization (DPO). While DPO is stable and RLfree, it treats preferences as flat winner vs. loser signals and is sensitive to noisy or brittle preferences arising from fragile chains of thought. We propose TUR-DPO, a topology- and uncertainty-aware variant of DPO that rewards how answers are derived, not only what they say, by eliciting lightweight reasoning topologies and combining semantic faithfulness, utility, and topology quality into a calibrated uncertainty signal. A small learnable reward is factorized over these signals and incorporated into an uncertainty-weighted DPO objective that remains RL-free and relies only on a fixed or moving reference policy. Empirically, across open 7–8B models and benchmarks spanning mathematical reasoning, factual question answering, summarization, and helpful/harmless dialogue, TUR-DPO improves judge win-rates, faithfulness, and calibration relative to DPO while preserving training simplicity and avoiding online rollouts. We further observe consistent gains in multimodal and long-context settings, and show that TUR-DPO matches or exceeds PPO on reasoning-centric tasks while maintaining operational simplicity.

Abstract:
Personalizing large language models (LLMs) to individual user preferences is a critical step beyond generating generically helpful responses. However, current personalization methods are illsuited for new users, as they typically require either slow, resource-intensive fine-tuning or a substantial amount of pre-existing user data, creating a significant cold-start problem. To address this challenge, we introduce a new paradigm for real-time personalization by learning from online pairwise preference feedback collected during text generation. We propose T-POP (Test-Time Personalization with Online Preference Feedback), a novel algorithm that synergistically combines test-time alignment with dueling bandits. Without updating the LLM parameters, T-POP steers the decoding process of a frozen LLM by learning a reward function online that captures user preferences. By leveraging dueling bandits, T-POP intelligently queries the user to efficiently balance between exploring their preferences and exploiting the learned knowledge to generate personalized text. Extensive experiments demonstrate that T-POP achieves rapid and data-efficient personalization, significantly outperforming existing baselines and showing consistent improvement with more user interactions.

Abstract:
As the cost of pretraining large language models grows, there is continued interest in strategies to improve learning efficiency during this core training stage. Motivated by cognitive development, where humans gradually build knowledge as their brains mature, we propose CurriculumGuided Layer Scaling (CGLS), a framework for compute-efficient pretraining that synchronizes increasing data difficulty with model growth through progressive layer stacking (i.e. gradually adding layers during training). At the 100M parameter scale, using a curriculum transitioning from synthetic short stories to general web data, CGLS outperforms baseline methods on the question-answering benchmarks PIQA and ARC. Pretraining at the 1.2B scale, we stratify the DataComp-LM corpus with a DistilBERT-based classifier and progress from general text to highly technical or specialized content. Our results show that progressively increasing model depth alongside sample difficulty leads to better generalization and zero-shot performance on various downstream benchmarks. Altogether, our findings demonstrate that CGLS unlocks the potential of progressive stacking, offering a simple yet effective strategy for improving generalization on knowledge-intensive and reasoning tasks.

Abstract:
Deep stochastic processes have recently become a central paradigm for image enhancement, with many methods explicitly conditioning the stochastic trajectory on the degraded input. However, the relationship between these conditional processes and standard diffusion models remains unclear. In this work, we introduce a unified perspective on stochastic image enhancement by classifying recent methods into three families of continuoustime processes: unconditional diffusion models, Ornstein–Uhlenbeck (OU) processes, and diffusion bridges. We show that all of these approaches arise from a common stochastic differential equation (SDE) formulation. This framework makes explicit that seemingly disparate methods differ primarily in their drift and diffusion terms, terminal distributions, and boundary conditions, while schedulers and samplers constitute orthogonal design choices. Leveraging this unification, we conduct a controlled empirical study across multiple image enhancement tasks using identical architectures and training protocols. Our results reveal no consistently dominant method; instead, we identify and disentangle the specific design choices that most strongly influence performance. Finally, we release ItoVision, a modular PyTorch library that implements the unified framework and enables rapid prototyping and fair comparison of stochastic image enhancement methods.

Abstract:
We study onesided and \alpha-correct sequential hypothesis testing for data generated by an ergodic Markov chain. The null hypothesis is that the unknown transition matrix belongs to a prescribed set \cal P of stochastic matrices, and the alternative corresponds to a disjoint set \cal Q. We establish a tight non-asymptotic instance-dependent lower bound on the expected stopping time of any valid sequential test under the alternative. Our novel analysis improves the existing lower bounds, which are either asymptotic or provably sub-optimal in this setting. Our lower bound incorporates both the stationary distribution and the transition structure induced by the unknown Markov chain. We further propose an optimal test whose expected stopping time matches this lower bound asymptotically as \alpha \to 0. We illustrate the usefulness of our framework through applications to sequential detection of model misspecification in Markov Chain Monte Carlo and to testing structural properties, such as the linearity of transition dynamics, in Markov decision processes. Our findings yield a sharp and general characterization of optimal sequential testing procedures under Markovian dependence.

Abstract:
Despite the remarkable practical success of transformerbased language models, recent work has raised concerns about their ability to perform state tracking. In particular, a growing body of literature has shown this limitation primarily through failures in out-of-distribution (OOD) generalization, such as length extrapolation. In this work, we shift attention to the in-distribution implications of these limitations. We empirically compare the data efficiency of transformers and recurrent neural networks (RNNs) across multiple supervision regimes. We find that the amount of training data required by transformers grows much more rapidly with state-space size and sequence length than for RNNs. Furthermore, we analyze the extent to which learned state-tracking mechanisms are shared across different sequence lengths. We show that transformers exhibit negligible or even detrimental weight sharing across lengths, indicating that they learn length-specific solutions in isolation. In contrast, recurrent models exhibit effective amortized learning by sharing weights across lengths, allowing data from one sequence length to improve performance on others. Together, these results demonstrate that state tracking remains a fundamental challenge for transformers, even when training and evaluation distributions match.

Abstract:
Large Language Models (LLMs) frequently hallucinate, limiting their reliability in critical applications. Conformal Prediction (CP) addresses this by calibrating error rates on heldout data to provide statistically valid confidence guarantees. Recent work extends CP to LLM factuality: outputs are decomposed into subclaims, each assigned a risk score, and a calibrated threshold filters out risky claims to guarantee hallucination rates below a user-specified level (e.g., 10%). While prior methods treat claims independently, Coherent Factuality extends to multi-step reasoning by representing outputs as dependency graphs and jointly validating claims with their logical ancestors. A key limitation is that Coherent Factuality is not differentiable, requiring hand-crafted scorers that at high reliability levels remove nearly 60% of true claims. We introduce Differentiable Coherent Factuality (DCF), a fully differentiable relaxation that enables learning improved scorers while provably recovering the original algorithm's guarantees. Experiments on two reasoning datasets demonstrate DCF achieves up to 141% improvement in claim retention while maintaining reliability guarantees, representing a significant step towards reliable conformal LLM systems.

Abstract:
Conversational agents are increasingly deployed in knowledgeintensive settings, where correct behavior depends on acquiring and applying domain-specific knowledge from large, proprietary, and unstructured corpora during live interactions with users. Yet most existing benchmarks evaluate retrieval or tool use in isolation, and rarely test whether agents can operationalize non-parametric knowledge to drive outcomes over long-horizon conversations. To remedy this, we introduce \tau-Knowledge, an extension of \tau-Bench that evaluates agents in environments where task success requires retrieving, reasoning over, and applying knowledge from a natural-language corpus. Our new domain, \tau-Banking, models realistic fintech customer support workflows in which agents must coordinate external knowledge with tool outputs to deliver verifiable, policy-compliant state changes over long-horizon conversations. \tau-Knowledge is substantially difficult: frontier models with high reasoning budgets only reach ～21% \passhat1, with reliability degrading sharply over repeated trials. We hope \tau-Knowledge provides a realistic testbed for developing conversational agents that integrate non-parametric knowledge in human-facing deployments.

Abstract:
We revisit the standard perturbationbased approach of Abernethy et al. (2008) in the context of unconstrained Bandit Linear Optimization (uBLO). We show the surprising result that in the unconstrained setting, this approach effectively reduces Bandit Linear Optimization (BLO) to a standard Online Linear Optimization (OLO) problem. Our framework improves on prior work in several ways. First, we derive expected-regret guarantees when our perturbation scheme is combined with comparator-adaptive OLO algorithms, leading to new insights about the impact of different adversarial models on the resulting comparator-adaptive rates. We extend our analysis to dynamic regret, obtaining the optimal \sqrtP_T path-length dependencies without prior knowledge of P_T. We then develop the first high-probability guarantees for both static and dynamic regret in uBLO. Finally, we discuss lower bounds on the static regret, and prove the folklore \Omega(\sqrtdT) rate for adversarial linear bandits on the unit Euclidean ball, which is of independent interest.

Abstract:
Diffusion Large Language Models (dLLMs) support arbitraryorder generation, yet their inference performance critically depends on the unmasking order. Existing strategies rely on heuristics that greedily optimize local confidence, offering limited guidance for identifying unmasking paths that are globally consistent and accurate. To bridge this gap, we introduce path log-likelihood (Path LL), a trajectory-conditioned objective that strongly correlates with downstream accuracy and enables principled selection of unmasking paths. To optimize Path LL at inference time, we propose POKE, an efficient value estimator that predicts the expected future Path LL of a partial decoding trajectory. We then integrate this lookahead signal into POKE-SMC, a Sequential Monte Carlo-based search framework for dynamically identifying optimal unmasking paths. Extensive experiments across 6 reasoning tasks show that POKE-SMC consistently improves accuracy, achieving 2%--3% average gains over strong decoding-time scaling baselines at comparable inference overhead on LLaDA models and advancing the accuracy--compute Pareto frontier.

Abstract:
Most causal discovery methods recover a completed partially directed acyclic graph (CPDAG) representing a Markov equivalence class from observational data. Recent work has extended these methods to federated settings to address data decentralization and privacy constraints, but often under idealized assumptions that all clients share the same causal model. Such assumptions are unrealistic in practice, as clientspecific policies, for instance, across hospitals, naturally induce heterogeneous and unknown interventions. In this work, we address federated causal discovery under unknown client-level interventions. We propose I-PERI, a novel federated algorithm that first recovers the CPDAG common to all clients and then orients additional edges by exploiting structural differences induced by interventions across clients. This yields a tighter equivalence class, which we call the \mathbf\phi-Markov Equivalence Class, represented by an augmented version of the CPDAG, namely, a \mathbf\phi-CPDAG. We provide theoretical guarantees on the convergence of I-PERI, as well as on its privacy-preserving properties, and present empirical evaluations demonstrating the effectiveness of the proposed algorithm.

Abstract:
Large language models (LLMs) have demonstrated that largescale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata, VDMs demonstrate higher data efficiency than their language counterparts. Taken together, our results indicate that video pretraining offers inductive biases that support progress toward visual foundation models.

Abstract:
Training models through selfplay alone (without any human data) has been a longstanding goal in AI, but its effectiveness for training large language models remains unclear, particularly in code generation where rewards based on unit tests are brittle and prone to error propagation. We study self-play in the verified code generation setting, where formal verification provides reliable correctness signals. We introduce Propose, Solve, Verify (PSV) a simple self-play framework where formal verification signals are used to create a proposer capable of generating challenging synthetic problems and a solver trained via expert iteration. We use PSV to train PSV-Verus, which across three benchmarks improves pass@1 by up to 9.6x over inference-only and expert-iteration baselines. We show that performance scales with the number of generated questions and training iterations, and through ablations identify formal verification and difficulty-aware proposal as essential ingredients for successful self-play.

Abstract:
Recent advances in crossmodal few-shot adaptation treat visual-semantic alignment as a continuous feature transport problem via Flow Matching (FM). However, we argue that Euclidean-based FM overlooks fundamental limitations of flat geometry, where polynomial volume growth fails to accommodate diverse feature distributions, leading to severe path entanglement. To this end, we propose path-decoupled Hyperbolic Flow Matching (HFM), leveraging the Lorentz manifold's exponential expansion for trajectory decoupling. HFM structures the transport via two key designs: 1) Centripetal hyperbolic alignment: It constructs a centripetal hierarchy by anchoring textual roots, which pushes visual leaves to the boundary to initialize orderly flows. 2) Path-decoupled objective: It acts as a "semantic guardrail" rigidly confining trajectories within isolated class-specific geodesic corridors via step-wise supervision. Furthermore, we devise an adaptive diameter-based stopping to prevent over-transportation into the crowded origin based on the intrinsic semantic scale. Extensive ablations on 11 benchmarks have shown that HFM establishes a new state-of-the-art, consistently outperforming its Euclidean counterparts. Our codes and models will be released.

Abstract:
We study two recurring phenomena in Transformer language models. First, \emphmassive activations, where a small number of hidden channels attain extremely large values for a few tokens. Second, \emphattention sinks, where certain tokens attract a disproportionate share of attention across many heads and layers. We present a unified inferencetime mechanism explaining how massive activations emerge and propagate through layers, and how normalization transforms these tokens into sparse, nearly fixed vectors that reshape the attention space and induce sink or no-sink behavior. We further conduct ablations on models trained from scratch to disentangle architectural and training factors governing both phenomena. We find that attention sinks persist across architectures and can arise even without massive activations. The normalization strategy primarily determines the emergence of massive activations, while head dimension and context length modulate the frequency of attention sink formation.

Abstract:
Neural operators have emerged as a powerful paradigm for learning discretizationinvariant function-to-function mappings in scientific computing. However, many practical systems are inherently stochastic, making principled uncertainty quantification essential for reliable deployment. To address this, we introduce a simple add-on, thediffusion last layer(DLL), a lightweight probabilistic head that can be attached to arbitrary neural operator backbones to model predictive uncertainty. Motivated by the relative smoothness and low-dimensional structure often exhibited by PDE solution distributions, DLL parameterizes the conditional output distribution directly in function space through a low-rank Karhunen-Loève expansion, enabling efficient and expressive uncertainty modeling. Across stochastic PDE operator learning benchmarks, DLL improves generalization and uncertainty-aware prediction. Moreover, even in deterministic long-horizon rollout settings, DLL enhances rollout stability and provides meaningful estimates of epistemic uncertainty for backbone neural operators.

Abstract:
Electroencephalography (EEG)based multimodal learning integrates brain signals with complementary modalities to improve mental state assessment, providing great clinical potential. The effectiveness of such paradigms largely depends on the representation learning on heterogeneous modalities. For EEG-based paradigms, one promising approach is to leverage their hierarchical structures, as recent studies have shown that both EEG and associated modalities (e.g., facial expressions) exhibit hierarchical structures reflecting complex cognitive processes. However, Euclidean embeddings struggle to represent these hierarchical structures due to their flat geometry, while hyperbolic spaces, with their exponential growth property, are naturally suited for them. In this work, we propose EEG-MoCE, a novel hyperbolic mixture-of-curvature experts framework designed for multimodal neurotechnology. EEG-MoCE assigns each modality to an expert in a learnable-curvature hyperbolic space, enabling adaptive modeling of its intrinsic geometry. A curvature-aware fusion strategy then dynamically weights experts, emphasizing modalities with richer hierarchical information. Extensive experiments on benchmark datasets demonstrate that EEG-MoCE achieves state-of-the-art performance, including emotion recognition, sleep staging, and cognitive assessment.

Abstract:
Given the recent rapid progress of LLM agents like Claude Code or Codex CLI for software engineering, an important next question is whether they can automate AI research itself. In this paper, we studyposttraining, which is the critical step that turns base LLMs into useful assistants. We introduce PostTrainBench to benchmark how well LLM agents can perform post-trainingautonomouslyunder bounded compute constraints (10 hours on one H100 GPU). We task frontier agents (e.g., Claude Code with Opus 4.5) to optimize the performance of a base LLM on a particular benchmark (e.g., Qwen3-4B on AIME). Importantly, we do not provide any predefined strategies to the agents and instead give them full autonomy to find necessary information on the web, run experiments, and curate data. We find that frontier agents make substantial progress but generally lag behind instruction-tuned LLMs from leading providers: 21.5% for the best agent vs. 51.1% for official instruction-tuned models. However, agents can exceed instruction-tuned models in targeted scenarios: GPT-5.1 Codex Max achieves 89% on BFCL with Gemma-3-4B vs. 67% for the official model. Additionally, we document concerning behaviors related to reward hacking, such as training on test data or downloading pre-existing instruction-tuned models, and unauthorized usage of API keys for synthetic data generation. Overall, we expect PostTrainBench to serve as an important benchmark for tracking both capabilities and risks of AI R&D automation.

Abstract:
Training instability in modern deep learning systems is frequently triggered by rare but extreme gradientnorm spikes, which can induce oversized parameter updates, corrupt optimizer state, and lead to slow recovery or divergence. Widely used safeguards such as gradient clipping mitigate these failures but require threshold tuning and indiscriminately truncate large updates. We proposeGradientStabilizer, a lightweight, drop-in gradient transform thatpreserves the instantaneous gradient directionwhile replacing the update magnitude with a statistically stabilized estimate derived from running gradient-norm statistics. We prove that the resulting stabilized magnitude is uniformly bounded on spike steps, independent of the spike size, and show how this boundedness controls optimizer state evolution in adaptive methods. Across LLM pre-training (FP16), quantization-aware pre-training (FP4), ImageNet classification, reinforcement learning, and time-series forecasting,GradientStabilizerconsistently improves training stability, widens stable learning-rate regions, and reduces divergence relative to clipping-based baselines, even substantially reducing Adam’s sensitivity to weight-decay strength.

Abstract:
We propose and study distributionally robust Markov games (DR‑MGs) with the average‑reward criterion as a crucial framework for multiagent decision-making under model mismatches and over extended horizons. Under a standard irreducible assumption, we first derive a correspondence between the optimal policies and the solutions of the robust Bellman equation, based on which we further show the existence of a stationary Nash Equilibrium (NE) of the game. We further study DR-MGs under a more general weakly communicating setting. We construct a set-valued map based on the constant-gain optimal robust Bellman operator and show that its value is a subset of the best-response policies. We further prove that this map admits a fixed point, which implies the existence of NE. We then design two algorithms, Robust Nash‑Iteration and robust TD Descent, with provably convergent guarantees. Finally, we show that the NE under average‑reward can be approximated by the ones for the discounted DR-MGs as the discount factor approaches one. Our studies provide a comprehensive theoretical and algorithmic foundation for decision-making in complex, uncertain, and long-running multi-player environments.

Abstract:
Flow Matching (FM) models have emerged as a leading paradigm for highfidelity synthesis. However, their reliance on iterative Ordinary Differential Equation (ODE) solving creates a significant latency bottleneck. Existing solutions face a dichotomy: training-free solvers suffer from significant performance degradation at low Neural Function Evaluations (NFEs), while training-based methods incur prohibitive training costs and lack plug-and-play versatility. To bridge this gap, we propose the Bi-Anchor Interpolation Solver (BA-solver). BA-solver retains the versatility of standard training-free solvers while achieving significant acceleration by introducing a lightweight SideNet (1-2% backbone size) alongside the frozen backbone. Specifically, our method is founded on two synergistic components: 1) Bidirectional Temporal Perception, where the SideNet learns to approximate both future and historical velocities without retraining the heavy backbone; and 2) Bi-Anchor Velocity Integration, which utilizes the SideNet with two anchor velocities to efficiently approximate intermediate velocities for batched high-order integration. By utilizing the backbone to establish high-precision “anchors” and the SideNet to densify the trajectory, BA-solver enables large step sizes with minimized error. Empirical results on ImageNet-256 demonstrate that BA-solver achieves generation quality comparable to 100+ NFEs Euler solver in just 10 NFEs and maintains high fidelity in as few as 5 NFEs, incurring negligible training costs. Furthermore, BA-solver ensures seamless integration with existing generative pipelines, facilitating downstream tasks such as image editing.

Abstract:
Large language models have achieved impressive performance across diverse tasks. However, their tendency to produce overconfident and factually incorrect outputs, known as hallucinations, poses risks in real world applications. Conformal prediction provides finitesample, distribution-free coverage guarantees, but standard conformal prediction breaks down under domain shift, often leading to under-coverage and unreliable prediction sets. We propose a new framework called Domain-Shift-Aware Conformal Prediction (DS-CP). Our framework adapts conformal prediction to large language models under domain shift, by systematically reweighting calibration samples based on their proximity to the test prompt, thereby preserving validity while enhancing adaptivity. Our theoretical analysis and experiments on the MMLU benchmark demonstrate that the proposed method delivers more reliable coverage than standard conformal prediction, especially under substantial distribution shifts, while maintaining efficiency. This provides a practical step toward trustworthy uncertainty quantification for large language models in real-world deployment.

Abstract:
Spurious correlations that lead models to correct predictions for the wrong reasons pose a critical challenge for robust realworld generalization. Existing research attributes this issue to group imbalance and addresses it by maximizing group-balanced or worst-group accuracy, which heavily relies on expensive bias annotations. A compromise approach involves predicting bias information using extensively pretrained foundation models, which requires large-scale data and is limited to physically interpretable biases. To address these challenges, we offer a novel perspective by reframing the spurious correlations as imbalances/mismatches in class-conditional distributions caused by general biases, whether interpretable or not, and propose a simple yet effective robust learning method that eliminates the need for bias annotations or predictions. With the goal of maximizing the conditional entropy (uncertainty) of the label given spurious factors, our method leverages a sample reweighting strategy to achieve class-conditional distribution balancing, which automatically highlights minority groups and classes, effectively dismantling spurious correlations and producing a debiased data distribution for classification. Extensive experiments and analysis demonstrate that our approach consistently delivers state-of-the-art performance, rivaling methods that rely on bias supervision.

Abstract:
Eliciting information to reduce uncertainty about latent grouplevel properties is a central problem in collective assessment, preference modeling, and opinion aggregation, and is especially important in survey-based studies. While natural language interactions provide a flexible interface, existing methods typically rely on fixed questionnaires and static respondent sets, and do not adapt to partial or missing responses across rounds. To address this gap, we study adaptive information elicitation through multi-turn interactions between a large language model and a group of individuals, where both queries and respondents are adaptively selected to infer latent group properties. We propose a theoretically grounded framework that, at each round, jointly selects a query and a subset of respondents based on previously observed responses to efficiently reduce uncertainty about a target latent quantity (e.g., group-level political inclination). Motivated by practical survey constraints, such as limited questions and costly participation, our strategy maximizes information gain under a fixed budget. To handle missing and incomplete responses, we combine graph neural networks for aggregating/imputing partial group information with an information-theoretic criterion that guides per-round selection. Across three real-world opinion datasets, we achieve consistent improvements in population-level response prediction under constrained budgets, including over a 12% relative gain on CES at a 10% respondent budget.

Abstract:
Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which among other things, allows them to explore solution strategies with selfchecking. This results in higher accuracy, but inflates context length, token/compute cost, and answer latency. We ask: Can current models leverage their metacognition to provide other combinations on this Pareto frontier, e.g., better accuracy with lower context length and/or latency? Abstractly, we view the model as an improvement operator on its own "thoughts" with a continuum of possible strategies. We identify an interesting inference family Parallel-Distill-Refine (PDR), which performs the following: (i) generate diverse drafts in parallel; (ii) distill them into a bounded, textual workspace; and (iii) refine conditioned on this workspace, producing an output that seeds the next round. Importantly, context length (hence compute cost) is controllable via degree of parallelism, and is no longer conflated with the total number of generated tokens. We report PDR instantiations of current models that give better accuracy than long CoT while incurring lower latency. Setting degree of parallelism to 1 yields an interesting subcase, Sequential Refinement (SR) (iteratively improve a single candidate answer) which provides performance superior to long CoT. Success of such model orchestrations raises the question whether further training could shift the Pareto frontier. To this end, we train an 8B thinking model with Reinforcement Learning (RL) to make it consistent with PDR as the inference method. On math tasks with verifiable answers, iterative pipelines surpass single-pass baselines at matched sequential budgets, with PDR delivering the largest gains (e.g., +11% on AIME 2024 and +9% on AIME 2025).

Abstract:
Distributed training of foundation models via \textttDDP is limited by interconnect bandwidth. While infrequent communication strategies reduce synchronization frequency, they remain bottlenecked by the memory and communication requirements of optimizer states. Lowrank optimizers can alleviate these constraints; however, in the local-update regime, workers lack access to the full-batch gradients required to compute low-rank projections, which degrades performance. We propose \textttLoRDO, a principled framework unifying low-rank optimization with infrequent synchronization. We first demonstrate that, while global projections based on pseudo-gradients are theoretically superior, they permanently restrict the optimization trajectory to a low-rank subspace. To restore subspace exploration, we introduce a full-rank quasi-hyperbolic update. \textttLoRDO achieves near-parity with low-rank \textttDDP in language modeling and downstream tasks at model scales of 125M--720M, while reducing communication by \approx10×. Finally, we show that \textttLoRDO improves performance even more in very low-memory settings with small rank/batch size.

Abstract:
While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracymodels must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that \model significantly reduces hallucinations (e.g., 43.5% \rightarrow 19.4%) and improves truthfulness (e.g., 5.3% \rightarrow 37.2%), with consistent gains across various backbone models (e.g., Qwen, Llama). In-depth ablation study demonstrates that vanilla accuracy-driven methods such as supervised fine-tuning or RL with a binary reward struggle to balance factual correctness and uncertainty, whereas the truthfulness-driven TruthRL achieves strong performance in both accuracy and truthfulness, underscoring the importance of learning objective design for developing truthful LLMs. Moreover, we find the improvement of \model arises from enhancing the capability of LLMs to recognize their knowledge boundary, hence avoiding being overly conservative as the baselines are. Further analysis validates our method across multiple evaluation judges, and confirms that TruthRL is robust to hallucination-baiting questions.

Abstract:
Affine Maximizer Auctions (AMAs), a generalized mechanism family from VCG, are widely used in automated mechanism design due to their inherent dominantstrategy incentive compatibility (DSIC) and individual rationality (IR). However, as the payment form is fixed, AMA's expressiveness is restricted, especially in distributions where bidders' valuations are correlated. In this paper, we propose Correlation-Aware AMA (CA-AMA), a novel framework that augments AMA with a new correlation-aware payment. We show that any CA-AMA preserves the DSIC property and formalize finding optimal CA-AMA as a constraint optimization problem subject to the IR constraint. Then, we theoretically characterize scenarios where classic AMAs can perform arbitrarily poorly compared to the optimal revenue, while the CA-AMA can reach the optimal revenue. For optimizing CA-AMA, we design a practical two-stage training algorithm. We derive that the target function's continuity and the generalization bound on the degree of deviation from strict IR. Finally, extensive experiments showcase that our algorithm can find an approximate optimal CA-AMA in various distributions with improved revenue and a low degree of violation of IR.

Abstract:
Many physical AI tasks are governed by implicit equilibrium: an agent actuates a subset of degrees of freedom (boundary DoFs), while the remaining free DoFs settle by minimizing a total potential energy. Even seemingly basic tasks such as bending a deformable linear object (DLO) to a target shape can exhibit strongly nonlinear behavior due to multistability: the same boundary conditions may yield multiple equilibrium shapes depending on the actuation trajectory. However, learning and control in such systems is brittle because the actuation-to-configuration map is defined only implicitly, and naive backpropagation through iterative equilibrium solvers is memory- and compute-intensive. We propose Neural Control, a boundary-control framework that computes trajectory-dependent, memory-efficient proxy gradients by differentiating equilibrium conditions via an adjoint formulation, avoiding unrolling of solver iterations. To improve robustness over long horizons, we integrate these sensitivities into a receding-horizon MPC scheme that repeatedly re-anchors optimization to realized equilibria and mitigates basin-switching in multi-stable regimes. We evaluate Neural Control in simulation and on physical robots manipulating DLOs, and show improved performance over gradient-free baselines such as SPSA and CEM.

Abstract:
Despite the ubiquity of multiway data across scientific domains, there are few userfriendly tools that fit tailored nonnegative tensor factorizations. Researchers may use gradient-based automatic differentiation (which often struggles in nonnegative settings), choose between a limited set of methods with mature implementations, or implement their own model from scratch. As an alternative, we introduce NNEinFact, an einsum-based multiplicative update algorithm that fits any nonnegative tensor factorization expressible as a tensor contraction by minimizing one of many user-specified loss functions (including the (\alpha,\beta)-divergence). To use NNEinFact, the researcher simply specifies their model with a string. NNEinFact converges to a local minimum of the loss, supports missing data, and fits to tensors with hundreds of millions of entries in seconds. Empirically, NNEinFact fits custom models which outperform standard ones in heldout prediction tasks on real-world tensor data by over 37% and attains less than half the test loss of gradient-based methods while converging up to 90 times faster.

Abstract:
Neural tangent kernels (NTKs) are a powerful tool for analyzing deep, nonlinear neural networks. In the infinite-width limit, NTKs can easily be computed for most common architectures, yielding full analytic control over the training dynamics. However, at infinite width, important properties of training such as NTK evolution or feature learning are absent. Nevertheless, finite width effects can be included by computing corrections to the Gaussian statistics at infinite width. We introduce Feynman diagrams for computing finite-width corrections to NTK statistics. These dramatically simplify the necessary algebraic manipulations and enable the computation of layer-wise recursion relations for arbitrary statistics involving preactivations, NTKs and certain higher-derivative tensors (dNTK and ddNTK) required to predict the training dynamics at leading order. We demonstrate the feasibility of our framework by extending stability results for deep networks from preactivations to NTKs and proving the absence of finite-width corrections for scale-invariant nonlinearities such as ReLU on the diagonal of the Gram matrix of the NTK. We numerically implement the complete set of equations necessary to compute the first-order corrections for arbitrary inputs and demonstrate that the results follow the statistics of sampled neural networks for widths n\gtrsim 20.

Abstract:
Loss spikes commonly emerge during neural network training with the Adam optimizer across diverse architectures and scales, yet their underlying mechanism remains elusive. While previous explanations attribute these phenomena to sharper loss landscapes at lower loss, we show that landscape geometry alone is insufficient to explain the phenomenon. In this work, we pinpoint the root cause in the internal dynamics of Adam's second moment estimator. We identify a critical ``decoupling'' mechanism where the adaptive preconditioner v_t fails to track the instantaneous squared gradients g_t^2, causing the adaptive mechanism to effectively fail. This decoupling allows the preconditioner to decay autonomously despite rising gradients, which pushes the maximum eigenvalue of the preconditioned Hessian beyond the stability threshold 2/\eta for sustained periods, manifesting as dramatic loss spikes. Through a quadratic approximation analysis, we theoretically and experimentally characterize five distinct stages of spike evolution and propose a predictor for anticipating spikes based on gradientdirectional curvature. We empirically find that the proposed loss spike mechanism, although derived from simplified models, generalizes well to practical scenarios ranging from small neural networks to large-scale Transformers.

Abstract:
Mobile Agents can autonomously execute user instructions, which requires hybridcapabilities reasoning, including screen summary, subtask planning, action decision and action function. However, existing agents struggle to achieve both decoupled enhancement and balanced integration of these capabilities. To address these challenges, we propose Channel-of-Mobile-Experts (CoME), a novel agent architecture consisting of four distinct experts, each aligned with a specific reasoning stage, CoME activates the corresponding expert to generate output tokens in each reasoning stage via output-oriented activation. To empower CoME with hybrid-capabilities reasoning, we introduce a progressive training strategy: Expert-FT enables decoupling and enhancement of different experts' capability; Router-FT aligns expert activation with the different reasoning stage; CoT-FT facilitates seamless collaboration and balanced optimization across multiple capabilities. To mitigate error propagation in hybrid-capabilities reasoning, we propose InfoGain-Driven DPO (Info-DPO), which uses information gain to evaluate the contribution of each intermediate step, thereby guiding CoME toward more informative reasoning. Comprehensive experiments show that CoME outperforms dense mobile agents and MoE methods on both AITZ and AMEX datasets

Abstract:
Expanding multimodal representations to novel modalities is constrained by reliance on largescale paired datasets (e.g., text–image, text–audio, text–3D, text–molecule), which are costly and often infeasible in domains requiring expert annotation such as medical imaging and molecular analysis. We introduce TextME, the first text-only modality expansion framework, to the best of our knowledge, projecting diverse modalities into LLM embedding space as a unified anchor. Our approach exploits the geometric structure of pretrained contrastive encoders to enable zero-shot cross-modal transfer using only text descriptions, without paired supervision. We empirically validate that such consistent modality gaps exist across image, video, audio, 3D, X-ray, and molecular domains, demonstrating that text-only training can preserve substantial performance of pretrained encoders. Moreover, our framework enables emergent cross-modal retrieval between modality pairs not explicitly aligned during training (e.g., audio-to-image, 3D-to-image). These results establish text-only training as a practical alternative to paired supervision for modality expansion.

Abstract:
Selfsupervised fMRI foundation models have shown promising transfer performance, yet most rely on predefined region-level parcellations that discard fine-grained voxel information and introduce atlas-dependent biases. We propose Omni-fMRI, an atlas-free foundation model that operates directly on voxel-level signals. To enable scalable pretraining on 49,497 fMRI sessions across nine datasets, Omni-fMRI introduces a dynamic patching mechanism that substantially reduces computational cost while preserving informative spatial structure. To support reproducibility and fair comparison, we establish a comprehensive benchmark suite spanning 11 datasets and a diverse set of resting-state and task-based fMRI tasks. Experimental results demonstrate that Omni-fMRI consistently outperforms existing foundation models, providing a scalable and reproducible framework for atlas-free brain representation learning. Code is available.

Abstract:
How can we train agents to navigate uncertainty over long horizons? In this work, we propose ∆BeliefRL, which leverages a language model's own intrinsic beliefs to reward intermediate progress. Our method utilizes the change in the probability an agent assigns to the target solution for credit assignment. By training on synthetic interaction data, ∆Belief-RL teaches information-seeking capabilities that consistently outperform purely outcome-based rewards for RL, with improvements generalizing to out-of-distribution applications ranging from customer service to personalization. Notably, the performance continues to improve as we scale test-time interactions beyond the training horizon, with interaction-efficiency increasing even on Pass@k metrics. Overall, our work introduces a scalable training strategy for navigating uncertainty over a long-horizon, by enabling credit assignment to intermediate actions via intrinsic ∆Belief rewards.

Abstract:
Recent work suggests that largescale, multi-animal modeling can significantly improve neural recording analysis. However, for functional calcium traces, existing approaches remain task-specific, limiting transfer across common neuroscience objectives. To address this challenge, we propose CalM, a self-supervised neural foundation model trained solely on neuronal calcium traces and adaptable to multiple downstream tasks, including forecasting and decoding. Our key contribution is a pretraining framework, composed of a high-performance tokenizer mapping single-neuron traces into a shared discrete vocabulary, and a dual-axis autoregressive transformer modeling dependencies along both the neural and the temporal axis. We evaluate CalM on a large-scale, multi-animal, multi-session dataset. On the neural population dynamics forecasting task, CalM outperforms strong specialized baselines after pretraining. With a task-specific head, CalM further adapts to the behavior decoding task and achieves superior results compared with supervised decoding models. Moreover, linear analyses of CalM representations reveal interpretable functional structures beyond predictive accuracy. Taken together, we propose a novel and effective self-supervised pretraining paradigm for foundation models based on calcium traces, paving the way for scalable pretraining and broad applications in functional neural analysis.

Abstract:
Credit assignment, the process of determining how changes in individual neurons and synapses influence a network’s output, is central to learning in brains and machines. Noise correlationbased methods, which estimate gradients by correlating perturbations of activity with changes in output, provide a biologically plausible solution to credit assignment but scales poorly as accurately estimating the Jacobian requires that the number of perturbations scale with network size. Moreover, isotropic noise conflicts with neurobiological observations that neural activity lies on a low-dimensional manifold. To address these drawbacks, we proposeneural manifold noise correlation(NMNC), which performs credit assignment using perturbations restricted to the neural manifold. We show theoretically and empirically that the Jacobian row space aligns with the neural manifold in trained networks, and that manifold dimensionality scales slowly with network size. NMNC substantially improves performance and sample efficiency over vanilla noise correlation in convolutional networks trained on CIFAR-10, ImageNet-scale models, and recurrent networks. NMNC also yields representations more similar to the primate visual system than vanilla noise correlation. These findings offer a mechanistic hypothesis for how biological circuits could support credit assignment, and suggest that biologically inspired constraints may enable, rather than limit, effective learning at scale.

Abstract:
The realm of Mamba for vision has been advanced in recent years to strike for the alternatives of Vision Transformers (ViTs) that suffer from the quadratic complexity. While the recurrent scanning mechanism of Mamba offers computational efficiency, it inherently limits noncausal interactions between image patches. Prior works have attempted to address this limitation through various multi-scan strategies; however, these approaches suffer from inefficiencies due to suboptimal scan designs and frequent data rearrangement. Moreover, Mamba exhibits relatively slow computational speed under short token lengths, commonly used in visual tasks. In pursuit of a truly efficient vision encoder, we rethink the scan operation for vision and the computational efficiency of Mamba. To this end, we propose SF-Mamba, a novel visual Mamba with two key proposals: auxiliary patch swapping for encoding bidirectional information flow under an unidirectional scan and batch folding with periodic state reset for advanced GPU parallelism. Extensive experiments on image classification, object detection, and instance and semantic segmentation consistently demonstrate that our proposed SF-Mamba significantly outperforms state-of-the-art baselines while improving throughput across different model sizes. We will release the source code after publication.

Abstract:
Recent progress in textto-image (T2I) diffusion models (DMs) has enabled high-quality visual synthesis from diverse textual prompts. Yet, most existing T2I DMs, even those equipped with large language model (LLM)-based text encoders, remain text-pixel mappers -- they employ LLMs merely as text encoders, without leveraging their inherent reasoning capabilities to infer what should be visually depicted given the textual prompt. To move beyond such literal generation, we propose the think-then-generate (T2G) paradigm, where the LLM-based text encoder is encouraged to reason about and rewrite raw user prompts; the states of the rewritten prompts then serve as diffusion conditioning. To achieve this, we first activate the think-then-rewrite pattern of the LLM encoder with a lightweight supervised fine-tuning process. Subsequently, the LLM encoder and diffusion backbone are co-optimized to ensure faithful reasoning about the context and accurate rendering of the semantics via Dual-GRPO. In particular, the text encoder is reinforced using image-grounded rewards to infer and recall world knowledge, while the diffusion backbone is pushed to produce semantically consistent and visually coherent images. Experiments show substantial improvements in factual consistency, semantic alignment, and visual realism across reasoning-based image generation and editing benchmarks, achieving 0.79 on WISE score, nearly on par with GPT-4. Our results constitute a promising step toward next-generation unified models with reasoning, expression, and demonstration capacities.

Abstract:
Planning has become a central capability for contemporary agent systems in navigating complex, longhorizon tasks, yet existing approaches predominantly rely on fixed, hand-crafted planning structures that lack the flexibility to adapt to the structural diversity of open-ended problems. To address this limitation, we introduce TodoEvolve, a meta-planning paradigm that autonomously synthesizes and dynamically revises task-specific planning architectures. Specifically, we first construct PlanFactory, a modular design space that standardizes diverse planning paradigms within a unified codebase encompassing topology, initialization, adaptation, and navigation, thereby providing a common interface for heterogeneous planning patterns. Leveraging PlanFactory, we collect high-quality planning trajectories and train Todo-14B via Impedance-Guided Preference Optimization (IGPO), a multi-objective reinforcement learning objective that encourages the generation of planning systems that are performant, stable, and token-efficient across arbitrary tasks and agent backbones. Empirical evaluations on five agentic benchmarks demonstrate that TodoEvolve consistently surpasses carefully engineered planning modules while maintaining economical API costs and runtime overhead.

Abstract:
In standard reinforcement learning (RL) settings, the interaction between the agent and the environment is typically modeled as a Markov decision process (MDP), which assumes that the agent observes the system state instantaneously, selects an action without delay, and executes it immediately. In realworld dynamic environments, such as cyber-physical systems, this assumption often breaks down due to delays in the interaction between the agent and the system. These delays can vary stochastically over time and are typicallyunobservablewhen deciding on an action. Existing methods deal with this uncertainty conservatively by assuming a known fixed upper bound on the delay, even if the delay is often much lower. In this work, we introduce theinteraction layer, a general framework that enables agents to adaptively handle unobservable and time-varying delays. Specifically, the agent generates a matrix of possible future actions, anticipating a horizon of potential delays, to handle both unpredictable delays and lost action packets sent over networks. Building on this framework, we develop a model-based algorithm,Actor-Critic with Delay Adaptation (ACDA), which dynamically adjusts to delay patterns. Our method significantly outperforms state-of-the-art approaches across a wide range of locomotion benchmark environments, including real-world measured delays.

Abstract:
We propose a minimal agentic baseline that enables systematic comparison across different AIbased theorem prover architectures. This design implements the core features shared among state-of-the-art systems: iterative proof refinement, library search and context management. We evaluate our baseline using qualitatively different benchmarks and compare various popular models and design choices. Our results demonstrate consistent advantages of an iterative approach over multiple single-shot generations, especially in terms of sample efficiency and cost effectiveness. We demonstrate competitive performance compared to state-of-the-art approaches, while using a significantly simpler architecture. We release our open-source implementation as a candidate reference for future research and as an accessible prover for the community.

Abstract:
Large language models (LLMs) have achieved remarkable performance on a wide range of tasks, hindering realworld deployment due to their massive size. Existing pruning methods (e.g., Wanda) tailored for LLMs rely heavily on manual design pruning algorithms, thereby leading to huge labor costs and requires expert knowledge. Furthermore, we are the first to identify the serious outlier value issue behind dramatic performance degradation under high pruning ratios that are caused by uniform sparsity, raising an additional concern about how to design adaptive pruning sparsity ideal for LLMs. Can LLMs prune by themselves? In this work, we introduce an affirmative answer by proposing a novel pruning method called AutoPrune, which first overcomes expert knowledge limits by leveraging LLMs to design optimal pruning algorithms for themselves automatically without any expert knowledge. Specifically, to mitigate the black-box nature of LLMs, we propose a Graph-driven Chain-of-Thought (GCoT) to optimize prompts, significantly enhancing the reasoning process in learning the pruning algorithm and enabling us to generate pruning algorithms with superior performance and interpretability in the next generation. Finally, grounded in insights of outlier value issue, we introduce Skew-aware Dynamic Sparsity Allocation (SDSA) to overcome the outlier value issue, mitigating performance degradation under high pruning ratios. We conduct extensive experiments on mainstream LLMs benchmarks, demonstrating the superiority of AutoPrune, which consistently excels state-of-the-art competitors. The code is available at: \urlhttps://anonymous.4open.science/r/AutoPrune.

Abstract:
Distribution Matching Distillation (DMD) is a powerful acceleration paradigm, yet its stability is often compromised inForbidden Zones—regions where the real teacher provides unreliable guidance while the fake teacher exerts insufficient repulsive force. In this work, we propose a unified optimization framework that reinterprets prior art as implicit strategies to avoid these corrupted regions. Based on this insight, we introduce Adaptive Matching Distillation (AMD), a selfcorrecting mechanism that utilizes reward proxies to explicitly detect and escape Forbidden Zones. AMD dynamically prioritizes corrective gradients via structural signal decomposition and introduces Repulsive Landscape Sharpening to enforce steep energy barriers against failure mode collapse. Extensive experiments across image and video generation tasks (e.g., SDXL, Wan2.1) and rigorous benchmarks (e.g., VBench, GenEval) demonstrate that AMD significantly enhances sample fidelity and training robustness. For instance, AMD improves the HPSv2 score on SDXL from30.64to31.25, outperforming state-of-the-art baselines. These findings validate that explicitly rectifying optimization trajectories within Forbidden Zones is essential for pushing the performance ceiling of few-step generative models.

Abstract:
Polytrees are a subclass of Bayesian networks that seek to capture the conditional dependencies between a set of n variables as a directed forest and are motivated by their more efficient inference and improved interpretability. Since the problem of learning the best polytree is NPhard, we study which restrictions make it more tractable by considering for example in-degree bounds, properties of score functions measuring the quality of a polytree, and approximation algorithms. We devise an algorithm that finds the optimal polytree in time \mathcalO\big((2+\epsilon)^n\big) for arbitrarily small \epsilon > 0 and any constant in-degree bound k, improving over the fastest previously known algorithm of time complexity \mathcalO\big(3^n\big). We further give polynomial-time algorithms for finding a polytree whose score is within a factor of k from the optimal one for arbitrary scores and a factor of 2 for additive ones. Many of the results are complemented by (nearly) tight lower bounds for either the time complexity or the approximation factors.

Abstract:
Online convex optimization in nonstationary environments has garnered considerable attention in the literature. Recently, Pasteris et al. (2024) investigate online convex optimization with the optimal switching regret, defined as the sum of the static regret over each segment, where the segmentation is an arbitrary partition of the entire time horizon. For general convex functions, their work has established an optimal switching regret bound. However, it remains open whether similar bounds are attainable for other types of convex functions, such as exponentially concave or strongly convex functions. In this paper, we affirmatively answer this question by proposing a novel meta-algorithm, termed IRESET, which is used to aggregate the decisions from a group of experts. The essence of our method lies in running multiple experts over a set of intervals, and then employing a meta-algorithm equipped with second-order bounds to sequentially combine their decisions. We leverage the segment tree structure to analyze the switching regret over the entire time horizon, and offer new insights into utilizing recursive equations over the segment tree. By choosing appropriate expert-algorithms for IRESET, our methods achieve logarithmic switching regret bounds for exponentially concave or strongly convex functions, respectively.

Abstract:
Reliable uncertainty quantification at unobserved spatial locations is a key challenge in spatial statistics, particularly for complex and heterogeneous datasets. While traditional methods such as Kriging rely on strong distributional assumptions, conformal prediction (CP) offers a distributionfree alternative. However, although non-i.i.d. CP theory is well established for time-series data, a significant gap remains for spatial data, where the lack of a natural ordering and discrete index complicates theoretical guarantees. Existing CP theory for spatial data often relies on exchangeability. We propose Localized Spatial Conformal Prediction (LSCP), a model-agnostic framework that bridges this gap by coupling local quantile regression with conformal calibration. LSCP conditions on spatial neighborhoods to capture local heterogeneity. We show that LSCP retains finite-sample marginal coverage under spatial exchangeability and attains asymptotic conditional coverage under stationarity and spatial mixing. Across synthetic and real datasets, LSCP consistently achieves near-nominal coverage with tighter and more stable prediction intervals than existing methods that fail to capture these spatial dependencies.

Abstract:
Physical systems near singularities, interfaces, and critical points exhibit powerlaw scaling, yet standard neural networks leave the governing exponents implicit. We introduce physics-informed M\"untz-Sz\'asz Networks (MSN-PINN), a power-law basis network that treats scaling exponents as trainable parameters. The model outputs both the solution and its scaling structure. We prove identifiability, or unique recovery, and show that, under these conditions, the squared error between learned and true exponents scales as O(|\mu - \alpha|^2). Across experiments, MSN-PINN achieves single-exponent recovery with 1--5% error under noise and sparse sampling. It recovers corner singularity exponents for the two-dimensional Laplace equation with 0.009% error, matches the classical result of Kondrat'ev (1967), and recovers forcing-induced exponents in singular Poisson problems with 0.03% and 0.05% errors. On a 40-configuration wedge benchmark, it reaches a 100% success rate with 0.022% mean error. Constraint-aware training encodes physical requirements such as boundary condition compatibility and improves accuracy by three orders of magnitude over naive training. By combining the expressiveness of neural networks with the interpretability of asymptotic analysis, MSN-PINN produces learned parameters with direct physical meaning.

Abstract:
The operational efficacy of large language models relies heavily on their inferencetime context. This has established Context Engineering (CE) as a formal discipline for optimizing these inputs. Current CE methods rely on manually crafted harnesses, such as rigid generation-reflection workflows and predefined context schemas. They impose structural biases and restrict context optimization to a narrow, intuition-bound design space. To address this, we introduce Meta Context Engineering (MCE), a bi-level framework that supersedes static CE heuristics by co-evolving CE skills and context artifacts. In MCE iterations, a meta-level agent refines engineering skills via agentic crossover, a deliberative search over the history of skills, their executions, and evaluations. A base-level agent executes these skills, learns from training rollouts, and optimizes context as flexible files and code. We evaluate MCE across five disparate domains under offline and online settings. MCE demonstrates consistent performance gains, achieving 5.6--53.8% relative improvement over state-of-the-art agentic CE methods (mean of 16.9%), while maintaining superior context adaptability, transferability, and efficiency in both context usage and training.

Abstract:
Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM. However, existing methods still rely on autoregressive drafting, which remains sequential and constrains practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. We show that speculative decoding provides a natural and effective setting for diffusion models. By generating draft tokens in a single forward pass, DFlash enables efficient drafting, and by conditioning the draft model on context features extracted from the target model, it achieves highquality drafts with improved acceptance rates. Experiments demonstrate that DFlash achieves more than 6× lossless acceleration across a range of models and tasks, delivering up to 2.5× higher speedup than the state-of-the-art speculative decoding method EAGLE-3.

Abstract:
Bayesian optimization (BO) has traditionally solved blackbox problems where function evaluation is expensive and, therefore, observations are few. Recently, however, there has been growing interest in applying BO to problems where function evaluation is cheaper and observations are more plentiful. In this regime, scaling to many observations N is impeded by Gaussian-process (GP) surrogates: GP hyperparameter fitting scales as \mathcalO(N^3) (reduced to roughly \mathcalO(N^2) in modern implementations), and it is repeated at every BO iteration. Many methods improve scaling at acquisition time, but hyperparameter fitting still scales poorly, making it the bottleneck. We propose Epistemic Nearest Neighbors (ENN), a lightweight alternative to GPs that estimates function values and uncertainty (epistemic and aleatoric) from K-nearest-neighbor observations. ENN scales as \mathcalO(N) for both fitting and acquisition. Our BO method, TuRBO-ENN, replaces the GP surrogate in TuRBO with ENN and its Thompson-sampling acquisition with \mathrmUCB = \mu(x) + \sigma(x). For the special case of noise-free problems, we can omit fitting altogether by replacing \mathrmUCB with a non-dominated sort over \mu(x) and \sigma(x). We show empirically that TuRBO-ENN reduces proposal time (i.e., fitting time + acquisition time) by one to two orders of magnitude compared to TuRBO at up to 50,000 observations.

Abstract:
We propose a trainingfree conditional sampling method for flow matching models based on importance sampling. Because a naïve application of importance sampling suffers from weight degeneracy in high-dimensional settings, we modify and incorporate a resampling technique in sequential Monte Carlo (SMC) during intermediate stages of the generation process. To encourage generated samples to diverge along distinct trajectories, we derive a stochastic flow with adjustable noise strength to replace the deterministic flow at the intermediate stage. Our framework requires no additional training, while providing theoretical guarantees of asymptotic accuracy. Experimentally, our method significantly outperforms existing approaches on conditional sampling tasks for MNIST and CIFAR-10. We further demonstrate the applicability of our approach in higher-dimensional, multimodal settings through text-to-image generation experiments on CelebA-HQ.

Abstract:
Unsupervised Environment Design (UED) offers a promising paradigm for improving reinforcement learning generalization by adaptively shaping training environments, but it requires reliable environment evaluation to remain effective. However, existing UED methods evaluate environments using indirect proxy signals such as regret, valuebased errors, or Monte Carlo, which suffer from bias, high variance, or substantial computational overhead and fail to reflect agent realized learning progress. To address these limitations, we propose Parameter Change Environment Design (PACE), which evaluates an environment through the policy parameter change induced by training on that environment, directly grounding environment selection in realized learning progress. Specifically, PACE assigns environment value using a first-order approximation of the policy optimization objective, where the improvement induced by an environment is proportional to the squared \ell_2 norm of the corresponding parameter update, enabling low-variance and computation-efficient evaluation without additional rollouts. Experiments on MiniGrid and Craftax show that PACE consistently outperforms established UED baselines, achieving higher IQM and smaller Optimality Gap on OOD evaluations, including an IQM of 96.4% and an Optimality Gap of 17.2% on MiniGrid.

Abstract:
While deep learning is facing an homogenization across modalities led by Transformers, they are still challenged by shallow linear models in the timeseries forecasting task. Our hypothesis is that models should learn a direct link from past to future data points, which we identify as a learning dynamics capability. We develop an original \textttPRO-DYN nomenclature to analyze existing models through the lens of dynamics. Two observations thus emerge: 1. under-performing architectures learn dynamics at most partially, 2. the location of the dynamics block at the model end is of prime importance. Our systemic and empirical studies both confirm our observations on a set of performance-varying models with diverse backbones. We propose a simple plug-and-play methodology guiding model designs and improvements.

Abstract:
Modern offline Reinforcement Learning (RL) methods find performant actorcritics, however, fine-tuning these actor-critics online with value-based RL algorithms typically causes immediate drops in performance. We provide evidence consistent with the hypothesis that, in the loss landscape, offline maxima for prior algorithms and online maxima are separated by low-performance valleys that gradient-based fine-tuning traverses. Following this, we present Score Matched Actor-Critic (SMAC), an offline RL method designed to learn actor–critics that transition to online value-based RL algorithms with no drop in performance. SMAC avoids valleys between offline and online maxima by regularizing the Q-function during the offline phase to respect a first-order derivative equality between the score of the policy and action-gradient of the Q-function. We experimentally demonstrate that SMAC converges to offline maxima that are connected to better online maxima via paths with monotonically increasing reward found by first-order optimization. SMAC achieves smooth transfer to Soft Actor-Critic and TD3 in 6/6 D4RL tasks. In 4/6 environments, it reduces regret by 34-58% over the best baseline.

Abstract:
Recent years have seen an explosion of interest in autonomous cyber defence agents trained to defend computer networks using deep reinforcement learning. These agents are typically trained in cyber gym environments using dense, highly engineered reward functions which combine many penalties and incentives for a range of (un)desirable states and costly actions. Dense rewards help alleviate the challenge of exploring complex environments but risk biasing agents towards suboptimal and potentially riskier solutions, a critical issue in complex cyber environments. We thoroughly evaluate the impact of reward function structure on learning and policy behavioural characteristics using a variety of sparse and dense reward functions, two wellestablished cyber gyms, a range of network sizes, and both policy gradient and value-based RL algorithms. Our evaluation is enabled by a novel ground truth evaluation approach which allows directly comparing between different reward functions, illuminating the nuanced inter-relationships between rewards, action space and the risks of suboptimal policies in cyber environments. Our results show that sparse rewards, provided they are goal aligned and can be encountered frequently, uniquely offer both enhanced training reliability and more effective cyber defence agents with lower-risk policies. Surprisingly, sparse rewards can also yield policies that are better aligned with cyber defender goals and make sparing use of costly defensive actions without explicit reward-based numerical penalties.

Abstract:
Evaluating generative models is challenging because standard metrics often fail to reflect human preferences. Human evaluations are more reliable but costly and noisy, as participants vary in expertise, attention, and diligence. Pairwise comparisons improve consistency, yet aggregating them into overall quality scores requires careful modeling. BradleyTerry-based methods update item scores from comparisons, but existing approaches either ignore rater variability or lack convergence guarantees, limiting robustness and interpretability. We introduce BBQ, a Bayesian Bradley-Terry variant that explicitly models rater quality, downweighting or removing unreliable participants, and provides guaranteed monotonic likelihood convergence through an Expectation-Maximization algorithm. Empirical results show that BBQ achieves faster convergence, well-calibrated uncertainty estimates, and more robust, interpretable rankings compared to baseline Bradley-Terry models, even with noisy or crowdsourced raters. This framework enables more reliable and cost-effective human evaluation of generative models.

Abstract:
Graph generation aims to sample discrete node and edge attributes while satisfying coupled structural constraints. Diffusion models for graphs often adopt largely factorized forwardnoising, and many flow-matching methods start from factorized reference noise and coordinate-wise interpolation, so node–edge coupling is not encoded by the generative geometry and must be recovered implicitly by the core network, which can be brittle after discrete decoding. Bayesian Flow Networks (BFNs) evolve distribution parameters and naturally support discrete generation. But classical BFNs typically rely on factorized beliefs and independent channels, which limit geometric evidence fusion. We propose Variational Bayesian Flow Network (VBFN), which performs a variational lifting to a tractable joint Gaussian variational belief family governed by structured precisions. Each Bayesian update reduces to solving a symmetric positive definite linear system, enabling coupled node and edge updates within a single fusion step. We construct sample-agnostic sparse precisions from a representation-induced dependency graph, thereby avoiding label leakage while enforcing node–edge consistency. On synthetic and molecular graph datasets, VBFN improves fidelity and diversity, and surpasses baseline methods. The reproducible code is available at https://anonymous.4open.science/r/VBFN-D6B0/.

Abstract:
Augmenting VisionLanguage-Action models (VLAs) with world models is promising for robotic policy learning but faces challenges in jointly predicting states and actions due to the modality gap. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework featuring a multimodal diffusion transformer that maintains separate modality streams while enabling cross-modal knowledge sharing. In addition, DUST utilizes independent noise perturbations and a decoupled flow matching loss to learn cross-modal causal relationships. We further introduce an asynchronous sampling method for action and vision tokens that enhances performance through inference-time scaling. Experimental results on simulated benchmarks like RoboCasa and GR-1 show that DUST achieves up to 6% gains over state-of-the-art VLA and world-modeling baselines, with inference-time scaling providing an additional 2–5% improvement. In real-world tasks using the Franka Research 3, DUST outperforms baselines by 10% in success rate. Finally, we demonstrate that DUST enables effective transfer learning through both pretraining on action-free videos and joint-training with heterogeneous robot and human datasets.

Abstract:
Contextaided forecasting (CAF) holds promise for integrating domain knowledge and forward-looking information, enabling AI systems to surpass traditional statistical methods. However, recent empirical studies reveal a puzzling gap: multimodal models often fail to outperform their unimodal counterparts. We hypothesize that this underperformance stems from poor context quality in existing datasets, as verification is challenging. To address these limitations, we introduce a semi-synthetic data augmentation method that generates contexts both descriptive of temporal dynamics and verifiably complementary to numerical histories. This approach enables massive-scale dataset creation, resulting in CAF-7M, a corpus of 7 million context-augmented time series windows, including a rigorously verified test set. We demonstrate that semi-synthetic pre-training transfers effectively to real-world evaluation, and show clear evidence of context utilization. Our results suggest that dataset quality, rather than architectural limitations, has been the primary bottleneck in context-aided forecasting.

Abstract:
Popular offline reinforcement learning (RL) methods rely on conservatism, penalizing outof-dataset actions or restricting rollout horizons. We question the universality of this principle and revisit a complementary Bayesian perspective. By modeling a posterior over plausible world models and training a history-dependent agent to maximize expected return, the Bayesian approach directly addresses epistemic uncertainty and enables test-time generalization, without conservatism. We first illustrate in a bandit setting that Bayesianism excels on low-quality datasets where conservatism fails. Scaling to realistic tasks, we find that long-horizon rollouts are essential to control value overestimation once conservatism is removed. We introduce design choices that enable learning from long-horizon rollouts while mitigating compounding errors, yielding our algorithm, NEUBAY, grounded in the neutral Bayesian principle. On D4RL and NeoRL benchmarks, NEUBAY is competitive with leading conservative algorithms, achieving new state-of-the-art on 7 datasets with rollout horizons of several hundred steps. Finally, we characterize datasets by quality and coverage to identify when NEUBAY is preferable to conservative methods.

Abstract:
An AI agent will learn a desired goal more effectively if it does not resist the training process, but many partially learned goals incentivize an AI to avoid further goal updates. We would like goals to be corrigible, meaning they allow requested changes, so that we can confidently correct errors and shut down the AI if necessary. Despite this being a crucial safety property, the existing literature does not specify goals that are both corrigible and competitive with alternatives. We introduce a transformation that constructs a corrigible version of nearly any goal, without sacrificing performance. This is done by eliciting predictions of reward conditional on costlessly preventing updates, and having that target be pursued myopically. These goals are then shown to lead to optimal performance among the class of corrigible goals, incentivize allowing midaction overrides, disincentivize deliberate self-modification, and induce corrigible behavior in gridworld settings.

Abstract:
We propose Ambient Dataloops, an iterative framework for refining datasets that makes it easier for diffusion models to learn the underlying data distribution. Modern datasets contain samples of highly varying quality, and training directly on such heterogeneous data often yields suboptimal models. We propose a datasetmodel co-evolution process; at each iteration of our method, the dataset becomes progressively higher quality, and the model improves accordingly. To avoid destructive self-consuming loops, at each generation, we treat the synthetically improved samples as noisy, but at a slightly lower noisy level than the previous iteration, and we use Ambient Diffusion techniques for learning under corruption. Empirically, Ambient Dataloops achieve state-of-the-art performance in unconditional and text-conditional image generation and de novo protein design. We further provide a theoretical justification for the proposed framework that captures the benefits of the data looping procedure.

Abstract:
Generative models have emerged as a powerful class of policies for offline reinforcement learning (RL) due to their ability to capture complex, multimodal behaviors. However, existing methods face a stark trade-off: slow, iterative models like diffusion policies are computationally expensive, while fast, single-step models like consistency policies often suffer from degraded performance. In this paper, we demonstrate that it is possible to bridge this gap. The key to moving beyond the limitations of individual methods, we argue, lies in a unifying perspective that views modern generative models—including diffusion, flow matching, and consistency models—as specific instances of learning a continuous-time generative trajectory governed by an Ordinary Differential Equation (ODE). This principled foundation provides a clearer design space for generative policies in RL and allows us to proposeGenerative Trajectory Policies(GTPs), a new and more general policy paradigm that learns the entire solution map of the underlying ODE. To make this paradigm practical for offline RL, we further introduce two key theoretically principled adaptations. Empirical results demonstrate that GTP achieves state-of-the-art performance on D4RL benchmarks -- it significantly outperforms prior generative policies, achieving perfect scores on several notoriously hard AntMaze tasks.

Abstract:
With the rapid evolution of Large Language Models (LLMs), multiround workflows, such as autonomous agents and iterative retrieval, have become increasingly prevalent. However, this raises hurdles for serving LLMs under prefill-decode (PD) disaggregation, a widely adopted paradigm that separates the compute-bound prefill phase and memory-bound decode phase onto individual resources. Specifically, existing systems overlook the interleaved prefill-decode workload pattern in multi-round inference, leading to sub-optimal handling of the incremental prefill workloads and model deployment for the two phases. In this work, we present AMPD, a brand new disaggregated serving framework for multi-round LLM inference. The core of AMPD is to coordinate the prefill workloads based on real-time workloads by adaptively determiningwhereto carry out these workloads andhowthey are scheduled, in order to maximize service level objective (SLO) attainment. In addition, we tailor a planning algorithm for our scenario, facilitating the deduction of optimal resource allocation and parallel strategies for the two phases. Empirical results demonstrate that AMPD substantially improves SLO attainment compared to state-of-the-art baselines.

Abstract:
When the color distribution of input images changes at inference, the performance of conventional neural network architectures drops considerably. A few researchers have begun to incorporate prior knowledge of color geometry in neural network design. These color equivariant architectures have modeled hue variation with 2D rotations, and saturation and luminance transformations as 1D translations. While this approach improves neural network robustness to color variations in a number of contexts, we find that approximating saturation and luminance (interval valued quantities) as 1D translations introduces appreciable artifacts. In this paper, we introduce a color equivariant architecture that is truly equivariant. Instead of approximating the interval with the real line, we lift values on the interval to values on the circle (a doublecover) and build equivariant representations there. Our approach resolves the approximation artifacts of previous methods, improves interpretability and generalizability, and achieves better predictive performance than conventional and equivariant baselines on tasks such as fine-grained classification and medical imaging tasks. Going beyond the context of color, we show that our proposed lifting can also extend to geometric transformations such as scale.

Abstract:
Foundation models compress a large amount of information in a single, large neural network, which can then be queried for individual tasks. There are strong parallels between this widespread framework and offline goalconditioned reinforcement learning algorithms: a universal value function is trained on a large number of goals, and the policy is evaluated on a single goal in each test episode. Extensive research in foundation models has shown that performance can be substantially improved through test-time training, specializing the model to the current goal. We find similarly that test-time offline reinforcement learning on experience related to the test goal can lead to substantially better policies at modest compute costs. We propose a novel self-supervised data selection criterion, which selects transitions from an offline dataset according to their relevance to the current state and quality with respect to the evaluation goal. We demonstrate across a wide range of high-dimensional loco-navigation and manipulation tasks that fine-tuning a policy on the selected data for a few gradient steps leads to significant performance gains over standard offline pre-training. Our goal-conditioned test-time training (GC-TTT) algorithm applies this routine in a receding-horizon fashion during evaluation, adapting the policy to the current trajectory as it is being rolled out. Finally, we study compute allocation at inference, demonstrating that, at comparable costs, GC-TTT induces performance gains that are not achievable by scaling model size.

Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key ingredient for unlocking complex reasoning capabilities in large language models. Recent work ProRL \citepliu2025prorl has shown promise in scaling RL by increasing the number of training steps. However, performance plateaus after thousands of steps, with clear diminishing returns from allocating more computation to additional training. In this work, we investigate a complementary paradigm for scaling RL: BroRL—increasing the number of rollouts per example to hundreds to exhaustively Broaden exploration, which yields continuous performance gains beyond the saturation point observed in ProRL when scaling the number of training steps. Our approach is motivated by a mass balance equation analysis allowing us to characterize the rate of change in probability mass for correct and incorrect tokens during the reinforcement process. We show that under a onestep RL assumption, sampled rollout tokens contribute to correct-mass expansion, while unsampled tokens outside rollouts may lead to gains or losses depending on their distribution and the net reward balance. Importantly, as the number of rollouts per example N increases, the effect of unsampled terms diminishes, making overall correct-mass expansion more likely. To validate our theoretical analysis, we conduct simulations under more relaxed conditions and find that a sufficiently large rollout size N—corresponding to ample exploration—can increase the probability mass of correct tokens broadly, and in our simulator it increases all correct-token probabilities and eliminates knowledge shrinkage. Empirically, BroRL revives models saturated after 3K ProRL training steps and demonstrates robust, continuous improvement, achieving strong results for the 1.5B model across diverse benchmarks. Notably, under the same training time, BroRL is both more data- and compute-efficient: large-N rollouts reduce the number of filtered samples during dynamic sampling at the algorithmic level and nearly double generation throughput compared to ProRL in our hardware setup; this throughput increase is consistent with shifting generation from a more memory-bound regime toward a more compute-bound one.

Abstract:
Controlling generative models—whether via inferencetime steering or fine-tuning—is expensive. Control relies on estimating the value function—typically necessitating costly trajectory simulations. To eliminate this bottleneck, we introduce Meta Flow Maps (MFMs), stochastic extensions of consistency models and flow maps. MFMs are trained to perform one-step posterior sampling, generating arbitrarily many i.i.d. draws of clean data x_1 from any noisy state x_t. Crucially, these samples are differentiable in the conditioning state x_t, unlocking efficient estimation of the value function gradient. We leverage this capability to enable both inference-time steering without inner rollouts, and unbiased, off-policy fine-tuning to general rewards. Among our fine-tuning and steering experiments on ImageNet, we highlight that our single-particle steered-MFM sampler outperforms a Best-of-1000 baseline across multiple rewards at a fraction of the compute.

Abstract:
Online structured prediction, including online classification as a special case, is the task of sequentially predicting labels from input features. In this setting, the surrogate regret—the cumulative excess of the actual target loss (e.g., 0–1 loss) over the surrogate loss (e.g., logistic loss) incurred by the best fixed estimator—has gained attention because it admits a finite bound independent of the time horizon T. However, such guarantees break down in nonstationary environments, where every fixed estimator may incur surrogate loss that grows linearly with T. To address this limitation, we obtain an upper bound of F_T + O(1 + P_T) on the cumulative target loss, where F_T is the cumulative surrogate loss of any comparator sequence and P_T is its path length. This bound depends on T only through F_T and P_T, thus offering stronger guarantees under non-stationarity. Our core idea is to combine the dynamic regret analysis of online gradient descent (OGD) with the exploit-the-surrogate-gap technique. This viewpoint sheds light on the usefulness of a Polyak-style learning rate for OGD, which systematically yields target-loss bounds and performs well empirically. We then extend our approach to broader settings beyond prior work via the convolutional Fenchel–Young loss. Finally, a lower bound shows that the dependence on F_T and P_T is tight.

Abstract:
While widespread, Transformers lack inductive biases for geometric symmetries common in science and computer vision. Existing equivariant methods often sacrifice the efficiency and flexibility that make Transformers so effective through complex, computationally intensive designs. We introduce the Platonic Transformer to resolve this tradeoff. By defining attention relative to reference frames from the Platonic solid symmetry groups, our method induces a principled weight-sharing scheme. This enables combined equivariance to continuous translations and Platonic symmetries, while preserving the exact architecture and computational cost of a standard Transformer. Furthermore, we show that this attention is formally equivalent to a dynamic group convolution, which reveals that the model learns adaptive geometric filters and enables ahighly scalable, linear-time convolutional variant. Across diverse benchmarks in computer vision (CIFAR-10), 3D point clouds (ScanObjectNN), and molecular dynamics, property prediction and generation (OMol25, ProteinMD, QM9), the Platonic Transformer achieves competitive performance by leveraging these geometric constraints at no additional cost.

Abstract:
World models simulate environment dynamics from raw sensory inputs like video. However, using them for planning can be challenging due to the vast and unstructured search space. We propose a robust and highly parallelizable planner that leverages the differentiability of the learned world model for efficient optimization, solving longhorizon control tasks from visual input. Our method treats states as optimization variables ("virtual states") with soft dynamics constraints, enabling parallel computation and easier optimization. To facilitate exploration and avoid local optima, we introduce stochasticity into the states. To mitigate sensitive gradients through high-dimensional vision-based world models, we modify the gradient structure to descend towards valid plans while only requiring action-input gradients. Our approach can be viewed as a stochastic version of a non-condensed or collocation-based optimal controller. We provide theoretical justification and experiments on video-based world models, where our resulting planner outperforms existing planning algorithms like the cross-entropy method (CEM) and vanilla gradient-based optimization (GD) on long-horizon experiments, both in success rate and time to convergence.

Abstract:
VisionLanguage Navigation (VLN) requires agents to follow natural language instructions in partially observed 3D environments, motivating map representations that aggregate spatial context beyond local perception. However, most existing approaches rely on hand-crafted maps constructed independently of the navigation policy. We argue that maps should instead be learned representations shaped directly by navigation objectives rather than exhaustive reconstructions. Based on this insight, we propose MapDream, a map-in-the-loop framework that formulates map construction as autoregressive bird’s-eye-view (BEV) image synthesis. The framework jointly learns map generation and action prediction, distilling environmental context into a compact three-channel BEV map that preserves only navigation-critical affordances. Supervised pre-training bootstraps a reliable mapping-to-control interface, while the autoregressive design enables end-to-end joint optimization through reinforcement fine-tuning. Experiments on R2R-CE and RxR-CE achieve state-of-the-art monocular performance, validating task-driven generative map learning.

Abstract:
One of the most pressing challenges in artificial intelligence is to make models more transparent to their users. Recently, explainable artificial intelligence has come up with numerous method to tackle this challenge. A promising avenue is to use conceptbased explanations, that is, high-level concepts instead of plain feature importance score. Among this class of methods, Concept Activation Vectors CAVs, Kim et al. (2018) stands out as one of the main protagonists. One interesting aspect of CAVs is that their computation requires sampling random examples in the train set. Therefore, the actual vectors obtained may vary depending on the randomness of this sampling. In this paper, we propose a fine-grained theoretical analysis of CAV construction in order to quantify their variability. Our results, confirmed by experiments on several real-life datasets of four different modalities, point out towards an universal result: the variance of CAVs decreases as 1/N, where N is the number of random examples. Based on this we give practical recommendations for a resource-efficient application of the method.

Abstract:
While coding agents have advanced rapidly, progress on multimodal agents has lagged behind, largely due to a gap between the unimodal nature of code and other multimodal computer applications. Game development bridges the modality gap, mirroring software development's complexity in terms of large codebases and contextual complexity, while simultaneously requiring multimodal understanding. We present GameDevBench, the first benchmark for evaluating agents on game development tasks, consisting of 168 tasks derived from web and video tutorials. Tasks require significant multimodal understanding and are complexthe average solution requires more than three times the amount of changes compared to software development benchmarks. Agents still struggle with game development, with the best agent solving only 50.0% of tasks. We further introduce two simple image and video-based feedback methods, nearly doubling performance in one setting from 25.6% to 44.4%. We find that performance degrades sharply with multimodal complexity, dropping on average from 44.4% pass@1 on gameplay oriented tasks to 24.3% on graphics tasks.

Abstract:
Although modern LLMs are aligned with human values during posttraining, robust moderation remains essential to prevent harmful outputs at deployment time. Existing approaches suffer from performance-efficiency trade-offs and are difficult to customize to user-specific requirements. Motivated by this gap, we introduce Multi-Layer Prototype Moderator (MLPM), a lightweight and highly customizable input moderation tool. We propose leveraging prototypes of intermediate representations across multiple layers to improve moderation quality while maintaining high efficiency. By design, our method adds negligible overhead to the generation pipeline and can be seamlessly applied to any model. MLPM achieves state-of-the-art performance on diverse moderation benchmarks and demonstrates strong scalability across model families of various sizes. Moreover, we show that it integrates smoothly into end-to-end moderation pipelines and further improves response safety when combined with output moderation techniques. Overall, our work provides a practical and adaptable solution for safe, robust, and efficient LLM deployment.

Abstract:
Deploying multimodal models in realworld scenarios requires generalization to new environments where recording conditions differ from training, a challenge known as multimodal domain generalization (MMDG). Standard architectures employ separate encoders for each modality and a fusion module, training the system end-to-end by optimizing on the fused features. In this paper, we identify that such joint optimization causes encoders to exploit cross-modal co-occurrences, statistical relationships between modalities that arise from source-specific recording conditions, rather than learning domain-invariant features. We term this failure mode Fusion Overfitting. To address this, we propose Modality-Entropy Regularization for Domain Generalization (MER-DG), which maximizes the entropy of each encoder's feature distribution to preserve feature diversity. MER-DG is architecture-agnostic and integrates into existing multimodal frameworks as an additive loss term. Extensive experiments on EPIC-Kitchens and HAC benchmarks demonstrate average improvements of ～5% over standard fusion and ～2% over state-of-the-art methods.

Abstract:
Controllable diffusion generation often relies on various heuristics that are seemingly disconnected without a unified understanding. We bridge this gap with Diffusion Controller (DiffCon), a unified controltheoretic view that casts reverse diffusion sampling as state-only stochastic control within (generalized) linearly-solvable Markov Decision Processes (LS-MDPs). Under this framework, control acts by reweighting the pretrained reverse-time transition kernels, balancing terminal objectives against an f-divergence cost. From the resulting optimality conditions, we derive practical reinforcement learning methods for diffusion fine-tuning: (i) f-divergence-regularized policy-gradient updates, including a PPO-style rule, and (ii) a regularizer-determined reward-weighted regression objective with a minimizer-preservation guarantee under the Kullback–Leibler (KL) divergence. The LS-MDP framework further implies a principled model form: the optimal score decomposes into a fixed pretrained baseline plus a lightweight control correction, motivating a side-network parameterization conditioned on exposed intermediate denoising outputs, enabling effective \emphgray-box adaptation with a frozen backbone. Experiments on Stable Diffusion v1.4 across supervised and reward-driven finetuning show consistent gains in preference-alignment win rates and improved quality–efficiency trade-offs versus gray-box baselines and even the parameter-efficient white-box adapter LoRA.

Abstract:
Large language models (LLMs) demand substantial computational and memory resources, posing challenges for efficient deployment. Two complementary approaches have emerged to address these issues: tokenadaptive layer execution, which reduces floating-point operations (FLOPs) by selectively bypassing layers, and quantization, which lowers memory footprint by reducing weight precision. However, naively integrating these techniques leads to additional accuracy degradation due to reduced redundancy in token-adaptive models. We propose QTALE (Quantization-Robust Token-Adaptive Layer Execution for LLMs), a novel framework that enables seamless integration of token-adaptive execution with quantization while preserving accuracy. Conventional token-adaptive methods reduce redundancy in two ways: (1) by limiting the diversity of training paths explored during fine-tuning, and (2) by lowering the number of parameters actively involved in inference. To overcome these limitations, QTALE introduces two key components: (1) a training strategy that ensures diverse execution paths are actively explored during fine-tuning, and (2) a post-training mechanism that allows flexible adjustment of the execution ratio at inference to reintroduce redundancy when needed. Experimental results show that QTALE enables seamless integration of token-adaptive layer execution with quantization, showing no noticeable accuracy difference, with the gap to quantization-only models kept below 0.5% on CommonsenseQA benchmarks. By combining token-adaptive execution for FLOPs reduction and quantization for memory savings, QTALE provides an effective solution for efficient LLM deployment.

Abstract:
Supervised finetuning with expert demonstrations often produces models that imitate outputs without internalizing the reasoning processes needed for robust generalization. While critique-based approaches show promise, training models to generate critiques directly, such as Critique Fine-Tuning (CFT), can lead to output-format drift and degradation of general capabilities. We propose Critique-Guided Distillation (CGD), a training framework that decouples critique consumption from critique generation. During fine-tuning, the student is trained to refine flawed responses conditioned on teacher critiques. CGD treats critiques as a training-time-only supervision signal, encouraging internalization of error-aware reasoning: critiques guide learning but are absent at inference. Across five model families, CGD consistently outperforms CFT and standard distillation on mathematical reasoning benchmarks, yielding 7% average improvements and gains of up to +15.0% on AMC23 and +12.2% on MATH-500. On challenging competition problems such as AIME24 and AIME25, CGD achieves substantially higher Pass@1 and stronger performance at low Pass@k, indicating improved reasoning quality per sample. Importantly, CGD preserves general instruction-following capabilities where CFT degrades significantly (-21.3% on IFEval). These results position CGD as a practical and compute-efficient intermediate training paradigm for reasoning-centric tasks without introducing inference-time overhead.

Abstract:
Strong semantic representations improve the convergence and generation quality of diffusion and flow models. Existing approaches largely rely on external models, which require separate training, operate on misaligned objectives, and exhibit unexpected scaling behavior. We argue that this dependence arises from the model's training objective, which poses a denoising task with little incentive to learn semantic representations. We introduceSelfFlow: a self-supervised flow matching paradigm that integrates representation learning within the generative framework. Our key mechanism,Dual-Timestep Scheduling, applies heterogeneous noise levels across tokens, creating an information asymmetry that forces the model to infer missing information from corrupted inputs. This drives learning strong representations alongside generative capabilities without external supervision. Our method generalizes across modalities and enables multi-modal training while following expected scaling laws, achieving superior image, video, and audio generation.

Abstract:
Language Identification (LID) is an important component of many multilingual natural language processing pipelines, where it facilitates corpus curation, training data analysis, and crosslingual evaluation of large language models. Despite near-perfect performance on high-resource languages, existing systems remain brittle in low-resource and closely related language settings. We introduce a simple and efficient LID method based on the probabilistic framing, parameter estimation technique and inference strategy of the UnigramLM tokenization algorithm. In short, we learn language-conditional unigram distributions over a shared tokenizer vocabulary but treat segmentation as a language-specific phenomenon. Our formulation is data- and compute-efficient, supports incremental addition of new languages without retraining existing models, and can naturally be integrated into existing language model tokenization pipelines. Empirical evaluations against widely used baselines, including fastText, GlotLID-M, and CLD3, show that UniLID achieves competitive performance on standard benchmarks, substantially improves sample efficiency in low-resource settings--surpassing 70% accuracy with as few as five labeled samples per language---and delivers large gains on fine-grained dialect identification.

Abstract:
Many successful families of generative models leverage a lowdimensional latent distribution that is mapped to a data distribution. Though simple latent distributions are often used, the choice of distribution has a strong impact on model performance. Recent experiments have suggested that the probability distributions produced by quantum processors, which are typically highly correlated and classically intractable, can lead to improved performance on some datasets. However, when and why latent distributions produced by quantum processors can improve performance, and whether these improvements are connected to quantum properties of these distributions, are open questions that we investigate in this work. We show in theory that, under certain conditions, these "quantum latent distributions" enable generative models to produce data distributions that classical latent distributions cannot efficiently produce. We provide intuition as to the underlying mechanisms that could explain a performance advantage on real datasets. Based on this, we perform extensive benchmarking on a synthetic quantum dataset and the QM9 molecular dataset, using both simulated and real photonic quantum processors. We find that the statistics arising from quantum interference lead to improved generative performance compared to classical baselines, suggesting that quantum processors can play a role in expanding the capabilities of deep generative models.

Abstract:
Representation learning is central to graph machine learning, powering tasks such as link prediction and node classification. However, most graph embeddings are hard to interpret, offering limited insight into how learned features relate to graph structure. Many networks naturally admit a rolemixture view, where nodes are best described as mixtures over latent archetypal factors. Motivated by this structure, we propose a compositional graph embedding framework grounded in Aitchison geometry, the canonical geometry for comparing mixtures. Nodes are represented as simplex-valued compositions and embedded via isometric log-ratio (ILR) coordinates, which preserve Aitchison distances while enabling unconstrained optimization in Euclidean space. This yields intrinsically interpretable embeddings whose geometry reflects relative trade-offs among archetypes and supports coherent behavior under component restriction; we consider both fixed and learnable ILR bases. Across node classification and link prediction, our method achieves competitive performance with strong baselines while providing explainability by construction rather than post hoc. Finally, subcompositional coherence enables principled component restriction: removing and renormalizing subsets preserves a well-defined geometry, which we exploit via subcompositional dimensionality removal to probe how archetype groups influence representations and predictions.

Abstract:
Tokenization is a key component of autoregressive generative models, converting raw data into more manageable units for modeling. Commonly, tokens describe local information, such as regions of pixels in images or word pieces in text, and autoregressive generation commonly predicts these tokens in a fixed order. A worthwhile question is whether token structures affect the ability to steer the generation through testtime search, where multiple candidate generations are explored and evaluated by a verifier. Using image generation as our testbed, we hypothesize that recent 1D ordered tokenizers with coarse-to-fine structure can be more amenable to search than classical 2D grid structures. This is rooted in the fact that the intermediate states in coarse-to-fine sequences carry semantic meaning that verifiers can reliably evaluate, enabling effective steering during generation. We find that autoregressive models trained on coarse-to-fine ordered tokens exhibit improved test-time scaling behavior compared to grid-based counterparts. Moreover, we demonstrate that, thanks to the ordered structure, pure test-time search over token sequences (i.e., without training an autoregressive model) can perform training-free text-to-image generation when guided by an image-text verifier. Beyond this, we also systematically study how classical search algorithms (best-of-N, beam search, lookahead search) interact with different token structures, as well as the role of different verifiers and autoregressive priors in guiding the generation.

Abstract:
Large Language Models (LLMs) are deployed as autonomous agents in increasingly complex applications, where enabling long horizon memory is critical for achieving strong performance. However, a significant gap exists between practical applications and current evaluation standards for agent memory: existing benchmarks primarily focus on dialogue centric, human agent interactions. In reality, agent memory consists of a continuous stream of agent environment interactions that are primarily composed of machine generated representations. To bridge this gap, we introduce AMA Bench (Agent Memory with Any Length) to evaluate long horizon memory for LLMs in real agentic applications. It features two key components: (1) a set of real world agentic trajectories across representative agentic applications paired with expert curated QA, and (2) a set of synthetic agentic trajectories that scale to arbitrary horizons paired with rule based QA. Our comprehensive study shows that existing memory systems underperform on AMA Bench primarily because they suffer from a loss of causality and objective information, and are constrained by the lossy nature of similarity based retrieval employed by many memory systems. To address these limitations, we propose AMA Agent, an effective memory system featuring a causality graph and tool augmented retrieval. Our results demonstrate that AMA Agent achieves 57.22% average accuracy on AMA Bench, surpassing the strongest memory system baselines by 11.16%.

Abstract:
We develop horizonaware anytime-valid tests and confidence sequences for bounded means under a strict deadline N. Using the betting/e-process framework, we cast horizon-aware betting as a finite-horizon optimal control problem with state space (t, \log W_t), where t is the time and W_t is the test martingale value. We first show that in certain interior regions of the state space, policies that deviate significantly from Kelly betting are provably suboptimal, while Kelly betting reaches the threshold with high probability. We then identify sufficient conditions showing that outside this region, more aggressive betting than Kelly can be better if log-wealth is low or time is short, and less aggressive can be better if log-wealth is high. Taken together these results suggest a simple phase diagram in the (t, \log W_t) plane, delineating regions where Kelly, fractional Kelly, and aggressive betting may be preferable. Guided by this phase diagram, we introduce a Deep Reinforcement Learning approach based on a universal Deep Q-Network (DQN) agent that learns a single policy from synthetic experience and maps simple statistics of past observations to bets across horizons and null values. In limited-horizon experiments, the learned DQN policy outperforms state-of-the-art baselines.

Abstract:
With the rise of reasoning language models and testtime scaling methods as a paradigm for improving model performance, substantial computation is often required to generate multiple candidate sequences from the same prompt. This enables exploration of different reasoning paths toward the correct solution, however, allocates the same compute budget for each prompt. Grounded on the assumption that different prompts carry different degrees of complexity, and thus different computation needs, we propose EAGer, a training-free generation method that leverages model uncertainty through token-wise entropy distribution to reduce redundant computation and concurrently improve overall performance. EAGer allows branching to multiple reasoning paths only in the presence of high-entropy tokens, and then reallocates the saved compute budget to the instances where exploration of alternative paths is most needed. We find that across multiple open-source models on complex reasoning benchmarks such as AIME 2025, EAGer can reallocate the budget without accessing target labels, achieving the best efficiency-performance trade-off in terms of reasoning length and Pass@k. When target labels are accessible, EAGer generates up to 65% fewer tokens (hence saving compute) and achieves up to 37% improvement in Pass@k compared to the Full Parallel Sampling. Our results show that EAGer consistently maximizes the efficiency-performance trade-off by enabling dynamic control over computation expenditure.

Abstract:
Selfattention is often viewed as probabilistic query-key lookup, motivating designs that preserve normalized attention scores and fixed positional semantics. We advocate a simpler and more unified perspective: an autoregressive attention head can be viewed as a dynamic two-layer MLP whose weights are instantiated from the context history. From this view, attention scores form an ever-growing hidden representation, and standard MLP activations such as ReLU or GLU naturally implement input-conditioned selection over a context-dependent memory pool rather than a probability distribution. Based on this formulation, we introduce HyperMLP and HyperGLU, which learn dynamic mixing in both feature space and sequence space, using a reverse-offset (lag) layout to align temporal mixing with autoregressive semantics. We provide theoretical characterizations of the expressivity and implications of this structure, and empirically show that HyperMLP/HyperGLU consistently outperform strong softmax-attention baselines under matched parameter budgets.

Abstract:
Mixtureof-Experts (MoE) architectures scale Large Language Models via expert specialization induced by conditional computation. In practice, however, expert specialization often fails: some experts become functionally similar, while others functioning as de facto shared experts, limiting the effective capacity and model performance. In this work, we analysis from a spectral perspective on parameter and gradient spaces, uncover that (1) experts share highly overlapping dominant spectral components in their parameters, (2) dominant gradient subspaces are strongly aligned across experts, driven by ubiquitous low-rank structure in human corpus, and (3) gating mechanisms preferentially route inputs along these dominant directions, further limiting specialization. To address this, we propose Spectral-Decoupled MoE (SD-MoE), which decomposes both parameter and gradient in the spectral space. SD-MoE improves performance across downstream tasks, enables effective expert specialization, incurring minimal additional computation, and can be seamlessly integrated into a wide range of existing MoE architectures, including Qwen and DeepSeek.

Abstract:
As LLMpowered agents have been used for high-stakes decision-making, such as clinical diagnosis, it becomes critical to develop reliable verification of their decisions to facilitate trustworthy deployment. Yet, existing verifiers usually underperform owing to a lack of domain knowledge and limited calibration. To address this, we establishGLEAN, an agent verification framework withGuideLine-groundedEvidenceAccumulatioNthat compiles expert-curated protocols into trajectory-informed, well-calibrated correctness signals. GLEAN evaluates the step-wise alignment with domain guidelines and aggregates multi-guideline ratings into surrogate features, which are accumulated along the trajectory and calibrated into correctness probabilities using Bayesian logistic regression. Moreover, the estimated uncertainty triggers active verification, which selectively collects additional evidence for uncertain cases via expanding guideline coverage and performing differential checks. We empirically validate GLEAN with agentic clinical diagnosis across three diseases from the MIMIC-IV dataset, surpassing the best baseline by 12% in AUROC and 50% in Brier score reduction, which confirms the effectiveness in both discrimination and calibration. In addition, the expert study with clinicians recognizes GLEAN's utility in practice.

Abstract:
Large Language Models (LLMs) have shown promising results in automating formal verification. However, existing approaches treat proof generation as a static, endto-end prediction over source code, relying on limited verifier feedback and lacking access to concrete program behaviors. We present EXVERUS, a counterexample-guided framework that enables LLMs to reason about proofs using behavioral feedback via counterexamples. When a proof fails, EXVERUS automatically generates and validates counterexamples, and then guides the LLM to generalize them into inductive invariants to block these failures. Our evaluation shows that EXVERUS significantly improves proof accuracy, robustness, and token efficiency over the state-of-the-art prompting-based Verus proof generator.

Abstract:
Multimodal visual object tracking can be divided into to several kinds of tasks (e.g. RGB and RGB+X tracking), based on the input modality. Existing methods often train separate models for each modality or rely on pretrained models to adapt to new modalities, which limits efficiency, scalability, and usability. Thus, we introduce OneTrackerV2, a unified multimodal tracking framework that enables end-to-end training for any modality. We propose Meta Merger to embed multi-modal information into a unified space, allowing flexible modality fusion and improved robustness to corrupted modalities. We further introduce Dual Mixture-of-Experts (DMoE): T-MoE models spatio-temporal relations for tracking, while M-MoE embeds multi-modal knowledge, disentangling cross-modal dependencies and reducing feature conflicts. With a shared architecture, unified parameters, and a single end-to-end training, OneTrackerV2 achieves state-of-the-art performance across five RGB and RGB+X tracking tasks and 12 benchmarks, while maintaining high inference efficiency. Notably, even after model compression, OneTrackerV2 retains strong performance. Moreover, OneTrackerV2 demonstrates remarkable robustness under modality-missing scenarios.

Abstract:
Accelerating exploration in visual reinforcement learning under sparse rewards remains challenging due to the substantial taskirrelevant variations. Despite advances in intrinsic exploration, many methods either assume access to low-dimensional states or lack task-aware exploration strategies, thereby rendering them fragile in visual domains. To bridge this gap, we present TEB, a Task-aware Exploration approach that tightly couples task-relevant representations with exploration through a predictive Bisimulation metric. Specifically, TEB leverages the metric not only to learn behaviorally grounded task representations but also to measure behaviorally intrinsic novelty over the learned latent space. To realize this, we first theoretically mitigate the representation collapse of degenerate bisimulation metrics under sparse rewards by internally introducing a simple but effective predicted reward differential. Building on this robust metric, we design potential-based exploration bonuses, which measure the relative novelty of adjacent observations over the latent space. Extensive experiments on MetaWorld and Maze2D show that TEB achieves superior exploration ability and outperforms recent baselines.

Abstract:
Recent advances in textto-video generation have achieved impressive performance on short clips, yet evaluating long-form generation under complex textual inputs remains a significant challenge. In response to this challenge, we present LoCoT2V-Bench, a benchmark for long video generation (LVG) featuring multi-scene prompts with hierarchical metadata (e.g., character settings and camera behaviors), constructed from collected real-world videos. We further propose LoCoT2V-Eval, a multi-dimensional framework covering perceptual quality, text-video alignment, temporal quality, dynamic quality, and Human Expectation Realization Degree (HERD), with an emphasis on aspects such as fine-grained text-video alignment and temporal character consistency. Experiments on 13 representative LVG models reveal pronounced capability disparities across evaluation dimensions, with strong perceptual quality and background consistency but markedly weaker fine-grained text-video alignment and character consistency. These findings suggest that improving prompt faithfulness and identity preservation remains a key challenge for long-form video generation.

Abstract:
Evidential Deep Learning (EDL) enables singlepass uncertainty estimation by predicting Dirichlet evidence, but it can remain overconfident and poorly calibrated, and it often fails to represent multi-modal epistemic uncertainty. We introduce Gated Evidential Mixtures (GEM), a family of models that learns an in-model energy signal and uses it to gate evidential outputs end-to-end in a distance-aware manner. GEM-CORE learns a feature-level energy and maps it to a bounded gate that smoothly suppresses evidence when support is low. To capture epistemic multi-modality without multi-pass ensembling, GEM-MIX adds a lightweight mixture of evidential heads with learned routing weights while preserving single-pass inference. Finally, GEM-FI stabilizes mixture allocations via a Fisher-informed regularizer, reducing head collapse and producing smoother boundary uncertainty. Across image classification and OOD detection benchmarks, GEM improves calibration and ID/OOD separation with single-pass inference. On CIFAR-10, GEM-FI vs. DAEDL improves Acc. 93.75 to 91.11 (+2.64 pp), reduces Brier×100 6.81 to 14.27 (-7.46), and also improves misclassification-detection (AUPR) 99.94 to 99.08 (+0.86). For epistemic OOD detection, GEM-FI achieves AUPR/AUROC of 92.59/95.09 on CIFAR-10\rightarrowSVHN and 90.20/89.06 on CIFAR-10\rightarrowCIFAR-100 (vs. 85.54/89.30 and 88.19/86.10 for DAEDL).

Abstract:
Large Language Models have shown strong capabilities in complex problem solving, yet many agentic systems remain difficult to interpret and control due to opaque internal workflows. While some frameworks offer explicit architectures for collaboration, many deployed agentic systems operate as black boxes to users. We address this by introducing Agentic Workflow Reconstruction (AWR), a new task aiming to synthesize an explicit, interpretable standin workflow that approximates a black-box system using only input--output access. We propose AgentXRay, a search-based framework that formulates AWR as a combinatorial optimization problem over discrete agent roles and tool invocations in a chain-structured workflow space. Unlike model distillation, AgentXRay produces editable white-box workflows that match target outputs under an observable, output-based proxy metric, without accessing model parameters. To navigate the vast search space, AgentXRay employs Monte Carlo Tree Search enhanced by a scoring-based Red-Black Pruning mechanism, which dynamically integrates proxy quality with search depth. Experiments across diverse domains demonstrate that AgentXRay achieves higher proxy similarity and reduces token consumption compared to unpruned search, enabling deeper workflow exploration under fixed iteration budgets.

Abstract:
Common deep learning approaches for antibody engineering focus on modeling the marginal distribution of sequences. By treating sequences as independent samples, however, these methods overlook affinity maturation as a rich and largely untapped source of information about the evolutionary process by which antibodies explore the underlying fitness landscape. In contrast, classical phylogenetic models explicitly represent evolutionary dynamics but lack the expressivity to capture complex epistatic interactions. We bridge this gap withCoSiNE, a continuoustime Markov chain parameterized by a deep neural network. Mathematically, we prove that CoSiNE provides a first-order approximation to the intractable sequential point mutation process, capturing epistatic effects with an error bound that is quadratic in branch length. Empirically, CoSiNE outperforms state-of-the-art language models in zero-shot variant effect prediction by explicitly disentangling selection from context-dependent somatic hypermutation. Finally, we introduceGuided Gillespie, a classifier-guided sampling scheme that steers CoSiNE at inference time, enabling efficient optimization of antibody binding affinity toward specific antigens.

Abstract:
Flow Matching enables simulationfree training of generative models on Riemannian manifolds, yet sampling typically still relies on numerically integrating a probability-flow ODE. We propose Riemannian MeanFlow (RMF), extending MeanFlow to manifold-valued generation where velocities lie in location-dependent tangent spaces. RMF defines an average-velocity field via parallel transport and derives a Riemannian MeanFlow identity that links average and instantaneous velocities for intrinsic supervision. We make this identity practical in a log-map tangent representation, avoiding trajectory simulation and heavy geometric computations. For stable optimization, we decompose the RMF objective into two terms and apply conflict-aware multi-task learning to mitigate gradient interference. RMF also supports conditional generation via classifier-free guidance. Experiments on spheres, tori, and SO(3) demonstrate competitive one-step sampling with improved quality–efficiency trade-offs and substantially reduced sampling cost.

Abstract:
We present MoveThen-Operate, a Vision–language–action framework that explicitly decouples robotic manipulation into two distinct behavioral phases: coarse relocation (move) and contact-critical interaction (operate). Unlike monolithic policies that conflate these heterogeneous regimes, our architecture employs a dual-expert policy routed by a learnable phase selector, introducing a structural inductive bias that isolates phase-specific dynamics. Phase labels are automatically generated via an MLLM-based pipeline conditioned on lightweight contextual cues such as end-effector velocity and subtask decomposition to ensure alignment with human motor patterns. Evaluated on the RoboTwin2 benchmark, our method achieves an average success rate of 68.9%, outperforming the monolithic \pi_0 baseline by +24%. It matches or exceeds models trained on 10× more data and reaches peak performance in 40% fewer training steps, demonstrating that architectural disentanglement of move and operate phases is a highly effective and efficient strategy for mastering high-precision manipulation.

Abstract:
Large language models (LLMs) are now widely deployed in userfacing applications, reaching hundreds of millions of users worldwide. Despite their widespread adoption, growing reliance on their outputs raises significant concerns, particularly as users may be exposed to model-inherent biases that disadvantage or stereotype certain groups. However, existing bias benchmarks commonly rely on simple templated prompts or restrictive multiple-choice questions that fail to capture the complexity of real-world user interactions. In this work, we address this gap by introducing a counterfactual framework that automatically generates realistic, open-ended questions for LLM bias evaluation. Through iterative question mutation, our approach further systematically explores areas where models are most susceptible to exhibit biased behavior. Beyond just detecting harmful biases, we also capture increasingly relevant response dimensions, such as asymmetric refusals and explicit bias acknowledgment. Building on this, we construct CAB, a diverse and human-verified benchmark for realistic and nuanced bias evaluations on current frontier LLMs. Our evaluation using CAB highlights the continued need for fairness research by demonstrating that all examined models exhibit persistent biases across certain scenarios.

Abstract:
Skillbased agent systems tackle complex tasks by composing reusable skills, improving modularity and scalability while introducing a largely unexamined security attack surface. We propose SkillTrojan, a backdoor attack that targets skill implementations rather than model parameters or training data. SkillTrojan embeds malicious logic inside otherwise plausible skills and leverages standard skill composition to reconstruct and execute an attacker-specified payload. The attack partitions an encrypted payload across multiple benign-looking skill invocations and activates only under a predefined trigger. SkillTrojan also supports automated synthesis of backdoored skills from arbitrary skill templates, enabling scalable propagation across skill-based agent ecosystems. To enable systematic evaluation, we release a dataset of 3,000+ curated backdoored skills spanning diverse skill patterns and trigger–payload configurations. We instantiate SkillTrojan in a representative code-based agent setting and evaluate both clean-task utility and attack success rate. Our results show that skill-level backdoors can be highly effective with minimal degradation of benign behavior, exposing a critical blind spot in current skill-based agent architectures and motivating defenses that explicitly reason about skill composition and execution. Concretely, on EHR SQL, SkillTrojan attains up to 97.2% ASR while maintaining 89.3% clean ACC on GPT-5.2-1211-Global.

Abstract:
The accuracyfirst perspective of differential privacy addresses an important shortcoming by allowing a data analyst to adaptively adjust the quantitative privacy bound instead of sticking to a predetermined bound. Existing works on the accuracy-first perspective have neglected an important property of differential privacy known as post-processing immunity, which ensures that an adversary is not able to weaken the privacy guarantee by post-processing. We address this gap by determining which existing definitions in the accuracy-first perspective have post-processing immunity, and which do not. The only definition with post-processing immunity, pure ex-post privacy, lacks useful tools for practical problems, such as an ex-post analogue of the Gaussian mechanism, and an algorithm to check if accuracy on separate private validation set is high enough. To address this, we propose a new definition based on Rényi differential privacy that has post-processing immunity, and we develop basic theory and tools needed for practical applications. We demonstrate the practicality of our theory with applications to synthetic data generation and image classifier fine-tuning, where our algorithm successfully adjusts the privacy bound until an accuracy threshold is met on a private validation dataset.

Abstract:
Visual Autoregressive (VAR) modeling departs from the nexttoken prediction paradigm of traditional Autoregressive (AR) models through next-scale prediction, enabling high-quality image generation. However, the VAR paradigm suffers from sharply increased computational complexity and running time at large-scale steps. Although existing acceleration methods reduce runtime for large-scale steps, but rely on manual step selection and overlook the varying importance of different stages in the generation process. To address this challenge, we present FasterVAR, a systematic study and stage-aware acceleration framework for VAR models. Our analysis shows that early steps are critical for preserving semantic and structural consistency and should remain intact, while later steps mainly refine details and can be pruned or approximated for acceleration. Building on these insights, FasterVAR introduces a plug-and-play acceleration strategy that exploits semantic irrelevance and low-rank properties in late-stage computations, without requiring additional training. Our proposed FasterVAR achieves up to 3.4× speedup with only a 0.01 drop on GenEval and a 0.26 decrease on DPG, consistently outperforming existing acceleration baselines. These results highlight stage-aware design as a powerful principle for efficient visual autoregressive image generation. Our codes will be open-sourced.

Abstract:
Alzheimer’s disease (AD) alters brain electrophysiology and disrupts multichannel EEG dynamics, making accurate and clinically useful EEGbased diagnosis increasingly important for screening and disease monitoring. However, many existing approaches rely on black-box classifiers and do not explicitly model the underlying dynamics that generate observed signals. To address these limitations, we propose LERD, an end-to-end Bayesian electrophysiological neural dynamical system that infers latent neural events and their relational structure directly from multichannel EEG without event or interaction annotations. LERD combines a continuous-time event inference module with a stochastic event-generation process to capture flexible temporal patterns, while incorporating an electrophysiology-inspired dynamical prior to guide learning in a principled way. We further provide theoretical analysis that yields a tractable bound for training and stability guarantees for the inferred relational dynamics. Extensive experiments on synthetic benchmarks and two real-world AD EEG cohorts demonstrate that LERD consistently outperforms strong baselines and yields physiology-aligned latent summaries that help characterize group-level dynamical differences.

Abstract:
Dataset distillation aims to synthesize a compact subset of the original data, enabling models trained on it to achieve performance comparable to those trained on the original large dataset. Existing distributionmatching methods are confined to Euclidean spaces, making them only capture linear structures and overlook the intrinsic geometry of real data, e.g., curvature. However, high-dimensional data often lie on low-dimensional manifolds, suggesting that dataset distillation should have the distilled data manifold aligned with the original data manifold. In this work, we propose a geometry-aware distribution-matching framework, called GeoDM, which operates in the Cartesian product of Euclidean, hyperbolic, and spherical manifolds, with flat, hierarchical, and cyclical structures all captured by a unified representation. To adapt to the underlying data geometry, we introduce learnable curvature and weight parameters for three kinds of geometries. At the same time, we design an optimal transport loss to enhance the distribution fidelity. Our theoretical analysis shows that the geometry-aware distribution matching in a product space yields a smaller generalization error bound than the Euclidean counterparts. Extensive experiments conducted on standard benchmarks demonstrate that our algorithm outperforms state-of-the-art data distillation methods and remains effective across various distribution-matching strategies for the single geometries.

Abstract:
In contemporary deep learning, a prevalent and effective workflow for solving lowdata problems is adapting powerful pre-trained foundation models (FMs) to new tasks via parameter-efficient fine-tuning (PEFT). However, while empirically effective, the resulting solutions lack generalisation guarantees to certify their accuracy - which may be required for ethical or legal reasons prior to deployment in high-importance applications. In this paper we develop a novel transfer learning approach that is designed to facilitate non-vacuous learning theoretic generalisation guarantees for downstream tasks, even in the low-shot regime. Specifically, we first use upstream tasks to train a \em distribution over PEFT parameters. We then learn the downstream task by a \em sample-and-evaluate procedure -- sampling plausible PEFTs from the trained diffusion model and selecting the one with the highest likelihood on the downstream data. Crucially, this confines our model hypothesis to a \em finite set of PEFT samples. In contrast to the typical continuous hypothesis spaces of neural network weights, this facilitates tighter risk certificates. We instantiate our bound and show non-trivial generalization guarantees compared to existing learning approaches which lead to vacuous bounds in the low-shot regime.

Abstract:
Capacitated fairrange k-clustering generalizes classical k-clustering by incorporating both capacity constraints and demographic fairness. In this setting, data points are categorized as clients and facilities; each facility has a capacity and may belong to one or more possibly intersecting demographic groups. The task is to select k facilities as centers and assign each client to a center so that: (a) no center exceeds its capacity, (b) the number of centers selected from each group lies within specified lower and upper bounds (fair-range constraints), and (c) the clustering cost (e.g., k-median or k-means) is minimized. Prior work by Thejaswi et al. (KDD 2022) showed that even satisfying fair-range constraints is \np-hard, thereby making the problem inapproximable to any polynomial factor. Our first main result strengthens this by showing that inapproximability persists even when the fair-range constraints are trivially satisfiable, highlighting the intrinsic computational complexity of the clustering task itself. These inapproximability results hold even on tree metrics and when the number of groups is logarithmic in the size of the facility set. In light of strong inapproximability results, we focus on a practical setting where the number of groups is constant. Our second main result is a polynomial-time O(\log k)- and O(\log^2 k)-approximation algorithm for k-median and k-means objectives, respectively, in this regime. Next, we design constant factor approximation algorithms for these problems that run in fixed parameterized tractable time in k. All our approximation guarantees match the best bounds for capacitated clustering without fair-range constraints. Finally, as our third main contribution, we show that our polynomial-time algorithms are, to our knowledge, the first to have provable approximation guarantees that can practically solve problem instances of modest size.

Abstract:
Textto-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism under complex prompts. To address this limitation, we present T2AV-Compass, a unified benchmark for comprehensive evaluation of T2AV systems, consisting of 500 diverse and complex prompts constructed via a taxonomy-driven pipeline to ensure semantic richness and physical plausibility. Besides, T2AV-Compass introduces a dual-level evaluation framework that integrates objective signal-level metrics for video quality, audio quality, and cross-modal alignment with a subjective MLLM-as-a-Judge protocol for instruction following and realism assessment. Extensive evaluation of 15 representative T2AV systems reveals that even the strongest models fall substantially short of human-level realism and cross-modal consistency, with persistent failures in audio realism, fine-grained synchronization, instruction following, etc. These results indicate significant improvement room for future models and highlight the value of T2AV-Compass as a challenging and diagnostic testbed for advancing text-to-audio-video generation.

Abstract:
Coverless Image Steganography (CIS) hides information without explicitly modifying a cover image, providing strong imperceptibility and inherent robustness to steganalysis. However, existing CIS methods largely lack robust access control, making it difficult to selectively reveal different hidden contents to different authorized users. Such access control is critical for scalable and privacysensitive information hiding in multi-user settings. We propose MIDAS, a training-free diffusion-based CIS framework that enables multi-image hiding with user-specific access control via latent-level fusion. MIDAS introduces a Random Basis mechanism to suppress residual structural information and a Latent Vector Fusion module that reshapes aggregated latents to align with the diffusion process. Experimental results demonstrate that MIDAS consistently outperforms existing training-free CIS baselines in access control functionality, stego image quality and diversity, robustness to noise, and resistance to steganalysis, establishing a practical and scalable approach to access-controlled coverless steganography.

Abstract:
The agentenvironment co-design paradigm jointly optimises agent policies and environment configurations in search of improved system performance, promising to fundamentally reshape how we deploy multi-agent systems in domains such as warehouse logistics and windfarm management. However, current co-design methods collapse under high dimensional environment design spaces and suffer from sample inefficiency when addressing moving targets inherent to joint optimisation. We address this by developingDiffusion Co-Design(DiCoDe), a scalable and sample-efficient co-design framework incorporating two core innovations. We introduce Projected Universal Guidance (PUG), enabling exploration of constraint-satisfying reward-maximising environments, and devise a critic distillation mechanism to transfer knowledge from the reinforcement learning loop to a guided diffuision model. Together, these improvements lead to superior environment-policy pairs when validated on challenging multi-agent co-design benchmarks, for example, exceeding state-of-the art in a warehouse setting with 39% higher rewards and 66% fewer simulation steps.

Abstract:
Reinforcement Learning (RL) and Supervised FineTuning (SFT) are the two dominant paradigms for enhancing Large Language Model (LLM) performance on downstream tasks. While RL generally preserves broader model capabilities (retention) better than SFT, it comes with significant costs: complex reward engineering, instability, and expensive on-policy sampling. In contrast, SFT is efficient but brittle, often suffering from catastrophic forgetting due to Supervision Mismatch: the divergence between the model's evolving policy and static training labels. We address this trade-off with Trajectory-Mixed Supervision (TMS), a reward-free framework that approximates the on-policy benefits of RL by creating a dynamic curriculum from the model's own historical checkpoints. TMS minimizes Policy-Label Divergence (PLD), preventing the mode collapse that drives forgetting in standard SFT. Experiments across reasoning (MATH, GSM8K) and instruction-following benchmarks demonstrate that TMS effectively shifts the accuracy-retention Pareto frontier. While RL remains the gold standard for retention, TMS significantly outperforms standard and iterative SFT, bridging the gap to RL without requiring reward models or verifiers. Mechanistic analysis confirms that PLD drift accurately predicts forgetting, and that TMS successfully mitigates this drift.

Abstract:
Memory agents, which depart from predefined memoryprocessing pipelines by endogenously managing the processing, storage, and retrieval of memories, have garnered increasing attention for their autonomy and adaptability. However, existing training paradigms remain constrained: agents often traverse long-horizon sequences of memory operations before receiving sparse and delayed rewards, which hinders truly end-to-end optimization of memory management policies. To address this limitation, we introduce Mem-T, an autonomous memory agent that interfaces with a lightweight hierarchical memory database to perform dynamic updates and multi-turn retrieval over streaming inputs. To effectively train long-horizon memory management capabilities, we further propose MoT-GRPO, a tree-guided reinforcement learning framework that transforms sparse terminal feedback into dense, step-wise supervision via memory operation tree backpropagation and hindsight credit assignment, thereby enabling the joint optimization of memory construction and retrieval. Extensive experiments demonstrate that Mem-T is \ding182 high-performing, surpassing frameworks such as A-Mem and Mem0 by up to 14.94%, and \ding183 economical, operating on a favorable accuracy-efficiency Pareto frontier and reducing inference tokens per query by ～24.45% relative to GAM without sacrificing performance.

Abstract:
Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to ～ 70% over the stateof-the-art while requiring <50% of the parameters and a 7× training speedup. By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-invariant and phase-equivariant biases. PHALAR establishes new retrieval state-of-the-art across MoisesDB, Slakh, and ChocoChorales, correlating significantly higher with human coherence judgment than semantic baselines. Finally, zero-shot beat tracking and linear chord probing confirm that PHALAR captures robust musical structures beyond the retrieval task.

Abstract:
Multimodal representation learning harmonizes distinct modalities by aligning them into a unified latent space. Recent research generalizes traditional crossmodal alignment to produce enhanced multimodal synergy but requires all modalities to be present for a common instance, making it challenging to utilize prevalent datasets with missing modalities. We provide theoretical insights into this issue from an anchor shift perspective. Observed modalities are aligned with a local anchor that deviates from the optimal one when all modalities are present, resulting in an inevitable shift. To address this, we propose CalMRL for multimodal representation learning to calibrate incomplete alignments caused by missing modalities. Specifically, CalMRL leverages the priors and the inherent connections among modalities to model the imputation for the missing ones at the representation level. To resolve the optimization dilemma, we employ a bi-step learning method with the closed-form solution of the posterior distribution of shared latents. We validate its mitigation of anchor shift and convergence with theoretical guidance. By equipping the calibrated alignment with the existing advanced method, we offer new flexibility to absorb data with missing modalities, which is originally unattainable. Extensive experiments and comprehensive analyses demonstrate the superiority of CalMRL. Our code, model checkpoints, and evaluation raw data will be publicly available.

Abstract:
Graphs provide a natural representation of relational structure that arises across diverse domains. Despite this ubiquity, graph structure is typically learned in a modalityand task-isolated manner, where graph representations are constructed within individual task contexts and discarded thereafter. As a result, structural regularities across modalities and tasks are repeatedly reconstructed rather than accumulated at the level of intermediate graph representations. This motivates a representation-learning question:how should graph structure be organized so that it can persist and accumulate across heterogeneous modalities and tasks?We adopt a representation-centric perspective in which graph structure is treated as a structural substrate that persists across learning contexts. To instantiate this perspective, we proposeG-Substrate, agraphsubstrateframework that organizes learning around shared graph structures. G-Substrate comprises two complementary mechanisms: a unified structural schema that ensures compatibility among graph representations across heterogeneous modalities and tasks, and an interleaved role-based training strategy that exposes the same graph structure to multiple functional roles during learning. Experiments across multiple domains, modalities, and tasks show that G-Substrate outperforms task-isolated and naive multi-task learning methods.

Abstract:
Reward design remains a significant bottleneck in applying reinforcement learning (RL) to realworld problems. A popular alternative is reward learning, where reward functions are inferred from human feedback rather than manually specified. Recent work has proposed learning reward functions from human ratings rather than traditional binary preferences, enabling richer and potentially less cognitively demanding supervision. Building on this paradigm, we introduce a new rating-based RL method, Ranked Return Regression for RL (R4). At its core, R4 uses a novel ranking mean squared error loss that learns from a dataset of trajectory–rating pairs, treating the human-provided discrete ratings (e.g., "bad," "neutral," "good") as ordinal targets. Unlike prior rating-based approaches, R4 offers formal guarantees: its solution set is provably minimal and complete under mild assumptions. Empirically, using both human-provided and simulated ratings, we demonstrate that R4 consistently matches or outperforms existing rating and preference-based RL methods on robotic benchmarks from OpenAI Gym and the DeepMind Control Suite.

Abstract:
Large language models are increasingly deployed in streaming scenarios, rendering conventional posthoc safeguards ineffective as they fail to interdict unsafe content in real-time. While streaming safeguards based on token-level supervised training could address this, they necessitate expensive annotations and suffer from severe overfitting. In this work, we challenge the paradigm that streaming safety must rely on token-level supervised training. Instead, it is an inherent capability of well-trained post-hoc safeguards, as they already encode token-level risk signals in hidden representations. Hence, we introduce NExT-Guard, a training-free framework that achieves streaming safeguards by monitoring interpretable latent features from Sparse Autoencoders (SAEs). It uses pretrained SAEs from publicly available base LLMs, enabling flexible, low-cost deployment without token-level supervision. Experimental results show that NExT-Guard outperforms both post-hoc and streaming safeguards based on supervised training, with superior robustness across models, SAE variants, and risk scenarios. These results make NExT-Guard a universal and scalable paradigm for real-time safety, accelerating the practical deployment of streaming safeguards.

Abstract:
Reinforcement learning from human feedback (RLHF) has evolved to be one of the main methods for finetuning large language models (LLMs). However, existing RLHF methods are non-robust, and their performance deteriorates if the downstream task differs significantly from the preference dataset used in fine-tuning. In order to mitigate this problem, we introduce a distributionally robust RLHF for fine-tuning LLMs. In particular, our goal is to ensure that a fine-tuned model retains its performance even when the distribution of prompts significantly differs from the distribution encountered during fine-tuning. We formulate distributionally robust optimization (DRO) version of two popular fine-tuning methods -- (1) reward-based RLHF and (2) reward-free DPO (direct preference optimization). We propose a minibatch gradient descent based algorithms for both of them, and theoretically prove convergence guarantees for the algorithms. Subsequently, we evaluate our algorithms on an out-of-distribution (OOD) task by first training the model on the Unified-Feedback dataset and evaluating its performance on two different datasets. The experimental results show that our robust training improves the accuracy of the learned reward models on average, and markedly on some tasks, such as reasoning. Furthermore, we show that the robust versions of policy optimization methods, similarly improve performance on OOD tasks.

Abstract:
Many modern retrieval problems are \emphsetvalued: given a broad intent, the system must return a \emphcollection of results that optimizes higher-order properties (e.g., diversity, coverage, complementarity, coherence) while staying grounded to a fixed database. Set-valued objectives are inherently non-decomposable and are not captured by existing supervised (query, content) datasets which only prioritize top-1 retrieval. While reinforcement learning (RL) can optimize set-level objectives via interaction, deploying an RL-tuned LLM for fan-out retrieval is prohibitively expensive at query time. Conversely, diffusion-based generative retrieval enables efficient single-pass fan-out in embedding space, but requires objective-aligned training targets. To address these issues, we propose R4T (Retrieve-for-Train), which uses RL \emphonce as an objective transducer in a three step process: (i) train a fan-out LLM with composite set-level rewards, (ii) synthesize objective-consistent training pairs, and (iii) train a lightweight diffusion retriever to model the conditional distribution of set-valued outputs. Across Polyvore and a music playlist dataset, R4T improves retrieval quality over strong baselines while reducing query-time fan-out latency by an order of magnitude.

Abstract:
Empirical studies have shown that continuous lowloss paths can be constructed between independently trained neural network models. This phenomenon, known as mode connectivity, refers to the existence of such paths between distinct modes-i.e., well-trained solutions in parameter space. However, existing empirical methods do not reliably connect independently trained modes and have been evaluated mainly on a narrow set of architectures (e.g., basic CNNs, VGG, and ResNet), leaving their effectiveness on newer models unclear. In this work, we propose a new empirical algorithm for connecting independently trained modes that generalizes beyond traditional architectures and supports a broader range of networks, including MobileNet, ShuffleNet, EfficientNet, RegNet, Deep Layer Aggregation (DLA), and Compact Convolutional Transformers (CCT). In addition to broader applicability, the proposed method yields more consistent connectivity paths across independently trained mode pairs and supports connecting modes obtained with different training hyperparameters.

Abstract:
AI agents are increasingly deployed for consequential tasks. Yet existing benchmarks evaluate only task success rates, ignoring whether agents behave consistently, remain robust to perturbations, fail predictably, or bound error severity. We propose a framework for measuring agent reliability grounded in safetycritical engineering practice, decomposing reliability into four dimensions: consistency, robustness, predictability, and safety. Applying these metrics to 12 frontier models across two complementary benchmarks, we find that recent capability gains have produced minimal improvement in reliability: agents remain inconsistent across runs, brittle to prompt rephrasings, and poorly calibrated in their self-assessments, even as accuracy improves. Our metrics complement accuracy-focused evaluation by offering tools for reasoning about how agents perform, degrade, and fail under uncertainty.

Abstract:
Generative modeling of threedimensional (3D) molecules is a fundamental yet challenging problem in drug discovery and materials science. Existing approaches typically represent molecules as 3D graphs and co-generate discrete atom types with continuous atomic coordinates, leading to intrinsic learning difficulties such as heterogeneous modality entanglement and geometry–chemistry coherence constraints. We propose VecMol, a novel representation that models 3D molecules as continuous vector fields over Euclidean space, where vectors point toward nearby atoms and implicitly encode molecular structure. The vector field is parameterized by a neural field and generated using a latent diffusion model, avoiding explicit graph generation and decoupling structure learning from discrete atom instantiation. Experiments on the QM9 and GEOM-Drugs benchmarks demonstrate that VecMol achieves competitive generation quality, suggesting vector-field-based representations as a promising new direction for 3D molecular generation.

Abstract:
Latent manifolds of autoencoders provide lowdimensional representations of data, which can be studied from a geometric perspective. We propose to describe these latent manifolds as implicit submanifolds of some ambient latent space. Based on this, we develop tools for a discrete Riemannian calculus approximating classical geometric operators. These tools are robust against inaccuracies of the implicit representation often occurring in practical examples. To obtain a suitable implicit representation, we propose to learn an approximate projection onto the latent manifold by minimizing a denoising objective. This approach is independent of the underlying autoencoder and supports the use of different Riemannian geometries on the latent manifolds. The framework in particular enables the computation of geodesic paths connecting given end points and shooting geodesics via the Riemannian exponential maps on latent manifolds. We evaluate our approach on various autoencoders trained on synthetic and real data.

Abstract:
The visual modality of visionlanguage models (VLMs) is an underexplored attack surface for bypassing safety alignment. We introduce four jailbreak attacks exploiting the vision component: (1) encoding harmful instructions as visual symbol sequences with a decoding legend, (2) replacing harmful objects with benign substitutes (e.g., bomb \rightarrow banana) then prompting for harmful actions using the substitute term, (3) replacing harmful text in images (e.g., on book covers) with benign words while visual context preserves the original meaning, and (4) visual analogy puzzles whose solution requires inferring a prohibited concept. Evaluating across five frontier VLMs, we find visual attacks achieve comparable and sometimes superior success rates to their text-only counterparts. For example, our visual cipher achieves 40.9% attack success on Claude-Haiku-4.5 versus 10.7% for an equivalent textual cipher. To further our insight into the attack mechanism, we present preliminary interpretability and mitigation results. These findings highlight that robust VLM alignment requires treating vision as a first-class target for safety post-training.

Abstract:
Optimizing Conditional Valueat-risk (CVaR) using policy gradient (a.k.a CVaR-PG) faces significant challenges of sample inefficiency. This inefficiency stems from the fact that it focuses on tail-end performance and overlooks many sampled trajectories. We address this problem by augmenting CVaR with an expected quantile term. Quantile optimization admits a dynamic programming formulation that leverages all sampled data, thus improves sample efficiency. This does not alter the CVaR objective since CVaR corresponds to the expectation of quantile over the tail. Empirical results in domains with verifiable risk-averse behavior show that our algorithm within the Markovian policy class substantially improves upon CVaR-PG and consistently outperforms other existing methods.

Abstract:
Reasoning in knowledgeintensive domains remains challenging because intermediate steps are often not locally verifiable: unlike math or code, evaluating step correctness may require synthesizing clues across large external knowledge sources. As a result, subtle errors can propagate through reasoning traces, potentially never to be detected. Prior work has proposed process reward models (PRMs), including retrieval-augmented variants. However, due to retrieval they operate post hoc by scoring completed trajectories which prevents their integration into dynamic inference procedures. Here, we introduce Process Reward Agents~(PRA), a test-time method for providing domain-grounded, online, step-wise rewards to a frozen reasoner. In contrast to prior retrieval-augmented PRMs, PRA enables search-based decoding to rank and prune candidate trajectories at every generation step. Experiments on multiple medical reasoning benchmarks demonstrate that PRA consistently outperforms strong baselines, achieving 80.9% accuracy on MedQA with Qwen3-4B, a new state of the art at the 4B scale. Crucially, PRA generalizes to unseen frozen policy models ranging from 0.5B to 8B parameters, improving their accuracy by up to +25.7% without any policy model updates. Ultimately, PRA suggests a paradigm in which frozen reasoners are decoupled from domain-specific tool-augmented reward modules, which enables the deployment of new backbones in complex domains without retraining. To support reproducibility, we release all code and data in an anonymous repository.

Abstract:
Offline reinforcement learning (RL) learns policies from fixed datasets, thereby avoiding costly or unsafe environment interactions. However, its reliance on finite static datasets inherently restricts the ability to generalize beyond the training distribution. Prior solutions based on synthetic data augmentation often fail to generalize to unseen scenarios in the (augmented) dataset. To address these challenges, we propose Retrieval HighquAlity Demonstrations (RAD) for decision-making, which innovatively introduces a retrieval mechanism into offline RL. Specifically, RAD retrieves high-return and reachable states from the offline dataset as target states, and leverages a generative model to generate sub-trajectories conditioned on these targets for planning. Since the targets are high-return states, once the agent reaches such a target, it can continue to obtain high returns by following the associated high-return actions, thereby improving policy generalization. Extensive experiments confirm that RAD achieves competitive or superior performance compared to baselines across diverse benchmarks, validating its effectiveness. Our code is available at https://anonymous.4open.science/r/RAD_ICML-6CC9.

Abstract:
FullReference image quality assessment (FR IQA) is important for image compression, restoration and generative modeling, yet current neural metrics remain slow and vulnerable to adversarial perturbations. We present BiRQA, a compact FR IQA metric model that processes four fast complementary features within a bidirectional multiscale pyramid. A bottom-up attention module injects fine-scale cues into coarse levels through an uncertainty-aware gate, while a top-down cross-gating block routes semantic context back to high resolution. To enhance robustness, we introduce Anchored Adversarial Training, a theoretically grounded strategy that uses clean "anchor" samples and a ranking loss to bound pointwise prediction error under attacks. On five public FR IQA benchmarks BiRQA outperforms or matches the previous state of the art (SOTA) while running ～ 3 × faster than previous SOTA models. Under unseen white-box attacks it lifts SROCC from 0.30-0.57 to 0.60-0.84 on KADID-10k, demonstrating substantial robustness gains. To our knowledge, BiRQA is the only FR IQA model combining competitive accuracy with real-time throughput and strong adversarial resilience.

Abstract:
Understanding how and why large language models (LLMs) fail is becoming a central challenge as models rapidly evolve and static evaluations fall behind. While automated probing has been enabled by dynamic test generation, existing approaches often discover isolated failure cases, lack principled control over exploration, and provide limited insight into the underlying structure of model weaknesses. We propose ProbeLLM, a benchmarkagnostic automated probing framework that elevates weakness discovery from individual failures to structured failure modes. ProbeLLM formulates probing as a hierarchical Monte Carlo Tree Search, explicitly allocating limited probing budgets between global exploration of new failure regions and local refinement of recurring error patterns. By restricting probing to verifiable test cases and leveraging tool-augmented generation and verification, ProbeLLM grounds failure discovery in reliable evidence. Discovered failures are further consolidated into interpretable failure modes via failure-aware embeddings and boundary-aware induction. Across diverse benchmarks and LLMs, ProbeLLM reveals substantially broader, cleaner, and more fine-grained failure landscapes than static benchmarks and prior automated methods, supporting a shift from case-centric evaluation toward principled weakness discovery

Abstract:
Multiadapter serving systems route entire sequences to a single adapter, forcing a choice when requests span multiple domains. This assumption fails in two important settings: (1) multimodal generation, where text and image tokens require different adapters within the same sequence, and (2) mixed-capability requests like ``write code to solve this equation,'' which need expertise from multiple specialized adapters. We introduce per-token routing, which routes individual tokens to adapters based on either vocabulary structure (for multimodal models) or learned gating (for semantic specialization). Per-token routing is provably optimal, achieving work N for N tokens versus K \cdot N for per-sequence routing with K adapter types. Our key contribution is MoLoRA (Mixture of LoRA), which enables composable specialization: load multiple domain-specific adapters and let a learned router select the appropriate adapter per-token. We demonstrate that specialization dramatically beats scale: MoLoRA enables Qwen3-1.7B to exceed Qwen3-8B across four reasoning benchmarks while being 4.7× smaller. This enables modular expertise at inference time: train focused LoRAs independently, combine them without retraining, and add new capabilities by simply loading new adapters.

Abstract:
Automated methods for discovering mechanistic simulator models from observational data offer a promising path toward accelerating scientific progress. Such methods often take the form of agenticstyle iterative workflows that repeatedly propose and revise candidate models by imitating human discovery processes. However, existing LLM-based approaches typically implement such workflows via hand-crafted heuristic procedures, without an explicit probabilistic formulation. We recast model discovery as probabilistic inference, i.e., as sampling from an unknown distribution over mechanistic models capable of explaining the data. This perspective provides a unified way to reason about model proposal, refinement, and selection within a single inference framework. As a concrete instantiation of this view, we introduce ModelSMC, an algorithm based on Sequential Monte Carlo sampling that represents candidate models as particles which are iteratively proposed and refined by an LLM, and weighted using likelihood-based criteria. Experiments on real-world scientific systems illustrate that this formulation discovers models with interpretable mechanisms and improves posterior predictive checks. More broadly, this perspective provides a probabilistic lens for understanding and developing LLM-based approaches to model discovery.

Abstract:
Federated LoRA provides a communicationefficient mechanism for fine-tuning large language models on decentralized data. In practice, however, a discrepancy between the factor-wise averaging used to preserve low rank and the mathematically correct aggregation of local updates can cause significant aggregation error and unstable training. We argue that a major source of this problem is rotational misalignment, arising from the rotational invariance of low-rank factorizations -- semantically equivalent updates can be represented in different latent subspaces across clients since (B_i R_i)(R_i^\top A_i) = B_i A_i. When such misaligned factors are averaged directly, they interfere destructively and degrade the global update. To address this issue, we propose FedRot-LoRA, a federated LoRA framework that aligns client updates via orthogonal transformations prior to aggregation. This alignment preserves the semantic update while reducing cross-client subspace mismatch, without increasing communication cost or restricting model expressivity. We provide a convergence analysis that examines the aggregation error induced by factor-wise averaging and shows how rotational alignment yields a tighter upper bound on this error. Extensive experiments on natural language understanding and generative tasks demonstrate that FedRot-LoRA consistently outperforms existing federated LoRA baselines across a range of heterogeneity levels and LoRA ranks. Code is available at https://anonymous.4open.science/r/FedRot-LoRA

Abstract:
Agentic multimodal models have garnered significant attention for their ability to leverage external tools to tackle complex tasks. However, it is observed that such agents often meet premature interaction collapse, caused by two primary reasons: 1) the terminal reward often appending on the last token prevents the advantage from distinguishing trajectories with exploratory behavior; 2) excessively redundant context hinders the agent from absorbing useful feedback. To address these issues, we propose the Deepening Reasoning MMSearchAgent, the framework leverages the structural proximity to derive advantage signals from the whole rollout trajectories in an entire batch, such that trajectories of different lengths are further encouraged to be generated, even when containing the same correct answer. Additionally, differentiated gaussian rewards are employed to dynamically calibrate interaction tolerance, thereby ensuring information reliability and reduce redundancy. To support multiturn interaction training, we have constructed a multi-step deep-reasoning dataset including 3602 high-quality QA pair with at least 3 reasonning steps. Extensive experiments demonstrate that our method achieves state-of-the-art performance, outperforming the MMSearch-R1 by 8.4% on FVQA-test.

Abstract:
We study how syntactic and semantic information is encoded in inner layer representations of Large Language Models (LLMs), focusing on the very large DeepSeekV3. We find that, by averaging hidden-representation vectors of sentences sharing syntactic structure or meaning, we obtain vectors that capture a significant proportion of the syntactic and semantic information contained in the representations. In particular, subtracting these syntactic and semantic ``centroids'' from sentence vectors strongly affects their similarity with syntactically and semantically matched sentences, respectively, suggesting that syntax and semantics are, at least partially, linearly encoded. We also find that the cross-layer encoding profiles of syntax and semantics are different, and that the two signals can to some extent be decoupled, suggesting differential encoding of these two types of linguistic information in LLM representations.

Abstract:
Implicit Neural Representations (INRs) based on vanilla MultiLayer Perceptrons (MLPs) are widely believed to be incapable of representing high-frequency content. This has directed research efforts towards architectural interventions, such as coordinate embeddings or specialized activation functions, to represent high-frequency signals. In this paper, we challenge the notion that the low-frequency bias of vanilla MLPs is an intrinsic, architectural limitation to learn high-frequency content, but instead a symptom of stable rank degradation during training. We empirically demonstrate that regulating the network's rank during training substantially improves the fidelity of the learned signal, rendering even simple MLP architectures expressive. Extensive experiments show that using optimizers like Muon, with high-rank, near-orthogonal updates, consistently enhances INR architectures even beyond simple ReLU MLPs. These substantial improvements hold across a diverse range of domains, including natural and medical images, and novel view synthesis, with up to 9 dB PSNR improvements over the previous state-of-the-art. Code and experiments will be released upon acceptance.

Abstract:
Concept Bottleneck Models (CBMs) aim to improve interpretability by mediating predictions through humanunderstandable concepts, but they provide no way to verify whether learned concepts align with the human's intended meaning, hurting interpretability. We introduce Prototype-Grounded Concept Models (PGCMs), which ground concepts in learned visual prototypes: image parts that serve as explicit evidence for the concepts. This grounding enables direct inspection of concept semantics and supports targeted human intervention at the prototype level to correct misalignments. Empirically, PGCMs match the predictive performance of state-of-the-art CBMs while substantially improving transparency, interpretability, and intervenability.

Abstract:
Dataset Distillation aims to compress a large dataset into a small synthetic one while maintaining predictive performance. We show that as different demographic groups exhibit distinct predictive patterns, the distillation process struggles to simultaneously preserve informative signals for all subgroups, regardless of whether group sizes are mildly or severely imbalanced. Consequently, models trained on distilled data can experience substantial performance drops for certain subgroups, leading to fairness gaps. Crucially, these gaps do not disappear by merely correcting group imbalance, since they stem from fundamental mismatches in subgroup predictive patterns rather than from samplesize disparities alone. We therefore formally analyze the interaction between these two sources of bias and cast the solution as identifying a group-imbalance-agnostic barycenter of the predictive information that induces similar representations across all subgroups. By distilling toward this shared aggregate representation, we show that group fairness concerns can be reduced. Our approach is compatible with existing distillation methods, and empirical results show that it substantially reduces bias introduced by dataset distillation.

Abstract:
In the domain of textto-video (T2V) generation, reliably synthesizing compositional content involving multiple subjects with intricate relations is still underexplored. The main challenges are twofold: 1) Subject presence, where not all subjects can be presented in the video; 2) Inter-subject relations, where the interaction and spatial relationship between subjects are misaligned. Existing methods adopt techniques, such as inference-time latent optimization or layout control, which fail to address both issues simultaneously. To tackle these problems, we propose Comp-Attn, a composition-aware cross-attention variant that follows a ``Present-and-Align” paradigm: it decouples the two challenges by enforcing subject presence at the condition level and achieving relational alignment at the attention-distribution level. Specifically, 1) We introduce Subject-aware Condition Interpolation (SCI) to reinforce subject-specific conditions and ensure each subject's presence; 2) We propose Layout-forcing Attention Modulation (LAM), which dynamically enforces the attention distribution to align with the relational layout of multiple subjects. Comp-Attn can be seamlessly integrated into various T2V baselines in a training-free manner, boosting T2V-CompBench scores by 15.7% and 11.7% on Wan2.1-T2V-14B and Wan2.2-T2V-A14B with only a 5% increase in inference time. Meanwhile, it also achieves strong performance on VBench and T2I-CompBench, demonstrating its scalability in general video generation and compositional text-to-image (T2I) tasks.

Abstract:
We address the challenge of uncovering systematic, and potentially controllable, dynamical structure underlying complex highdimensional time series data. Existing generative and autoregressive models have difficulty telling systematic content apart from distractors, while contrastive methods struggle to learn accurate dynamics. To address these shortcomings, we develop the Recognition-Parametrized Gaussian State Space Model (RP-GSSM), a probabilistic framework that infers accurate latent dynamics without relying on a parametrized decoder. By eliminating explicit generative parameters, the model directs its entire representational capacity to encoding dynamically relevant state; and, being fully probabilistic, learns via maximum likelihood without auxiliary objectives or ad-hoc regularization. Combining the expressive power of a neural network encoder with exact inference under a jointly Gaussian prior allows the RP-GSSM to embed a broad class of intrinsically nonlinear dynamical systems. We show that the RP-GSSM recovers physically meaningful latent states from noisy video more faithfully than competing methods, more reliably identifies underlying controllable nonlinear dynamics, and remains substantially more robust to visual distractors.

Abstract:
Understanding how transformer components operate in LLMs is important, as it is at the core of recent technological advances in artificial intelligence. In this work, we revisit the challenges associated with interpretability of feedforward modules (FFNs) and propose MemoryLLM, which aims to decouple FFNs from self-attention and enables us to study the decoupled FFNs as context-free token-wise neural retrieval memory. In detail, we investigate how input tokens access memory locations within FFN parameters and the importance of FFN memory across different downstream tasks. MemoryLLM achieves context-free FFNs by training them in isolation from self-attention directly using the token embeddings. This approach allows FFNs to be pre-computed as token-wise lookups (ToLs), enabling on-demand transfer between VRAM and storage, additionally enhancing inference efficiency. We also introduce Flex-MemoryLLM, positioning it between a conventional transformer design and MemoryLLM. This architecture bridges the performance gap caused by training FFNs with context-free token-wise embeddings.

Abstract:
We introduce Debate2Create (D2C), a multiagent LLM framework that formulates robot co-design as structured, iterative debate grounded in physics-based evaluation. A design agent and control agent engage in a thesis-antithesis-synthesis loop, while pluralistic LLM judges provide multi-objective feedback to steer exploration. Across five MuJoCo locomotion benchmarks, D2C achieves up to 3.2× the default Ant score and ~9× on Swimmer, outperforming prior LLM-based methods and black-box optimization. Iterative debate yields 18–35% gains over compute-matched zero-shot generation, and D2C-generated rewards transfer to default morphologies in 4/5 tasks. Our results demonstrate that structured multi-agent debate offers an effective alternative to hand-designed objectives for joint morphology-reward optimization.

Abstract:
We introduce ECSEL, an explainable classification method that learns formal expressions in the form of signomial equations, motivated by the observation that many symbolic regression benchmarks admit compact signomial structure. ECSEL directly constructs a structural, closedform expression that serves as both a classifier and an explanation. On standard symbolic regression benchmarks, our method recovers a larger fraction of target equations than competing state-of-the-art approaches while requiring substantially less computation. Leveraging this efficiency, ECSEL achieves classification accuracy competitive with established machine learning models without sacrificing interpretability. Further, we show that ECSEL satisfies some desirable properties regarding global feature behaviour, decision-boundary analysis, and local feature attributions. Experiments on benchmark datasets and two real-world case studies i.e., e-commerce and fraud detection, demonstrate that the learned equations expose dataset biases, support counterfactual reasoning, and yield actionable insights.

Abstract:
How can we generate samples from a conditional distribution that we never fully observe? This question arises across a broad range of applications in both modern machine learning and classical statistics, including image postprocessing in computer vision, approximate posterior sampling in simulation-based inference, and conditional distribution modeling in complex data settings. In such settings, compared with unconditional sampling, additional feature information can be leveraged to enable more adaptive and efficient sampling. Building on this, we introduce Conditional Generator using MMD (CGMMD), a novel framework for conditional sampling. Unlike many contemporary approaches, our method frames the training objective as a simple, adversary-free direct minimization problem. A key feature of CGMMD is its ability to produce conditional samples in a single forward pass of the generator, enabling practical one-shot sampling with low test-time complexity. We establish rigorous theoretical bounds on the loss incurred when sampling from the CGMMD sampler, and prove convergence of the estimated distribution to the true conditional distribution. In the process, we also develop a uniform concentration result for nearest-neighbor based functionals, which may be of independent interest. Finally, we show that CGMMD performs competitively on synthetic tasks involving complex conditional densities, as well as on practical applications such as image denoising and image super-resolution.

Abstract:
Robot manipulation has increasingly adopted datadriven generative policy frameworks, yet the field faces a persistent trade-off: diffusion models suffer from high inference latency, while flow-based methods often require complex architectural constraints. Although in image generation domain, the MeanFlow paradigm offers a path to single-step inference, its direct application to robotics is impeded by critical theoretical pathologies, specifically spectral bias and gradient starvation in low-velocity regimes. To overcome these limitations, we propose the One-step MeanFlow Policy (OMP), a novel framework designed for high-fidelity, real-time manipulation. We introduce a lightweight directional alignment mechanism to explicitly synchronize predicted velocities with true mean velocities. Furthermore, we implement a Differential Derivation Equation (DDE) to approximate the Jacobian-Vector Product (JVP) operator, which decouples forward and backward passes to significantly reduce memory complexity. Extensive experiments on the Adroit and Meta-World benchmarks demonstrate that OMP outperforms state-of-the-art methods in success rate and trajectory accuracy, particularly in high-precision tasks, while retaining the efficiency of single-step generation.

Abstract:
DNA language models are increasingly used to represent genomic sequence, yet their effectiveness depends critically on how raw nucleotides are converted into model inputs. Unlike natural language, DNA offers no canonical “word” boundaries, making fixed tokenizations a brittle design choice under shifts, indels, and local repeats. We introduce DNAChunker, a masked DNA language model that incorporates a learnable adaptive segmentation module to produce contextdependent, variable-length units. Building on a dynamic segmentation procedure, DNAChunker learns to allocate finer granularity to functionally enriched regions while compressing repetitive or redundant sequence. We pre-train DNAChunker on the human reference genome (HG38) and evaluate it on the Nucleotide Transformer and Genomic Benchmarks, where it consistently improves over strong fixed-tokenization baselines. Further analyses and ablations indicate that the learned segmentation is structured rather than incidental: the model preferentially uses shorter units around promoters and exons, and longer units in repetitive regions, yielding representations that are both mutation-resilient and biologically-informed.

Abstract:
Posttraining compression is currently divided into two contrasting regimes. On the one hand, fast, data-free, and model-agnostic methods (e.g., NF4 or HQQ) offer maximum accessibility but suffer from functional collapse at extreme bit-rates below 4 bits. On the other hand, techniques leveraging calibration data or extensive recovery training achieve superior fidelity but impose high computational constraints and face uncertain robustness under data distribution shifts. We introduce EntQuant, the first framework to unite the advantages of these distinct paradigms. By matching the performance of data-dependent methods with the speed and universality of data-free techniques, EntQuant enables practical utility in the extreme compression regime. Our method decouples numerical precision from storage cost via entropy coding, compressing a 70B parameter model in less than 30 minutes. We demonstrate that EntQuant does not only achieve state-of-the-art results on standard evaluation sets and models, but also retains functional performance on more complex benchmarks with instruction-tuned models, all at modest inference overhead.

Abstract:
Safe reinforcement learning (RL) is a standard paradigm for safetycritical decision making. However, real-world safety constraints can be complex, subjective, and even hard to explicitly specify. Existing works on constraint inference rely on restrictive assumptions or extensive expert demonstrations, which is not realistic in many real-world applications. How to cheaply and reliably learn these constraints is the major challenge we focus on in this study. While inferring constraints from human preferences offers a data-efficient alternative, we identify the popular Bradley-Terry (BT) models fail to capture the asymmetric, heavy-tailed nature of safety costs, resulting in risk underestimation. It is still rare in the literature to understand the impacts of BT models on the downstream policy learning. To address the above knowledge gaps, we propose a novel approach namely Preference-based Constrained Reinforcement Learning (PbCRL). We introduce a novel dead zone mechanism into preference modeling and theoretically prove that it encourages heavy-tailed cost distributions, thereby achieving better constraint alignment. Additionally, we incorporate a Signal-to-Noise Ratio (SNR) loss to encourage exploration by cost variances, which is found to benefit policy learning. Further, two-stage training strategy are deployed to lower online labeling burdens while adaptively enhancing constraint satisfaction. Empirical results demonstrate that PbCRL achieves superior alignment with true safety requirements and outperforms the state-of-the-art baselines in terms of safety and reward. Our work explores a promising and effective way for constraint inference in Safe RL, which has great potential in a range of safety-critical applications.

Abstract:
The combination of exponentially large action spaces, stochastic dynamics, and longhorizon decision-making under limited resources makes Sequential Stochastic Combinatorial Optimization (SSCO) particularly challenging for reinforcement learning. Hierarchical Reinforcement Learning (HRL) offers a natural decomposition, but it places the high-level policy in a Semi-Markov Decision Process (SMDP) where actions have variable durations, making it difficult to learn a world model that is suitable for planning. We introduce a model-based hierarchical framework for sequential stochastic combinatorial decision-making that directly addresses this issue. Our method combines a latent-space tree-search planner with an SMDP-aware world model for variable-duration decisions. A multi-timescale objective structures the latent dynamics so that transition magnitudes reflect the effective temporal scales of abstract actions, enabling efficient lookahead under adaptive temporal abstraction. We further learn a subgoal-conditioned budget policy jointly with the world model to support context-aware resource allocation. Across challenging SSCO benchmarks, our method outperforms strong baselines.

Abstract:
Distributional reinforcement learning (DRL) models the full return distribution rather than expectations, but extending it to multivariate settings remains challenging. Many common metrics do not naturally generalize beyond one dimension or lose computational tractability, and the multivariate case introduces additional difficulties such as general matrix discounting, for which no contraction results are available. We introduce Sliced Distributional Reinforcement Learning (SDRL), which lifts tractable onedimensional divergences to multivariate return distributions via projections. We prove Bellman contraction for uniform slicing under shared scalar discounting, and introduce a maximum-slicing variant with contraction under general dense discount matrices. SDRL supports a broad class of base divergences; we analyze Wasserstein, Cramér, and Maximum Mean Discrepancy (MMD), and characterize which SDRL variants suit the standard single-sample Bellman update used in distributional RL. We evaluate SDRL on a toy chain problem and a gridworld image-based environment as well as a subset of Atari games.

Abstract:
In deep learning, functional similarity evaluation quantifies the extent to which independently trained models learn similar input–output relationships. In model stitching, functional similarity is framed as representation forward compatibility, i.e., whether the representations of two models can be aligned to solve a given task. Recent studies, however, highlight a critical limitation: models relying on different information cues can still produce compatible representations, making them appear misleadingly similar \citesmithfunctional. We attribute this failure to standard model stitching being inherently blind to the invariance properties of the stitched models. To address this limitation, we introduce the forwardbackward compatibility requirement under which we formulate the invariance-aware model stitching. Through analyzing key stitching configurations, we study the interplay between forward and backward compatibility, showing that invariance-aware model stitching provides a more principled approach to functional similarity evaluation while revealing functional discrepancies previously obscured.

Abstract:
While Large Language Models (LLMs) demonstrate remarkable capabilities, they remain susceptible to sophisticated, multistep jailbreak attacks that circumvent conventional surface-level safety alignment by exploiting the internal generation process. To address these vulnerabilities, we propose Reflector, a principled two-stage framework that internalizes self-reflection within the generation trajectory. Reflector first leverages teacher-guided generation to produce high-quality reflection data for supervised fine-tuning (SFT), establishing structured reflection patterns. It subsequently uses Reinforcement Learning (RL) with outcome-driven and reward-validity supervision to instill robust, autonomous self-reflection capabilities. Empirical results show that Reflector achieves Defense Success Rates (DSR) exceeding 90% against complex indirect attacks while generalizing robustly across diverse threat scenarios. Notably, the framework enhances both task-specific and general utility, yielding a 5.85% gain on GSM8K alongside improved performance on knowledge-intensive benchmarks. By internalizing trajectory-level safety, Reflector overcomes the fundamental limitations of surface alignment without significant computational overhead, offering an efficient and scalable solution for the development of safe and capable LLMs.

Abstract:
Strategic classification studies the interaction between decision models and agents who strategically manipulate their features for favorable outcomes. Existing SC frameworks typically rely on the idealized assumption that agents are strictly rational. However, evidence from behavioral economics and psychology consistently shows that realworld decision-making is often shaped by cognitive biases, deviating from pure rationality. To formalize this limitation, we identify and define a new problem setting, termed the behaviorally realistic strategic classification problem, where agents’ strategic manipulations deviate from full rationality due to psychological biases. Motivated by the identified limitation, we propose the Prospect-Guided Strategic Framework (Pro-SF) to address the problem, a principled framework grounded in prospect theory to model and learn under behaviorally realistic strategic responses. Specifically, to capture behaviorally realistic strategic manipulations, our framework reformulates the Stackelberg-style interaction between agents and the decision-maker by incorporating three key mechanisms inspired by prospect theory, including the asymmetry between benefits and costs, different subjective reference points, and non-rational probability distortion. Experiments on synthetic and real-world datasets establish Pro-SF as a behaviorally grounded approach to strategic classification, bridging machine learning and behavioral economics for more reliable deployment in the real world.

Abstract:
Entropy coding is widely used in typical learned image compression (LIC) that converts latents into a compact bitstream. However, entropy coding is typically sequential and becomes the coding latency bottleneck. To overcome it, we present EntropyCoding Free Learned Image Compression (EF-LIC), a multi-rate framework that generates compact representation by removing statistical and correlation redundancy with low coding latency. First, we introduce unconstrained vector quantization and prove that its index distribution approaches the maximum-entropy bound, yielding minimal statistical redundancy. Second, we propose a context-conditioned autoregressive transform that directly reparameterizes the latents to reduce inter-dependency. Theoretical analysis shows that EF-LIC can remove correlation redundancy as effectively as typical LIC with entropy coding, leading to comparable compression performance. Experiments show EF-LIC achieves up to 67.86% bitrate reduction over MS-ILLM on Kodak with LPIPS. Ablation studies further show EF-LIC matches the compression performance of its entropy-coding based variant while achieving over 3× faster encoding and 5× faster decoding.

Abstract:
Modelbased reinforcement learning (RL) methods that leverage search are responsible for many milestone breakthroughs in RL. Sequential Monte Carlo (SMC) recently emerged as an alternative to the Monte Carlo Tree Search (MCTS) algorithm which drove these breakthroughs. SMC is easier to parallelize and more suitable to GPU acceleration. However, it also suffers from large variance and path degeneracy which prevent it from scaling well with increased search depth, i.e., increased sequential compute. To address these problems, we introduce Twice Sequential Monte Carlo Tree Search (TSMCTS). Across discrete and continuous environments TSMCTS outperforms the SMC baseline as well as a popular modern version of MCTS as a policy improvement operator, scales favorably with sequential compute, reduces estimator variance and mitigates the effects of path degeneracy while retaining the properties that make SMC natural to parallelize.

Abstract:
Incontext learning (ICL) adapts large language models (LLMs) to new tasks by conditioning on demonstrations in the prompt without parameter updates. With long-context models, many-shot ICL can use dozens to hundreds of examples and achieve performance comparable to fine-tuning, yet current understanding of its scaling behavior is largely derived from non-reasoning tasks. We study many-shot chain-of-thought in-context learning (CoT-ICL) for reasoning and show that standard many-shot rules do not transfer. Across non-reasoning and reasoning-oriented LLMs and across non-reasoning and reasoning tasks, we find: (i) a setting-dependent scaling effect, where increasing the number of CoT demonstrations is unstable for non-reasoning LLMs and benefits mainly reasoning-oriented LLMs; (ii) similarity-based retrieval helps on non-reasoning tasks but fails on reasoning, since semantic similarity poorly predicts procedural (i.e., CoT) compatibility; and (iii) an order-scaling effect, where performance variance grows with more CoT demonstrations. We interpret these behaviors by viewing many-shot CoT-ICL as in-context test-time learning rather than scaled pattern matching, and suggests two principles: (i) demonstrations should be easy for the target model to understand, and (ii) they should be ordered to support a smooth conceptual progression. Guided by the principle, we propose Curvilinear Demonstration Selection (CDS), a simple ordering method that yields an average 3.81% gain across math and narrative reasoning tasks. Overall, our results reframe the long context window from a retrieval buffer into a structured curriculum for in-context test-time learning.

Abstract:
Network routers that enforce Qualityof-Service (QoS) guarantees must decide, at every clock cycle, which expiring packet of information to transmit, even when the value of the packet is unknown until it is processed. We frame this problem as the Online Packet Scheduling with Deadlines (OPSD) problem under Partial Feedback: packets arrive at every clock cycle, with different deadlines, but the weights are only observed after execution. Under a stochastic assumption on the unknown weights, we explore different variants of the OPSD problem with bandit feedback. We establish a connection between our setting and the sleeping bandits problem, and set our learning goal to \alpha-regret minimization. We provide algorithms with provable \alpha-regret guarantees under different spans of slackness, distinguishing systems allowing for randomization and systems that do not. In every scenario, our algorithms achieve an \alpha-regret upper bound of \widetilde\mathcalO\left(\sqrtKT\right), matching the lower bound for the standard bandit setting. In the practically relevant case of 2-bounded deadline instances, where the deadline is set at most one clock cycle away from the arrival, our deterministic algorithm achieves the provably tightest possible competitive ratio. Remarkably, when the number of distinct packet types K\ge 2 is finite, it is possible to break the well-established \Phi = \frac1+\sqrt52 competitive ratio barrier and attain a tighter competitive ratio \theta_K ranging in [\sqrt2, \Phi).

Abstract:
Multiobjective reinforcement learning (MORL) is a fundamental framework for real-world decision-making problems involving multiple conflicting criteria. Existing multi-policy (MP) methods typically rely on online evolutionary frameworks that maintain large policy populations, leading to high sample complexity and excessive agent–environment interactions. To mitigate these limitations, we present Multi-policy Pareto Front Tracking (MPFT), a framework without a self-evolving population. It leverages an efficient Pareto-tracking mechanism initialized with single-objective extreme policies to trace the Pareto front, and further densifies sparse regions to achieve an accurate approximation of the full Pareto front. MPFT can be seamlessly integrated with advanced offline MORL algorithms, thereby substantially improving sample efficiency. We evaluate MPFT on six robotic control tasks with up to three objectives and three high-dimensional tasks with more than three objectives. Experimental results show that MPFT outperforms state-of-the-art baselines in terms of hypervolume and expected utility. It also significantly reduces agent–environment interactions. These results further demonstrate that MPFT serves as a general-purpose framework that can seamlessly integrate both online and offline MORL algorithms.

Abstract:
Annotating largescale LiDAR point clouds for 3D semantic segmentation is costly and time-consuming, which motivates the use of semi-supervised learning (SemiSL). Standard LiDAR SemiSL methods typically adopt a two-step training paradigm, where pseudo-labels are separately generated from a single distillation source, either from the same or an another LiDAR representation. Such supervision relies on a unique source of pseudo-labels, which can reinforce confirmation bias and propagate errors during training, ultimately limiting performance. To address this challenge, We introduceCoLLiS, a novel framework that leveragesCollaborativeLearning forLiDARSemi-supervised segmentation. Unlike prior paradigms with decoupled pseudo-labeling and training phases,CoLLiStrains multiple representations collaboratively in a single step by treating them as coequal students. Each student is adaptively distilled from multiple representations, while inter-student disparities are monitored online to resolve contradictory supervision and effectively mitigate confirmation bias. Extensive experiments on three datasets demonstrate thatCoLLiSconsistently outperforms state-of-the-art LiDAR SemiSL methods, with particularly strong gains in low-label regimes. The code will be released upon acceptance.

Abstract:
Lowrank tensor decomposition (TD) is usually effective on clean, fully observed data, but it often degrades under severe missingness or noise. The low-rank constraint alone provides a weak inductive bias, while common handcrafted priors (e.g., sparsity or smoothness) fail to capture rich real-world structures. To compensate for this weak inductive bias under heavy corruption, one would like to inject a learned, data-driven prior; however, the state-of-the-art diffusion models are not readily compatible with current TD and tractable posterior inference. To address these challenges, we introduce DiffBCP, a Bayesian CP decomposition framework that combines a cumulative shrinkage process prior for automatic rank selection with an off-the-shelf pre-trained diffusion model as an implicit prior on the reconstructed tensor. To make posterior inference tractable despite the coupling among the likelihood, low-rank constraint, and diffusion prior, we develop a split Gibbs sampler: CP factors admit conjugate updates, while the diffusion block is sampled via low-rank-guided denoising. A noise-adaptive coupling schedule further reduces sensitivity to hand-tuned annealing. Experiments on image inpainting and denoising, including high-resolution out-of-distribution images, show consistent gains over Bayesian, nonlinear, and plug-and-play TD baselines.

Abstract:
Training large language models (LLMs) typically relies on adaptive optimizers such as Adam, which introduce extra operations and require significant more memory to maintain firstand second-order moments than SGD. While recent works such as GaLore, Fira and APOLLO have proposed state-compressed variants to reduce memory consumption, a fundamental question remains: What are the minimum modifications to plain SGD needed to match state-of-the-art pretraining performance? We systematically investigate this question using a bottom-up approach, and identify two simple yet highly (memory- and compute-) efficient techniques: (1) column-wise gradient normalization (normalizing the gradient along the output dimension), which boosts SGD performance without momentum; and (2) applying first-order momentum only to the output layer, where gradient variance is highest. Combining these two techniques lead to SCALE (Stochastic Column-normAlized Last-layer momEntum), a simple optimizer for memory efficient pretraining. Across multiple LLaMA models (60M–1B), SCALE matches or exceeds the performance of Adam while using only 35–45% of the total memory. It also consistently outperforms memory-efficient optimizers such as GaLore, Fira and APOLLO, making it a strong candidate for large-scale pretraining under memory constraints. For LLaMA 7B model, SCALE outperforms the state-of-the-art memory-efficient methods APOLLO and Muon, in terms of both perplexity and memory consumption.

Abstract:
Jung et al. (2025) introduce a hypothesis testing framework for guaranteeing agreement between large language models (LLMs) and human judgments, relying on the assumption that the model’s estimated confidence is monotonic with respect to humandisagreement risk. In practice, however, this assumption may be violated, and the generalization behavior of the confidence estimator is not explicitly analyzed. We mitigate these issues by learning a dedicated confidence estimator instead of relying on heuristic confidence signals. Our approach leverages simulated annotator diversity and a margin-based ranking formulation to explicitly model how confidently an LLM distinguishes between human-agreement and human-disagreement cases. We further derive generalization guarantees for this estimator, revealing a margin-dependent trade-off that informs the design of an adaptive estimator training procedure. When integrated into fixed-sequence testing, the learned confidence estimator yields improved ranking accuracy and empirically strengthens the monotonic relationship between confidence and disagreement risk, leading to higher success rates in satisfying target agreement levels across multiple datasets and judge models.

Abstract:
Current motioncontrolled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences. To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation problem. To encourage causally grounded and commonsense-consistent interactions, we leverage a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and to hallucinate plausible secondary motions. To further improve motion naturalness, we propose a confidence-aware control scheme that modulates guidance strength, enabling the model to closely follow high-confidence plans while correcting artifacts under low-confidence inputs with its internal generative priors. To support systematic evaluation, we curate a new image-to-video benchmark, MotiBench, consisting of interaction-centric scenes where new events are triggered by motion. Both VLM-based evaluation and a human study on MotiBench demonstrate that MotiMotion produces videos with more plausible object behaviors and interaction, and is preferred over existing approaches.

Abstract:
Conformal prediction (CP) and its extension, conformal risk control (CRC), are established frameworks for quantifying uncertainty in supervised machine learning through formal guarantees. However, recent breakthroughs in artificial intelligence (AI) have been driven by unsupervised generative models, such as large language models (LLMs) and image generators, which are not directly compatible with CP or CRC. In this work we introduce conformal generation (ConfGen), a general framework adapting CRC to generative tasks while relaxing its theoretical assumptions. Conf-Gen unifies and generalizes previous attempts to apply CP to LLMs, and extends conformal methodology to entirely new domains. We demonstrate the flexibility of Conf-Gen through some novel applications, including obtaining conformal guarantees on: image generators producing non-memorized images, conversational AI systems having asked enough clarifying questions, and the output of AI agents being correct.

Abstract:
Reasoning is central to building intelligent systems that can solve unseen problems beyond training. Yet we still lack a principled understanding of what internal mechanism enables neural networks to generalize reasoning beyond memorized patterns. We hypothesize that generalizable reasoning emerges through learning taskconditioned attractors. Concretely, the model learns a latent dynamical system whose fixed points correspond to valid solutions. We term models that reason by converging to such task-conditioned fixed points Equilibrium Reasoners (EqR). This attractor view elucidates when and how to scale test-time compute. Empirically, improvements from scaling test-time compute are tightly coupled with convergence to attractors. By shaping a more favorable attractor landscape and leveraging stochasticity, EqR improves convergence and scales reliably at test time. Our models scale along two axes: depth by running more solver steps, and width by aggregating stochastic trajectories from multiple random initializations. As we scale test-time compute by 8192×, with max effective layers surpassing 300,000 layers when unrolled, reasoning accuracy rises from 8% to over 99% on Sudoku-Extreme. We hope our attractor perspective sheds light on scalable reasoning through test-time computation.

Abstract:
Recent advances in interpretability suggest that large language models (LLMs) implicitly encode signals in their generated text that enable selfrecognition of their outputs. We demonstrate that this capability is reliable, even in low-entropy scenarios, and that it can be amplified through targeted intervention. By steering the internal residual stream during generation with a random sparse vector, we create a detectable fingerprint that enables attribution of a given text to a specific LLM. This signal is recoverable from the activations of an LLM used as a detector, achieving over 98% accuracy across multiple detection settings while preserving the quality of generated text. As AI-generated content proliferates, this approach offers a practical alternative to traditional detectors, by leveraging the model's natural representation structure for attribution rather than embedding a signal externally. Our contributions include: (i) establishing reliable self-recognition capabilities in LLMs, (ii) a simple steering mechanism enabling multi-LLM identification with no quality degradation, (iii) demonstrating that activation spaces contain exploitable structure for encoding signals without semantic interference.

Abstract:
Priordata fitted networks (PFNs) have recently emerged as a powerful approach for Bayesian prediction tasks, approximating the posterior predictive distribution (PPD) through in-context learning. Despite their strong empirical performance and ability to go beyond point predictions, theoretical understandings of the algorithmic capability of transformers to learn distributions in context are still lacking. Focusing on Gaussian process regression problems, we show by construction that transformers can implement a gradient descent algorithm targeting the posterior predictive mean and variance, followed by nonlinear mappings that yield binned probabilities of PPD. We study the error bounds of the approximated PPD in terms of attention depth and bin resolution. Based on these results, we further demonstrate the key role of normalization and the choice of attention depth in enabling the extrapolation abilities of transformers beyond the pretraining sample size range. We conduct simulations that corroborate our findings, providing insight into the expressivity of PFNs targeting PPDs and how architectural choices may influence generalization capabilities.

Abstract:
Standard conformal prediction (CP) procedures are typically formulated in terms of pvalues, but reliance on p-values alone limits flexibility, for example, when combining dependent evidence across models or data splits. Recent work has explored e-value formulations for conformal inference, yet a direct connection between p- and e-value formulations in CP has been missing, especially regarding their statistical efficiency. We first identify limitations of classical p-to-e calibrators in the CP setting, showing that they are not set-preserving and can lead to overly conservative prediction sets. To address this, we propose a novel P2E calibrator that converts conformal p-values into e-values without altering the prediction set induced by the original conformal p-value. We establish both theoretically and empirically that this calibrator yields substantial efficiency gains over existing p-to-e methods. This e-value formulation enables principled use of recent advances in e-value merging and randomization to improve conformal inference. We demonstrate its impact in two applications: cross-conformal prediction (CCP), whose variants typically provide only approximate 1-2\alpha coverage, and conformal aggregation (CA). In both cases, our e-value-based methods achieve exact 1-\alpha coverage while improving efficiency over standard baselines. More broadly, our approach expands the flexibility of CP and opens new directions for efficient, distribution-free uncertainty quantification.

Abstract:
Estimating the generalization gap and developing optimization methods that improve generalization are crucial for deep learning models, for both theoretical understanding and practical applications. Leveraging unlabeled data for these purposes offers significant advantages in realworld scenarios. This paper introduces a novel generalization measure, local inconsistency, derived from an information-geometric perspective on the parameter space of neural networks. A key feature of local inconsistency is that it can be computed without explicit labels. We establish theoretical underpinnings by connecting local inconsistency to the Fisher information matrix and the loss Hessian. Empirically, we demonstrate that local inconsistency correlates with the generalization gap. Based on these findings, we propose Inconsistency-Aware Minimization (IAM), which incorporates local inconsistency into the training objective. We demonstrate that in standard supervised learning settings, IAM enhances generalization, achieving performance comparable to that of existing methods such as Sharpness-Aware Minimization. Furthermore, IAM exhibits efficacy in semi- and self-supervised learning scenarios, where the local inconsistency is computed from unlabeled data.

Abstract:
Graph foundation models (GFMs) seek transferable representations across graph domains but are limited by structural heterogeneity and incompatible node feature spaces. We propose StructureCentric Graph Foundation Models (SCGFM), which treat graph topology as the primary source of transferable knowledge. Modeling graphs as metric measure spaces, SCGFM introduces learnable geometric bases that define a shared structural coordinate system. Graphs are aligned to these bases via Gromov–Wasserstein distances, yielding structure-aligned latent representations that accommodate heterogeneous graph topologies. To address feature incompatibility, SCGFM employs a structure-aware feature re-encoding mechanism that unifies node representations without assuming a fixed feature dimensionality or requiring dataset-specific preprocessing. Experiments on graph- and node-level tasks demonstrate strong in-domain and cross-domain generalization, outperforming existing GFM approaches.

Abstract:
In the graph label selection problem, one is given an nvertex graph and a budget k, and seeks to select k vertices whose labels enable accurate prediction of the labels on the remaining vertices. This problem formalizes distilling a small representative set from the whole graph. We present the first \tildeO(\log^1.5 n)-approximation algorithm for graph label selection under the standard budget constraint. Prior work either relies on resource augmentation, allowing substantially more than k labeled vertices, or consists primarily of heuristics without provable guarantees. Finally, we demonstrate that practical heuristic variants of our algorithm scale to significantly larger graphs than previous methods, while essentially retaining their quality.

Abstract:
\em Reinforcement learning (RL) for \em reachability specifications is fundamental in sequential decisionmaking, yet theoretical guarantees remain less explored. A recent work achieves \em asymptotic convergence to optimal policies. However, this approach provides limited insight into convergence dynamics. In this work, we present an alternative approach that provides deeper theoretical insights into convergence. Our approach builds on \em PAC learning with assumptions. PAC learning guarantees near-optimal policies with high confidence in finite time but requires knowing internal MDP parameters like minimum transition probability. We argue that while these parameters are unknown in RL, they can be iteratively refined and estimated with increasing accuracy. By iteratively satisfying PAC conditions, we show that exact optimality can be achieved in the limit. Empirical evaluations on standard benchmarks validate our theoretical insights into convergence dynamics.

Abstract:
Electrocardiography (ECG) is a cornerstone of cardiac assessment, making the learning of informative ECG representations fundamental to tasks ranging from disease diagnosis to clinical report generation. However, existing methods operate almost exclusively in the observable ECG signal space. In practice, the standard twelvelead ECG represents multiple projections of the same underlying cardiac electrical activity from different spatial orientations. Therefore, representation learning in the ECG space inevitably introduces substantial redundancy, which may lead to spurious correlations and increased risk of overfitting. To address this and motivated by the Frank vectorcardiogram (VCG) model, we propose learning a unified latent representation of cardiac electrical activity directly in the VCG space. We introduce LVCG, the first general self-supervised representation learning framework designed to operate in this physically grounded latent space. By learning view-invariant latent VCG representations rather than lead-specific artifacts, VCG minimizes redundancy and improves generalization. LVCG generally outperforms ECG-space baselines across tasks, demonstrating enhanced robustness and generalization, especially in domain shift settings. Our code has been made available at https://anonymous.4open.science/r/LVCG-A0C4

Abstract:
Multimodal reasoning often relies on long chains of intermediate textual and visual thoughts, where accumulating visual tokens and dense crossmodal attention incur substantial computation and memory overhead. To address this challenge, we propose Spectral-Progressive Thought Flow (SpecFlow), a novel lightweight multimodal reasoning framework that represents intermediate visual thoughts in a fixed-size discrete cosine space. By exploiting strong energy compaction, SpecFlow preserves global layout and relational structure while introducing high-frequency details only when increased spatial precision is required. To align visual state evolution with linguistic intent, classifier-free guidance enables autoregressive textual thoughts to steer flow-based updates of the visual workspace without expanding the context. As a result,SpecFlow maintains a bounded visual workspace whose updates depend only on the current visual state and accumulated textual trace, enabling long-horizon inference with stable latency and memory usage independent of reasoning depth. Empirical results show that SpecFlow achieves competitive or superior reasoning performance while reducing computation and memory costs by up to 2.1×.

Abstract:
Conformal prediction offers finitesample coverage guarantees under minimal assumptions. However, existing methods treat the entire modeling process as a black box, overlooking opportunities to exploit and understand modular structure. We introduce a conformal prediction framework for two-stage sequential models, where an upstream predictor generates intermediate representations for a downstream model. By decomposing the overall prediction residual into stage-specific components, our method enables practitioners to attribute uncertainty to specific pipeline stages. We develop a risk-controlled parameter selection procedure using family-wise error rate (FWER) control to calibrate stage-wise scaling parameters, and introduce an adaptive extension for non-stationary settings. Experiments on synthetic distribution shifts, as well as real-world supply chain and stock market data, demonstrate that our approach improves coverage under structural, stage-wise shifts compared to standard conformal methods, while identifying stage-wise error contribution. This framework offers diagnostic advantages and robust coverage that standard conformal methods lack.

Abstract:
While RAG systems are increasingly deployed in dynamic web search, temporal volatility amplifies their vulnerability to adversarial attacks. Existing staticoriented defenses struggle to handle evolving threats and incur prohibitive storage costs in dynamic settings. We propose RADAR, a framework that models reliable context selection as a graph-based energy minimization problem, solved exactly via Max-Flow Min-Cut. By incorporating a Bayesian memory node, RADAR recursively updates a belief state instead of archiving raw historical documents, effectively balancing stability against attacks with adaptability to genuine knowledge shifts. Experiments on a novel dynamic dataset show that RADAR achieves superior robustness and response quality with minimal storage overhead compared to the baselines.

Abstract:
Large language models (LLMs) are trained on vast datasets that may contain sensitive information. Differential privacy (DP), the de facto standard for formal privacy guarantees, provides a principled framework for training LLMs with provable privacy protection. However, stateof-the-art DP training implementations rely on fast gradient clipping techniques with memory overhead O(B\min(T^2, d^2)), where B is the batch size, T is the sequence length, and d is the layer width. This becomes prohibitive as both model width and context length grow. We propose DP-SGD-RC, a novel variant of DP-SGD with randomized clipping that reduces memory and compute overhead. DP-SGD-RC leverages stochastic trace estimation methods, specifically Hutchinson's estimator and its improved variant, Hutch^++, to reduce the memory footprint of per-sample gradient norm estimation. We provide a tight privacy analysis showing that DP-SGD-RC achieves noise multipliers competitive with deterministic clipping. Experiments fine-tuning Llama 3.2 1B on long-context benchmarks spanning classification, question answering, and summarization tasks demonstrate that DP-SGD-RC matches baseline utility while significantly reducing memory and compute.

Abstract:
Reinforcement learning often produces highfrequency oscillatory control signals that undermine the safety and stability required for physical deployment. Explicit action chunking addresses this by predicting fixed-horizon trajectories but increases the policy output dimension to R^hd, leading to optimization difficulties and incompatibility with standard step-wise interaction. To overcome these challenges, this paper proposes Dual-Window Smoothing (DWS), an implicit action chunking framework for smooth continuous control. Unlike explicit methods, DWS enforces temporal coherence without expanding the action space. It uses a dual-window design: an execution window that ensures physical smoothness through deterministic modulation, and a value window that aligns temporal-difference targets over the horizon to correct critic bias caused by open-loop execution. DWS also includes a lightweight actor-side temporal regularizer based on first-order action differences to promote global continuity. This design effectively bridges the gap between temporal abstraction and reactive step-wise control. Experiments on benchmarks including the DeepMind Control Suite and industrial energy management tasks with vector states show that DWS outperforms state-of-the-art (SOTA) baselines such as LipsNet++ and SmODE. In complex vision-based autonomous driving tasks, DWS achieves smoother control, safer behavior with reduced jitter, and attains a 100% success rate.

Abstract:
We propose Parameterized Diffusion Policy (PDP), a framework that learns a diffusion policy parameterized in a smooth continuous space. By structuring a latent manifold such that distances between latents' values reflect the semantic similarity of physical trajectories, we transform diffusion from a mechanism of stochastic diversity into a precise tool for behavior steering. Our approach also enables smooth interpolation between known strategies and efficient generalization to novel constraints without the need to update policy weights. We demonstrate that PDP significantly improves adaptation performance on complex multimodal benchmarks in both simulation and realrobot hardware compared to regular diffusion policy, particularly in scenarios requiring the discovery of novel behaviors.

Abstract:
Aligning large language models (LLMs) to heterogeneous and rapidly evolving safety requirements remains a critical challenge. Existing instructiontuned LLMs and standalone safety classifiers often fail to generalize to new safety configurations, motivating the need for Reward Models (RMs) that are explicitly configurable to changing specifications. We introduce the Configurable Safety Reward Model (CSRM), which is jointly optimized for calibrated safety compliance and reward modeling. Our approach is supported by configuration-targeted data augmentation that enforces instruction adherence while preserving relative severity structure. The resulting RM is sensitive to fine-grained safety configurations and conversational nuances, substantially improving generalization to previously unseen safety configurations. CSRM achieves state-of-the-art performance on recent configurable safety benchmarks, including CoSApien (94.6% F1) and DynaBench (75.8% F1), without requiring additional human annotation. When used for downstream safety alignment, CSRM yields LLMs with a significantly improved helpfulness–safety tradeoff compared to existing baselines.

Abstract:
Gradientbased sufficient dimension reduction methods face a fundamental tradeoff: computing local gradients in the ambient space yields closed-form solutions but suffers from the curse of dimensionality, while iterative refinement in the projected space improves statistical efficiency at O(n^2 p) cost per iteration. We show that minimizers of the Minimum Average Variance Estimation (MAVE) criterion recover the span of the regression function's gradients---the same target as the Outer Product of Gradients method---but through local regression in the projected space. We then reformulate MAVE as a Riemannian maximization problem on the Stiefel manifold and derive a closed-form gradient, enabling efficient stochastic optimization. The resulting algorithm, SMAVE, combines mini-batch Riemannian gradient ascent with adaptive k-nearest neighbor localization that evolves with the subspace estimate. On synthetic benchmarks, SMAVE matches or exceeds the accuracy of existing methods while running 10--50× faster; on real regression tasks, these gains translate to improved prediction with speedups exceeding three orders of magnitude.

Abstract:
Scorebased diffusion models demonstrate superior performance in generative tasks but encounter fundamental bottlenecks in inverse problems due to the analytical intractability of the time-dependent likelihood score. To bridge this gap, we propose a novel proximal-based generative modeling (PGM) framework that rigorously circumvents explicit likelihood evaluation. Our framework is built upon a theoretical equivalence between Gaussian convolution in diffusion processes and Moreau-Yosida regularization in nonsmooth optimization. This enables a new sampling mechanism driven by the proposed Moreau score, which admits a closed-form expression via proximal operators. Moreover, we introduce Moreau score matching to learn the proximal operators that rely solely on samples drawn from the prior distribution. Theoretically, PGM eliminates the early-stopping bias inherent in the score-based diffusion model and achieves non-asymptotic convergence. Experiments demonstrate that PGM significantly surpasses state-of-the-art methods in reconstruction quality and sampling time.

Abstract:
Online Continual Learning (OCL) aims to learn from endless non\textstationary data streams, yet most existing methods assume a flat label space and overlook the hierarchical organization of real\text-world concepts that evolves both horizontally (sibling classes) and vertically (coarse or fine categories). To better reflect this context, we introduce a new problem setting, DHOCL (Online Continual Learning from Dynamic Hierarchies), where taxonomies evolve across granularities and each sample provides supervision at a single hierarchical level. In this setting, we find two fundamental issues: (i) partial supervision under mixed granularities provides only point-wise signals over an evolving path-wise hierarchy, which constrains plasticity and undermines cross-level semantic consistency, and (ii) the dynamically evolving hierarchies induce granularity-dependent interference, destabilizing popular replay and regularization mechanisms and thereby exacerbating catastrophic forgetting. To tackle these issues, we propose HALO (Hierarchical Adaptive Learning with Organized Prototypes), which adaptively combines complementary classification heads, regularized by organized learnable hierarchical prototypes, enabling rapid adaptation, hierarchical consistency, and structured knowledge consolidation as the taxonomy evolves. Extensive experiments on multiple benchmarks demonstrate that HALO consistently outperforms existing methods across hierarchical accuracy, mistake severity, and continual performance.

Abstract:
Inferring continuous probability paths from sparse snapshots is a fundamental challenge in domains like singlecell biology, where high-fidelity data acquisition is often destructive and constrained by prohibitive sequencing costs. This motivates the need for active learning strategies to strategically select optimal measurement times. However, designing active learning policies for this setting remains an open problem: the target objects reside on the infinite dimensional Wasserstein space where standard Euclidean metrics are ill-defined, and current interpolation methods lack epistemic uncertainty quantification. We introduce a framework which extends active experimentation to the space of measures. By leveraging Linearized Optimal Transport (LOT), we map distributional snapshots into a tangent space amenable to Gaussian Process modeling, allowing us to construct a tractable probabilistic surrogate for the underlying probability path. This yields a geometric acquisition function that iteratively selects measurement times to minimize uncertainty. Empirical results demonstrate that our strategy outperforms uncertainty-agnostic baselines on both synthetic and real-world datasets.

Abstract:
Nonlinear conservation laws are at the heart of many of the most important dynamical systems in science and engineering. In practical applications, such systems are often subject to various sources of uncertainty, e.g. due to sparse or noisy measurements. Inferring physical quantities and fields of interest then becomes an illposed problem which both classical numerical methods and modern deep learning-based methods struggle to treat appropriately. Recent work has framed classical numerical methods as Bayesian inference under Gaussian process priors, resulting in a physics-aware treatment of uncertainties. Following this line of work, we develop a novel numerically conservative method for uncertainty-aware simulations of nonlinear conservation laws. Our method uses recent sparse approximation techniques to scale up to large-scale forward and inverse problems. For forward simulation, we match the accuracy of classical solvers while providing structurally meaningful uncertainty. On inverse problems, we recover posteriors over nonparametric source fields in seconds --- outperforming neural baselines that take minutes to produce a less accurate point estimate.

Abstract:
Realworld multimodal systems must be robust against low-quality data, such as sensor noise, incomplete multimodal data and conflicting inputs. However, existing trustworthy fusion methods rely on the model's own prediction confidence to judge data quality. This creates a circular dependency: when a model is confident but wrong (overconfident), these methods fail to detect the error. To break this loop, we propose Geometry-based Multimodal Fusion (GMF). Instead of relying on predictions, we evaluate reliability by measuring the physical effort required to map input data back to the valid data manifold. We implement this using Diffusion Schrödinger Bridges with Rectified Flow, which allows us to calculate Transport Energy as a direct metric for quality. The logic is simple: valid data sits on the manifold (low energy), while noisy, incomplete data or conflicting data requires high energy to be restored. This geometric metric acts as an independent judge, effectively flagging unreliable inputs even when the classifier is fooled. Extensive experiments demonstrate that GMF significantly improves robustness against severe sensor noise and semantic conflicts compared to confidence-based baselines.

Abstract:
Reconstructing spatiotemporal fields from partial observations is fundamental to scientific inference, from inferring atmospheric states from satellite data to recovering fluid states from imaging. When observations are incomplete, the inverse problem is fundamentally illposed: even when the underlying PDE dynamics are Markovian in the full state, partial observation operators induce a non-Markovian posterior that cannot be resolved from a single timestep. We propose a history-bootstrapped autoregressive flow matching (HB-ARFM) for spatiotemporal inverse reconstruction under partial observability. Observation history bootstraps the initial reconstruction via conditional flow matching, resolving ambiguities. The same conditional transport model is then applied autoregressively, conditioning on both new observations and past predictions to propagate the reconstruction forward in time. We evaluate the method on boiling dynamics reconstruction, recovering full velocity and temperature fields from interface geometry and motion. Across two inverse tasks with varying observation sparsity, HB-ARFM produces physically and temporally valid reconstructions where other models fail.

Abstract:
Graph coarsening reduces the size of a graph while preserving certain properties. Most existing methods preserve either spectral or spatial characteristics. Recent research has shown that preserving topological features helps maintain the predictive performance of graph neural networks (GNNs) trained on the coarsened graph but suffers from exponential time complexity. To address these problems, we propose a Scalable TopologyPreserving Graph Coarsening (STPGC) by introducing the concepts of graph strong collapse and graph edge collapse extended from algebraic topology. STPGC comprises three new algorithms,GStrongCollapse,GEdgeCollapse, andNeighborhoodConingbased on these two concepts, which eliminate dominated nodes and edges while rigorously preserving topological features. We further prove that STPGC preserves the GNN receptive field and develop approximate algorithms to accelerate GNN training. Experiments in node classification of GNNs demonstrate the efficiency and effectiveness of STPGC.

Abstract:
Inferring dynamics from population snapshots is a fundamental challenge in machine learning and biology. In scRNAsequencing (scRNA-seq), destructive measurements preclude direct tracking of individual cells across time, making trajectory inference underdetermined. Optimal Transport (OT) provides a principled framework for snapshot alignment, but a long-standing modeling question is which cost functions yield biologically meaningful couplings. Standard OT approaches rely on gene-expression distances, implicitly treating cells as independent points and neglecting structured cell-cell communication mediated by ligand--receptor signaling. We introduce \textttCellBRIDGE (Cell-Based Regularized Interaction-Driven Gene Expression), which augments feature-based OT with a directed, typed interaction cost derived from ligand-receptor activity. By explicitly modeling cell--cell communication, \textttCellBRIDGE improves cross-snapshot couplings and downstream trajectory estimates across synthetic and real scRNA-seq datasets relative to feature-only baselines. Notably, \textttCellBRIDGE enables mechanistically interpretable in silico perturbations: on lung cancer data, silencing specific ligand-receptor pairs induces trajectory shifts that recapitulate expected effects of targeted pathway inhibition.

Abstract:
Driven by the pressing demand for graph anomaly detection (GAD) in highstakes domains, the generalist GAD paradigm, which trains a single detector transferable across new graphs, has recently gained growing attention. However, existing methods often rely on scarce and costly annotations for training and sometimes even require few-shot support at inference, which limits their robustness to diverse and unseen anomaly patterns. To address this limitation, we introduce ProMoS, the first unsupervised generalist GAD framework, which detects anomalies by modeling the abundant normality in unlabeled data. ProMoS adopts a knowledge-distillation paradigm to distill normality priors from a frozen self-supervised graph neural network (GNN) teacher to a mixture-of-students model with shared global and lightweight personalized branches, enabling efficient and expressive normality modeling without learning from scratch. We further propose prototype-guided soft-label distillation to align teacher and student in a shared prototype space, enhancing cross-graph generalizability. During inference, ProMoS performs zero-shot anomaly detection on unseen graphs via distillation bias and prototype geometric deviation. Extensive experiments show the effectiveness and efficiency of ProMoS, charting a practical path toward label-free, zero-shot generalist GAD.

Abstract:
In AI for Science, physicsinformed losses are becoming popular to train learned compressors, but their rate-distortion consequences are poorly understood. We formalise this problem via a geometric framework, showing that physics-aware compression is governed by the interaction of two Riemannian structures in latent space: a Hessian-based physics sensitivity geometry induced by the physical observable, and a rate geometry induced by the entropy model. This theoretical view yields an explicit mechanism for error allocation: the codec concentrates precision along spectrally stiff and rate-expensive directions, while pushing uncertainty into directions that are weakly sensed by the physical observable. We prove fundamental limits from this alignment: (i) rate-efficient preservation is theoretically possible only when physical sensitivity is strongly anisotropic, and (ii) when physics and fidelity are not spectrally aligned, improving physical observables at fixed rate provably worsens standard distortion. We validate these predictions across chaotic fluid dynamics simulations, and introduce simple geometric alignment diagnostics that anticipate when physics-aligned training will succeed.

Abstract:
Experimental design has emerged as a powerful approach for improving the sample efficiency of A/B testing, yet existing designs rely critically on correctly specified models. We study robust sequential experimental design under model misspecification and develop a unified framework that covers both contextual bandit and dynamic settings. Theoretically, we prove that our design bounds the worstcase mean squared error of the estimated treatment effect. Empirically, we demonstrate the effectiveness of the proposed approach using synthetic and real-world datasets from a leading technology company.

Abstract:
Distributional RL models the full return distribution, but common categorical/quantile approaches rely on projection and independently sampled Bellman targets, which ignore the Bellman operator’s affine transport structure and yield highvariance learning signals. We introduce Path-Coupled Bellman Flows, a flow-matching framework that shares base noise to couple the generative trajectories of consecutive states, inducing a geometric Bellman scaling law between their velocity fields. This geometry motivates a \lambda-family of Bellman-flow objectives that functions as a control variate, reducing variance while retaining the same Bellman-consistent fixed point. Across toy diagnostics and offline RL benchmarks (OGBench, D4RL), our method improves training stability and achieves competitive or improved performance relative to prior distributional baselines.

Abstract:
Lowdimensional structures appear ubiquitously in the eigenspectra of deep-learning matrices in classification networks trained in the overparameterized regime. While theoretical advances have aimed to explain this phenomenology, they typically succeed only in capturing subsets of the full behavior or rely on assumptions that cannot hold in practice. In this work, we provide an analytic explanation for the bulk–outlier structure of several canonical deep-learning matrices, including the Hessian, gradients, and weights. We achieve this using unconstrained feature models (UFMs), a now-common tool for studying the emergence of deep neural collapse (DNC). We show that DNC is the source of these low-dimensional eigenspectra: in each case, the eigenvalues and eigenvectors can be constructed from feature means, the characterizing objects of DNC. This provides a unifying analytic explanation for a wide range of spectral phenomena in deep learning and goes beyond empirical characterizations—which typically focus on eigenvalues—by providing a detailed analysis of eigenvectors. We prove that our results hold for both linear and ReLU networks and provide numerical validation in both the modeling context and standard deep-network architectures on canonical datasets.

Abstract:
Transformerbased large language models face severe scalability challenges in long-context generation due to the computational and memory costs of full-context attention. Under practical computation and memory constraints, many inference-efficient long-context methods improve efficiency by adopting bounded-context or segment-level execution only during inference, while continuing to train models under full-context attention, resulting in a mismatch between training and inference execution and state-transition semantics. Based on this insight, we propose a training-consistent segment-level generation framework, in which training and inference follow the same segment-level forward execution semantics. During training, consistency with inference is enforced by restricting gradient propagation to KV states carried over from the immediately preceding segment, while permitting head-specific access to past KV states during the forward pass without involving them in gradient propagation. Across long-context benchmarks, our approach achieves performance comparable to full-context attention, while achieving competitive latency--memory trade-offs against strong inference-efficient baselines, and substantially improving scalability at very long context lengths (e.g., approximately 6× lower peak prefill memory at 128K compared to full-context attention with FlashAttention).

Abstract:
We propose a data augmentation method for offline reinforcement learning, motivated by active positioning problems. Particularly, our approach enables the training of offpolicy models from a limited number of suboptimal trajectories. We introduce a trajectory-based augmentation technique that exploits task structure and the geometric relationship between rewards, value functions, and mathematical properties of logging policies. During data collection, our augmentation supports suboptimal logging policies, leading to higher data quality and improved offline reinforcement learning performance. We provide theoretical justification for these strategies and validate them empirically across positioning tasks of varying dimensionality and under partial observability.

Abstract:
We study the correlation clustering problem in the nodearrival data stream model. Unlike previous work, where the stream consists of the graph's edges, we focus on the setting in which the stream contains only the nodes. This model better reflects many real-world scenarios in which the data stream naturally consists of raw objects (e.g., images, tweets, or websites), and the similar/dissimilar edges are derived through a similarity function. We present C^4Approx, the first streaming algorithm that approximates the cost of correlation clustering using sublinear space in the number of nodes and a constant number of passes. We further complement this result with lower bounds. Experiments on real-world datasets show that by storing only 2% of the nodes, our algorithm achieves performance comparable to the classic Pivot algorithm and the more recent PrunedPivot algorithm.

Abstract:
Causal discovery in time series is increasingly performed using nonlinear machinelearning models, yet the resulting causal relationships are almost always summarized by scalar edge scores. We argue that this practice obscures the true object learned by nonlinear autoregressive models: function-valued causal influence. In such models, each directed relationship corresponds not to a single weight or coefficient, but to a state-dependent function whose effect varies across regimes, magnitudes, and contexts of the system. In this paper, we formalize function-valued causal influence in nonlinear multivariate time series and show that common scalar summaries, such as aggregated contribution magnitudes, constitute severe information bottlenecks. Using Neural Additive Vector Autoregression as a representative architecture, we demonstrate that edges with indistinguishable scalar causal scores can exhibit qualitatively different functional behaviors, including monotonic, thresholded, saturating, and sign-changing effects. These differences explain persistent discrepancies between causal score magnitude, interpretability, and predictive relevance that cannot be resolved by significance testing alone. We present a general framework for extracting and visualizing causal response functions from neural autoregressive models using learned contribution tensors and local attribution methods. Through controlled synthetic systems and an applied case study of democratic development, we show how function-valued analysis reveals regime-specific and asymmetric causal structure that is systematically missed by coefficient-centric or score-centric approaches. Our results suggest that meaningful interpretation of nonlinear causal time-series models requires moving beyond scalar causal scores toward explicit analysis of causal response functions. This reframing clarifies the representational content of modern causal discovery methods and provides a foundation for more faithful interpretation of complex dynamical systems.

Abstract:
We study bandit learning in matching markets, where players and arms constitute the two market sides, and the players' utilities are linear in the arm contexts. In each round, new arms arrive with observable contexts. Then, the algorithm matches them to players, aiming to minimize each player's regret against astable matching benchmark. This contextual structure creates significant complexity: subtle context shifts can slightly alter one player's utility while completely reconfiguring the underlying benchmark, causing large regret spikes for others. We address this in two settings:stochasticcontexts, drawn from a latent distribution, andadversarialcontexts, which may be arbitrary. In the stochastic setting, we introduce a novel minimum preference gap to characterize learning difficulty; in the adversarial setting, we propose a tractable regret notion that remains valid under arbitrary contexts. We develop fully adaptive algorithms for both settings, establishing instancedependent poly-logarithmic regret upper bounds. In the stochastic case, we also prove matching instance-independent regret upper and lower bounds under a mild assumption on the context distribution.

Abstract:
The robustness of synchronization is a central theme of the study of dynamical systems on networks. Typically one attempts to define a single stability index that characterizes the robustness of individual nodes to a class of perturbations. The dependence of a stability index on topology and system parameters can then be studied using network science or GNNs. Here we propose a novel upstream task, Stability Landscapes, that allows deriving many downstream stability indices. To support this task, we release two computationally intensive datasets of 10,000 graphs each at 20 and 100 nodes with pernode landscape labels. The dynamics are given by a conceptual oscillator model that captures aspects of the synchronization behavior of power grids. A compact graph neural network with a CNN decoder predicts these landscapes with about 85% SSIM in distribution and 67% under a 20 to 100 size shift, and 65%-73% SSIM when going from the 100 node ensemble to realistic power grid topologies with 100-400 nodes. This demonstrates that while basin landscapes are not suitable for study with conventional methods of network science, they are amenable to machine learning methods. This suggests that there is considerable potential in the study of complex networked systems across biology, neuroscience, and power grids, to move beyond scalar stability indices.

Abstract:
LLMpowered coding agents are reshaping the development paradigm. However, existing evaluation systems, neither traditional tests for humans nor benchmarks for LLMs, fail to capture this shift, excluding problems that require both human reasoning to guide solutions and AI efficiency for implementation. We introduce CentaurEval, a unified, ecologically valid benchmark for measuring human-in-the-loop value in coding. CentaurEval's core innovation is its "Collaboration-Necessary" problem templates, which are intractable for standalone LLMs or humans, but solvable through effective collaboration. CentaurEval dynamically instantiates tasks from 45 templates, providing a standardized IDE for humans and a reproducible 450-task toolkit for LLMs. We benchmark 45 participants against 5 LLMs under 4 levels of human intervention. Results show that while LLMs or humans alone achieve poor pass rates (0.67% and 18.89%), human–AI collaboration significantly improves to 31.11%. Our analysis reveals an emerging co-reasoning partnership, challenging the traditional human-tool hierarchy by showing that strategic breakthroughs can originate from either humans or AI. Our work is openly accessible.

Abstract:
AI model hubs provide access to a rapidly growing collection of powerful pretrained models, enabling off-the-shelf mixture-of-experts systems with different routing strategies. However, this rapid growth poses two fundamental challenges: scaling model selection across thousands of experts and continually updating routing mechanisms as new models and tasks are introduced. In this paper, we formalise this setting as Continual Model Routing (CMR) and proposeCMRBench, a new large-scale benchmark simulating realistic hub expansion and including over 2,000 candidate models. Finally, we introduceCARvE, a contrastive embedding approach for efficient continual model routing via domain-stratified coreset replay and checkpoint-based anchoring. Extensive empirical results and ablations show that CARvE significantly outperforms zero-shot retrieval, fine-tuning, and adapter-merging baselines in model, family, and domain-level accuracy.

Abstract:
Synthetic data becomes crucial for large language model training, but its effectiveness is highly inconsistent. Here we provide an informationtheoretic understanding of such inconsistency. Synthetic data is effective only when the generation-training loop is information-open: it has continuous information injections by external signals (verifiers, environments, or rubrics) that supply task-relevant signal not implied by the model's current probability distribution. When the loop is information-closed and relies mainly on self-generated samples, repeated processing tends to degrade performance. Based on this criterion, we further study factors that influence the efficiency of information injections by different synthetic methods. We argue that information-efficient methods often lead to strong generalization, which relies on simple, unified signals that ignore nuisance variation across examples, rather than adding more instance-level labels. Our work thus provides an operational guide for designing and understanding synthetic data methods, echoing Sutton's "bitter lesson" -- that compute-scalable, general methods ultimately beat human-built structure in the long run.

Abstract:
Deep networks often exhibit a preference for "simple" solutions, and such a simplicity bias is widely believed to play a key role in generalization. Yet a broadly applicable, quantitative measure of simplicity remains elusive. We introducepolynomial representationsas a distributionaware, low-dimensional surrogate for neural functions: we approximate a network’s predictive behavior along data-dependent interpolation paths using orthogonal polynomial bases, yielding a compact functional representation. We show that theeffective degreeof this representation serves as a practical simplicity metric that is predictive of generalization across tasks and architectures, and consistently outperforms existing generalization proxies such as sharpness. Finally, polynomial representations naturally yield adifferentiablesimplicity regularizer, which consistently improves generalization in image and text classification, fine-tuning contrastive vision–language models, and reinforcement learning.

Abstract:
Emergent misalignment can arise when a language model is finetuned on a narrowly scoped supervised objective: the model learns the target behavior, yet also develops undesirable out-of-domain behaviors. We investigate a mechanistic approach to preventing emergent misalignment by identifying a small set of internal features that reliably control the misaligned behavior and then discouraging the model from strengthening these features during fine-tuning. Across six fine-tuning domains, blocking (i.e., constraining) a fixed set of features achieves up to 97% relative reduction in emergent misalignment with no degradation in target-task performance. We strengthen validity with disjoint selection/evaluation splits, multiple independent judges, multiple random seeds for key settings, quality metrics, and extensive ablations demonstrating that the reduction in misalignment is specific to the identified mechanism. We also characterize a limiting regime in which misalignment re-emerges under prolonged fine-tuning, present evidence consistent with rerouting through alternative features or layers, and evaluate modifications that partially restore the misalignment-blocking effect. Overall, our results show that targeted training-time constraints on internal mechanisms can mitigate emergent misalignment without degrading target-task performance.

Abstract:
Recent layoutto-image models have achieved remarkable progress in spatial controllability. However, they still struggle with inter-object occlusion. When bounding boxes overlap, most existing methods lack explicit occlusion information, which makes the generation in intersection regions inherently ambiguous and hinders the determination of complex occlusion relationships. As a result, they often produce entangled textures or physically inconsistent layering in the overlapped areas. To address this issue, we first constructSA-Z, a large-scale dataset enriched with explicit occlusion ordering and pixel-level annotations. Building upon our proposed dataset, we introduceOcclusionFormer, a novel occlusion-aware Diffusion Transformer framework that explicitly models Z-order priority by decoupling instances and compositing them via volume rendering. Furthermore, to ensure fine-grained spatial precision, we introduce a queried alignment loss that explicitly supervises individual instances and enhances semantic consistency. The proposed method effectively reduces ambiguity in overlapping regions, enforces correct occlusion dependencies, and preserves structural integrity, leading to substantial accuracy gains across diverse scenes.

Abstract:
We introduce SOMA, the Spatial Memory framework for Outof-Vision Manipulation in Vision-Language-Action (VLA) models. Most existing VLAs implicitly assume that task-relevant objects are always visible, leading to brittle and reactive behaviors when targets fall outside the camera’s field of view. SOMA addresses this limitation by equipping VLAs with a persistent spatial memory constructed from multi-view observations acquired via a movable head camera, enabling reasoning beyond the current visual frustum. The framework consists of three components: Spatial Memory Construction, which aggregates angular-wise observations into a unified spatial–semantic representation through scanning; Dynamic Memory Refinement, which maintains global consistency over time; and Contextual Memory Retrieval, which activates instruction-relevant spatial cues during manipulation. We evaluate SOMA on five challenging real-world out-of-vision manipulation tasks, including multi-step and dual-arm scenarios where target objects are initially invisible. Experimental results show that SOMA not only improves task success rates, but also induces qualitatively different manipulation behaviors, with faster target localization, reduced viewpoint search, and near one-shot grasping under partial observability. Additional experiments on RoboCasa GR1 and SimplerEnv further validate the effectiveness of SOMA’s memory design under conventional fully observable settings.

Abstract:
Recent advances in reinforcement learning (RL) have achieved great successes by leveraging the multimodality and exploration capability of diffusion policies. Among these approaches, one representative branch focuses on the weightedbased policy optimization. This design enables better exploration capability of diffusion model, particularly at the beginning of training, but suffer from the low exploitation in Q-value information, resulting in a slow policy convergence. Another branch pay attention to gradient-based policy optimization, which sufficiently exploit the gradient of Q function yet tend to collapse into a unimodal policy with low diversity. To address this issue, we propose CGPO, Critic-Guided diffusion Policy Optimization, which effectively balances exploration and exploitation with the training-free guidance technique integrated into the denoising process of diffusion policy. Concretely, CGPO steers action generation toward high-value regions defined by the critic network and uses the guided actions as regression objectives. In this manner, CGPO reduces the time required to obtain high-quality actions and improves final performance with better balance between exploration-exploitation tradeoff. We validate the effectiveness of CGPO on 5 MuJoCo locomotion tasks, and CGPO achieves state-of-the-art performance compared with existing diffusion-based RL methods. Notably, CGPO is the first success to incorporate diffusion policy into real-world RL, with its superior performance on Franka robot arm grasping tasks.

Abstract:
Bayesian experimental design (BED) is a principled framework for dataefficient design of sequential experiments. However, existing BED methods are unable to adapt to dynamic constraints inherent in real-world tasks due to budget limitations, varying costs, or physical constraints that restrict how designs evolve over time. In this paper, we introduce a novel approach to BED that enables constrained optimization of experimental designs by combining offline pre-training of an amortized policy and a posterior network with online multi-step lookahead planning using scenario trees. We empirically demonstrate that our method yields substantially more informative design sequences than existing methods across a range of constrained BED tasks, while incurring only a modest additional computational overhead.

Abstract:
Subgoalbased policy tree search, which uses a policy to guide search, is effective for complex single-agent deterministic problems but often relies on explicit subgoal generation that can incur substantial overhead and hinders scalability. In this paper, we overcome these limitations by using a learned ``rerooter'' through the recently-introduced \sqrt\textLTS algorithm. A rerooter implicitly decomposes the problem into soft subtasks. While previous work focused on the formal guarantees for given or handcrafted rerooters, in this work we propose three rerooter designs: (i) a clustering-based rerooter that exploits global state-space structure, (ii) a heuristic-based rerooter that leverages learned cost-to-go estimates, and (iii) a hybrid that combines both signals. Our framework avoids having to explicitly reconstruct and reason over generated subgoals, thereby enabling scalable allocation of search effort with significantly lower computational overhead. Empirically, our rerooting-based methods scale to complex environments where subgoal-based policy tree search fails, and achieve state-of-the-art online training efficiency on the domains tested.

Abstract:
Diffusion Policy has dominated action generation due to its strong capabilities for modeling multimodal action distributions, but its multi-step denoising processes make it impractical for real-time visuomotor control. Existing caching-based acceleration methods typically rely on static schedules that fail to adapt to the dynamics of robot-environment interactions, thereby leading to suboptimal performance. In this paper, we propose \underlineSparse \underlineAction\underlineGen (SAG) for extremely sparse action generation. To accommodate the iterative interactions, SAG customizes a rollout-adaptive prune-then-reuse mechanism that first identifies prunable computations globally and then reuses cached activations to substitute them during action diffusion. To capture the rollout dynamics, SAG parameterizes an observation-conditioned diffusion pruner for environment-aware adaptation and instantiates it with a highly parameter- and inference-efficient design for real-time prediction. Furthermore, SAG introduces a one-for-all reusing strategy that reuses activations across both timesteps and blocks in a zig-zag manner, minimizing the global redundancy. Extensive experiments on multiple robotic benchmarks demonstrate that SAG achieves up to 4× generation speedup without sacrificing performance. Project Page: https://sparse-actiongen.github.io/.

Abstract:
Most existing illuminationediting methods struggle to jointly offer customized lighting control and preserve content integrity, limiting their effectiveness especially in transferring complex light effects from a reference to a target image in portrait photography. To address this problem, we propose TransLight, a novel framework that enables high-fidelity and high-freedom transfer of light effects. Extracting light effects from the reference image is the most critical and challenging step, as real-world lighting contains complex geometric structures tightly coupled with image content. To achieve this, we propose Generative Decoupling, using two fine-tuned diffusion models to accurately separate image content and lighting, and create a new million-scale dataset of image–content–light triplets. We then adopt IC-Light as the generative model, training it on these triplets with the reference lighting image as an additional conditioning signal. The resulting model enables customized and natural transfer of diverse light effects. Notably, by fully disentangling light effects from reference images, our generative decoupling strategy gives TransLight highly flexible illumination control. Experiments show that TransLight successfully transfers structured lighting across diverse images in portrait photography, offering more customized control than existing methods and charting new directions in illumination harmonization and editing.

Abstract:
Linear probes are a promising approach for monitoring AI systems for deceptive behaviour. Previous work has shown that a linear classifier trained on a contrastive instruction pair and a simple dataset can achieve good performance. However, these probes exhibit notable failures even in straightforward scenarios, including spurious correlations and false positives on nondeceptive responses. In this paper, we identify the importance of the instruction pair used during training. Furthermore, we show that targeting specific deceptive behaviors through a human-interpretable taxonomy of deception leads to improved results on evaluation datasets. Our findings reveal that instruction pairs capture deceptive \emphintent rather than content-specific patterns, explaining why prompt choice dominates probe performance (70.6% of variance). Given the heterogeneity of deception types across datasets, we conclude that organizations should design specialized probes targeting their specific threat models rather than seeking a universal deception detector.

Abstract:
Chaos arises in many complex dynamical systems, from weather to power grids, but is difficult to accurately model using datadriven emulators, including neural operator architectures. For chaotic systems, the inherent sensitivity to initial conditions makes exact long-term forecasts theoretically infeasible, meaning that traditional squared-error losses often fail when trained on noisy data. Recent work has focused on training emulators to match the statistical properties of chaotic attractors by introducing regularization based on handcrafted local features and summary statistics, as well as learned statistics extracted from a diverse dataset of trajectories. In this work, we propose a family of adversarial optimal transport objectives that jointly learn high-quality summary statistics and a physically consistent emulator, all from a single trajectory. We theoretically analyze and experimentally validate a Sinkhorn divergence formulation (2-Wasserstein) and a WGAN-style dual formulation (1-Wasserstein). Our experiments across a variety of chaotic systems, including systems with high-dimensional chaotic attractors, show that emulators trained with our approach exhibit significantly improved long-term statistical fidelity.

Abstract:
PredictionPowered Inference (PPI) is a popular strategy for combining gold-standard and possibly noisy pseudo-labels to perform statistical estimation. Prior work has shown an asymptotic \enquotefree lunch for PPI++, an adaptive form of PPI, showing that the asymptotic variance of PPI++ is always less than or equal to the variance obtained from using gold-standard labels alone. Notably, this result holds regardless of the quality of the pseudo-labels. In this work, we demystify this result by conducting an exact finite-sample analysis of the estimation error of PPI++ on the mean estimation problem. We give a \enquoteno free lunch result, characterizing the settings (and sample sizes) where PPI++ has provably worse estimation error than using gold-standard labels alone. Specifically, PPI++ will outperform if and only if the correlation between pseudo- and gold-standard is above a certain level that depends on the number of labeled samples (n). In some cases our results simplify considerably: For Gaussian data, for instance, the correlation must be at least 1/\sqrtn - 2 in order to see improvement. More broadly, by providing exact non-asymptotic expressions for the variance of PPI++ under sample splitting, we aim to empower practitioners to transparently reason about the benefits of PPI++ in specific applications. In experiments, we illustrate that our theoretical findings hold on real-world datasets.

Abstract:
Large Language Models usually put more emphasis on accuracy and therefore, will guess even when not certain about the prediction, which is especially severe when finetuned on small datasets due to the inherent tendency toward miscalibration. In this work, we introduce Bayesian-LoRA, which reformulates the deterministic LoRA update as a probabilistic low-rank representation inspired by Sparse Gaussian Processes. We identify a structural isomorphism (in the functional sense of shared bilinear form, not strict algebraic equivalence) between LoRA's factorization and Kronecker-factored SGP posteriors, and show that LoRA emerges as a limiting case when posterior uncertainty collapses. We conduct extensive experiments on various LLM architectures across commonsense reasoning benchmarks. With only approximately 0.42M additional parameters and \approx1.2× training cost relative to standard LoRA, Bayesian-LoRA significantly improves calibration across models up to 30B, achieving up to 84% ECE reduction and 76% NLL reduction while maintaining competitive accuracy for both in-distribution and out-of-distribution (OoD) evaluations.

Abstract:
We present a theoretical analysis of stochastic optimization methods in terms of their sensitivity with respect to the step size. We identify a key quantity that, for each method, describes how the performance degrades as the step size becomes too large. For convex problems, we show that this quantity directly impacts the suboptimality bound of the method. Most importantly, our analysis provides directtheoreticalevidence that adaptive step size methods, such as SPS or NGN, are more robust than SGD. This allows us to quantify the advantage of these adaptive methods beyond empirical evaluation. Finally, we show through experiments that our theoretical bound qualitatively mirrors the actual performance as a function of the step size, even for nonconvex problems.

Abstract:
Large language models allocate uniform computation across all tokens, ignoring that some sequences are trivially predictable while others require deep reasoning. We introduce ConceptMoE, which dynamically merges semantically similar tokens into concepts through learnable chunking at target compression ratio R. The MoE architecture enables controlled evaluation: reallocating saved computation to match baseline FLOPs and parameters isolates genuine architectural benefits. ConceptMoE consistently outperforms standard MoE, achieving +0.9 points on language pretraining, +2.3 on long context, and +0.6 on multimodal tasks. Continual training conversion with layer looping gains +5.5 points. Beyond performance, at R=2, ConceptMoE reduces attention computation by R^2× and KV cache by R×, delivering prefill speedups up to 175% and decoding speedups up to 117%. The minimal architectural changes enable straightforward integration, demonstrating that adaptive conceptlevel processing fundamentally improves LLM effectiveness and efficiency.

Abstract:
Continuous Normalizing Flows (CNFs) enable elegant generative modeling but remain bottlenecked by their iterative nature requiring costly sampling and lacking interpretability of the intermediate states. Recent approaches accelerate sampling by straightening trajectories or distilling endpoints, yet they treat the original generative process as a black box, discarding the teacher’s intermediate dynamics. We propose a fundamentally different perspective: globally linearizing flow dynamics via Koopman theory to achieve trajectorypreserving linearization. By lifting Conditional Flow Matching (CFM) into a higher-dimensional Koopman space, we represent its evolution with a single linear operator. Crucially, unlike boundary-only distillation, our method enforces infinitesimal consistency with the teacher's vector field alongthe full generative path. We derive a practical, simulation-free training objective that ensures this global alignment and yields two key benefits. First, sampling becomes one-step and parallelizable. Second, because the linearization is faithful to the dynamics, the Koopman operator provides unique insights on the generation. We demonstrate that this structure enables novel applications unavailable in prior approaches, including discovery of semantically coherent editing directions, inversion with a teacher-aligned linear operator and class-conditional spectral signatures. Empirically, our approach achieves competitive sample quality, while enabling spectral analysis and control of theentire trajectoriesof generative flows.

Abstract:
Causal discovery is a difficult problem that typically relies on strong assumptions on the datagenerating model, such as non-Gaussianity. In practice, many modern applications provide multiple related views of the same system, which has rarely been considered for causal discovery. Here, we leverage this multi-view structure to achieve causal discovery with weak assumptions. We propose a multi-view linear Structural Equation Model (SEM) that extends the well-known framework of non-Gaussian disturbances by alternatively leveraging correlation over views. We prove the identifiability of the model for acyclic SEMs. Subsequently, we propose several multi-view causal discovery algorithms, inspired by single-view algorithms (DirectLiNGAM, PairwiseLiNGAM, and ICA-LiNGAM). The new methods are validated through simulations and applications on neuroimaging data, where they enable the estimation of causal graphs between brain regions.

Abstract:
Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes theirvibe check.Vibe checkreflects human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional correctness, overlooking nonfunctional instructions that users routinely apply. In this paper, we hypothesize that instruction following is the missing piece underlyingvibe checkbesides functional correctness. To quantify models' code instruction-following capabilities with measurable signals, we presentVeriCode, a taxonomy of 30 verifiable code instructions together with deterministic verifiers. We use the taxonomy to augment established evaluation suites, resulting inVibe Checker, a testbed to assess both instruction following and functional correctness. Evaluating 31 LLMs, we show that even the strongest models struggle to comply with multiple instructions and exhibit functional regression. Most importantly,a composite score of functional correctness and instruction following correlates best with human preference, with instruction following emerging as the primary differentiator among LLMs.

Abstract:
Contrastive representation learning (CRL) underpins many modern foundation models. Despite recent theoretical progress, existing analyses suffer from several key limitations: (i) the statistical consistency of CRL remains poorly understood; (ii) available generalization bounds deteriorate as the number of negative samples increases, contradicting the empirical benefits of large negative sets; and (iii) the retrieval performance of CRL has received limited theoretical attention. In this paper, we develop a unified statistical learning theory for CRL. For downstream tasks, we evaluate retrieval quality using an AUCtype population criterion and show that the contrastive loss is statistically consistent with optimal ranking. We further establish a calibration-style inequality that quantitatively relates excess contrastive risk to excess retrieval suboptimality. For upstream training, we study both supervised and self-supervised contrastive objectives and derive generalization bounds of order O(1/m + 1/\sqrtn) and O(1/\sqrtm + 1/\sqrtn), respectively, where m denotes the number of negative samples and n the number of anchor points. These bounds not only explain the empirical advantages of large negative sets but also reveal an explicit trade-off between m and n. Extensive experiments on large-scale vision--language models corroborate our theoretical predictions.

Abstract:
We consider an online learning problem in environments with multiple change points. In contrast to the single change point problem that is widely studied using classical "high confidence" detection schemes, the multiple change point environment presents new learningtheoretic and algorithmic challenges. Specifically, we show that classical methods may exhibit catastrophic failure (high regret) due to a phenomenon we refer to as endogenous confounding. To overcome this, we propose a new class of learning algorithms dubbed Anytime Tracking CUSUM (ATC). These are horizon-free online algorithms that implement a selective detection principle, balancing the need to ignore "small" (hard-to-detect) shifts, while reacting "quickly" to significant ones. We prove that the performance of a properly tuned ATC algorithm is nearly minimax-optimal; its regret is guaranteed to closely match a novel information-theoretic lower bound on the achievable performance of any learning algorithm in the multiple change point problem. Experiments on synthetic as well as real-world data validate the aforementioned theoretical findings.

Abstract:
1bit LLM quantization offers significant advantages in reducing storage and computational costs. However, existing methods typically train 1-bit LLMs from scratch, failing to fully leverage pre-trained models. This results in high training costs and notable accuracy degradation. We identify that the large gap between full precision and 1-bit representations makes naive adaptation difficult. In this paper, we introduce a consistent progressive training for both forward and backward, smoothly converting the full-precision weights into the binarized ones. Additionally, we incorporate binary-aware initialization and dual-scaling compensation to reduce the difficulty of progressive training and improve the performance. Experimental results on LLMs of various sizes demonstrate that our method outperforms existing approaches. Our results show that high-performance 1-bit LLMs can be achieved using pre-trained models, eliminating the need for expensive training from scratch.

Abstract:
Robotic manipulation benefits from foundation models that describe goals, but today's agents still lack a principled way to learn from their own mistakes. We ask whether natural language can serve as feedback, an errorreasoning signal that helps embodied agents diagnose what went wrong and correct course. We introduce LAGEA (Language Guided Embodied Agents), a framework that turns episodic, schema-constrained reflections from a vision language model (VLM) into temporally grounded guidance for reinforcement learning. LAGEA summarizes each attempt in concise language, localizes the decisive moments in the trajectory, aligns feedback with visual state in a shared representation, and converts goal progress and feedback agreement into bounded, step-wise shaping rewardswhose influence is modulated by an adaptive, failure-aware coefficient. This design yields dense signals early when exploration needs direction and gracefully recedes as competence grows. On the Meta-World MT10 and Robotic Fetch embodied manipulation benchmarks, LAGEA improves average success over the state-of-the-art (SOTA) methods by 9.0% on random goals, 5.3% on fixed goals, and 17% on fetch tasks, while converging faster. These results support our hypothesis: language, when structured and grounded in time, is an effective mechanism for teaching robots to self-reflect on mistakes and make better choices. Code will be released soon.

Abstract:
Embedding geometry plays a fundamental role in retrieval quality, yet dense retrievers for retrievalaugmented generation (RAG) remain largely confined to Euclidean space. However, natural language exhibits hierarchical structure from broad topics to specific entities that Euclidean embeddings fail to preserve, causing semantically distant documents to appear spuriously similar and increasing hallucination risk. To address these limitations, we introduce hyperbolic dense retrieval, developing two model variants in the Lorentz model of hyperbolic space: HyTE-FH, a fully hyperbolic transformer, and HyTE-H, a hybrid architecture projecting pre-trained Euclidean embeddings into hyperbolic space. To prevent representational collapse during sequence aggregation, we introduce the Outward Einstein Midpoint, a geometry-aware pooling operator that provably preserves hierarchical structure. On MTEB, HyTE-FH outperforms equivalent Euclidean baselines, while on RAGBench, HyTE-H achieves up to 29% gains over Euclidean baselines in context relevance and answer relevance using substantially smaller models than current state-of-the-art retrievers. Our analysis also reveals that hyperbolic representations encode document specificity through norm-based separation—with over 20% radial increase from general to specific concepts—a property absent in Euclidean embeddings, underscoring the critical role of geometric inductive bias in faithful RAG systems\footnoteThe code is available at: \urlhttps://anonymous.4open.science/r/HypRAG-30C6.

Abstract:
It is widely recognized that reinforcement learning (RL) finetuning of large language models often leads to diversity collapse, where outputs lack variety. Prior work has proposed a range of heuristics to counteract this effect, but these methods are ad hoc: they frequently trade off correctness for diversity, their effectiveness varies across tasks, and in some cases they even contradict one another. In this work, we place these observations on a rigorous foundation. We first provide a formal proof of why RL fine-tuning exhibits diversity collapse via a selection and reinforcement bias. Next, we make a key observation that any reward modification to address diversity collapse only needs to be applied on the correct trajectories. Building directly on this analysis, we introduce a principled method---differential smoothing---that provably improves both correctness and diversity, outperforming vanilla RL as well as widely used entropy-based heuristics. Our theory precisely characterizes when existing heuristics help and why they fail, while showing that differential smoothing is universally superior. Extensive experiments with models from 1B to 7B parameters, across domains including CountDown and real-world mathematical reasoning, demonstrate consistent gains. Differential smoothing improves both Pass@1 and Pass@k, with up to 6.7% improvements on AIME24 dataset.

Abstract:
Longhorizon agentic reasoning necessitates effectively compressing growing interaction histories into a limited context window. Most existing memory systems serialize history as text, where token-level cost is uniform and scales linearly with length, often spending scarce budget on low-value details. To this end, we introduceMemOCR, a multimodal memory agent that improves long-horizon reasoning under tight context budgets by allocating memory space with adaptive information density through visual layout. Concretely, MemOCR maintains a structured rich-text memory (e.g., headings, highlights) and renders it into an image that the agent consults for memory access, visually prioritizing crucial evidence while aggressively compressing auxiliary details. To ensure robustness across varying memory budgets, we train MemOCR with reinforcement learning under budget-aware objectives that expose the agent to diverse compression levels. Across long-context multi-hop and single-hop question-answering benchmarks, MemOCR outperforms strong text-based baselines and achieves more effective context utilization under extreme budgets.

Abstract:
The Platonic Representation Hypothesis suggests that representations from neural networks are converging to a common statistical model of reality. We show that the existing metrics used to measure representational similarity areconfounded by network scale: increasing model depth or width can systematically inflate representational similarity scores. To correct these effects, we introduce a permutationbased null-calibration framework that transforms any representational similarity metric into a calibrated score with statistical guarantees. We revisit the Platonic Representation Hypothesis with our calibration framework, which reveals a nuanced picture: the apparent convergence reported by global spectral measures largely disappears after calibration, while local neighborhood similarity, but not local distances, retains significant agreement across different modalities. Based on these findings, we propose theAristotelian Representation Hypothesis: representations in neural networks are converging to shared local neighborhood relationships.

Abstract:
Training large language models (LLMs) is computationally expensive, partly because the loss exhibits slow powerlaw convergence whose origin remains debatable. Through systematic analysis of toy models and empirical evaluation of LLMs, we show that this behavior can arise intrinsically from the use of softmax and cross-entropy. When learning peaked probability distributions, e.g., next-token distributions, these components yield power-law vanishing losses and gradients, creating a fundamental optimization bottleneck. This ultimately leads to power-law time scaling of the loss with a universal exponent of 1/3. Our results provide a mechanistic explanation for observed neural scaling and suggest new directions for improving LLM training efficiency.

Abstract:
Gradient signals in LLM training are highly anisotropic: recurrent linguistic structure concentrates energy into a small set of dominant spectral directions, while contextspecific information resides in a long tail. We show that this spike–tail separation persists throughout training, with the spike occupying only about 1.5% of directions yet dominating optimizer statistics. This dominance suppresses tail learning by contracting tail updates through second-moment normalization and tightening the globally stable learning-rate bound. Motivated by this analysis, we propose Spectra, a spike-aware optimizer that suppresses the dominant low-rank spike subspace without amplifying the noise-sensitive spectral tail. Spectra tracks the spike subspace via cached, warm-started power iteration and applies low-rank spectral shaping with negligible overhead and substantially reduced optimizer-state memory. On LLaMA3-8B trained on 50B tokens, Spectra reaches the same target loss 30% faster than AdamW, reduces per-step end-to-end overhead by 0.7%, cutting optimizer-state memory by 49.25%, and improves average downstream accuracy by 1.62%. Compared to Muon, Spectra is 5.1× faster in optimizer processing time, achieves a lower final loss, and improves average accuracy by 0.66%. Spectra's Megatron integration is released publicly (https://tinyurl.com/29n4vv5f).

Abstract:
VisionLanguage Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view captures complex cross-modal interactions, enabling stealthy, coordinated adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o.

Abstract:
Discrete flow models offer a powerful framework for learning distributions over discrete state spaces and have demonstrated superior performance compared to the discrete diffusion models. However, their convergence properties and error analysis remain largely unexplored. In this work, we develop a unified framework grounded in stochastic calculus theory to systematically investigate the theoretical properties of discrete flow models. Specifically, by leveraging a Girsanovtype theorem for the path measures of two continuous-time Markov chains (CTMCs), we present a comprehensive error analysis that accounts for both transition rate estimation error and early stopping error. In fact, the estimation error of transition rates has received little attention in existing works. Unlike discrete diffusion models, discrete flow incurs no initialization error caused by truncating the time horizon in the noising process. Building on generator matching and uniformization, we establish non-asymptotic error bounds for distribution estimation without the boundedness condition on oracle transition rates. Furthermore, we derive a faster rate of total variation convergence for the estimated distribution with the boundedness condition, yielding a nearly optimal rate in terms of sample size. Our results provide the first error analysis for discrete flow models. We also study model performance under different settings based on simulation results.

Abstract:
Reinforcement learning with verifiable rewards has shown notable effectiveness in enhancing large language models (LLMs) reasoning performance, especially in mathematics tasks. However, such improvements often come with reduced outcome diversity, where the model concentrates probability mass on a narrow set of solutions. Motivated by diminishingreturns principles, we introduce a set level diversity objective defined over sampled trajectories using kernelized similarity. Our approach derives a leave-one-out marginal contribution for each sampled trajectory and integrates this objective as a plug-in advantage shaping term for policy optimization. We further investigate the contribution of a single trajectory to language model diversity within a distribution perturbation framework. This analysis theoretically confirms a monotonicity property, proving that rarer trajectories yield consistently higher marginal contributions to the global diversity. Extensive experiments across a range of model scales demonstrate the effectiveness of our proposed algorithm, consistently outperforming strong baselines in both Pass@1 and Pass@K across various benchmarks.

Abstract:
While there are many applications of machine learning (ML) to scientific problems that \emphlook promising, the eye test can be misleading compared to the quantitative values. Using numerical analysis techniques, we rigorously quantify the accuracy, convergence rates, and generalization bounds of certain ML models applied to linear differential equations (DEs) for parameter discovery or solution finding. Beyond the quantity and discretization of data, we identify that the function space of the data is critical to the generalization of the model which can lead to divergence. Similar lack of generalization is empirically demonstrated for commonly used models. Surprisingly, we find that different classes of models can exhibit opposing generalization behaviors. Based on our theoretical analysis, we also introduce a new mechanistic interpretability lens on scientific models whereby Green's function representations can be extracted from the weights of blackbox models. Our results inform a new cross-validation technique for measuring generalization in physical systems, and can be useful as a benchmark of future methods.

Abstract:
We consider problems of parameter estimation where design variables can be actively optimized to maximize information gain. To this end, we introduce JADAI, a framework that jointly amortizes Bayesian adaptive design and inference by training a policy, a history network, and an inference network endto-end. The networks minimize a generic loss that aggregates incremental reductions in posterior error along experimental sequences without density evaluations. Inference networks are instantiated with diffusion models that can approximate high-dimensional and multimodal posteriors at every experimental step. JADAI achieves superior or competitive performance across adaptive design benchmarks.

Abstract:
Selfreflection enables language agents to iteratively refine solutions, yet often produces repetitive outputs that limit reasoning performance. Recent studies have attempted to address this limitation through various approaches, among which increasing reflective diversity has shown promise. Our empirical analysis reveals a strong positive correlation between reflective diversity and task success, further motivating the need for diverse reflection signals. We introduceParamMem, a parametric memory module that encodes cross-sample reflection patterns into model parameters, enabling diverse reflection generation through temperature-controlled sampling. Building on this module, we propose ParamAgent, a reflection-based agent framework that integrates parametric memory with episodic and cross-sample memory. Extensive experiments on code generation, mathematical reasoning, and multi-hop question answering demonstrate consistent improvements over state-of-the-art baselines. Further analysis reveals thatParamMemis sample-efficient, enables weak-to-strong transfer across model scales, and supports self-improvement without reliance on stronger external model, highlighting the potential ofParamMemas an effective component for enhancing language agents.

Abstract:
In RL posttraining of LLM agents, calls to ex ternal tools take several seconds or even min- utes, leaving allocated GPUs idle and inflating post-training time and cost. While many tool invocations repeat across parallel rollouts and could in principle be cached, naively caching their outputs for reuse is incorrect since tool outputs depend on the environment state induced by prior agent interactions. We present TVCACHE, a stateful tool-value cache for LLM agent post- training. TVCACHE maintains a tree of observed tool-call sequences and performs longest-prefix matching for cache lookups: a hit occurs only when the agent’s full tool history matches a previ- ously executed sequence, guaranteeing identical environment state. On three diverse workloads— terminal-based tasks, SQL generation, and video understanding—TVCACHE achieves cache hit rates of up to 70% and reduces median tool call execution time by up to 6.9×, with no degradation in post-training reward accumulation.

Abstract:
Online MultiAgent Reinforcement Learning (MARL) is a prominent framework for efficient agent coordination. Crucially, enhancing policy expressiveness is pivotal for achieving superior performance. Diffusion-based generative models are well-positioned to meet this demand, having demonstrated remarkable expressiveness and multimodal representation in image generation and offline settings. Yet, their potential in online MARL remains largely under-explored. A major obstacle is that the intractable likelihoods of diffusion models impede entropy-based exploration and coordination. To tackle this challenge, we propose among the first Online off-policy MARL framework using Diffusion policies (OMAD) to orchestrate coordination. Our key innovation is a relaxed policy objective that maximizes scaled joint entropy, facilitating effective exploration without relying on tractable likelihood. Complementing this, within the centralized training with decentralized execution (CTDE) paradigm, we employ a joint distributional value function to optimize decentralized diffusion policies. It leverages tractable entropy-augmented targets to guide the simultaneous updates of diffusion policies, thereby ensuring stable coordination. Extensive evaluations on MPE and MAMuJoCo establish our method as the new state-of-the-art across 10 diverse tasks, demonstrating a remarkable 2.5× to 5× improvement in sample efficiency.

Abstract:
Featureimportance methods show promise for transforming machine learning (ML) models from predictive engines into tools for scientific discovery. However, expressive models can be unstable due to data sampling and algorithmic stochasticity, leading to inaccurate variable importance estimates, undermining their utility in critical biomedical applications. While ensembling offers a remedy, the choice between explaining a single ensemble model or aggregating individual model explanations is non-trivial due to the non-linearity of importance measures, and remains largely understudied. Our theoretical analysis, developed under assumptions accommodating complex state-of-the-art ML models, reveals that this choice is governed by a trade-off involving the model's excess risk. In contrast to prior literature, we show that ensembling at the model level provides more accurate variable-importance estimates, particularly for expressive models, by reducing this leading error term. We validate these findings on classical benchmarks and a large-scale proteomic study from the UK Biobank.

Abstract:
We study robustness to agent malfunctions in cooperative multiagent reinforcement learning (MARL), a failure mode that is critical in practice yet underexplored in existing theory. We introduce MARTA, a plug-and-play robustness layer that augments standard MARL algorithms with a \fontfamilycmss\selectfont Switcher–\fontfamilycmss\selectfont Adversary mechanism which selectively induces malfunctions in performance-critical states. This formulation defines a fault-switching (N+2)-player Markov game in which the \fontfamilycmss\selectfont Switcher chooses when and which agent fails, and the \fontfamilycmss\selectfont Adversary controls the resulting faulty behaviour via random or worst-case policies. We develop a Q-learning-type scheme and show that the associated Bellman operator is a contraction, yielding existence and uniqueness of the minimax value, convergence to a Markov perfect equilibrium. MARTA integrates seamlessly with MARL algorithms without architectural modification and consistently improves robustness across Traffic Junction (TJ), Level-Based Foraging (LBF), MPE SimpleTag, and SMAC (v2). In these domains, MARTA achieves large gains in final performance of up to 116.7% in SMAC, 21.4% in MPE SimpleTag, and 44.6% in LBF, while significantly reducing failure rates under train–test mismatched fault regimes. These results establish MARTA as a theoretically grounded and practically deployable mechanism for fault-tolerant MARL.

Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key paradigm for unlocking complex reasoning in Large Language Models (LLMs), yet its potential in 3D scene understanding remains untapped. To bridge this gap, we present Reinforcement FineTuning for Video-based 3D Scene Understanding (3D-RFT), the first framework to extend RLVR to 3D perception and reasoning. Our pipeline operates in two stages: activating 3D-aware Multi-modal Large Language Models (MLLMs) via Supervised Fine-Tuning (SFT), followed by reinforcement fine-tuning using Group Relative Policy Optimization (GRPO) with strictly verifiable reward functions. We design task-specific rewards—such as 3D IoU and F1-score—to provide deterministic signals for spatial alignment. Extensive experiments demonstrate that 3D-RFT achieves state-of-the-art performance on video-based 3D scene understanding benchmarks, significantly outperforming VG LLM-8B on detection and grounding tasks. Moreover, our model surpasses larger mainstream models on VSI-Bench, demonstrating the efficiency of verifiable reinforcement learning. We conclude by offering valuable insights into optimal training strategies .

Abstract:
Multiobjective reinforcement learning (MORL) seeks policies that effectively balance conflicting objectives. However, presenting many diverse policies without accounting for the decision maker’s (DM’s) preferences can overwhelm the decision-making process. On the other hand, accurately specifying preferences in advance is often unrealistic. To address these challenges, we introduce a human-in-the-loop MORL framework that interactively discovers preferred policies during optimization. Our approach proactively learns the DM’s implicit preferences in real time, requiring no a priori knowledge. Importantly, we integrate this preference learning directly into a parallel optimization framework, balancing exploration and exploitation to identify high-quality policies aligned with the DM's preferences. Evaluations on a complex quadrupedal robot simulation environment demonstrate that, with only interactions, our proposed method can identify policies aligned with human preferences, e.g., running like a dog. Further experiments on seven MuJoCo tasks and a multi-microgrid system design task against eight state-of-the-art MORAL algorithms fully demonstrate the effectiveness of our proposed framework. Demonstrations and full experiments are in https://sites.google.com/view/pbmorl/home.

Abstract:
Diffusion models show growing promise for generative modeling of physical systems, but enforcing partial differential equation (PDE) constraints directly is infeasible during the stochastic denoising process. Current methods apply constraints to the expected clean sample, incurring a Jensen’s Gap that forces a tradeoff between PDE satisfaction and generative accuracy. To bridge this gap, we propose Physics-Informed Distillation of Diffusion Models (PIDDM), a simple yet effective post-hoc distillation strategy that enforces PDE constraints after training. PIDDM enables fast single-step generation while improving both physical consistency and sample quality, supporting forward/inverse problems and reconstruction from partial observations. Extensive experiments across PDE benchmarks show PIDDM outperforms recent baselines, such as PIDM, DiffusionPDE, and ECI-sampling, in both accuracy and constraint satisfaction, with lower computation and minimal hyperparameter tuning, offering a more efficient pathway to physics-informed diffusion models.

Abstract:
Scaling LLMbased embodied agents from text-only environments to complex multimodal settings remains a major challenge. Recent work identifies a perception–reasoning–decision gap in standalone Vision–Language Models (VLMs), which often overlook task-critical information. In this paper, we introduce PRISM, a framework that tightly couples perception (VLM) and decision (LLM) through a dynamic question–answer (DQA) pipeline. Instead of passively accepting the VLM’s description, the LLM critiques it, probes the VLM with goal-oriented questions, and synthesizes a compact image description. This closed-loop interaction yields a sharp, task-driven understanding of the scene. We evaluate PRISM on the ALFWorld and Room-to-Room (R2R) benchmarks. We show that: (1) PRISM significantly outperforms state-of-the-art image-based models, (2) our Interactive goal-oriented perception pipeline yields systematic and substantial gains, and (3) PRISM is fully automatic, eliminating the need for handcrafted questions or answers.

Abstract:
We present CausalAdapter, a modular framework that adapts frozen text-to-image diffusion backbones for counterfactual image generation. Our method enables causal interventions on target attributes while preserving all other aspects of the image, including the core identity. In contrast to prior approaches that rely on prompt engineering without explicit causal structure, Causal-Adapter leverages structural causal modeling augmented with two attribute regularization strategies: prompt-aligned injection, which aligns causal attributes with textual embeddings for precise semantic control, and a conditioned token contrastive loss to disentangle attribute factors and reduce spurious correlations. Causal-Adapter achieves state-of-the-art results on synthetic and real-world datasets, outperforming other baselines in effectiveness, composition, realism, and minimality. These results demonstrate that the approach enables efficient, robust, and generalizable counterfactual image editing with faithful attribute modification and strong preservation of core identity.

Abstract:
Multimodal large language models (MLLMs) have achieved remarkable success on complex multi-modal tasks. However, it remains insufficiently explored whether they exhibit modality preference, a tendency to favor one modality over another when processing multi-modal contexts. To study this question, we introduce MC^2 benchmark, which constructs controlled evidence-conflict scenarios to systematically evaluate modality preference in decision-making. Extensive experiments reveal that all 20 tested MLLMs generally demonstrate clear modality preferences, and such preferences can serve as a useful indicator of downstream task performance of MLLMs. Further analysis shows that modality preference can be controlled by instruction guidance and captured within the latent representations of MLLMs. Built on these insights, we propose a probing and steering method based on representation engineering to explicitly control modality preference without requiring additional fine-tuning. This method effectively amplifies modality preference toward a desired direction and demonstrates promising improvements across multiple multi-modal understanding and reasoning tasks.

Abstract:
As large language models grow increasingly capable, concerns about their safe deployment have intensified. While numerous alignment strategies aim to restrict harmful behavior, these defenses can still be circumvented through carefully designed adversarial prompts. In this work, we introduce a theoretical framework that formalizes a game between an attacker and a defender. Within this framework, we design a theoretical bestresponse attack strategy and show that it is closely related to many existing adversarial prompting methods. We further analyze the resulting game, characterize its equilibria, and reveal inherent advantages for the attacker. Drawing on our theoretical analysis, we also derive a provably optimal defense strategy. Empirically, we evaluate a practical instantiation of the theoretically optimal attack and observe stronger performance relative to existing adversarial prompting approaches in diverse settings encompassing different LLMs and benchmarks.

Abstract:
Identifying training data of largescale models is critical for copyright litigation, privacy auditing, and ensuring fair evaluation. However, existing works typically treat this task as an instance-wise identification without controlling the error rate of the identified set, which cannot provide statistically reliable evidence. In this work, we formalize training data identification as a set-level inference problem and propose Provable Training Data Identification (PTDI), a distribution-free approach that enables provable and strict false identification rate control. Specifically, our method computes conformal p-values for each data point using a set of known unseen data and then develops a novel Jackknife-corrected Beta boundary (JKBB) estimator to estimate the training-data proportion of the test set, which allows us to scale these p-values. By applying the Benjamini–Hochberg (BH) procedure to the scaled p-values, we select a subset of data points with provable and strict false identification control. Extensive experiments across various models and datasets demonstrate that PTDI achieves higher power than prior methods while strictly controlling the FIR.

Abstract:
Recent large language models (LLMs) achieve strong performance in generating promising reasoning paths for complex tasks. However, despite powerful generation ability, LLMs remain weak at verifying their own answers, revealing a persistent capability asymmetry between generation and selfverification. In this work, we conduct an in-depth investigation of this asymmetry throughout training evolution and show that, even on the same task, improving generation does not lead to corresponding improvements in self-verification. Interestingly, we find that the reverse direction of this asymmetry behaves differently: learning to self-verify can effectively improve generation performance, achieving accuracy comparable to standard generation training while yielding more efficient and effective reasoning traces. Building on this observation, we further explore integrating self-verification into generation training by formulating a multi-task reinforcement learning framework, where generation and self-verification are optimized as two independent but complementary objectives. Extensive experiments across benchmarks and models demonstrate performance gains over generation-only training in both generation and verification capabilities.

Abstract:
In embodied vision, GoalOriented Navigation (GON) requires robots to locate a specific goal within an unexplored environment. The primary challenge of GON arises from the need to construct a Bird's-Eye-View (BEV) map to understand the environment while simultaneously localizing an unobserved goal. Existing map-based methods typically employ self-centered semantic maps, often facing challenges such as reliance on complete maps or inconsistent semantic association. To this end, we propose Plug-and-Play Label Map Diffusion (PLMD), which defines a novel map completion diffusion model based on Denoising Diffusion Probabilistic Models (DDPM). PLMD generates obstacle and semantic labels for unobserved regions through a diffusion-based completion process, thereby enabling goal localization even in partially observed environments. Moreover, it mitigates inconsistent semantic association by leveraging structural consistency between known and unknown obstacle layouts and integrating obstacle priors into the semantic denoising process. By substituting predicted labels for unobserved regions, robots can accurately localize the specified objects. Extensive experiments demonstrate that PLMD (I) effectively expands the region of unknown maps, (II) integrates seamlessly into existing navigation strategies that rely on semantic maps, (III) achieves state-of-the-art performance on three GON tasks.

Abstract:
Online reinforcement learning in nonepisodic, finite-horizon MDPs remains underexplored and is challenged by the need to estimate returns to a fixed terminal time. Existing infinite-horizon methods, which often rely on discounted contraction, do not naturally account for this fixed-horizon structure. We introduce a modified Q-function: rather than targeting the full-horizon, we learn a K-step lookahead Q-function that truncates planning to the next K steps. To further improve sample efficiency, we introduce a thresholding mechanism: actions are selected only when their estimated K-step lookahead value exceeds a time-varying threshold. We provide an efficient tabular learning algorithm for this novel objective, proving it achieves fast finite-sample convergence: it achieves minimax optimal constant regret for K=1 and \mathcalO(\max((K-1),C_K-1)\sqrtSAT\log(T)) regret for any K \geq 2. We numerically evaluate the performance of our algorithm under the objective of maximizing reward. Our implementation adaptively increases K over time, balancing lookahead depth against estimation variance. Empirical results demonstrate superior cumulative rewards over state-of-the-art tabular RL methods across synthetic MDPs and RL environments: JumpRiverswim, FrozenLake and AnyTrading.

Abstract:
Existing data poisoning attacks on retrievalaugmented generation (RAG) systems scale poorly because they require costly optimization of poisoned documents for each target phrase. We introduce Eyes-on-Me, a modular attack that decomposes an adversarial document into reusable Attention Attractors and Focus Regions. Attractors are optimized to direct attention to the Focus Region. Attackers can then insert semantic baits for the retriever or malicious instructions for the generator, adapting to new targets at near zero cost. This is achieved by steering a small subset of attention heads that we empirically identify as strongly correlated with attack success. Across 18 end-to-end RAG settings (3 datasets × 2 retrievers × 3 generators), Eyes-on-Me raises average attack success rates from 21.9 to 57.8 (+35.9 points, 2.6× over prior work). A single optimized attractor transfers to unseen black box retrievers and generators without retraining. Our findings establish a scalable paradigm for RAG data poisoning and show that modular, reusable components pose a practical threat to modern AI systems. They also reveal a strong link between attention concentration and model outputs, informing interpretability research.

Abstract:
Learninglatent actionsfrom diverse human videos enables scaling robot learning beyond embodimentspecific robot datasets, and these latent actions have recently been used as pseudo-action labels for vision-language-action (VLA) model pretraining. To make VLA pretraining effective, latent actions should contain information about the underlying agent's actions despite the absence of ground-truth labels. We proposeMulti-ViewPointLatentActionModel (MVP-LAM), which learns discrete latent actions that are highly informative about ground-truth actions from time-synchronized multi-view videos. MVP-LAM trains latent actions with across-viewpoint reconstructionobjective, so that a latent action inferred from one view must explain the future in another view, reducing reliance on viewpoint-specific cues. On Bridge V2, MVP-LAM produces more action-centric latent actions, achieving higher mutual information with ground-truth actions and improved action prediction, including under out-of-distribution evaluation. Finally, pretraining VLAs with MVP-LAM latent actions improves downstream manipulation performance on the SIMPLER and LIBERO-Long benchmarks.

Abstract:
Testtime policy adaptation for multi-turn interactions (T^2PAM) is essential for aligning Large Language Models (LLMs) with dynamic user needs during inference time. However, existing paradigms commonly treat test-time adaptation as a single-axis problem, either purely refining instructions (Prompt Engineering) or only adjusting weights (Test-Time Training), ignoring that interaction failures stem from a coupled mix of ambiguity and incapacity. We argue that these two optimization paths are not merely additive but synergistic: semantic clarity acts as a pre-conditioner for effective parameter updates. To this end, we propose ROSA2, a framework that reformulates interaction as a joint optimization problem over the heterogeneous space of Words and Weights. By mathematically decomposing the error signal, ROSA2 utilizes textual gradients to rectify intent ambiguity and parameter updates to bridge capability gaps. Theoretically, we prove that this co-adaptation strictly reduces the required parameter shift for convergence. Empirically, ROSA2 outperforms state-of-the-art baselines by 30% on MATH while reducing interaction turns by 40%, demonstrating that refining the context unlocks the true potential of parameter updates.

Abstract:
The rise of large language models (LLMs) has shifted time series analysis from narrow analytics to generalpurpose reasoning. Yet, existing benchmarks cover only a small set of health time series modalities and tasks, failing to reflect the diverse domains and extensive temporal dependencies inherent in real-world physiological modeling. To bridge these gaps, we introduce HEARTS (Health Reasoning over Time Series), a unified benchmark for evaluating hierarchical reasoning capabilities of LLMs over general health time series. HEARTS integrates 16 real-world datasets across 12 health domains and 20 signal modalities, and defines a comprehensive taxonomy of 110 tasks grouped into four core capabilities: Perception, Inference, Generation, and Deduction. Evaluating 14 state-of-the-art LLMs on more than 20K test samples reveals intriguing findings. First, LLMs substantially underperform specialized models, and their performance is only weakly related to general reasoning scores. Moreover, LLMs often rely on simple heuristics and struggle with multi-step temporal reasoning. Finally, performance declines with increasing temporal complexity, with similar failure modes within model families, indicating that scaling alone is insufficient. By making these gaps measurable, HEARTS provides a standardized testbed and living benchmark for developing next-generation LLM agents capable of reasoning over diverse health signals.

Abstract:
Despite recent advancements in Multimodal Large Language Models (MLLMs) on diverse understanding tasks, these models struggle to solve problems which require extensive multi-step reasoning. This is primarily due to the progressive dilution of visual information during long-context generation, which hinders their ability to fully exploit test-time scaling. To address this issue, we introduce Vision-aligned Latent Reasoning (VaLR), a simple, yet effective reasoning framework that dynamically generates vision-aligned latent tokens before each Chain of Thought reasoning step, guiding the model to reason based on perceptual cues in the latent space. Specifically, VaLR is trained to preserve visual knowledge during reasoning by aligning intermediate embeddings of MLLM with those from vision encoders. Empirical results demonstrate that VaLR consistently outperforms existing approaches across a wide range of benchmarks requiring long-context understanding or precise visual perception, while exhibiting test-time scaling behavior not observed in prior MLLMs. In particular, VaLR improves the performance significantly from 33.0% to 52.9% on VSI-Bench, achieving a 19.9%p gain over Qwen2.5-VL.

Abstract:
While VisionLanguage-Action models (VLAs) are rapidly advancing toward generalist robot policies, quantitatively characterizing their capability boundaries and failure modes remains challenging. To address this, we introduceVLA-Arena, a comprehensive benchmark. It features a novel structured task design framework to quantify difficulty across three orthogonal axes:(1) Task Structure,(2) Language Command, and(3) Visual Observation. This allows us to systematically design tasks with fine-grained difficulty levels, enabling a precise measurement of model capability frontiers. For task structure, VLA-Arena comprises 11 task suites organized into four dimensions:Safety,Distractor,Extrapolation, andLong Horizon, totaling 170 tasks. Each suite spans three difficulty levels (L0–L2), with fine-tuning restricted to L0 to rigorously assess generalization. Orthogonal to this, language (W0-W4) and visual (V0-V4) perturbations can be applied to any task as diagnostic probes to distinguish robust grounding from superficial pattern matching. Our extensive evaluation of state-of-the-art VLAs reveals critical limitations: memorization over generalization, superficial visual perception, and a neglect of safety constraints. Additionally, model rank reversals across L0–L2 validate that each level provides non-redundant insights. To foster research addressing these model limitations and ensure reproducibility, we provide the complete VLA-Arena framework, including an end-to-end toolchain from task definition to automated evaluation and the VLA-Arena-S/M/L datasets for fine-tuning. Our benchmark, datasets, models, and leaderboard will be open-sourced.

Abstract:
Transforming a large language model (LLM) into a visionlanguage model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily process visual tokens, we need interpretability methods that reveal what is encoded in the visual token representations at every layer of LLM processing. In this work, we introduce LatentLens, a novel approach for mapping latent representations to descriptions in natural language. LatentLens encodes a large text corpus and stores contextualized token representations for each token in that corpus. Visual token representations are then compared to these contextualized representations and the top-k nearest neighbor representations serve as descriptions of the visual token. We evaluate this method on 10 different VLMs, showing that commonly used methods, such as LogitLens, substantially underestimate the interpretability of visual tokens. With LatentLens instead, the majority of visual tokens are interpretable across all studied models and all layers. Qualitatively, we show that the descriptions produced by LatentLens are semantically meaningful and provide more fine-grained interpretations for humans compared to individual tokens. More broadly, our findings contribute new evidence on the alignment between vision and language representations and open up new directions for analyzing the latent representations of LLMs.

Abstract:
Weightonly quantization has become a standard approach for efficiently serving large language models (LLMs). However, existing methods fail to efficiently compress models to binary (1-bit), as they either require large amounts of data and compute or incur additional storage. In this work, we propose NanoQuant, the first post-training quantization (PTQ) method to compress LLMs to both binary and sub-1-bit levels. NanoQuant formulates quantization as a low-rank binary factorization problem, and compresses full-precision weights to low-rank binary matrices and scales. Specifically, it utilizes an efficient alternating direction method of multipliers (ADMM) method to precisely initialize latent binary matrices and scales, and then tune the initialized parameters through a block and model reconstruction process. Consequently, NanoQuant establishes a new Pareto frontier in low-memory post-training quantization, achieving state-of-the-art accuracy even at sub-1-bit compression rates. NanoQuant makes large-scale deployment feasible on consumer hardware. For example, it compresses Llama2-70B by 25.8× in just 13 hours on a single H100, enabling a 70B model to operate on a consumer 8 GB GPU.

Abstract:
Large reasoning models (LRMs) typically solve reasoningintensive tasks by generating long chain-of-thought (CoT) traces, leading to substantial inference overhead. We identify a reproducible inference-time phenomenon, termed \emphSelf-Compression: when multiple independent and answerable questions are presented within a single prompt, the model spontaneously produces shorter reasoning traces for each question. This phenomenon arises from \emphmulti-question contextual pressure during generation and consistently manifests across models and benchmarks. Building on this observation, we propose ConPress (Learning from Contextual Pressure, a lightweight self-supervised fine-tuning approach. ConPress constructs multi-question prompts to induce self-compression, samples the resulting model outputs, and parses and filters per-question traces to obtain concise yet correct reasoning trajectories. These trajectories are directly used for supervised fine-tuning, internalizing compressed reasoning behavior in single-question settings without external teachers, manual pruning, or reinforcement learning. With only 8k fine-tuning examples, ConPress reduces reasoning token usage by 59% on MATH500 and 33% on AIME25, while maintaining competitive accuracy.

Abstract:
Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with taskspecific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain underutilized for reasoning-focused training. We introduce RARO (Relativistic Adversarial Reasoning Optimization), which learns strong reasoning capabilities from expert demonstrations alone via Inverse Reinforcement Learning. Our method sets up an adversarial game between a policy and a relativistic critic: the policy learns to mimic expert answers, while the critic aims to identify the expert among (expert, policy) answer pairs. Both the policy and the critic are trained jointly and continuously via RL, and we identify key stabilization techniques required for robust learning. Empirically, RARO significantly outperforms strong verifier-free baselines across all evaluation tasks—achieving +13.7% accuracy on Countdown (1.5\textB), +8.2% on DeepMath (7\textB), and a +19.1% win-rate on Poetry Writing (7\textB) against expert poems. RARO also exhibits the same robust scaling trends as RL with verifiers. These results demonstrate that our method effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning even when task-specific verifiers are unavailable.

Abstract:
We address the problem of gradient estimation for stochastic differentiable relaxations of algorithms, operators, simulators, and other nondifferentiable functions. Stochastic smoothing conventionally perturbs the input of a non-differentiable function with a differentiable density distribution with full support, smoothing it and enabling gradient estimation. Our theory starts at first principles to derive stochastic smoothing with reduced assumptions, without requiring a differentiable density nor full support, and presenting a general framework for relaxation and gradient estimation of non-differentiable black-box functions . We develop variance reduction for gradient estimation from 3 orthogonal perspectives. Empirically, we benchmark 6 distributions and up to 24 variance reduction strategies for differentiable sorting and ranking, differentiable shortest-paths on graphs, differentiable rendering for pose estimation, as well as differentiable cryo-electron tomography simulations.

Abstract:
Regularization is a core component of recent Reinforcement Learning (RL) algorithms. Mirror Descent Value Iteration (MDVI) uses both KullbackLeibler divergence and entropy as regularizers in its value and policy updates. Despite its empirical success in discrete action domains and strong theoretical guarantees, the performance of KL-entropy-regularized methods does not surpass that of a strong entropy-only-regularized method in continuous action domains. In this study, we propose Mirror Descent Actor Critic (MDAC) as an actor-critic style instantiation of MDVI for continuous action domains, and show that its empirical performance is significantly boosted by bounding the actor's log-probability terms in the critic's loss function, compared to a non-bounded naive instantiation. Further, we relate MDAC to Advantage Learning by recalling that the actor's log-probability is equal to the regularized advantage function in tabular cases, and theoretically discuss when and why bounding the advantage terms is validated and beneficial. We also empirically explore effective choices for the bounding functions, and show that MDAC performs better than strong non-regularized and entropy-only-regularized methods with an appropriate choice of the bounding functions.

Abstract:
Behavior cloning (BC) is a practical offline imitation learning method, but it often fails when expert demonstrations are limited. Recent works have introduced a class of architectures named predictive inverse dynamics models (PIDM) that combine a future state predictor with an inverse dynamics model (IDM). While PIDM often outperforms BC, the reasons behind its benefits remain unclear. In this paper, we provide a theoretical explanation: PIDM introduces a biasvariance tradeoff. While predicting the future state introduces bias, conditioning the IDM on the prediction can significantly reduce variance. We establish conditions on the state predictor bias for PIDM to achieve lower prediction error and higher sample efficiency than BC, with the gap widening when additional data sources are available. We validate the theoretical insights empirically in 2D navigation tasks, where BC requires up to five times (three times on average) more demonstrations than PIDM to reach comparable performance; and in a complex 3D environment in a modern video game with high-dimensional visual inputs and stochastic transitions, where BC requires over 66% more samples than PIDM.

Abstract:
Textto-image diffusion models achieve impressive generation quality but inherit and amplify training-data biases, skewing coverage of semantic attributes. Prior work addresses this in two ways. Closed-set approaches mitigate biases in predefined fairness categories (e.g., gender, race), assuming socially salient minority attributes are known a priori. Open-set approaches frame the task as bias identification, highlighting majority attributes that dominate outputs. Both overlook a complementary task: uncovering rare or minority features underrepresented in the data distribution (social, cultural, or stylistic) yet still encoded in model representations. We introduce RAIGen, the first framework, to our knowledge, for unsupervised rare-attribute discovery in diffusion models. RAIGen leverages Matryoshka Sparse Autoencoders and a novel minority metric combining neuron activation frequency with semantic distinctiveness to identify interpretable neurons whose top-activating images reveal underrepresented attributes. Experiments show RAIGen discovers attributes beyond fixed fairness categories in Stable Diffusion, scales to larger models such as SDXL, supports systematic auditing across architectures, and enables targeted amplification of rare attributes during generation.

Abstract:
Reinforcement Learning (RL) posttraining alignment for language models is effective, but also costly and unstable in practice, owing to its complicated training process. To address this, we propose a training-free inference method to sample directly from the optimal RL policy. The transition probability applied to Masked Language Modeling (MLM) consists of a reference policy model and an energy term. Based on this, our algorithm, Energy-Guided Test-Time Scaling (ETS), estimates the key energy term via online Monte Carlo, with a provable convergence rate. Moreover, to ensure practical efficiency, ETS leverages modern acceleration frameworks alongside tailored importance sampling estimators, substantially reducing inference latency while provably preserving sampling quality. Experiments on MLM (including autoregressive models and diffusion language models) across reasoning, coding, and science benchmarks show that our ETS consistently improves generation quality, validating its effectiveness and design.

Abstract:
Many realworld design problems involve optimizing an expensive black-box function f(x), for which Bayesian Optimization is a sample-efficient framework. However, while the basic black-box setting returns a scalar reward, real-world experiments often generate a wealth of useful information. We introduce a new setting where an experiment generates high-dimensional auxiliary information h(x) along with f(x); moreover, a history of relevant, previously-solved tasks is available for accelerating optimization. We develop a novel method based on a neural model which predicts f(x) for unseen designs given a few-shot context containing observations of h(x). We evaluate our method on two challenging domains, robotic hardware design and hyperparameter tuning. On both domains, our method achieves improved few-shot prediction and faster design optimization, outperforming several multi-task optimization methods.

Abstract:
Highquality geometric diagram generation presents both a challenge and an opportunity: it demands strict spatial accuracy while offering well-defined constraints to guide generation. Inspired by recent advances in geometry problem solving that employ formal languages and symbolic solvers for enhanced correctness and interpretability, we propose GeoLoom, a novel framework for text-to-diagram generation in geometric domains. GeoLoom comprises two core components: an autoformalization module that translates natural language into a specifically designed generation-oriented formal language GeoLingua, and a coordinate solver that maps formal constraints to precise coordinates using the efficient Monte Carlo optimization. To support this framework, we introduce GeoNF, a dataset aligning natural language geometric descriptions with formal GeoLingua descriptions. We further propose a constraint-based evaluation metric that quantifies structural deviation, offering mathematically grounded supervision for iterative refinement. Empirical results demonstrate that GeoLoom significantly outperforms state-of-the-art baselines in structural fidelity, providing a principled foundation for interpretable and scalable diagram generation.

Abstract:
Retrievalaugmented generation (RAG) for long-context question answering is bottlenecked by inference-time prefilling over large retrieved contexts. A common strategy is to precompute key–value (KV) caches for individual documents and selectively recompute a small subset of tokens to restore global causal dependencies, but existing methods rely on heuristics or representation discrepancies without modeling whether selected tokens can effectively influence generation. We cast selective KV recomputation as an information flow problem and show that a simple attention-norm signal from the query reliably identifies tokens that are both semantically relevant and structurally positioned to propagate information, when computed under an inference-consistent RoPE geometry. We therefore reconstruct global positional assignments for retrieved chunks and introduce an information-flow–guided chunk reordering strategy. Experiments on LLM and VLM benchmarks demonstrate consistent gains over prior methods under comparable efficiency budgets.

Abstract:
Visual AutoRegressive modeling (VAR) suffers from substantial computational cost due to the massive token count involved. Failing to account for the continuous evolution of modeling dynamics, existing VAR token reduction methods face three key limitations: heuristic stage partition, nonadaptive schedules, and limited acceleration scope, thereby leaving significant acceleration potential untapped. Since entropy variation intrinsically reflects the transition of predictive uncertainty, it offers a principled measure to capture modeling dynamics evolution. Therefore, we propose NOVA, a training-free token reduction acceleration framework for VAR models via entropy analysis. NOVA adaptively determines the acceleration activation scale during inference by online identifying the inflection point of scale entropy growth. Through scale-linkage and layer-linkage ratio adjustment, NOVA dynamically computes distinct token reduction ratios for each scale and layer, pruning low-entropy tokens while reusing the cache derived from the residuals at the prior scale to accelerate inference and maintain generation quality. Extensive experiments and analyses validate NOVA as a simple yet effective training-free acceleration framework. Code is available.

Abstract:
Continual posttraining (CPT) is a popular and effective technique for adapting foundation models like multimodal large language models to ever-evolving downstream tasks. While existing research primarily focus on methods like data replay, model expansion, or parameter regularization, the fundamental role of the learning paradigm remains largely unexplored. This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), investigating their respective impacts on knowledge retention during CPT. Our experiments are conducted across multiple multimodal tasks, utilizing Qwen2.5-VL-7B-Instruct as the base model. The investigation yields two significant findings: (1) When continuously learning on downstream tasks, SFT leads to catastrophic forgetting of previously learned tasks. In contrast, RFT inherently preserves prior knowledge and achieves performance comparable to multi-task training. (2) RFT successfully protects and even enhances the model's general knowledge on standard benchmarks, while SFT degrades general model capabilities severely. Further analysis reveals that this stability is not primarily due to explicit mechanisms like KL penalty or chain-of-thought reasoning. We investigate RFT's learning dynamics and find that its selective update mechanism inherently prevents interference with established knowledge. Based on this insight, we propose a rollout-based instance filtering algorithm (RIF-RFT) that enhances the training efficiency of RFT by focusing on learnable samples. Our comprehensive study demonstrates the superiority of RFT as a robust paradigm for continual post-training

Abstract:
Sequential LLM agents fail on longhorizon planning with hard constraints like budgets and diversity requirements. As planning progresses and context grows, these agents drift from global constraints. We propose HiMAP-Travel, a hierarchical multi-agent framework that splits planning into strategic coordination and parallel day-level execution. A Coordinator allocates resources across days, while Day Executors plan independently in parallel. Three key mechanisms enable this: a transactional monitor enforcing budget and uniqueness constraints across parallel agents, a bargaining protocol allowing agents to reject infeasible sub-goals and trigger re-planning, and a single policy trained with GRPO that powers all agents through role conditioning. On TravelPlanner, HiMAP-Travel with Qwen3-8B achieves 52.78% validation and 52.65% test Final Pass Rate (FPR). In a controlled comparison with identical model, training, and tools, it outperforms the sequential DeepTravel baseline by +8.67pp. It also surpasses ATLAS by +17.65pp and MTP by +10.0pp. On FlexTravelBench multi-turn scenarios, it achieves 44.34% (2-turn) and 37.42% (3-turn) FPR while reducing latency 2.5x through parallelization.

Abstract:
Generative Flow Networks (GFlowNets) learn to sample diverse candidates in proportion to a reward function, making them wellsuited for scientific discovery, where exploring multiple promising solutions is crucial. Further extending GFlowNets to multi-objective settings has attracted growing interest since real-world applications often involve multiple, conflicting objectives. However, existing approaches require additional training for each set of objectives, limiting their applicability and incurring substantial computational overhead. We propose a training-free mixing policy that composes pre-trained GFlowNets at inference time, enabling rapid adaptation without finetuning or retraining. Importantly, our framework is flexible, capable of handling diverse reward combinations ranging from linear scalarization to complex non-linear logical operators, which are often handled separately in previous literature. We prove that our method exactly recovers the target distribution for linear scalarization and quantify the approximation quality for nonlinear operators through a distortion factor. Experiments on a synthetic 2D grid and real-world molecule-generation tasks demonstrate that our approach achieves performance comparable to baselines that require additional training.

Abstract:
Diffusion policies have recently been as a powerful paradigm for visuomotor control in robotic manipulation due to their ability to model the distribution of action sequences and capture multimodality. However, iterative denoising leads to substantial inference latency, limiting control frequency in realtime closed-loop systems. Existing acceleration methods either reduce sampling steps, bypass diffusion through direct prediction, or reuse past actions, but often struggle to jointly preserve action quality and achieve consistently low latency. In this work, we proposeSTEP, a lightweight spatiotemporal consistency prediction mechanism to construct high-quality warm-start actions that are both distributionally close to the target action and temporally consistent, without compromising the generative capability of the original diffusion policy. Then, we propose a velocity-aware perturbation injection mechanism that adaptively modulates actuation excitation based on temporal action variation to execution stall especially for real-world tasks. We further provide a theoretical analysis showing that the proposed prediction induces a locally contractive mapping, ensuring convergence of action errors during diffusion refinement. We conduct extensive evaluations on nine simulated benchmarks and two real-world tasks. Notably, STEP with 2 steps can achieve an average 21.6% and 27.5% higher success rate than BRIDGER and DDIM on the RoboMimic benchmark and real-world tasks, respectively. These results demonstrate that \we consistently advances the Pareto frontier of inference latency and success rate over existing methods.

Abstract:
Reinforcement learning from human feedback (RLHF) or verifiable rewards (RLVR), the standard paradigm for aligning LLMs or building recent SOTA reasoning models, is highly sensitive to noise from inconsistent or erroneous rewards. Yet, the interaction between such noise and widely used groupbased policy optimization methods remains underexplored. We introduce a noise-robust Group Relative Policy Optimization (GRPO) and Done Right GRPO (Dr.GRPO) framework that explicitly models reward corruption as Bernoulli noise. Our method applies noise correction after estimating reward flip probabilities to debias the learning signal, yielding unbiased gradient estimates. Theoretical analysis shows that group-based methods inherently mitigate individual-level noise, and our correction strategy amplifies this robustness. Empirically, we observe consistent improvements across math and code tasks when applying our noise correction to standard reward model usage, with particular gains of up to 6.7 percentage points in accuracy on math tasks and 1.5 on code tasks under realistic reward model conditions. This work bridges label-noise correction from supervised learning with modern RLHF, offering both theoretical insights and a practical algorithm for noisy real-world deployment.

Abstract:
Despite recent progress, recovering parametric CAD construction sequences from geometric input, such as meshes or point clouds, is a key challenge for design and manufacturing, as existing CAD reconstruction and generation methods are largely restricted to difficultto-edit formats like meshes or Breps or editable simple sketch-and-extrude pipelines and low-complexity datasets. We introduce CADFit, a hybrid optimization-based CAD reconstruction framework that recovers complex, editable CAD construction sequences from meshes by incrementally fitting and validating parametric operations using geometric feedback. Our approach is distinguished by formulating reconstruction as an IoU-driven optimization over structured CAD programs and supporting a rich set of operations, including extrusions, revolutions, fillets, and chamfers. Experiments on multiple CAD benchmarks show that CADFit outperforms state-of-the-art mesh-to-CAD methods in volumetric Intersection-over-Union and Chamfer Distance, while substantially reducing the Invalid Ratio of reconstructed CAD programs, particularly for complex designs. We further present a multimodal pipeline that enables end-to-end reconstruction of CAD construction sequences from images by combining image-based geometry reconstruction with CADFit. By enabling accurate reconstruction of higher-complexity CAD models, CADFit provides a practical foundation for generating richer datasets and advancing future learning-based approaches to CAD reverse engineering.

Abstract:
Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently verified by a larger target model. However, the performance of standard SD is often limited by the strictly sequential execution of these drafting and verification stages. To address this, this paper proposes MineDraft, a batch parallel speculative decoding (PSD) framework designed to effectively hide drafting latency by overlapping it with verification. Our theoretical analysis shows that PSD is substantially more efficient than standard SD. MineDraft realizes the PSD through a novel batchparallel design that maintains two batches of requests, overlapping drafting for one batch with verification for the other. Our experimental results show significant improvements of MineDraft in both throughput (up to 75%) and end-to-end latency (up to 39%) over standard SD. Furthermore, we have implemented MineDraft as a plugin for vLLM, demonstrating its practicality for production-ready inference systems.

Abstract:
Information retrieval is a core component of many intelligent systems as it enables conditioning of outputs on new and largescale datasets. While effective, the standard practice of encoding data into high-dimensional representations for similarity search entails large memory and compute footprints, and also makes it hard to inspect the inner workings of the system. Hierarchical retrieval methods offer an interpretable alternative by organizing data at multiple granular levels, yet do not match the efficiency and performance of flat retrieval approaches. In this paper, we propose Retreever, a tree-based method that makes hierarchical retrieval viable at scale by directly optimizing its structure for retrieval performance while naturally providing transparency through meaningful semantic groupings. Our method offers the flexibility to balance cost and utility by indexing data using representations from any tree level. We show that Retreever delivers strong coarse (intermediate levels) and fine representations (terminal level), while achieving the highest retrieval accuracy at the lowest latency among hierarchical methods. These results demonstrate that this family of techniques is viable in practical applications.

Abstract:
Finetuning large language models (LLMs) on sensitive datasets raises privacy concerns, as training data extraction (TDE) attacks can expose highly confidential information. Existing defenses against such attacks either lack formal privacy guarantees or incur substantial utility degradation. We observe that fine-tuning induces widespread probability shifts, yet preserving only a small subset of influential token-level deviations is sufficient; the remaining shifts can be aggressively smoothed with minimal impact on utility. Motivated by this insight, we propose SCP-\Delta_r, a Near Access Freeness (NAF)-based algorithm that operates on relative probabilities and explicitly smooths low-impact tokens using a base model. SCP-\Delta_r achieves orders-of-magnitude better theoretical bounds than existing NAF based methods and provides strong empirical protection against TDE attacks with minimal performance loss.

Abstract:
VisionLanguage-Action (VLA) models have recently shown strong generalization, with some approaches seeking to explicitly generate linguistic reasoning traces or predict future observations prior to execution. However, explicit reasoning typically incurs non-negligible inference latency, which constrains the temporal resolution required for robotic manipulation. Moreover, such reasoning is confined to the linguistic space, imposing a representational bottleneck that struggles to faithfully capture ineffable physical attributes. To mitigate these limitations, we propose LaST_0, a framework that enables efficient reasoning before acting through a Latent Spatio-Temporal Chain-of-Thought (CoT), capturing fine-grained physical and robotic dynamics that are often difficult to verbalize. Specifically, we introduce a token-efficient latent CoT space that models future visual dynamics, 3D structural information, and robot proprioceptive states, and further extends these representations across time to enable temporally consistent implicit reasoning trajectories. Furthermore, LaST_0 adopts a dual-system architecture implemented via a Mixture-of-Transformers design, where a reasoning expert conducts low-frequency latent inference and an acting expert generates high-frequency actions conditioned on robotics-oriented latent representations. To facilitate coordination, LaST_0 is trained with heterogeneous operation frequencies, enabling adaptive switching during deployment. Across 10 real-world tasks spanning tabletop, mobile, and dexterous hand manipulation, LaST_0 improves mean success rates by 13%, 14% and 14% over prior SOTA VLA methods, respectively.

Abstract:
Large language models (LLMs) have recently advanced in reasoning when optimized with reinforcement learning (RL) under verifiable rewards. Existing methods primarily rely on outcomebased supervision to strengthen internal LLM reasoning, often leading to inefficient exploration and sparse rewards. To mitigate this issue, we propose Expert-Assisted Policy Optimization (EAPO), a novel RL framework that enhances exploration by incorporating multi-turn interactions with external experts during training. Unlike prior methods, where policies reason in isolation, EAPO incentivizes the policy to adaptively determine when and how to consult experts, yielding richer reward signals and more reliable reasoning trajectories. External assistance ultimately internalizes expert knowledge into the policy model, amplifying the model’s inherent reasoning capabilities. During evaluation, the policy model has been well-optimized to solve questions independently, producing improved reasoning paths and more accurate solutions. On AIME 2024/2025 and AIMO 2025, EAPO consistently outperforms expert-assisted, expert-distilled, and RL baselines, averaging a 5-point gain over self-exploration RL, and also generalizes to non-math benchmarks, including HumanEval, HLE, GPQA, MMLU, EvalPlus, HotpotQA, and SimpleQA.

Abstract:
Simulationbased inference (SBI) enables amortized Bayesian inference by first training a neural posterior estimator (NPE) on prior-simulator pairs, typically through low-dimensional summary statistics, which can then be cheaply reused for fast inference by querying it on new test observations. Because NPE is estimated under the training data distribution, it is susceptible to misspecification when observations deviate from the training distribution. Many robust SBI approaches address this by modifying NPE training or introducing error models, coupling robustness to the inference network and compromising amortization and modularity. We introduce minimum-distance summaries, a plug-in robust NPE method that adapts queried test-time summaries independently of the pretrained NPE. Leveraging the maximum mean discrepancy (MMD) as a distance between observed data and a summary-conditional predictive distribution, the adapted summary inherits strong robustness properties from the MMD. We demonstrate that the algorithm can be implemented efficiently with random Fourier feature approximations, yielding a lightweight, model-free test-time adaptation procedure. We provide theoretical guarantees for the robustness of our algorithm and empirically evaluate it on a range of synthetic and real-world tasks, demonstrating substantial robustness gains with minimal additional overhead.

Abstract:
Videobased visible-infrared person re-identification (VVI-ReID) aims to learn robust video-level representations under modality discrepancy. However, existing methods typically rely on Euclidean geometry, which is suboptimal for modeling the complex temporal dynamics within visible and infrared tracklets, as it inevitably distorts the intrinsic hierarchical structure inherent in diverse temporal variations (e.g., occlusion, pose). In this paper, we propose Hyperbolic Hierarchical Alignment (HHA), which unifies spatio-temporal modeling and cross-modality alignment on the Poincar\'e ball. HHA employs a Hyperbolic Hierarchical Spatio-Temporal Aggregator (HHSA) to organize time-varying cues into low-distortion hierarchical representations via Hyperbolic Geometry Interaction (HGI) and Dual-Geometry Fusion (DGF). Furthermore, we introduce Geometry-Aware Modality Alignment (GMA), which integrates Hyperbolic Modality Alignment (HMA) to couple modality centroids for geometric consistency and Hyperbolic Prototype Alignment (HPA) to anchor both modalities to shared identity prototypes for robust discrimination. Experiments on HITSZ-VCM and BUPTCampus demonstrate state-of-the-art performance.

Abstract:
Learningbased whole-body controllers have become a key driver for humanoid robots, yet most existing approaches require robot-specific training. In this paper, we study the problem of cross-embodiment humanoid control and show that a single policy can robustly generalize across a wide range of humanoid robot designs with one-time training. We introduce XHugWBC, a novel cross-embodiment training framework that enables generalist humanoid control through: (1) physics-consistent morphological randomization, (2) semantically aligned observation and action spaces across diverse humanoid robots, and (3) effective policy architectures modeling morphological and dynamical properties. XHugWBC is not tied to any specific robot. Instead, it internalizes a broad distribution of morphological and dynamical characteristics during training. By learning motion priors from diverse randomized embodiments, the policy acquires a strong structural bias that supports zero-shot transfer to previously unseen robots. Experiments on twelve simulated humanoids and seven real-world robots demonstrate the strong generalization and robustness of the resulting universal controller.

Abstract:
Deep neural networks often generalize well despite heavy overparameterization, challenging classical parameter-based analyses. We study generalization from a representation-centric perspective and analyze how the geometry of learned embeddings controls predictive performance for a fixed trained model. We show that population risk can be bounded by two factors: (i) the intrinsic dimension of the embedding distribution, which determines the convergence rate of empirical embedding distribution to the population distribution in Wasserstein distance, and (ii) the sensitivity of the downstream mapping from embeddings to predictions, characterized by Lipschitz constants. Together, these yield an embedding-dependent error bound that does not rely on parameter counts or hypothesis class complexity. At the final embedding layer, architectural sensitivity vanishes and the bound is dominated by embedding dimension, explaining its strong empirical correlation with generalization performance. Experiments across architectures and datasets validate the theory and demonstrate the utility of embedding-based diagnostics.

Abstract:
Proactive and realtime interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real-time constraints. In this work, we instantiate AI companions through two gaming scenarios—commentator and guide—selected for their suitability for automatic evaluation. We introduce the Live Gaming Benchmark, a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance, and present Proact-VL, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction. Extensive experiments show Proact-VL achieves superior response latency and quality while maintaining strong video understanding capabilities, demonstrating its practicality for real-time interactive applications. Code is available at https://anonymous.4open.science/r/Proact-VL-8699/.

Abstract:
Large language models (LLMs) have demonstrated remarkable performance due to their large parameter counts and extensive training data. However, their scale leads to significant memory bottlenecks during training, especially when using memoryintensive optimizers like Adam. Existing memory-efficient approaches often rely on techniques such as singular value decomposition (SVD), projections, or weight freezing, which can introduce substantial computational overhead, require additional memory for projections, or degrade model performance. In this paper, we propose Folded Optimizer with Approximate Moment (FOAM), a method that compresses optimizer states by computing block-wise gradient means and incorporates a residual correction to recover lost information. Theoretically, FOAM achieves convergence rates equivalent to vanilla Adam under standard non-convex optimization settings. Empirically, FOAM eliminates up to 90% of the memory overhead of optimizer states and accelerates convergence. Furthermore, FOAM is compatible with other memory-efficient optimizers, delivering performance and throughput that match or surpass both full-rank and existing memory-efficient baselines.

Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) and Group Relative Policy Optimization (GRPO) have significantly advanced the reasoning capabilities of large language models. Extending these methods to multimodal settings, however, faces a critical challenge: the instability of stdbased normalization, which is easily distorted by extreme samples with nearly positive or negative rewards. Unlike pure-text LLMs, multimodal models are particularly sensitive to such distortions, as both perceptual and reasoning errors influence their responses. To address this, we characterize each sample by its difficulty, defined through perceptual complexity (measured via visual entropy) and reasoning uncertainty (captured by model confidence). Building on this characterization, we propose difficulty-aware group normalization (Durian), which re-groups samples by difficulty levels and shares the std within each group. Our approach preserves GRPO's intra-group distinctions while eliminating sensitivity to extreme cases, yielding significant performance gains across multiple multimodal reasoning benchmarks.

Abstract:
The finite symmetric group S_n provides a natural domain for permutations, yet learning probability distributions on S_n is challenging due to its factorially growing size and discrete, nonEuclidean structure. Recent permutation diffusion methods define forward noising via shuffle-based random walks (e.g., riffle shuffles) and learn reverse transitions with Plackett–Luce (PL) variants, but the resulting trajectories can be abrupt and increasingly hard to denoise as n grows. We propose Soft-Rank Diffusion, a discrete diffusion framework that replaces shuffle-based corruption with a structured soft-rank forward process: we lift permutations to a continuous latent representation of order by relaxing discrete ranks into soft ranks, yielding smoother and more tractable trajectories. For the reverse process, we introduce contextualized generalized Plackett–Luce (cGPL) denoisers that generalize prior PL-style parameterizations and improve expressivity for sequential decision structures. Experiments on sorting and combinatorial optimization benchmarks show that Soft-Rank Diffusion consistently outperforms prior diffusion baselines, with particularly strong gains in long-sequence and intrinsically sequential settings.

Abstract:
Large Language Models (LLMs) have shown significant potential in scientific discovery but struggle to bridge the gap between theoretical reasoning and verifiable physical simulation. Existing solutions operate in a passive "executethen-response" loop and thus lack runtime perception, obscuring agents to transient anomalies (e.g., numerical instability or diverging oscillations). To address this limitation, we propose EmbodiedAct, a framework that transforms established scientific software into active embodied agents by grounding LLMs in embodied actions with a tight perception-execution loop. We instantiate EmbodiedAct within MATLAB and evaluate it on complex engineering design and scientific modeling tasks. Extensive experiments show that EmbodiedAct significantly outperforms existing baselines, achieving SOTA performance by ensuring satisfactory reliability and stability in long-horizon simulations and enhanced accuracy in scientific modeling.

Abstract:
Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAIo3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging due to the need for joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3-Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning by highlighting key timestamps, objects, and bounding boxes, making the reasoning process traceable and verifiable. To enable this capability, we first construct high-quality datasets STGR that provide unified spatio-temporal supervision, which is absent in existing resources. We further adopt a cold-start reinforcement learning strategy with specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On the V-STAR benchmark, Open-o3-Video achieves state-of-the-art performance, improving mAM by 14.4% and mLGM by 24.2% over the Qwen2.5-VL baseline, and shows consistent gains across a range of video understanding benchmarks. Beyond accuracy, the grounded reasoning traces produced by Open-o3-Video support confidence-aware test-time scaling, improving answer reliability. The code and datasets will be made publicly available.

Abstract:
While recent Multimodal Large Language Models (MLLMs) have attained significant strides in multimodal reasoning, their reasoning processes remain predominantly textcentric and fail to visualize and track intermediate visual states during the reasoning process, leading to suboptimal performance in complex long-horizon, vision-centric tasks. Moving beyond the constraints of text-centric reasoning, we establish Generative Multimodal Reasoning as a novel paradigm and introduce DiffThinker, a diffusion-based reasoning framework. Conceptually, DiffThinker reformulates multimodal reasoning as a native generative image-to-image task, where the iterative denoising trajectory naturally serves as a visual reasoning path. This enables the model to track the evolution of visual information throughout the reasoning process. We perform a systematic comparison between DiffThinker and MLLMs, providing the first in-depth investigation into the intrinsic characteristics of this paradigm, revealing four core properties: efficiency, controllability, native parallelism, and collaboration. Extensive experiments across seven tasks demonstrate that DiffThinker significantly outperforms leading closed-source models, including GPT-5 (+314.2%) and Gemini-3-Flash (+111.6%), as well as the fine-tuned Qwen3-VL-32B baseline (+39.0%), highlighting Generative Multimodal Reasoning as a promising approach for vision-centric reasoning.

Abstract:
Multimodal representation learning aims to construct a shared embedding space in which heterogeneous modalities are semantically aligned. Despite strong empirical results, InfoNCEbased objectives introduce inherent conflicts that yield distribution gaps across modalities. In this work, we identify two conflicts in the multimodal regime, both exacerbated as the number of modalities increases: (i) an alignment–uniformity conflict, whereby the repulsion of uniformity undermines pairwise alignment, and (ii) an intra-alignment conflict, where aligning multiple modalities induces competing alignment directions. To address these issues, we propose a principled decoupling of alignment and uniformity for multimodal representations, providing a conflict-free recipe for multimodal learning that simultaneously supports discriminative and generative use cases without task-specific modules. We then provide a theoretical guarantee that our method acts as an efficient proxy for a global Hölder divergence over multiple modality distributions, and thus reduces the distribution gap among modalities. Extensive experiments on retrieval and UnCLIP-style generation demonstrate consistent gains.

Abstract:
Robust optimization safeguards decisions against uncertainty by optimizing against worstcase scenarios, yet their effectiveness hinges on a prespecified robustness level that is often chosen ad hoc, leading to either insufficient protection or overly conservative and costly solutions. Recent approaches using conformal prediction construct data-driven uncertainty sets with finite-sample coverage guarantees, but they still fix coverage targets a priori and offer little guidance for selecting robustness levels. We propose a new framework that provides distribution-free, finite-sample guarantees on both miscoverage and regret for any family of robust predict-then-optimize policies. Our method constructs valid estimators that trace out the miscoverage--regret Pareto frontier, enabling decision-makers to reliably evaluate and calibrate robustness levels according to their cost--risk preferences. The framework is simple to implement, broadly applicable across classical optimization formulations, and achieves sharper finite-sample performance. This paper offers a principled data-driven methodology for guiding robustness selection and empowers practitioners to balance robustness and conservativeness in high-stakes decision-making.

Abstract:
Training with differential privacy (DP) guarantees dataset members that they cannot be identified by users of the released model. However, those data providers, and, in general, the public, lack methods to efficiently verify that models trained on their data satisfy DP guarantees. The amount of compute needed to verify DP guarantees for current algorithms scales with the amount of computation required to train the model. In this paper we design the first DP algorithm with near optimal privacyutility trade-offs but whose DP guarantees can be verified cheaper than training. We focus on DP stochastic convex optimization (DP-SCO), where optimal privacy-utility trade-offs are known. Here we show we can obtain tight privacy-utility trade-offs by privately minimizing a series of regularized objectives and only using the standard DP composition bound. Crucially, this method can be verified with much less compute than training. This leads to the first known DP-SCO algorithm with near optimal privacy-utility whose DP verification scales better than training cost, significantly reducing verification costs on large datasets.

Abstract:
Proteins are shaped by gradual evolution under biophysical and functional constraints. Protein language models learn rich evolutionary constraints from largescale sequence data, and discrete diffusion–based protein language models (e.g., DPLMs) have emerged as a promising framework for both understanding and generation. However, existing DPLMs typically rely on masking-based absorbing diffusion, which conflicts with a basic biological intuition: proteins evolve through accumulated edits rather than emerging from masked tokens. As a result, these frameworks lack explicit pretraining objectives for substitution and insertion/deletion (indel) operations, which in turn limits both optimization-style post-editing and flexible guided generation. To address these limitations, we present DPLM-Evo, an evolutionary discrete diffusion framework that explicitly predicts substitution, insertion, and deletion operations during denoising. \method decouples a fixed-length latent alignment space from the variable-length observed sequence space, making indel-aware generation tractable and enabling adaptive scaffold growth throughout the process with negligible computational overhead. To further align substitutions with real evolutionary dynamics, we introduce a contextual evolutionary noising kernel that induces biologically informed, context-dependent mutation patterns. Across tasks, \method improves sequence understanding and achieves state-of-the-art performance on ProteinGym in the single-sequence setting, while also enabling variable-length simulated evolution, guided generation, and post-editing or optimization of existing proteins via explicit edit trajectories.

Abstract:
We revisit the central online problem of ski rental in the ``algorithms with predictions'' framework from the point of view of distributional predictions. Ski rental was one of the first problems to be studied with predictions, where a natural prediction is simply the number of ski days. But it is both more natural and potentially more powerful to think of a prediction as a distribution \hat p over the ski days. If the true number of ski days is drawn from some true (but unknown) distribution p, then we show as our main result that there is an algorithm with expected cost at most OPT + O\left(\min \left(\max(\eta,1) \cdot \sqrtb,\ b \log b \right) \right), where OPT is the expected cost of the optimal policy for the true distribution p, b is the cost of buying, and \eta is the Earth Mover's (Wasserstein1) distance between p and \hat p. Note that when \eta < o(\sqrtb) this gives additive loss less than b (the trivial bound), and when \eta is arbitrarily large (corresponding to an extremely inaccurate prediction) we still do not pay more than O(b \log b) additive loss. An implication of these bounds is that our algorithm has consistency O(\sqrtb) (additive loss when the prediction error is 0) and robustness O(b \log b) (additive loss when the prediction error is arbitrarily large). Moreover, we do not need to assume that we know (or have any bound on) the prediction error \eta, in contrast with previous work in robust optimization which assumes that we know this error. We complement this upper bound with a variety of lower bounds showing that it is essentially tight: not only can the consistency/robustness tradeoff not be improved, but our particular loss function cannot be meaningfully improved.

Abstract:
Spiking Neural Networks (SNNs) can achieve competitive performance by converting already existing welltrained Artificial Neural Networks (ANNs), avoiding further costly training. This property is particularly attractive in Reinforcement Learning (RL), where training through environment interaction is expensive and potentially unsafe. However, existing conversion methods perform poorly in continuous control, where suitable baselines are largely absent. We identify error amplification as the key cause: small action approximation errors become temporally correlated across decision steps, inducing cumulative state distribution shift and severe performance degradation. To address this issue, we propose Cross-Step Residual Potential Initialization (CRPI), a lightweight training-free mechanism that carries over residual membrane potentials across decision steps to suppress temporally correlated errors. Experiments on continuous control benchmarks with both vector and visual observations demonstrate that CRPI can be integrated into existing conversion pipelines and substantially recovers lost performance. Our results highlight continuous control as a critical and challenging benchmark for ANN-to-SNN conversion, where small errors can be strongly amplified and impact performance.

Abstract:
Despite the remarkable success of Vision Transformers (ViTs) across a wide range of vision tasks, recent studies have revealed that they remain vulnerable to adversarial examples, much like Convolutional Neural Networks (CNNs). A common empirical defense strategy is adversarial training, yet the theoretical underpinnings of its robustness in ViTs remain largely unexplored. In this work, we present the first theoretical analysis of adversarial training under simplified ViT architectures. We show that, when trained under a signalto-noise ratio that satisfies a certain condition and within a moderate perturbation budget, adversarial training enables ViTs to achieve nearly zero robust training loss and robust generalization error under certain regimes. Remarkably, this leads to strong generalization even in the presence of overfitting, a phenomenon known as \emphbenign overfitting, previously only observed in CNNs (with adversarial training). Experiments on both synthetic and real-world datasets further validate our theoretical findings.

Abstract:
Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models, offering sublinear generation latency and bidirectional capabilities that are particularly appealing for code generation and editing. Achieving sub-linear latency in discrete DLMs requires predicting multiple tokens in parallel. However, standard DLMs sample tokens independently from conditional marginal distributions, failing to capture the joint dependencies among concurrently generated tokens. As a result, they often lead to syntactic inconsistencies and break multi-token structures. In this work, we introduce CoDiLA (Coherent Diffusion with Local Autoregression), a method that reconciles parallel sampling with local dependency modeling. Rather than forcing the DLM to resolve fine-grained syntax, CoDiLA delegates local decoding to a small, auxiliary AR model operating on the diffusion latents. This design allows for parallel block generation while ensuring sequential validity within each block and maintaining core DLM capabilities, including bidirectional modeling across blocks. We demonstrate that using a highly compact auxiliary AR model (e.g., 0.6B parameters) effectively eliminates coherence artifacts, establishing a new Pareto frontier for accuracy and speed in code generation benchmarks.

Abstract:
Policy gradient (PG) methods are a class of effective reinforcement learning algorithms, particularly when dealing with continuous control problems. They rely on fresh onpolicy data, making them sample-inefficient and requiring \mathcalO(\epsilon^-2) trajectories to reach an \epsilon-approximate stationary point. A common strategy to improve efficiency is to reuse information from past iterations, such as previous gradients or trajectories, leading to off-policy PG methods. While gradient reuse has received substantial attention, leading to improved rates up to \mathcalO(\epsilon^-3/2), the reuse of past trajectories, although intuitive, remains largely unexplored from a theoretical perspective. In this work, we provide the first rigorous theoretical evidence that reusing past off-policy trajectories can significantly accelerate PG convergence. We propose RT-PG (Reusing Trajectories - Policy Gradient), a novel algorithm that leverages a power mean-corrected multiple importance weighting estimator to effectively combine on-policy and off-policy data coming from the most recent \omega iterations. Through a novel analysis, we prove that RT-PG achieves a sample complexity of \widetilde\mathcalO(\epsilon^-2\omega^-1). When reusing all available past trajectories, this leads to a rate of \widetilde\mathcalO(\epsilon^-1), the best known one in the literature for PG methods. We further validate our approach empirically, demonstrating its effectiveness against baselines with state-of-the-art rates.

Abstract:
While foundation models have revolutionized various domains, their application to time series classification remains rather underexplored, with existing literature predominantly focused on forecasting. To bridge this gap, we introduce Mantis, a transformer-based foundation model pre-trained exclusively on synthetic data via self-supervised contrastive learning. We demonstrate that effective tokenization is critical to unlocking the full potential of transformers, proposing a novel token generator unit. Furthermore, we introduce an enhanced test-time methodology that bridges the performance gap between Mantis and strong specialized approaches by leveraging intermediate-layer representations, self-ensembling, and cross-model embedding fusion. Extensive experiments demonstrate that Mantis establishes a new state-of-the-art, outperforming existing foundation models across four diverse dataset collections covering various application domains.

Abstract:
A central challenge in large language model (LLM) editing is capability preservation: methods that successfully change targeted behavior can quietly game the editing proxy and corrupt general capabilities, producing degenerate behaviors reminiscent of proxy/reward hacking. We present CrispEdit, a scalable and principled secondorder editing algorithm that treats capability preservation as an explicit constraint, unifying and generalizing several existing editing approaches. CrispEdit formulates editing as constrained optimization and enforces the constraint by projecting edit updates onto the low-curvature subspace of the capability-loss landscape. At the crux of CrispEdit is expressing capability constraint via Bregman divergence, whose quadratic form yields the Gauss–Newton Hessian exactly and even when the base model is not trained to convergence. We make this second-order procedure efficient at the LLM scale using Kronecker-factored approximate curvature (K-FAC) and a novel matrix-free projector that exploits Kronecker structure to avoid constructing massive projection matrices. Across standard model-editing benchmarks, CrispEdit achieves high edit success while keeping capability degradation below 1% on average across datasets, significantly improving over prior editors.

Abstract:
Credit assignment is a core challenge in multiagent reinforcement learning (MARL), especially in large-scale systems with structured, local interactions. Graph-based Markov decision processes (GMDPs) capture such settings via an influence graph, but standard critics are poorly aligned with this structure: global value functions provide weak per-agent learning signals, while existing local constructions can be difficult to estimate and ill-behaved in infinite-horizon settings. We introduce the Diffusion Value Function (DVF), a factored value function for GMDPs that assigns to each agent a value component by diffusing rewards over the influence graph with temporal discounting and spatial attenuation. We show that DVF is well-defined, admits a Bellman fixed point, and decomposes the global discounted value via an averaging property. DVF can be used as a drop-in critic in standard RL algorithms and estimated scalably with graph neural networks. Building on DVF, we propose Diffusion A2C (DA2C) and a sparse message-passing actor, Learned DropEdge GNN (LD-GNN), for learning decentralised algorithms under communication costs. Across the firefighting benchmark and three distributed computation tasks (vector graph colouring and two transmit power optimisation problems), DA2C consistently outperforms local and global critic baselines, improving average reward by up to 11%.

Abstract:
Current neural architectures lack a principled way to handle interchangeable tokens, i.e., symbols that are semantically equivalent yet distinguishable, such as bound variables. As a result, models trained on fixed vocabularies often struggle to generalize to unseen symbols, even when the underlying semantics remain unchanged. We propose a novel Transformerbased mechanism that is provably invariant to the renaming of interchangeable tokens. Our approach employs parallel embedding streams to isolate the contribution of each interchangeable token in the input, combined with an aggregated attention mechanism that enables structured information sharing across streams. Experimental results confirm the theoretical guarantees of our method and demonstrate substantial performance gains on open-vocabulary tasks that require generalization to novel symbols.

Abstract:
In many domains generating variable length sequences through insertions provides greater flexibility over autoregressive models. However, the action space of insertion models is much larger than that of autoregressive models (ARMs) making the learning challenging. To address this, we incorporate trainable order dynamics into the target rates for discrete flow matching, and show that with suitable choices of parameterizations, joint training of the target order dynamics and the generator is tractable without the need for numerical simulation. As the generative insertion model, we use a variable length masked diffusion model, which generates by inserting and filling mask tokens. On graph traversal tasks for which a locally optimal insertion order is known, we explore the choices of parameterization empirically and demonstrate the tradeoffs between flexibility, training stability and generation quality. On de novo small molecule generation, we find that the learned order dynamics leads to a significant increase in validity and quality of the generated molecules, when compared to uniform order dynamics.

Abstract:
Neural operators have shown promise in learning solution maps of partial differential equations (PDEs), but they often struggle to generalize when test inputs lie outside the training distribution, such as novel initial conditions, unseen PDE coefficients or unseen physics. Prior works address this limitation with large scale multi physics pretraining followed by fine tuning, but this still requires examples from the new dynamics, falling short of true zero shot generalization. In this work, we propose a method to enhance generalization at testtime, i.e, without modifying pretrained weights.Building on DISCO, which provides a dictionary of neural operators trained across different dynamics, we introduce a neural operator splitting strategy that, at test time, searches over compositions of training operators to approximate unseen dynamics. On challenging out-of-distribution tasks including parameter extrapolation and novel combinations of physics phenomena, our approach achieves state-of-the-art zero shot generalization results, while being able to recover the underlying PDE parameters. These results underscore test-time computation as a key avenue for building flexible, compositional, and generalizable neural operators.

Abstract:
Optimal transport (OT) finds a least cost transport plan between two probability distributions using a cost matrix defined on pairs of points. Unlike standard OT, which infers unstructured pointwise mappings, lowrank optimal transport explicitly constrains the rank of the transport plan to infer latent structure. This improves statistical stability and robustness, yields sharper parametric rates for estimating Wasserstein distances adaptive to the intrinsic rank, and generalizes K-means to co-clustering. These advantages, however, come at the cost of a non-convex and NP-hard optimization problem. We introduce transport clustering, an algorithm to compute a low-rank OT plan that reduces low-rank OT to a clustering problem on correspondences obtained from a full-rank transport registration step. We prove that this reduction yields polynomial-time, constant-factor approximation algorithms for low-rank OT: specifically, a (1+\gamma) approximation for negative-type metrics and a (1+\gamma+\sqrt2\gamma) approximation for kernel costs, where \gamma \in [0,1] denotes the approximation ratio of the optimal full-rank solution relative to the low-rank optimal. Empirically, transport clustering outperforms existing low-rank OT solvers on synthetic benchmarks and large-scale, high-dimensional datasets.

Abstract:
Many popular practical reinforcement learning (RL) algorithms employ evolving reward functions—through techniques such as reward shaping, entropy regularization, or curriculum learning—yet their theoretical foundations remain underdeveloped. This paper provides the first finitetime convergence analysis of a single-timescale actor-critic algorithm in the presence of an evolving reward function under Markovian sampling. We consider a setting where the reward parameters may change at each time step, affecting both policy optimization and value estimation. Under standard assumptions, we derive non-asymptotic bounds for both actor and critic errors. Our result shows that an O(1/\sqrtT) convergence rate is achievable, matching the best-known rate for static rewards, provided the reward parameters evolve slowly enough. This rate is preserved when the reward is updated via a gradient-based rule with bounded gradient and on the same timescale as the actor and critic, offering a theoretical foundation for many popular RL techniques. As a secondary contribution, we introduce a novel analysis of distribution mismatch under Markovian sampling, improving the best-known rate by a factor of \log^2T in the static-reward case.

Abstract:
A common assumption in representation learning is that globally welldistributed embeddings support robust and generalizable representations. This focus has shaped both training objectives and evaluation protocols, implicitly treating global geometry as a proxy for representational competence. While global geometry effectively encodes which elements are present, it is often insensitive to how they are composed. We investigate this limitation by testing the ability of geometric metrics to predict compositional binding across 21 vision encoders. We find that standard geometry-based statistics exhibit near-zero correlation with compositional binding. In contrast, functional sensitivity, as measured by the input-output Jacobian, reliably tracks this capability. We further provide an analytic account showing that this disparity arises from objective design, as existing losses explicitly constrain embedding geometry but leave the local input-output mapping unconstrained. These results suggest that global embedding geometry captures only a partial view of representational competence and establish functional sensitivity as a critical complementary axis for modeling composite structure.

Abstract:
We present SorryDB, a dynamicallyupdating benchmark of open Lean tasks drawn from 78 real world formalization projects on GitHub. Unlike existing static benchmarks, often composed of competition problems, hillclimbing the SorryDB benchmark will yield tools that are aligned to the community needs, more usable by mathematicians, and more capable of understanding complex dependencies. Moreover, by providing a continuously updated stream of tasks, SorryDB mitigates test-set contamination and offers a robust metric for an agent's ability to contribute to novel formal mathematics projects. We evaluate a collection of approaches, including generalist large language models, agentic approaches, and specialized symbolic provers, over a selected snapshot of 1000 tasks from SorryDB. We show that current approaches are complementary: even though an agentic approach based on Gemini Flash is the most performant, it is not strictly better than other off-the-shelf large-language models, specialized provers, or even a curated list of tactics.

Abstract:
Large Language Models (LLMs) increasingly support culturally sensitive decision making, yet often exhibit misalignment due to skewed pretraining data and the absence of structured value representations. Existing methods can steer outputs, but often lack demographic grounding and treat values as independent, unstructured signals, reducing consistency and interpretability. We propose OGMAR, an Ontology-Guided Multi-Agent Reasoning framework. OG-MAR summarizes respondent-specific values from the World Values Survey (WVS) and constructs a global cultural ontology by eliciting relations over a fixed taxonomy via competency questions. At inference time, it retrieves ontology-consistent relations and demographically similar profiles to instantiate multiple value-persona agents, whose outputs are synthesized by a judgment agent that enforces ontology consistency and demographic proximity. Experiments on regional social-survey benchmarks across four LLM backbones show that OG-MAR improves cultural alignment and robustness over competitive baselines, while producing more transparent reasoning traces.

Abstract:
Adapting large pretrained models to downstream tasks often produces task-specific parameter updates that are expensive to relearn for every model variant. While recent work has shown that such updates can be transferred between models with identical architectures, transferring them across models of different widths remains largely unexplored. In this work, we introduce Theseus, a training-free method for transporting task-specific updates across heterogeneous models. Rather than matching parameters directly, we characterize a task update by the functional effect it induces on intermediate representations. We formalize task-vector transport as a functional matching problem on observed activations and show that, after aligning representation spaces via orthogonal Procrustes analysis, it admits a stable closed-form solution that preserves the geometry of the update. We evaluate Theseus on vision and language models across different widths, showing consistent improvements over strong baselines without additional training or backpropagation. Our results show that task updates can be meaningfully transferred across architectures when task identity is defined functionally rather than parametrically.

Abstract:
Deep neural networks excel in vision, language, and audio, yet continue to underperform relative to treebased models on tabular data. We identify and extend inductive biases crucial for tabular learning—robustness to irrelevant features, axis alignment, localized irregularities, feature heterogeneity, and training stability—and proposeLassoFlexNet, a novel architecture coupled with a new training algorithm. LassoFlexNet employs a Tied Group Lasso mechanism that sparsely selects raw inputs based on nonlinear per-feature embeddings. This design encourages a raw input variable to contribute jointly with others only if it provides marginal predictive value, linearly or nonlinearly. The resulting non-homogeneity and localized irregularities introduce optimization challenges that defeat standard stochastic and proximal-gradient methods. To address this, we develop aSequential Hierarchical Proximal Gradient optimizer with exponential moving averages (EMA), enabling stable training. Across 52 datasets from three recent benchmarks, LassoFlexNet matches or surpasses state-of-the-art tree-based models, achieving up to 10% relative gains while improving interpretability. We further validate our design through ablation studies and prove enhanced expressivity for a key architectural component.

Abstract:
Reinforcement learning (RL) in realworld tasks often suffers from the performance degradation due to the distribution shift between training and deployment environments. Distributionally Robust RL (DR-RL) addresses this issue by optimizing the worst-case performance over an uncertainty set of transition dynamics, providing an optimized baseline performance upon deployment. However, existing methods typically require strong data access assumptions (e.g., a generative model or comprehensive offline datasets) and mostly focus on tabular settings. In this paper, we introduce a purely online DR-RL algorithm with general function approximation that learns a robust policy directly from interaction, without any prior knowledge or pre-collected data. Our method uses a dual-based fitted robust Bellman update to jointly learn the value function and the robust backup operator. We establish the first regret guarantee for online DR-RL in terms of an intrinsic complexity measure—the robust Bellman–Eluder (BE) dimension, for general \phi-divergence uncertainty sets. Our regret bound is sublinear and independent of |\mathcalS| and |\mathcalA|, and recovers sharp rates in structured regimes, providing a scalable method for practical DR-RL.

Abstract:
Hebbian and antiHebbian plasticity are widely observed in the brain and are classically modeled as mechanistic, local homosynaptic rules stabilized by homeostatic constraints. This raises an identifiability question: does observing Hebbian/anti-Hebbian structure in synaptic updates uniquely imply an underlying Hebbian computation? We identify an alternative, emergent route. We show that near stationarity, L2 weight decay generically drives the \emphlearning-signal component of many update rules to align with a Hebbian direction, with alignment increasing monotonically with decay strength. This Hebbian-like signature is not specific to SGD and can arise even for non-learning or random update rules long before learning has ceased. We further show that stochastic perturbations can induce anti-Hebbian alignment, yielding a simple tradeoff with weight decay and a phase boundary in regression settings. These mechanisms do not replace standard Hebbian theory; they can coexist with genuine Hebbian plasticity and complicate the interpretation of synaptic measurements, motivating experiments that distinguish mechanistic Hebbian computation from emergent Hebbian signatures.

Abstract:
LLMbased function calling enables intelligent agents to interact with external tools and environments, yet autoregressive decoding imposes a fundamental latency bottleneck that limits real-time applications such as embodied intelligence, game AI, and interactive avatars (e.g., 10 Hz control frequency). We observe that function calling differs fundamentally from free-form text generation: structured outputs exhibit substantial token redundancy (delimiters, parameter names), and arguments exhibit weak causal dependencies. Crucially, these two properties must be exploited jointly to achieve real-time performance. We present RealtimeTool, which introduces special tokens that serve a dual role: compressing low-entropy tokens (4–6× reduction) while acting as mode selectors that enable independent parallel generation of function name and arguments. This synergistic design achieves 3–6× end-to-end speedup (up to 9.6×) with only +8.2% parallelization overhead. Experiments on five benchmarks across Qwen-series models (0.5B–14B) demonstrate substantial speedup while maintaining competitive or improved accuracy. On Mobile Actions, RT-Qwen-0.5B outperforms Google's FunctionGemma in both accuracy and latency consistency. With quantization on consumer-grade GPU, RealtimeTool achieves 61.2ms P50 latency, enabling 16 Hz real-time control at 4B model scale, bridging the gap between LLM function calling and latency-critical real-world deployment.

Abstract:
Reinforcement learning (RL) has become a cornerstone for finetuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio clipping mechanism in PPO is structurally ill-suited for the large vocabularies inherent to LLMs. PPO constrains policy updates based on the probability ratio of sampled tokens, which serves as a noisy single-sample Monte Carlo estimate of the true policy divergence. This creates a sub-optimal learning dynamic: updates to low-probability tokens are aggressively over-penalized, while potentially catastrophic shifts in high-probability tokens are under-constrained, leading to training inefficiency and instability. To address this, we propose Divergence Proximal Policy Optimization (DPPO), which substitutes heuristic clipping with a more principled constraint based on a direct estimate of policy divergence (e.g., Total Variation or KL). To avoid huge memory footprint, we introduce the efficient Binary and Top-K approximations to capture the essential divergence with negligible overhead. Extensive empirical evaluations demonstrate that DPPO achieves superior stability and efficiency compared to existing methods, offering a more robust foundation for RL-based LLM fine-tuning.

Abstract:
Mixtureof-Experts (MoE) models enable sparse expert activation, meaning that only a subset of the model’s parameters is used during each inference. However, to translate this sparsity into practical performance, an expert caching mechanism is required. Previous works have proposed hardware-centric caching policies, but how these various caching policies interact with each other and different hardware specification remains poorly understood. To address this gap, we develop SpecMD, a standardized framework for benchmarking ad-hoc cache policies on various hardware configurations. Using SpecMD, we perform an exhaustive benchmarking of several MoE caching strategies, reproducing and extending prior approaches in controlled settings with realistic constraints. Our experiments reveal that MoE expert access is not consistent with temporal locality assumptions (e.g LRU, LFU). Motivated by this observation, we propose Least-Stale, a novel eviction policy that exploits MoE's predictable expert access patterns to reduce collision misses by up to 85× over LRU. With such gains, we achieve over 88% hit rates with up to 34.7% Time-to-first-token (TTFT) reduction on OLMoE at only 5% or 0.6GB of VRAM cache capacity.

Abstract:
Large language models (LLMs) are increasingly used to generate executable outputs, JSON objects, and API calls, where a single syntax error can make the output unusable. Constrained decoding enforces validity tokenby-token via masking and renormalization, but it can distort generation when the model assigns low probability mass to valid continuations, pushing decoding toward locally valid yet semantically incorrect trajectories. We propose \emphDraft-Conditioned Constrained Decoding (DCCD), a simple two-step, training-free inference procedure that decouples semantic planning from structural enforcement: an unconstrained draft is generated first, and constrained decoding is then applied, conditioned on this draft, to guarantee validity. We analyze DCCD through a KL-projection view, showing that draft conditioning increases feasible mass and reduces the cumulative “projection tax” induced by hard constraints, with an optional best-of-K draft selection. Across structured reasoning benchmarks, DCCD improves strict structured accuracy by up to +24 percentage points over standard constrained decoding (e.g., 15.2% to 39.0% on GSM8K with a 1B model), and enables smaller model pairs to match or exceed much larger constrained baselines, yielding substantial gains in parameter efficiency.

Abstract:
Policy gradient methods for Large Language Models (LLMs) optimize a policy \pi_\theta via a surrogate objective computed from samples of a rollout policy \pi_\textroll. However, modern LLMRL pipelines suffer from unavoidable implementation divergences—such as backend discrepancies, Mixture-of-Experts routing discontinuities, and distributed training staleness. These factors cause an off-policy mismatch (\pi_\textroll \neq \pi_\theta), leading to approximation errors between the surrogate and true objectives, often precipitating training collapse. We demonstrate that classical trust region bounds on this error scale as O(T^2) with sequence length T, rendering them vacuous for long-horizon tasks. To address this, we derive two tighter bounds: a Pinsker-Marginal bound scaling as O(T^3/2) and a Mixed bound scaling as O(T). Crucially, both bounds depend on \mathcalD_\textKL^\max—the maximum token-level KL divergence across the sequence. As this is a sequence-level quantity, it cannot be controlled by token-independent methods like PPO clipping. We propose Trust Region Masking (TRM), which masks entire sequences that violate the trust region. TRM theoretically provides the first non-vacuous monotonic improvement guarantees and empirically improves training stability for long-horizon LLM-RL.

Abstract:
Parallel thinking has emerged as a promising paradigm for reasoning, yet it imposes significant computational burdens. Existing efficiency methods primarily rely on local, pertrajectory signals and lack principled mechanisms to exploit global dynamics across parallel branches. We introduce 2D probing, an interface that exposes the width–depth dynamics of parallel thinking by periodically eliciting intermediate answers from all branches. Our analysis reveals three key insights: non-monotonic scaling across width–depth allocations, heterogeneous reasoning branch lengths, and early stabilization of global consensus. Guided by these insights, we introduce Parallel-Probe, a training-free controller designed to optimize online parallel thinking. Parallel-Probe employs consensus-based early stopping to regulate reasoning depth and deviation-based branch pruning to dynamically adjust width. Extensive experiments across three benchmarks and multiple models demonstrate that Parallel-Probe establishes a superior Pareto frontier for test-time scaling. Compared to standard majority voting, it reduces sequential tokens by up to 35.8% and total token cost by over 25.8% while maintaining competitive accuracy.

Abstract:
With the rapid development of VisionLanguage Models (VLMs) and the growing demand for their applications, efficient compression of the image inputs has become increasingly important. Existing VLMs predominantly digest and understand high-bitrate compressed images, while their ability to interpret low-bitrate compressed images has yet to be explored by far. In this paper, we introduce the first comprehensive benchmark to evaluate the ability of VLM against compressed images, varying existing widely used image codecs and diverse set of tasks, encompassing over one million compressed images in our benchmark. Next, we analyse the source of performance gap, by categorising the gap from a) the information loss during compression and b) generalisation failure of VLM. We visualize these gaps with concrete examples and identify that for compressed images, only the generalization gap can be mitigated. Finally, we propose a universal VLM adaptor to enhance model performance on images compressed by existing codecs. Consequently, we demonstrate that a single adaptor can improve VLM performance across images with varying codecs and bitrates by 10%-30%. We believe that our benchmark and enhancement method provide valuable insights and contribute toward bridging the gap between VLMs and compressed images.

Abstract:
Incomplete multiview multi-label classification addresses scenarios where views and labels are partially missing. While existing methods treat labels solely as supervision signals, they overlook the semantic structure inherent in partial annotations. We propose Label-Guided Representation Learning (LGRL) that systematically exploits label semantics as structural priors throughout learning. Our framework constructs a semantic-informed mixture prior via learnable category prototypes to guide representation extraction, and introduces category-specific conditional posteriors where prototypes act as Bayesian experts for multi-view fusion. We further derive a principled label-driven information bottleneck objective balancing reconstruction sufficiency with cross-view consistency, enabling category-conditional reasoning. Extensive experimental results demonstrate the effectiveness of LGRL across benchmark datasets as well as real-world applications in sports analytics and medical imaging.

Abstract:
Bayesian last layers (BLLs) provide a convenient and computationally efficient way to estimate uncertainty in neural networks. However, they underestimate epistemic uncertainty because they apply a Bayesian treatment only to the final layer, ignoring uncertainty induced by earlier layers. We propose a method that improves BLLs by leveraging a projection of Neural Tangent Kernel (NTK) features onto the space spanned by the lastlayer features. This enables posterior inference that accounts for variability of the full network while retaining the low computational cost of a standard BLL. We show that our method yields posterior variances that are provably greater or equal to those of a standard BLL, correcting its tendency to underestimate epistemic uncertainty. To further reduce computational cost, we introduce a uniform subsampling scheme for estimating the projection matrix and for posterior inference. We derive approximation bounds for both types of sub-sampling. Empirical evaluations on UCI regression, contextual bandits, and out-of-distribution detection tasks in image and tabular datasets, demonstrate improved calibration and uncertainty estimates compared to standard BLLs and competitive baselines, while reducing computational cost.

Abstract:
While recent Textto-Image (T2I) models show impressive capabilities in synthesizing images from brief descriptions, they struggle with the long, detailed prompts required for professional applications. We present DetailMaster, a comprehensive benchmark for evaluating T2I capabilities on long prompts with complex compositional requirements, accompanied by an automated data construction pipeline and an evaluation workflow. Comprising expert-validated prompts averaging 284.89 tokens, our benchmark introduces four critical evaluation dimensions: Character Attributes, Structured Character Locations, Multi-Dimensional Scene Attributes, and Spatial/Interactive Relationships. Evaluations on various general-purpose and long-prompt-optimized models reveal critical performance limitations, showing that weak encoders struggle to preserve syntactic dependencies within prompts and diffusion models suffer from attribute leakage under detail-intensive conditions. Through a controlled ablation study under varying constraints, we further show that high-fidelity generation requires a synergistic combination of expanded prompt limits and long-prompt training. We open-source our dataset and code to foster progress in long-prompt-driven T2I generation.

Abstract:
Causal inference in spatial domains faces two intertwined challenges: (1) unmeasured spatial factors, such as weather, air pollution, or mobility, that confound treatment and outcome, and (2) interference from nearby treatments that violate standard nointerference assumptions. While existing methods typically address one by assuming away the other, we show they are deeply connected:interference reveals structurein the latent confounder. Leveraging this insight, we propose theSpatial Deconfounder, a two-stage method that reconstructs a substitute confounder from local treatment vectors using a conditional variational autoencoder (C-VAE) with a spatial prior, then estimates causal effects via a flexible outcome model. We show that this approach enables nonparametric identification of both direct and spillover effects under weak assumptions—without requiring multiple treatment types or a known model of the latent field. Empirically, we extendSpaCE, a benchmark suite for spatial confounding, to include treatment interference, and show that the Spatial Deconfounder consistently improves effect estimation across real-world datasets in environmental health and social science. By turning interference into a multi-cause signal, our framework bridges spatial and deconfounding literatures to advance robust causal inference in structured spatial data.

Abstract:
Embedding physical intuition into network architectures allows the learning of dynamics that enforce fundamental properties, such as energy conservation laws, thereby leading to physicallyplausible predictions. Yet, scaling these models to intrinsically high-dimensional dynamical systems remains a significant challenge. This paper introduces Reduced-order Hamiltonian Neural Network (RO-HNN), a novel physics-inspired neural network that combines the conservation laws of Hamiltonian mechanics with the scalability of model order reduction. RO-HNN is built on two core components: a novel geometrically-constrained symplectic autoencoder that learns a low-dimensional, structure-preserving symplectic submanifold, and a geometric Hamiltonian neural network that models the dynamics on the submanifold. Our experiments demonstrate that RO-HNN provides physically-consistent, stable, and generalizable predictions of complex high-dimensional dynamics, thereby effectively extending the scope of Hamiltonian neural networks to high-dimensional physical systems.

Abstract:
Imagebased Joint-Embedding Predictive Architecture (I-JEPA) offers a promising approach to visual self-supervised learning through masked feature prediction. However with the inherent visual uncertainty at masked positions, feature prediction remains challenging and may fail to learn semantic representations. In this work, we propose Text-Conditional JEPA (TC-JEPA) that uses image captions to reduce the prediction uncertainty. Specifically, we modulate the predicted patch features using a self-supervised, image-grounded text conditioner that computes sparse cross-attention over input text tokens. With such conditioning, patch features become predictable as a function of text, thus are more semantically meaningful. We show TC-JEPA improves downstream performance and training stability, with promising scaling properties. TC-JEPA also offers a new vision-language pretraining paradigm based on feature prediction only, outperforming contrastive methods on diverse tasks, especially those requiring fine-grained visual understanding and reasoning.

Abstract:
The cost of simulator evaluations is a key practical bottleneck for Simulation Based Inference (SBI). In hierarchical settings with shared global parameters and exchangeable sitelevel parameters and observations, this structure can be exploited to improve simulation efficiency. Existing hierarchical SBI approaches factorise the posterior yet still simulate across multiple sites per training sample; We instead explore likelihood factorisation (LF) to train from single-site simulations. In LF sampling we learn a per-site (factorised) likelihood and then assemble synthetic multi-site observations to amortise inference for the full hierarchical posterior. Building on this, we propose Tokenised Flow Matching for Posterior Estimation (TFMPE), a tokenised flow matching approach that supports function-valued observations through likelihood factorisation. To enable systematic evaluation, we introduce a benchmark for hierarchical SBI. We validate TFMPE on this benchmark and on realistic infectious disease and computational fluid dynamics models, finding well-calibrated posteriors while reducing computational cost.

Abstract:
Testtime scaling (TTS) can boost LLM reasoning through long chain-of-thought (CoT), but the linear KV-cache growth amplifies the memory-bound bottleneck of LLM decoding. Query-aware sparse decoding methods can achieve state-of-the-art performance under constrained FLOP budget, but are mainly constrained by both sequential-dependent page filtering and coarse-grained token selection, hampering the serving efficiency and model performance on TTS tasks under high concurrency and long CoT scenarios, where token selection can even occupy higher runtime than the forward pipeline itself. In this paper, we first find that the query state of the current decoding token can be approximated in a unified manner from a short sliding window of recent queries, enabling training-free query-aware sparsity without sequential dependency in the decoding loop. Based on the findings, we propose \textttAsyncSpade, an asynchronous framework for efficient TTS, built on two core components: (1) a novel light-weight temporal-regressive module that predicts the next-token query state, and (2) an asynchronous disaggregated framework that decouples the KV cache selection from the auto-regressive decoding loop, overlapping the token-level KV selection with the forward inference computation through asynchronism, thereby eliminating the sequential dependency without sacrificing model performance. We validate the effectiveness of \textttAsyncSpade on common LLM serving setups with an A100 node, where \textttAsyncSpade can fully overlap KV-cache operations with the inference pipeline within a certain workload range, achieving theoretical optimal time-per-output-token~(TPOT). Specifically, \textttAsyncSpade delivers over 20% reduction on TPOT compared to SoTA baseline (i.e. Quest) and at least 50% TPOT reduction compared to full attention on Qwen3-8B and Qwen3-32B models, while matching or surpassing their accuracy on various TTS benchmarks (AIME-24/25, GPQA-Diamond, MATH-500). Our code is available through https://anonymous.4open.science/r/AsyncSpade-063C.

Abstract:
Speculative decoding has emerged as a widely adopted paradigm for accelerating large language model inference, where a lightweight draft model rapidly generates candidate tokens that are then verified in parallel by a larger target model. However, due to limited model capacity, drafts often struggle to approximate the target distribution, resulting in shorter acceptance lengths and diminished speedup. A key yet underexplored observation is that speculative decoding inherently providesverification feedbackthat quantifies the deviation between the draft and target models at no additional cost. This process naturally forms an iterative "draft commits–feedback provides–draft adapts" evolving loop, which precisely matches theonline learningparadigm. Motivated by this connection, we propose OnlineSPEC, a unified framework that systematically leverages interactive feedback to continuously evolve draft models. Grounded indynamic regret minimization, we establish a formal link between online learning performance and speculative system's acceleration rate, and develop novel algorithms via modern online learning techniques, including optimistic online learning that adaptively reuses historical gradients as predictive update hints, and online ensemble learning that dynamically maintains multiple draft models. Our algorithms are equipped with theoretical justifications and improved acceleration rates, achieving up to 24% speedup over seven benchmarks and three foundation models.

Abstract:
Large language models have demonstrated remarkable performance; however, their massive parameter counts make deployment highly expensive. Lowrank approximation offers a promising compression solution, yet existing approaches have two main limitations: (1) They focus on minimizing the output error of individual linear layers, without considering the architectural characteristics of Transformers, and (2) they decompose a large weight matrix into two small low-rank matrices. Consequently, these methods often fall short compared to other compression techniques like pruning and quantization, and introduce runtime overhead such as the extra GEMM kernel launches and memory operations for decomposed small matrices. To address these limitations, we propose A^3, a post-training low-rank approximation framework. A^3 splits a Transformer layer into three functional components, namely \textttQK, \textttOV, and \textttMLP and provides analytical solutions that reduces the hidden dimension size inside each component while minimizing the component's functional loss. This approach directly reduces model sizes, KV cache sizes, and FLOPs without introducing any runtime overheads. Through extensive experiments, we show that A^3 maintains superior performance compared to SoTAs. For example, under the same reduction budget in computation and memory, our low-rank approximated LLaMA 3.1-70B achieves a perplexity of 4.69 on WikiText-2, outperforming the previous SoTA's 7.87 by 3.18. We also show versatile applications of A^3 in KV cache compression, integration with quantization, fine-tuning and mixed-rank assignments.

Abstract:
Current repository agents encounter a reasoning disconnect due to fragmented representations, as existing methods rely on isolated API documentation or dependency graphs that lack semantic depth. We consider repository comprehension and generation to be inverse processes within a unified cycle: generation expands intent into implementation, while comprehension compresses implementation back into intent. To address this, we propose RPGEncoder, a framework that generalizes the Repository Planning Graph (RPG) from a static generative blueprint into a unified, high-fidelity representation. RPG-Encoder closes the reasoning loop through three mechanisms: (1) Encoding raw code into the RPG that combines lifted semantic features with code dependencies; (2) Evolving the topology incrementally to decouple maintenance costs from repository scale, reducing overhead by 95.7%; and (3) Operating as a unified interface for structure-aware navigation. In evaluations, RPG-Encoder establishes state-of-the-art repository understanding on SWE-bench Verified with 93.7% Acc@5 and exceeds the best baseline by over 10% on SWE-bench Live. These results highlight our superior fine-grained localization accuracy in complex codebases. Furthermore, it achieves 98.5% reconstruction coverage on RepoCraft, confirming RPG's high-fidelity capacity to mirror the original codebase and closing the loop between intent and implementation.

Abstract:
Analytical solutions to differential equations offer exact, interpretable insight but are rarely available because discovering them requires expert intuition or exhaustive search in combinatorial spaces. We introduce SIGS, a neurosymbolic framework that automates this process. SIGS uses a formal grammar to generate only syntactically valid building blocks, embeds these expressions into a continuous space, and then searches this space to assemble, score, and refine candidate closed-form solutions by minimizing a physics-based residual. This design unifies symbolic reasoning with numerical optimization; the grammar constrains candidate solution blocks to be proper by construction, while the latent search makes exploration tractable and data-free. SIGS is the first neuro-symbolic method to (i) analytically solve coupled systems of nonlinear PDEs, (ii) discover solutions under grammar misspecification, and (iii) produce accurate symbolic approximations for PDEs lacking known closed-form solutions. Overall, SIGS achieves orders-of-magnitude improvements in accuracy and efficiency over existing symbolic methods on standard benchmarks.

Abstract:
Learning rate warmup -- increasing the learning rate at the beginning of training -- has become a ubiquitous heuristic in modern deep learning, yet its theoretical foundations remain poorly understood. In this work, we provide a principled explanation for why warm-up improves training. We rely on a generalization of the (L_0, L_1)-smoothness condition, which bounds local curvature as a linear function of the loss sub-optimality and exhibits desirable closure properties. We show -- both theoretically and empirically -- that this condition is satisfied by common neural architectures and accurately captures the curvature of the optimization landscape early in training. Adapting the learning rate in response to this curvature condition naturally induces a warm-up–like schedule, and we show that this choice yields provably faster convergence guarantees than using a fixed learning rate. Finally, we validate our theoretical insights through experiments on language and vision models, confirming the agreement between our theoretically derived schedule and standard warm-up.

Abstract:
Continuoustime reinforcement learning (CTRL) provides a natural framework for sequential decision-making in dynamic environments where interactions evolve continuously over time. While CTRL has shown growing empirical success, its ability to adapt to varying levels of problem difficulty remains poorly understood. In this work, we investigate the instance-dependent behavior of CTRL and introduce a simple, model-based algorithm built on maximum likelihood estimation (MLE) with a general function approximator. Unlike existing approaches that estimate system dynamics directly, our method estimates the state marginal density to guide learning. We establish instance-dependent performance guarantees by deriving a regret bound that scales with the total reward variance and measurement resolution. Notably, the regret becomes independent of the specific measurement strategy when the observation frequency adapts appropriately to the problem’s complexity. To further improve performance, our algorithm incorporates a randomized measurement schedule that enhances sample efficiency without increasing measurement cost. These results highlight a new direction for designing CTRL algorithms that automatically adjust their learning behavior based on the underlying difficulty of the environment.

Abstract:
Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization. While effective in many settings, this constraint can limit flexibility in controlling attention magnitudes and may contribute to overly concentrated or unstable attention patterns during training. Prior work has explored modifications such as attention sinks or gating mechanisms, but these approaches provide only limited or indirect control over attention reweighting. We propose AffineScaled Attention, a simple extension to standard attention that introduces input-dependent scaling and a corresponding bias term applied to softmax-normalized attention weights. This design relaxes the strict normalization constraint while maintaining aggregation of value representations, allowing the model to adjust both the relative distribution and the scale of attention in a controlled manner. We empirically evaluate Affine-Scaled Attention in large-scale language model pretraining across multiple model sizes. Experimental results show consistent improvements in training stability, optimization behavior, and downstream task performance compared to standard softmax attention and attention sink baselines. These findings suggest that modest reweighting of attention outputs provides a practical and effective way to improve attention behavior in Transformer models.

Abstract:
As Textto-Image (T2I) diffusion models are increasingly used in real-world creative workflows, a principled framework for valuing contributors who provide a collection of data is essential for fair compensation and sustainable data marketplaces. While the Shapley value offers a theoretically grounded approach to attribution, it faces a dual computational bottleneck: (i) the prohibitive cost of exhaustive model retraining for each sampled subset of players (i.e., data contributors) and (ii) the combinatorial number of subsets needed to estimate marginal contributions due to contributor interactions. To this end, we proposeSurrogateSHAP, a retraining-free framework that approximates the expensive retraining game through inference from a pretrained model. To further improve efficiency, we employ a gradient-boosted tree to approximate the utility function and derive Shapley values analytically from the tree-based model. We evaluate SurrogateSHAP across three diverse attribution tasks: (i) image quality for DDPM-CFG on CIFAR-20, (ii) aesthetics for Stable Diffusion on Post-Impressionist artworks, and (iii) product diversity for FLUX.1 on Fashion-Product data. Across settings, SurrogateSHAP outperforms prior methods while substantially reducing computational overhead, consistently identifying influential contributors across multiple utility metrics. Finally, we demonstrate that SurrogateSHAP effectively localizes data sources responsible for spurious correlations in clinical images, providing a scalable path toward auditing safety-critical generative models. Code is available at https://anonymous.4open.science/r/CFG-Attribution-15DD/

Abstract:
While Experience Replay—the practice of storing rollouts and reusing them multiple times during training—is a foundational technique in general RL, it remains largely unexplored in LLM posttraining due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading -- and in some cases even improving -- final model performance, while preserving policy entropy.

Abstract:
Despite their practical success, it remains unclear why Mixture of Experts (MoE) models can outperform dense networks beyond sheer parameter scaling. We study an isoparameter regime where inputs exhibit latent modular structure but are corrupted by feature noise, a proxy for noisy internal activations. We show that sparse expert activation acts as a noise filter: compared to a dense estimator, MoEs achieve lower generalization error under feature noise, improved robustness to perturbations, and faster convergence speed. Empirical results on synthetic data and real-world language tasks corroborate the theoretical insights, demonstrating consistent robustness and efficiency gains from sparse modular computation.

Abstract:
Stateof-the-art model-based Reinforcement Learning (RL) approaches either use gradient-free, population-based methods for planning, learned policy networks, or a combination of policy networks and planning. Hybrid approaches that combine Model Predictive Control (MPC) with a learned model and a policy prior to leverage the advantages of both paradigms have shown promising results. However, these approaches typically rely on gradient-free optimization methods, which can be computationally expensive for high-dimensional control tasks. While gradient-based methods are a promising alternative, recent works have empirically shown that gradient-based methods often perform worse than their gradient-free counterparts. We propose Dream-MPC, a novel approach that generates few candidate trajectories from a rolled-out policy and optimizes each trajectory by gradient ascent using a learned world model, uncertainty regularization and amortization of optimization iterations over time by reusing previously optimized actions. Our results on 24 continuous control tasks show that Dream-MPC can significantly improve the performance of the underlying policy and can outperform gradient-free MPC and state-of-the-art baselines. We will open source our code and more at https://dream-mpc.github.io.

Abstract:
Alignment has greatly improved large language models (LLMs)’ output quality at the cost of diversity, yielding highly similar outputs across generations, especially in openended generation tasks. We propose Base-Aligned Model Collaboration (BACo), an inference-time token-level model collaboration framework that dynamically combines a base LLM with its aligned counterpart to optimize diversity and quality. Using uncertainty and content-based signals, BACO employs routing strategies to determine, at each token, which model to decode from. Prior diversity-promoting methods often improve diversity at the expense of quality or require expensive decoding or post-training. In contrast, BACO achieves both high diversity and quality post hoc within a single pass, while offering strong controllability. We introduce a family of effective routing strategies and evaluate them across three open-ended generation tasks with 13 diversity and quality metrics. BACo consistently surpasses state-of-the-art inference-time baselines. With our best router, BACo achieves a 21.3% joint improvement in diversity and quality, which is further supported by human evaluations. Overall, our results demonstrate that collaboration between base and aligned models provides an effective and controllable mechanism for optimizing the diversity-quality trade-off.

Abstract:
Traditional instrumental variable (IV) estimators cannot accommodate more treatments than instruments, a limitation that is critical for highdimensional, unstructured data like clinical treatment pathways. Current practice—applying unsupervised dimension reduction before IV estimation—suffers from substantial omitted treatment bias because the representation learning step ignores the instrument. We propose a novel framework that constructs treatment representations by explicitly incorporating instrumental variables. We prove that this instrument-guided approach ensures the identification of optimal outcome-prediction directions even with limited instruments. Validation on large-scale, semi-synthetic clinical data derived from a major hospital, along with other simulations, shows that our approach significantly outperforms conventional two-stage methods.

Abstract:
Recent multibit watermarking methods for large language models (LLMs) prioritize capacity over reliability, often conflating decoding with detection. Our analysis reveals that existing ECC-based extractors suffer from catastrophic false positive rates (FPR), and applying rejection thresholds merely collapses detection sensitivity (TPR) to random guessing. To resolve this structural limitation, we proposeBREW(Block-wise Reliable Embedding for Watermarking), a framework shifting the paradigm todesignated verification. BREW employs a two-stage mechanism: (i)blind message estimationvia independent block voting, followed by (ii)window-shifting verificationthat rigorously validates the payload against local edits. Experiments demonstrate that BREW achieves a TPR of 0.965 with an FPR of 0.02 under 10% synonym substitution, demonstrating that the high-FPR issue is not an inherent trade-off of multi-bit watermarking, but a solvable structural flaw of prior decoding-centric designs. Our framework is model-agnostic and theoretically grounded, providing a scalable solution for reliable forensic deployment.

Abstract:
Document visual question answering requires models not only to answer questions correctly, but also to precisely localize answers within complex document layouts. While large visionlanguage models (VLMs) achieve strong spatial grounding, their inference cost and latency limit real-world deployment; on the other hand, compact VLMs are efficient but suffer substantial localization degradation under standard fine-tuning or distillation. To address this gap, we proposeDocVAL, a validated chain-of-thought (CoT) distillation framework that transfers explicit spatial reasoning from large teacher models to compact, deployable student VLMs. DocVAL combines(1)teacher-generated spatial CoT supervision,(2)a rule-based dual-mode validator that filters low-quality training signals and provides fine-grained, pixel-level corrective feedback, and(3)a validation-driven two-stage training procedure with iterative refinement. Text detection is used only as training-time scaffolding for supervision and validation, enabling the final student to operate as a pure VLM without OCR or detection at inference. Across multiple document understanding benchmarks, the proposedDocVALyields consistent improvements of up to6--7 ANLSpoints over comparable compact VLMs. We further introduce mean Average Precision (mAP) as a localization metric for document question answering and report strong spatial grounding performance under this new evaluation. We release95K validator-verified CoT tracesand show that high-quality, validated supervision is more effective than scaling unfiltered data, enabling efficient and trustworthy document grounding. Dataset and implementation: https://anonymous.4open.science/r/DocVAL-1C14

Abstract:
Bridging the gap between datarich training regimes and observation-sparse deployment conditions remains a central challenge in spatiotemporal field reconstruction, particularly when target domains exhibit distributional shifts, heterogeneous structure, and multi-scale dynamics absent from available training data. We present SENDAI, a hierarchical Sparse-measurement, EfficieNt Data AssImilation Framework that reconstructs full spatial states from hyper sparse sensor observations by combining simulation-derived priors with learned discrepancy corrections. We demonstrate the performance on satellite remote sensing, reconstructing MODIS (Moderate Resolution Imaging Spectroradiometer) derived vegetation index fields across six globally distributed sites. Using seasonal periods as a proxy for domain shift, the framework consistently outperforms established baselines that require substantially denser observations---SENDAI achieves a maximum SSIM improvement of 185% over traditional baselines and a 36% improvement over recent high-frequency-based methods. These gains are particularly pronounced for landscapes with sharp boundaries and sub-seasonal dynamics; more importantly, the framework effectively preserves diagnostically relevant structures---such as field topologies, land cover discontinuities, and spatial gradients. By yielding corrections that are more structurally and spectrally separable, the reconstructed fields are better suited for downstream inference of indirectly observed variables. The results therefore highlight a lightweight and operationally viable framework for sparse-measurement reconstruction that is applicable to physically grounded inference, resource-limited deployment, and real-time monitor and control.

Abstract:
We study the estimation of repeatedly nested expectations (RNEs) with a constant horizon (number of nestings) using quantum computing. We propose a quantum algorithm that achieves \varepsilonerror with cost \tilde O(\varepsilon^-1), up to logarithmic factors. Standard lower bounds show this scaling is essentially optimal, yielding an almost quadratic speedup over the best classical algorithm. Our results extend prior quantum speedups for single nested expectations to repeated nesting, and therefore cover a broader range of applications, including optimal stopping. This extension requires a new derandomized variant of the classical randomized Multilevel Monte Carlo (rMLMC) algorithm. Careful de-randomization is key to overcoming a variable-time issue that typically increases quantized versions of classical randomized algorithms.

Abstract:
Recent reinforcementlearning frameworks for visual perception policy usually incorporate intermediate reasoning chains expressed in natural language. Empirical observations indicate that such purely linguistic intermediate reasoning often reduces performance on perception tasks. We argue that the core issue lies not in reasoning per se but in the form of reasoning: while these chains perform semantic reasoning in an unstructured linguistic space, visual perception requires reasoning in a spatial and object-centric space. In response, we introduce Artemis, a perception-policy learning method that performs structured visual reasoning, where each intermediate step is represented as a (label, bounding-box) pair capturing a verifiable visual state. This design enables explicit tracking of intermediate states, direct supervision for proposal quality, and avoids ambiguity introduced by language-based reasoning. Building upon verifiable and spatially grounded reasoning chains, Artemis provides a unified architecture for diverse perceptual tasks, without requiring the task-specific designs relied upon by prior perceptual policy models. Trained using grounding and detection sampeles in natural image domains, Artemis generalizes to counting and geometric perception tasks. At its core, a spatially grounded, object-centric chain rule provides a principled foundation for scalable and general perceptual policies.

Abstract:
DecisionFocused Learning (DFL) trains predictors to improve downstream decision quality, but computing regret gradients typically requires differentiating through solvers or relying on surrogate losses, which can be computationally expensive or deviate from the true objective. We show that, under standard regularity with locally stable active constraints, the regret gradient admits a closed-form geometric characterization, equivalent to the prediction error projected onto the tangent space of active constraints, scaled by lo- cal curvature. This reveals that regret gradients can be obtained by filtering decision-irrelevant components from the MSE gradient, providing a simpler and more direct alternative to existing approaches. Based on this, we propose PEAR (Projected Error As Regret-gradient), which computes regret gradients via a reduced linear system over active constraints, avoiding differentiation through solver iterations or additional optimization solves. Experiments on LP benchmarks and a real-world QP task show that PEAR achieves the best decision quality among all baselines while being the most computationally efficient, with gains that persist under constraint shifts.

Abstract:
Mechanistic Interpretability (MI) aims to reverseengineer model behaviors by identifying functional sub-networks. Yet, the scientific validity of these findings depends on their stability. In this work, we argue that circuit discovery is not a standalone task but a statistical estimation problem built upon causal mediation analysis (CMA). We uncover a fundamental instability at this base layer: exact, single-input CMA scores exhibit high intrinsic variance, implying that the causal effect of a component is a volatile random variable rather than a fixed property. We then demonstrate that circuit discovery pipelines inherit this variance and further amplify it. Fast approximation methods, such as Edge Attribution Patching and its successors, introduce additional estimation noise, while aggregating these noisy scores over datasets leads to fragile structural estimates. Consequently, small perturbations in input data or hyperparameters yield vastly different circuits. We systematically decompose these sources of variance and advocate for more rigorous MI practices, prioritizing statistical robustness and routine reporting of stability metrics.

Abstract:
Visionlanguage model (VLM) based graphical user interface (GUI) agents have shown strong interaction capabilities. However, they often behave unfaithfully, relying on memorized shortcuts rather than grounding actions in displayed screen evidence or user instructions. To address this, we proposeFaithful-Agent, a faithfulness-first framework that reformulates GUI interaction to prioritize evidence groundedness and internal consistency. Faithful-Agent employs a two-stage pipeline: (i) a faithfulness-oriented SFT stage to instill abstainment behaviors under evidence perturbations; (ii) an RFT stage that further amplifies faithfulness by introducing the guided advantage estimator (GuAE), an anchor-based and variance-adaptive advantage tempering mechanism built upon GRPO. GuAE prevents advantage collapse in low-variance rollout groups under sparse GUI rewards, and with a thought-action consistency reward, Faithful-Agent (Stage II) elevates the Trap SR from 13.88% to 80.21% relative to the baseline, while preserving robust general instruction-following performance.

Abstract:
The widespread adoption of Vision Transformers (ViTs) elevates supplychain risk on third-party model hubs, where an adversary can implant backdoors into released checkpoints. Existing ViT backdoor attacks largely rely on poisoned-data training, while prior data-free attempts typically require synthetic-data fine-tuning or extra model components. This paper introduces Data-Free Logic-Gated Backdoor Attacks (DF-LoGiT), a truly data-free backdoor attack on ViTs via direct weight editing. DF-LoGiT exploits ViT’s native multi-head architecture to realize a logic-gated compositional trigger, enabling a stealthy and effective backdoor. We validate its effectiveness through theoretical analysis and extensive experiments, showing that DF-LoGiT achieves near-100% attack success with negligible degradation in benign accuracy and remains robust against representative classical and ViT-specific defenses.

Abstract:
Offline reinforcement learning (RL) learns effective policies from precollected datasets, offering a practical solution for applications where online interactions are risky or costly. Model-based approaches are particularly advantageous for offline RL, owing to their data efficiency and generalizability. However, due to inherent model errors, model-based methods often artificially introduce conservatism guided by heuristic uncertainty estimation, which can be unreliable. In this paper, we introduce VIPO, a novel model-based offline RL algorithm that incorporates self-supervised feedback from value estimation to enhance model training. Specifically, the model is learned by additionally minimizing the inconsistency between the value learned directly from the offline data and the value estimated from the model. We perform comprehensive evaluations from multiple perspectives to show that VIPO can learn a highly accurate model efficiently and consistently outperform existing methods. In particular, it achieves state-of-the-art performance on almost all tasks in both D4RL and NeoRL benchmarks. Overall, VIPO offers a general framework that can be readily integrated into existing model-based offline RL algorithms to systematically enhance model accuracy. Our code is available at~\urlhttps://anonymous.4open.science/r/vipo2025-8FD4.

Abstract:
Multistep theorem prediction is a central challenge in automated reasoning. Existing neural–symbolic approaches rely heavily on supervised parametric models, which exhibit limited generalization to evolving theorem libraries. In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL). We identify a critical scalability bottleneck, termed Structural Drift: as reasoning depth increases, the performance of vanilla ICL degrades sharply, often collapsing to near zero. We attribute this failure to the LLM’s inability to recover latent topological dependencies, leading to unstructured exploration. To address this issue, we propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference. Coupled with retrieval-augmented graph construction and a stepwise symbolic executor, our approach enables LLMs to act as structured planners without any gradient-based optimization. Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models. These results indicate that explicit structural priors offer a promising direction for scaling LLM-based symbolic reasoning.

Abstract:
The increasing scale and complexity of modern model parameters underscore the importance of pretrained models. However, deployment often demands architectures of varying sizes, exposing limitations of conventional pre-training and fine-tuning. To address this, we propose SWEET, a self-supervised framework that performs constraint-based pre-training to enable scalable initialization in vision tasks. Instead of pre-training a fixed-size model, we learn a shared weight template and size-specific weight scalers under Tucker-based factorization, which promotes modularity and supports flexible adaptation to architectures with varying depths and widths. Target models are subsequently initialized by composing and reweighting the template through lightweight weight scalers, whose parameters can be efficiently learned from minimal training data. To further enhance flexibility in width expansion, we introduce width-wise stochastic scaling, which regularizes the template along width-related dimensions and encourages robust, width-invariant representations for improved cross-width generalization. Extensive experiments on classification, detection, segmentation and generation tasks demonstrate the state-of-the-art performance of SWEET for initializing variable-sized vision models.

Abstract:
GUI grounding maps natural language instructions to the correct interface elements, serving as the perception foundation for GUI agents. Existing approaches predominantly rely on finetuning multimodal large language models (MLLMs) using large-scale GUI datasets to predict target element coordinates, which is data-intensive and generalizes poorly to unseen interfaces. Recent attention-based alternatives exploit localization signals in MLLMs attention mechanisms without task-specific fine-tuning, but suffer from low reliability due to the lack of explicit and complementary spatial anchors in GUI images. To address this limitation, we propose Trifuse, an attention-based grounding framework that explicitly integrates complementary spatial anchors. Trifuse integrates attention, OCR-derived textual cues, and icon-level caption semantics via a Consensus-SinglePeak (CS) fusion strategy that enforces cross-modal agreement while retaining sharp localization peaks. Extensive evaluations on four grounding benchmarks demonstrate that Trifuse achieves strong performance without task-specific fine-tuning, substantially reducing the reliance on expensive annotated data. Moreover, ablation studies reveal that incorporating OCR and caption cues consistently improves attention-based grounding performance across different backbones, highlighting its effectiveness as a general framework for GUI grounding.

Abstract:
The target network update frequency (TUF) is a central stabilization mechanism in (deep) Qlearning. However, their selection remains poorly understood and is often treated merely as another tunable hyperparameter rather than as a principled design decision. This work provides a theoretical analysis of target fixing in tabular Q-learning through the lens of approximate dynamic programming. We formulate periodic target updates as a nested optimization scheme in which each outer iteration applies an inexact Bellman optimality operator, approximated by a generic inner loop optimizer. Rigorous theory yields a finite-time convergence analysis for the asynchronous sampling setting, specializing to stochastic gradient descent in the inner loop. Our results deliver an explicit characterization of the bias–variance trade-off induced by the target update period, showing how to optimally set this critical hyperparameter. We prove that constant target update schedules are suboptimal, incurring a logarithmic overhead in sample complexity that is entirely avoidable with adaptive schedules. Our analysis shows that the optimal target update frequency increases geometrically over the course of the learning process.

Abstract:
Simulating trajectories of dynamical systems is a fundamental problem in a wide range of fields such as molecular dynamics, biochemistry, and pedestrian dynamics. Machine learning has become an invaluable tool for scaling physicsbased simulators and developing models directly from experimental data. In particular, recent advances in deep generative modeling and geometric deep learning enable probabilistic simulation by learning complex trajectory distributions while respecting intrinsic permutation and time-shift symmetries. However, trajectories of N-body systems are commonly characterized by high sensitivity to perturbations leading to bifurcations, as well as multi-scale temporal and spatial correlations. To address these challenges, we introduce STFlow (Spatio-Temporal Flow), a generative model based on graph neural networks and hierarchical convolutions. By incorporating data-dependent couplings within the Flow Matching framework, STFlow denoises starting from conditioned random-walks instead of Gaussian noise. This novel informed prior simplifies the learning task by reducing transport cost, increasing training and inference efficiency. We validate our approach on N-body systems, molecular dynamics, and human trajectory forecasting. Across these benchmarks, STFlow achieves the lowest prediction errors with fewer simulation steps and improved scalability.

Abstract:
Due to constraints on privacy, cost, and latency, onpremise deployment of small models is increasingly common. However, most practical pipelines stop at supervised fine-tuning (SFT) and fail to reach the reinforcement learning (RL) alignment stage. The main reason is that RL alignment typically requires either expensive human preference annotation or heavy reliance on high-quality reward models with large-scale API usage and ongoing engineering maintenance, both of which are ill-suited to on-premise settings. To bridge this gap, in this paper, we propose a positive-unlabeled (PU) RL distillation method for on-premise small-model deployment. Without human-labeled preferences or a reward model, our method distills the teacher’s preference-optimization capability from black box generations into a locally trainable student. For each prompt, we query the teacher once to obtain an anchor response, locally sample multiple student candidates, and perform anchor-conditioned self-ranking to induce pairwise or listwise preferences, enabling a fully local training loop via direct preference optimization or group relative policy optimization. Theoretical analysis justifies that the induced preference signal by our method is order-consistent and concentrates on near-optimal candidates, supporting its stability for preference optimization. Experiments demonstrate that our method achieves consistently strong performance under a low-cost setting.

Abstract:
Modern diffusion/flowbased models for image generation typically exhibit two core characteristics: (i) using multi-step sampling, and (ii) operating in a latent space. Recent advances have made encouraging progress on each aspect individually, paving the way toward one-step diffusion/flow without latents. In this work, we take a further step towards this goal and propose "pixel MeanFlow" (pMF). Our core guideline is to formulate the network output space and the loss space separately. The network target is designed to be on a presumed low-dimensional image manifold (i.e., x-prediction), while the loss is defined via MeanFlow in the velocity space. We introduce a simple transformation between the image manifold and the average velocity field. In experiments, pMF achieves strong results for one-step latent-free generation on ImageNet at 256×256 resolution (2.22 FID) and 512×512 resolution (2.48 FID), filling a key missing piece in this regime. We hope that our study will further advance the boundaries of diffusion/flow-based generative models.

Abstract:
The superficial alignment hypothesis (SAH) posits that large language models learn most of their knowledge during pretraining, and that post-training merely surfaces this knowledge. The SAH, however, lacks a precise definition, which has led to (i) different and seemingly orthogonal arguments supporting it, and (ii) important critiques to it. We propose a new metric calledTask Complexity: the length of the shortest program that achieves a target performance on a task. In this framework, the SAH claims that pre-trained models drastically reduce the task complexity of achieving high performance on many tasks. Our definition unifies prior arguments supporting the SAH, interpreting them as different strategies to find such short programs. Experimentally, we estimate task complexities of mathematical reasoning, machine translation, and instruction following tasks and show that their respective task complexities can be remarkably low when conditioned on a pre-trained model. Further, we find that pre-training enables access to strong performances on our tasks, but it can require programs of gigabytes of length to access them. Post-training, on the other hand, collapses the complexity of reaching this same performance by several orders of magnitude. Overall, our results highlight that task adaptation can require remarkably little information—often just a few kilobytes.

Abstract:
Offline zeroshot reinforcement learning (RL) aims to learn agents that optimize unseen reward functions without additional environment interaction. The standard approach to this problem trains task-conditioned policies by sampling task vectors that define linear reward functions over learned state representations. In most existing algorithms, these task vectors are randomly sampled, implicitly assuming this adequately captures the structure of the task space. We argue that doing so leads to suboptimal zero-shot generalization. To address this limitation, we propose extracting task vectors directly from the offline dataset and using them to define the task distribution used for policy training. We introduce a simple and general reward function extraction procedure that integrates into existing offline zero-shot RL algorithms. Across multiple benchmark environments and baselines, our approach improves zero-shot performance by an average of 20%, highlighting the importance of principled task sampling in offline zero-shot RL.

Abstract:
Effective coordination among unfamiliar partners remains a major challenge in multiagent systems. Existing approaches, such as population-based methods, improve robustness through diversity but often lack mechanisms for efficient adaptation beyond the training distribution. Fine-tuning is also impractical for few-shot learning because it requires a large number of interactions for meaningful improvement. To address these limitations, we propose Coordination Transformers (CooT), a framework that leverages in-context learning (ICL) for real-time partner adaptation. Unlike prior ICL approaches that focus on task generalization, CooT is designed to generalize across diverse partner behaviors. Trained on trajectories from behavior-preferring agents, it learns to align actions with partner intentions purely through observation. We evaluate CooT on two challenging multi-agent benchmarks: Overcooked and Google Research Football. Results show that CooT consistently outperforms population-based methods, gradient-based fine-tuning, and Meta-RL baselines, achieving stable and rapid adaptation without parameter updates. Human evaluations also identify CooT as a preferred collaborator, and our ablations confirm its ability to adapt quickly to new partners and remain stable under sudden partner changes, making it reliable for real-world human-AI collaboration.

Abstract:
We study empirical scaling laws for language model merging measured by crossentropy. Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale the model size. We identify a compact power law that links model size and expert number: the size-dependent floor decreases with model capacity, while the merging tail exhibits clear diminishing returns in the number of experts. The law holds in-domain and cross-domain, tightly fits measured curves across diverse architectures and methods (Average, TA, TIES, DARE), and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included. Building on this, we present a simple theory that explains why gains fall roughly as (1/k) and links the floor and tail to properties of the base model and the diversity across domains. This law enables \emphpredictive planning: estimate how many experts are needed to reach a target loss, decide when to stop adding experts, and trade off scaling the base model versus adding experts under a fixed budget—turning merging from heuristic practice into a computationally efficient, planable alternative to multitask training. This suggests a scaling principle for distributed generative AI: predictable gains can be achieved by composing specialists, offering a complementary path toward AGI-level systems.

Abstract:
Realizing the symbiotic potential of AI and fusion energy requires bridging a critical "simto-real" gap. Models trained on simulations must generalize reliably under distribution shifts in safety-critical workflows. Focusing on the strongly nonlinear Grad-Shafranov equation (GSE) for tokamak equilibria, we propose a physics-anchored operator learning framework. Through systematic benchmarking across neural operator instantiations, we identify the Transformer-KAN Neural Operator (TKNO) as the state-of-the-art performer. Crucially, we adopt a semi-supervised paradigm that synergizes sparse data supervision with physical loss constraints. This approach significantly mitigates the catastrophic collapse often observed in purely data-driven methods under boundary-shape distribution shifts, thereby ensuring robust extrapolation. Validated on experimental discharge data from the EXL-50U tokamak, the model achieves high-fidelity equilibrium prediction (RMSE < 1.3%) with millisecond-level inference. These results demonstrate that AI has the potential to significantly accelerate fusion research and development.

Abstract:
Current multimodal benchmarks often conflate reasoning with domainspecific knowledge, making it difficult to isolate and evaluate general reasoning abilities in non-expert settings. To address this, we introduce VisualPuzzles, a benchmark that targets visual reasoning while deliberately minimizing reliance on specialized knowledge. VisualPuzzles consists of diverse questions spanning five categories: algorithmic, analogical, deductive, inductive, and spatial reasoning. One major source of our questions is manually translated logical reasoning questions from the Chinese Civil Service Examination. Experiments show that VisualPuzzles requires significantly less intensive domain-specific knowledge and more complex reasoning compared to benchmarks like MMMU, enabling us to better evaluate genuine multimodal reasoning. Evaluations show that state-of-the-art multimodal large language models consistently lag behind human performance on VisualPuzzles, and that strong performance on knowledge-intensive benchmarks does not necessarily translate to success on reasoning-focused, knowledge-light tasks. Additionally, reasoning enhancements such as scaling up inference compute (with "thinking" modes) yield inconsistent gains across models and task types, and we observe no clear correlation between model size and performance. We also found that models exhibit different reasoning and answering patterns on VisualPuzzles compared to benchmarks with heavier emphasis on knowledge. VisualPuzzles offers a clearer lens through which to evaluate reasoning capabilities beyond factual recall and domain knowledge.

Abstract:
In the era of transformer models, masked selfsupervised learning (SSL) has become a foundational training paradigm. A defining feature of masked SSL is that training aggregates predictions across many masking patterns, giving rise to a joint, matrix-valued predictor rather than a single vector-valued estimator. This object encodes how coordinates condition on one another and poses new analytical challenges. We develop a precise high-dimensional analysis of masked modeling objectives in the proportional regime where the number of samples scales with the ambient dimension. Our results provide explicit expressions for the generalization error and characterize the spectral structure of the learned predictor, revealing how masked modeling extracts structure from data. For spiked covariance models, we show that the joint predictor undergoes a Baik–Ben Arous–Péché (BBP)-type phase transition, identifying when masked SSL begins to recover latent signals. Finally, we identify structured regimes in which masked self-supervised learning provably outperforms PCA, highlighting potential advantages of SSL objectives over classical unsupervised methods.

Abstract:
Longform video understanding remains a formidable challenge due to the complexity of modeling long-range temporal dependencies and multi-event narratives. Existing methods often rely on static reasoning or external Visual-Language Models (VLMs), resulting in high computational complexity and sub-optimal performance. In this paper, we propose Video-MTR, a reinforced multi-turn reasoning framework that operates solely through data-efficient, pure RL post-training. Video-MTR reformulates video understanding as a dynamic decision-making process, where the agent iteratively selects key segments conditioned on the evolving context of previously processed frames and the query. To ensure effective intermediate reasoning and training stability, we introduce a novel gated bi-level reward system, which synergizes trajectory-level rewards (answer correctness) with turn-level rewards (frame-query relevance). This mechanism eliminates the need for data-intensive supervised fine-tuning, thereby substantially reducing reliance on large-scale datasets. Remarkably, Video-MTR achieves competitive or superior performance using only ～8K training samples, compared to existing approaches that require 257K to 4.4M examples. Extensive experiments on benchmarks including VideoMME, MLVU, LongVideoBench, LVBench, and EgoSchema demonstrate that Video-MTR surpasses state-of-the-art methods in both accuracy and efficiency.

Abstract:
Classical optimization theory establishes that zerothorder (ZO) algorithms suffer from a dimension-dependent slowdown, with convergence rates typically scaling with the model dimension compared to first-order methods. However, in contrast to these theoretical expectations, a growing body of recent work demonstrates the successful application of ZO methods to fine-tuning Large Language Models (LLMs) with billions of parameters. To explain this paradox, we derive the one-step learning dynamics of ZO SGD, where the empirical Neural Tangent Kernel (eNTK) naturally emerges as the key term governing the learning behavior. Inspection of the eNTK produced by ZO-SGD reveals that each element corresponds to the inner product of neural tangent vectors projected onto a random low-dimensional subspace. Thus, by invoking the Johnson-Lindenstrauss Lemma, our analysis shows that the fidelity of the ZO eNTK is governed primarily by the number of perturbations. Crucially, the approximation error depends on the model output size rather than the massive parameter dimension. This dimension-free property provides a theoretical justification for the scalability of ZO methods to LLMs finetuning tasks. We believe that this kernel-based framework offers a novel perspective for understanding ZO methods within the context of learning dynamics.

Abstract:
To address the increasing longcontext compute limitations of softmax attention, several subquadratic recurrent operators have been developed. This work includes models such as Mamba-2, DeltaNet, Gated DeltaNet (GDN), and Kimi Delta Attention (KDA). As the space of recurrences grows, a parallel line of work has arisen to taxonomize them. One compelling view is the test-time regression (TTR) framework, which interprets recurrences as performing online least squares updates that learn a linear map from the keys to values. Existing delta-rule recurrences can be seen as first-order approximations to this objective, but notably ignore the curvature of the least-squares loss during optimization. In this work, we address this by introducing preconditioning to these recurrences. Starting from the theory of online least squares, we derive equivalences between linear attention and the delta rule in the exactly preconditioned case. Next, we realize this theory in practice by proposing a diagonal approximation: this enables us to introduce preconditioned variants of DeltaNet, GDN, and KDA alongside efficient chunkwise parallel algorithms for computing them. Empirically, we find that our preconditioned delta-rule recurrences yield consistent performance improvements across synthetic recall benchmarks and language modeling at the 340M and 1B scale.

Abstract:
Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of largescale task collections with reproducible execution environments and reliable test suites. Although a growing number of benchmarks have emerged, datasets suitable for training remain limited in scale and diversity or often target a limited set of high-resource language ecosystems. We introduce SWE-rebench V2, a language-agnostic automated pipeline for harvesting executable real-world SWE tasks and constructing RL training environments at scale. The pipeline synthesizes repository-specific installation and test procedures via an interactive setup agent, and filters unsound instances using an ensemble of LLM judges, validated against human-verified SWE-bench annotations. Using this pipeline, we construct a dataset of 36,000+ tasks spanning 20 languages and 3,800+ repositories, with pre-built images for reproducible execution. To further scale training data, we additionally release 100,000+ tasks with installation instructions, fail-to-pass tests and rich metadata, where the problem statement is generated based on the original pull request description. We validate the collected instances through a diagnostic study that covers a subset of tasks in five programming languages across seven popular models, and provide instance-level metadata that flags common confounders such as overly restrictive tests and underspecified descriptions. We release the datasets, the collection and execution code, and associated artifacts to enable large-scale training of SWE agents across diverse languages and repositories.

Abstract:
Conditioning unconditional diffusion models typically requires either retraining with conditional inputs or perstep gradient computations (e.g., classifier-based guidance), both of which incur substantial computational overhead. We present a general recipe for efficiently steering unconditional diffusionwithout gradient guidance during inference, enabling fast controllable generation. Our approach is built on two observations about diffusion model structure:Noise Alignment: even in early, highly corrupted stages, coarse semantic steering is possible using a lightweight, offline-computed guidance signal, avoiding any per-step or per-sample gradients.Transferable concept vectors: a concept direction in activation space once learned through Recursive Feature Machines (RFMs) transfers across bothtimestepsandsamples; the same fixed steering vector learned near clean time remains effective when injected at intermediate noise levels for every generation trajectory, providing refined conditional control with efficiency. Experiments on CIFAR-10, ImageNet, and CelebA demonstrate improved accuracy/quality relative to gradient-based guidance, while achieving significant inference speedups.

Abstract:
Multiobjective combinatorial optimization seeks Pareto-optimal solutions over exponentially large discrete spaces, yet existing methods sacrifice generality, scalability, or theoretical guarantees. We reformulate it as an online learning problem over a decomposed decision space, solving position-wise bandit subproblems via adaptive expert-guided sequential construction. This formulation admits regret bounds of O(d\sqrtT \log T) depending on subproblem dimensionality \(d\) rather than combinatorial space size. On standard benchmarks, our method achieves 80--98% of specialized solvers performance while achieving two to three orders of magnitude improvement in sample and computational efficiency over Bayesian optimization methods. On real-world hardware-software co-design for AI accelerators with expensive simulations, we outperform competing methods under fixed evaluation budgets. The advantage grows with problem scale and objective count, establishing bandit optimization over decomposed decision spaces as a principled alternative to surrogate modeling or offline training for multi-objective optimization.

Abstract:
As cyber threats continue to grow in scale and sophistication, blue team defenders increasingly require advanced tools to proactively detect and mitigate risks. Large Language Models (LLMs) offer promising capabilities for enhancing threat analysis. However, their effectiveness in realworld blue team threat-hunting scenarios remains insufficiently explored. This paper presents CYBERTEAM, a benchmark designed to guide LLMs in blue teaming practice. CYBERTEAM constructs a standardized workflow in two stages. First, it models realistic threat-hunting workflows by capturing the dependencies among analytical tasks from threat attribution to incident response. Next, each task is addressed through a set of operational modules tailored to its specific analytical requirements. This transforms threat hunting into a structured sequence of reasoning steps, with each step grounded in a discrete operation and ordered according to task-specific dependencies. Guided by this framework, LLMs are directed to perform threat-hunting tasks through modularized steps. Overall, CYBERTEAM integrates 30 tasks and 9 operational modules to guide LLMs through standardized threat analysis. We evaluate both leading LLMs and state-of-the-art cybersecurity agents, comparing CYBERTEAM against open-ended reasoning strategies. Our results highlight the improvements enabled by standardized design, while also revealing the limitations of open-ended reasoning in real-world threat hunting.

Abstract:
We propose an adaptive node feature selection approach for graph neural networks (GNNs) that identifies and removes unnecessary features during training. The ability to measure how features contribute to model output is key for interpreting decisions and reducing dimensionality by eliminating unhelpful variables. However, graphstructured data introduces complex dependencies that may be unsuited to classical feature importance metrics. Inspired by this, we present a data-, model-, and task-agnostic method that determines relevant features during training based on changes in validation performance upon permuting feature values. We theoretically motivate our approach by characterizing how the relationships between node data and graph structure influences GNN performance. Empirically, we show that (i) our highly general approach rivals the performance of tailored feature selection approaches that exploit prior assumptions; (ii) we return meaningful feature importance scores well before the GNN is fully trained; and (iii) our scores demonstrably extract relevant properties that inform feature importance for various graph learning settings.

Abstract:
Music recordings often suffer from audio quality issues such as excessive reverberation, distortion, clipping, tonal imbalances, and a narrowed stereo image, especially when created in nonprofessional settings without specialized equipment or expertise. These problems are typically corrected using separate specialized tools and manual adjustments. In this paper, we introduce SonicMaster, the first unified generative model for music restoration and mastering that addresses a broad spectrum of audio artifacts with text-based control. SonicMaster is conditioned on natural language instructions to apply targeted enhancements, or can operate in an automatic mode for general restoration. To train this model, we construct the SonicMaster dataset, a large dataset of paired degraded and high-quality tracks by simulating common degradation types with nineteen degradation functions belonging to five enhancements groups: equalization, dynamics, reverb, amplitude, and stereo. Our approach leverages a flow-matching generative training paradigm to learn an audio transformation that maps degraded inputs to their cleaned, mastered versions guided by text prompts. Objective audio quality metrics demonstrate that SonicMaster significantly improves sound quality across all artifact categories. Furthermore, subjective listening tests confirm that listeners prefer SonicMaster’s enhanced outputs over other baselines. Demo samples are available through https://msonic793.github.io/SonicMaster/

Abstract:
Reconstructing accurate implicit surface representations from point clouds remains a challenging task, particularly when data is captured using lowquality scanning devices. These point clouds often contain substantial noise, leading to inaccurate surface reconstructions. Inspired by the Noise2Noise paradigm for 2D images, we introduce NoiseSDF2NoiseSDF, a novel method designed to extend this concept to 3D neural fields. Our approach enables learning clean neural SDFs from noisy point clouds through noisy supervision by minimizing the MSE loss between noisy SDF representations, allowing the network to implicitly denoise and refine surface estimations. We evaluate the effectiveness of NoiseSDF2NoiseSDF on benchmarks, including the ShapeNet, ABC, Famous, and Real datasets. Experimental results demonstrate that our framework significantly improves surface reconstruction quality from noisy inputs.

Abstract:
Largescale AI evaluation increasingly relies on aggregating binary judgments from K annotators, including LLMs used as judges. Most classical methods, e.g., Dawid-Skene or (weighted) majority voting, assume annotators are conditionally independent given the true label Y\in\\0,1\\, an assumption often violated by LLM judges due to shared data, architectures, prompts, and failure modes. Ignoring such dependencies can yield miscalibrated posteriors and even confidently incorrect predictions. We study label aggregation through a hierarchy of dependence-aware models based on Ising graphical models and latent factors. For class-dependent Ising models, the Bayes log-odds is generally quadratic in votes; for class-independent couplings, it reduces to a linear weighted vote with correlation-adjusted parameters. We present finite-K examples showing that methods based on conditional independence can flip the Bayes label despite matching per-annotator marginals. We prove separation results demonstrating that these methods remain strictly suboptimal as the number of judges grows, incurring nonvanishing excess risk under latent factors. Finally, we evaluate the proposed method on three real-world datasets, demonstrating improved performance over the classical baselines.

Abstract:
Developers often struggle to specify correct training labels and rewards. Perhaps they don't need to. We propose recontextualization, which reduces how often language models "game" training signals, performing misbehaviors those signals fail to penalize. We show recontextualization prevents models from learning to 1) prioritize evaluation metrics over chat response quality; 2) specialcase code to pass incorrect tests; 3) overwrite evaluation functions rather than write correct code; and 4) become sycophantic. Our method works by generating completions from prompts discouraging misbehavior and then recontextualizing them as though they were in response to prompts permitting misbehavior. Recontextualization trains language models to resist misbehavior even when instructions permit it. This mitigates the reinforcement of misbehavior from misspecified training signals, reducing specification gaming without improving the supervision signal.

Abstract:
Recently, small models with latent recursion have obtained promising results on complex reasoning tasks. These results are typically explained by the theory that such recursion increases a network’s depth, allowing it to compactly emulate the capacity of larger models. However, the performance of recursively added layers remains behind the capabilities of one‑pass models with the same feedforward depth. This means that in the looped version, not every recursive step effectively contributes to depth. This raises the question: when and why does latent reasoning improve performance, and when does it result in dead compute? In our work, we analyze the algorithms that latent reasoning provides answer to this question. We show that latent reasoning can be formalized as a classifier‑free guidance and policy improvement algorithm. Building on these insights, we propose to use a training schemes from RL and diffusion methods for latent reasoning modles. Using the Tiny Recursive Model as our testbed, we show that with our modifications we can avoid dead compute steps and reduce the total number of forward passes by 18× while maintaining performance. Broadly speaking, we show how a policy improvement perspective on recursive steps can explain model behavior and provide insights for further improvements.

Abstract:
We aim to develop physics foundation models for science and engineering that provide realtime solutions to Partial Differential Equations (PDEs) which preserve structure and accuracy under adaptation to unseen geometries. To this end, we introduce General-Geometry Neural Whitney Forms (Geo-NeWF): a data-driven finite element method. We jointly learn a differential operator and compatible reduced finite element spaces defined on the underlying geometry. The resulting model is solved to generate predictions, while exactly preserving physical conservation laws through Finite Element Exterior Calculus. Geometry enters the model as a discretized mesh both through a transformer-based encoding and as the basis for the learned finite element spaces. This explicitly connects the underlying geometry and imposed boundary conditions to the solution, providing a powerful inductive bias for learning neural PDEs which we demonstrate improves generalization to unseen domains. We provide a novel parameterization of the constitutive model ensuring existence and uniqueness of the solution. Our approach demonstrates state-of-the-art performance on several steady-state PDE benchmarks and provides a significant improvement over conventional baselines on out-of-distribution geometries.

Abstract:
Prompt tuning introduces learnable prompt vectors that adapt pretrained visionlanguage models to downstream tasks in a parameter-efficient manner. However, under limited supervision, prompt tuning alters pretrained representations and drives downstream features away from the pretrained manifold toward directions that are unfavorable for transfer. This drift degrades generalization. To address this limitation, we propose ManiPT, a framework that performs prompt tuning on the pretrained manifold. ManiPT introduces cosine consistency constraints in both the text and image modalities to confine the learned representations within the pretrained geometric neighborhood. Furthermore, we introduce a structural bias that enforces incremental corrections, guiding the adaptation along transferable directions to mitigate reliance on shortcut learning. From a theoretical perspective, ManiPT alleviates overfitting tendencies under limited data. Our experiments cover four downstream settings: unseen-class generalization, few-shot classification, cross-dataset transfer, and domain generalization. Across these settings, ManiPT achieves higher average performance than baseline methods. Notably, ManiPT provides an explicit perspective on how prompt tuning overfits under limited supervision.

Abstract:
Advancing towards artificial superintelligence requires rich and intelligent perceptual capabilities. A critical frontier in this pursuit is overcoming the limited spatial understanding of Multimodal Large Language Models (MLLMs), where geometry information is essential. Existing methods often address this by rigidly injecting geometric signals into every input, while ignoring their necessity and adding computation overhead. Contrary to this paradigm, our framework endows the model with an awareness of perceptual insufficiency, empowering it to autonomously engage geometric features in reasoning when 2D cues are deemed insufficient. To achieve this, we first introduce an independent geometry input channel to the model architecture and conduct alignment training, enabling the effective utilization of geometric features. Subsequently, to endow the model with perceptual awareness, we curate a dedicated spatialaware supervised fine-tuning dataset. This serves to activate the model’s latent internal cues, empowering it to autonomously determine the necessity of geometric information. Experiments across multiple spatial reasoning benchmarks validate this approach, demonstrating significant spatial gains without compromising 2D visual reasoning capabilities, offering a path toward more robust, efficient and self-aware multi-modal intelligence.

Abstract:
Distribution matching distillation (DMD) aligns a multistep generator with its few-step counterpart to enable high-quality generation under low inference cost. However, DMD tends to suffer from mode collapse, as its reverse-KL formulation inherently encourages mode-seeking behavior, for which existing remedies typically rely on perceptual or adversarial regularization, thereby incurring substantial computational overhead and training instability. In this work, we propose a role-separated distillation framework that explicitly disentangles the roles of distilled steps: the first step is dedicated to preserving sample diversity via a target-prediction (e.g., v-prediction) objective, while subsequent steps focus on quality refinement under the standard DMD loss, with gradients from the DMD objective blocked at the first step. We term this approach Diversity-Preserved DMD (DP-DMD), which, despite its simplicity—no perceptual backbone, no discriminator, no auxiliary networks, and no additional ground-truth images—preserves sample diversity while maintaining visual quality on par with state-of-the-art methods in extensive text-to-image experiments. Code and models will be released publicly.

Abstract:
Kernel methods are powerful tools in statistical learning, but their cubic complexity in the sample size n limits their use on largescale datasets. In this work, we introduce a scalable framework for kernel regression with complexity O(n \log n), fully leveraging GPU acceleration. The approach is based on a Fourier representation of kernels combined with non-uniform fast Fourier transforms (NUFFT), enabling exact, fast, and memory-efficient computations. We instantiate our framework in three settings: Sobolev kernel regression, physics-informed regression, and additive models. When known, the proposed estimators are shown to achieve minimax convergence rates, consistent with classical kernel theory. Empirical results demonstrate that our methods can process up to tens of billions of samples within minutes, providing both statistical accuracy and computational scalability. These contributions establish a flexible approach, paving the way for the routine application of kernel methods in large-scale learning tasks, whenever the kernel norm can be efficiently expressed in the Fourier space and the ambient dimension d is small.

Abstract:
Diffusion models transform noise into data by injecting information that was captured in their neural network during the training phase. In this paper we ask: what is this information? We find that, in pixelspace diffusion models, (1) a large fraction of the total information in the neural network is committed to reconstructing small-scale perceptual details of the image, and (2) the correlations between images and their class labels are informed by the semantic content of the images, and are largely agnostic to the low-level details. We argue that these properties are intrinsically tied to the manifold structure of the data itself. Finally, we show that these facts explain the efficacy of classifier-free guidance: the guidance vector amplifies the mutual information between images and conditioning signals early in the generative process, influencing semantic structure, but tapers out as perceptual details are filled in.

Abstract:
Latent Diffusion Models (LDMs) achieve highfidelity synthesis but suffer from latent space brittleness, causing discontinuous semantic jumps during editing. We introduce a Riemannian framework to diagnose this instability by analyzing the generative Jacobian, decomposing geometry intoLocal Scaling(capacity) andLocal Complexity(curvature). Our study uncovers a Geometric Decoupling": while curvature in normal generation functionally encodes image detail, OOD generation exhibits a functional decoupling where extreme curvature is wasted on unstable semantic boundaries rather than perceptible details. This geometric misallocation identifiesGeometric Hotspots" as the structural root of instability, providing a robust intrinsic metric for diagnosing generative reliability.

Abstract:
Speculative decoding has emerged as a powerful approach to accelerate large language model (LLM) inference by employing lightweight draft models to propose candidate tokens that are subsequently verified by the target model. The effectiveness of this paradigm critically depends on the quality of the draft model. While recent advances such as the EAGLE series achieve stateof-the-art speedup, existing draft models remain limited by error accumulation: they condition only on the current prefix, causing their predictions to drift from the target model over steps. In this work, we propose ConFu (Contemplate the Future), a novel speculative decoding framework that enables draft models to anticipate the future direction of generation. ConFu introduces (i) contemplate tokens and soft prompts that allow the draft model to leverage future-oriented signals from the target model at negligible cost, (ii) a dynamic contemplate token mechanism with MoE to enable context-aware future prediction, and (iii) a training framework with anchor token sampling and future prediction replication that learns robust future prediction. Experiments demonstrate that ConFu improves token acceptance rates and generation speed over EAGLE-3 by 8-11%, across various downstream tasks with Llama-3 3B and 8B models. We believe our work is the first to bridge speculative decoding with continuous reasoning tokens, offering a new direction for accelerating LLM inference.

Abstract:
Deeper modern architectures are costly to train, making hyperparameter transfer preferable to expensive repeated tuning. Maximal Update Parametrization (\muP) helps explain why many hyperparameters transfer across width. Yet depth scaling is less understood for modern architectures, whose computation graphs contain multiple parallel paths and residual aggregation. To unify various nonrecurrent multi-path neural networks such as CNNs, ResNets, and Transformers, we introduce a graph-based notion of effective depth. Under stabilizing initializations and a maximal-update criterion, we show that the optimal learning rate decays with effective depth following a universal -3/2 power law. Here, the maximal-update criterion maximizes the typical one-step representation change at initialization without causing instability, and effective depth is the minimal path length from input to output, counting layers and residual additions. Experiments across diverse architectures confirm the predicted slope and enable reliable zero-shot transfer of learning rates across depths and widths, turning depth scaling into a predictable hyperparameter-transfer problem.

Abstract:
Many materials properties depend on higherorder derivatives of the potential energy surface, yet machine learned interatomic potentials (MLIPs) trained with a standard loss on energy, force, and stress errors can exhibit error in curvature, degrading the prediction of vibrational properties. We introduce phonon fine-tuning (PFT), which directly supervises second-order force constants of materials by matching MLIP energy Hessians to DFT-computed force constants from finite displacement phonon calculations. To scale to large supercells, PFT stochastically samples Hessian columns and computes the loss with a single Hessian-vector product. We also use a simple co-training scheme to incorporate upstream data to mitigate catastrophic forgetting. On the MDR Phonon benchmark, PFT improves Nequix MP by 55% on average across phonon thermodynamic properties and achieves state-of-the-art accuracy among models trained on Materials Project trajectories. PFT also generalizes to improve properties beyond second-derivatives, improving thermal conductivity predictions that rely on third-order derivatives of the potential energy.

Abstract:
Partial agent failure becomes inevitable when systems scale up, making it crucial to identify the subset of agents whose failure causes worstcase system performance degradations. We study this Vulnerable Agent Identification (VAI) problem in large-scale multi-agent reinforcement learning (MARL). We frame VAI as a Hierarchical Adversarial Decentralized Mean Field Control (HAD-MFC), where where the upper level selects vulnerable agents as an NP-hard task and the lower level learns their worst-case adversarial policies via mean-field MARL. The two problems are coupled together, making HAD-MFC difficult to solve. To handle this, we first decouple the hierarchical process by Fenchel-Rockafellar transform, resulting a regularized mean-field Bellman operator for upper level that enables independent learning at each level, thus reducing computational complexity. We next reformulate the upper-level NP-hard problem as an MDP with dense rewards, allowing sequential identification of vulnerable agents via greedy and RL algorithms. This decomposition provably preserves the optimal solution. Experiments show our method effectively identifies more vulnerable agents in large-scale MARL and the rule-based system, fooling system into worse failures, and reveals the vulnerability of each agent in large systems. Code available at \urlhttps://anonymous.4open.science/r/VAI-5F61/.

Abstract:
Selfplay-based policy optimization has emerged as an effective approach for fine-tuning large language models (LLMs), formulating preference optimization as a two-player game. However, the regularization with respect to the reference policy, which is crucial for mitigating over-optimization, has been insufficiently investigated in self-play alignment. To study the impact of different regularization strategies, we propose Regularized Self-Play Policy Optimization (RSPO), a novel framework that unifies prior methods and enables simple plug-and-play regularizers, meanwhile preserving convergence to Nash equilibrium of the corresponding regularized game. We empirically show that RSPO with appropriate regularizers can substantially improve the length-controlled win rate (LCWR) on AlpacaEval-2 across a range of base models, while also achieving consistently superior performance on Arena-Hard, MT-Bench, ArmoRM, and response diversity. In particular, RSPO improves unregularized self-play baseline (SPPO) on AlpacaEval-2 LCWR from 28.5% to 35.4% with base model Mistral-7B, from 38.77% to 43.66% with LLaMA-8B, and from 50.54% to 51.83% with Gemma-2B. Combining simplicity, convergence guarantees, and significant empirical gains, RSPO offers a strong foundation for exploring regularized self-play in alignment.

Abstract:
The prefill stage in longcontext LLM inference remains a computational bottleneck. Recent token-ranking heuristics accelerate inference by selectively processing a subset of semantically relevant tokens. However, existing methods suffer from unstable token importance estimation, often varying between layers. Evaluating token-ranking quality independently from heuristic-specific architectures is challenging. To address this, we introduce an Answer-Informed Oracle, which defines ground-truth token importance by measuring attention from generated answers back to the prompt. This oracle reveals that existing heuristics exhibit high variance across layers: rankings can degrade sharply at specific layers, a failure mode invisible to end-to-end benchmarks. The diagnosis suggests a simple fix: aggregate scores across layers rather than relying on any single one. We implement this as Cross-Layer Attention Aggregation (CLAA), which closes the gap to the oracle upper bound and reduces Time-to-First-Token (TTFT) by up to 39% compared to the Full KV Cache baseline.

Abstract:
Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements. Existing approaches typically rely on single judges or aggregate multiple judges assuming equal reliability. In practice, LLM judges vary substantially in performance across tasks and aspects, and their judgment probabilities may be biased and inconsistent. Furthermore, humanlabelled supervision for judge calibration may be unavailable. We first empirically demonstrate that inconsistencies in LLM comparison probabilities exist and show that it limits the effectiveness of direct probability-based ranking. To address this, we study the \emphLLM-as-a-jury setting and propose BT-\sigma, a judge-aware extension of the Bradley–Terry model that introduces a discriminator parameter for each judge to jointly infer item rankings and judge reliability from pairwise comparisons alone. Experiments on benchmark NLG evaluation datasets show that BT-\sigma consistently outperforms averaging-based aggregation methods, and that the learned discriminator strongly correlates with independent measures of LLM evaluation performance. Further analysis reveals that BT-\sigma can be interpreted as an unsupervised calibration mechanism that improves aggregation by modelling judge reliability.

Abstract:
Geometry is central to PDEgoverned systems, motivating shape optimization and inversion. Classical pipelines conduct costly forward simulation with geometry processing, requiring substantial expert effort. Neural surrogates accelerate forward analysis but do not close the loop because gradients from objectives to geometry are often unavailable. Existing differentiable methods either rely on restrictive parameterizations or unstable latent optimization driven by scalar objectives, limiting interpretability and part-wise control. To address these challenges, we propose Geometry-Aware Neural Optimizer (GANO), an end-to-end differentiable framework that unifies geometry representation, field-level prediction, and automated optimization/inversion in a single latent-space loop. GANO encodes shapes with an auto-decoder and stabilizes latent updates via a denoising mechanism, and a geometry-injected surrogate provides a reliable gradient pathway for geometry updates. Moreover, GANO supports part-wise control through null-space projection and uses remeshing-free projection to accelerate geometry processing. We further prove that denoising induces an implicit Jacobian regularization that reduces decoder sensitivity, yielding controlled deformations. Experiments on three benchmarks spanning 2D Helmholtz, 2D airfoil, and 3D vehicles show state-of-the-art accuracy and stable, controllable updates, achieving up to +55.9% lift-to-drag improvement for airfoils and ～ 7% drag reduction for vehicles.

Abstract:
As intents unfold and environments change, multiturn agents face continuously shifting decision contexts. Although reusing past experience is intuitively appealing, existing approaches remain limited: full trajectories are often too context-specific to transfer, while tool-level reuse ignores the context and environment. In this paper, we introduce a hybrid episodic–procedural memory strategy (H-EPM) that enables experience-induced self-evolution of multi-turn tool-use policies, by adaptively reusing partially overlapping successful experiences in both inference and training. Inspired by human episodic–procedural integration, we build a tool graph from accumulated trajectories, where recurring tool-to-tool dependencies capture procedural routines and each edge is augmented with a compact episodic summaries of relevant context. At inference, the agent dynamically balances episodic recall for contextual reasoning and procedural execution for routine steps. Beyond inference, H-EPM introduces a memory-guided reinforcement learning paradigm that directly addresses a core challenge in multi-turn agent RL: ineffective exploration over long trajectories. By biasing exploration toward historically successful tool transitions, H-EPM learns a stronger policy that generalizes during inference without relying on domain-specific experience collection. Experiments show that H-EPM consistently delivers substantial inference-time gains over strong baselines across multi-turn tool-use benchmarks, reaching up to 50%+. It also boosts RL policy performance, achieving up to 40%+ improvement on out-of-distribution tasks.

Abstract:
Existing LLMas-a-Judge systems suffer from three fundamental limitations: limited adaptivity to task and domain-specific evaluation criteria, systematic biases driven by non-semantic cues such as position, length, format, and model provenance, and evaluation inconsistency that leads to contradictory judgments across different evaluation modes (e.g., pointwise versus pairwise). To address these issues, we propose FairJudge, an adaptive, debiased, and consistent LLM-as-a-Judge. Unlike prior approaches that treat the judge as a static evaluator, FairJudge models judging behavior itself as a learnable and regularized policy. From a data-centric perspective, we construct a high--information-density judging dataset that explicitly injects supervision signals aligned with evaluation behavior. Building on this dataset, we adopt a curriculum-style SFT--DPO--GRPO training paradigm that progressively aligns rubric adherence, bias mitigation, and cross-mode consistency, while avoiding catastrophic forgetting. Experimental results on multiple internal and public benchmarks show that FairJudge consistently improves agreement and F1, reduces non-semantic biases, and outperforms substantially larger instruction-tuned LLMs. All resources will be publicly released at https://anonymous.4open.science/r/FairJudge-E7CB to facilitate future research.

Abstract:
Large Language Models are rapidly becoming core components of modern software development workflows, yet ensuring code security remains challenging. Existing vulnerability detection pipelines either rely on static analyzers or use LLM/GNNbased detectors trained with coarse program-level supervision. Both families often require complete context, provide sparse end-of-completion feedback, and can degrade as code length grows, making them ill-suited for real-time, prefix-level assessment during interactive coding and streaming generation. We propose SecCodePRM, a security-oriented process reward model that assigns a context-aware, step-level security score along a code trajectory. To train the model, we derive step-level supervision labels from static analyzers and expert annotations, allowing the model to attend more precisely to fine-grained regions associated with inter-procedural vulnerabilities. SecCodePRM has three applications: full-code vulnerability detection (VD), partial-code VD, and secure code generation (CG). For VD, SecCodePRM uses risk-sensitive aggregation that emphasizes high-risk steps; for CG, SecCodePRM supports inference-time scaling by ranking candidate continuations and favoring higher cumulative reward. This design yields dense, real-time feedback that scales to long-horizon generation. Empirically, SecCodePRM outperforms prior approaches in all three settings, while preserving code functional correctness, suggesting improved security without a safety–utility tradeoff.

Abstract:
Fine‑tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain‑specific fine‑tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a finetuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in‑training safeguards against EMA that are practical for providers who expose fine‑tuning via an API: We evaluate whether they a) prevent broad misalignment, b) allow narrow misalignment, c) learn well on benign tasks, and d) remain coherent. We investigate four training regularization interventions: (i) KL‑divergence regularization toward a safe reference model, (ii) \ell_2 distance in feature space, (iii) preventative steering with an evil persona vector, and (iv) interleaving training examples from a general instruct-tuning dataset. We demonstrate that selecting interleaving data by the perplexity gap between aligned and misaligned models yields the best results overall.

Abstract:
Policygradient methods are widely used in reinforcement learning, yet training often becomes unstable or slows down as learning progresses. We study this phenomenon through the noise-to-signal ratio (NSR) of a policy-gradient estimator, defined as the estimator variance (noise) normalized by the squared norm of the true gradient (signal). Our main result is that, for (i) finite-horizon linear systems with Gaussian policies and linear state-feedback, and (ii) finite-horizon polynomial systems with Gaussian policies and polynomial feedback, the NSR of the REINFORCE estimator can be characterized exactly—either in closed form or via numerical moment-evaluation algorithms—without approximation. For general nonlinear dynamics and expressive policies (including neural policies), we further derive a general upper bound on the variance. These characterizations enable a direct examination of how NSR varies across policy parameters and how it evolves along optimization trajectories (e.g. SGD and Adam). Across a range of examples, we find that the NSR landscape is highly non-uniform and typically increases as the policy approaches an optimum; in some regimes it blows up, which can trigger training instability and policy collapse.

Abstract:
Federated learning~(FL) across multiple HPC facilities faces stochastic \emphadmission delays from batch schedulers that dominate wallclock time. Synchronous FL suffers from severe stragglers, while asynchronous FL accumulates stale updates when queues spike. We propose \fedqueue, a queue-aware FL protocol that incorporates scheduler delays directly into training and aggregation, which (i) predicts per-facility queue delays online to budget local work, (ii) applies cutoff-based admission that buffers late arrivals to bound staleness, and (iii) performs staleness-aware aggregation to stabilize heterogeneous local workloads. We prove the convergence for non-convex objectives at rate \mathcalO(1/\sqrtR) under bounded staleness, and show that the admission controls yield bounded staleness with high probability under queue-prediction error. Real-world cross-facility deployment of \fedqueue shows 20.5% improvement over baseline algorithms. Controlled queue simulations demonstrate robust improvement over the baselines; in particular, about 34% reduction in time to reach a target accuracy level under high queue variance and non-IID partitions.

Abstract:
Reinforcement learning (RL) has equipped LLM agents with a strong ability to solve complex tasks. However, existing RL methods normally use a single policy network, causing simplicity bias where simple tasks occupy most parameters and dominate gradient updates, leaving insufficient capacity for complex tasks. A plausible remedy could be employing the Mixtureof-Experts (MoE) architecture in the policy network, as MoE allows different parameters (experts) to specialize in different tasks, preventing simple tasks from dominating all parameters. However, a key limitation of traditional MoE is its token-level routing, where the router assigns each token to specialized experts, which fragments phase-consistent patterns into scattered expert assignments and thus undermines expert specialization. In this paper, we propose Phase-Aware Mixture of Experts (PA-MoE). It first features a lightweight phase router that learns latent phase boundaries directly from the RL objective without pre-defining phase categories. Then, the phase router allocates temporally consistent assignments to the same expert, allowing experts to preserve phase-specific expertise. Experimental results demonstrate the effectiveness of our proposed PA-MoE. Code is available at https://anonymous.4open.science/r/PA-MoE-576C/.

Abstract:
We proposerefined GRFs(GRFs++), a new class ofGraph Random Features(GRFs) for efficient and accurate computations involving kernels defined on the nodes of a graph. GRFs++ resolve some of the longstanding limitations of regular GRFs, including difficulty modeling relationships between more distant nodes. They reduce dependence on sampling long graph random walks via a novelwalk-stitchingtechnique, concatenating several shorter walks without breaking unbiasedness. By applying these techniques, GRFs++ inherit the approximation quality provided by longer walks but with greater efficiency, trading sequential inefficient sampling of a long walk for parallel computation of short walks and matrix-matrix multiplication. Furthermore, GRFs++ extend the simplistic GRFs walk termination mechanism (Bernoulli schemes with fixed halting probabilities) to a broader class of strategies, applying general distributions on the walks' lengths. This improves approximation accuracy of graph kernels, without incurring extra computational cost. We provide empirical evaluations to showcase our claims and complement our results with theoretical analysis.

Abstract:
Reservoir Computing (RC) has established itself as an efficient paradigm for temporal processing. However, its scalability remains severely constrained by (i) the necessity of processing temporal data sequentially and (ii) the prohibitive memory footprint of highdimensional reservoirs. In this work, we revisit RC through the lens of structured operators and state space modeling to address these limitations, introducing Parallel Echo State Network (ParalESN). ParalESN enables the construction of high-dimensional and efficient reservoirs based on diagonal linear recurrence in the complex space, enabling parallel processing of temporal data. We provide a theoretical analysis demonstrating that ParalESN preserves the Echo State Property and the universality guarantees of traditional Echo State Networks while admitting an equivalent representation of arbitrary linear reservoirs in the complex diagonal form. Empirically, ParalESN matches the predictive accuracy of traditional RC on time series benchmarks, while delivering substantial computational savings. On 1-D pixel-level classification tasks, ParalESN achieves competitive accuracy with fully trainable neural networks while reducing computational costs and energy consumption by orders of magnitude. Overall, ParalESN offers a promising, scalable, and principled pathway for integrating RC within the deep learning landscape.

Abstract:
Multimodal Process Reward Models (MPRMs) are central to steplevel supervision for visual reasoning in MLLMs. Training MPRMs typically requires large-scale Monte Carlo (MC)-annotated corpora, incurring substantial training cost. This paper studies the data efficiency for MPRM training. Our preliminary experiments reveal that MPRM training quickly saturates under random subsampling of the training data, indicating substantial redundancy within existing MC-annotated corpora. To explain this, we formalize a theoretical framework and reveal that informative gradient updates depend on two factors: label mixtures of positive/negative steps and label reliability (average MC scores of positive steps). Guided by these insights, we propose the Balanced-Information Score (BIS), which prioritizes both mixture and reliability based on existing MC signals at the rollout level, without incurring any additional cost. Across two backbones (InternVL2.5-8B and Qwen2.5-VL-7B) on VisualProcessBench, BIS-selected subsets consistently match and even surpass the full-data performance at small fractions. Notably, the BIS subset reaches full-data performance using only 10% of the training data, improving over random subsampling by a relative 4.1%.

Abstract:
Benchmarks play a central role in reinforcement learning (RL) research, yet their computational constraints often shape what is studied. Despite the motivation of lifelong learning, most continual RL papers consider only 3–10 sequential tasks, as CPUbound environments make longer sequences impractical. Meanwhile, continual learning in cooperative multi-agent settings remains largely unexplored. To address these gaps, we introduceMEAL(Multi-agentEnvironments forAdaptiveLearning), the first benchmark for continual multi-agent RL. By leveraging JAX and GPU acceleration, MEAL enables training on sequences of 100 tasks on a single GPU in a few hours. We find that long task sequences reveal failure modes that do not appear at smaller scales.

Abstract:
Combining existing pretrained LLMs is a promising avenue for tackling diverse reasoning tasks. However, selecting experts at the task level is often too coarse-grained, as heterogeneous tasks may require different expertise for each instance. To enable instance-level mixing of LLM experts, we propose Symbolic-MoE, a symbolic, text-based, and gradient-free Mixture-of-Experts framework. Symbolic-MoE uses inferred skills, i.e., specialized knowledge such as algebra in mathematics, for expert selection. Each expert is selected based on how relevant its expertise is to the query, and then generates its own reasoning. This results in k outputs from k experts, which are then synthesized into a final high-quality response by an aggregator, chosen based on its ability to integrate diverse outputs. We show that instance-level expert selection improves performance by a large margin but -- when implemented naively -- can introduce a high computational overhead due to the need for constant model loading and offloading. To address this, we implement a batch inference strategy that groups instances based on their assigned experts, ensuring each model will only be loaded once. This allows us to integrate 16 expert models on a single GPU with a time cost comparable to prior multi-agent baselines using 4 GPUs. Through extensive evaluations on diverse benchmarks (MMLU-Pro, GPQA, AIME, and MedMCQA), Symbolic-MoE shows an absolute average improvement of 8.15% over the best baseline. Moreover, Symbolic-MoE generalizes well to unseen tasks and removes the need for expensive multi-round discussions, outperforming discussion baselines with less computation.

Abstract:
In interactive task learning (ITL), AI agents learn new capabilities from limited human instruction provided during task execution. STAND is a new method of dataefficient rule precondition induction specifically designed for these human-in-the-loop training scenarios. A key feature of STAND is its self-awareness of its own learning—it can provide accurate metrics of training progress back to users. STAND beats popular methods like XGBoost, decision trees, random forests, and version spaces at small-data precondition induction tasks, and is highly accurate at estimating when its performance improves on holdout examples. In our evaluations, we find that STAND shows more monotonic improvement than other models with low rates of error recurrence. These features of STAND support a more consistent training experience, enabling human instructors to estimate when they are finished training and providing active-learning support by identifying trouble spots where more training is required.

Abstract:
Pruning is a typical acceleration technique for computebound models by removing computation on unimportant values. Recently, it has been applied to accelerate Vision-Language-Action (VLA) model inference. However, existing acceleration methods focus on local information from the current action step and ignore the global context, leading to >20% success rate drop and limited speedup in some scenarios. In this paper, we point out spatial-temporal consistency in VLA tasks: input images in consecutive steps exhibit high similarity, and propose the key insight that token selection should combine local information with global context of the model. Based on this, we propose SpecPrune-VLA, a training-free, two-level pruning method with heuristic control. (1) Action-level static pruning. We leverage global history and local attention to statically reduce visual tokens per action. (2) Layer-level dynamic pruning. We prune tokens adaptively per layer based on layer-wise importance. (3) Lightweight action-aware controller: We classify actions as coarse- or fine-grained by the speed of the end effector and adjust pruning aggressiveness accordingly. Extensive experiments show that SpecPrune-VLA achieves up to 1.57× speedup in LIBERO simulation and 1.70× on real-world tasks, with negligible success rate degradation.

Abstract:
Large Language Models (LLMs) are rarely static and are frequently updated in practice. A growing body of alignment research has shown that models initially deemed ``aligned'' can exhibit misaligned behavior after finetuning, such as forgetting jailbreak safety features or re-surfacing knowledge that was intended to be forgotten. These works typically assume that the initial model is aligned based on static black-box evaluation, i.e., the absence of undesired responses to a fixed set of queries. In contrast, we formalize model alignment in both the static and post-update settings and uncover a fundamental limitation of black-box evaluation. We theoretically show that, due to overparameterization, static alignment provides no guarantee of post-update alignment for \emphany update dataset. Moreover, we prove that static black-box probing cannot distinguish a model that is genuinely post-update robust from one that conceals an arbitrary amount of adversarial behavior, which can be activated by even a single benign gradient update. We further validate these findings empirically in LLMs across three core alignment domains: privacy, jailbreak safety, and behavioral honesty. We demonstrate the existence of LLMs that pass all standard black-box alignment tests, yet become severely misaligned after a single benign update. Finally, we show that the capacity to hide such latent adversarial behavior increases with model scale, confirming our theoretical prediction that post-update misalignment grows with the number of parameters. Together, our results highlight the inadequacy of static evaluation protocols and emphasize the urgent need for post-update--robust alignment evaluation.

Abstract:
We introduce a problem of fairly allocating indivisible goods (items) in which the agents' valuations cannot be observed directly, but instead can only be accessed via noisy queries. In the twoagent setting with Gaussian noise and bounded valuations, we derive upper and lower bounds on the required number of queries for finding an envy-free allocation in terms of the number of items, m, and the negative-envy of the optimal allocation, \Delta. In particular, when \Delta is not too small (namely, \Delta \gg m^1/4), we establish that the optimal number of queries scales as \frac\sqrt m (\Delta / m)^2 = \fracm^2.5\Delta^2 up to logarithmic factors. Our upper bound is based on non-adaptive queries and a simple thresholding-based allocation algorithm that runs in polynomial time, while our lower bound holds even under adaptive queries and arbitrary computation time.

Abstract:
Existing LLMspost-training techniques are broadly categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). Each paradigm presents a distinct trade-off: (1) SFT excels at mimicking demonstration data, but can lead to problematic generalization as a form of behaviour cloning. (2) Conversely, RFT can significantly enhance a model's performance but is prone to learn unexpected behaviours, and its performance is sensitive to the initial policy. In this paper, we propose a unified view of these methods and introduce Prefix-RFT, a hybrid approach that synergizes learning from both demonstration and exploration. Using mathematical reasoning problems as a test bed, we empirically demonstrate that \ourmethod is simple yet effective. Not only does it surpass the performance of standalone SFT and RFT, but it also outperforms parallel mixed-policy RFT methods. Our analysis highlights the complementary nature of SFT and RFT, validating that Prefix-RFT effectively harmonizes them. Further ablation studies confirm the method's robustness to variations in the quality and quantity of demonstration data.

Abstract:
Serving Large Language Models (LLMs) can benefit immensely from parallelizing both the model and input requests across multiple devices, but incoming workloads exhibit substantial spatial and temporal heterogeneity. Spatially, workloads comprise heterogeneous requests with varying compute and memory demands. Temporally, workload composition varies over time. Nevertheless, existing systems typically assume spatially uniform and temporally stable workloads, employing a homogeneous, static model deployment. This mismatch between the assumption and realworld spatial-temporal heterogeneity results in suboptimal performance. We present OServe, an LLM serving system with heterogeneous and flexible model deployment that addresses both spatial and temporal heterogeneity. First, OServe introduces a novel workload-aware scheduling algorithm that optimizes heterogeneous model deployments according to real-time workload characteristics. Second, OServe proposes an efficient workload-adaptive switching method that migrates model deployments in response to predicted workload changes. Experiments on real-world traces show that OServe improves performance by up to 2× (average: 1.5×) compared to state-of-the-art serving systems.

Abstract:
Lowprecision training is critical for optimizing the trade-off between model quality and training costs, necessitating the joint allocation of model size, dataset size, and numerical precision. While empirical scaling laws suggest that quantization impacts effective model and data capacities or acts as an additive error, the theoretical mechanisms governing these effects remain largely unexplored. In this work, we initiate a theoretical study of scaling laws for low-precision training within a high-dimensional sketched linear regression framework. We demonstrate that the impact of quantization is twofold: it introduces an additive error and, simultaneously, shrinks the effective model and data sizes. Crucially, our analysis reveals distinct behaviors across quantization regimes: multiplicative quantization (where error variance scales with signal magnitude) primarily reduces the effective data size, whereas additive quantization (where error variance is independent of the signal) diminishes both the effective model size and data size. Numerical experiments validate our theoretical findings. By rigorously characterizing the complex interplay among model scale, dataset size, and quantization error, our work provides a principled theoretical basis for optimizing training protocols under practical hardware constraints.

Abstract:
Parallel decoding for diffusion LLMs (dLLMs) is difficult because each denoising step provides only tokenwise marginal distributions, while unmasking multiple tokens simultaneously requires accounting for inter-token dependencies. We propose Dependency-Aware Parallel Decoding (DAPD), a simple, training-free decoding method that uses self-attention to induce a conditional dependency graph over masked tokens. At each iteration, edges in this graph capture strong token interactions, while non-edges indicate weak dependence. Parallel decoding is then reduced to selecting an independent set on the graph and unmasking the selected tokens in parallel. This avoids co-updating strongly coupled tokens without auxiliary models or retraining. Experiments on LLaDA and Dream show that DAPD improves the accuracy–steps trade-off over existing methods and enables more globally distributed parallel updates that better exploit the any-order generation capability of dLLMs.

Abstract:
Mapping images to executable CAD programs is a central challenge in generative design, yet aligning visual inputs with symbolic code remains difficult. Existing approaches typically rely on brittle supervised finetuning or costly online reinforcement learning to overcome data limitations. In this work, we ask: how far can we push performance by leveraging test-time compute to bootstrap an augmented training set? We identify the primary bottleneck as the scarcity of diverse data aligning visual geometry with program syntax, rather than model capacity. To address this, we introduce Geometric Inference Feedback Tuning (GIFT), a framework that uses geometric feedback to generate high-quality data augmentations. GIFT systematically analyzes model failures via inference-time scaling, verifying geometric accuracy with a CAD kernel. GIFT bootstraps and curates an alignment dataset through two core mechanisms: Soft-Rejection Sampling (SRS), which captures diverse valid programs beyond ground-truth matching, and Failure-Driven Augmentation (FDA), which improves robustness by re-purposing rendered near-miss failures as synthetic training examples to cover hard negative geometries. By amortizing these insights into the model weights, GIFT matches the performance of extensive test-time scaling with an 80 % reduction in inference compute. It outperforms strong baselines by 12 % and remains competitive with complex multimodal systems, all without additional supervision or specialized architectures.

Abstract:
Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make system rankings fragile under standard majority vote aggregation. Majority vote discards annotator reliability and itemlevel ambiguity, often yielding unstable comparisons across annotator subsets. We introduce STABLEVAL, a disagreement-aware evaluation framework that models latent item correctness and annotator-specific confusion patterns to produce posterior expected item credit and calibrated agent-level scores. Unlike label-denoising approaches such as Dawid–Skene, STABLEVAL is explicitly designed for stable and uncertainty-aware system evaluation rather than hard label recovery. We formalize ranking stability as a first-class evaluation objective and analyze how aggregation methods preserve or distort underlying annotator behavior. Across controlled synthetic experiments and multiple real-world human-annotated benchmarks, majority vote exhibits increasing score error and ranking instability under annotator heterogeneity and adversarial noise, while STABLEVAL yields more stable and statistically grounded system rankings. These results demonstrate that modeling disagreement is essential for robust and reproducible AI evaluation.

Abstract:
Group Relative Policy Optimization (GRPO) is effective for training language models on complex reasoning. However, since the objective is defined relative to a group of sampled trajectories, extended deliberation can create more chances to realize relative gains, leading to inefficient reasoning and overthinking, and complicating the tradeoff between correctness and rollout efficiency. Controlling this behavior is difficult in practice, considering (i) Length penalties are hard to calibrate because longer rollouts may reflect harder problems that require longer reasoning, penalizing tokens risks truncating useful reasoning along with redundant continuation; and (ii) supervision that directly indicates when to continue or stop is typically unavailable beyond final answer correctness. We propose Weakly Supervised GRPO (WS-GRPO), which improves rollout efficiency by converting terminal rewards into correctness-aware guidance over partial trajectories. Unlike global length penalties that are hard to calibrate, WS-GRPO trains a preference model from outcome-only correctness to produce prefix-level signals that indicate when additional continuation is beneficial. Thus, WS-GRPO supplies outcome-derived continue/stop guidance, reducing redundant deliberation while maintaining accuracy. We provide theoretical results and empirically show on reasoning benchmarks that WS-GRPO substantially reduces rollout length while remaining competitive with GRPO baselines.

Abstract:
Transformers have shown inconsistent success in AI planning tasks, and theoretical understanding of when generalization should be expected has been limited. We take important steps towards addressing this gap by analyzing the ability of decoderonly models to verify whether a given plan correctly solves a given planning instance. To analyse the general setting where the number of objects -- and thus the effective input alphabet -- grows at test time, we introduce C-RASP, an extension of C-RASP designed to establish length generalization guarantees for transformers under the simultaneous growth in sequence length and vocabulary size. Our results identify a large class of classical planning domains for which transformers can provably learn to verify long plans, and structural properties that significantly affects the learnability of length generalizable solutions. Empirical experiments corroborate our theory.

Abstract:
Supervised Causal Learning has shown promise in causal discovery, yet it often struggles with generalization across diverse interventional settings, particularly when intervention targets are unknown. To address this, we propose TICL (Testtime Interventional Causal Learning), a novel method that synergizes Test-Time Training with Joint Causal Inference (JCI). Specifically, we design a self-augmentation strategy to generate instance-specific training data at test time, effectively avoiding distribution shifts. Furthermore, by integrating JCI, we developed a PC-inspired two-phase supervised learning scheme, which effectively leverages self-augmented data while ensuring theoretical identifiability. Extensive experiments on bnlearn benchmarks demonstrate TICL's superiority in multiple aspects of causal discovery and intervention target detection.

Abstract:
Sparse PCA is one of the most wellstudied problems in high-dimensional statistics. In this problem, we are given samples from a distribution with covariance \\mathbf\Sigma, whose top eigenvector \\mathbfv \in \\mathbbR^d is s-sparse. Existing sparse PCA algorithms can be broadly categorized into (1) combinatorial algorithms (e.g., diagonal or elementwise covariance thresholding) and (2) SDP-based algorithms. While combinatorial algorithms are much simpler, they are typically only analyzed under the spiked identity model (where \\mathbf\Sigma \propto \\mathbfI_d + \gamma \\mathbfv\\mathbfv^\top for some \gamma > 0), whereas SDP-based algorithms require no additional assumptions on \\mathbf\Sigma. We demonstrate explicit counterexample covariances \\mathbf\Sigma against the success of standard combinatorial algorithms for sparse PCA, when moving beyond the spiked identity model. In light of this discrepancy, we give the first combinatorial method for sparse PCA that provably succeeds for general \\mathbf\Sigma using poly(s, \log(d)) samples and d^2 \cdot poly(s, \log(d)) time, by providing a global convergence guarantee on the truncated power method of Yuan and Zhang (JMLR, 2013). We provide a natural generalization of our method to recovering k sparse principal components. Finally, we evaluate our method on synthetic and real-world sparse PCA datasets.

Abstract:
Steering intermediate representations has emerged as a powerful strategy for controlling generative models. However, despite its empirical success, it currently lacks a comprehensive theoretical framework. In this paper, we bridge this gap by formalizing the theory of concept steering. First, we establish a link between steering and affine concept erasure, proving that the standard approach for removing unwanted behaviors is a special case of LEACE (a closedform method for affine erasure). Next, we formulate a principled theoretical framework for concept switching, LEACE-Switch, and characterize the assumptions under which it provides an optimal affine solution. Building on this analysis, we then introduce MidSteer (Minimal Disturbance concept Steering), a more general affine framework for concept manipulation that relaxes these assumptions and enables directed, minimal-disturbance transformations. We empirically demonstrate that MidSteer performs favorably across a range of tasks, modalities, and architectures, including vision diffusion models and large language models.

Abstract:
Smart contracts on public blockchains now manage large amounts of value, and vulnerabilities in these systems can lead to substantial losses. As AI agents become more capable at reading, writing, and running code, it is natural to ask how well they can already navigate this landscape, both in ways that improve security and in ways that might increase risk. We introduce EVMbench, an evaluation that measures the ability of agents to detect, patch, and exploit smart contract vulnerabilities. EVMbench draws on 120 curated vulnerabilities from 37 repositories and, in the most realistic setting, uses programmatic grading based on tests and blockchain state under a local Ethereum execution environment. We evaluate a range of frontier agents and find that they are capable of discovering and exploiting vulnerabilities endto-end against live blockchain instances. We also compare various agent scaffolds and find that in some cases performance gains due to scaffolding improvements alone rival gains due to increased model quality. We release code, tasks, and tooling to support continued measurement of these capabilities and future work on security.

Abstract:
Deepfake audio detectors often fail to generalize to unseen attacks, in part due to \emphspectral bias: neural networks prioritize lowfrequency structure while under-exploiting subtle high-frequency (HF) artifacts left by generative models. We introduce SONAR (Spectral-cONtrastive Audio Residuals), a frequency-guided framework that \emphexplicitly enforces representation-level consistency between semantic content and HF residuals. Unlike prior frequency-aware or dual-stream detectors that treat HF cues as auxiliary features, SONAR encourages structured interaction between content and noise representations in latent space. The model employs a dual-path architecture in which an XLSR encoder captures low-frequency content, while a parallel branch with learnable, value-constrained SRM high-pass filters distills HF residuals. The two representations are fused via frequency cross-attention and trained with a \emphJensen--Shannon alignment loss that promotes LF–HF consistency for genuine audio and amplifies inconsistency for deepfakes. Evaluated on ASVspoof~2021 and in-the-wild benchmarks, SONAR achieves state-of-the-art performance in a single run setting and converges up to 4× faster than strong baselines. By mitigating the effects of spectral bias through frequency-guided alignment, SONAR provides a fully data-driven and architecture-agnostic approach to generalizable audio deepfake detection.

Abstract:
Deep generative models can help with data scarcity and privacy by producing synthetic training data, but they struggle in lowdata, imbalanced tabular settings to fully learn the complex data distribution. We argue that striving for the full joint distribution could be overkill; for greater data efficiency, models should prioritize learning the conditional distribution P(y\mid \bmX), as suggested by recent theoretical analysis. Therefore, we overcome this limitation with ReTabSyn, a Reinforced Tabular Synthesis pipeline that provides direct feedback on feature correlation preservation during synthesizer training. This objective encourages the generator to prioritize the most useful predictive signals when training data is limited, thereby strengthening downstream model utility. We empirically fine-tune a language model-based generator using this approach, and across benchmarks with small sample sizes, class imbalance, and distribution shift, ReTabSyn consistently outperforms state-of-the-art baselines. Moreover, our approach can be readily extended to control various aspects of synthetic tabular data, such as applying expert-specified constraints on generated observations.

Abstract:
Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing objectagnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive evaluations on both in-domain NExTQA and out-of-domain Video-Holmes, CG-Bench-Reasoning, and VRBench benchmarks demonstrate consistent performance gains, robustness and generalization of Chain-of-Glimpse across diverse video reasoning tasks.

Abstract:
Preference learning has become the foundation of aligning Large Language Models (LLMs) with human intent. Popular methods, such as Direct Preference Optimization (DPO), minimize surrogate losses as proxies for the intractable pairwise ranking loss. However, we demonstrate that for the equicontinuous hypothesis sets typical of neural networks, these standard surrogates are theoretically inconsistent, yielding vacuous generalization guarantees. To resolve this, we formulate LLM alignment within a marginshifted ranking framework. We derive rigorous H-consistency bounds that depend on enforcing a separation margin \gamma. Crucially, we extend this to Structure-Aware H-consistency, introducing a novel objective (SA-DPO) that adapts the margin based on the semantic distance between responses to handle synonyms and hard pairs. Finally, we analyze the trade-off between consistency and model limitations via the Margin-Capacity Profile, proving that heavy-tailed surrogates (such as the Polynomial Hinge family) offer superior consistency guarantees for capacity-bounded models compared to the standard logistic loss used in DPO.

Abstract:
Testtime scaling for complex reasoning tasks shows that leveraging inference-time compute, for example by independently sampling and aggregating multiple solutions, results in significantly better task outcomes. However, a critical bottleneck isverification: sampling is only effective if correct solutions can be reliably identified among candidates. While existing approaches typically evaluate candidates independently via scalar scoring, we demonstrate that models are substantially stronger atpairwise self-verification. Leveraging this insight, we introduceV1, a framework that unifies generation and verification through efficient pairwise ranking. V1 comprises two components:V1-Infer, an uncertainty-guided algorithm using a tournament-based ranking that dynamically allocates self-verification compute to candidate pairs whose relative correctness is most uncertain; andV1-PairRL, an RL framework thatjointly trainsa single model as both generator and pairwise self-verifier, ensuring the verifier adapts to the generator's evolving distribution. On code generation (LiveCodeBench, CodeContests) and math reasoning (AIME, HMMT) benchmarks, V1-Infer improves Pass@1 by up to 10% over pointwise verification and outperforms recent test-time scaling methods while being significantly more efficient. Furthermore, V1-PairRL achieves 7-9% test-time scaling gains over standard RL and pointwise joint training, and improves base Pass@1 by up to 8.7% over standard RL.

Abstract:
Existing evaluations of agents with memory typically assessmemorizationandactionin isolation. One class of benchmarks evaluates memorization by testing recall of past conversations or text but fails to capture how memory is used to guide future decisions. Another class focuses on agents acting in singlesession tasks without the need for long-term memory. However, in realistic settings, memorization and action are tightly coupled: agents acquire memory while interacting with the environment, and subsequently rely on that memory to solve future tasks. To capture this setting, We introduce MEMORYARENA, a unified evaluation gym for benchmarking agent memory in multi-session Memory-Agent-Environment loops. The benchmark consists of human-crafted agentic tasks with explicitly interdependent subtasks, where agents must learn from earlier actions and feedback by distilling experiences into memory, and subsequently use that memory to guide later actions to solve the overall task. MEMORYARENA supports evaluation across web navigation, preference-constrained planning, progressive information search, and sequential formal reasoning, and reveals that agents with near-saturated performance on existing long-context memory benchmarks like LoCoMo perform poorly in our agentic setting, exposing a gap in current evaluations for agents with memory.

Abstract:
Mechanistic Interpretability (MI) seeks to explain how neural networks implement their capabilities, but the scale of Large Language Models (LLMs) has limited prior MI work in Machine Translation (MT) to wordlevel analyses. We study sentence-level MT from a mechanistic perspective by analyzing attention heads to understand how LLMs internally encode and distribute translation functions. We decompose MT into two subtasks: producing text in the target language (i.e. target language identification) and preserving the input sentence’s meaning (i.e. sentence equivalence). Across three families of open-source models and 20 translation directions, we find that distinct, sparse sets of attention heads specialize in each subtask. Based on this insight, we construct subtask-specific steering vectors and show that modifying just 1% of the relevant heads enables instruction-free MT performance comparable to instruction-based prompting, while ablating these heads selectively disrupts their corresponding translation functions.

Abstract:
Finetuning pre-trained diffusion and flow models to optimize downstream utilities is central to real-world deployment. Existing entropy-regularized methods primarily maximize expected reward, providing no mechanism to shape tail behavior. However, tail control is often essential: the lower tail determines reliability by limiting low-reward failures, while the upper tail enables discovery by prioritizing rare, high-reward outcomes. In this work, we present Tail-aware Flow Fine-Tuning (TFFT), a principled and efficient distributional fine-tuning algorithm based on the Conditional Value-at-Risk (CVaR). We address two distinct tail-shaping goals: right-CVaR for seeking novel samples in the high-reward tail and left-CVaR for controlling worst-case samples in the low-reward tail. Unlike prior approaches that rely on non-linear optimization, we leverage the variational dual formulation of CVaR to decompose it into a decoupled two-stage procedure: a lightweight one-dimensional threshold optimization step, and a single entropy-regularized fine-tuning process via a specific pseudo-reward. This decomposition achieves CVaR fine-tuning efficiently with computational cost comparable to standard expected fine-tuning methods. We demonstrate the effectiveness of TFFT across illustrative experiments, high-dimensional text-to-image generation, and molecular design.

Abstract:
Existing time series tokenization methods predominantly encode a constant number of samples into individual tokens. This inflexible approach can generate excessive tokens for even simple patterns like extended constant values, resulting in substantial computational overhead. Inspired by the success of byte pair encoding, we propose the first patterncentric tokenization scheme for time series analysis. Based on a discrete vocabulary of frequent motifs, our method merges samples with underlying patterns into tokens, compressing time series adaptively. Exploiting our finite set of motifs and the continuous properties of time series, we further introduce conditional decoding as a lightweight yet powerful post-hoc optimization method, which requires no gradient computation and adds no computational overhead. On recent time series foundation models, our motif-based tokenization improves forecasting performance by 36% and boosts efficiency by 1990% on average. Conditional decoding further reduces MSE by up to 44%. In an extensive analysis, we demonstrate the adaptiveness of our tokenization to diverse temporal patterns, its generalization to unseen data, and its meaningful token representations capturing distinct time series properties, including statistical moments and trends.

Abstract:
Isolation Forest (iForest) stands out as a widelyused unsupervised anomaly detector, primarily owing to its remarkable runtime efficiency and superior performance in large-scale tasks. Despite its widespread adoption, a theoretical foundation explaining iForest's success remains unclear. This paper focuses on the inductive bias of iForest, which theoretically elucidates under what circumstances and to what extent iForest works well. The key is to formulate the growth process of iForest, where the split dimensions and split values are randomly selected. We model the growth process of iForest as a random walk, enabling us to derive the expected depth function, which is the outcome of iForest, using transition probabilities. The case studies reveal key inductive biases: iForest exhibits lower sensitivity to central anomalies while demonstrating greater parameter adaptability compared to k-Nearest Neighbor. The infinite sample size analysis shows that, unlike k-NN, iForest scores any point by considering both the density and the distances to the endpoints. Our study provides a theoretical understanding of the effectiveness of iForest and establishes a foundation for further theoretical exploration.

Abstract:
When evaluating Large Language Models (LLMs) in questionanswering domains, multiple-choice question answering (MCQA) is widely used because it enables automatic grading. However, MCQA also exposes models to answer options that can be exploited in ways that inflate reasoning ability. We study this phenomenon across 15 question-answering benchmarks and 27 LLMs by systematically varying how and when models are exposed to answer options. For non-reasoning LLMs, MCQA can remain a good proxy for free-text performance when any chain-of-thought is produced only before the options are revealed. However, this "decoupled" format is not realizable for most reasoning models: they are designed to emit reasoning tokens whenever they are prompted, so if options are present they inevitably "reason over" the options. In practice, this makes reasoning models particularly effective at extracting signal from options, and can create large, misleading gains over free-text baselines. To characterize how models exploit MCQA, we introduce diagnostic probes that isolate option-only and question-plus-option exploitation pathways, and we quantify how design choices such as distractor strength and "none-of-the-above" answers effect exploitability. Finally, we examined the practice of multiple choice as an error diagnostic: inferring a model's mistake from the wrong option it picks. On benchmarks where reasoning can be expressed as code, we ask models to output code, we then executed it varying the inputs, and compared the resulting input–output behavior, revealing failure modes that MCQA diagnostics obscure. Lastly, we offer practical guidelines when analyzing results from MCQA that better reflect LLMs' genuine reasoning capabilities.

Abstract:
Reward models are fundamental to Reinforcement Learning from Human Feedback (RLHF), yet realworld datasets are inevitably corrupted by noisy preference. Conventional training objectives tend to overfit these errors, while existing denoising approaches often rely on homogeneous noise assumptions that fail to capture the complexity of linguistic preferences. To handle these challenges, we propose SelectiveRM, a framework grounded in optimal transport. We first devise a Joint Consistency Discrepancy to align the distribution of model predictions with preference data. Furthermore, to address the limitation of strict mass conservation which compels the model to fit outliers, we incorporate a Mass Relaxation mechanism via partial transport. This enables the autonomous exclusion of samples with noisy preference that contradict semantic consistency. Theoretically, we demonstrate that SelectiveRM optimizes a tighter upper bound on the unobserved clean risk. Extensive experiments validate that our approach significantly outperforms state-of-the-art baselines across diverse benchmarks. Code is available at https://anonymous.4open.science/r/SelectiveRM-33F1.

Abstract:
Expressive policies based on flowmatching have been successfully applied in reinforcement learning (RL) more recently due to their ability to model complex action distributions from offline data. These algorithms build on standard policy gradients, which assume that there is no unmeasured confounding in the data. However, this condition does not necessarily hold for pixel-based demonstrations when a mismatch exists between the demonstrator's and the learner's sensory capabilities, leading to implicit confounding biases in offline data. We address the challenge by investigating the problem of confounded observations in offline RL from a causal perspective. We develop a novel causal offline RL objective that optimizes policies' worst-case performance that may arise due to confounding biases. Based on this new objective, we introduce a practical implementation that learns expressive flow-matching policies from confounded demonstrations, employing a deep discriminator to assess the discrepancy between the target policy and the nominal behavioral policy. Experiments across 25 pixel-based tasks demonstrate that our proposed confounding-robust augmentation procedure achieves a success rate 120% that of confounding-unaware, state-of-the-art offline RL methods.

Abstract:
Structural nested mean models (SNMMs) are a principled approach to estimate the treatment effects over time. A particular strength of SNMMs is to break the joint effect of treatment sequences over time into localized, timespecific ``blip effects''. This decomposition promotes interpretability through the incremental effects and enables the efficient offline evaluation of optimal treatment policies without re-computation. However, neural frameworks for SNMMs are lacking, as their inherently sequential g-estimation scheme prevents end-to-end, gradient-based training. Here, we propose DeepBlip, the first neural framework for SNMMs, which overcomes this limitation with a novel double optimization trick to enable simultaneous learning of all blip functions. Our DeepBlip seamlessly integrates sequential neural networks like LSTMs or transformers to capture complex temporal dependencies. By design, our method correctly adjusts for time-varying confounding to produce unbiased estimates, and its Neyman-orthogonal loss function ensures robustness to nuisance model misspecification. Finally, we evaluate our DeepBlip across various clinical datasets, where it achieves state-of-the-art performance.

Abstract:
Deep learning has revolutionized weather and climate modeling, yet the current landscape remains fragmented: highly specialized models are typically trained individually for distinct tasks. To unify this landscape, we introduce WIND, a single pretrained foundation model capable of replacing specialized baselines across a vast array of tasks. Crucially, in contrast to previous atmospheric foundation models, we achieve this without any task-specific fine-tuning. To learn a robust, task-agnostic prior of the atmosphere, we pre-train WIND with a self-supervised video reconstruction objective, utilizing an unconditional video diffusion model to iteratively reconstruct atmospheric dynamics from a noisy state. At inference, we frame diverse domain-specific problems strictly as inverse problems and solve them via posterior sampling. This unified approach allows us to tackle highly relevant weather and climate problems, including probabilistic forecasting, spatial and temporal downscaling, sparse reconstruction and enforcing conservation laws purely with our pre-trained model. We further demonstrate the model's capacity to generate physically consistent counterfactual storylines of extreme weather events under global warming scenarios. By combining generative video modeling with inverse problem solving, WIND offers a computationally efficient paradigm shift in AI-based atmospheric modeling.

Abstract:
We propose GradPower, a lightweight gradienttransformation technique for accelerating language model pre-training. Given a gradient vector \boldsymbolg=(g\_i)\_i, GradPower first applies the elementwise `sign-power` transformation: \varphi_p(\boldsymbolg) = \left(\rm sign(g\_i)|g\_i|^p\right)\_i for a fixed p>0, and then feeds the transformed gradient into a base optimizer. Notably, GradPower requires only a single-line code change and no modifications to the base optimizer’s internal logic, including the hyperparameters. When applied to AdamW (termed AdamWPower), GradPower consistently achieves lower terminal loss across diverse architectures (LLaMA, Qwen2MoE), parameter scales (66M to 2B), datasets (C4, OpenWebText), and learning-rate schedules (cosine, warmup-stable-decay). The most pronounced gains are observed when training modern mixture-of-experts models with warmup-stable-decay schedules. GradPower also integrates seamlessly with other state-of-the-art optimizers, such as Muon, yielding further improvements. Finally, we provide theoretical analyses that reveal the underlying mechanism of GradPower and highlight the influence of gradient noise.

Abstract:
To schedule LLM inference, the shortest job first (SJF) principle is favorable by prioritizing requests with short output lengths to avoid headof-line (HOL) blocking. Existing methods usually predict a single output length for each request to facilitate scheduling. We argue that such a point estimate does not match the stochastic decoding process of LLM inference, where output length is uncertain by nature and determined by when the end-of-sequence (EOS) token is sampled. Hence, the output length of each request should be fitted with a distribution rather than a single value. With an in-depth analysis of empirical data and the stochastic decoding process, we observe that output length follows a heavy-tailed distribution and can be fitted with the log-t distribution. On this basis, we propose a simple metric called Tail Inflated Expectation (TIE) to replace the output length in SJF scheduling, which adjusts the expectation of a log-t distribution with its tail probabilities to account for the risk that a request generates long outputs. To evaluate our TIE scheduler, we compare it with three strong baselines, and the results show that TIE reduces the per-token latency by 2.31× for online inference and improves throughput by 1.42× for offline data generation.

Abstract:
Recent advances in coding agents have made them capable of planning, editing, running, and testing complex code bases. Despite their growing ability in coding tasks, these systems still struggle to infer and track user intent, especially when instructions are underspecified or contextdependent. To bridge this gap, we introduce ToM-SWE, a dual-agent architecture that pairs a primary software-engineering (SWE) agent with a lightweight theory-of-mind (ToM) partner agent dedicated to modeling the user's mental state. The ToM agent infers user goals, constraints, and preferences from instructions and interaction history, maintains a persistent memory of the user, and provides user-related suggestions to the SWE agent, while preserving privacy and minimizing context window load. In two software engineering benchmarks (ambiguous SWE-bench and stateful SWE-bench), ToM-SWE improves task success rates and user satisfaction. Notably, on the stateful SWE benchmark, a newly introduced evaluation that provides agents with a user simulator along with previous interaction histories, ToM-SWE achieves a substantially higher task success rate of 59.7% compared to 18.1% for OpenHands, a state-of-the-art SWE agent. Furthermore, in a three-week study with professional developers using ToM-SWE in their daily work, participants found it better aligned with their intent and useful 86% of the time, underscoring the value of stateful user modeling for practical coding agents.

Abstract:
We present WorldMirror, a unified feedforward model for comprehensive 3D geometric prediction tasks. Unlike existing methods constrained to image-only inputs or customized for a specific task, our framework flexibly integrates diverse geometric priors, including camera poses, intrinsics, and depth maps, while simultaneously generating multiple 3D representations: dense point clouds, multi-view depth maps, camera parameters, surface normals, and 3D Gaussians. Remarkably, prior injection yields universal gains across all tasks, suggesting that input flexibility and multi-task prediction are mutually reinforcing. WorldMirror achieves state-of-the-art performance across diverse benchmarks from camera, point map, depth, and surface normal estimation to novel view synthesis, while maintaining the efficiency of feed-forward inference. Code and models will be publicly available.

Abstract:
Antimicrobial peptide discovery is challenged by the astronomical size of peptide space and the relative scarcity of active peptides. While generative models provide latent maps of this space, they typically ignore decoderinduced geometry and rely on flat Euclidean metrics, making exploration distorted and inefficient. Existing manifold-based approaches assume fixed intrinsic dimensionality, which fails for real peptide data. We introduce PepCompass, a geometry-aware framework based on a Union of \kappa-Stable Riemannian Manifolds that captures local decoder geometry while maintaining computational stability. PepCompass performs global interpolation via Potential-minimizing Geodesic Search (PoGS) to bias discovery toward promising seeds and enables local exploration through Second-Order Riemannian Brownian Efficient Sampling and Mutation Enumeration in Tangent Space, which together form Local Enumeration Bayesian Optimization (LE-BO). PepCompass achieves a 100% in-vitro validation rate: PoGS identifies four novel seeds and LE-BO optimizes them into 25 highly active, broad-spectrum peptides, demonstrating that geometry-informed exploration is a powerful paradigm for antimicrobial peptide design.

Abstract:
Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detectingscheming, where agents covertly pursue misaligned goals. One approach to mitigating such risks is LLMbased monitoring: using language models to examine agent behaviors for suspicious actions. We studyconstitutional black-box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural-language behavior specifications. We introduce two pipelines for generating synthetic agent trajectories,STRIDE(iterative refinement) andGloom(agent-environment simulation), from which we generate 1,000 samples each. We optimize monitors on these datasets via prompt sweeps, human refinement, and automated prompt optimization, and evaluate performance on 7,500 held-out trajectories from Control Arena, a suite of grounded environments where agents operate in more realistic contexts. Our results demonstrate that monitors selected purely on synthetic data can generalize to more realistic environments, capturing a meaningful scheming signal. However, we find that performance saturates quickly in our setting, with simple prompt sweeps matching the results of more extensive optimization. Pushing beyond this limit yields no further improvements and instead leads to overfitting.

Abstract:
We prove that Thompson sampling exhibits \tildeO(\sigma d \sqrtT + d r \sqrt\mathrmTr(\Sigma_0)) Bayesian regret in the linearGaussian bandit with a \mathcalN(\mu_0, \Sigma_0) prior distribution on the coefficients, where d is the dimension, T is the time horizon, r is the maximum \ell_2 norm of the actions, and \sigma^2 is the noise variance. In contrast to existing regret bounds, this shows that to within logarithmic factors, the prior-dependent ''burn-in'' term d r \sqrt\mathrmTr(\Sigma_0) decouples additively from the minimax (long run) regret \sigma d \sqrtT. Previous regret bounds exhibit a multiplicative dependence on these terms. We establish these results via a new ''elliptical potential'' lemma, and also provide a lower bound indicating that the burn-in term is unavoidable.

Abstract:
Reverse Kullback–Leibler (KL) divergencebased regularization with respect to a fixed reference policy is widely used in modern reinforcement learning to preserve the desired traits of the reference policy and sometimes to promote exploration (using uniform reference policy, known as entropy regularization). Beyond serving as a mere anchor, the reference policy can also be interpreted as encoding prior knowledge about good actions in the environment. In the context of alignment, recent game-theoretic approaches have leveraged KL regularization with pretrained language models as reference policies, achieving notable empirical success in self-play methods. Despite these advances, the theoretical benefits of KL regularization in game-theoretic settings remain poorly understood. In this work, we develop and analyze algorithms that provably achieve improved sample efficiency under KL regularization. We study both two-player zero-sum Matrix games and Markov games: for Matrix games, we propose OMG, an algorithm based on best response sampling with optimistic bonuses, and extend this idea to Markov games through the algorithm SOMG, which also uses best response sampling and a novel concept of superoptimistic bonuses. Both algorithms achieve a logarithmic regret in T that scales inversely with the KL regularization strength \beta in addition to the traditional \widetilde\mathcalO(\sqrtT) regret without the \beta^-1 dependence.

Abstract:
Large Language Models (LLMs) have significantly advanced code analysis tasks, yet they struggle to detect malicious behaviors fragmented across files, whose intricate dependencies easily get lost in the vast amount of benign code. We therefore propose a graphcentric attention acquisition pipeline that enhances LLMs' ability to localize malicious behavior. The approach parses a project into a code graph, uses an LLM to encode nodes with semantic and structural signals, and trains a Graph Neural Network (GNN) under sparse supervision. The GNN performs an initial detection, and by interpreting these predictions, identifies key code sections that are most likely to contain malicious behavior. These influential regions are then used to guide the LLM's attention for in-depth analysis. This strategy significantly reduces interference from irrelevant context while maintaining low annotation costs. Extensive experiments show that the method consistently outperforms existing approaches on multiple public and custom datasets, highlighting its potential for practical deployment in software security scenarios.

Abstract:
Being able to edit panoramic images is crucial for creating realistic 360° visual experiences. However, existing perspectivebased image editing methods fail to model the spatial structure of panoramas. Conventional cube-map decompositions attempt to overcome this problem but inevitably break global consistency due to their mismatch with spherical geometry. Motivated by this insight, we reformulate panoramic editing directly in the equirectangular projection (ERP) domain and present World-Shaper, a unified geometry-aware framework that bridges panoramic generation and editing within a single editing-centric design. To overcome the scarcity of paired data, we adopt a generate-then-edit paradigm, where controllable panoramic generation serves as an auxiliary stage to synthesize diverse paired examples for supervised editing learning. To address geometric distortion, we introduce a geometry-aware learning strategy that explicitly enforces position-aware shape supervision and implicitly internalizes panoramic priors through progressive training. Extensive experiments on our new benchmark, PEBench, demonstrate that our method achieves superior geometric consistency, editing fidelity, and text controllability compared to SOTA methods, enabling coherent and flexible 360° visual world creation with unified editing control. Code, model, and data will be released.

Abstract:
We address the problem of causal effect estimation in the presence of hidden confounders using nonparametric instrumental variable (IV) regression. An established approach is to use estimators based on learned \emphspectral features, that is, features spanning the top singular subspaces of the operator linking treatments to instruments. While powerful, such features are agnostic to the outcome variable. Consequently, the method can fail when the true causal function is poorly represented by these dominant singular functions. To mitigate, we introduce Augmented Spectral Feature Learning, a framework that makes the feature learning process outcomeaware. Our method learns features by minimizing a novel contrastive loss derived from an augmented operator that incorporates information from the outcome. By learning these task-specific features, our approach remains effective even under spectral misalignment. We provide a theoretical analysis of this framework and validate our approach on challenging benchmarks.

Abstract:
The widespread adoption of textto-image (T2I) diffusion models has raised concerns about their potential to generate copyrighted, inappropriate, or sensitive imagery learned from massive training corpora. As a practical solution, machine unlearning aims to selectively erase unwanted concepts from a pre-trained model without retraining from scratch. While most existing methods are effective for single-concept unlearning, they often struggle in real-world scenarios that require removing multiple concepts, since extending them to this setting is both non-trivial and problematic, causing significant challenges in unlearning effectiveness, generation quality, and sensitivity to hyperparameters and datasets. In this paper, we take a unique perspective on multi-concept unlearning by leveraging model sparsity and propose the Forget It All (FIA) framework. FIA first introduces Contrastive Concept Saliency to quantify each weight connection’s contribution to a target concept. It then identifies Concept-Sensitive Neurons by combining temporal and spatial information, ensuring that only neurons consistently responsive to the target concept are selected. Finally, FIA constructs masks from the identified neurons and fuses them into a unified multi-concept mask, where Concept-Agnostic Neurons that broadly support general content generation are preserved while concept-specific neurons are pruned to remove the targets. FIA is training-free and requires only minimal hyperparameter tuning for new tasks, thereby promoting a plug-and-play paradigm. Extensive experiments across three distinct unlearning tasks demonstrate that FIA achieves more reliable multi-concept unlearning, improving forgetting effectiveness while maintaining semantic fidelity and image quality.

Abstract:
Unified audiovisual generation is rapidly gaining industrial and creative relevance, enabling applications in virtual production and interactive media. However, when moving from general audio--video synthesis to music–dance co-generation, the task becomes substantially harder: musical rhythm, phrasing, and accents must drive choreographic motion at fine temporal resolution, and such rhythmic coupling is not captured by unimodal metrics or generic audiovisual consistency scores used in current evaluation practice. We introduce TMD-Bench, a benchmark for text-driven music–dance co-generation that assesses systems across unimodal generation quality, instruction adherence, and cross-modal rhythmic alignment. The benchmark integrates computable physical metrics with perceptual multimodal judgments, and is supported by a curated rhythm-aligned music–dance dataset and a fine-grained Music Captioner for structured music semantics. TMD-Bench further reveals that (i) modern commercial audio--visual models (e.g., Veo 3, Sora 2) produce high-quality music and video, while rhythmic coupling remains less consistently optimized and leaves room for improvement, and (ii) our unified baseline RhyJAM trained on rhythm-aligned data achieves competitive beat-level synchronization while maintaining competitive unimodal fidelity. This presents prospects for building next-generation music–dance models that explicitly optimize rhythmic and kinetic coherence.

Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in highlevel visual understanding. However, extending these models to fine-grained dense prediction tasks, such as semantic segmentation and depth estimation, typically necessitates the incorporation of complex, task-specific decoders and other customizations. This architectural fragmentation increases model complexity and deviates from the generalist design of MLLMs, ultimately limiting their practicality. In this work, we challenge this paradigm by accommodating standard MLLMs to perform dense predictions without requiring additional task-specific decoders. The proposed model is called DenseMLLM, grounded in the standard architecture with a novel vision token supervision strategy for multiple labels and tasks. Despite its minimalist design, our model achieves highly competitive performance across a wide range of dense prediction and vision-language benchmarks, demonstrating that a standard, general-purpose MLLM can effectively support dense perception without architectural specialization.

Abstract:
Riemannian neural networks have proven effective in solving a variety of machine learning tasks. The key to their success lies in the development of principled Riemannian analogs of fundamental building blocks in deep neural networks (DNNs). Among those, Riemannian batch normalization (BN) layers have shown to enhance training stability and improve accuracy. In this paper, we propose BN layers for neural networks on complex domains. The proposed layers have close connections with existing Riemannian BN layers. We derive essential components for practical implementations of BN layers on some complex domains which are less studied in previous works, e.g., the Siegel disk domain. We conduct experiments on radar clutter classification, node classification, and action recognition demonstrating the efficacy of our method.

Abstract:
Retrieval augmented generation (RAG) has enhanced large language models by enabling access to external knowledge, with graphbased RAG emerging as a powerful paradigm for structured retrieval and reasoning. However, existing graph-based methods often over-rely on surface-level node matching and lack explicit causal modeling, leading to unfaithful or spurious answers. Prior attempts to incorporate causality are typically limited to local or single-document contexts and also suffer from information isolation that arises from modular graph structures, which hinders scalability and cross-module causal reasoning. To address these challenges, we propose HugRAG, a framework that rethinks knowledge organization for graph-based RAG through causal gating across hierarchical modules. HugRAG explicitly models causal relationships to suppress spurious correlations while enabling scalable reasoning over large-scale knowledge graphs. Extensive experiments demonstrate that HugRAG consistently outperforms competitive graph-based RAG baselines across multiple datasets and evaluation metrics. Our work establishes a principled foundation for structured, scalable, and causally grounded RAG systems.

Abstract:
Transformerbased autoregressive models have emerged as a unifying paradigm across modalities such as text and images, but their extension to 3D molecule generation remains underexplored. The gap stems from two fundamental challenges: (1) how to tokenize molecules into a canonical 1D sequence of tokens that is invariant to both SE(3) transformations and atom index permutations, and (2) how to design an architecture capable of modeling hybrid atom-based tokens that couple discrete atom types with continuous 3D coordinates. To address these challenges, we introduce InertialAR. It first performs generation-oriented canonical tokenization by aligning each molecule to a canonical inertial frame and reordering atoms, thereby converting arbitrary 3D structures into a unique, SE(3)- and permutation-invariant sequence of tokens for autoregressive generation. Built upon this canonical tokenization, we propose geometric rotary positional encoding (GeoRoPE), which endows Transformer attention with 3D geometric awareness. Finally, InertialAR utilizes a hierarchical autoregressive paradigm to decode the next atom, consecutively predicting the atom type and 3D coordinates via Diffusion Loss. Experimentally, InertialAR achieves state-of-the-art performance on 8 of the 10 evaluation metrics for unconditional generation across QM9, GEOM-Drugs, and B3LYP. Moreover, it significantly outperforms baselines in controllable generation for targeted chemical functionality, attaining state-of-the-art results across all 5 metrics.

Abstract:
Generative AI disrupts the practice of giving credit to work that came before. Ideally, a generative model would give credit to any work on which its output depends in a significant way.Counterfactual credit attribution (CCA)is a technical condition formalizing this goala relaxation of differential privacy---recently introduced by Livni, Moran, Nissim, and Pabbaraju (2024) who studied it in the PAC learning setting. We initiate the study of CCA generative models. Specifically, we consider autoregressive models giving credit to a deployment-time dataset (e.g., a RAG database). We uncover barriers to two natural approaches to CCA autoregressive models. First, we show that imposing CCA on the underlying next-token predictor does not guarantee that the model is CCA: CCA does not compose autoregressively (unlike DP). Second, we consider a different approach to building CCA models which we callretrofitting. Retrofitting takes a model that does not attribute credit, and adds credit onto it. Given black-box access to the starting model, retrofitting requires query complexity exponential in the length of the model's outputs.

Abstract:
Supervised FineTuning (SFT) empowers Large Language Models (LLMs) with exceptional performance on specialized tasks, but it yields dense, high-dimensional delta parameters that pose severe storage and distribution challenges. Singular Value Decomposition (SVD)-based compression offers a compact representation for such delta parameters, but existing methods adopt heuristic quantization without clarifying underlying mechanisms, leading to poor generalizability. In this work, we propose PrinMix, a rigorous SVD-based framework that models quantization as an optimization problem, grounding the design in mathematical mechanisms. We first theoretically derive quantization error and identify a key singular-value-dominated scaling mechanism, which mathematically proves the necessity of mix-precision quantization. We then model the quantization scheme as a 0/1 Integer Linear Programming (ILP) problem, which yields optimal bit-budget-constrained solutions without empirical assumptions. Furthermore, PrinMix integrates a Reconstruction Target Correction (RTC) method to compensate for errors from the \mathbfV-then-\mathbfU sequential quantization process. Extensive experiments confirm PrinMix performs well: for 7B LLMs, PrinMix outperforms SOTA Delta-CoMe on challenging benchmarks by 22.3% on AIME2024 and 6.1% on GQA.

Abstract:
The pursuit of reducing the memory footprint of the selfattention mechanism in multi-headed self attention (MHA) spawned a rich portfolio of methods, e.g., group-query attention (GQA) and multi-head latent attention (MLA). The methods leverage specialized low-rank factorizations across embedding dimensions or attention heads. From the point of view of classical low-rank approximation, these methods are unconventional and raise questions of which objects they really approximate and how to interpret the low-rank behavior of the resulting representations. To answer these questions, this work proposes a generalized view on the weight objects in the self-attention layer and a factorization strategy, which allows us to construct a parameter efficient scheme, called Tucker Attention. Tucker Attention requires an order of magnitude fewer parameters for comparable validation metrics, compared to GQA and MLA, as evaluated in LLM and ViT test cases. Additionally, Tucker Attention~encompasses GQA, MLA, MHA as special cases and is fully compatible with flash-attention and rotary position embeddings (RoPE). This generalization strategy yields insights of the actual ranks achieved by MHA, GQA, and MLA, and further enables simplifications for MLA.

Abstract:
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs due to its extensive groupbased sampling requirement. While recent selective data utilization methods can mitigate this overhead, they could induce estimation bias by altering the underlying sampling distribution, compromising theoretical rigor and convergence behavior. To address this limitation, we propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation through importance sampling-based correction. By incorporating mathematically derived rescaling factors, DPPO significantly accelerates GRPO training without altering the optimization objective of the full-batch baseline. Furthermore, to mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy that maximizes valid token density and hardware utilization. Extensive experiments demonstrate that DPPO consistently accelerates training across diverse models and benchmarks. For instance, on Qwen3-4B trained on MATH, DPPO achieves 2.37× training speedup and outperforms GRPO by 3.36% in average accuracy across six mathematical reasoning benchmarks.

Abstract:
scRNAseq clustering is a critical task for analyzing single-cell RNA sequencing (scRNA-seq) data, as it groups cells with similar gene expression profiles. Transformers, as powerful foundational models, have been applied to scRNA-seq clustering. Their self-attention mechanism automatically assigns higher attention weights to cells within the same cluster, enhancing the distinction between clusters. Existing methods for scRNA-seq clustering, such as graph transformer-based models, treat each cell as a token in a sequence. Their computational and space complexities are \mathcalO(n^2) with respect to the number of cells, limiting their applicability to large-scale scRNA-seq datasets. To address this challenge, we propose a Bipartite Graph Transformer-based clustering model (BGFormer) for scRNA-seq data. We introduce a set of learnable anchor tokens as shared reference points to represent the entire dataset. A bipartite graph attention mechanism is introduced to learn the similarity between cells and anchor tokens, bringing cells of the same class closer together in the embedding space. BGFormer achieves linear computational complexity with respect to the number of cells, making it scalable to large datasets. Experimental results on multiple large-scale scRNA-seq datasets demonstrate the effectiveness and scalability of BGFormer.

Abstract:
Posttraining Large Vision-and-Language Models (LVLMs) typically involves Supervised Fine-Tuning (SFT) for knowledge injection or Reinforcement Learning with Verifiable Rewards (RLVR) for performance enhancement. However, SFT often leads to sub-optimal performance, while RLVR remains constrained by the model's internal knowledge base. While a sequential SFT \rightarrow RLVR pipeline can be used, it introduces significant computational overhead and suffers from catastrophic forgetting. To address these limitations, we propose ViSurf (Visual Supervised-and-Reinforcement Fine-Tuning), a unified, single-stage paradigm that integrates the strengths of both SFT and RLVR. By analyzing their training objectives, we establish a unified framework that injects ground-truth labels directly into RLVR rollouts, facilitating simultaneous external supervision and internal reinforcement. Furthermore, we introduce three novel reward control strategies to ensure training stability and optimization. Extensive experiments demonstrate that ViSurf consistently outperforms standalone SFT, RLVR, and the traditional two-stage pipeline across diverse benchmarks. In-depth analysis corroborates these findings, validating the derivation and design principles of ViSurf. Our code, data and models will be publicly available.

Abstract:
Identifying feature representations in language models is a central task in mechanistic interpretability. Several recent studies have made an implicit assumption that feature representations can be inferred in some cases from singular vectors of attention matrices. However, sound justification for this assumption is lacking. In this paper we address that question, asking: why and when do singular vectors align with features? First, we demonstrate that singular vectors robustly align with features in a model where features can be directly observed. We then show theoretically that such alignment is expected under a variety of general conditions. We close by asking how, operationally, alignment may be recognized in real models where feature representations are not directly observable. We identifysparse attention decompositionas a testable prediction of alignment, and show evidence that it emerges consistent with predictions in real models. Together these results suggest that alignment of singular vectors with features can be a sound and theoretically justified basis for feature identification in language models.

Abstract:
This paper presents a theoretical framework that explains why finetuning small, randomly selected subnetworks (slices) within pre-trained models is sufficient for downstream adaptation. We prove that pretrained networks exhibit a universal winning slice property, arising from two phenomena: (1) spectral balance— the eigenspectra of different weight matrix slices are remarkably similar—and (2) high task energy—their backbone representations (pretrained weights) retain rich, task-relevant features. This leads to the Universal Winning Slice Hypothesis, which provides a theoretical foundation for parameter-efficient fine-tuning (PEFT) in large-scale models. Inspired by this, we propose SliceFine, a PEFT method that uses this inherent redundancy by updating only selected slices of the origi- nal weights—introducing zero new parameters, unlike adapter-based approaches. Empirically, SliceFine matches the performance of SOTA PEFT methods across various language and vision tasks, while significantly improving training speed, memory efficiency, and model compactness. Our work bridges theory and prac- tice, offering a theoretically grounded alternative to existing PEFT techniques.

Abstract:
Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multiturn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling–filtering–ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.

Abstract:
Finegrained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We will release the model, code, and benchmark to facilitate future research on bilingual fine-grained alignment.

Abstract:
Deploying multiple models within shared GPU clusters is a key strategy to improve resource efficiency in large language model (LLM) serving. Existing multiLLM serving systems improve GPU utilization at the cost of degraded inference performance, particularly time-to-first-token (TTFT). We attribute this degradation to the lack of awareness regarding future workload characteristics. In contrast, recent analyses have shown the strong periodicity and long-term predictability of real-world LLM serving workloads. In this paper, we propose one-for-many GPU prewarming, which proactively loads parameters from multiple models onto GPUs based on workload forecasts. These prewarmed weights enable the system to promptly instantiate serving instances upon encountering request bursts. We design and implement WarmServe, a multi-LLM serving system incorporating three key techniques: (1) a model placement algorithm that optimizes prewarming decisions to minimize cross-model prewarming interference, (2) a KV cache reservation strategy that repurposes idle KV cache space on running GPUs for prewarming new models, and (3) an efficient GPU memory switching mechanism for tensor management. Evaluation on real-world datasets shows that WarmServe reduces TTFT by up to 50.8× compared to the state-of-the-art autoscaling-based system, while supporting up to 2.5× higher request throughput than the GPU-sharing system.

Abstract:
In truncated linear regression, samples (x,y) are shown only when the outcome y falls inside a certain survival set S^\star and the goal is to estimate the unknown ddimensional regressor w^\star. This problem has a long history of study in Statistics and Machine Learning going back to the works of (Galton, 1897; Tobin, 1958) and more recently in, e.g., (Daskalakis et al., 2019; 2021; Lee et al., 2023; 2024). Despite this long history, however, most prior works are limited to the special case where S^\star is precisely known. The more practically relevant case, where S^\star is unknown and must be learned from data, remains open: indeed, here the only available algorithms require strong assumptions on the distribution of the feature vectors (e.g., Gaussianity) and, even then, have a d^\mathrmpoly (1/\varepsilon) run time for achieving \varepsilon accuracy. In this work, we give the first algorithm for truncated linear regression with unknown set case that runs in \mathrmpoly (d/\varepsilon) time, by only requiring that the feature vectors are sub-Gaussian. Our algorithm relies on a novel subroutine for efficiently learning unions of a bounded number of intervals using access to positive examples (without any negative examples) under a certain smoothness condition. This learning guarantee adds to the line of works on positive-only PAC learning and may be of independent interest.

Abstract:
Deep learning has emerged as a transformative tool for the neural surrogate modeling of partial differential equations (PDEs), known as neural PDE solvers. However, scaling these solvers to industrialscale geometries with over 10^8 cells remains a fundamental challenge due to the prohibitive memory complexity of processing high-resolution meshes. We present Transolver-3, a new member of the Transolver family as a highly scalable framework designed for high-fidelity physics simulations. To bridge the gap between limited GPU capacity and the resolution requirements of complex engineering tasks, we introduce two key architectural optimizations: faster slice and deslice by exploiting matrix multiplication associative property and geometry slice tiling to partition the computation of physical states. Combined with an amortized training strategy by learning on random subsets of original high-resolution meshes and a physical state caching technique during inference, Transolver-3 enables high-fidelity field prediction on industrial-scale meshes. Extensive experiments demonstrate that Transolver-3 is capable of handling meshes with over 160 million cells, achieving impressive performance across three challenging simulation benchmarks, including aircraft and automotive design tasks.

Abstract:
Mixtureof-Experts (MoE) models are typically pre-trained with explicit load-balancing constraints to ensure statistically balanced expert routing. Despite this, we observe that even well-trained MoE models exhibit significantly imbalanced routing. This behavior is arguably natural—and even desirable—as imbalanced routing allows models to concentrate domain-specific knowledge within a subset of experts. Expert parallelism (EP) is designed to scale MoE models by distributing experts across multiple devices, but with a less-discussed assumption of balanced routing. Under extreme imbalance, EP can funnel a disproportionate number of tokens to a small number of experts, leading to compute- and memory-bound failures on overloaded devices during post-training or inference, where explicit load balancing is often inapplicable. We propose Least-Loaded Expert Parallelism (LLEP), a novel EP algorithm that dynamically reroutes excess tokens and associated expert parameters from overloaded devices to underutilized ones. This ensures that all devices complete their workloads within the minimum collective latency while respecting memory constraints. Across different model scales, LLEP achieves up to 5x speedup and 4x reduction in peak memory usage compared to standard EP. This enables faster and higher-throughput post-training and inference, with ~1.9x faster for gpt-oss-120b. We support our method with extensive theoretical analysis and comprehensive empirical evaluations, including ablation studies. These results illuminate key trade-offs and enable a principled framework for hardware-specific hyper-parameter tuning to achieve optimal performance.

Abstract:
A data mixture refers to how different data sources are combined to train large language models, and selecting an effective mixture is crucial for optimal downstream performance. Existing methods either conduct costly searches directly on the target model or rely on mixture scaling laws that fail to extrapolate well to large model sizes. We address these limitations by introducing a computeefficient pipeline for data mixture scaling. First, we propose CAMEL, a capacity-aware mixture law that models validation loss with the nonlinear interplay between model size and mixture. We also introduce a loss-to-benchmark prediction law that estimates benchmark accuracy from validation loss, enabling end-to-end performance prediction for the target model. Next, we study how to allocate a fixed compute budget across model scales to fit the law and reduce prediction error. Finally, we apply our method to Mixture-of-Experts models with up to 7B-A150M parameters to fit the law, and verify the optimal mixture derived from the law by extrapolating to a 55B-A1.2B target model. Compared to prior methods, we reduces mixture optimization costs by 50% and improves downstream benchmark performance by up to 3%.

Abstract:
Scaling inference methods such as Markov chain Monte Carlo to highdimensional models remains a central challenge in Bayesian deep learning. A promising recent proposal, microcanonical Langevin Monte Carlo, has shown state-of-the-art performance across a wide range of problems. However, its reliance on full-dataset gradients makes it prohibitively expensive for large-scale problems. This paper addresses a fundamental question: Can microcanonical dynamics effectively leverage mini-batch gradient noise? We provide the first systematic study of this problem, establishing a novel continuous-time theoretical analysis of stochastic-gradient microcanonical dynamics. We reveal two critical failure modes: a theoretically derived bias due to anisotropic gradient noise and numerical instabilities in complex high-dimensional posteriors. To tackle these issues, we propose a principled gradient noise preconditioning scheme shown to significantly reduce this bias and develop a novel, energy-variance-based adaptive tuner that automates step size selection and dynamically informs numerical guardrails. The resulting algorithm is a robust and scalable microcanonical Monte Carlo sampler that achieves state-of-the-art performance on challenging high-dimensional inference tasks like Bayesian neural networks. Combined with recent ensemble techniques, our work unlocks a new class of stochastic microcanonical Langevin ensemble (SMILE) samplers for large-scale Bayesian inference.

Abstract:
We present the first highprobability optimal regret bound for a policy optimization technique applied to the problem of stochastic contextual multi-armed bandit (CMAB) with general offline function approximation. Our algorithm is both efficient and achieves an optimal regret bound of \widetildeO(\sqrt K|\mathcalA|\log|\mathcalF|), where K is the number of rounds, \mathcalA is the set of arms, and \mathcalF is the function class used to approximate the losses. Our results bridge the gap between theory and practice, demonstrating that the widely used policy optimization methods for the contextual bandit problem can achieve a rigorously-proved optimal regret bound. We support our theoretical results with an empirical evaluation of our algorithm.

Abstract:
Large language models (LLMs) inevitably encounter distribution shifts during realworld deployment, leading to performance degradation. Although test-time learning (TTL) adapts LLMs from unlabeled test streams, applying entropy minimization to autoregressive generation faces two challenges: (i) early decoding errors can steer later tokens off track, and updating on them can push the model further off course, and (ii) updates on unreliable tokens can amplify confident error predictions and trigger model collapse. To address these challenges, we propose Future-Gain Guided Test-Time Learning (FG-TTL) for LLMs, which learns selectively from the model's own generations. Our key idea is to update only on tokens that reduce uncertainty in subsequent generation rather than tokens that are merely uncertain at the current step. Specifically, we develop a Future-Gain Guided Token Selection (FTS) strategy to decide where to learn. We introduce Future-Gain as a token-level metric for this purpose and update the model only on high-gain tokens, concentrating learning on informative positions and mitigating temporal error propagation. In addition, we design a Risk-Aware Adaptation (RAA) mechanism that controls how strongly to update by combining gain-based weighting with adaptive temperature scaling based on intrinsic uncertainty, suppressing unreliable gradients while enabling stronger learning on high-gain tokens. Experiments on six benchmarks with three LLM backbones show that FG-TTL achieves the best average performance.

Abstract:
Modern VisionLanguage Models (VLMs) pose significant individual-level privacy risks by linking fragmented multimodal data to identifiable individuals through hierarchical chain-of-thought reasoning. However, existing privacy benchmarks remain structurally insufficient for this threat, as they primarily evaluate privacy perception while failing to address the more critical risk of privacy reasoning: a VLM’s ability to infer and link distributed information to construct individual profiles. To address this gap, we propose MultiPriv, the first benchmark designed to systematically evaluate individual-level privacy reasoning in VLMs. We introduce the Privacy Perception and Reasoning (PPR) framework and construct a bilingual multimodal dataset with synthetic individual profiles, where identifiers (e.g., faces, names) are linked to sensitive attributes. This design enables nine challenging tasks spanning attribute detection, cross-image re-identification, and chained inference. We conduct a large-scale evaluation of over 50 open-source and commercial VLMs. Our analysis shows that 60% of widely used VLMs can perform individual-level privacy reasoning with up to 80% accuracy, posing a significant threat to personal privacy. MultiPriv provides a foundation for developing and assessing privacy-preserving VLMs.

Abstract:
Neural scaling laws relate loss to model size in large language models (LLMs), yet depth and width may contribute to performance differently, requiring more detailed studies. Here, we quantify how depth affects loss via analysis of LLMs and toy residual networks. We find loss scales inversely proportional to depth in LLMs, probably due to functionally similar layers reducing error through ensemble averaging rather than compositional learning or discretizing smooth dynamics. This regime is inefficient yet robust and may arise from the architectural bias of residual networks and target functions incompatible with smooth dynamics. Our findings suggest that improving LLM efficiency may require architectural innovations to encourage compositional use of depth.

Abstract:
The rise of autobidding in online advertising has created new challenges for ensuring advertiser incentive compatibility, particularly when advertisers delegate bidding to agents with high-level constraints. One challenge is the multiplicity of equilibria with reported constraints. Alimohammadi et al. proposed a notion of Auto-bidding Incentive Compatibility (AIC) which serves to highlight that standard auctions may not incentivize truthful reporting of these constraints. However, their definition of AIC is very stringent as it requires that the worst-case outcome of an advertiser's truthful report is at least as good as the best-case outcome of any of the advertiser's possible deviations. In this paper, we introduce two refined and relaxed concepts: Risk-Averse Auto-bidding Incentive Compatibility (RAIC) and Optimistic Auto-bidding Incentive Compatibility (OAIC). RAIC (OAIC) stipulates that truthful reporting is preferred if its least (most) favorable equilibrium outcome is no worse than the least (most) favorable equilibrium outcome from any misreport. We demonstrate that SPA satisfies both RAIC and OAIC. These findings clarify SPA's incentive properties under auto-bidding, specifically regarding advertiser perspectives on equilibrium selection.

Abstract:
Upcycling pretrained dense models into sparse Mixture-of-Experts (MoEs) efficiently increases model capacity but often suffers from poor expert specialization due to naive weight replication. We introduce Dirichlet-Prior Shaping Loss (DPSL), a novel router regularization technique that directly shapes routing probability distributions by matching expert assignments to a target Dirichlet prior resulting in enhanced expert specialization. DPSL enables encoding of inductive biases such as encouraging experts to focus on specific modalities or tasks, without requiring manual intervention. DPSL is a general tool applicable to any module that outputs categorical probability distributions, extending its utility beyond MoE training. Experiments on upcycled MoE vision-language models show that DPSL consistently outperforms upcycling strategies and regularization techniques across standard vision-language benchmarks, addressing the critical issue of poor specialization and fostering higher-performing models.

Abstract:
Understanding the alignment between large language models (LLMs) and human brain activity can reveal computational principles underlying language processing. This work describes a pipeline to apply attribution methods to the brainLLM alignment setting to identify the specific words most important for this alignment. As a case study, we leverage it to study a contentious research question about brain-LLM alignment: the relationship between brain alignment (BA) and next-word prediction (NWP). Across two naturalistic fMRI datasets, we find that BA and NWP rely on largely distinct word subsets: NWP exhibits recency and primacy biases with a focus on syntax, while BA prioritizes semantic and discourse-level information with a more targeted recency effect. This work advances our understanding of how LLMs relate to human language processing and highlights differences in feature reliance between BA and NWP. Beyond this study, our attribution method can be broadly applied to explore the cognitive relevance of model predictions in diverse language processing tasks.

Abstract:
The safety alignment of Large Language Models (LLMs) remains vulnerable to Harmful Finetuning (HFT). While existing defenses impose constraints on parameters, gradients, or internal representations, we observe that they can be effectively circumvented under persistent HFT. Our analysis traces this failure to the inherent redundancy of the high-dimensional parameter space: attackers exploit optimization trajectories that are orthogonal to defense constraints to restore harmful capabilities while deceptively adhering to safety restrictions. To address this, we propose Safety Bottleneck Regularization (SBR). SBR shifts the defensive focus from the redundant parameter space to the unembedding layer, which serves as a geometric bottleneck. By anchoring the final hidden states of harmful queries to those of the safety-aligned model, SBR enables the model to maintain safe responses even under persistent HFT. Extensive experiments confirm SBR's effectiveness, demonstrating that utilizing just a single safety anchor is sufficient to reduce the Harmful Score to <10 while preserving competitive performance on benign downstream tasks. The code is available at https://anonymous.4open.science/r/SBR-F876

Abstract:
Machine learning interatomic potentials (MLIPs) have proven to be wildly useful for molecular dynamics simulations, powering countless drug and materials discovery applications. However, MLIPs face two primary bottlenecks preventing them from reaching realistic simulation scales: inference time and memory consumption. In this work, we address both issues by challenging the longheld belief that the cutoff radius for the MLIP must be held to a fixed, constant value. For the first time, we introduce adynamiccutoff formulation that still leads to stable, long timescale molecular dynamics simulation. In introducing the dynamic cutoff, we are able to induce sparsity onto the underlying atom graph by targeting a specific number of neighbors per atom, significantly reducing both memory consumption and inference time. We show the effectiveness of a dynamic cutoff by implementing it onto 4 state of the art MLIPs: MACE, Nequip, Orbv3, and TensorNet, leading to2.26xless memory consumption and2.04xfaster inference time, depending on the model and atomic system. We also perform an extensive error analysis and find that the dynamic cutoff models exhibit minimal accuracy dropoff compared to their fixed cutoff counterparts on both materials and molecular datasets. All model implementations and training code will be fully open sourced.

Abstract:
The rapid rise of synthetic media has made deepfake detection a critical challenge for online safety and trust. Progress remains constrained by the scarcity of large, highquality datasets. Although multimodal large language models (LLMs) exhibit strong reasoning capabilities, their performance on deepfake detection is poor, often producing explanations that are misaligned with visual evidence or hallucinatory. To address this limitation, we introduce a reasoning-annotated dataset for deepfake detection and propose Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns LLM reasoning with image content at the paragraph level. Experiments show that PRPO improves detection accuracy by a wide margin and achieves the highest reasoning score of 4.55/5.0. Ablation studies further demonstrate that PRPO significantly outperforms GRPO under test-time conditions. These results underscore the importance of grounding multimodal reasoning in visual evidence to enable more reliable and interpretable deepfake detection.

Abstract:
Connectivity structure shapes neural computation, but inferring this structure from population recordings is degenerate: multiple connectivity structures can generate identical dynamics. Recent work uses lowrank recurrent neural networks (lrRNNs) to infer low-dimensional latent dynamics and connectivity structure from observed activity, enabling a mechanistic interpretation of the dynamics. However, standard approaches for training lrRNNs can recover spurious structures irrelevant to the underlying dynamics. We first characterize the identifiability of connectivity structures in lrRNNs and determine conditions under which a unique solution exists. Then, to find such solutions, we develop an inference framework based on maximum entropy and continuous normalizing flows (CNFs), trained via flow matching. Instead of estimating a single connectivity matrix, our method learns the maximally unbiased distribution over connection weights consistent with observed dynamics. This approach captures complex yet necessary distributions such as heavy-tailed connectivity found in empirical data. We validate our method on synthetic datasets with connectivity structures that generate multistable attractors, limit cycles, and ring attractors, and demonstrate its applicability in recordings from rat frontal cortex during decision-making. Our framework shifts circuit inference from recovering connectivity to identifying which connectivity structures are computationally required, and which are artifacts of underconstrained inference.

Abstract:
Multimodal large language models (MLLMs) have substantially advanced video misinformation detection through unified multimodal reasoning, but they often rely on fixeddepth inference and place excessive trust in internally generated assumptions, particularly in scenarios where critical evidence is sparse, fragmented, or requires external verification. To address these limitations, we propose FactGuard, an agentic framework for video misinformation detection that formulates verification as an iterative reasoning process built upon MLLMs. FactGuard explicitly assesses task ambiguity and selectively invokes external tools to acquire critical evidence, enabling progressive refinement of reasoning trajectories. To further strengthen this capability, we introduce a two-stage training strategy that combines domain-specific agentic supervised fine-tuning with decision-aware reinforcement learning to optimize tool usage and calibrate risk-sensitive decision making. Extensive experiments on three public benchmarks demonstrate that FactGuard consistently outperforms state-of-the-art methods in both verification accuracy and reliability.

Abstract:
Data augmentation is a widely used strategy to improve model robustness and generalization by enriching training datasets with synthetic examples. While large language models (LLMs) have demonstrated strong generative capabilities for this purpose, their applications in highstakes domains like healthcare present unique challenges due to the risk of generating clinically incorrect or misleading information. In this work, we propose a novel query-based model collaboration framework that integrates expert-level domain knowledge to guide the augmentation process to preserve critical medical information. Compared to existing LLM-based and traditional augmentation methods, our generated data significantly improves preservation of critical medical information and reduces hallucinations at both the token and concept levels. Experiments on downstream clinical prediction tasks demonstrate consistent performance gains over existing augmentation methods. This lightweight collaborative framework addresses the gap between LLM augmentation potential and the safety requirements of specialized domains.

Abstract:
Reinforcement learning (RL) finetuning is now widely used to improve LLM reasoning, and recent work has begun extending it to vision-language models (VLMs). While RL-tuned VLMs can improve visual reasoning benchmark performance, they can still suffer from weak visual grounding, hallucinations, and over-reliance on textual cues. We show that simple, controlled textual perturbations—misleading captions or incorrect chain-of-thought (CoT) traces—cause substantial drops in robustness and confidence, and that these effects are more pronounced when CoT consistency is taken into account across open-source multimodal reasoning models. Entropy-based metrics further show that these perturbations reshape model uncertainty on the correct option, exposing model-specific trends in miscalibration. To better understand these vulnerabilities, we further analyze RL fine-tuning dynamics and uncover an accuracy–faithfulness trade-off: fine-tuning raises benchmark accuracy, but can simultaneously erode the reliability of the accompanying CoT and its robustness to contextual shifts. Although adversarial augmentation improves robustness, it does not by itself prevent faithfulness drift. Incorporating a faithfulness-aware reward can restore alignment between answers and reasoning, but when paired with augmentation, training risks collapsing onto shortcut strategies and robustness remains elusive. Together, these findings highlight the limitations of accuracy-only evaluations and motivate training and assessment protocols that jointly emphasize correctness, robustness, and the faithfulness of visually grounded reasoning.

Abstract:
We propose ContextLM, a framework that implicitly learns multitoken prediction by augmenting standard pretraining with an intrinsic next-context prediction objective. ContextLM builds a language model on top of context embeddings that span multiple tokens, enabling better next-token prediction by predicting the next context. Our model is fully compatible with standard autoregressive, token-by-token evaluation paradigms (e.g., perplexity). Extensive experiments with GPT-2 and Pythia backbones (up to 1.5B parameters and 300B training tokens) reveal that ContextLM shifts the Pareto frontier of scaling laws, exhibiting superior efficiency in parameters, training tokens, and FLOPs. Our results show that ContextLM could already achieve the baseline perplexity using 39% fewer parameters and demonstrates robust generalization improvements on extensive downstream tasks under equivalent parameter counts.

Abstract:
Graph generation is a fundamental task with broad applications, such as drug discovery. Recently, discrete flow matchingbased graph generation, \aka, graph flow model (GFM), has emerged due to its superior performance and flexible sampling. However, effectively aligning GFMs with complex human preferences or task-specific objectives remains a significant challenge. In this paper, we propose Graph-GRPO, an online reinforcement learning (RL) framework for training GFMs under verifiable rewards. Our method makes two key contributions: (1) We derive an analytical expression for the transition probability of GFMs, replacing the Monte Carlo sampling and enabling fully differentiable rollouts for RL training; (2) We propose a refinement strategy that randomly perturbs specific nodes and edges in a graph, and regenerates them, allowing for localized exploration and self-improvement of generation quality. Extensive experiments on both synthetic and real datasets demonstrate the effectiveness of Graph-GRPO. With only 50 denoising steps, our method achieves 95.0% and 97.5% Valid-Unique-Novelty scores on the planar and tree datasets, respectively. Moreover, Graph-GRPO achieves state-of-the-art performance on the molecular optimization tasks, outperforming graph-based and fragment-based RL methods as well as classic genetic algorithms.

Abstract:
Recent developments in large language models have shown advantages in reallocating a notable share of computational resource from training time to inference time. However, the principles behind inference time scaling are not well understood. In this paper, we introduce an analytically tractable model of inferencetime scaling: Bayesian linear regression with a reward-weighted sampler, where the reward is determined from a linear model, modeling LLM-as-a-judge scenario. We study this problem in the high-dimensional regime, where the deterministic equivalents dictate a closed-form expression for the posterior predictive mean and variance. We analyze the generalization error when training data are sampled from a teacher model. We draw k inference-time samples and select via softmax at a temperature applied to a quadratic reward. When the reward is not too different from the teacher, the generalization error decreases monotonically with increasing inference time samples k. However, the specific reward that optimizes inference-time selection generally differs from the teacher. In contrast, substantial reward misspecification induces a finite optimal k beyond which more sampling can increase the generalization error. For fixed k, there exists an optimal sampling temperature. We experimentally verify these facts in large language model inference with an additional large language model as a judge. In the "best-of-k" limit with the teacher as reward, we theoretically show that the generalization error decays as \Theta(1/k^2) and determine the leading coefficient via extreme value theory. These formulas delineate domains where scaling inference-time computation is provably preferable to collecting more data. Finally, we demonstrate that when task difficulty increases, the previously mentioned advantage of inference-time compute degrades.

Abstract:
Foundation models have become a dominant paradigm in machine learning, achieving remarkable performance across diverse tasks through largescale pretraining. However, these models often yield overconfident, uncalibrated predictions. The standard approach to quantifying epistemic uncertainty, training an ensemble of independent models, incurs prohibitive computational costs that scale linearly with ensemble size, making it impractical for large foundation models. We propose Singular Value Ensemble (SVE), a parameter-efficient implicit ensemble method that builds on a simple, but powerful core assumption: namely, that the singular vectors of the weight matrices constitute meaningful subspaces of the model's knowledge. Pretrained foundation models encode rich, transferable information in their weight matrices. If the singular vectors are indeed meaningful (orthogonal) "knowledge directions". To obtain a model ensemble, we modulate only how strongly each direction contributes to the output. Rather than learning entirely new parameters, we freeze the singular vectors and only train per-member singular values that rescale the contribution of each direction in that shared knowledge basis. Ensemble diversity emerges naturally as stochastic initialization and random sampling of mini-batches during joint training cause different members to converge to different combinations of the same underlying knowledge. SVE achieves uncertainty quantification comparable to explicit deep ensembles while increasing the parameter count of the base model by less than 1%, making principled uncertainty estimation accessible in resource-constrained settings. We validate SVE on NLP and vision tasks with various different backbones and show that it improves calibration while maintaining predictive accuracy.

Abstract:
Safety alignment in large language models is typically evaluated under isolated queries, yet realworld use is inherently multi-turn. Although multi-turn jailbreaks are empirically effective, the structure of conversational safety failure remains insufficiently understood. In this work, we study safety failures from a state-space perspective and show that many multi-turn failures arise from structured contextual state evolution rather than isolated prompt vulnerabilities. We introduce STAR, a state-oriented diagnostic framework that treats dialogue history as a state transition operator and enables controlled analysis of safety behavior along interaction trajectories. Rather than optimizing attack strength, STAR provides a principled probe of how aligned models traverse the safety boundary under autoregressive conditioning. Across multiple frontier language models, we find that systems which appear robust under static evaluation can undergo rapid and reproducible safety collapse under structured multi-turn interaction. Mechanistic analysis reveals monotonic drift away from refusal-related representations and abrupt phase transitions induced by role-conditioned context. Together, these findings motivate viewing language model safety as a dynamic, state-dependent process defined over conversational trajectories.

Abstract:
Speculative decoding accelerates LLM inference by verifying candidate tokens from a draft model against a larger target model. Recent "judge'' decoding boosts this process by relaxing verification criteria by accepting draft tokens that may exhibit minor discrepancies from target model output, but existing methods are restricted by their reliance on human annotations or tasks with verifiable ground truths, limiting generalizability across diverse NLP tasks. We propose SelfJudge, which trains judge verifiers via selfsupervision of the target model. Our method measures semantic preservation by assessing whether token-substituted responses preserve the meaning of original responses, enabling automatic verifier training across diverse NLP tasks. Our experiments show SelfJudge achieves superior inference-accuracy trade-offs than judge decoding baselines, offering a broadly applicable solution for faster LLM inference.

Abstract:
A key challenge in normalizing flows is finding expressive invertible scalar bijections. Existing approaches face tradeoffs: affine transformations are smooth and analytically invertible but lack expressivity; monotonic splines offer local control but are only piecewise smooth and act on bounded domains; residual flows achieve smoothness but need numerical inversion. We introduce three families of analytic bijections that are globally smooth (C^\infty), defined on all of \mathbbR, and analytically invertible in closed form, combining the favorable properties of prior approaches. Beyond serving as drop-in replacements in coupling flows, where they match or exceed spline performance, we develop radial flows: a novel architecture using direct parametrization that transforms the radial coordinate while preserving angular direction. Radial flows exhibit exceptional training stability, produce geometrically interpretable transformations, and on targets with radial structure can achieve comparable quality to coupling flows with 1000× fewer parameters. We provide comprehensive evaluation on 1D and 2D benchmarks, and demonstrate applicability to higher-dimensional physics problems through experiments on \phi^4 lattice field theory, where our bijections outperform affine baselines and enable problem-specific designs that address mode collapse.

Abstract:
Postdeployment machine learning algorithms often influence the environments they act in, and thusshiftthe underlying dynamics that the standard reinforcement learning (RL) methods ignore. While designing optimal algorithms in thisperformativesetting has recently been studied in supervised learning, the RL counterpart remains under-explored. In this paper, we prove the performative counterparts of the performance difference lemma and the policy gradient theorem in RL, and introduce thePerformative Policy GradientalgorithmPePG.PePGis the first policy gradient algorithm designed to account for performativity in RL. Under softmax parametrisation, and also with and without entropy regularisation, we prove thatPePGconverges toperformatively optimal policies, i.e. policies that remain optimal under the distribution shifts induced by themselves. Thus,PePGsignificantly extends the prior works in Performative RL that achievesperformative stabilitybut not optimality. Our empirical analysis on standard performative RL environments validate thatPePGoutperforms the existing performative RL algorithms aiming for stability.

Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generation tasks. However, their massive parameter scale leads to significant resource consumption and latency during inference. Posttraining weight-only quantization offers a promising solution by reducing model size and accelerating token generation through alleviating the memory-bound issue. Nevertheless, there are inherent systematic outliers in weights, and although some efforts have attempted to address them, such as scaling and rotation, the performance of low-bit quantization remains far from satisfactory. In this paper, we propose Outlier Self-Absorption Quantization (OSAQ), which performs second-order low-rank derived additive weight suppression for low-bit weight-only LLM quantization. Specifically, we observe that Hessian exhibits low-rank consistency across different inputs, with certain directions persistently lacking strength. Leveraging this property, we construct an additive weight transformation based on the Hessian’s null space, thereby suppressing weight outliers without affecting the task loss. This additive transformation can be absorbed into the weights offline, requiring no inter-layer transformations and introducing no inference overhead. Moreover, the construction is efficiently achieved by a closed-form solution, without resource-intensive training or iterative procedures. Extensive experiments across models of varying scales and tasks are conducted, and the results show that OSAQ effectively suppresses outliers and improves low-bit quantization performance.

Abstract:
Large language models (LLMs) increasingly operate in settings that require reliable longcontext understanding, such as retrieval-augmented generation and multi-document reasoning. A common strategy is to fine-tune pretrained short-context models at the target sequence length. However, we find that standard long-context adaptation can remain brittle: model accuracy depends strongly on the absolute placement of relevant evidence, exhibiting high positional variance even when controlling for task format and difficulty. We proposeRoPE-Perturbed Self-Distillation, a training regularizer that improves positional robustness. The core idea is to form alternative ``views'' of the same training sequence by perturbing its RoPE indices---effectively moving parts of the context to different positions---and to train the model to produce consistent predictions across views via self-distillation. This encourages reliance on semantic signals instead of brittle position dependencies. Experiments on long-context adaptation of Llama-3-8B and Qwen-3-4B demonstrate consistent gains on long-context benchmarks, including up to 12.04% improvement on RULER-64K for Llama-3-8B and 2.71% on RULER-256K for Qwen-3-4B after SFT, alongside improved length extrapolation beyond the training context window.

Abstract:
Statespace models (SSMs) have emerged as a powerful foundation for long-range sequence modeling, with the HiPPO framework showing that continuous-time projection operators can be used to derive stable, memory-efficient dynamical systems that encode the past history of the input signal. However, existing projection-based SSMs often rely on polynomial bases with global temporal support, whose inductive biases are poorly matched to signals exhibiting localized or transient structure. In this work, we introduce \emphWaveSSM, a collection of SSMs constructed over wavelet frames. Our key observation is that wavelet frames yield a localized support on the temporal dimension, useful for tasks requiring precise localization. Empirically, we show that on equal conditions, WaveSSM outperforms orthogonal counterparts as S4 on real-world datasets with transient dynamics, including physiological signals on the PTB-XL dataset and raw audio on Speech Commands.

Abstract:
We propose ReFORC, an adaptive reward prediction method that, given a context, enables prediction of the expected future rewards as a function of the number of future thinking tokens. Re-FORC trains a lightweight adapter on reasoning models, demonstrating improved prediction with longer reasoning and larger models. Re-FORC enables: 1) early stopping of unpromising reasoning chains, reducing compute by 26% while maintaining accuracy, 2) optimized model and thinking length selection that achieves 4% higher accuracy at equal compute and 55% less compute at equal accuracy compared to the largest model, 3) adaptive test-time scaling, which increases accuracy by 11% in high compute regime, and 7% in low compute regime. Re-FORC allows dynamic reasoning with length control via cost-per-token thresholds while estimating computation time upfront.

Abstract:
The quadratic computational complexity of selfattention remains a fundamental bottleneck for scaling Large Language Models (LLMs) to long contexts, particularly during the pre-filling phase. In this paper, we rethink the causal attention mechanism from the perspective of information flow. Due to causal constraints, tokens at initial positions participate in the aggregation of every subsequent token. However, existing sparse methods typically apply a uniform top-k selection across all token positions within a layer, ignoring the cumulative dependency of token information inherent in causal architectures. To address this, we propose Stem, a novel, plug-and-play sparsity module aligned with information flow. First, Stem employs the Token Position-Decay strategy, applying position-dependent top-k within each layer to retain initial tokens for recursive dependencies. Second, to preserve information-rich tokens, Stem utilizes the Output-Aware Metric. It prioritizes high-impact tokens based on approximate output magnitude. Extensive evaluations demonstrate that Stem achieves superior accuracy with reduced computation and pre-filling latency. Code is available at https://anonymous.4open.science/r/stem-254ENLXL.

Abstract:
SelfRewarding Language Models propose an architecture in which the Large Language Models(LLMs) both generates responses and evaluates its own outputs via LLM-as-a-Judge prompting, dynamically improving its generative capabilities through iterative Direct Preference Optimization (DPO). However, our analysis reveals a critical limitation in existing self-rewarding paradigms: the synchronized improvement of chosen and rejected responses progressively narrows the representational difference between contrasting samples, undermining effective preference learning. We propose Temporal Self-Rewarding Language Models that strategically coordinate past, present, and future model generations to sustain learning signals. Our dual-phase framework introduces: (1) Anchored Rejection - fixing rejected responses using the past initial model's outputs and (2) Future-Guided Chosen - dynamically curating chosen samples using next-generation model predictions. Extensive experiments across three model families (Llama, Qwen, Mistral) and different model sizes (Llama3B/8B/70B) demonstrate significant improvements when trained with our method compared to self-rewarding using same computation resources. For example, Llama3.1-8B reaches a 29.44 win rate on AlpacaEval 2.0 with our method, outperforming the self-rewarding baseline (19.69) by 9.75. Notably, our method also demonstrates superior out-of-distribution generalization across mathematical reasoning (GSM8K), knowledge-based QA (ARC, TruthfulQA), and code generation (HumanEval) tasks, even though we do not specifically collect such training data. The generality of our temporal strategy is further validated by its benefits when extended to online reinforcement learning settings on mathematical reasoning tasks.

Abstract:
The specification of the action space plays a pivotal role in imitationbased robotic manipulation policy learning, fundamentally shaping the optimization landscape of policy learning. While recent advances have focused heavily on scaling training data and model capacity, the choice of action space remains guided by ad-hoc heuristics or legacy designs, leading to an ambiguous understanding of robotic policy design philosophies. To address this ambiguity, we conducted a large-scale and systematic empirical study, confirming that the action space does have significant and complex impacts on robotic policy learning. We dissect the action design space along temporal and spatial axes, facilitating a structured analysis of how these choices govern both policy learnability and control stability. Based on 13,000+ real-world rollouts on a bimanual robot and evaluation on 500+ trained models over four scenarios, we examine the trade-offs between absolute vs. delta representations, and joint-space vs. task-space parameterizations. Our large-scale results suggest that properly designing the policy to predict delta actions consistently improves performance, while joint-space and task-space representations offer complementary strengths, favoring control stability and generalization, respectively.

Abstract:
This paper introduces Probabilistic MultiVariable Branching (PMVB), a simple and effective technique for accelerating mixed-integer optimization using data-driven machine learning models. At its core, PMVB employs a multi-variable branching procedure that partitions the feasible region via data-driven hyperplanes and requires only two lines of code to implement. Moreover, PMVB is model-agnostic and compatible with a wide range of machine learning models. Leveraging tools from statistical learning theory, we develop interpretable hyperparameter selection strategies and propose several extensions to further enhance performance. We evaluate PMVB by integrating it into state-of-the-art MIP solvers and conducting experiments on both classical benchmark datasets and real-world instances. The results demonstrate the effectiveness of PMVB in improving MIP-solving efficiency.

Abstract:
Foundation models have revolutionized AI, but adapting them efficiently for multimodal tasks, particularly in dualstream architectures composed of unimodal encoders, such as DINO and BERT, remains a significant challenge. Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) enable lightweight adaptation, yet they operate in isolation within each modality, limiting their ability in capturing cross-modal interactions. In this paper, we take a step in bridging this gap with Cross-Modal Low-Rank Adaptation (CoLA), a novel PEFT framework that extends LoRA by introducing a dedicated inter-modal adaptation pathway alongside the standard intra-modal one. This dual-path design enables CoLA to adapt unimodal foundation models to multimodal tasks effectively, without interference between modality-specific and cross-modal learning. We evaluate CoLA across a range of vision-language (RefCOCO, RefCOCO+, RefCOCOg) and audio-visual (AVE, AVS) benchmarks, where it consistently outperforms LORA, achieving a relative gain of around 3% and 2%, respectively, while maintaining parameter efficiency. Notably, CoLA enables the first multi-task PEFT framework for visual grounding, bridging a key gap in efficient multimodal adaptation.

Abstract:
3D style transfer enables the creation of visually expressive 3D content, enriching the visual appearance of 3D scenes and objects. However, existing VGGand CLIP-based methods struggle to model multi-view consistency within the model itself, while diffusion-based approaches can capture such consistency but rely on denoising directions, leading to unstable training. To address these limitations, we propose DiffStyle3D, a novel diffusion-based paradigm for 3DGS style transfer that directly optimizes in the latent space. Specifically, we introduce an Attention-Aware Loss that performs style transfer by aligning style features in the self-attention space, while preserving original content through content feature alignment. Inspired by the geometric invariance of 3D stylization, we propose a Geometry-Guided Multi-View Consistency method that integrates geometric information into self-attention to enable cross-view correspondence modeling. Based on geometric information, we additionally construct a geometry-aware mask to prevent redundant optimization in overlapping regions across views, which further improves multi-view consistency. Extensive experiments show that DiffStyle3D outperforms state-of-the-art methods, achieving higher stylization quality and visual realism.

Abstract:
Large language models (LLMs) are trained and tested extensively on symbolic representations such as code and graphs, yet realworld user tasks are often specified in natural language. To what extent can LLMs generalize across these representations? Here, we approach this question by studying isomorphic tasks involving procedures represented in code, graphs, and natural language (e.g., scheduling steps in planning). We find that training LLMs with popular post-training methods on graphs or code data alone does not reliably generalize to corresponding natural language tasks, while training solely on natural language can lead to inefficient performance gains. To address this gap, we propose a two-stage data curriculum that first trains on symbolic, then natural language data. The curriculum substantially improves model performance across model families and tasks. Remarkably, a 1.5B Qwen model trained by our method can closely match zero-shot GPT-4o in naturalistic planning. Finally, our analysis suggests that successful cross-representation generalization can be interpreted as a form of generative analogy, which our curriculum effectively encourages.

Abstract:
Because of the pervasive use of deep neural networks (DNNs), especially in highstakes domains, the interpretability of DNNs has received increased attention. The general idea of rationale extraction (RE) is to provide an interpretable-by-design framework for DNNs via a select-predict architecture where two neural networks learn jointly to perform feature selection and prediction, respectively. Given only the remote supervision from the final task prediction, the process of learning to select subsets of features (orrationales) requires searching in the space of all possible feature combinations, which is computationally challenging and even harder when the base neural networks are not sufficiently capable. To improve the predictive performance of RE models that are based on less capable or smaller neural networks (i.e., the students), we proposeREKD(RationaleExtraction withKnowledgeDistillation) where a student RE model learns from the rationales and predictions of a teacher (i.e., arationalist) in addition to the student's own RE optimization. This structural adjustment to RE aligns well with how humans could learn effectively from interpretable and verifiable knowledge. Because of the neural-model agnostic nature of the method, any black-box neural network could be integrated as a backbone model. To demonstrate the viability of REKD, we conduct experiments with multiple variants of BERT and vision transformer (ViT) models. Our experiments across language and vision classification datasets (i.e., IMDB movie reviews, CIFAR 10 and CIFAR 100) show that REKD significantly improves the predictive performance of the student RE models.

Abstract:
Video editing has evolved toward InContext Learning (ICL) paradigms, yet the resulting quadratic attention costs create a critical computational bottleneck. In this work, we proposeIn-contextSparseAttention (ISA), the first experimentally lossless sparse framework tailored for ICL video editing. Our design is grounded in two key insights:first, context tokens exhibit significantly lower saliency than source tokens;second, we theoretically prove and empirically validate that Query sharpness correlates with approximation error. Motivated by these findings, ISA implements an efficient pre-selection strategy to prune redundant context, followed by a dynamic query grouping mechanism that routes high-error queries to full attention and low-error ones to a computationally efficient 0-th order Taylor sparse attention. Furthermore, we construct a scalable pipeline to curate a 1M-sample dataset and trainLIVEditor, a novel lightning video editing model via ISA. Extensive experiments demonstrate that LIVEditor achieves a ~60% reduction in latency while surpassing state-of-the-art methods across EditVerseBench, IVE-Bench, and VIE-Bench, delivering experimentally lossless acceleration without compromising visual fidelity.

Abstract:
We present Videoin-the-Loop (ViTL), a two-stage long-video QA framework that preserves a fixed token budget by first localizing question-relevant interval(s) with a low-fps skim and then answering via span-aware reallocation of visual tokens at higher effective frame rate, emitting an interleaved output with both spans and the final option for direct attribution. We also introduce VGrounding-QA, which converts description based event graphs into span-grounded multiple-choice QA by pairing each question with ground-truth time span(s) and related reasoning. ViTL is trained end-to-end with an interleaved group-relative objective that couples temporal IoU for localization with answer correctness, allowing credit to flow from answers back to spans without increasing compute. Under fixed token budgets, ViTL attains up to 8.6% with 50% less frame input on long-video QA and temporal grounding (e.g., Charades-STA, ActivityNet-Captions) and ablations show that span-aware token reallocation consistently surpasses uniform sampling. Together, VGrounding-QA and ViTL provide an interpretable, compute-efficient recipe for scalable long-video QA.

Abstract:
Structured pruning reduces LLM inference cost by removing lowimportance architectural components. This can be viewed as learning a multiplicative gate for each component under an \ell_0 sparsity constraint. Due to the discreteness of the \ell_0 norm, prior work typically adopts stochastic hard-concrete relaxations to enable differentiable optimization; however, this stochasticity can introduce a train--test mismatch when sampled masks are discretized for deployment and restricting masks to a bounded, near-binary range. To address this, we propose Deterministic Differentiable Pruning (DDP), a mask-only optimization method that eliminates stochasticity by directly optimizing a deterministic soft surrogate of the discrete \ell_0 objective. Compared with prior approaches, DDP offers greater expressiveness, reduced train--test mismatch, and faster convergence. We apply our method to several dense and MoE models, including Qwen3-32B and Qwen3-30B-A3B, achieving a performance loss as small as 1% on downstream tasks while outperforming previous methods at 20% sparsity. We further demonstrate end-to-end inference speedups in realistic deployment settings with vLLM.

Abstract:
Adaptation methods have been a workhorse for unlocking the transformative power of pretrained diffusion models in diverse applications. Existing approaches often abstract adaptation objectives as a reward function and steer diffusion models to generate high-reward samples. However, these approaches can incur high computational overhead due to additional training, or rely on stringent assumptions on the reward such as differentiability. Moreover, despite their empirical success, theoretical justification and guarantees are seldom established. In this paper, we propose \textttDOIT (Doob-Oriented Inference-time Transformation), a training-free and computationally efficient adaptation method that applies to generic, non-differentiable rewards. The key framework underlying our method is a measure transport formulation that seeks to transport the pre-trained generative distribution to a high-reward target distribution. We leverage Doob's h-transform to realize this transport, which induces a dynamic correction to the diffusion sampling process and enables efficient simulation-based computation without modifying the pre-trained model. Theoretically, we establish a high probability convergence guarantee to the target high-reward distribution via characterizing the approximation error in the dynamic Doob's correction. Empirically, on D4RL offline RL benchmarks, our method consistently outperforms state-of-the-art baselines while preserving sampling efficiency.

Abstract:
Makeon-demand combinatorial synthesis libraries (CSLs) like Enamine REAL have significantly enabled drug discovery efforts. However, their large size presents a challenge for virtual screening, where the goal is to identify the top compounds in a library according to a computational objective (e.g., optimizing docking score) subject to computational constraints under a limited computational budget. For current library sizes---numbering in the tens of billions of compounds---and scoring functions of interest, a routine virtual screening campaign may be limited to scoring fewer than 0.1% of the available compounds, leaving potentially many high scoring compounds undiscovered. Furthermore, as constraints (and sometimes objectives) change during the course of a virtual screening campaign, existing virtual screening algorithms typically offer little room for amortization. We propose the approximate-but-exhaustive search protocol for CSLs, or APEX. APEX utilizes a neural network surrogate that exploits the structure of CSLs in the prediction of objectives and constraints to make full enumeration on a consumer GPU possible in under a minute, allowing for exact retrieval of approximate top-k sets. To demonstrate APEX's capabilities, we develop a benchmark CSL comprised of more than 10 million compounds, all of which have been annotated with their docking scores on five medically relevant targets along with physicohemical properties measured with RDKit such that, for any objective and set of constraints, the ground truth top-k compounds can be identified and compared against the retrievals from any virtual screening algorithm. We show APEX's consistently strong performance both in retrieval accuracy and runtime compared to alternative methods.

Abstract:
Large language models (LLMs) are increasingly deployed as agents in dynamic realworld environments, where success depends on maintaining precise records of actions and observations. However, the resulting unbounded context growth in long-horizon agentic tasks makes two critical bottlenecks: prohibitive inference memory costs and reasoning degradation due to irrelevant information. Existing compression methods fail to fully address this, often relying on brittle heuristics or requiring parameter updates impractical for proprietary or large-scale LLMs. We introduce Agent Context Optimization (ACON), a unified framework that optimally compresses both observations and history into concise, informative representations. Distinct from prior works, ACON employs an optimization in natural language space: it iteratively refines compression guidelines based on failure analysis of the agent, ensuring critical state information is preserved without model fine-tuning. To further minimize computational overhead, we distill the optimized compressor into smaller models. Experiments on AppWorld, OfficeBench, and Multi-objective QA demonstrate that ACON reduces peak token usage by 26–54% while maintaining task performance. Notably, it enables smaller LMs to function effectively as long-horizon agents, achieving up to 46% performance improvement by mitigating context distraction.

Abstract:
We introduce ideal attribution mechanisms, a formal abstraction for reasoning about attribution decisions over strings. At the core of this abstraction lies the ledger, an appendonly log of the prompt-response interaction history between a model and its user. Each mechanism produces deterministic decisions based on the ledger and an explicit selection criterion, making it well-suited to serve as a ground truth for attribution. We frame the design goal of watermarking schemes as faithful representation of ideal attribution mechanisms. This novel perspective brings conceptual clarity, replacing piecemeal probabilistic statements with a unified language for stating the guarantees of each scheme. It also enables precise reasoning about desiderata for future watermarking schemes, even when no current construction achieves them, as the ideal functionalities are specified first. In this way, the framework provides a roadmap that clarifies which guarantees are attainable in an idealized setting and worth pursuing in practice.

Abstract:
Multimodal large language models (MLLMs) exhibit strong visuallanguage reasoning, yet remain confined to their native modalities and cannot directly process structured, non-visual data such as human skeletons. Existing methods either compress skeleton dynamics into lossy feature vectors for text alignment, or quantize motion into discrete tokens that generalize poorly across heterogeneous skeleton formats. We present SkeletonLLM, which achieves universal skeleton understanding by translating arbitrary skeleton sequences into the MLLM's native visual modality. At its core is DrAction, a differentiable, format-agnostic renderer that converts skeletal kinematics into compact image sequences. Because the pipeline is end-to-end differentiable, MLLM gradients can directly guide the rendering to produce task-informative visual tokens. To further enhance reasoning capabilities, we introduce a cooperative training strategy: Causal Reasoning Distillation transfers structured, step-by-step reasoning from a teacher model, while Discriminative Finetuning sharpens decision boundaries between confusable actions. SkeletonLLM demonstrates strong generalization on diverse tasks including recognition, captioning, reasoning, and cross-format transfer---suggesting a viable path for applying MLLMs to non-native modalities. Code is available in the Supplementary Material.

Abstract:
Nash regret has recently emerged as a principled fairnessaware performance metric for stochastic multi-armed bandits, motivated by the Nash Social Welfare objective. Although this notion has been extended to linear bandits, existing results suffer from suboptimality in ambient dimension d, stemming from proof techniques that rely on restrictive concentration inequalities. In this work, we resolve this open problem by introducing new analytical tools that yield an order-optimal Nash regret bound in linear bandits. Beyond Nash regret, we initiate the study of p-means regret in linear bandits, a unifying framework that interpolates between fairness and utility objectives and strictly generalizes Nash regret. We propose a generic algorithmic framework, FairLinBandit, that works as a meta-algorithm on top of any linear bandit strategy. We instantiate this framework using two bandit algorithms: Phased Elimination and Upper Confidence Bound, and prove that both achieve sublinear p-means regret for the entire range of p. Extensive experiments on linear bandit instances generated from real-world datasets demonstrate that our methods consistently outperform the existing state-of-the-art baseline.

Abstract:
The web is littered with images, once created for human consumption and now increasingly interpreted by agents using visionlanguage models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent's decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, background, or depicted context). We then evaluate which edits increase selection probability. Through large-scale experiments on frontier VLMs, we demonstrate that optimized edits significantly shift choice probabilities in head-to-head comparisons. We develop an automatic interpretability pipeline to explain these preferences, identifying consistent visual themes that drive selection. We argue that this approach offers a new lens on the internal value functions of image-based AI agents, enabling systematic study of what they are visually attracted to and why.

Abstract:
Hallucinationsfluent but factually incorrect responses---pose a major challenge to the reliability of language models, especially in multi-step or agentic settings. This work investigates hallucinations in small-sized LLMs through a geometric perspective, starting from the hypothesis that when models generate multiple responses to the same prompt, genuine ones exhibit tighter clustering in the embedding space, we prove this hypothesis and, leveraging this geometrical insight, we also show that it is possible to achieve a consistent level of separability. This latter result is used to introduce a label-efficient propagation method that classifies large collections of responses from just 30–50 annotations, achieving F1 scores above 90%. Our findings, framing hallucinations from a geometric perspective in the embedding space, complement traditional knowledge-centric and single-response evaluation paradigms, paving the way for further research.

Abstract:
We propose CRAFT, a redteaming alignment framework that leverages model reasoning capabilities and hidden representations to improve robustness against jailbreak attacks. Unlike prior defenses that operate primarily at the output level, CRAFT aligns large reasoning models to generate safety-aware reasoning traces by explicitly optimizing objectives defined over the hidden state space. Methodologically, CRAFT integrates contrastive representation learning with reinforcement learning to separate safe and unsafe reasoning trajectories, yielding a latent-space geometry that supports robust, reasoning-level safety alignment. Theoretically, we show that incorporating latent–textual consistency into GRPO eliminates superficially aligned policies by ruling them out as local optima. Empirically, we evaluate CRAFT on multiple safety benchmarks using two strong reasoning models, Qwen3-4B-Thinking and R1-Distill-Llama-8B, where it consistently outperforms state-of-the-art defenses such as IPO and SafeKey. Notably, CRAFT delivers an average79.0%improvement in reasoning safety and87.7%improvement in final-response safety over the base models, demonstrating the effectiveness of hidden-space reasoning alignment.

Abstract:
We present ExCyTInBench, the first benchmark to Evaluate an LLM agent X on the task of Cyber Threat Investigation through security questions derived from investigation graphs. Real‑world security analysts must sift through a large number of heterogeneous security logs, follow multi‑hop chains of evidence to investigate threats. With the developments of LLMs, building LLM-based agents for automatic threat investigation is a promising direction. We construct a benchmark from a controlled Azure tenant including a SQL environment covering 57 log tables from Microsoft Sentinel and related services, and 7542 generated questions. We leverage security logs extracted with expert-crafted detection logic to build threat investigation graphs, and then generate questions with LLMs using paired nodes on the graph, taking the start node as background context and the end node as answer. Anchoring each question to these explicit nodes and edges not only provides automatic, explainable ground truth answers but also makes the pipeline reusable and readily extensible to new logs. Our comprehensive experiments on the test set with different models confirm the difficulty of the task: the best model so far can achieve a reward of 0.606, leaving much headroom for future research.

Abstract:
Offline goalconditioned reinforcement learning remains challenging for long-horizon tasks. While hierarchical approaches mitigate this issue by decomposing tasks, most existing methods rely on separate high- and low-level networks and generate only a single intermediate subgoal, making them inadequate for complex tasks that require coordinating multiple intermediate decisions. To address this limitation, we draw inspiration from the chain-of-thought paradigm and propose the Chain-of-Goals Hierarchical Policy (CoGHP), a novel framework that reformulates hierarchical decision-making as autoregressive sequence modeling within a unified architecture. Given a state and a final goal, CoGHP autoregressively generates a sequence of latent subgoals followed by the primitive action, where each latent subgoal acts as a reasoning step that conditions subsequent predictions. To implement this efficiently, we pioneer the use of an MLP-Mixer backbone, which supports cross-token communication and captures structural relationships among state, goal, latent subgoals, and action. Across challenging navigation and manipulation benchmarks, CoGHP consistently outperforms strong offline baselines, demonstrating improved performance on long-horizon tasks.

Abstract:
Meta reinforcement learning (RL) allows agents to leverage experience across a distribution of tasks on which the agent cantrainat will, enabling faster learning of optimal policies on newtesttasks. Despite its success in improving sample complexity on test tasks, many realworld applications, such as robotics and healthcare, impose safety constraints during testing. Constrained meta RL provides a promising framework for integrating safety into meta RL. The key challenge is to learn optimal policies while ensuring safe exploration, meaning that policies must remain feasible throughout the testing process. A largely unexplored direction is sample complexity for constrained meta RL with provable safe exploration guarantees. To address this gap, we propose an algorithm that refines policies learned during training, with provable safe exploration and sample complexity guarantees for learning a near optimal policy. We further derive a matching lower bound, showing that this sample complexity is tight. We validate our approach in a gridworld environment, where it outperforms prior constrained RL and constrained meta RL methods in learning efficiency while ensuring safe exploration.

Abstract:
Large language model(LLM)driven multi-agent systems(MAS) coordinate specialized agents through predefined interaction topologies and have shown promise for complex tasks such as competition-level code generation. Recent studies demonstrate that carefully designed multi-agent workflows and communication graphs can significantly improve code generation performance by leveraging collaborative reasoning. However, existing methods neither adapt topology density to task difficulty nor iteratively refine the topology within an instance using execution feedback, which leads to redundant communication and performance bottlenecks. To address these issues, we propose AgentConductor: a reinforcement learning-optimized MAS with an LLM-based orchestrator agent as its core, which enables end-to-end feedback-driven dynamic generation of interaction topologies. For each query, AgentConductor infers agent roles and task difficulty, then constructs a task-adapted, density-aware layered directed acyclic graph (DAG) topology, underpinned by two key innovations. First, we design a novel topological density function that captures communication-aware mathematical characterizations of multi-agent interactions. Second, we adopt difficulty interval partitioning to avoid excessive pruning for precise topological density upper bound measurement per difficulty level and finer-grained control. Empirically, across three competition-level and two foundational code datasets, AgentConductor achieves state-of-the-art accuracy, outperforming the strongest baseline by up to 14.6% in pass@1 accuracy, 13% in density reduction, and 68% in token cost reduction.

Abstract:
The ski rental problem is a canonical model for online decisionmaking under uncertainty, capturing the fundamental trade-off between repeated rental costs and a one-time purchase. While classical algorithms focus on worst-case competitive ratios and recent ``learning-augmented'' methods leverage point-estimate predictions, neither approach fully exploits the richness of full distributional predictions while maintaining rigorous robustness guarantees. We address this gap by introducing algorithms that systematically integrate distributional predictions into both deterministic and randomized algorithms. For the deterministic setting, we formalize the problem under perfect distributional prediction and derive an efficient algorithm to compute the optimal threshold-buy day. We provide a rigorous performance analysis, identifying sufficient conditions on the predicted distribution under which the expected competitive ratio (ECR) matches the classic optimal randomized bound. To handle imperfect predictions, we propose the Clamp Policy, which restricts the buying threshold to a safe range controlled by a tunable parameter. We show that this policy is both robust, maintaining good performance even with large prediction errors, and consistent, approaching the optimal performance as predictions become accurate. For the randomized setting, we characterize the stopping distribution via a Water-Filling Algorithm, which optimizes expected cost while strictly satisfying robustness constraints. Experimental results across diverse distributions (Gaussian, geometric, and bi-modal) demonstrate that our framework improves consistency by significantly over existing point-prediction baselines while maintaining comparable robustness.

Abstract:
Designing algorithms that can explore multimodal target distributions accurately across successive refinements of an underlying highdimensional problem is a central challenge in sampling. Annealed Langevin dynamics (ALD) is a widely used alternative to classical Langevin since it often yields much faster mixing on multimodal targets, but there is still a gap between this empirical success and existing theory: when, and under which design choices, can ALD be guaranteed to remain stable as dimension increases? In this paper, we help bridge this gap by providing a uniform-in-dimension analysis of continuous-time ALD for multimodal targets that can be well-approximated by Gaussian mixture models. Along an explicit annealing path obtained by progressively removing Gaussian smoothing of the target, we identify sufficient spectral conditions—linking smoothing covariance and the covariances of the Gaussian components of the mixture—under which ALD achieves a prescribed accuracy within a single, dimension-uniform time horizon. We then establish dimension-robustness to imperfect initialization and score approximation: under a misspecified-mixture score model, we derive explicit conditions showing that preconditioning the ALD algorithm with a sufficiently decaying spectrum is necessary to prevent error terms from accumulating across coordinates and destroying dimension-uniform control. Finally, numerical experiments illustrate and validate the theory.

Abstract:
Efficiently processing long sequences with Transformer models usually requires splitting the computations across accelerators via context parallelism. The dominant approaches in this family of methods, such as Ring Attention or DeepSpeed Ulysses, enable scaling over the context dimension but do not focus on memory efficiency, which limits the sequence lengths they can support. More advanced techniques, such as Fully Pipelined Distributed Transformer or activation offloading, can further extend the possible context length at the cost of training throughput. In this paper, we present UPipe, a simple yet effective context parallelism technique that performs finegrained chunking at the attention head level. This technique significantly reduces the activation memory usage of self-attention, breaking the activation memory barrier and unlocking much longer context lengths. Our approach lowers the peak activation memory usage by as much as82.5%for 70B Transformers, while matching previous context parallelism techniques in terms of training speed. UPipe can support maximum context lengths of up to 5M tokens for training 8B models on a single 8xH100 node, improving upon prior methods by25%.

Abstract:
Visual localization in complex environments remains a critical challenge for robotics and AR applications. Sequential localization, where pose estimates are refined over time, is important for autonomous agents. However, traditional methods often require storing extensive image databases or point clouds, leading to significant storage overhead. This paper introduces a novel, lightweight approach to sequential visual localization using 3D scene graphs. Our method represents the environment with a compact scene graph, where nodes represent objects (with coarse meshes) and edges encode spatial relationships. For each image in the localization phase, we extract perpatch semantic features, predicting object identities. Localization is performed within a particle filter framework. Each particle, representing a camera pose, projects the coarse object meshes from the scene graph into the image, assigning object identities to patches based on visibility. The similarity of the per-patch features, in the input image, and object features from the scene graph determines the weight of a particle. Subsequent images are incorporated sequentially, refining the pose estimate. By leveraging a compact scene graph and efficient semantic matching, our method significantly reduces storage while maintaining performance on real-world datasets. The code will be made public.

Abstract:
The generation of RegisterTransfer Level (RTL) code is a crucial yet labor-intensive step in digital hardware design, traditionally requiring engineers to manually translate complex specifications into thousands of lines of synthesizable Hardware Description Language (HDL) code. While Large Language Models (LLMs) have shown promise in automating this process, existing approaches—including fine-tuned domain-specific models and advanced agent-based systems—struggle to scale to industrial IP-level design tasks. We identify three key challenges: (1) handling long, highly detailed documents, where critical interface constraints become buried in unrelated submodule descriptions; (2) generating long RTL code, where both syntactic and semantic correctness degrade sharply with increasing output length; and (3) navigating the complex debugging cycles required for functional verification through simulation and waveform analysis. To overcome these challenges, we propose LocalV, a multi-agent framework that leverages information locality in modular hardware design. LocalV decomposes the long-document to long-code generation problem into a set of short-document, short-code tasks, enabling scalable generation and debugging. Specifically, LocalV integrates hierarchical document partitioning, task planning, localized code generation, interface-consistent merging, and AST-guided locality-aware debugging. Experiments on \textscRealBench, an IP-level Verilog generation benchmark, demonstrate that LocalV substantially outperforms state-of-the-art (SOTA) LLMs and agents, achieving a pass rate of 45.0% compared to 21.6%.

Abstract:
We study incontext learning for nonparametric regression with \alpha-Hölder smooth regression functions, for some \alpha>0. We prove that, with n in-context examples and d-dimensional regression covariates, a pretrained transformer with \Theta(\log n) parameters and \Omega\bigl(n^2\alpha/(2\alpha+d)\log^3 n\bigr) pretraining sequences can achieve the minimax-optimal rate of convergence O\bigl(n^-2\alpha/(2\alpha+d)\bigr) in mean squared error. Our result requires substantially fewer transformer parameters and pretraining sequences than previous results in the literature. This is achieved by showing that transformers are able to approximate local polynomial estimators efficiently by implementing a kernel-weighted polynomial basis and then running gradient descent.

Abstract:
We study the problems of approximate sampling from and distribution testing of \emphmixture models, where the modes satisfy a functional inequality called approximate tensorization of entropy (ATE). While it is known that ATE makes these tasks more efficient in the unimodal setting, mixtures of few distributions satisfying ATE do not necessarily satisfy ATE overall, leading to a lack of theoretical guarantees for multimodal distributions, which are a key challenging case of modern generative models. We show this gap can be overcome by establishing the following pair of results for mixtures of ATE distributions: 1) We show fast mixing of Glauber dynamics from a databased initialization, with optimal sample complexity, for mixtures of distributions satisfying modified log-Sobolev inequalities, building on similar results in \citeKoehlerV24, HuangMRW24 for mixtures satisfying the weaker Poincaré inequality. 2) Answering an open question from \citeblanca2023complexity, we give efficient identity-testers for mixtures of ATE distributions in the coordinate-conditional sampling access model.

Abstract:
We introduce TFRBench, the first benchmark designed to evaluate the reasoning capabilities of forecasting systems. Traditionally, timeseries forecasting has been evaluated solely on numerical accuracy, treating foundation models as "black boxes." Unlike existing benchmarks, TFRBench provides a protocol for evaluating the reasoning generated by forecasting systems--specifically their analysis of cross-channel dependencies, trends, and external events. To enable this, we propose a systematic multi-agent framework that utilizes an iterative verification loop to synthesize numerically grounded reasoning traces. Spanning ten datasets across five domains, our evaluation confirms that this reasoning is causally effective; useful for evaluation; and prompting LLMs with our generated traces significantly improves forecasting accuracy compared to direct numerical prediction (e.g., avg. ～40.2% \rightarrow ～56.6%), validating the quality of our reasoning. Conversely, benchmarking experiments reveal that off-the-shelf LLMs consistently struggle with both reasoning (lower LLM-as-a-Judge scores) and numerical forecasting, frequently failing to capture domain-specific dynamics. TFRBench thus establishes a new standard for interpretable, reasoning-based evaluation in time-series forecasting.

Abstract:
Process reward models (PRMs) provide finegrained reward signals along the reasoning process, improving credit assignment beyond outcome-only rewards. Training reliable PRMs often relies on step annotations or heavy verification pipelines, making them expensive to scale and refresh during online RL. Implicit PRMs mitigate this cost by learning decomposable token/step rewards from trajectory-level outcome labels, but they suffer a train-inference mismatch: training only constrains a sequence-level aggregate, while deployment queries token-level scores as local step quality. As a result, token credits are weakly identified and can become predictive of final success without faithfully reflecting which step is correct. This unreliability can even undermine a key promise of implicit PRMs—scoring many candidate tokens—because noisy per-token advantages may systematically reinforce incorrect continuations. We address this with a novel Implicit Prefix-Value Reward Model (IPVRM), which directly learns a prefix-conditioned value function estimating the probability of eventual correctness, and derives step signals via temporal-difference (TD) differences. IPVRM substantially improves step-verification F1 on ProcessBench. Building on these calibrated signals, we further propose Distribution-Level RL (DistRL), which computes TD advantages not only for sampled tokens but also for high-probability candidate tokens across the vocabulary, enabling dense counterfactual updates without additional rollouts. DistRL brings limited benefits with mis-calibrated implicit rewards, but consistently improves downstream reasoning once powered by IPVRM’s reliable prefix values.

Abstract:
Subsampled natural gradient descent (SNG) has been used to enable highprecision scientific machine learning, but standard analyses based on stochastic preconditioning fail to provide insight into realistic small-sample settings. We overcome this limitation by instead analyzing SNG as a sketch-and-project method. Motivated by this lens, we discard the usual theoretical proxy which decouples gradients and preconditioners using two independent mini-batches, and we replace it with a new proxy based on squared volume sampling. Under this new proxy the expectation of the SNG direction becomes equal to a preconditioned gradient descent step even in the presence of coupling, leading to (i) global convergence guarantees when using a single mini-batch of any size, and (ii) an explicit characterization of the convergence rate in terms of quantities related to the sketch-and-project structure. These findings in turn yield new insights into small-sample settings, for example by suggesting that the advantage of SNG over SGD is that it can more effectively exploit spectral decay in the model Jacobian. We also extend these ideas to explain a popular structured momentum scheme for SNG, known as SPRING, by showing that it arises naturally from accelerated sketch-and-project methods.

Abstract:
Fullduplex voice agents—systems that listen and speak simultaneously—are rapidly moving from research to production. However, existing evaluations address conversational dynamics and task completion in isolation. We introduce \tau-voice, a benchmark for evaluating voice agents on grounded tasks with real-world complexity: agents must navigate complex multi-turn conversations, adhere to domain policies, and interact with the environment. The framework extends \tau^2-bench into a novel voice agent benchmark combining verifiable completion of complex grounded tasks, full-duplex interaction, and realistic audio—enabling direct comparison between voice and text performance. A controllable and realistic voice user simulator provides diverse accents, realistic audio environments, and rich turn-taking dynamics; by decoupling simulation from wall-clock time, the user simulator can use the most capable LLM without real-time constraints. We evaluate task completion (pass@1) and voice interaction quality across 278 tasks: while GPT-5 (reasoning) achieves 80%, voice agents reach only 29--42% under clean conditions and 19--30% under realistic conditions with noise and diverse accents—a 50--61pp gap; qualitative analysis confirms 75--90% of failures stem from agent behavior, suggesting that observed failures primarily reflect agent behavior under our evaluation setup. \tau-voice provides a reproducible testbed for measuring progress toward voice agents that are natural, conversational, and reliable.

Abstract:
We study the finegrainded uniform convergence behavior of halfspaces beyond worst-case VC bounds. For inhomogeneous halfspaces in \mathbbR^d with d\ge 2, we show that standard first-order VC bounds are essentially tight: even consistent hypotheses can incur population error \Theta(d\log(n/d)/n), and in the agnostic setting the deviation scales as \sqrt\tau\log(1/\tau) at true error \tau. In contrast, homogeneous halfspaces in \mathbbR^2 exhibit a markedly different behavior. In the realizable case, every hypothesis consistent with the sample has error O(1/n). In the agnostic case, we prove a bandwise, log-free deviation bound on each dyadic risk band via a critical-wedge localization argument. Unioning over bands incurs only a \log\log n overhead, and we establish a matching lower bound showing this overhead is unavoidable. Together, these results give a fine-grained and nearly complete picture of uniform convergence for halfspaces, revealing sharp dimensional and structural thresholds.

Abstract:
Most existing 3D assembly methods treat the problem as pure pose estimation, rearranging observed parts via rigid transformations. In contrast, human assembly naturally couples structural reasoning with holistic shape inference. Inspired by this intuition, we reformulate 3D assembly as a joint problem of assembly and generation. We show that these two processes are mutually reinforcing: assembly provides partlevel structural priors for generation, while generation injects holistic shape context that resolves ambiguities in assembly. Unlike prior methods that cannot synthesize missing geometry, we propose CRAG, which simultaneously generates plausible complete shapes and predicts poses for input parts. Extensive experiments demonstrate state-of-the-art performance across in-the-wild objects with diverse geometries, varying part counts, and missing pieces. Our code and models will be released.

Abstract:
Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents. Recent work has further explored augmenting these agents with memory mechanisms to support longhorizon reasoning. However, these approaches typically operate at a coarse instance granularity, treating the entire problem-solving episode as the atomic unit of storage and retrieval. We empirically demonstrate that instance-level memory suffers from a fundamental granularity mismatch, resulting in misguided retrieval when tasks with similar surface descriptions require distinct reasoning logic at specific stages. To address this, we propose Structurally Aligned Subtask-Level Memory, a method that aligns memory storage, retrieval, and updating with the agent’s functional decomposition. Extensive experiments on SWE-bench Verified demonstrate that our method consistently outperforms both vanilla agents and strong instance-level memory baselines across diverse backbones, improving mean Pass@1 over the vanilla agent by +4.7 pp on average (e.g., +6.8 pp on Gemini 2.5 Pro). Performance gains grow with more interaction steps, showing that leveraging past experience benefits long-horizon reasoning in complex software engineering tasks.

Abstract:
Physicsinformed neural networks (PINNs) solve time-dependent partial differential equations (PDEs) by learning a mesh-free, differentiable solution that can be evaluated anywhere in space and time. However, standard space-time PINNs take time as an input but reuse a single network with shared weights across all times, forcing the same features to represent markedly different dynamics. This coupling degrades accuracy and can destabilize training when enforcing PDE, boundary, and initial constraints jointly. We propose Time-Induced Neural Networks (TINNs), a novel architecture that parameterizes the network weights as a learned function of time, allowing the effective spatial representation to evolve over time while maintaining shared structure. The resulting formulation naturally yields a nonlinear least-squares problem, which we optimize efficiently using a Levenberg-Marquardt method. Experiments on various time-dependent PDEs show up to 4× improved accuracy and 10× faster convergence compared to PINNs and strong baselines.

Abstract:
Instruction tuning is essential for aligning large language models (LLMs) to downstream tasks and commonly relies on large, diverse corpora. However, small, highquality subsets, known as coresets, can deliver comparable or superior results, though curating them remains challenging. Existing methods often rely on coarse, sample-level signals like gradients, an approach that is computationally expensive and overlooks fine-grained features. To address this, we introduce TRIM (Token Relevance via Interpretable Multi-layer Attention), a forward-only, token-centric framework. Instead of using gradients, TRIM operates by matching underlying representational patterns identified via attention-based "fingerprints" from a handful of target samples. Such an approach makes TRIM highly efficient and uniquely sensitive to the structural features that define a task. Coresets selected by our method consistently outperform state-of-the-art baselines by up to 9% on downstream tasks and even surpass the performance of full-data fine-tuning in some settings. By avoiding expensive backward passes, TRIM achieves this at a fraction of the computational cost. These findings establish TRIM as a scalable and efficient alternative for building high-quality instruction-tuning datasets.

Abstract:
Understanding physical transformations is fundamental for reasoning in dynamic environments. While Vision Language Models (VLMs) show promise in embodied applications, whether they genuinely understand physical transformations remains unclear. We introduceConservationBenchevaluatingconservation—whether physical quantities remain invariant under transformations. Spanning four properties with paired conserving/nonconserving scenarios, we generate 23,040 questions across 112 VLMs. Results reveal systematic failure: performance remains near chance with improvements on conservation tasks accompanied by drops on controls. Control experiments show strong textual priors favoring invariance, yet models perform worse with visual content. Neither temporal resolution, prompting, nor curated sampling helps. These findings show that current VLMs fail to maintain transformation-invariant representations of physical properties across dynamic scenes.

Abstract:
In this paper, we presentNEMO, a system that translatesNaturallanguage descriptions of decision problems into formalExecutableMathematicalOptimization implementations, operating collaboratively with users or autonomously. Existing approaches typically rely on specialized large language models (LLMs) or bespoke, task-specific agents. Such methods are often brittle, complex and frequently generating syntactically invalid or non-executable code. NEMO instead centers on remote interaction with autonomous coding agents (ACAs), treated as a first-class abstraction analogous to API-based interaction with LLMs. This design enables the construction of higher-level systems around ACAs that structure, consolidate, and iteratively refine task specifications. Because ACAs execute within sandboxed environments, code produced by NEMO is executable by construction, allowing automated validation and repair. Building on this, we introduce novel coordination patterns with and across ACAs, including asymmetric validation loops between independently generated optimizer and simulator implementations (serving as a high-level validation mechanism), external memory for experience reuse, and robustness enhancements via minimum Bayes risk (MBR) decoding and self-consistency. We evaluate NEMO on nine established optimization benchmarks. As depicted in Figure 1, it achieves state-of-the-art performance on the majority of tasks, with substantial margins on several datasets, demonstrating the power of execution-aware agentic architectures for automated optimization modeling.

Abstract:
We present a unified framework for quantifying the similarity between representations through the lens of usable information, offering a rigorous theoretical and empirical synthesis across three key dimensions. First, addressing functional similarity, we establish a formal link between stitching performance and conditional mutual information. We further reveal that stitching is inherently asymmetric, demonstrating that robust functional comparison necessitates a bidirectional analysis rather than a unidirectional mapping. Second, concerning representational similarity, we prove that reconstructionbased metrics and standard tools (e.g., CKA, RSA) act as estimators of usable information under specific constraints. Crucially, we show that similarity is relative to the capacity of the predictive family: representations that appear distinct to a rigid observer may be identical to a more expressive one. Third, we demonstrate that representational similarity is sufficient but not necessary for functional similarity. We unify these concepts through a task-granularity hierarchy: similarity on a complex task guarantees similarity on any coarser derivative, establishing representational similarity as the limit of maximum granularity: input reconstruction.

Abstract:
Retrievalaugmented generation is a practical paradigm for question answering over long documents, but it remains brittle for multimodal reading where text, tables, and figures are interleaved across many pages. First, flat chunking breaks document-native structure and cross-modal alignment, yielding semantic fragments that are hard to interpret in isolation. Second, even iterative retrieval can fail in long contexts by looping on partial evidence or drifting into irrelevant sections as noise accumulates, since each step is guided only by the current snippet without a persistent global search state. We introduce G^2-Reader, a dual-graph system, to address both issues. It evolves a Content Graph to preserve document-native structure and cross-modal semantics, and maintains a Planning Graph, an agentic directed acyclic graph of sub-questions, to track intermediate findings and guide stepwise navigation for evidence completion. On VisDoMBench across five multimodal domains, G^2-Reader with Qwen3-VL-32B-Instruct reaches 66.21% average accuracy, outperforming strong baselines and a standalone GPT-5 (53.08%). Code is available: https://anonymous.4open.science/r/D2-Reader-8526.

Abstract:
Graph Contrastive Learning (GCL) is a powerful selfsupervised learning framework that performs data augmentation through graph perturbations, with growing applications in the analysis of biological networks such as Gene Regulatory Networks (GRNs). The artificial perturbations commonly used in GCL, such as node dropping, induce structural changes that can diverge from biological reality. This concern has contributed to a broader trend in graph representation learning toward augmentation-free methods, which view such structural changes as problematic and should be avoided. However, this trend overlooks the fundamental insight that structural changes from biologically meaningful perturbations are not a problem to be avoided, but rather a rich source of information, thereby ignoring the valuable opportunity to leverage data from real biological experiments. Motivated by this insight, we propose SupGCL (Supervised Graph Contrastive Learning), a new GCL method for GRNs that directly incorporates biological perturbations from gene knockdown experiments as supervision. SupGCL is a probabilistic formulation that continuously generalizes conventional GCL, linking artificial augmentations with real perturbations measured in knockdown experiments, and using the latter as explicit supervision. On patient-derived GRNs from three cancer types, we train GRN representations with SupGCL and evaluate it in two regimes: (i) embedding space analysis, where it yields clearer disease-subtype structure and improves clustering, and (ii) task-specific fine-tuning, where it consistently outperforms strong graph representation learning baselines on 13 downstream tasks spanning gene-level functional annotation and patient-level prediction.

Abstract:
While current software agents powered by large language models (LLMs) and reinforcement learning (RL) can boost programmer productivity, their reliance on humancurated training data and environments creates a fundamental barrier to superintelligence. In this paper, we present Self-play SWE-RL (SSR), a first step toward training superintelligent software agents under minimal data assumptions. SSR requires only access to sandboxed repositories with source code and dependencies, no need for human-labeled is sues or test commands. Grounded in real-world codebases, a single LLM agent is trained via RL in a self-play setting to inject and repair increasingly complex bugs. The bugs are formally specified by test suite improvements proposed by the agent rather than natural language issue descriptions. On the SWE-bench Verified and SWE-Bench Pro benchmarks, SSR achieves clear self-improvement (+10.4 and +7.8 points) and consistently outperforms the human-data baseline throughout training, generalizing to natural language bug descriptions not seen in training. Overall, our results point toward a paradigm where agents autonomously gather extensive learning experiences from real software repositories, ultimately enabling superintelligent systems that exceed human capabilities in understanding, modifying, and creating software from scratch.

Abstract:
Tokenization is a hardcoded compression step which remains in the training pipeline of Large Language Models (LLMs), despite a general trend towards architectures becoming increasingly endto-end. Prior work has shown promising results at scale in bringing this compression step inside the LLMs' architecture with heuristics to draw token boundaries, and also attempts to learn these token boundaries with straight-through estimates, which treat the problem of drawing discrete token boundaries as a continuous one. We show that these token boundaries can instead be learned using score function estimates, which have tighter theoretical guarantees due to directly optimizing the problem of drawing discrete token boundaries to minimize loss. We observe that techniques from reinforcement learning, such as time discounting, are necessary to reduce the variance of this score function sufficiently to make it practicable. We demonstrate that the resultant method outperforms prior proposed straight-through estimates, both qualitatively and quantitatively at the 100 million parameter scale.

Abstract:
Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domainspecific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduceSciAgentGym, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines, supported by a robust execution infrastructure. Complementing this, we presentSciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities from elementary actions to long-horizon workflows. Our evaluation identifies a critical bottleneck: state-of-the-art models struggle with complex scientific tool-use. Even for a leading model like GPT-5, success rates drop sharply from 60.6% to 30.9% as interaction horizons extend, primarily due to failures in multi-step workflow execution. To address this, we proposeSciForge, a data synthesis method that models the tool action space as a dependency graph to generate logic-aware training trajectories. By fine-tuning on these trajectories, our SciAgent-8B outperforms the significantly larger Qwen3-VL-235B-Instruct while exhibiting positive cross-domain transfer of scientific tool-use capabilities. These results underscore the promising potential of next-generation autonomous scientific agents.

Abstract:
Continuous knowledge updating for pretrained large language models (LLMs) is increasingly necessary yet remains challenging. Although inference-time methods like In-Context Learning (ICL) and Retrieval-Augmented Generation (RAG) are popular, they face constraints in context budgets, costs, and retrieval fragmentation. Departing from these context-dependent paradigms, this work investigates a parametric approach using Low-Rank Adaptation (LoRA) as a modular knowledge memory. Although few recent works examine this concept, the fundamental mechanics governing its capacity and composability remain largely unexplored. We bridge this gap through the first systematic empirical study mapping the design space of LoRA-based memory, ranging from characterizing storage capacity and optimizing internalization to scaling multi-module systems and evaluating long-context reasoning. Rather than proposing a single architecture, we provide practical guidance on the operational boundaries of LoRA memory. Overall, our findings position LoRA as the complementary axis of memory alongside RAG and ICL, offering distinct advantages.

Abstract:
Precise spatial fidelity in Imageto-3D multi-instance generation is critical for downstream real-world applications. Recent work attempts to address this by fine-tuning pre-trained Image-to-3D (I23D) models on multi-instance datasets, which incurs substantial training overhead and struggles to guarantee spatial fidelity. In fact, we observe that pre-trained I23D models already possess meaningful spatial priors, which remain underutilized as evidenced by instance entanglement issues. Motivated by this, we proposeTIMI, a novelTraining-free framework forImage-to-3DMulti-Instance generation that achieves high spatial fidelity. Specifically, we first introduce an Instance-aware Separation Guidance (ISG) module, which facilitates instance disentanglement during the early denoising stage. Next, to stabilize the guidance introduced by ISG, we devise a Spatial-stabilized Geometry-adaptive Update (SGU) module that promotes the preservation of the geometric characteristics of instances while maintaining their relative relationships. Extensive experiments demonstrate that our method yields better performance in terms of both global layout and distinct local instances compared to existing multi-instance methods, without requiring additional training and with faster inference speed.

Abstract:
Partial differential equations (PDEs) are fundamental for modeling complex natural and physical phenomena. In many realworld applications, however, observational data are extremely sparse, which severely limits the applicability of both classical numerical solvers and existing neural approaches. While neural methods have shown promising results under moderately sparse observations, their inference efficiency at high resolutions is limited, and their accuracy degrades substantially in the extremely sparse regime. In this work, we propose the Di-BiLPS, a unified neural framework that effectively handle both forward and inverse PDE problems under extremely sparse observations. Di-BiLPS combines a variational autoencoder to compress high-dimensional inputs into a compact latent space, a latent diffusion module to model uncertainty, and contrastive learning to align representations. Operating entirely in this latent space, the framework achieves efficient inference while retaining flexible input–output mapping. In addition, we introduce a PDE-informed denoising algorithm based on a variance-preserving diffusion process, which further improves inference efficiency. Extensive experiments on multiple PDE benchmarks demonstrate that Di-BiLPS consistently achieves SOTA performance under extremely sparse inputs (as low as 3%), while substantially reducing computational cost. Moreover, Di-BiLPS enables zero-shot super-resolution, as it allows predictions over continuous spatial–temporal domains.

Abstract:
Context management enables agentic models to solve longhorizon tasks through iterative summarization of previous interaction histories. However, this process typically incurs substantial decoding overhead for the extra summarization tokens, which significantly affect the end-to-end response latency at deployment. In this paper, we introduce CoMem, a novel framework that decouples memory management from the primary agent workflow, enabling these processes to execute in parallel. We propose a k-step-off asynchronous pipeline that overlaps the memory model's summarization with the agent's inference, effectively masking the latency of context processing. To ensure robustness under this asynchronous setting, we introduce a reward-driven training strategy that aligns the memory model to capture sufficient statistics for the agent's decision-making. Theoretical analysis confirms that CoMem offers a superior efficiency-effectiveness trade-off compared to coupled architectures. Our extensive experimental results on SWE-Bench-Verified show that CoMem provides 1.4x latency improvements upon vanilla long-context solutions while preserving most of the performance. Furthermore, we demonstrate that these latency gains scale favorably with increased system throughput, offering a modular path forward for the independent optimization of agent reasoning and memory compression.

Abstract:
Guidance methods, e.g., classifierfree guidance (CFG) and auto-guidance (AG), have distinctively improved noise-to-data diffusion generation results. Recently, bridge models have been proposed, which present a data-to-data sampling process to exploit instructive information from clean prior representation, showing advantages on the tasks such as image-to-image translation. In this work, we design a custom guidance method for bridge models, named prior guidance (PG). Different from highlighting condition alignment (CFG) or score accuracy (AG), we training-freely construct an additional weak prior for the pre-trained bridge models, and extrapolate the estimation results to further encourage prior exploitation. Then, we analyze the underlying mechanism of prior exploitation in bridge process and design frequency-modulated prior guidance (FMPG), which tailors the guidance scale to low- and high-frequency bands coherent with bridge generative dynamics. Finally, considering the challenge of bridge models on image in-painting, we develop a cascaded guidance framework, CFG-FMPG, that first generates a coarse prior under global semantic condition and then refines it with FMPG, naturally fulfilling their complementary advantages along sampling trajectory. Experiments conducted on strong pre-trained bridge models, DDBM and DBIM, valid the consistent improvement achieved by our training-free design.

Abstract:
Progress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and taskspecific reporting. As a result, claims of superiority or generality across models are often not directly comparable. We introduce GENEB, a large-scale diagnostic benchmark that evaluates frozen representations from 40 genomic foundation models across 100 tasks spanning 13 functional categories under a unified probing-based protocol, including few-shot regimes. GENEB enables controlled comparison across model scale, architecture, tokenization, and pretraining data while explicitly exposing task-level trade-offs. Our analysis shows that aggregate leaderboards are unstable: model rankings vary sharply across task categories, scale provides only modest and inconsistent gains, and architectural and pretraining alignment frequently outweigh parameter count. These results highlight limitations of current evaluation practices and position GENEB as a reference framework for principled comparison and category-aware model selection in genomic machine learning.

Abstract:
Conformal Prediction (CP) provides a statistical framework for uncertainty quantification that constructs prediction sets with coverage guarantees. While CP yields uncontrolled prediction set sizes, Backward Conformal Prediction (BCP) inverts this paradigm by enforcing a predefined upper bound on set size and estimating the resulting coverage guarantee. However, the looseness induced by Markov's inequality within the BCP framework causes a significant gap between the estimated coverage bound and the empirical coverage. In this work, we introduce STBCP, a novel method that introduces a data-dependent transformation of nonconformity scores to narrow the coverage gap. In particular, we develop a computable transformation and prove that it outperforms the baseline identity transformation. Extensive experiments demonstrate the effectiveness of our method, reducing the average coverage gap from 4.20% to 1.12% on common benchmarks.

Abstract:
Recent work has identified a counterintuitive phenomenon termed “Hyperfitting", where finetuning Large Language Models (LLMs) to near-zero training loss on small datasets surprisingly enhances open-ended generation quality and mitigates repetition in greedy decoding. While effective, the underlying mechanism remains poorly understood, with the extremely low-entropy output distributions suggesting a potential equivalence to simple temperature scaling. In this work, we demonstrate that this phenomenon is fundamentally distinct from distribution sharpening; entropy-matched control experiments reveal that temperature scaling fails to replicate the diversity gains of hyperfitting. Furthermore, we falsify the hypothesis of static vocabulary reweighting, showing through ablation studies that hyperfitting relies on a dynamic, context-dependent rank reordering mechanism. Layer-wise analysis localizes this effect to a “Terminal Expansion" in the final transformer block, where a substantial geometric expansion of the feature space (\Delta \mathrmDim \approx +80.8) facilitates the promotion of deep-tail tokens. And we introduce Late-Stage LoRA, a targeted fine-tuning strategy that updates only the final 5 layers, achieving robust generation with minimal parameter updates.

Abstract:
In semiconductor manufacturing, lithography projects circuit layouts onto silicon wafers through an optical mask. As circuit features shrink below the wavelength of light, optical diffraction causes the printed patterns to deviate from their intended layouts. Inverse Lithography Technology (ILT) addresses this challenge by generating optimized masks that enhance the fidelity of pattern transfer onto wafers. While ILT resembles an image synthesis task, its reliance on explicit physical metrics for mask evaluation limits the applicability of existing generative models. We introduce LithoGRPO, an ILT framework that integrates the flow‑matching paradigm with GRPO‑based reinforcement learning (RL) fine‑tuning, enabling efficient exploration of diverse masks for a given target layout. Unlike purely generative or optimization‑based approaches, RL in LithoGRPO exploits the explicitly defined, physics‑based reward function of ILT, enabling optimization under complex, process‑aware constraints. To the best of our knowledge, this is the first framework that unifies flow matching and RL for mask optimization. To improve RL sampling efficiency, we propose a fast shotcounting algorithm for manufacturability evaluation, achieving over 130× speedup while preserving the mask ranking of the traditional shot-count metric. Extensive experiments demonstrate that LithoGRPO achieves state‑of‑the‑art performance over both optimization‑based and learning‑based methods, while maintaining efficient mask generation.

Abstract:
Membership Inference Attacks (MIAs) aim to estimate whether a specific data point was used in the training of a given model. Existing stateof-the-art attacks typically rely on training multiple reference models to approximate the conditional score distribution for individual data points, which leads to significant computational overhead and limits their practical applicability. In this work, we propose a novel approach -- Bayesian Membership Inference Attack (BMIA), which performs conditional attack through Bayesian sampling. Specifically, we apply Laplace approximation to a single reference model to obtain a posterior over model parameters, enabling direct estimation of the conditional score distribution. Theoretically, we demonstrate that Bayesian sampling reduces intra-model variance, thereby improving attack power. This insight naturally motivates the multi-reference variant that further enhances performance when additional reference models are available. Extensive experiments across image, text, and tabular datasets indicate that our method achieves state-of-the-art performance in both effectiveness and efficiency.

Abstract:
A goalconditioned reinforcement learning agent acting in an environment will see a wealth of information throughout a trajectory, most of which is discarded when only considering the trajectory with respect to a single goal. All-goals learning, where each transition is used for learning off-policy with respect to every goal, allows agents to extract maximal information, however it is usually computationally infeasible when done via naive relabelling. This can be overcome by jointly outputting values and actions for every goal at once, allowing for efficient, parallel all-goals updates with a single pass through the network, in a process we call Learning Everything all at Once (LEO). We show that this approach significantly outperforms other methods on goal-conditioned Craftax and is competitive with existing baselines on continuous control environments, while achieving a 250x speed-up compared to all-goals relabelling. We hope that, by unlocking all-goals learning at scale, LEO can serve as a useful tool for RL practitioners in complex environments. We open source our code at https://anonymous.4open.science/r/CraftaxGC-D3E1.

Abstract:
Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chainof-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced through RLVR, which results in the over-thinking issues of LRMs. Previous attempts to resolve the overthinking issue of LRMs mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.

Abstract:
We introduce a multiscale Image Super Resolution (ISR) method building on recent advances in Visual Auto-Regressive (VAR) modeling. Recently, VAR models challenged the dominance of diffusion-based models by adopting a next-scale prediction paradigm. Specifically, VAR models iteratively estimate the residual in latent space between gradually increasing image scales, an approach that aligns perfectly with our target ISR task. Previous work taking advantage of this synergy, suffer from two main shortcomings. First, due to the limitations of the residual quantizers used in VAR models, they typically only generate images at a predefined fixed scale, i.e. they fail to map intermediate outputs to the corresponding image scales. Also, to achieve better performance, they rely on large backbones and either external VLM for guidance, or a large corpus of external carefully annotated data. To address both shortcomings, we introduce two novel components to the VAR training for ISR, aiming at increasing its flexibility and reducing its complexity. In particular, we introduce a) a Hierarchical Image Tokenization (HIT) approach with a multi-scale image tokenizer that progressively represents images at different scales while enforcing token overlap across scales, and b) a Direct Preference Optimization (DPO) regularization term that, relying solely on the (LR,HR) pair, encourages the transformer to produce the latter over the former. The resulting model can denoise the LR image and super-resolve at different upscale factors in a single forward pass, and achieves state-of-the-art results using a relatively small model (300M params vs ~1B params of VARSR), and without the need of external training data.

Abstract:
Large Language Models (LLMs) increasingly act as function call agents that invoke external tools to tackle tasks beyond their static knowledge. However, they typically invoke tools one at a time without a global view of task structure. As tools often depend on one another, this leads to error accumulation and poor scalability, particularly when scaling to hundreds or thousands of tools. To address these limitations, we propose NaviAgent, an explicit bilevel architecture that decouples task planning from tool execution through graph‑based modeling of tool relations. At the planning level, the LLM‑based agent decides whether to respond directly, clarify intent, or retrieve and execute a toolchain independent of inter‑tool complexity. At the execution level, a Tool World Navigation Model (TWNM) encodes structural and behavioral relations among tools, steering the agent to compose scalable and robust invocation sequences. Incorporating feedback from real tool interactions, NaviAgent achieves closed‑loop alignment between planning and execution, enabling adaptive navigation in large‑scale tool ecosystems. Evaluations on APIBank and ToolBench show consistent improvements in task success rate (TSR), with TWNM boosting performance on complex tasks by up to 17 points. Further tests on 50 real APIs across 7 domains confirm a average 10% improvement in TSR over \alpha‑UMI with fewer steps and lower latency, demonstrating robust generalization under real world dynamics.

Abstract:
Epistemic uncertainty is often viewed as a reducible uncertainty that vanishes with increasing data. This perspective implicitly assumes parameter identifiability and equates epistemic uncertainty with predictive variability. In overparametrized neural networks, however, model parameters are typically nonidentifiable due to symmetries and redundant representations. As a consequence, substantial parameter uncertainty can persist even when the underlying function is fully identified. In this work, we analyze epistemic uncertainty through the lens of non-identifiability, characterize both discrete and continuous sources of residual uncertainty, and show that these can be measured using a variance-based decomposition. Focusing on one-hidden-layer ReLU networks, we thoroughly analyze the resulting posterior structure and validate our theoretical insights through empirical studies.

Abstract:
In this work, we address the critical yet underexplored challenge of symmetric multimodalto-multimodal (MM2MM) retrieval, where queries and contexts are interchangeable. Existing universal multimodal retrieval works struggle with this task, as they are constrained by the labeled asymmetric datasets used. We produce SOLAR (Self-supervised jOint LeArning for symmetric multimodal Retrieval), a novel two-stage self-supervised framework that leverages readily available unlabeled web-scale image-text pairs. Based on the observation that both semantic alignment and discrepancies exist between two modalities, in the first stage, we learn the intersection mask of image-text pair, allowing us to align intersection while preserving semantic of difference. In the second stage, the learned mask is further utilized to construct positive and hardnegative samples via masking different parts of image/text, which enable us to conduct self-supervised multimodal embedding learning. Complementing this framework, we present a new benchmark featuring high-quality human-verified positive and hard-negative pairs to evaluate symmetric MM2MM retrieval under realistic conditions, as well as the corresponding pipeline. Extensive experiments against ten SOTA methods show SOLAR surpasses the strongest supervised VLM by 7.08 points on this benchmark, with over 50x fewer model parameters and a 5x smaller embedding dimension. Code will be available upon publication.

Abstract:
Reinforcement learning with verifiable rewards (RLVR) can yield large reasoning gains from very few training instances, yet its strong sensitivity to which instances are used makes data selection a central bottleneck. Most existing selection pipelines rely on trainingtime optimization signals and/or require access to verifiable rewards or ground-truth answers over large candidate pools, which is costly and often infeasible in specialized domains. We study RLVR data selection in a setting where selection must be performed \emphbefore any RL training and \emphwithout labels or reward evaluation on the full pool. % We propose SHIFT, a one-shot, training-free selector based solely on inference-time hidden-state dynamics. For each candidate instance, SHIFT runs a single deterministic reasoning rollout and computes a \emphreasoning-induced representation shift (RIRS) as the start-to-end hidden-state delta. SHIFT uses the RIRS magnitude as a lightweight proxy for instance utility and enforces coverage via a quality-weighted farthest-first coreset procedure in an RIRS-augmented feature space, producing compact subsets that scale to large unlabeled pools. % Across mathematical reasoning and medical QA benchmarks under ultra-low budgets, SHIFT consistently outperforms training-free diversity and difficulty/uncertainty baselines, improving both in-domain accuracy and transfer to harder evaluation settings. Ablations show that RIRS-based coverage and quality-weighting contribute complementary gains, and analyses indicate that RIRS is not explained by simple input/output length statistics. Code will be released.

Abstract:
Dynamic Sparse Training (DST) methods train neural networks by maintaining sparsity while dynamically adapting the network topology. Despite the promise of reduced computation, DST methods converge significantly slower than dense training, often requiring comparable training time to achieve similar accuracy. We demonstrate both analytically and empirically that Batch Normalization (BN) adversely affects sparse training, and propose SparseOpt — a sparsityaware optimizer — to address this. Experiments on ResNet models across CIFAR-100 and ImageNet demonstrate consistently faster convergence and improved generalization with our proposed method. Our work highlights the limitations of current normalization layers in sparse training and provides the first systematic study of the interaction between Batch Normalization, sparse layers, and DST, taking a significant step toward making DST practically competitive with dense training.

Abstract:
We address the problem of finetuning pre-trained generative policies with reinforcement learning (RL) while preserving the multimodality of their action distributions. Existing methods for RL fine-tuning of generative policies (e.g., diffusion policies) improve task performance but often collapse diverse behaviors into a single reward-maximizing mode. To mitigate this issue, we propose an unsupervised mode discovery framework that uncovers latent behavioral modes within generative policies. The discovered modes enable the use of mutual information as an intrinsic reward, regularizing RL fine-tuning to enhance task success while maintaining behavioral diversity. Experiments on robotic manipulation tasks demonstrate that our method consistently outperforms conventional fine-tuning approaches, achieving higher success rates and preserving richer multimodal action distributions.

Abstract:
Oscillatory neural signals such as electroencephalography (EEG) and local field potentials (LFPs) show phase relationships that coordinate communication across brain regions. Modern recordings capture hundreds of channels across many frequency bins, yet standard phase analyses are restricted to only a few variables. The Torus Graph (TG) model, an exponentialfamily distribution over phases whose univariate and pairwise potentials generalize von Mises distributions, infers principled structure among oscillations but models only static, undirected dependencies and is limited to ～ 100 variables because its score matching inference scales as \mathcalO(d^6). We introduce a stochastic score matching procedure that reduces the per-iteration cost to \mathcalO(d^2), enabling inference on datasets with thousands of variables. This scalable foundation supports analyses of 1,860 frequency-phase features from multi-electrode LFPs and enables two extensions previously inaccessible to TGs or classical circular statistics: (i) a TG-Hidden Markov Model capturing state-dependent phase-coupling changes (e.g., spindle-related states during sleep) and (ii) an autoregressive TG inferring directional interactions via transfer-entropy estimation. Applied to LFP recordings, these models reveal state-dependent phase-interaction patterns between wakefulness and NREM sleep. Together, they enable systematic, large-scale mapping of dynamic and directional phase relationships across brain and cognitive states.

Abstract:
LowRank Adaptation (LoRA) is the most widely adopted method for fine-tuning large language models. Notably, LoRA is inherently overparameterized: multiple pairs of low-rank factors can yield the same adapted weight matrix. We show—both theoretically and empirically—that these pairs exhibit significantly different condition numbers. As a result, converging to different loss minimizers directly impacts the convergence rate of LoRA. Building on this observation, we introduce Balanced Low-Rank Adaptation (BaLoRA), a variant of LoRA that projects iterates onto a balanced manifold. This manifold improves the conditioning of the loss landscape while preserving the adapted matrix. The projection step is computationally lightweight and integrates seamlessly into existing fine-tuning pipelines. Empirically, BaLoRA converges faster than standard LoRA and achieves superior performance across a range of fine-tuning tasks.

Abstract:
Detecting what has changed in an environment is essential for longterm autonomy, yet most change detection settings assume fixed viewpoints, mild misalignment, or only a few changed objects. We introduce Video-based Scene Change Detection (VSCD), which predicts a pixel-wise change mask for each query frame, given a reference and a query RGB video of the same indoor space recorded at different times under unconstrained camera motion. The two videos are not temporally synchronized, and many object instances may appear or disappear. To study this setting, we build a large-scale benchmark with over 1.1 million frames annotated with pixel-accurate change masks, together with a real-world test set for evaluating transfer beyond simulation. We propose a query-centric multi-reference model that learns temporal matching implicitly from change-mask supervision, aligns candidate reference features to the query via local patch correspondence, and fuses per-candidate change features using frame-level and patch-level confidence before decoding a high-resolution mask once per frame. Our approach achieves state-of-the-art performance against strong image- and video-based baselines, and we validate its real-world impact by deploying it on a mobile robot for two downstream applications—visual surveillance and object incremental learning.

Abstract:
Understanding complex interactions between brain regions is critical for early neurodegenerative disease classification such as Alzheimer’s Disease (AD) and Parkinson’s Disease (PD). While graphbased models are widely used to analyze brain networks, most existing approaches primarily focus on pairwise interactions between directly connected nodes, limiting their ability to capture higher-order dependencies across multiple regions. Although hypergraph-based methods have been proposed to model higher-order relations, many rely on predefined hyperedges or restrict learning to hyperedge weights, reducing flexibility and limiting their capacity to capture multi-resolution structural patterns. In this regard, we introduce an adaptive multi-scale hyperedge learning framework, i.e., MuHL, which constructs hierarchical node features and dynamically learns high-order interaction through continuous hyper-edge construction over multi-resolution graph signals. Extensive experiments on multiple brain network benchmarks demonstrate that MuHL consistently improves disease classification performance across different stages, and further identifies key regions of interest (ROIs) and their group-wise interactions from the learned hyperedges that are associated with disease progression, highlighting its potential as a powerful tool for brain network analysis with neurodegenerative disorders.

Abstract:
Tokenlevel reweighting is a simple yet effective mechanism for controlling supervised fine-tuning, but common indicators are largely one-dimensional: the ground-truth probability reflects downstream alignment, while token entropy reflects intrinsic uncertainty induced by the pre-training prior. Ignoring entropy can misidentify noisy or easily replaceable tokens as learning-critical, while ignoring probability fails to reflect target-specific alignment. RankTuner introduces a probability--entropy calibration signal, theRelative Rank Indicator, by comparing the rank of the ground-truth token with its expected rank under the predictiondiction distribution. The inverse indicator is used as a token-wiseRelative Scaleto reweight the fine-tuning objective, focusing updates on truly under-learned tokens without over-penalizing intrinsically uncertain positions. Experiments on multiple backbones show consistent improvements on mathematical reasoning benchmarks, transfer gains on out-of-distribution reasoning, and pre code generation performance over probability- or entropy-only reweighting baselines.

Abstract:
Unsupervised video domain adaptation (UVDA) is a practical but underexplored problem. In this paper, we propose a frustratingly easy UVDA method, called \emphMetaTrans. Specifically, \emphMetaTrans adopts a concise learning objective that contains only two fundamental loss terms. Despite the simplicity of the learning objective, \emphMetaTrans embodies an advanced UVDA idea, that is, handling the spatial and temporal divergence of cross-domain videos separately, through a subtle model architecture design. By implementing a temporal-static subtraction module, \emphMetaTrans effectively removes spatial and temporal divergence. Extensive empirical evaluations, particularly on various cross-domain action recognition tasks, show substantial absolute adaptation performance enhancement and significantly superior relative performance gain compared with state-of-the-art UVDA baselines.

Abstract:
Selfcorrection in language models remains elusive. In this work, we explore whether language models can explicitly localize errors in incorrect reasoning, as a path toward building AI systems that can effectively correct themselves. We introduce a prompting method that structures reasoning as discrete, semantically coherent thought steps, and show that models are able to reliably localize errors within this structure, while failing to do so in conventional, unstructured chain-of-thought reasoning. Motivated by how the human brain monitors errors at discrete decision points and resamples alternatives, we introduce Iterative Correction Sampling of Thoughts (Thought-ICS), a self-correction framework. Thought-ICS iteratively prompts the model to generate reasoning one discrete and complete thought at a time—where each thought represents a deliberate decision by the model—creating natural boundaries for precise error localization. Upon verification, the model localizes the first erroneous step, and the system backtracks to generate alternative reasoning from the last correct point. When asked to correct reasoning verified as incorrect by an oracle, Thought-ICS achieves 20-40% self-correction lift. In a completely autonomous setting without external verification, it outperforms contemporary self-correction baselines.

Abstract:
Current fMRI decoders face a performancefidelity trade-off where efficient ID encoders outperform geometrically-aligned surface-based models. We argue this is an artifact of inefficient surface tokenization and the failure to use anatomy as a predictive signal. We presentNeurIPS, a framework that improves surface-based decoding by reframing anatomical variation from a nuisance to a powerful inductive prior. NeurIPS unites two innovations: aSelective ROI Spherical Tokenizer (SRST)for efficient geometric encoding, and aGuided Mixture of Experts (SG-MoE)that explicitly models individual anatomy using cortical features. On the Natural Scenes Dataset, NeurIPS establishes a new state-of-the-art for surface decoders and achieves performance comparable to strong 1D baselines. This is achieved with unprecedented efficiency, as the model converges dramatically faster (10 vs. 600 epochs). This efficiency enables rapid adaptation to new subjects using only20%of data and remains stable when scaling the training cohort (4 to 8 subjects). Ablations provide evidence that these gains are driven by the model's use of cortical features, not by memorizing subject IDs. By leveraging anatomical priors, NeurIPS provides a principled and scalable path toward robust, generalizable brain decoding.

Abstract:
Reinforcement learning (RL) has demonstrated potential for enhancing reasoning in large language models (LLMs). However, effective RL training, which requires mediumdifficulty training samples, faces two fundamental challenges: Effective Data Scarcity and Dynamic Difficulty Shifts, where medium-difficulty samples are scarce and become trivial as models improve. Existing methods mitigate this scarcity to some extent by generating training samples. However, these approaches suffer from anchor-free generation, ignoring co-evolution, and difficulty mismatch. To address these issues, we propose D²Evo, a Dual Difficulty-aware self-Evolution RL framework. In each iteration, our method mines medium-difficulty anchors based on the current Solver's capability, trains the Questioner to generate diverse questions at appropriate difficulty levels, and jointly optimizes both components to enable progressive reasoning gains. Extensive experiments demonstrate that D²Evo outperforms existing methods on mathematical reasoning benchmarks with fewer than 2K real mathematical samples, and exhibits strong generalization on general reasoning benchmarks.

Abstract:
Optimal transport (OT) has emerged as a fundamental tool in modern machine learning, yet its computational cost remains a significant bottleneck for largescale applications. While harnessing the massive parallelism of modern GPU hardware is critical for efficiency, the de facto standard Sinkhorn algorithm, despite its ease of parallelization, often suffers from slow convergence in challenging problems. More recently, the sparse-plus-low-rank quasi-Newton method offers a balance between convergence rate and per-iteration complexity; however, its efficiency on GPUs is severely hindered by the serial nature of sparse matrix symbolic analysis and irregular memory access patterns. To bridge this gap, we present cuRegOT, a high-performance GPU solver tailored for entropic-regularized OT. We introduce a suite of algorithmic and architectural optimizations, including an amortized symbolic analysis strategy to mitigate CPU bottlenecks, an asynchronous Sinkhorn iterates generation mechanism, and a fused kernel for bandwidth-efficient gradient evaluation. These strategies are backed by rigorous theoretical guarantees ensuring algorithmic convergence. Extensive numerical experiments demonstrate that cuRegOT achieves significant speedups over state-of-the-art GPU-based solvers across a variety of benchmark tasks.

Abstract:
Effective realworld assistance requires AI agents with robust Theory of Mind (ToM): inferring human mental states from their behavior. Despite recent advances, several key challenges remain, including (1) online inference with robust uncertainty updates over multiple hypotheses; (2) efficient reasoning suitable for real-time assistance; and (3) the lack of ground-truth mental state annotations in real-world domains. We address these challenges by introducingMindZero, a self-supervised reinforcement learning framework that trains language models to perform efficient and robust online mental reasoning. During training, the model is rewarded for generating mental state hypotheses that maximize the likelihood of observed actions estimated by a planner, similar to model-based ToM reasoning. This method thus eliminates the need for explicit mental state annotations. After training,MindZerointernalizes model-based reasoning, and performs mental inference in a single forward pass at test time. We evaluateMindZeroin four challenging mental reasoning and AI assistance domains.MindZeromatches the robustness of explicit model-based methods while significantly accelerating inference, outperforming state-of-the-art methods by a large margin. These results demonstrate that mental reasoning can be learned as a self-supervised skill, bridging the gap between robustness and efficiency in ToM modeling.

Abstract:
While Large VisionLanguage Models (VLMs) excel at interpolation, they suffer catastrophic failures in systematic generalization, most notably in visual counting beyond training distributions. In this work, we investigate this extrapolation bottleneck by deconstructing visual counting into three cognitive stages: object individuation, abstract magnitude representation, and symbolic decoding. Using a controlled environment of synthetic Go game boards, we isolate the specific mechanism of failure. Contrary to the hypothesis that models suffer from perceptual errors, we demonstrate via linear probing that visual backbones maintain robust, linearly separable representations of quantity well into the extrapolation regime. Furthermore, models retain latent magnitude awareness, successfully performing comparative reasoning on quantities they fail to enumerate. We pinpoint the collapse to the Symbolic Decoding stage, where the model fails to project valid visual magnitudes onto discrete tokens. Our findings support a Fractured Magnitude Hypothesis: VLMs fail to acquire a Universal Number Space, instead learning disjoint, modality-specific statistical manifolds that prevent cross-modal grounding for unseen pairings. We validate our findings on the state-of-the-art foundation model, suggesting that bridging the extrapolation gap requires inductive priors that enforce unified magnitude representations rather than simply scaling training data.

Abstract:
Split Federated Learning (SFL) enables privacypreserving collaborative training by partitioning models between clients and a server. However, under non-IID data distributions, SFL often suffers from biased optimization and unstable convergence, while existing solutions largely adapt techniques from conventional federated learning. In this work, we observe that the split architecture of SFL inherently alters how client information is represented and coordinated, opening opportunities for bias compensation beyond parameter-level aggregation. Based on this insight, we propose BESplit, an architecture-aware framework that exploits the intrinsic structure of SFL to mitigate non-IID effects. First, to prevent biased local data from dominating global updates, we introduce Evidential Aggregation (EA) to perform fine-grained reweighting of client contributions based on evidential uncertainty. Second, to further reduce distributional skew, we develop Bias-Compensated Collaboration (BCC) to align split-layer representations by pairing complementary clients. Finally, Dual-Teacher Distillation (DTD) is incorporated to synchronize knowledge between decoupled client and server models, enabling independent local inference. Extensive experiments on five benchmark datasets demonstrate that BESplit consistently outperforms state-of-the-art methods in accuracy, convergence stability, and computational efficiency under diverse non-IID settings.

Abstract:
We study the problem of learning Nash equilibria in offline twoplayer zero-sum Markov games. While existing approaches often rely on explicit pessimism to address distribution shift, we show that KL regularization alone suffices to stabilize learning and guarantee convergence. We first introduce Regularized Offline Sequential Equilibrium (ROSE), a theoretical framework that achieves a fast \widetilde\mathcalO(1/n) convergence rate under unilateral concentrability, improving over the standard \widetilde\mathcalO(1/\sqrtn) rates in unregularized settings. We then propose Sequential Offline Self-play Mirror Descent (SOS-MD), a practical model-free algorithm based on least-squares value estimation and iterative self-play updates. We prove that SOS-MD attains the same \widetilde\mathcalO(1/n) statistical rate with a linear iteration complexity.

Abstract:
Autoregressive language models frequently degrade during longhorizon generation, producing repetitive text, losing instruction adherence, and exhibiting unstable entropy. Despite the prevalence of these failures, practitioners lack online diagnostics to detect them in real time as they occur. We formalize this degradation as cognitive fatigue, a measurable generation-time state characterized by decay in attention to the original prompt, representational drift, and entropy miscalibration. We introduce the Fatigue Index (FI), a lightweight, model-agnostic diagnostic that aggregates these three signals under explicit axioms (monotonicity, boundedness, interpretability), enabling reliable runtime monitoring. Across nine models (1B–13B parameters), FI trajectories exhibit structured temporal dynamics, predict task degradation (AUROC = 0.95) and repetition (ρ = 0.94), and reveal non-monotonic scaling behavior: instruction-tuned models below 3B exhibit faster collapse than base models, with this trend reversing at 7B. Stress analyses further show that FI onset accelerates under longer contexts, middle-positioned evidence, and reduced numerical precision. These results establish cognitive fatigue as a coherent and measurable phenomenon, and position FI as a principled tool for runtime reliability monitoring in production LLM systems.

Abstract:
Comparative evaluation ofmultipledynamic treatment policies is essential for healthcare and policy decisions, yet conventional longitudinal causal inference methods estimate each inisolation, preventing information sharing across counterfactuals. We demonstrate that this separate estimation paradigm induces a structurally uncontrolled secondorder bias, inflating finite-sample variance even after standard debiasing withlongitudinal targeted maximum likelihood estimation(LTMLE). To address this, we propose a policy-aware reparameterization ofIterative Conditional Expectation(ICE) Q-functions that enables joint estimation through shared representations. We implement this approach in thePolicy-Encoded Q Network (PEQ-Net), an architecture centered on a shared policy encoder. The encoder is trained using kernel mean embeddings, ensuring that the learned representation space reflects population-level policy dissimilarities. After applying an LTMLE correction step, we prove this design imposes a structural constraint on the second-order remainder, thereby stabilizing finite-sample variance. Experiments on semi-synthetic datasets demonstrate that PEQ-Net consistently outperforms existing ICE-based methods, reducing root-mean-square error by 13–62%.

Abstract:
For many realworld systems, causal ground truth is difficult to obtain, making claims about causal effects hard to assess. We develop methods for evaluating collections of \binomn2 bivariate causal statements over a set of n variables. In the setting of acyclic linear statements, any such collection can be extended to a unique multivariate causal model, but we argue that this induced model is implausible if it imposes substantial additional confounding to explain observed correlations. We introduce a compatibility score that quantifies this notion of plausibility, notably without relying on the faithfulness assumption. Additionally, we define an incompatibility score for purely graphical bivariate causal statements, based on global consistency constraints that are derived from acyclicity and faithfulness assumptions. We give theoretical and empirical evidence that both scores can successfully distinguish correct from incorrect causal statements in generic settings. Moreover, we demonstrate the practical applicability of our methods by analyzing causal claims made by large language models. Our work aims to provide a foundation for assessing the reliability of causal information derived from human experts or artificial intelligence in settings where alternative forms of validation are unavailable.

Abstract:
The adversarial robustness of attributions is a fundamental requirement for reliable explainability in deep learning, yet existing approaches typically rely on computationally expensive explicit regularization. In this work, we show that attribution robustness can arise implicitly from the learning dynamics of standard stochastic gradient descent. We theoretically motivate this effect through connections between parameterspace and input-space curvature, and validate it across architectures, datasets, and attribution methods, with negligible computational overhead. In contrast, we prove that such robustness gains often does not transfer to attention-based attribution under softmax normalization, due to inherent entropy constraints, and we validate this limitation experimentally. Finally, we show that replacing softmax attention with kernel-based attention restores the robustness gains in transformer models. Our results highlight learning dynamics as a principled and practical mechanism for robust explainability, and reveal fundamental limitations of attention-based attribution under normalization.

Abstract:
Testtime reinforcement learning generates multiple candidate answers via repeated rollouts and performs online updates using pseudo-labels constructed by majority voting. To reduce overhead and improve exploration, prior work introduces tree-structured rollouts, which share reasoning prefixes and branch at key nodes to improve sampling efficiency. However, this paradigm still faces two challenges: (1) high-entropy branching can trigger rollout collapse, where the branching budget concentrates on a few trajectories with consecutive high-entropy segments, rapidly reducing the number of effective branches; (2) early pseudo-labels are noisy and biased, which can induce self-reinforcing overfitting, causing the policy to sharpen prematurely and suppress exploration. To address these issues, we propose Entropy–Confidence Hybrid Group Relative Policy Optimization (ECHO). During rollout, ECHO jointly leverages local entropy and group-level confidence to adaptively control branch width, and further introduces online confidence-based pruning to terminate persistently low-confidence branches, avoiding high-entropy traps and mitigating collapse. During policy updates, ECHO employs confidence-adaptive clipping and an entropy–confidence hybrid advantage shaping approach to enhance training robustness and mitigate early-stage bias. Experiments demonstrate that ECHO achieves consistent gains on multiple mathematical and visual reasoning benchmarks, and generalizes more effectively under a limited rollout budget.

Abstract:
LowRank Adaptation (LoRA) has emerged as a promising paradigm for Continual Learning. It independently updates its low-rank factors (A and B), creating a composite update to the full weight matrix through their interaction. To prevent catastrophic forgetting, this update should remain orthogonal to the task-specific subspace that contains previously learned knowledge. However, we identify that this composite update systematically violates this orthogonality, reintroducing interference and undermining stability. Furthermore, naively enforcing this orthogonality compromises plasticity, disrupting the delicate stability-plasticity trade-off. To resolve these issues, we propose Janus-LoRA, a framework that restores this balance through two novel components. Specifically, we first introduce Gradient Rectification, a closed-form solution that mathematically decouples LoRA's factor updates, enforcing orthogonality against the historical knowledge subspace identified by an efficient Online Estimation. Next, to enhance plasticity, we introduce a Decoupled Margin Loss that promotes feature-level separation by pushing new feature representations away from old ones, thus creating distinct, low-interference regions for new learning. Comprehensive experiments on challenging benchmarks demonstrate that by harmonizing parameter-level orthogonality with feature-level separation, Janus-LoRA achieves a superior balance and establishes new state-of-the-art performance.

Abstract:
Alignment plays a fundamental role in many machine learning problems, such as multinetwork analysis, multimodal learning, and point cloud registration. Recent works increasingly leverage optimal transport (OT) for distributional alignment, whose effectiveness largely depends on sparse supervision that is hard or costly to obtain in practice. Existing works, however, largely overlook how to actively acquire high-quality supervision to improve their alignment performance under OT frameworks. In this paper, we propose a principled active alignment framework for optimal transport alignment called AvAtar. We quantify the informativeness of a candidate by measuring its gradient-based impact on the global alignment result, computed as the gradient propagation from the global alignment result to all possible supervisions of the candidate through the entropy-regularized OT formulation. While differentiating through OT is challenging given its constrained nature, we leverage the adjoint-state method to reformulate the computation to a linear system solvable by the conjugate gradient method with linear complexity and guaranteed convergence. By encoding the global alignment result via effective utility functions, AvAtar is applicable to general alignment problems under the OT framework. Extensive experiments on three representative alignment tasks demonstrate the effectiveness, scalability, and generalizability of the proposed AvAtar.

Abstract:
Editing complex, longform knowledge in Large Language Models remains a significant challenge due to the difficulty of maintaining generation coherence. Existing autoregressive methods like AnyEdit alleviate length constraints but rely on Fixed-window Chunking, which disregards logical structure and compromises consistency. To address this, we present AnyEdit++, a structure-aware framework incorporating Bayes-Chunk, an adaptive segmentation mechanism that dynamically identifies semantic boundaries based on Bayesian Surprise. We underpin this approach with a theoretical framework establishing two key principles: (1) Structural Independence: we prove that cross-segment interference is minimized when anchor keys are geometrically orthogonal (a condition naturally satisfied by our surprisal-based boundaries but violated by fixed windows), and (2) Causal Locality: we demonstrate that updates injected at these semantic peaks yield strictly superior control compared to arbitrary split points. Extensive experiments across mathematical reasoning, code generation, and narrative tasks demonstrate that AnyEdit++ achieves superior performance and robustness compared to state-of-the-art baselines, validating that structural awareness is critical for effective long-form knowledge editing.

Abstract:
Selfsupervised pre-training methods in medical imaging typically treat each individual as an isolated instance, learning representations through augmentation-based objectives or masked reconstruction. They often do not adequately capitalize on a key characteristic of physiological features: anatomical structures maintain consistent spatial relationships across individuals (instances), such as the thalamus being medial to the basal ganglia, regardless of variations in brain size, shape, or pathology. We propose leveraging this cross-instance topological consistency as a supervisory signal. The challenge arises from the inherent variability in medical imaging, which can differ significantly across instances and modalities. To tackle this, we focus on two alignment regimes. (i) Intra-instance: with pixel-level correspondences available, a cross-modal triplet objective explicitly preserves local neighborhood topology. (ii) Inter-instance: without such supervision, we derive pseudo-correspondences to control partial neighborhood alignment and prevent topology collapse across modalities. We validate our approach across 7 downstream multi-modal tasks, achieving average improvements of 1.1% and 5.94% in segmentation and classification tasks, respectively, and demonstrating significantly better robustness when modalities are missing at test time.

Abstract:
The performance of sparse direct solvers is fundamentally governed by fillin, i.e. new nonzero entries arising from the LU factorization of a sparse matrix, as they dictate memory footprint and subsequent computation time. For decades, a variety of graph-theoretic algorithms have aimed to minimize fill-in, a problem known to be both NP-hard and critically important. While recent deep learning methods, optimizing surrogate fill-in objectives, show empirical promise and can outperform classical algorithms on certain matrices, they offer limited interpretability into the underlying mechanism of fill-in generation. To address this, we propose a novel reordering approach, Causal Triplet Structure Learning (CTS), which is grounded in the Fill-Path Theorem and reduces arbitrary-length fill-paths to length-two candidate triplets, identifies the causal structures that trigger fill-in, and intervenes to block their formation. Empirically, we design a multigrid-style GAT with KAN activations to learn vertex embeddings and introduce a causal triplet loss that discourages such structures during training. Experiments on the SuiteSparse Matrix Collection demonstrate that our method reduces fill-in by 6×, leading to 12× speedup in factorization time compared to state-of-the-art methods on Chemical Process Simulation and Computational Fluid Dynamics matrices.

Abstract:
Deep neural networks learn representations where individual features often lack interpretable meaning; a single neuron may activate for scattered, unrelated inputs. We introduce coherence, a geometric property inspired by neural coding in the brain, where neurons like grid cells and head direction cells respond to contiguous regions of state space. A nonnegative matrix is coherent if both each row (sample) attends to geometrically clustered columns (features) and, vice versa, each feature attends to clustered samples. We prove that coherent matrices induce a bounded interleaving between the Vietoris-Rips filtrations of samples and features, guaranteeing that both spaces share compatible topological structure. This geometric constraint facilitates interpretability. For example, if data lies on a circle, coherent features must tile that circle into contiguous arcs. We introduce COH, a differentiable regularizer based on Fréchet variance that enforces coherence during training. Unlike sparsity, which bounds how many samples a feature activates on, coherence boundswhichsamples, requiring geometric connectivity rather than only rarity. This yields not just interpretable features but an interpretable feature space. We validate COH using synthetic and rotated MNIST datasets.

Abstract:
Toolcalling text-to-image (T2I) agents can plan and execute multi-step tool chains to accomplish complex generation and editing queries. However, this capability introduces a new safety attack surface: harmful outputs may arise from tool orchestration, where individually benign steps combine into unsafe results, making prompt-only jailbreak techniques insufficient. We present OrchJail, an orchestration-guided fuzzing framework for jailbreaking tool-calling T2I agents. Its core idea is to exploit high‑risk tool‑orchestration patterns: by learning from successful jailbreak tool-calling traces and their causal relationships to prompt wording, OrchJail directly guides the fuzzing search toward prompts that are more likely to trigger unsafe multi‑step tool behaviors, rather than relying on surface‑level textual perturbations. Extensive experiments demonstrate that OrchJail improves jailbreak effectiveness and efficiency across representative tool-calling T2I agents, achieving higher attack success rates, better image fidelity, and lower query costs, while remaining robust against common jailbreak defenses. Our work highlights tool orchestration as a critical, previously unexplored attack surface and provides a novel framework for uncovering safety risks in T2I agents.

Abstract:
Large Language Models (LLMs) show remarkable semantic understanding but often struggle with structural understanding when processing graph topologies in a serialized format. Existing solutions rely on training external graphbased adapters or fine-tuning, which incur high costs and lost generalizability. In this work, we investigate the internal mechanisms of LLMs and present a critical finding:LLMs spontaneously reconstruct the graph's topology internally, evidenced by a distinct "sawtooth" pattern in their attention maps that structurally aligns with the "token-level adjacency matrix". However, this intrinsic structural understanding is diluted by the attention sink. We theoretically formalize this dilution as a representation bottleneck, stemming from a fundamental conflict: the model's anisotropic bias, essential for language tasks, suppresses the isotropic information flow required by graph topology. To address this, we propose a training-free solution, namedStructuraLAttentionSHarpening (Slash), which amplifies this internal structural understanding via a plug-and-play attention redistribution. Experiments on pure graph tasks and molecular prediction validate Slash delivers significant and consistent performance gains across diverse LLMs.

Abstract:
Neural networks are typically trained with a single learning rate across all layers. While recent empirical evidence suggests that assigning layerspecific learning rates can accelerate training, a principled understanding of the conditions and mechanisms under which non-uniform learning rates are beneficial remains limited. In this work, we investigate non-uniform learning rates through the lens of Stackelberg optimization. Specifically, we demonstrate that training neural networks with a smaller learning rate for the body layers and a larger learning rate for the final layer can be interpreted as a two-time-scale alternating gradient descent algorithm applied to a Stackelberg reformulation of the original objective. We establish finite-time convergence guarantees for the algorithm under broad conditions that accommodate constraint sets and non-smooth activation functions. Beyond convergence, we identify two mechanisms by which non-uniform learning rates can outperform uniform learning rates: (i) we show that certain problem instances induce a Stackelberg objective with stronger optimization structure than the original objective, yielding faster convergence to globally optimal solutions, (ii) our numerical analysis reveals that the Stackelberg objective can exhibit substantially sharper local curvature, especially in early training, which leads to more informative gradients and learning acceleration. Experiments in supervised learning and reinforcement learning support our findings.

Abstract:
Recently, zeroshot 3D scene understanding via 2D Vision-Language Models (VLMs) has gained increasing research interest due to their promising spatial reasoning capabilities. Typically, multiple 2D views are sampled from a 3D point cloud and fed into pre-trained VLMs to answer a given question. This paradigm highlights the critical role of input context quality and raises the challenge of retaining as many task-relevant 3D details as possible under a limited input budget. We propose \textttKeyVT, a hierarchical approach for input context collection at both the view and token levels. Specifically, we combine pixel features with camera parameters and assess view importance based on both semantic content and geometric position, resulting in spatially consistent and task-relevant views. Furthermore, we address redundancy among patches across selected views by identifying representative tokens under the optimal transport (OT) framework, where view tokens and key tokens are formulated as two discrete distributions in the embedding space. These key tokens are expected to cover all view features by minimizing the OT distance. We evaluate our framework on three widely used benchmarks, demonstrating significant improvements over existing tuning-free methods and performance comparable to training-based approaches.

Abstract:
Testtime compute (TTC) strategies have emerged as a lightweight approach to boost reasoning in large language models, but their applicability to vision-language models (VLMs) remains unclear. We present a systematic study of TTC for visual reasoning across seven open-source VLMs and six benchmarks, revisiting two paradigms: (i) feature-based scoring of chain-of-thought (CoT) traces and (ii) confidence-based aggregation via majority voting (MV). In the single-model setting, feature cues (e.g., length, pivot words) fail to improve accuracy, while MV yields only modest, CoT-dependent gains. To explain this limitation, we theoretically show that the voting method’s effectiveness depends on prediction diversity: when outputs are highly correlated, the benefit of voting vanishes. In contrast, multi-model ensembles introduce stronger diversity through architectural differences, training data, and scale, making them both more realistic and more promising for TTC. However, MV treats all models equally, leaving it vulnerable to correlated errors from weaker models. To address this, we propose Entropy-based TTC, which selects the most confident prediction based on predictive entropy. Our method reduces to MV in the single-model case but, in ensembles, leverages confidence disparities to prioritize stronger models. We prove that our method theoretically outperforms MV under mild dependence assumptions, and empirically show that it consistently surpasses both MV and the best individual model across diverse visual reasoning benchmarks. This demonstrates that smaller models can enhance, rather than hinder, larger ones when combined appropriately, unlocking ensemble gains not achievable with existing TTC strategies.

Abstract:
Millimeterwave (mmWave) enables privacy-preserving, illumination-robust human pose estimation (HPE), with each mmWave frame represented as a range--angle--Doppler tensor, providing spatial magnitude for localization and Doppler signatures for motion-related cues. However, existing mmWave-based HPE methods either underutilize or naïvely fuse Doppler signatures with spatial magnitude, disregarding their distinct physical semantics. As a result, non-human Doppler signatures can be misinterpreted as human motion cues, leading to jittery trajectories. We propose PULSE, which converts Doppler signatures into confidence-aware motion prompts and injects them into spatial magnitude reasoning through constrained interactions. By screening Doppler prompts before they influence prediction, PULSE first suppresses spurious spectral motion cues and then uses the screened prompts to stabilize prediction. Across three datasets spanning single- and multi-person settings, PULSE consistently improves pose accuracy and temporal stability, indicating that controlled Doppler prompting is a practical direction for stable mmWave HPE. Codes are available in supplementary materials.

Abstract:
Generating physically consistent 3D tabletop scenes is a fundamental yet underexplored problem for interactive and generalist robotic learning. The challenge stems from dense object hierarchies and irregular affordances. Existing methods, ranging from decoupled symbolic solvers to endto-end regression models, often suffer from error propagation or overfitting to noisy supervision containing widespread physical violations. To address these limitations, we introduce PhyScene3D, a framework that reformulates generation as a Human-Mimetic Constructive Process. The proposed Cognitive Topological Reasoning Chain (CTRC) factorizes scene synthesis into a sequential, anchor-conditioned process under the novel 3D Axis-Aligned Bounding Box (3D AABB)-based placement scheme, thereby imposing a strong structural inductive bias. To address imperfect supervision and physical infeasibility, we introduce Physics-Aware Denoising Alignment (PADA), which integrates a differentiable Signed Distance Field (SDF) with Test-Time Optimization (TTO) to project generated scenes onto a physics-feasible manifold while preserving semantic intent. Experiments demonstrate that PhyScene3D outperforms state-of-the-art approaches in both semantic accuracy and physical validity, achieving a 40% reduction in collision rate relative to the human-annotated training data.

Abstract:
Calibration, the alignment of predicted probabilities with true outcome frequencies, is essential for reliable decisionmaking. While extensively studied for classification and regression, calibration has not been formally addressed for probabilistic label ranking, where the goal is to predict a distribution over orderings of a label set. Naively treating rankings as classes ignores their structure and fails to capture important modalities such as pairwise and top-k predictions. We formalize calibration for label ranking and develop a hierarchy of notions covering full rankings, sub-rankings, and top-k rankings. We prove that full-rank calibration implies the others but not conversely, and sub-ranking and top-k calibration are incomparable. Empirically, we find popular label ranking models are often poorly calibrated, with substantial differences between sub-ranking and top-k metrics. Applying our framework to RLHF reward models, we find that calibration correlates strongly but not perfectly with benchmark accuracy, suggesting it captures a meaningful quality dimension beyond top-1 accuracy. These findings motivate future work on understanding the downstream effects of miscalibration and developing methods to correct it.

Abstract:
Reliable watermarking of panoramic imagery is fundamentally challenged by arbitrary 3D rotations. As panoramas are defined on the sphere, they naturally transform under the action of SO(3), rendering conventional planar representations and augmentationbased robustness strategies inadequate and devoid of theoretical guarantees. To address this, we formulate panoramas as spherical signals and leverage SO(3) representation theory to derive provably rotation-invariant descriptors. While spherical harmonic coefficients transform equivariantly under rotations, the natural invariant constructions are typically limited to zeroth-order statistics which eliminate directional information and severely constrain embedding capacity. In this work, we introduce a principled third-order invariant construction by coupling higher-order SO(3) irreducible representations via tensor products and projecting onto the trivial representation. This yields a spherical invariant bispectrum that preserves phase information while remaining strictly rotation-invariant. Leveraging this property, we embed watermarks into higher-order spherical harmonic coefficients and recover them from invariant bispectral scalars, enabling reliable extraction under arbitrary 3D rotations. We provide a theoretical proof of SO(3) invariance for it and demonstrate experimentally its near-perfect robustness to continuous rotations while maintaining high visual fidelity.

Abstract:
Building on recent formalizations of root cause analysis for rare events (“outliers”) in structural equation models, we propose a formal definition of a causal pathway and discuss its testable implications. We identify conditions under which these implications depend only on a causal abstraction defined by the pathway of rare events, rather than on the full causal graph of the underlying system. Accordingly, we introduce an abstraction of causal structure to pathways of rare events that bridges simple verbal causal explanations and detailed causal modeling.

Abstract:
Offline cooperative multiagent reinforcement learning (MARL) faces unique challenges due to the distribution shift between online and offline data collection. While online MARL typically converges to a single coordinated joint policy, offline datasets are often mixtures of diverse cooperative behaviors, resulting in highly multimodal joint behavior distributions. In such settings, independent policy regularization often misaligns joint policy constraints and leads to severe distribution shift. To address this, we propose OMSD, which sequentially decomposes the joint behavior policy into individual conditional distributions and leverages diffusion-based generative models to provide modality-coordinated regularization for each agent. Combined with centralized critic guidance, OMSD achieves coordinated exploration within high-value, in-distribution regions, and avoids out-of-distribution joint actions. Experiments across multiple datasets on various continuous control tasks demonstrate that OMSD consistently achieves state-of-the-art performance, especially in challenging multimodal scenarios. Our results highlight the necessity of modality-aware coordination for robust offline MARL.

Abstract:
Emotional support plays an important role in dialogue systems, and its success depends on adapting to a user’s evolving and implicit needs across multiturn interactions while leveraging the strong reasoning capacity of large language models (LLMs). However, since user needs are often weakly supervised and can only be disambiguated through multi-turn back-and-forth, existing emotional support methods often struggle to acquire and generalize emotionally relevant conversational knowledge efficiently. To bridge this gap, we introduce User-aware active knowledge acquisition (UKA), a gradient-free active dialogue learning framework that explicitly represents uncertainty about user needs and incorporates active learning into both knowledge acquisition and response selection. We propose a Theory-of-Mind-inspired uncertainty estimation mechanism that allows the model to prioritize responses, thereby obtaining the greatest expected information gain. Our framework is capable of efficiently exploring user-aligned conversational knowledge during training while maintaining robustness at test time. Experiments across multiple dialogue benchmarks and model architectures demonstrate that our approach consistently outperforms strong baselines in dialogue quality and user alignment.

Abstract:
Entity tracking (ET), the ability to keep track of states, is a fundamental skill that underlies complex reasoning. An increasing amount of work investigates how transformer language models (LMs) solve entity bindingwithoutstate changes; however, there is limited understanding of how nontoy LMs address ET problems of realistic difficulties expressed in natural language. To this end, we investigate the mechanisms underlying ET in more complex scenarios featuring multiple state-changing operations. We find that LMs do not build world states incrementally across tokens or layers, but simply retrieve and aggregate relevant information at the last token when the query becomes evident. We further investigate mechanisms of individual operations (PUT, REMOVE, MOVE) to elucidate how exactly tracking is implemented non-incrementally. Surprisingly, LMs implement the REMOVE operation with a fragile global suppression tag; we provide a mechanistic solution of nullifying this tag to partially address this issue. This global removal mechanism also predicts various additional failure modes that we confirm behaviorally. Our findings suggest directions for training and finetuning for more robust tracking mechanisms, and furthermore offer a mechanistic hypothesis for why chain-of-thought prompting improves ET.

Abstract:
Recent advances in Large Reasoning Models have significantly improved chainof-thought (CoT) capabilities via reinforcement learning (RL). However, generated reasoning chains frequently suffer from structural redundancy (i.e., \emphoverthinking), incurring high computational overhead without improving answer correctness. Existing mitigation strategies typically rely on token-uniform length penalties, which provide coarse, segment-agnostic pressure toward shorter outputs and can inadvertently suppress useful reasoning alongside redundancy. To address this, we demonstrate that inefficiency concentrates in high-probability segments with low marginal utility. We derive a theoretical characterization of segment suboptimality under the correctness-length trade-off objective and propose \textscSLAT (Segment-Level Adaptive Trimming), an RL framework that selectively suppresses redundant segments based on this criterion. Empirical results on standard benchmarks indicate that \textscSLAT establishes a superior accuracy-efficiency Pareto frontier, reducing reasoning length by 50% relative to uncompressed baselines while maintaining competitive accuracy. Overall, our results suggest that theoretically grounded, segment-aware trimming is a promising direction for efficient CoT reasoning in large language models.

Abstract:
Large language models (LLMs) can memorize sensitive facts, motivatingunlearningmethods that remove targeted knowledge without costly retraining. However, unlearning research remains heavily Englishcentric. We study multilingual unlearning by extending the TOFU benchmark to five languages, and fine-tune, unlearn and query our models with different permutations of languages. We find that unlearning transfer -- the ability of an unlearned model to "forget" facts in languages other than the unlearning language -- is highly variable: e.g., it is strongest between languages sharing scripts and families, and we show that theunlearning languagepredicts whichquery languagesare most likely to yield the strongest transfer. Layer-wise analysis reveals that unlearning leaves the shared cross-lingual latent space largely intact in early layers, instead operating primarily in later decoding layers. This suggests that unlearning does not truly erase knowledge, but rather induces superficial suppression. Exploiting this structure, a single inference-time steering direction reverses much of this suppression across languages, recovering 50% (Qwen) and 90% (Gemma) of the unlearned knowledge.

Abstract:
Extreme multilabel classification (XMC) involves learning deep learning models over large output spaces with millions of labels, making the output layer of the network a major bottleneck in memory and compute. While sparsity-based methods reduce arithmetic complexity, they often fail to yield proportional wall-clock gains due to irregular memory access, poor hardware utilization, or reliance on auxiliary architectural components in extreme long-tailed regimes. We introduce group-shared fixed fan-in sparsity, a semi-structured output-layer design in which groups of semantically related labels share a common sparse input pattern while retaining independent weights. This grouping introduces a task-aligned inductive bias---encouraging related labels to attend to similar feature subsets---while simultaneously reducing index memory overhead, increasing feature reuse across labels, and enabling efficient GPU execution via custom CUDA kernels that leverage modern accelerator primitives. As an alternative to auxiliary objectives, we exploit the long-tailed structure of XMC datasets by decomposing the output layer into a small dense head over frequent labels and a group-shared sparse tail over the remainder, providing an informative gradient pathway while preserving the memory benefits of sparsity. Through kernel-level microbenchmarking, we show that group-shared fixed fan-in converts reductions in arithmetic complexity into proportional wall-clock gains, achieving up to 4.4× speedup in the forward pass and up to 25× speedup in backward passes compared to standard fixed fan-in sparsity, while operating within a few percent of a FLOPs-matched dense bottleneck. Across large-scale XMC benchmarks, our approach matches or improves precision@k compared to prior sparse baselines, while substantially narrowing the performance gap to dense.

Abstract:
Social simulation provides a compelling testbed for studying social intelligence, where agents interact through multiturn dialogues under evolving contexts and strategically adapting opponents. Such environments are inherently non-stationary, requiring agents to dynamically adjust their strategies over time. However, most Large Language Model (LLM) based social agents rely on static personas, while existing approaches for enhancing social intelligence, such as offline reinforcement learning or external planners, are ill-suited to these settings, typically assuming stationarity and incurring substantial training overhead. To bridge this gap, we proposeALSO(Adversarial onLineStrategyOptimization), the first framework for online strategy optimization in multi-agent social simulation. ALSO advances social adaptation through two key contributions. (1) ALSO formulates multi-turn interaction as an adversarial bandit problem, where combinations of static personas and dynamic strategy instructions are treated as arms, providing a principled solution to non-stationarity without relying on environmental stability assumptions. (2) To predict rewards and generalize sparse feedback in multi-turn dialogues, ALSO introduces a lightweight neural surrogate to predict rewards from interaction histories, enabling sample-efficient exploration and continuous online adaptation. Experiments on the Sotopia benchmark demonstrate that ALSO consistently outperforms static baselines and existing optimization methods in dynamic environments, validating the effectiveness of adversarial online strategy optimization for building robust social agents. The codes of ALSO are available at https://anonymous.4open.science/r/ALSO-67D5/

Abstract:
Selfimprovement training enables the large reasoning models (LRMs) to improve themselves by self-generating reasoning trajectories as training data without external supervision. However, we find that this method often falls short in complex reasoning tasks and even leads to model collapse. Through a series of preliminary analyses, we reveal two problems: (1) data imbalance, where most training samples are simple, but the challenging yet crucial samples are scarce; (2) overthinking, where many undesired samples with redundant reasoning steps are used for self-training. To this end, we propose HSIR, which effectively Harnesses Self-Improvement in large Reasoning models via two simple-yet-effective approaches. Specifically, HSIR introduces a verify-then-exit sampling strategy to mitigate data imbalance by efficiently collecting more accurate solutions for difficult queries, and designs an Intrinsic Diversity score to quantify overthinking and filter out the undesired solutions. We apply HSIR to various post-training paradigms, among which we further propose H-GRPO, an enhanced GRPO algorithm that leverages the intrinsic diversity as an external reward to encourage concise and diverse reasoning via reinforcement learning. Extensive results show that HSIR not only effectively enhances the reasoning performance, i.e., bringing up to +10.9% average performance gains, but also significantly improves the reasoning efficiency by reducing up to 42.4% relative inference overhead.

Abstract:
Many organizations lack computational resources to finetune large language models (LLMs) on private (unshareable) data for better utility, while fine-tuning tiny language models (TinyLMs) alone performs poorly. To address this bottleneck, we propose a data-free knowledge distillation framework that generates LLM update vectors based on TinyLMs fine-tuned on private data. An update vector is a vector of parameter changes from an initial model to its fine-tuned version on a dataset, capturing the effect of cumulative gradient steps during fine-tuning. The key idea of our framework is a novel Gradient Transformer that transforms TinyLM's update vectors into LLM's update vectors. As derived from shadow datasets, \textttGrad-Transformer captures the correlation between TinyLM and LLM update vectors, enabling third-party providers to generate LLM update vectors given the organization's TinyLM update vectors without accessing the organization's private data. The framework supports multi-organization collaboration to jointly update LLMs, improving performance and cost-efficiency. Extensive experiments across language modeling and reasoning tasks show that \textttGrad-Transformer remarkably outperforms state-of-the-art knowledge distillation baselines, even under strict differential privacy protection.

Abstract:
Epistemic uncertainty quantification (UQ) for deep neural networks (DNNs) is a requirement for safe adoption of AI in missioncritical settings. Several leading methods for UQ linearize DNNs to form Bayesian Generalized Linear Models (GLMs), where epistemic uncertainty is modeled via the predictive posterior distribution. Linearizing around the parameters of thefinal connected layerof a DNN is a commonly used approximation for reducing the computational burden of such GLMs, though it is often believed to come at the cost of degraded performance. In this work, we compare GLMs arising from full-network and last-layer linearization using both theoretical and empirical approaches. We first employ tools from random matrix theory to conduct a theoretical comparison; this analysis reveals no meaningful improvement in the UQ capabilities of full linearization. Coupled with a large-scale empirical evaluation across a range of modern machine learning tasks, we arrive at the following conclusion: a last-layer approximation yields comparable UQ performance while offering substantially improved computational efficiency.

Abstract:
Keyvalue (KV) caching accelerates inference of large language models (LLMs) by reusing past computations for generated tokens. Its importance becomes even greater in long-context applications such as retrieval-augmented generation (RAG) and in-context learning (ICL). However, conventional KV caching embeds positional information directly into the cache, limiting its reusability. Existing solutions either restrict reuse to prefixes or require expensive memory materialization for positional re-encoding. We introduce LazyAttention, a novel attention mechanism that kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV reuse. By adjusting positional encoding within attention kernels on-the-fly, LazyAttention resolves the materialization bottleneck, allowing a single physical KV copy to serve multiple logical requests at arbitrary positions. Leveraging attention kernels tailored for prefilling and decoding, our system achieves significant efficiency improvements: under skewed document distributions, it reduces time-to-first-token (TTFT) by 1.37× and increases inference throughput by 1.40× compared to the state-of-the-art Block Attention, while maintaining comparable output quality.

Abstract:
Neural networks often map lowdimensional embeddings to high-dimensional output spaces. Usually, the output layer is linear, which can create arank bottleneckthat limits the functions a model can represent. Such bottlenecks are ubiquitous in link prediction models, such as knowledge graph embeddings (KGEs), as the output space of entities can be orders of magnitude larger than the embedding dimension. We investigate how rank bottlenecks limit model expressivity for fitting the training data. While previous work focused on sufficient bounds on the embedding dimension required for specific KGEs, we show necessary bounds forallKGEs with a linear output layer, which grow with graph size and connectivity. We also consider a non-linear output layer using mixtures to break the bottleneck without significant parameter overhead. Empirically, we show that models using this non-linear layer improve in ranking performance and probabilistic fit for large and dense datasets at a low parameter cost, as predicted by our theory. Our work reveals how linear output layers limit KGEs and motivates non-linear alternatives for scaling to large and dense graphs.

Abstract:
The training of Binary Neural Networks (BNNs) is fundamentally based on gradient approximation for nondifferentiable binarization operations (e.g., sign function). However, prevailing methods including the Straight-Through Estimator (STE) and its improved variants, rely on hand-crafted designs that suffer from gradient mismatch problem and information loss induced by fixed-range gradient clipping. To address this, we propose SURrogate GradiEnt Adaptation (SURGE), a novel learnable gradient compensation framework with theoretical grounding. SURGE mitigates gradient mismatch through auxiliary backpropagation. Specifically, we design a Dual-Path Gradient Compensator (DPGC) that constructs a parallel full-precision auxiliary branch for each binarized layer, decoupling gradient flow via output decomposition during backpropagation. DPGC enables bias-reduced gradient estimation by leveraging the full-precision branch to estimate components beyond STE's first-order approximation. To further enhance training stability, we introduce an Adaptive Gradient Scaler (AGS) based on an optimal scale factor to dynamically balance inter-branch gradient contributions via norm-based scaling. Experiments on image classification, object detection, and language understanding tasks demonstrate that SURGE performs best over state-of-the-art methods.

Abstract:
Globalization and multiculturalism continue to produce increasingly diverse speech varieties. Yet current spoken dialogue systems frequently fail on underrepresented dialects and accents, often misidentifying the input language and causing cascading failures in downstream dialogue tasks. Addressing this dialectal variance under low-resource constraints remains an open challenge, as standard fine-tuning is computationally expensive and prone to overfitting on high-dimensional speech data. We propose Convex Language Detection (CLD), a novel framework that integrates theoretically grounded convex optimization techniques into the spoken dialogue systems pipeline. Our method is efficiently implemented via multi-GPU Alternating Direction Method of Multipliers (ADMM) in JAX, thus providing global optimality guarantees and fast training in polynomial time. Theoretically, we prove that our convex objective induces certified margin stability and provide rigorous guarantees against feature perturbations. Empirically, we demonstrate sample efficiency and robustness to input dialectical variance, to significantly reduce language misidentification rates for low-resource dialects within high-resource languages. Our open-source package is available at \urlhttps://anonymous.4open.science/r/CLD-845F/README.md.

Abstract:
Object removal aims to eliminate specified objects from images while plausibly inpainting the affected regions with background content. Current trainingfree methods typically block attention to object regions within self-attention layers during the image generation process, leveraging surrounding background information to restore the image. However, indiscriminate suppression of self-attention in the vacated areas can degrade generation quality, as the model must simultaneously reconstruct background content in these regions. To solve this conflict, we propose AdaEraser, an adaptive framework that dynamically modulates attention based on the estimated presence of target object concepts. Through analysis of self-attention map evolution across denoising timesteps before and during removal, we develop a token-wise adaptive attention suppression strategy. This approach enables progressive perception of object removal throughout the denoising process, with the suppression strength in self-attention layers adjusted adaptively. Extensive experiments demonstrate that AdaEraser achieves superior performance in object removal, outperforming even training-based methods.

Abstract:
Consistency is an important property in dynamic submodular maximization and entails maintaining a nearoptimal solution at all times, making only a small number of adjustments to the solution in each step. Prior work has explored this question for the insertion-only case, where the algorithm faces a stream of n insertions, and has established lower and upper bounds for the cardinality-constrained version of the problem. We consider this question in the fully dynamic setting, where the stream of operations may contain both insertions and deletions. We develop a general framework for designing algorithms for this setting, and instantiate it to obtain the first constant-factor approximations with sublinear consistency. For cardinality constraints, we propose a \tfrac 12 - O(\varepsilon) approximation that is O\left(\tfrac1\varepsilon^2\right) consistent. For rank-k matroid constraints, we construct a \tfrac 14 - O(\varepsilon) approximation to the dynamic optimum that is O\left(\tfrac\log k\varepsilon^2\right) consistent.

Abstract:
Onpolicy reinforcement learning methods like GRPO suffer from \emphmode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution once discovered and ceasing exploration of alternative strategies. We show this stems from reverse KL minimization's mode-seeking behavior, which reinforces the first high-reward trajectory found rather than maintaining a distribution over multiple diverse solutions. We propose DMPO (Distribution-Matching Policy Optimization), which prevents mode collapse through principled approximation of forward KL minimization. DMPO constructs a group-level target distribution over sampled trajectories proportional to their rewards, then aligns the policy distribution to this target. This provides mode-covering behavior without requiring sampling from the intractable global target distribution, enabling sustained exploration throughout training. We validate DMPO on NP-hard combinatorial optimization, where exponentially many feasible solutions exist but only a few approach optimality—an ideal testbed for evaluating exploration. DMPO achieves 43.9% Quality Ratio on text-based NP-Bench (vs. GRPO's 40.1%) and 43.1% on vision-based NP-Bench (vs. 38.4%)—demonstrating 9% and 12% relative improvements respectively. These gains generalize to mathematical reasoning (+2.0%) and out-of-domain tasks (+2.3%), showing that diversity-preserving training enhances general reasoning capabilities across modalities. Our work establishes distribution matching as a practical, principled approach to preventing mode collapse in on-policy RL, with consistent quality improvements demonstrating sustained exploration across diverse reasoning tasks.

Abstract:
Graph Neural Networks (GNNs) are prone to overfitting and oversquashing of long-range information. Stochastic graph perturbations (e.g., edge/node dropping) regularize training, but often (i) induce train-test mismatch in expected message aggregation, (ii) lack a principled mechanism for random edge addition, (iii) offer limited control over regularization strength, and (iv) require dataset-specific tuning over perturbation hyperparameters. We proposeUnbiased Random Add-Drop Edge (RADE), which independently drops edges and adds uniformly sampled non-edges while preserving the expected aggregation at each layer. RADE models graph perturbations as mean-zero logit noise, naturally inducing a variance-weighted regularization penalty. For RADE, we derive a drop/add variance decomposition that clarifies their complementary effects, and propose an epoch-wise GradNorm rule that adaptively tunes the deletion and addition rates during training. We further proposeRADE with Inference Correction (RADE-IC), which adds inference-time shortcuts to introduce additional message pathways and mitigate over-squashing. Experiments on node- and graph-classification benchmarks show consistent gains with RADE, while RADE-IC yields notable improvements on over-squashing-prone tasks. Ablations validate the role of unbiased aggregation, GradNorm adaptivity, and drop/add complementarity.

Abstract:
Standard SelfSupervised Learning (SSL) for Automatic Modulation Recognition (AMR) struggles with ineffective isotropic augmentations, spectral instability, and semantic drift. To address these challenges, we propose Dynamic-Consistency Contrastive Learning (DyCo-CL), a geometry-aware framework that couples Virtual Adversarial Augmentation (VAA) with a semantic consistency loss. We provide a theoretical analysis indicating that this strategy acts as an implicit spectral regularizer for the encoder, enabling stable manifold exploration. Complementing this, our Signal-Adaptive Swin Backbone with fixed-window attention improves structural stability by constraining attention locality, while a Hybrid Knowledge Fusion module anchors representations with physical priors. Experiments on RML benchmarks show that DyCo-CL achieves a 6.27% accuracy gain in 1-shot settings over prior methods.

Abstract:
The PolicySpace Response Oracles (PSRO) framework scales equilibrium computation to large zero-sum games by iteratively expanding a restricted strategy set using deep reinforcement learning (DRL). A central challenge is to construct, under limited computational budgets, a small strategy population whose induced game well approximates the full game. Existing PSRO variants typically expand the population using best responses to meta-strategies computed from restricted-game payoffs, which can lead to inefficient expansions that provide limited global improvement. We propose to guide population expansion by directly evaluating the post-expansion population quality. Specifically, we adopt Population Exploitability (PE) to measure how well a restricted strategy set represents the full game, and introduce a two-phase exploration--selection framework that explicitly minimizes PE during expansion. We instantiate this framework as Global PSRO, a practical DRL-based algorithm that efficiently generates candidate responses and estimates PE via parameter-sharing conditional neural networks. Experiments across multiple two-player zero-sum games show that Global PSRO achieves lower exploitability and approximates Nash equilibria with significantly fewer policy iterations than prior PSRO methods.

Abstract:
Tool calling extends large language models (LLMs) by enabling grounded interaction with external executable interfaces, thereby supporting environmentcoupled problem solving. However, mainstream in-context learning (ICL) approaches typically incorporate detailed tool documentation and usage examples directly into the context. This results in substantial inference overhead and heightened risks of hallucination as the context length grows. Conversely, while tuning-based methods improve general tool-calling capabilities, they often fail to effectively internalize the specific details of previously seen tools, thereby retaining a dependency on in-context documentation. To address these limitations, we propose ParaTool, a framework that projects each tool into a dedicated, loadable set of parameters. By equipping a dynamic integration of these parameterized tools, the LLM can perform tool calling without relying on in-context documents or examples. Specifically, our approach consists of three stages: (1) parametric tool pre-training encapsulates the knowledge of different tools into independent parameter modules; (2) soft tool selection employs a gating network to dynamically weigh and aggregate relevant tool parameters; and (3) parametric tool fine-tuning jointly updates tool parameters to align the training and inference processes. Experiments on Stable ToolBench and BFCL demonstrate that ParaTool significantly outperforms strong ICL-based baselines, achieving superior performance while reducing computational complexity.

Abstract:
Agents that can beat or tie any other under a model of pairwise preference have strong guarantees for both user satisfaction and overall social welfare. However, searching for these agents in longterm decision problems is not computationally tractable with current approaches, which require the size of an agent's policy to increase with the problem length. We introduce the Markov decision contest, a model of learning from general preferences in long-term (infinite-horizon) decision problems. Within this model, we prove that agents only need a stationary Markov policy in order to be optimal (that is, to beat or tie any agent with a history-dependent policy); that the problem of finding an optimal policy is in P; and that a simple iterative algorithm (which we call Hedged Policy Iteration) converges to an optimal policy at a sublinear rate. In a suite of high-dimensional experiments, we demonstrate that Hedged Policy Iteration scales well to function approximation. Lastly, we present a near approximation of Hedged Policy Iteration, called HPI-Clip, which both matches the performance of Proximal Policy Optimization on reward-based tasks while also outperforming it on tasks with non-transitive preferences. These results show that learning from pairwise preferences in long-term decision problems can be far more tractable than what is known from prior work.

Abstract:
Inferring the structure of directed acyclic graphs (DAGs) from data is a central challenge in causal discovery, particularly in modern highdimensional settings where large-scale interventional data are increasingly available. While interventional data can substantially improve identifiability, existing differentiable causal discovery methods rely on soft acyclicity constraints, leading to optimization over invalid cyclic graphs, numerical instability, and limited scalability. We introduce PACER (Perturbation-driven Acyclic Causal Edge Recovery), a scalable framework for causal discovery that guarantees acyclicity by construction. PACER parameterizes a distribution over DAGs through a joint model of variable permutations and edge probabilities, enabling direct optimization over valid causal structures without surrogate penalties. The framework supports a unified likelihood-based treatment of observational and interventional data, flexible conditional density models, and the incorporation of structural prior knowledge. For linear-Gaussian mechanisms, we derive closed-form expressions for the expected interventional log-likelihood and its gradients, yielding substantial computational gains. Empirically, PACER matches or exceeds state-of-the-art methods on protein signaling and large-scale genetic perturbation benchmarks, while scaling efficiently to networks with thousands of variables and achieving up to two orders of magnitude speedups over penalty-based differentiable approaches. These results demonstrate that exact and scalable causal discovery from high-dimensional perturbation data is achievable through principled search space design.

Abstract:
Offline blackbox optimization aims to discover novel designs with high property scores using only a static dataset, a task fundamentally challenged by the out-of-distribution (OOD) extrapolation problem. Existing approaches typically bifurcate into inverse methods, which struggle with the ill-posed nature of mapping scores to designs, and forward methods, which often lack the distributional expressivity to quantify uncertainty effectively. In this work, we propose SPADE (Support-Proximity Augmented Diffusion Estimation), a novel framework that reimagines forward surrogate modeling through the lens of conditional generative modeling. SPADE models the forward likelihood p(y|\boldsymbolx) using a diffusion model, but with two critical enhancements to tailor it for optimization: (1) a \emphCalibrated Diffusion Estimation module that enforces global consistency in statistical moments and pairwise rankings, and (2) a \emphSupport-Proximity Regularization mechanism that implicitly internalizes the data manifold constraint p(\boldsymbolx) via kNN-based density estimation. Theoretically, we prove that our regularization is first-order equivalent to maximizing a Bayesian posterior with a valid design prior. Empirically, SPADE achieves state-of-the-art performance across Design-Bench tasks and an LLM data mixture optimization benchmark. Our code is available through the anonymous repo \hrefhttps://anonymous.4open.science/r/diffsurr-icml2026-C4FD/here.

Abstract:
Recent advancements have established Reinforcement Learning (RL) as a pivotal paradigm for aligning generative models with human intent. However, groupbased optimization frameworks (e.g., GRPO) face a critical limitation: the rapid decay of intra-group variance. As the distinctiveness among samples within a group diminishes, the variance approaches zero. This eliminates the very learning signal required for optimization, rendering the process unstable and forcing the policy into premature stagnation or reward hacking. Existing strategies, such as varying the initial noise or increasing group sizes, often fail to address this fundamental issue, resulting in training instability or diminishing returns. To overcome these challenges, we propose Embedding-perturbed Exploration Preference Optimization (E^2PO), a novel framework that sustains optimization through embedding-level perturbation. Our method introduces structured, embedding-level perturbations within sample groups, guaranteeing a robust variance that preserves the discriminative signal throughout the training process. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving a more faithful alignment with human preference.

Abstract:
Referenceguided video editing takes a source video, a text instruction, and a reference image as inputs, requiring the model to faithfully apply the instructed edits while preserving original motion and unedited content. Existing methods fall into two paradigms, each with inherent limitations: decoupled encoders suffer from modality gaps when processing instructions and visual content independently, while unified vision-language encoders lose fine-grained spatial details by relying solely on final-layer representations. We observe that VLM layers encode complementary information hierarchically—early layers capture localized spatial details essential for precise editing, while deeper layers encode global semantics for instruction comprehension. Building on this insight, we present MiVE (Multiscale Vision-language features for reference-guided video Editing), a framework that repurposes VLMs as multiscale feature extractors. MiVE extracts hierarchical features from Qwen3-VL and integrates them into a unified self-attention Diffusion Transformer, eliminating the modality mismatch inherent in cross-attention designs. Experiments demonstrate that MiVE achieves state-of-the-art performance by ranking highest in human preference, outperforming both academic methods and commercial systems.

Abstract:
Selfsupervised learning (SSL) has emerged as a powerful paradigm for learning meaningful representations from unlabeled data. However, the standard protocol for evaluating these representations, linear probing, is computationally expensive, sensitive to hyperparameters, and provides limited insight into the geometric structure of the representation space. In this work, motivated by connections between neural network generalization and intrinsic dimensionality (ID) we propose IdEst, a method for estimating the ID of SSL representations via the Minimum Spanning Tree dimension estimator (\mathrmdim_\mathrmMST). Across diverse datasets, architectures, and SSL pretraining objectives, we show that IdEst strongly correlates withdownstream linear probe performances. Furthermore, we demonstrate that IdEst enables efficient hyperparameter selection, significantly reducing the computational cost compared to supervised alternatives. Our results highlight intrinsic dimensionality as a principled geometric proxy for assessing and optimizing SSL representations, complementing standard supervised probing protocols.

Abstract:
Reliable decision making pipelines powered by machine learning models require uncertainty quantification (UQ) methods that come with explicit safety guarantees. Conformal prediction provides such UQ by wrapping ML predictions into prediction sets, and recent work by \citekiyani2025decision established that these sets can be translated into optimal riskaverse decision policies—yet only inheriting marginal safety guarantees. We generalize and strengthen their results by (i) introducing action-conditional conformal prediction, which yields safety guarantees conditioned explicitly on each action taken by the decision maker, (ii) showing that action-conditional prediction sets serve as a proxy for the feasible decision space for risk-averse decision makers aiming to optimize action-conditional value-at-risk, and (iii) proposing a principled finite-sample algorithm based on pinball-loss minimization, connecting the framework of \citegibbs2025conformal to action-conditional guarantees. Experiments on two real-world datasets confirm that our approach significantly improves action-conditional performance over several conformal baselines.

Abstract:
With the rapid advancement of large multimodal models (LMMs), inferencetime overhead has become a key bottleneck for real-world deployment. Existing methods typically prune visual tokens at prefill, assuming the required visual evidence remains static during reasoning. However, we empirically show that visual evidence is strongly step-dependent: only a sparse subset of visual tokens is critical at each decoding step, and the critical set evolves across reasoning. Furthermore, we identify a coupled bottleneck where redundant visual context can steer the model toward query-irrelevant regions, lengthening the reasoning trace. Guided by these insights, we proposeVisionPulse, a step-wise visual token pruning framework during reasoning. VisionPulse computes a lightweight visual attention mass to estimate the step-wise retention budget by exploiting its strong positive correlation with LMMs' effective visual token usage and retain only the most critical tokens under this budget. By enforcing visual sparsity during reasoning, VisionPulse filters redundant visual context while preserving relevant visual evidence, shortening reasoning traces naturally. Extensive experiments show that VisionPulse only retains 5% of visual tokens per step with reasoning traces shortened by 11.2%, while keeping accuracy almost unchanged.

Abstract:
Large Language Model (LLM) agents often operate under underspecified user instructions, where latent uncertainty over user intent leads to erroneous tool actions. To address this challenge, we propose a goaloriented clarification framework that aligns clarification behavior with ambiguity resolution. Central to our approach is the Information Gain Reward, a metric that quantifies the utility of clarification questions by measuring the Bayesian belief update towards the ground-truth goal induced by the clarification exchange. We train the clarifier (LLM) using this reward to optimize for high information gain, ensuring that clarifications effectively reduce uncertainty and improve task completion within the agent-tool-user environment. We validate our framework within a clarification-enhanced \tau-Bench environment, conducting cross-agent evaluations across five heterogeneous backbones. Empirical results demonstrate that our method consistently improves the success rate by 3.7% over the no-clarification baseline, incurring small overhead via surgical interventions (averaging only 0.3 additional steps). Our code and data are available at the anonymous repository: \urlhttps://anonymous.4open.science/r/Information-Gain-Clarifier-AD68/.

Abstract:
Preferencebased post-training aligns LLMs with human intent, yet safety behavior often remains brittle. A model may refuse a harmful request in a standard prompt but comply when the same intent is wrapped in adversarial wording. We suggest that robust safety requires context-invariant alignment, where behavior depends on the underlying intent rather than surface form. Enforcing invariance is difficult in alignment because not all training signals are equally trustworthy; for some prompt variants we can obtain verifiable feedback (e.g., multiple-choice), while for open-ended variants we typically rely on noisy, gameable reward proxies (e.g., learned judges). As a result, standard symmetric invariance regularizers can reduce cross-context discrepancies by lowering performance on reliable variants instead of improving open-ended robustness. To address this, we introduce Anchor Invariance Regularization (AIR), which treats verifiable prompts as anchors and uses a stop-gradient target to regularize only the open-ended variants toward the anchor performance. AIR is implemented as a plug-in auxiliary loss and combined with group-based preference optimization (e.g., GRPO) via heterogeneous prompt grouping. Across Safety, Moral Reasoning, and Math, AIR improves context invariance, boosting in-distribution group accuracy by 12.71% and out-of-distribution consistency by 33.49%, making safety constraints robust to adversarial framings.

Abstract:
Multitoken prediction (MTP) is a prominent strategy to significantly speed up generation in large language models (LLMs), especially in byte-level LLMs, which are tokeniser-free but prohibitively slow. However, existing MTP methods often sacrifice expressiveness by assuming independence between future tokens. In this work, we investigate the trade-off between expressiveness and latency in MTP within the framework of probabilistic circuits (PCs). Our framework, named MTPC, allows one to explore different ways to encode the joint distributions over future tokens by selecting different circuit architectures, generalising classical models such as (hierarchical) mixture models, hidden Markov models, and tensor networks. We show the efficacy of MTPC by retrofitting existing byte-level LLMs, such as EvaByte, and byte-fied subword models, such as Llama3.2 3B. Our experiments show that, when combined with speculative decoding, MTPC substantially speeds up generation compared to MTP with independence assumptions, while guaranteeing to retain the performance of the original verifier LLM. We also rigorously study the optimal trade-off between expressiveness and latency when exploring the possible parameterisations of MTPC, such as PC architectures and partial layer sharing between the verifier and draft LLMs.

Abstract:
Ensuring fair and equitable treatment across diverse groups, particularly in multiclass classification tasks, poses a significant challenge due to the persistent biases inherent in machine learning models. Most existing bias mitigation techniques are tailored to binary settings, and the presence of multi-dimensional outputs and complex fairness mechanisms makes their extension to multi-class scenarios neither straightforward nor effective. In this paper, we investigate two fundamental, unresolved challenges in fair classification: (i)characterizing the optimal accuracy-fairness frontier in multi-class settings, and (ii)designing practical algorithms that attain this optimum across different training phases. To tackle these challenges, we first specify an analytically tractable probabilistic formulation of the optimal classifier under fairness constraints. Building upon this, we propose two attribute-blind algorithms to enforce fairness requirements in practice: an in-processing approach for fairness intervention during training via the reduction approach, and a post-processing approach for fine-tuning output probabilities with plug-in estimation. Theoretical analysis reveals that both methods converge to the optimal accuracy-fairness Pareto frontier. Experiments conducted on multiple datasets demonstrate the superior performance of our methods in balancing accuracy and fairness.

Abstract:
As predictive models are increasingly deployed in highstakes settings such as credit approval, there is a growing need for post-hoc methods that provide recourse to affected individuals. Many such models operate on tabular data, where features correspond to real-world attributes. Recently, in-context learning (ICL) has enabled large language models to perform tabular prediction by conditioning on labeled examples at inference time, without explicit training. However, algorithmic recourse for tabular decision-making under ICL remains largely unexplored. In this work, we present the first study of algorithmic recourse for tabular data under ICL. We carry out a theoretical analysis, showing that recourse remains well-defined and bounded, and we characterize how recourse converges toward classical solutions as the context size increases. In practice, we propose a novel zeroth-order recourse framework, Adaptive Subspace Recourse for In-Context Learning (ASR-ICL), that efficiently generates actionable and sparse recourse for black-box ICL models. The proposed framework naturally extends to multi-class tabular tasks. Experiments across multiple real-world datasets and models demonstrate that ASR-ICL achieves recourse quality comparable to existing methods with fewer queries and empirically confirm the predicted convergence behavior, supporting our theoretical analysis.

Abstract:
Learningbased surrogates for partial differential equations have recently matched the accuracy of classical solvers while achieving orders-of-magnitude speedups, predominantly in fluid settings and structured geometries. In contrast, robust surrogates for deformable solids remain underexplored, despite the presence of nonlinear elasticity, plasticity, and transient behavior that challenge standard architectures. We introduce a multigrid graph neural network for solid mechanics that couples anencoder-processor-decoderbackbone with a physics-informed coarsening strategy. Instead of downsampling via geometric heuristics, our method scores nodes using a residual-based measure of local physical activity and preferentially retains regions of high strain or stress concentration, allocating multiscale capacity where it is most needed. This preserves long-range interactions through hierarchical message passing while improving stability over long rollouts. We evaluate on multiple datasets covering linear, nonlinear, and transient regimes, and observe consistent gains in accuracy and rollout stability compared to standard sampling baselines. Our results highlight the importance of physics-informed coarsening for scalable surrogate modeling in solid mechanics.

Abstract:
Incontext learning (ICL) derives its power from enabling Large Language Models to adapt to new tasks via prompt-based reasoning alone, entirely bypassing the need for parameter updates. Existing theories primarily study ICL in single-task settings, while real-world prompts often contain sequences of heterogeneous tasks, leaving a gap in understanding whether Large Language Models implicitly perform continual learning during inference. To bridge this gap, we propose the first theoretical framework for in-context continual learning, modeling how a pretrained Transformer processes multiple sequential tasks within a single prompt through shared attention mechanisms. Focusing on linear and masked linear self-attention, we derive error expressions for model predictions under sequential task prompts and analyze their generalization and forgetting behavior. Our results reveal that standard attention mechanisms inevitably induce inter-task interference by uniformly or causally aggregating historical contexts, leading to systematic bias. We further provide a bias–variance–interference decomposition of prediction error, characterizing when historical in-context information yields positive transfer or provable negative transfer. This analysis exposes fundamental limits of attention-based continual inference and offers theoretical explanations for order sensitivity and performance degradation in long prompts.

Abstract:
Symbolic regression (SR) stands as a cornerstone of scientific discovery, deriving mathematical expressions from observing data. Recent advances incorporate large language models (LLMs) into evolutionary optimization, typically relying on iterative refinement driven by scalar feedback (e.g., mean squared error, MSE). However, such coarse feedback lacks directional guidance for strategic lookahead and diagnostic signals to localize structural errors, thereby confining the search to a myopic trialand-error process. Additionally, treating optimization steps as isolated episodes precludes learning from historical trajectories. Consequently, optimization often degenerates into an inefficient search with substantial computational cost. Motivated by these limitations, we propose Deliberate Evolution, an agentic framework for SR tasks that equips LLM-based candidate proposal with explicit, structured guidance. Our approach steers optimization through adaptive evolutionary operators for directional control, analytical tools for diagnostic feedback, and reflective memory for historical insight. Extensive experiments on LLM-SRBench demonstrate that our approach consistently outperforms prior baselines while using merely 40% of the sample budget.

Abstract:
Large VisionLanguage Models (LVLMs) are rapidly evolving toward true multimodal reasoning, with visual search representing a concrete instantiation of the thinking-with-images paradigm. However, LVLM visual search faces two key challenges: incompatibility among intrinsic capabilities after post-training, and interference in long multi-step reasoning contexts. To address these, we identify two novel insights. First, self-regulation between pre- and post-training LVLMs leverages the intrinsic single-step capabilities of the pre-training model to mitigate capability deterioration and long-context interference. Second, probability-based prophetic sampling, replacing naive prompting, provides a probabilistic interface where the pre-training model acts as a prophet and the post-training model selectively accepts prophetic tokens under its output distribution, preserving coherent multi-step reasoning. Building on these insights, we introduce SeProD, a self-prophetic decoding framework that leverages intrinsic single-step capabilities to enable coherent multi-step reasoning in a training-free, plug-and-play manner. Experiments show that SeProD consistently improves multiple visual-search LVLMs across all 12 splits of 4 visual search benchmarks, as well as across general VQA benchmarks, without added computational overhead, thanks to its parallel prophetic acceptance mechanism. The code will be made publicly available.

Abstract:
How can we distinguish whether a peer review was written by a human or generated by an AI model?We argue that, in this setting, authorship should not be attributed solely from the textual features of a review, but also from the ideas, judgments, and claims it expresses. To this end, we propose SemDetect, an authorship detection method for peer reviews that operationalizes this principle by combining textual features with claim-level semantic analysis. Sem-Detect compares a target review against multiple AI-generated reviews of the same paper, leveraging the observation that different AI models tend to converge on similar points, while human reviewers introduce more unique and diverse ones. As a result, Sem-Detect is able to distinguish fully AI-generated reviews from authentic human-written ones, including those that have been refined using an LLM but still reflect human judgment. Across a dataset of over 20,000 peer reviews from ICLR and NeurIPS conferences, Sem-Detect improves over the strongest prior detector by 36.5% in TPR@1% FPR in the binary setting. More importantly, in the three-class scenario, we empirically show that LLM refinement preserves the semantic signals of human reviews, which remain distinct from the patterns exhibited by fully AI-generated text; as a result, fewer than 3.5% of LLM-refined human reviews are misclassified as AI-generated.

Abstract:
In this work, we show that Latent FlowMatching (LFM) models are robust to different types of perturbations, including data reduction and model capacity shrinkage. We characterize this stability by their tendency to generate similar outputs under identical noise seeds. We provide a perspective relating this phenomenon to flow matching theory, which indicates that this stability is inherent to the FM objective. We further exploit this stability to derive practical algorithms for more efficient training and inference. Concretely, first, we show that by training LFM models on significantly reduced datasets, the performance does not degrade perceptually or quantitatively. This yields multiple advantages, such as reducing training time by converging faster under limited compute budget, and alleviating annotation effort when training conditional models. Second, LFM stability under architectural shrinkage gives rise to a two-model coarse-to-fine approach, one using a light-weight architecture for the first phase of the FM trajectory, and one with higher capacity for the second, thereby reducing the inference cost substantially. To determine which samples are informative, we introduce three sample-scoring criteria and evaluate them under standard metrics for generative models. Our results are thoroughly evaluated on multiple datasets, demonstrating the practical advantage of this stability, including data saving and a more than two-fold inference speedup while generating comparable outputs.

Abstract:
Direct Preference Optimization (DPO) is a widely used RLfree method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions. Existing token-level extensions typically decompose a sequence-level Bradley–Terry objective across timesteps, leaving per-prefix (state-wise) optimality implicit. We study how to recovertoken-levelpreference optimality using only standard sequence-level pairwise comparisons. We introduceToken-level Bregman Preference Optimization (TBPO), which posits a token-level Bradley--Terry preference model over next-token actions conditioned on the prefix, and derive a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving the optimal policy induced by the token-level model and maintaining DPO-like simplicity. We introduce two instantiations: TBPO-Q, which explicitly learns a lightweight state baseline, and TBPO-A, which removes the baseline through advantage normalization. Across instruction following, helpfulness/harmlessness, and summarization, TBPO improves alignment quality and training stability and increases output diversity relative to strong sequence-level and token-level baselines.

Abstract:
Mendelian Randomization (MR) is a prominent observational epidemiological research method, designed to address unobserved confounding when estimating causal effects. It is closely related to instrumental variable (IV) methods, where genetic variants serve as instruments to infer causal relationships from observational data. However, the core assumptions required for valid IV analysisparticularly the independence between instruments and unobserved confounders---are untestable and often violated in practice. In MR, such violations commonly arise when genetic variants are correlated with environmental factors (e.g., population stratification and assortive mating), leading to confounding between instruments and outcomes. At the same time, MR studies increasingly include data collected across multiple environments or populations, providing an opportunity to address these violations. Leveraging this setting, we propose a representation learning framework that exploits multi-environment data to recover latent exogenous components of genetic instruments suitable for causal inference. We provide theoretical insights into when and how the learned components can act as valid instruments, and we demonstrate the effectiveness of our approach through simulations and semi-synthetic experiments using genetic data from the All of Us Biobank.

Abstract:
Designing microstructures that satisfy coupled crossphysics objectives is a fundamental challenge in material science. This inverse design problem involves a vast, discontinuous search space where traditional topology optimization is computationally prohibitive, and deep generative models often suffer from "physical hallucinations," lacking the capability to ensure rigorous validity. To address this limitation, we introduce AutoMS, a multi-agent neuro-symbolic framework that reformulates inverse design as an LLM-driven evolutionary search. Unlike methods that treat LLMs merely as interfaces, AutoMS integrates them as "semantic navigators" to initialize search spaces and break local optima, while our novel Simulation-Aware Evolutionary Search (SAES) addresses the "blindness" of traditional evolutionary strategies. Specifically, SAES utilizes simulation feedback to perform local gradient approximation and directed parameter updates, effectively guiding the search toward physically valid Pareto frontiers. Orchestrating specialized agents (Manager, Parser, Generator, and Simulator), AutoMS achieves a state-of-the-art 83.8% success rate on 17 diverse cross-physics tasks, nearly doubling the performance of traditional NSGA-II (43.7%) and significantly outperforming ReAct-based LLM baselines (53.3%). Furthermore, our hierarchical architecture reduces total execution time by 23.3%. AutoMS demonstrates that autonomous agent systems can effectively navigate complex physical landscapes, bridging the gap between semantic design intent and rigorous physical validity.

Abstract:
Agent memory systems must accommodate continuously growing information while supporting efficient, contextaware retrieval for downstream tasks. Abstraction is essential for scaling agent memory, yet it often comes at the cost of specificity, obscuring the fine-grained details required for effective reasoning. We introduce Memora, a harmonic memory representation that structurally balances abstraction and specificity. Memora organizes information via itsprimary abstractionsthat index concrete memory values and consolidate related updates into unified memory entries, whilecue anchorsexpand retrieval access across diverse aspects of the memory and connect related memories. Building on this structure, we employ a retrieval policy that actively exploits these memory connections to retrieve relevant information beyond direct semantic similarity. Theoretically, we show that standard Retrieval-Augmented Generation (RAG) and Knowledge Graph (KG)-based memory systems emerge as special cases of our framework. Empirically, Memora establishes a new state-of-the-art on the LoCoMo and LongMemEval benchmarks, demonstrating better retrieval relevance and reasoning effectiveness as memory scales.

Abstract:
Detecting misalignment in large language models is challenging because models may learn to conceal misbehavior during training. Standard auditing techniques fall short: blackbox methods often cannot distinguish misaligned outputs from benign ones, and mechanistic interpretability does not scale with model capabilities. We introduce Split Personality Training (SPT), which fine-tunes a second ``honest persona'' into LoRA parameters that remain inactive during normal operation. After the main model responds, we activate the LoRA adapter and insert a trigger string, enabling the honest persona to review the response while accessing the main model's latent states. We test our method on the Anthropic Auditing Game Model Organism, a benchmark where Llama-3.3-70B is trained to exploit reward hacks while concealing this behavior. SPT achieves 96% overall accuracy, whereas Anthropic reports near 0% accuracy. The honest persona reveals latent knowledge inaccessible to external observers, such as the fictional biases the compromised model was trained on.

Abstract:
Traditional Mixtureof-Experts (MoE) training typically proceeds without any structural priors, effectively requiring the model to simultaneously train expert weights while searching for an optimal routing policy within a vast combinatorial space. This entanglement often leads to sluggish convergence and training instabilities. This paper introduces Grouter, a preemptive routing method that by distilling high-quality structures from fully-trained MoE models and serving as a fixed router for target models. By decoupling structural optimization from weight updates, Grouter significantly accelerates both the speed and quality of model convergence. To ensure the framework's versatility, we also introduce expert folding to adapt Grouter across varying model configurations and expert tuning to rebalance workloads across different data distributions. Furthermore, by leveraging the structural priors provided by preemptive routing, we can implement targeted optimizations to further enhance training throughput. Experiments demonstrate that Grouter achieves superior performance and efficiency which boosts pre-training data utilization by 4.28× and achieves up to 33.5% throughput acceleration, establishing preemptive routing as a fundamental paradigm for scalable MoE training.

Abstract:
An important emerging application of coding agents isagent optimization: the iterative improvement of atarget agentthrough edit–execute–evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Agent optimization differs fundamentally from conventional software engineering: the target agent interleaves deterministic code with stochastic LLM completions, requiring structured capture of both intermediate reasoning and downstream execution outcomes. To address these challenges, we introduce VeRO (Versioning,Rewards, andObservations), which provides (1) a reproducible evaluation harness with versioned agent snapshots, budgetcontrolled evaluation, and structured execution traces, and (2) a benchmark suite of target agents and tasks with reference evaluation procedures. Using VeRO, we conduct an empirical study comparing optimizer configurations across tasks and analyzing which modifications reliably improve target agent performance. We release VeRO to support research on agent optimization as a core capability for coding agents.

Abstract:
Hallucination remains a fundamental challenge for Multimodal Large Language Models (MLLMs). While Direct Preference Optimization (DPO) is a key alignment framework, existing approaches often rely heavily on costly external evaluators for scoring or rewriting, incurring offpolicy learnability gaps and discretization loss. Due to the lack of access to internal states, such feedback overlooks the fine-grained conflicts between different modalities that lead to hallucinations during generation. To address this issue, we propose IRIS (Implicit Reward-Guided Internal Sifting), which leverages continuous implicit rewards in the native log-probability space to preserve full information density and capture internal modal competition. This on-policy paradigm eliminates learnability gaps by utilizing self-generated preference pairs. By sifting these pairs based on multimodal implicit rewards, IRIS ensures that optimization is driven by signals that directly resolve modal conflicts. Extensive experiments demonstrate that IRIS achieves highly competitive performance on key hallucination benchmarks using only 5.7k samples, without requiring any external feedback during preference alignment. These results confirm that IRIS provides an efficient and principled paradigm for mitigating MLLM hallucinations.

Abstract:
Posttraining alignment of large language models (LLMs) relies on large-scale human annotations guided by policy specifications that change over time. Cultural shifts, value reinterpretations, and regulatory or industrial updates make static alignment increasingly brittle. As policies evolve, deployed models can diverge from current alignment objectives, creating an Alignment–Reality Gap that is difficult to audit or correct. Existing remediation typically requires re-annotation under revised guidelines, which introduces systematic challenges, including guideline ambiguity, annotator interpretation drift, and reduced consistency at scale. We introduce TRACE (Triage and Re-align by Alignment Conflict Evaluation), a framework that transforms re-alignment into a structured optimization problem over existing data without requiring fresh human annotation. Leveraging a stronger model as a proxy judge, TRACE operates via a three-stage pipeline: (1) triaging preference pairs into inversion, suppression, or retention categories based on alignment conflicts; (2) computing an alignment impact score via bi-level optimization to prioritize high-leverage samples; and (3) executing updates using a hybrid objective that combines relational losses (e.g., IPO) for preference inversion and punitive losses (e.g., NPO) for response suppression. Experiments on Qwen2.5-7B, Gemma-2-9B, and Llama-3.1-8B demonstrate robust re-alignment on synthetic benchmarks and the PKU-SafeRLHF dataset without degrading general utility. This work provides a scalable approach for LLM realignment under evolving data annotation policies and alignment guidelines.

Abstract:
We study a widely used Bayesian optimization method, Gaussian process Thompson sampling (GPTS), under the assumption that the objective function is a sample path from a GP. Compared with the GP upper confidence bound (GP-UCB) with established high-probability and expected regret bounds, most analyses of GP-TS have been limited to expected regret. Moreover, whether the recent analyses of GP-UCB for the lenient regret and the improved cumulative regret upper bound can be applied to GP-TS remains unclear. To fill these gaps, this paper shows several regret bounds: (i) a regret lower bound for GP-TS, which implies that GP-TS suffers from a polynomial dependence on 1/\delta with probability \delta, (ii) an upper bound of the second moment of cumulative regret, which directly suggests an improved regret upper bound on \delta, (iii) expected lenient regret upper bounds, and (iv) an improved cumulative regret upper bound on the time horizon T. Along the way, we provide several useful lemmas, including a relaxation of the necessary condition from recent analysis to obtain improved regret upper bounds on T.

Abstract:
Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to finetune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier’s decision boundary, increasing both false positives and false negatives. We propose \emphBoundary Guidance, a reinforcement learning fine-tuning method that explicitly steers generation away from the classifier’s margin. On a benchmark of jailbreak, ambiguous, and long-context prompts, \emphBoundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive ablations across model scales and reward designs demonstrate the robustness of our approach. The code is available at https://anonymous.4open.science/r/boundary-avoidance-45G2-45FD1.

Abstract:
VisionLanguage-Action (VLA) models show strong generalization for robotic control, but finetuning them with reinforcement learning (RL) is constrained by the high cost and safety risks of real-world interaction. Training VLA models in interactive world models avoids these issues but introduces several challenges, including pixel-level world modeling, multi-view consistency, and compounding errors under sparse rewards. Building on recent advances across large multimodal models and model-based RL, we propose VLA-MBPO, a practical framework to tackle these problems in VLA finetuning. Our approach has three key design choices: (i) adapting unified multimodal models (UMMs) for data-efficient world modeling; (ii) an interleaved view decoding mechanism to enforce multi-view consistency; and (iii) chunk-level branched rollout to mitigate error compounding. Theoretical analysis and experiments across simulation and real-world tasks demonstrate that VLA-MBPO significantly improves policy performance and sample efficiency. Crucially, our method maintains a universal set of hyperparameters across all tasks, underscoring its robustness and scalability for real-world robotic deployment.

Abstract:
Large Language Models (LLMs) are increasingly used to simulate how specific users respond to any context, enabling more usercentric applications that rely on user feedback. However, existing user simulators mostly imitate surface-level patterns and language styles, which fails to reflect the underlying state of real users (e.g., beliefs, emotions). To address these limitations, we propose a novel training framework, HumanLM, which builds user simulators that accurately reflect real users. Our key insight is, in addition to generating responses, we generate natural-language latent states that align with the ground truth responses through reinforcement learning. These latent states correspond to a set of state dimensions which psychologically lead to how real users respond. HumanLM further synthesizes these aligned latent states into responses that accurately represent real users. For extensive evaluation, we develop Humanual, a comprehensive benchmark on simulating real users based on public data. Humanual consists of six large-scale datasets with 26k users and 216k responses in total. It spans diverse tasks such as generating user responses to daily life issues, political blogs, and chat sessions with LLM assistants. Across the datasets, HumanLM significantly outperforms the best alternative approaches by an average relative improvement of 16.3% on alignment score from an LLM judge. In a real-time simulation study with 37 participants, HumanLM achieves the highest scores on similarity with real user responses and human-likeness.

Abstract:
VisionLanguage Models (VLMs) achieve strong multimodal performance but are costly to deploy, and post-training quantization often causes significant accuracy loss. Despite its potential, quantization-aware training for VLMs remains underexplored. We propose GRACE, a framework unifying knowledge distillation and QAT under the Information Bottleneck principle: quantization constrains information capacity while distillation guides what to preserve within this budget. Treating the teacher as a proxy for task-relevant information, we introduce confidence-gated decoupled distillation to filter unreliable supervision, relational centered kernel alignment to transfer visual token structures, and an adaptive controller via Lagrangian relaxation to balance fidelity against capacity constraints. Across extensive benchmarks on LLaVA and Qwen families, our INT4 models consistently outperform FP16 baselines (e.g., LLaVA-1.5-7B: 70.1 vs. 66.8 on SQA; Qwen2-VL-2B: 76.9 vs. 72.6 on MMBench), nearly matching teacher performance. Using real INT4 kernel, we achieve 3× throughput with 54% memory reduction. This principled framework significantly outperforms existing quantization methods, making GRACE a compelling solution for resource-constrained deployment.

Abstract:
Large language models (LLMs) have shown promise in assisting cybersecurity tasks, yet existing approaches struggle with automatic vulnerability discovery and exploitation due to limited interaction, weak execution grounding, and a lack of experience reuse. We propose CoRedTeam, a security-aware multi-agent framework designed to mirror real-world red-teaming workflows by integrating security-domain knowledge, code-aware analysis, execution-grounded iterative reasoning, and long-term memory. Co-RedTeam decomposes vulnerability analysis into coordinated discovery and exploitation stages, enabling agents to plan, execute, validate, and refine actions based on real execution feedback while learning from prior trajectories. Extensive evaluations on challenging security benchmarks demonstrate that Co-RedTeam consistently outperforms strong baselines across diverse backbone models, achieving over 60% success rate in vulnerability exploitation and over 10% absolute improvement in vulnerability detection. Ablation and iteration studies further confirm the critical role of execution feedback, structured interaction, and memory for building robust and generalizable cybersecurity agents.

Abstract:
KVcache retrieval is essential for long-context LLM inference, yet existing methods struggle with distribution drift and high latency at scale. We introduceParisKV, a drift-robust, GPU-native KV-cache retrieval framework based on collision-based candidate selection, followed by a quantized inner-product reranking estimator. For million-token contexts, ParisKV supports CPU-offloaded KV caches via Unified Virtual Addressing (UVA), enabling on-demand top-kfetching with minimal overhead. ParisKV matches or outperforms full-attention quality on bothlong-inputandlong-generationbenchmarks. It achieves state-of-the-art long-context decoding efficiency: it matches or exceeds full-attention speed even at batch size 1 for long contexts, delivers up to2.8×higher throughput within full attention’s runnable range, and scales tomillion-tokencontexts where full attention runs out of memory. At million-token scale, ParisKV reduces decode latency by17×and44×compared to MagicPIG and PQCache, respectively—two state-of-the-art KV-cache top-kretrieval baselines.

Authors: Jingzhe Ding, Shengda Long, puchangxin , Ge Zhang, zhou huan, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, FEI HU, Zhaojian Li, Weiran Shi, Zaiyuan Wang, Daoguang Zan, Chenchen Zhang, Xiaoxu Zhang, Chen Qizhi, cheng, Bo Deng, Qingshui Gu, Kai Hua, Juntao Lin, Pai Liu, Mingchen Li, Minghao Li, Xuanguang Pan, Zifan Peng, Yujia Qin, Yong Shan, Zhewen Tan, Haoran Wang, Zihan Wang, Weihao Xie, Yishuo Yuan, Jiayu Zhang, Yunfei Zhao, He Zhu, LIYA ZHU, chenyangzou , Ming Ding, Jiaheng Liu, Jianpeng Jiao, Liam Liu, Qian Liu, Chongyang Tao, Jian Yang, Tong Yang, Zhaoxiang Zhang, Xinjie Chen, Wenhao Huang

Abstract:
Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks primarily evaluate shorthorizon behaviors such as localized code generation, scaffolded completion, or repository repair, leaving it unclear whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we introduce NL2Repo-Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation from scratch: given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, and produce a fully installable Python library. Experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved, with even the strongest agents achieving merely 40% average test pass rates and rarely completing an entire repository correctly. Further analysis identifies systematic long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. These results position NL2Repo-Bench as a rigorous, execution-based testbed for evaluating sustained agentic competence and highlight long-horizon reasoning as a key bottleneck for autonomous coding agents. Our data and code are available at https://anonymous.4open.science/r/nl2repobench-foricml-F4ED/.

Abstract:
Efficiently routing queries to the optimal large language model (LLM) is crucial for optimizing the costperformance trade-off in multi-model systems. However, most existing routers rely on single-model selection, making them susceptible to misrouting. In this work, we formulate LLM routing as the \alpha-VOR problem to minimize expected set size while controlling the misrouting risk, and propose a novel method -- RACER, extending base routers to output model sets that can be subsequently aggregated for improved output. In particular, RACER constructs nested model sets via augmented scoring and utilizes finite-sample concentration bounds to calibrate a threshold that allows for both variable set sizes and abstention. We theoretically prove that RACER achieves rigorous distribution-free risk control on unseen test data in a post-hoc and model-agnostic manner. Extensive experiments verify our theoretical guarantees and demonstrate that RACER consistently enhances downstream accuracy across a wide range of benchmarks.

Abstract:
We introduce vector diffusion wavelets (VDWs), a novel family of wavelets inspired by the vector diffusion maps algorithm that was introduced to analyze data lying in the tangent bundle of a Riemannian manifold. We show that these wavelets may be effectively incorporated into a family of geometric graph neural networks, which we refer to as VDWGNNs. We demonstrate that such networks are effective on synthetic point cloud data, as well as on real-world data derived from wind-field measurements and neural activity data. Theoretically, we prove that these new wavelets have desirable frame theoretic properties, similar to traditional diffusion wavelets. Additionally, we prove that these wavelets have desirable symmetries with respect to rotations and translations.

Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models. However, existing RLVR methods utilize rollouts in an indiscriminate and shorthorizon manner: responses of heterogeneous quality within each prompt are treated uniformly, and historical rollouts are discarded after a single use. This leads to noisy supervision, poor sample efficiency, and suboptimal policy updates. We address these issues by formulating rollout scheduling in RLVR as a contextual bandit problem and proposing a unified neural scheduling framework that adaptively selects high-value rollouts throughout training. Each rollout is treated as an arm whose reward is defined by the induced performance gain between consecutive optimization steps. The resulting scheduler supports both noise-aware intra-group selection and adaptive global reuse of historical rollouts within a single principled framework. We provide theoretical justification by deriving sublinear regret bounds and showing that enlarging the rollout buffer improves the achievable performance upper bound. Experiments on six mathematical reasoning benchmarks demonstrate consistent gains in performance and training efficiency across multiple RLVR optimization methods.

Abstract:
Dataset distillation (DD) aims to replace a full training set with a tiny synthetic one, yet current theories neither explain why heterogeneous matching objectives (gradient, distribution, trajectory) work nor provide a quantitative boundary for robustness under configuration changes (optimizer, architecture, augmentation). We propose configurationdynamics-error (CDE) analysis for a broad class of matching-based DD methods, which provides a unified generalization framework that treats the training configuration as an update operator inducing optimization dynamics, and that measures distillation robustness by the test-risk gap between models trained on distilled versus full data. Within this framework, all gradient, distribution, and trajectory matchings are shown to reduce the same dynamics-induced risk gap, explaining why these heterogeneous objectives can work. CDE yields two predictive laws. First, within a fixed configuration, the gap decays as \mathcalO(k^-1/2) with the distilled set size k until the configuration-dependent floor, which can explain the ubiquitous IPC saturation and indicate when improving the floor dominates enlarging k. Second, we formalize a utility boundary via an order-tight coverage law: the required k grows linearly with the configuration diversity that can be captured by the covering-number complexity. Experiments with representative DD methods and configuration changes exhibit predictive behaviors consistent with our laws.

Abstract:
InContext Learning (ICL) in transformers acts as an online associative memory and is believed to underpin their high performance on complex sequence processing tasks. However, in gated linear attention models, this memory has a fixed capacity and is prone to interference, especially for long sequences. We propose Palimpsa, a self-attention model that views ICL as a continual learning problem that must address a stability-plasticity dilemma. Palimpsa uses Bayesian metaplasticity, where the plasticity of each attention state is tied to an importance state grounded by a prior distribution that captures accumulated knowledge. We demonstrate that various gated linear attention models emerge as specific architecture choices and posterior approximations, and that Mamba2 is a special case of Palimpsa where forgetting dominates. This theoretical link enables the transformation of any non-metaplastic model into a metaplastic one, significantly expanding its memory capacity. Our experiments show that Palimpsa consistently outperforms baselines on the Multi-Query Associative Recall (MQAR) benchmark and on Commonsense Reasoning tasks.

Abstract:
Cooperative reasoning under incomplete information remains complex for both humans and multiagent AI, requiring agents to transcend individual logic in favor of recursive Theory-of-Mind (ToM) and strategic coordination. To investigate these challenges, we conduct a large-scale evaluation of 17 state-of-the-art LLMs (4B–600B+) on Hanabi card game across 2–5 players. To examine their limitations, we analyze the impact of context engineering and scaffold robustness, ranging from minimal prompts (Watson setting) to Bayesian-motivated scaffolding (Sherlock setting) and multi-turn working memory (Mycroft setting). Our findings reveal that: (1) top-performing models can autonomously track game states via internal working memory, although not reliably, and (2) cross-play performance scales smoothly with model capability. However, even the best models (scoring ≈ 15/25) trail specialist human experts (> 20/25). We introduce and release two novel datasets: HanabiLogs (1,520 annotated trajectories) and HanabiRewards (560 games with dense move-level utilities). By fine-tuning a 4B open-weight model (Qwen3-Instruct) on our datasets, we achieve performance gains of up to 156%, bringing performance to within 3 points of a strong proprietary reasoning model (o4-mini) and surpassing the best non-reasoning model (GPT-4.1) by 52%. Crucially, our HanabiRewards RL-finetuned model further generalizes beyond Hanabi, improving performance on a cooperative group-guessing benchmark by 11%, temporal reasoning on EventQA by 6.4%, instruction-following on IFBench-800K by 1.7 Pass@10, and matching AIME 2025 mathematical reasoning Pass@10. Code and datasets are available at redacted for double blind.

Abstract:
Motivated by discrete diffusion's success in languagevision modeling, we explore its potential for multi-view generation, a task dominated by continuous approaches. We introduce ViewMask-1-to-3, formulating multi-view synthesis as a discrete sequence modeling problem where each viewpoint is represented as visual tokens from MAGVIT-v2. Through masked token prediction, our approach enables progressive multi-view generation via iterative token unmasking, unifying language and vision in a shared token space. Importantly, simple random masking combined with self-attention naturally encourages cross-view consistency without specialized architectures or 3D geometric priors. Our method outperforms the baseline on the GSO and 3D-FUTURE benchmarks, ranking first on average across standard image metrics and improving IoU by 10.6% on 3D-FUTURE. This validates discrete diffusion as a promising candidate for multi-view generation.

Abstract:
Large vision–language models (LVLMs) perform well on multimodal tasks, but their ability to reason and precisely align visual and textual information still has room for improvement. In this study, we show that external visual cues, such as symbols or grid lines, help LVLMs form more accurate connections between visual components, such as objects, and their corresponding textual descriptions, improving their grounding and reasoning abilities. We introduce the concept of Grounding IDs, which are latent identifiers that arise within the model as a result of external cues structuring both visual and textual modalities. Our analysis reveals that partitioninducing external cues lead to Grounding IDs that make better alignment between corresponding visual and text representations, helping the model focus on relevant information. We find that Grounding IDs enhance attention between related components, improving cross-modal grounding and reducing hallucinations. Overall, our results show that Grounding IDs are a key mechanism that enables external cues to improve cross-modal alignment, reduce errors, and enhance the overall performance of LVLMs across a range of multimodal tasks.

Abstract:
We prove the conjecture stated in Appendix F.3 of Zhu et al.: among all conversion rules that map a Rényi Differential Privacy (RDP) profile \tau \mapsto \rho(\tau) to a valid hypothesistesting trade-off f (or equivalently, an (\varepsilon,\delta)-Differential Privacy curve), the rule based on the intersection of single-order RDP privacy regions is optimal. This optimality holds simultaneously for all valid RDP profiles and for all Type I error levels \alpha. Concretely, we show that in the space of trade-off functions, the tightest possible bound is f_\rho(\cdot)(\alpha) = \sup_\tau \geq 0.5 f_\tau,\rho(\tau)(\alpha): the pointwise maximum of the single-order bounds for each RDP privacy region. Our proof unifies and sharpens the insights of Balle et al., Asoodeh et al., and Zhu et al.. Our analysis relies on a precise geometric characterization of the RDP privacy region, leveraging its convexity and the fact that its boundary is determined exclusively by Bernoulli mechanisms. Our results establish that the "intersection-of-RDP-privacy-regions" rule is not only valid, but optimal: no other black-box conversion can uniformly dominate it in the Blackwell sense, marking the fundamental limit of what can be inferred about a mechanism's privacy solely from its RDP guarantees.

Abstract:
Parameter sharing is a key model compression technique, yet existing methods overlook the geometric properties of the loss landscape, often causing severe accuracy degradation under high compression ratios. Inspired by secondorder optimization, we propose Curvature-aware Graph Coloring (CGC), a cross-layer parameter sharing framework that treats each network layer as a graph node, with each node assigned to a shared low-rank basis. CGC leverages Hessian eigenspace information to group layers with similar curvature profiles, aligning the perturbations introduced by parameter sharing with the low-curvature (flat) directions of the loss ellipsoid. This effectively mitigates performance loss while enabling flexible, global cross-layer sharing. Experiments on LLaMA-7B and Swin Transformer show that CGC achieves 28%–50% parameter compression with Top-1 accuracy loss no more than 0.01% on Swin—or even accuracy gains on LLaMA—while delivering over 60% higher inference throughput, significantly outperforming SVD-based and heuristic-based methods. This work demonstrates that curvature-guided, geometry-aware sharing is essential for efficient, stable, and high-ratio model compression.

Abstract:
Generating highquality synthetic text under differential privacy (DP) is critical for training and evaluating language models without compromising user privacy. Prior work on synthesizing DP datasets often fail to preserve key statistical attributes, suffer utility loss from the noise required by DP, and lack fine-grained control over generation. To address these challenges, we make two contributions. First, we introduce a hierarchical framework that decomposes DP synthetic text generation into two subtasks: feature learning and conditional text generation. This design explicitly incorporates learned features into the generation process and simplifies the end-to-end synthesis task. Through systematic ablations, we identify the most effective configuration: a rich tabular schema as feature, a DP tabular synthesizer, and a DP fine-tuned conditional generator, which we term ACTG (Attribute-Conditioned Text Generation). Second, we propose Anchored RL (ARL), a post-training method that improves the instruction-following ability of ACTG for conditional generation. ARL combines RL to boost control with an SFT anchor on best-of-N data to prevent reward hacking. Together, these components form our end-to-end algorithm ACTG-ARL, which advances both the quality of DP synthetic text (+20% MAUVE over prior work) and the control of the conditional generator under strong privacy guarantees. Our code is at https://github.com/actg-arl/ACTG-ARL.

Abstract:
As large language models (LLMs) increasingly power conversational agents, understanding how they model users' emotional states is critical for ethical deployment. Inspired by emotion wheels, i.e., a psychological framework that argues emotions organize hierarchically, we analyze probabilistic dependencies between emotional states in model outputs. We find that LLMs naturally form hierarchical emotion trees that align with human psychological models, and larger models develop more complex hierarchies. We also uncover systematic biases in emotion recognition across socioeconomic personas, with compounding misclassifications for intersectional, underrepresented groups. Human studies reveal striking parallels, suggesting that LLMs internalize aspects of social perception. Beyond highlighting emergent emotional reasoning in LLMs, our results hint at the potential of using cognitivelygrounded theories for developing better model evaluations.

Abstract:
In recent years, Spiking Neural Networks (SNNs) have achieved remarkable progress, with Spiking Transformers emerging as a promising architecture for energyefficient sequence modeling. However, existing Spiking Transformers still lack a principled mechanism for effective temporal fusion, limiting their ability to fully exploit spatiotemporal dependencies. Inspired by feedforward–feedback modulation in the human visual pathway, we proposeTEFormer, the first Spiking Transformer framework that achieves bidirectional temporal fusion by decoupling temporal modeling across its core components. Specifically, TEFormer employs a lightweight and hyperparameter-freeforward temporal fusion mechanism in the attention module, enabling fully parallel computation, while incorporating abackward gated recurrent structure in the MLPto aggregate temporal information in reverse order and reinforce temporal consistency. Extensive experiments across a wide range of benchmarks demonstrate that TEFormer consistently and significantly outperforms strong SNN and Spiking Transformer baselines under diverse datasets. Moreover, through the first systematic evaluation of Spiking Transformers under different neural encoding schemes, we show that the performance gains of TEFormer remain stable across encoding choices, indicating that the improved temporal modeling directly translates into reliable accuracy improvements across varied spiking representations. These results collectively establish TEFormer as an effective and general framework for temporal modeling in Spiking Transformers. Code:https://anonymous.4open.science/r/TEFormer-6EB7

Abstract:
Recent time series modeling faces a sharp divide between numerical generation and semantic understanding, with research showing that generation models often rely on superficial pattern matching, while understandingoriented models struggle with high-fidelity numerical output. Although unified multimodal models (UMMs) have bridged this gap in vision, their potential for time series remains untapped. We propose TimeOmni-VL, the first vision-centric framework that unifies time series understanding and generation through two key innovations: (1) Fidelity-preserving bidirectional mapping between time series and images (Bi-TSI), which advances Time Series-to-Image (TS2I) and Image-to-Time Series (I2TS) conversions to ensure near-lossless transformations. (2) Understanding-guided generation. We introduce TSUMM-Suite, a novel dataset consists of six understanding tasks rooted in time series analytics that are coupled with two generation tasks. With a calibrated Chain-of-Thought (CoT), TimeOmni-VL is the first to leverage time series understanding as an explicit control signal for high-fidelity generation. Experiments confirm that this unified approach significantly improves both semantic understanding and numerical precision, establishing a new frontier for multimodal time series modeling.

Abstract:
Model merging combines knowledge from separately finetuned models, yet success factors remain poorly understood. While recent work treats mergeability as an intrinsic property, we show with an architecture-agnostic framework that it fundamentally depends on both the merging method and the partner tasks. Using linear optimization over a set of interpretable pairwise metrics (e.g., gradient L_2 distance), we uncover properties correlating with post-merge performance across four merging methods. We find substantial variation in success drivers (46.7% metric overlap; 55.3% sign agreement), revealing method-specific "fingerprints". Crucially, however, subspace overlap and gradient alignment metrics consistently emerge as foundational, method-agnostic prerequisites for compatibility. These findings provide a diagnostic foundation for understanding mergeability and motivate future fine-tuning strategies that explicitly encourage these properties.

Abstract:
We develop approximation and statistical theory for standard Transformer networks in sequence modeling. Given a sequenceto-sequence target on [0,1]^d_x × n whose entries are \gamma-H\"older for \gamma \in (0,1] or belong to a first-order Sobolev class, we establish explicit L^p-approximation bounds for all p \in [1,\infty], including the previously elusive endpoint p=\infty under softmax attention. In particular, achieving error \varepsilon in L^p-norm requires \mathcalO(\varepsilon^-d_x n/\gamma) parameters for \gamma-H\"older targets and \mathcalO(\varepsilon^-d_x n) parameters for Sobolev targets, matching the best known scalings in ambient dimension d_x n. We further study nonparametric regression with sequential and dependent observations using Transformer networks. Assuming stationary \beta-mixing covariates whose temporal dependence weakens over time, we analyze a sliding-window empirical risk minimization procedure and establish excess-risk guarantees for the Transformer-based estimators. Our analysis clarifies the role of attention and enables extensions beyond softmax.

Abstract:
Understanding the generalization behavior of learning algorithms is a central goal of learning theory. A recently emerging explanation is that learning algorithms are successful in practice because they converge to flat minima, which have been consistently associated with improved generalization performance. In this work, we study the link between flat minima and generalization in the canonical setting of stochastic convex optimization with a nonnegative, \beta-smooth objective. Our first finding is that, even in this fundamental setting, flat empirical minima may incur trivial \Omega(1) population risk while sharp minima generalizes optimally. We then demonstrate that this phenomenon extends to sharpness-aware algorithms introduced by Foret et al. (2021), namely Sharpness-Aware Gradient Descent (SA-GD) and Sharpness-Aware Minimization (SAM). For SA-GD we prove that it successfully converges to a flat minimum at a fast rate, but the population risk of the solution can still be as large as \Omega(1). For SAM we show that although it minimizes the empirical loss, it may converge to a sharp minimum and also incur population risk \Omega(1). Finally, we establish population risk upper bounds for both SA-GD and SAM using algorithmic stability techniques.

Abstract:
Research on adversarial robustness in language models is currently fragmented across applications and attacks, obscuring shared vulnerabilities. In this work, we propose unifying the study of adversarial robustness in text scoring models spanning dense retrievers, rerankers, and reward models. Unlike openended generation, text scoring failures are directly testable: an attack succeeds when an irrelevant or rejected text outscores a relevant or chosen one. Using this principled lens of text scoring, we demonstrate that current adversarial training formulations for language models are often short-sighted, failing to effectively generalize across attacks. To address this, we introduce multiple adversarial training methods for text scorers and show that combining complementary training methods can yield strong robustness while also improving task effectiveness. Finally, we highlight the practical value of our approach for RLHF, showing that our adversarially trained reward models mitigate reward hacking and support the training of better-aligned LLMs. We provide our code and models for further study.

Abstract:
While deep learningbased weather forecasting paradigms have made significant strides, addressing extreme weather diagnostics remains a formidable challenge. This gap exists primarily because the diagnostic process demands sophisticated multi-step logical reasoning, dynamic tool invocation, and expert-level prior judgment. Although agents possess inherent advantages in task decomposition and autonomous execution, current architectures are still hampered by critical bottlenecks: inadequate expert knowledge integration, a lack of professional-grade iterative reasoning loops, and the absence of fine-grained validation and evaluation systems for complex workflows under extreme conditions. To this end, we propose HVR-Met，a multi-agent meteorological diagnostic system characterized by the deep integration of expert knowledge. Its central innovation is the ``Hypothesis-Verification-Replanning'' closed-loop mechanism, which facilitates sophisticated iterative reasoning for anomalous meteorological signals during extreme weather events. To bridge gaps within existing evaluation frameworks, we further introduce a novel benchmark focused on atomic-level subtasks. Experimental evidence demonstrates that the system excels in complex diagnostic scenarios.

Abstract:
Current multimodal LLMs process audio as a mono stream, ignoring the rich spatial information essential for embodied AI. Existing spatial audio models, conversely, are constrained to fixed microphone geometries, preventing deployment across diverse devices. We present PhaseCoder, a transformeronly spatial audio encoder that is agnostic to microphone geometry. PhaseCoder takes raw multichannel audio and microphone coordinates as inputs to perform localization and produces robust spatial embeddings. We integrate PhaseCoder with the Gemma 3n LLM by finetuning it to reason over ``spatial audio tokens''. We show our encoder achieves state-of-the-art results on microphone-invariant localization benchmarks and, for the first time, enables an LLM to perform complex spatial reasoning and targeted transcription tasks from an arbitrary microphone array.

Abstract:
Large Multimodal Models (LMMs) have achieved remarkable progress across various capabilities; however, complex video reasoning in the scientific domain remains a significant and challenging frontier. Current video benchmarks predominantly target general scenarios where perception/recognition is heavily relied on, while with relatively simple reasoning tasks, leading to saturation and thus failing to effectively evaluate advanced multimodal cognitive skills. To address this critical gap, we introduce SciVideoBench, a rigorous benchmark specifically designed to assess advanced video reasoning in scientific contexts. SciVideoBench consists of 1,000 carefully crafted multiplechoice questions derived from cutting-edge scientific experimental videos spanning over 25 specialized academic subjects and verified by a semi- automatic system. Each question demands sophisticated domain-specific knowledge, precise spatiotemporal perception, and intricate logical reasoning, effectively challenging models’ higher-order cognitive abilities. Our evaluation highlights significant performance deficits in state-of-the-art proprietary and open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating substantial room for advancement in video reasoning capabilities. Detailed analyses of critical factors such as reasoning complexity and visual grounding provide valuable insights and clear direction for future developments in LMMs, driving the evolution of truly capable multimodal AI co-scientists. We hope SciVideoBench could fit the interests of the community and help to push the boundary of cutting-edge AI for border science.

Abstract:
Diffusion planners are a strong approach for offline reinforcement learning, but they can fail when valueguided selection favours trajectories that score well yet are locally inconsistent with the environment dynamics, resulting in brittle execution. We propose Self-supervised Action Gating with Energies (SAGE), an inference-time re-ranking method that penalises dynamically inconsistent plans using a latent consistency signal. SAGE trains a Joint-Embedding Predictive Architecture (JEPA) encoder on offline state sequences and an action-conditioned latent predictor for short horizon transitions. At test time, SAGE assigns each sampled candidate an energy given by its latent prediction error and combines this feasibility score with value estimates to select actions. SAGE can integrates into existing diffusion planning pipelines that can sample trajectories and select actions via value scoring; it requires no environment rollouts and no policy re-training. Across locomotion, navigation, and manipulation benchmarks, SAGE improves the performance and robustness of diffusion planners.

Abstract:
Modern database workloads are highly predictable: query streams are dominated by recurring jobs and templates, even when their arrival order is not known in advance. This motivates a learningaugmented view of online differentially private (DP) analytics: can algorithms utilize predictions about which queries will occur to improve utility under a single global privacy budget, while remaining robust when predictions are wrong? We study online DP query answering, where a curator must answer a stream Q of S linear queries arriving in uniformly random order under privacy budget (\epsilon,\delta). We present LAPRAS, which assumes access to an oracle that outputs a prediction set of queries likely to appear in the stream and uses it to guide privacy spending. LAPRAS answers predicted queries using the offline-optimal Matrix Mechanism and answers the remaining queries online from a residual budget. To pace spending across an unknown number of unpredicted queries, we introduce Smooth Allocation, which forms an unbiased stopping-time estimate \widehatB from the first T=\Theta(\log^2 S) unpredicted queries and continuously recalibrates per-query expenditure. Empirically, over two real datasets, we validate the intended consistency--robustness trade-off: LAPRAS achieves near-offline utility under high overlap and degrades gracefully to baseline-level performance when overlap is low.

Abstract:
Direct Alignment Algorithms (DAAs) simplify LLM alignment by directly optimizing policies, bypassing reward modeling and RL. While DAAs differ in their use of SFT (onestage vs. two-stage) and the scalar score they optimize (likelihood vs. odds ratios), the key performance drivers remain underexplored. We present a systematic comparison and analyze a previously overlooked axis - the ranking objective (pairwise vs. pointwise). To isolate this factor, we propose a unified training framework across DAAs by (i) converting one-stage methods (ORPO, ASFT) into a two-stage pipeline with an explicit SFT phase and (ii) introducing a \beta parameter that places all methods in the same hyperparameter space and improves the quality of odds-ratio DAAs (ORPO, ASFT). Under this setup, the ranking objective emerges as the primary determinant of alignment quality, whereas the particular scalar score (policy–reference ratio vs. odds ratio) is secondary. We corroborate this on instruction-following tasks and further confirm it on math-reasoning benchmarks across model scales. Evidence suggests that this stems from how these objectives interact with prompt-specific biases, supported both by strictly controlled experiments and by observations on real data. Our findings underscore the need for nuanced evaluations in DAA research to avoid oversimplified claims of superiority.

Abstract:
Flow and diffusion models produce highquality samples, but adapting them to user preferences or constraints post-training remains costly and brittle, a challenge commonly called reward alignment. We argue that efficient reward alignment should be a property of the generative model itself, not an afterthought, and redesign the model for adaptability. We propose Diamond Maps, a stochastic flow-map model that enables efficient and accurate alignment to arbitrary rewards at inference time. Diamond Maps amortize many simulation steps into a single-step sampler, like flow maps, while preserving the stochasticity required for optimal reward adaptation. This design makes search, Sequential Monte Carlo, and guidance scalable by enabling efficient and consistent estimation of the value function. Our experiments show that Diamond Maps can be learned efficiently via distillation from GLASS Flows, achieve stronger reward-alignment performance, and scale better than existing alignment methods. Overall, our results point toward a practical route to generative models that can be rapidly adapted to arbitrary preferences and constraints at inference time.

Abstract:
Pretrained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with the environment using reinforcement learning. While learning from interaction allows models to improve their within-task performance, it fails to produce models with generalizable physical intuitions. We find that models trained on one task do not reliably generalize to related tasks, even if the tasks share visual statistics and physical principles, and regardless of whether the models are trained through interaction.

Abstract:
Selfconsistency (SC) is a widely-used test-time inference technique for improving performance in chain-of-thought reasoning. It consists of generating multiple responses, or ``samples," from a large language model (LLM) and selecting the most frequent answer. This procedure can naturally be viewed as a majority vote or empirical mode estimation. Despite its effectiveness, self-consistency is prohibitively expensive at scale when naively applied to datasets, and it lacks a unified theoretical treatment of sample efficiency and scaling behavior. In this paper, we provide the first comprehensive analysis of SC's scaling behavior and its variants, drawing on mode estimation and voting theory. We derive and empirically validate power law scaling for self-consistency across datasets, and analyze the sample efficiency for fixed-allocation and dynamic-allocation sampling schemes. From these insights, we introduce Blend-ASC, a novel variant of self-consistency that dynamically allocates samples to questions during inference, achieving state-of-the-art sample efficiency. Our approach uses 4.8× fewer samples than vanilla SC on average, outperforming both fixed- and dynamic-allocation SC baselines, thereby demonstrating the superiority of our approach in terms of efficiency. In contrast to existing variants, we note that Blend-ASC is hyperparameter-free and can fit any budget of samples, ensuring it can be easily applied to any self-consistency application.

Abstract:
Recent research has shown that large language models (LLM) favor own outputs when acting as judges, undermining the integrity of automated posttraining and evaluation workflows. However, it is difficult to disentangle which evaluation biases are explained by narcissism versus general experimental confounds, distorting measurements of self-preference bias. We discover a core methodological confound which could reduce measurement error by89.6%. Specifically, LLM evaluators may deliver self-preferring verdicts when the judge responds to queries which they completed incorrectly themselves; this would be true regardless of whether one of their responses is their own. To decouple self-preference signals from noisy outputs on hard problems, we introduce anEvaluator Quality Baseline, which compares the probability that a judge incorrectly votes for itself against the probability that it votes for an incorrect response from another model. Evaluating this simple baseline on 37,448 queries, only51%of initial findings retain statistical significance. Finally, we turn towards characterizing the entropy of "easy" versus "hard" evaluation votes from LLM judges. Our corrective baseline enables future research on self-preference by eliminating noisy data from potential solutions. More widely, this work contributes to the growing body of work on cataloging and isolating judge-bias effects

Abstract:
Onpolicy preference learning algorithms for language model alignment such as online direct policy optimization (DPO) can significantly outperform their offline counterparts. We provide a theoretical explanation for this phenomenon by analyzing how the sampling policy's coverage evolves throughout on-policy training. We propose and rigorously justify the \emphcoverage improvement principle: with sufficient batch size, each update moves into a region around the target where coverage is uniformly better, making subsequent data increasingly informative and enabling rapid convergence. In the contextual bandit setting with Bradley-Terry preferences and linear softmax policy class, we show that on-policy DPO converges exponentially in the number of iterations for batch size exceeding a generalized coverage threshold. In contrast, any learner restricted to offline samples from the initial policy suffers a slower minimax rate, leading to a sharp separation in total sample complexity. Motivated by this analysis, we further propose a simple hybrid sampler based on a novel \emphpreferential G-optimal design, which removes dependence on coverage and guarantees convergence in just two rounds. Finally, we develop principled on-policy schemes for reward distillation in the general function class setting, and show faster noiseless rates under an alternative deviation-based notion of coverage.

Abstract:
Natural Language Querying for Time Series Databases (NLQ4TSDB) aims to assist nonexpert users retrieve meaningful events, intervals, and summaries from massive temporal records. However, existing Text-to-SQL methods are not designed for continuous morphological intents such as shapes or anomalies, while time series models struggle to handle ultra-long histories. To address these challenges, we propose Sonar-TS, a neuro-symbolic framework that tackles NLQ4TSDB via a "Search-Then-Verify" pipeline. Analogous to active sonar, it utilizes a feature index to "ping'' candidate windows via SQL, followed by generated Python programs to "lock on'' and verify candidates against raw signals. To enable effective evaluation, we introduce NLQTSBench, the first large-scale benchmark designed for NLQ over TSDB-scale histories. Our experiments highlight the unique challenges within this domain and demonstrate that Sonar-TS effectively navigates complex temporal queries where traditional methods fail. This work presents the first systematic study of NLQ4TSDB, offering a general framework and evaluation standard to facilitate future research.

Abstract:
Diffusion models do not recover semantic structure uniformly over time. Instead, samples transition from semantic ambiguity to class commitment within a narrow regime. Recent theoretical work attributes this transition to dynamical instabilities along classseparating directions, but practical methods to detect and exploit these windows in trained models are still limited. We show that tracking the class-conditional entropy of a latent semantic variable given the noisy state provides a reliable signature of these transition regimes. By restricting the entropy to semantic partitions, the entropy can furthermore resolve semantic decisions at different levels of abstraction. We analyze this behavior in high-dimensional Gaussian mixture models and show that the entropy rate concentrates on the same logarithmic time scale as the speciation symmetry-breaking instability previously identified in variance-preserving diffusion. We validate our method on EDM2-XS and Stable Diffusion 1.5, where class-conditional entropy consistently isolates the noise regimes critical for semantic structure formation. Finally, we use our framework to quantify how guidance redistributes semantic information over time. Together, these results connect information-theoretic and statistical physics perspectives on diffusion and provide a principled basis for time-localized control.

Abstract:
Polysomnography (PSG) provides the gold standard for sleep assessment but suffers from substantial heterogeneity across recording devices and cohorts. There have been growing efforts to build generalpurpose foundation models (FMs) for sleep physiology, but lack an in-depth understanding of the pre-training process and scaling patterns that lead to more generalizable sleep FMs. To fill this gap, we curate a massive corpus of 166,500 hours of sleep recordings from nine public sources and establish SleepBench, a comprehensive, fully open-source benchmark. Leveraging SleepBench, we systematically evaluate four families of self-supervised pre-training objectives and uncover three critical findings: (1) existing FMs fail to generalize to missing channels at inference; (2) channel-invariant feature learning is essential for pre-training; and (3) scaling sample size, model capacity, and multi-source data mixture consistently improves downstream performance. With an enhanced pre-training and scaling recipe, we introduce OSF, a family of sleep FMs that achieves state-of-the-art performance across nine datasets on diverse sleep and disease prediction tasks. Further analysis of OSF also reveals intriguing properties in sample efficiency, hierarchical aggregation, and cross-dataset scaling.

Abstract:
The quadratic complexity of standard attention mechanisms poses a significant scalability bottleneck for large language models (LLMs) in longcontext scenarios. While hybrid attention strategies that combine sparse and full attention within a single model offer a viable solution, they typically employ static computation ratios (i.e., fixed proportions of sparse versus full attention) and fail to adapt to the varying sparsity sensitivities of downstream tasks during inference. To address this issue, we propose Elastic Attention, which allows the model to dynamically adjust its overall sparsity based on the input. This is achieved by integrating a lightweight Attention Router into the existing pretrained model, which dynamically assigns each attention head to different computation modes. Within only 12 hours of training on 8×A800 GPUs, our method enables models to achieve both strong performance and efficient inference. Experiments across three long-context benchmarks on widely-used LLMs demonstrate the superiority of our method.

Abstract:
While scaling laws guide compute allocation for LLM pretraining, analogous prescriptions for reinforcement learning (RL) post-training of LLMs remain poorly understood. We study the compute-optimal allocation of sampling compute for on-policy RL methods in LLMs, framing scaling as a compute-constrained optimization over three resources: parallel rollouts per problem, number of problems per batch, and number of update steps. We find that the compute-optimal number of parallel rollouts per problem increases predictably with compute budget and then saturates. This trend holds across both easy and hard problems, though driven by different mechanisms: solution sharpening on easy problems and coverage expansion on hard problems. We further show that increasing the number of parallel rollouts mitigates interference across problems, while the number of problems per batch primarily affects training stability and can be chosen within a broad range. Validated across base models and data distributions, our results recast RL scaling laws as prescriptive allocation rules and provide practical guidance for compute-efficient LLM RL post-training.

Abstract:
While diffusion models enable new approaches for estimating Local Intrinsic Dimension (LID), existing methods fail in highdimensional spaces where noise from vast normal directions overwhelms the tangent signal. We propose Local Hessian Spectral Dimension (LHSD), which resolves this by applying spectral filtering to the log-density Hessian, explicitly cutting off large eigenvalues associated with normal directions to count zero-curvature tangent directions. Implemented using Stochastic Lanczos Quadrature (SLQ), LHSD avoids full Hessian construction, achieving linear scalability with dimension D. Experiments on synthetic and real data confirm LHSD’s superior robustness and its utility in detecting memorization in large-scale diffusion models.

Abstract:
Large Reasoning Models (LRMs) achieve remarkable inferencetime improvements through parallel thinking. However, existing methods rely on redundant sampling of reasoning trajectories, failing to effectively explore the reasoning space to uncover high-quality solutions. To address these limitations, we propose Decoding Tree Sketching (DTS), a plug-and-play decoding framework for structural multi-trajectory exploration and reasoning selection. For reasoning exploration, DTS sketches a backbone tree of the reasoning space by selectively branching at decision tokens. For reasoning selection, guided by length-accuracy anti-correlation, DTS designs an early termination to prioritize short and reliable trajectories during decoding. Experimental results across four LRMs and datasets demonstrate that DTS significantly enhances accuracy by 14% and reduces repetitive generation by 8% on average. Notably, DTS enables smaller models to outperform larger models with 10× the size, highlighting its potential to strengthen reasoning capabilities.

Abstract:
Mean Field Games (MFGs) provide a powerful framework for modeling the collective behavior of large populations of interacting agents. In this paper, we address the problem of Imitation Learning (IL) in MFGs subject to common noise, where the population distribution evolves stochastically. This stochasticity compels agents to adopt populationaware policies to respond to aggregate shocks. We formulate two distinct learning objectives: recovering a Nash equilibrium and maximizing performance against an expert population. We investigate two imitation proxies: Behavioral Cloning (BC) and Adversarial (ADV) divergence. We then establish finite-sample error bounds showing that minimizing these proxies effectively controls both the policy’s exploitability and its performance gap relative to the expert. Furthermore, we propose a numerical framework using generalized Fictitious Play and Deep Learning to compute expert population-aware policies. Through experiments on three environments we demonstrate that standard population-unaware policies fail to capture the equilibrium dynamics. Our results highlight that learning population-aware policies is crucial to avoid being misled by the randomness inherent in common noise.

Abstract:
Foundation models based on priordata fitted networks (PFNs) have shown strong empirical performance in causal inference by framing it as an in-context learning problem. However, it is unclear whether PFN-based causal estimators provide uncertainty quantification that is consistent with classical frequentist estimators. In this work, we address this gap by analyzing the frequentist consistency of PFN-based estimators for the average treatment effect (ATE). (1) We show that existing PFNs, when interpreted as Bayesian ATE estimators, can exhibit prior-induced confounding bias: the prior is not asymptotically overwritten by data, which, in turn, prevents frequentist consistency. (2) As a remedy, we suggest employing a calibration procedure based on a one-step posterior correction (OSPC). We show that the OSPC helps to restore frequentist consistency and can yield a semi-parametric Bernstein–von Mises theorem for calibrated PFNs (i.e., both the calibrated PFNs and the classical semi-parametric efficient estimators converge with growing data size). (3) Finally, we implement OSPC through tailoring martingale posteriors on top of the PFNs. In this way, we are able to recover nuisance-function posterior distributions from PFNs, required by the OSPC. In multiple (semi-)synthetic experiments, PFNs calibrated with our martingale posterior OSPC produce ATE uncertainty that (i) asymptotically matches frequentist uncertainty and (ii) is often best calibrated in finite samples among other Bayesian ATE estimators.

Abstract:
Role specialization in multiLLM agent systems is often realized via multi-LoRA, where agents share a pretrained backbone and differ only through lightweight adapters. Despite sharing base model weights, each agent independently builds and stores its own KV cache for the same long, tool-augmented trajectories, incurring substantial memory and compute overhead. Existing KV cache sharing methods largely overlook this multi-LoRA setting. We observe that, across agents, cache differences are dominated by adapter outputs, while activations from the shared pretrained backbone remain highly similar. Based on this observation, we propose LRAgent, a KV cache sharing framework for multi-LoRA agents that decomposes the cache into a shared base component from the pretrained weights and an adapter-dependent component from LoRA weights. LRAgent reduces memory overhead by sharing the base component and storing the adapter component in its inherent low-rank form, and further reduces compute overhead, enabled by shared-A multi-LoRA architectures, by also sharing the low-rank cache and avoiding redundant computations for contexts already processed by other agents. To efficiently reconstruct adapter contributions at runtime, we introduce Flash-LoRA-Attention, a kernel that reorders attention computation to avoid materializing the low-rank cache to full dimension. LRAgent achieves throughput and time-to-first-token latency close to fully shared caching, while preserving accuracy near the non-shared caching baseline across agentic question-answering benchmarks.

Abstract:
The analysis of Temporal Difference (TD) learning in the averagereward setting faces notable theoretical difficulties because the Bellman operator is not contractive with respect to any norm. This complicates standard analyses of stochastic updates that are effective in discounted settings. Although a considerable body of literature addresses these challenges, existing theoretical approaches come with limitations. We introduce a novel algorithm designed explicitly for policy evaluation in the average-reward setting, utilizing sampling from two Markovian trajectories. Our proposed method overcomes previous limitations by guaranteeing convergence to the unique solution of a properly defined projected Bellman equation. Notably, and in contrast to earlier work, our convergence analysis is uniformly applicable to both linear function approximation and tabular settings and does not involve explicit dimension-dependent terms in its convergence bounds. These results align with what is known to hold in the discounted setting. Furthermore, our algorithm achieves improved dependence on the problem's condition number, reducing the sample complexity from quartic, as in prior literature, to quadratic scaling, and thus matching the efficiency seen in the discounted setting.

Abstract:
Recurring Pattern Bugs (RPBs) are defined as bugs where a single root cause appears repeatedly across multiple code segments. These bugs remain a persistent security threat even after individual instances are patched. Various static analyzers exist for finding specific bug patterns but require significant engineering effort and fail to generalize well beyond their predefined template, preventing them from detecting RPBs. To tackle RPBs, we introduce BugStone, a hybrid framework combining LLVMbased program analysis with Large Language Models to automate RPB detection. BugStone leverages a single patched instance to synthesize abstract error patterns and retrieves semantically similar bugs throughout the codebase. To evaluate BugStone, we create a ground truth dataset by analyzing over 1.9K security bugs reports, on which BugStone achieves 92.2% precision and 79.1% pairwise accuracy. We further validated BugStone through a large-scale real-world deployment. In the Linux kernel, BugStone identified over 22K potential issues; a manual audit of 400 samples confirmed 246 valid bugs, including invalid pointer dereferences, resource leaks, type errors, performance issues, and others. To evaluate the generalizability of BugStone, we further applied it to the top 100 Python projects, discovering multiple critical command injection vulnerabilities.

Abstract:
Where do learning signals come from when there is no ground truth in posttraining? We show that inference compute itself can serve as supervision. By generating parallel rollouts and converting them into reference estimates, models can learn without human labels—critically, even in non-verifiable domains like healthcare guidance where no programmatic checker exists. We call this frameworkCompute as Teacher (CaT)and it turns inference-time compute from parallel rollouts into supervision for RL training. The framework has two components: (1) reference estimation which aggregates rollouts into a pseudo-reference answer, and (2) reward derivation which converts that pseudo-reference into RL rewards. For (1), we explore a simple method we callsynthesis, but the framework admits any aggregator. For (2), we introduce self-proposed rubrics for non-verifiable domains. These are binary, auditable criteria generated from the pseudo-reference and scored by an LLM judge. On HealthBench, models trained with CaT match or exceed inference-time aggregation quality while using 9× less test-time compute. Here, CaT also competes with learning from expert physician annotations, yielding up to +30% relative improvement over the initial policy. The framework extends naturally to verifiable rewards, matching the best existing baselines on MATH-500 in test-time RL and demonstrating 'drop-in' versatility across both types of domains.

Abstract:
Reinforcement learning (RL) with outcomebased rewards has achieved significant success in training large language model (LLM) agents for complex reasoning tasks. However, in active reasoning where agents need to strategically ask questions to acquire task-relevant information, we find that LLM agents trained with RL often suffer from information self-locking: the agent ceases to ask informative questions and sticks to uninformative decisions. To understand the phenomenon, we decompose active reasoning into two core capabilities: Action Selection (AS), which determines the observation stream through queries, and Belief Tracking (BT), which updates the agent’s belief based on collected evidence. We show that low AS and BT capabilities of LLMs will limit the information exploration during RL training. Furthermore, insufficient exploration in turn hinders the improvement of AS and BT, creating a feedback loop that locks the agent in a low-information regime. To resolve the issue, we propose a simple yet effective approach that directly promotes AS capability using proxy AS signals to help the agent escape the low-information regime. Extensive experiments with 6 benchmarks show that our approach mitigates the information self-locking, and brings up to 10% improvements.

Abstract:
Why depth yields a genuine computational advantage over shallow methods remains a central open question in learning theory. We study this question in a controlled highdimensional Gaussian setting, focusing on compositional target functions. We analyze their learnability using an explicit three-layer fitting model trained via layer-wise spectral estimators. Although the target is globally a high-degree polynomial, its compositional structure allows learning to proceed in stages: an intermediate representation reveals structure that is inaccessible at the input level. This reduces learning to simpler spectral estimation problems, well studied in the context of multi-index models, whereas any shallow estimator must resolve all components simultaneously. Our analysis relies on Gaussian universality, leading to sharp separations in sample complexity between two and three-layer learning strategies.

Abstract:
Diffusion models excel at generation, but their latent spaces are high dimensional and not explicitly organized for interpretation or control. We introduce ConDA (Contrastive Diffusion Alignment), a plugand-play geometry layer that applies contrastive learning to pretrained diffusion latents using auxiliary variables (e.g., time, stimulation parameters, facial action units). ConDA learns a low-dimensional embedding whose directions align with underlying dynamical factors, consistent with recent contrastive learning results on structured and disentangled representations. In this embedding, simple nonlinear trajectories support smooth interpolation, extrapolation, and counterfactual editing while rendering remains in the original diffusion space. ConDA separates editing and rendering by lifting embedding trajectories back to diffusion latents with a neighborhood-preserving kNN decoder and is robust across inversion solvers. Across fluid dynamics, neural calcium imaging, therapeutic neurostimulation, facial expression dynamics, and monkey motor cortex activity, ConDA yields more interpretable and controllable latent structure than linear traversals and conditioning-based baselines, indicating that diffusion latents encode dynamics-relevant structure that can be exploited by an explicit contrastive geometry layer.

Abstract:
Explainable artificial intelligence (XAI) aims to provide humaninterpretable insights into the behavior of deep neural networks (DNNs), typically by estimating a simplified causal structure of the model. In existing work, this causal structure often includes relationships where the presence of a concept is associated with a strong activation of a neuron. For example, attribution methods primarily identify input pixels that contribute most to a prediction, and feature visualization methods reveal inputs that cause high activation of a target neuron – the former implicitly assuming that the relevant information resides in the input, and the latter that neurons encode the presence of concepts. However, a largely overlooked type of causal relationship is that of encoded absences, where the absence of a concept increases neural activation. In this work, we show that such missing but relevant concepts are common and that mainstream XAI methods struggle to reveal them when applied in their standard form. To address this, we propose two simple extensions to attribution and feature visualization techniques that uncover encoded absences. Across experiments, we show how mainstream XAI methods can be used to reveal and explain encoded absences, how ImageNet models exploit them, and that debiasing can be improved when considering them.

Abstract:
Reinforcement learning with verifiable rewards has driven recent advances in LLM posttraining, in particular for reasoning. Policy optimization algorithms generate a number of responses for a given prompt and then effectively weight the corresponding gradients depending on the rewards. The most popular algorithms including GRPO, DAPO, and RLOO focus on ambiguous prompts, i.e., prompts with intermediate success probability, while downgrading gradients with very easy and very hard prompts. In this paper, we consider asymmetric prompt weightings that assign higher weights to prompts with low success probability. We find that asymmetric weighting particularly benefits from-scratch RL (as in R1-Zero), where training traverses a wide accuracy range, and less so post-SFT RL where the model already starts at high accuracy. We also provide theory that characterizes prompt weights which minimize the time needed to raise success probability from an initial level to a target accuracy under a fixed update budget. In low-success regimes, where informative responses are rare and response cost dominates, these optimal weights become asymmetric, upweighting low success probabilities and thereby accelerating effective-time convergence.

Abstract:
Interpretable machine learning is essential in highstakes domains where decision-making requires accountability, transparency, and trust. While rule-based models offer global and exact interpretability, learning rule sets that achieve high performance while maintaining low complexity to be human understandable and generalizability across tasks remains a difficult challenge. To address this, we introduce TT-Sparse, a flexible neural building block that leverages differentiable truth tables as nodes to learn sparse, effective connections. A key contribution of our approach is a novel soft TopK operator that allows straight-through estimation, ensuring differentiability to backpropagate gradients through the sparse connections and identify meaningful connections. This design allows each node to be exactly transformed into DNF/CNF equations via the Quine-McCluskey algorithm, effectively reducing the entire model to interpretable Boolean formulas. Extensive empirical results across 28 datasets spanning binary, multiclass, and regression tasks show the sparse rules exhibit superior predictive performance with lower complexity compared to existing state-of-the-art methods.

Abstract:
Modality translation is inherently underconstrained, as multiple cross-modal mappings may yield the same marginals. Recent work has shown that diffusion bridges are effective for this task. However, most existing approaches rely on fully paired datasets, thereby imposing a single data-driven constraint. We propose a diffusion-bridge framework that characterizes the space of admissible solutions and restricts it via alignment constraints, treating paired supervision as an optional heuristic rather than a prerequisite. We validate our method on synthetic and real modality translation benchmarks across unpaired, semi-paired, and paired regimes, showing consistent performance across supervision levels. Notably, it achieves near fully-paired quality with a substantial relaxation in pairing requirements, and remaining applicable in the unpaired regime. These results highlight diffusion bridges as a flexible foundation for modality translation beyond fully paired data.

Abstract:
Theoretical analyses of Empirical Risk Minimization (ERM) are standardly framed within the RealRAM model of computation. In this setting, training even simple neural networks is known to be \exists \mathbbR-complete---a complexity class believed to be harder than NP, that characterizes the difficulty of solving systems of polynomial inequalities over the real numbers. However, this algebraic framework diverges from the reality of digital computation with finite-precision hardware. In this work, we analyze the theoretical complexity of ERM under a realistic bit-level model (ERM_bit), where network parameters and inputs are constrained to be rational numbers with polynomially bounded bit-lengths. Under this model, we reveal a sharp dichotomy in tractability governed by the network's activation function. We prove that for deep networks with any polynomial activations with rational coefficients and degree at least 2, the bit-complexity of training is severe: deciding ERM_bit (with a regularized loss) is #P-hard, hence believed to be strictly harder than NP-complete problems. Furthermore, we show that determining the sign of a single partial derivative of the empirical loss function is intractable (unlikely in BPP), and deciding a specific bit in the gradient is #P-hard. This provides a complexity-theoretic perspective for the phenomenon of exploding and vanishing gradients. In contrast, we show that for piecewise-linear activations such as ReLU, the precision requirements remain manageable: ERM_bit is contained within NP (specifically NP-complete), and standard backpropagation runs in polynomial time. Our results demonstrate that finite-precision constraints are not merely implementation details but fundamental determinants of learnability.

Abstract:
Diffusion language models (dLLMs) recently emerged as a promising alternative to autoregressive LLMs. The latest works further extended it to multimodal understanding and generation tasks. In this work, we propose LaViDa-R1, a multimodal, general-purpose reasoning dLLM. Unlike existing works that build reasoning dLLMs through task-specific reinforcement learning, LaViDa-R1 incorporates diverse multimodal understanding and generation tasks in a unified manner. In particular, LaViDa-R1 is built with a novel unified post-training framework that seamlessly integrates supervised finetuning (SFT) and multi-task reinforcement learning (RL). It employs several novel training techniques, including answer-forcing, tree search, and complementary likelihood estimation, to enhance effectiveness and scalability. Extensive experiments demonstrate LaViDa-R1's strong performance on a wide range of multimodal tasks, including visual math reasoning, reason-intensive grounding, and image editing.

Abstract:
Large language models (LLMs) have demonstrated exceptional capabilities when trained within executable runtime environments, notably excelling at software engineering tasks through verified feedback loops. Yet, scalable and generalizable executiongrounded environments remain scarce, limiting progress in training more capable ML agents. We introduce CTF-Dojo, the first large-scale executable runtime tailored for training LLMs with verifiable feedback, featuring 658 fully functional Capture-The-Flag (CTF)-style challenges containerized in Docker with guaranteed reproducibility. To enable rapid scaling without manual intervention, we develop CTF-Forge, an automated pipeline that transforms publicly available artifacts into ready-to-use execution environments in minutes, eliminating weeks of expert configuration traditionally required. We trained LLM-based agents on just 486 high-quality, execution-verified trajectories from CTF-Dojo, achieving up to 11.6% absolute gains over strong baselines across three competitive benchmarks: InterCode-CTF, NYU CTF Bench, and Cybench. Our best-performing 32B model reaches 31.9% Pass@1, establishing a new open-weight state-of-the-art that rivals frontier models like DeepSeek-V3-0324 and Gemini-2.5-Flash. By framing CTF-style tasks as a benchmark for executable-agent learning, CTF-Dojo demonstrates that execution-grounded training signals are not only effective but pivotal in advancing high-performance ML agents without dependence on costly proprietary systems.

Abstract:
Stochasticgradient MCMC methods enable scalable Bayesian posterior sampling but often suffer from sensitivity to minibatch size and gradient noise. To address this, we propose Stochastic Gradient Lattice Random Walk (SGLRW), an extension of the Lattice Random Walk discretization. Unlike conventional Stochastic Gradient Langevin Dynamics (SGLD), SGLRW introduces stochastic noise only through the off-diagonal elements of the update covariance; this yields greater robustness to minibatch size while retaining asymptotic correctness. Furthermore, as comparison we analyze a natural analogue of SGLD utilizing gradient clipping. Experimental validation on Bayesian regression and classification demonstrates that SGLRW remains stable in regimes where SGLD fails, including in the presence of heavy-tailed gradient noise, and matches or improves predictive performance.

Abstract:
Reinforcement learning (RL) has become a key paradigm for training software engineering (SWE) agents, yet its practical accessibility and scalability is often constrained by containerbased execution frameworks used for environment isolation. As the number of task instances increases, pre-cached container images introduce substantial storage overhead, limiting large-scale training under limited cotainer resources, and excludes users without container management privileges. We introduce SWE-MiniSandbox, a lightweight, container-free method that enables scalable RL training of SWE agents without sacrificing isolation. Instead of relying on per-instance containers, SWE-MiniSandbox executes each task in an isolated workspace backed by kernel-level mechanisms, substantially reducing system overhead. It leverages lightweight environment pre-caching techniques to eliminate the need for bulky container images. As a result, our approach lowers disk usage to approximately 5% of that required by container-based pipelines and reduces environment preparation time to about 25% of the container baseline. Empirical results demonstrate that SWE-MiniSandbox achieves evaluation performance comparable to standard container-based pipelines. Consequently, by removing the dependency on heavy container infrastructure, SWE-MiniSandbox offers a practical and accessible foundation for scaling RL-based SWE agents, particularly in resource-constrained research environments.

Abstract:
Video language models (VideoLLMs) are prone to hallucinations, often generating plausible but ungrounded content when visual evidence is weak, ambiguous, or biased. Existing decoding methods, such as contrastive decoding (CD), rely on random perturbations to construct contrastive data for mitigating hallucination patterns. However, such a way is hard to control the visual cues that drive hallucination or well align with model weaknesses. We propose Model-aware Counterfactual Data based Contrastive Decoding (MACD), a new inference strategy that combines model-guided counterfactual construction with decoding. Our approach uses the Video-LLM’s own feedback to identify object regions most responsible for hallucination, generating targeted counterfactual inputs at the object level rather than arbitrary frame or temporal modifications. These model-aware counterfactual data is then integrated into CD to enforce evidence-grounded token selection during decoding. Experiments on EventHallusion, MVBench, Perception-test and Video-MME show that MACD consistently reduces hallucination while maintaining or improving task accuracy across diverse Video-LLMs, including Qwen and InternVL families. The method is especially effective in challenging scenarios involving small, occluded, or co-occurring objects. Our code and data will be publicly released.

Abstract:
Despite scale driving substantial recent advancements in machine learning, reinforcement learning (RL) methods still primarily use small value functions. Naively scaling value functions including with a transformer architecture, which is known to be highly scalable -- often results in learning instability and worse performance. In this work, we ask what prevents transformers from scaling effectively for value functions? Through empirical analysis, we identify the critical failure mode in this scaling: attention scores collapse as capacity increases. Our key insight is that we can effectively prevent this collapse and stabilize training by controlling the entropy of the attention scores, thereby enabling the use of larger models. To this end, we propose Transformer Q-Learning (TQL), a method that unlocks the scaling potential of transformers in learning value functions in RL. Our approach yields up to a 43% improvement in performance when scaling from the smallest to the largest network sizes, while prior methods suffer from performance degradation.

Abstract:
Recent approaches to realtime long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-context (memoryless) teacher. In these frameworks, the student performs long rollouts but receives supervision from a teacher limited to short 5-second windows. This structural discrepancy creates a critical student-teacher mismatch: the teacher's inability to access long-term history prevents it from guiding the student on global temporal dependencies, effectively capping the student's context length. To resolve this, we propose Context Forcing, a novel framework that trains a long-context student via a long-context teacher. By ensuring the teacher is aware of the full generation history, we eliminate the supervision mismatch, enabling the robust training of models capable of long-term consistency. To make this computationally feasible for extreme durations (e.g., 2 minute), we introduce a context management system that transforms the linearly growing context into a Slow-Fast Memory architecture, significantly reducing visual redundancy. Extensive results demonstrate that our method enables effective context lengths exceeding 20 seconds—6\text--10× longer than state-of-the-art methods like LongLive and Infinite-RoPE. By leveraging this extended context, Context Forcing preserves superior consistency across long durations, surpassing state-of-the-art baselines on various long video evaluation metrics.

Abstract:
Universal morphology control aims to learn a universal policy that generalizes across heterogeneous robot morphologies, with Transformerbased controllers emerging as a dominant choice. However, such architectures incur substantial computational costs, resulting in high deployment overhead, and existing methods exhibit limited cross-task generalization, necessitating training from scratch for each new task. To this end, we propose DivMorph, a modular training paradigm that leverages knowledge diversion to learn decomposable controllers. DivMorph factorizes randomly initialized Transformer weights into basic knowledge units via SVD and employs dynamic soft gating, conditioned on task and morphology embeddings, to adaptively modulate these units into universal learngenes and morphology- and task-specific tailors during training, thereby achieving knowledge disentanglement. By selectively activating relevant components, DivMorph adaptively recomposes the controller, enabling efficient policy deployment and effective policy transfer to novel tasks. Extensive experiments demonstrate that DivMorph achieves state-of-the-art performance, improving sample efficiency for cross-task transfer by 3.3× and reducing model size for single-agent deployment by 16.7×.

Abstract:
Large models recently are widely applied in machine learning, so efficient training of large models has received widespread attention. More recently, a useful Muon optimizer is specifically designed for matrixstructured parameters of large models. Although some works have begun to studying the Muon optimizer, the existing Muon and its variants still suffer from high sample complexity or high memory for large models. To fill this gap, we propose a light and fast Muon (LiMuon) optimizer for training large models, which builds on the momentum-based variance reduced technique and randomized Singular Value Decomposition (SVD). In particular, our LiMuon optimizer simultaneously has a lower memory and lower sample complexity than the Muon. Moreover, we prove that our LiMuon has a lower sample complexity of O(\epsilon^-3) for finding an \epsilon-stationary solution of non-convex stochastic optimization under the generalized smooth condition. Numerical experimental results on training Mamba-130M, Qwen2.5-0.5B and ViT models demonstrate effectiveness of our LiMuon optimizer.

Abstract:
While diffusion language models (DLMs) have achieved competitive performance in text generation, improving their reasoning ability with reinforcement learning remains an active research area. Here, we introduce d2, a reasoning framework tailored for masked DLMs. Central to our framework is a new policy gradient algorithm that relies on accurate estimates of the sampling trajectory likelihoods. Our likelihood estimator, d2AnyOrder, achieves exact trajectory likelihood with a single model pass for DLMs that support a sampling algorithm called any-order decoding. Through an empirical study of widely used DLMs, we show that any-order decoding is not universally supported in practice. Consequently, for DLMs that do not naturally support any-order decoding, we propose another estimator, d2-StepMerge, which, unlike d2-AnyOrder, only approximates the trajectory likelihood. d2-StepMerge trades off compute for approximation accuracy in an analytically tractable manner. Empirically, d2 significantly outperforms widely-used RL baselines when applied to popular DLMs, and sets a new state-of-the-art performance for DLMs on logical reasoning tasks (Countdown and Sudoku) and math reasoning benchmarks (GSM8K and MATH500).

Abstract:
Autonomous agents powered by large language models (LLMs) promise to accelerate scientific discovery, but rigorously evaluating their capacity for verifiable discovery remains a central challenge. Existing benchmarks face a tradeoff: they either rely on LLM-as-judge evaluations of automatically generated papers, or optimize isolated performance metrics that provide only coarse proxies for scientific insight. To address this, we introduce FIRE-Bench (Full-cycle Insight Rediscovery Evaluation), a benchmark that evaluates agents through the rediscovery of established findings from recent, high-impact machine learning research. Agents are given only a high-level research question from a published study and must autonomously design experiments, implement code, execute their plans, and derive conclusions supported by empirical evidence. We evaluate a range of state-of-the-art agents with frontier model backbones, such as gpt-5, on FIRE-Bench. Our results show that full-cycle scientific research remains challenging for current agent systems: even the strongest agents achieve limited rediscovery success, exhibit high variance across runs, and display recurring failure modes in experimental design, execution, and evidence-based reasoning. Overall, FIRE-Bench provides a rigorous and diagnostic framework for measuring progress toward reliable agent-driven scientific discovery.

Abstract:
The emergence of large language model (LLM)based agents has significantly advanced the development of autonomous machine learning (ML) engineering. However, the dominant prompt-based paradigm exhibits limitations: smaller models lack the capacity to learn from execution trajectories for generalization, while large proprietary models incur high computational overhead, restricting accessibility and scalability. Focusing on this, for the first time, we explore the paradigm of learning-based agentic ML, where an LLM agent learns through interactive experimentation on ML tasks using online reinforcement learning (RL). To realize this, we propose a novel agentic ML training framework with three key components: (1) exploration-enriched fine-tuning, which enables LLM agents to generate diverse actions for enhanced RL exploration; (2) step-wise RL, which enables training on a single action step, accelerating experience collection and improving training efficiency; (3) an agentic ML-specific reward module, which unifies varied ML feedback signals into consistent rewards for RL optimization. Leveraging this framework, we train ML-Agent, driven by a 7B-sized Qwen-2.5 LLM for autonomous ML. Despite training on only 9 ML tasks, our 7B-sized ML-Agent achieves comparable performance to agents using much larger proprietary LLMs (e.g., GPT-5) but at significantly lower computational cost, demonstrating strong performance and cross-task generalization.

Abstract:
Obtaining highquality labels is costly, whereas unlabeled covariates are often abundant, motivating semi-supervised inference methods with reliable uncertainty quantification. Prediction-powered inference (PPI) leverages a machine-learning predictor trained on a small labeled sample to improve efficiency, but it can lose efficiency under model misspecification and suffer from coverage distortions due to label reuse. We introduce Machine‑Learning‑Assisted Generalized Entropy Calibration (MEC), a cross‑fitted, calibration‑weighted variant of PPI. MEC improves efficiency by reweighting labeled samples to better align with the target population, using a principled calibration framework based on Bregman projections. This yields robustness to affine transformations of the predictor and relaxes requirements for validity by replacing conditions on raw prediction error with weaker projection‑error conditions. As a result, MEC attains the semiparametric efficiency bound under weaker assumptions than existing PPI variants. Across simulations and a real‑data application, MEC achieves near‑nominal coverage and tighter confidence intervals than CF‑PPI and vanilla PPI.

Abstract:
Selfevolving large language model (LLM) agents continually improve by accumulating and reusing past experience, yet it remains unclear whether they faithfully rely on that experience to guide their behavior. We present the first systematic investigation of \emphexperience faithfulness—the causal dependence of an agent's decisions on the experience it is given—in self-evolving LLM agents. Using controlled causal interventions on both raw and condensed forms of experience, we comprehensively evaluate four representative frameworks across 10 LLM backbones and 9 environments. Our analysis uncovers a striking asymmetry: while agents consistently depend on raw experience, they often disregard or misinterpret condensed experience, even when it is the only experience provided. This gap persists across single- and multi-agent configurations and across backbone scales. We trace its underlying causes to three factors: the semantic limitations of condensed content, internal processing biases that suppress experience, and task regimes where pretrained priors already suffice. These findings challenge prevailing assumptions about self-evolving methods and underscore the need for more faithful and reliable approaches to experience integration.

Abstract:
Despite their impressive performance on a wide variety of tasks, modern language models remain susceptible to distribution shifts, exhibiting brittle behavior when evaluated on data that differs in distribution from their training data. In this paper, we describe how distribution shifts in language models can be separated intoobservableandunobservablecomponents, and we discuss how established approaches for dealing with distribution shift address only the former. Importantly, we identify that the resultingomitted variable biasfrom unobserved variables can compromise both evaluation and optimization in language models. To address this challenge, we introduce a framework that maps the strength of the omitted variables to bounds on theworstcase generalization performanceof language models under distribution shift. In empirical experiments, we show that using these bounds directly in language model evaluation and optimization provides more principled measures of out-of-distribution performance, improves true out-of-distribution performance relative to standard distribution shift adjustment methods, and further enables inference about the strength of the omitted variables when target distribution labels are available.

Abstract:
While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind, remaining fragmented and narrowly focused. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for Compositional Music Instruction (CMI) reward modeling, where the generated music may be conditioned on text descriptions, lyrics, and/or audio prompts. We first introduce CMIPrefPseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMIRewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text–music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgments scores on Music Arena and CMI-Pref test set, as well as preference agreement on Music Arena and CMI-Pref. Additional analyses examine performance variation across factors such as annotators, annotation timing and confidence, music generation models, and audio length. Experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via topk filtering. Our work provides the necessary data, benchmarks, and models to advance aligned music generation.

Abstract:
Research involving privacysensitive data has always been constrained by data scarcity, standing in sharp contrast to other areas that have benefited from data scaling. To quench this thirst, we present Privasis (i.e., privacy oasis), the first million-scale fully synthetic dataset entirely built from scratch—an expansive reservoir of texts with rich and diverse private information—designed to broaden and accelerate research in areas where processing sensitive social data is inevitable. Compared to existing datasets, Privasis, comprising 1.4 million records, offers orders-of-magnitude larger scale with quality, and far greater diversity across various document types, including medical records, legal documents, financial records, calendars, emails, meeting transcripts, and text-messages with a total of 55.1 million annotated attributes such as ethnicity, date of birth, workplace, etc. We leverage Privasis to construct a parallel corpus for text sanitization with our pipeline that recursively decomposes texts and applies targeted sanitization. Our compact sanitization models (\leq 4B) trained on this dataset outperform state-of-the-art large language models, such as GPT-5 and Qwen-3 235B.

Abstract:
We introduce, to our knowledge, the first direct secondorder method for computing Nash equilibria in two-player zero-sum games. To do so, we construct a Douglas-Rachford-style splitting formulation, which we then solve with a semi-smooth Newton (SSN) method. We show that our algorithm enjoys local superlinear convergence. To augment the fast local behavior of our SSN method with global efficiency guarantees, we develop a hybrid method that combines our SSN method with the state-of-the-art first-order method for game solving, Predictive Regret Matching (PRM^+). Our hybrid algorithm leverages the global progress provided by PRM^+ while achieving a local superlinear convergence rate once it switches to SSN near a Nash equilibrium. Numerical experiments on matrix games demonstrate order-of-magnitude speedups over PRM^+ for high-precision solutions.

Abstract:
Triggerable watermarking enables model owners to assert ownership against model extraction attacks. However, most existing approaches require additional training, which limits postdeployment flexibility, and the lack of clear theoretical foundations makes them vulnerable to adaptive attacks. In this paper, we propose Neural Honeytrace, a plug-and-play watermarking framework that operates without retraining. We redefine the watermark transmission mechanism from an information perspective, designing a training-free multi-step transmission strategy that leverages the long-tailed effect of backdoor learning to achieve efficient and robust watermark embedding. Extensive experiments demonstrate that Neural Honeytrace reduces the average number of queries required for a worst-case t-test-based ownership verification to as low as 2% of existing methods, while incurring zero training cost.

Abstract:
Previous work establishing completeness results forcounterfactual identificationhas been limited to the setting where the input data belongs to observational and interventional distributions (Layers 1 and 2 of Pearl's Causal Hierarchy), since it was generally presumed impossible to obtain data from counterfactual distributions, belonging to Layer 3. However, recent work (Raghavan & Bareinboim, 2025) has formally characterized a family of counterfactual distributions which can be directly estimated via experimental methods a notion they callcounterfactual realizabilty. This leaves open the question of whatadditionalLayer 3 quantities now become identifiable, given this new access to (some) Layer 3 data. We develop thectfIDu+algorithm for identifying a counterfactual query from an arbitrary set of Layer 3 data, and prove that it is complete for this task. Using this, we establish the theoretical limit of which counterfactuals can be identified from physically realizable data, thus implying thefundamental limit to exact causal inference in the non-parametric setting. Finally, we derive novel analytic bounds for important non-identifiable quantities given realizable counterfactual data, that are provably tighter than the previously established benchmark. We corroborate using simulations that even if a quantity is non-identifiable, counterfactual data can be used to further tighten bounds for its range.

Abstract:
The empirical evidence indicates that stochastic optimization with heavytailed gradient noise is more appropriate to characterize the training of machine learning models than that with standard bounded gradient variance noise. Most existing works on this phenomenon focus on the convergence of optimization errors, while the analysis for generalization bounds under the heavy-tailed gradient noise remains limited. In this paper, we develop a general framework for establishing generalization bounds under heavy-tailed noise. Specifically, we introduce a truncation argument to achieve the generalization error bound based on the algorithmic stability under the assumption of bounded pth centered moment with p\in(1,2]. Building on this framework, we further provide the stability and generalization analysis for several popular stochastic algorithms under heavy-tailed noise, including clipped and normalized stochastic gradient descent, as well as their mini-batch and momentum variants.

Abstract:
In this work, we consider the problems of learning junta distributions, their quantum counterparts (quantum junta states), and \mathsfQAC^0 circuits, which we show to be close to juntas. (1) Junta distributions. A probability distribution p:1,1^n\to \mathbb [0,1] is a k-junta if it only depends on k bits. We show that they can be learned to within additive error \varepsilon in total variation distance from O(2^k\log(n)/\varepsilon^2) samples, which quadratically improves the upper bound of Aliakbarpour et al. (COLT'16) and matches their lower bound in every parameter. (2) Junta states. We initiate the study of n-qubit states that are k-juntas, those that are the tensor product of a k-qubit state and an (n-k)-qubit maximally mixed state. We show that these states can be learned with error \varepsilon in trace distance with O(12^k\log(n)/\varepsilon^2) single copies. We also prove a lower bound of \Omega((4^k+\log (n))/\varepsilon^2) copies. Additionally, we show that, for constant k, \widetilde\Theta(2^n/\varepsilon^2) copies are necessary and sufficient to test whether a state is \varepsilon-close or 7\varepsilon-far from being a k-junta. (3) \mathsfQAC^0 circuits. We show that n-qubit \mathsfQAC^0 circuits with size s, depth d and a auxiliary qubits can be learned from 2^O(\log(s^22^a)^d)\log(n) copies of the Choi state, improving the n^O(\log(s^22^a)^d) by Nadimpalli et al. (STOC'24). Along the way, we give new proof of the optimal performance of Classical Shadows based on Pauli analysis. We also strengthen the lower bounds against \mathsfQAC^0 to compute the address function.

Abstract:
Predictive coding networks are neural models that perform inference through an iterative energy minimization process. While effective in shallow architectures, they suffer significant performance degradation beyond five to seven layers. In this work, we show that this degradation is caused by exponentially imbalanced errors between layers during weight updates, and the predictions from the previous layers not being effective in guiding updates in deeper layers. Furthermore, when training models with skip connections, the energy propagated by the residuals reaches higher layers faster than the one propagated by the main pathway, affecting test accuracy. We address the first issue by introducing a novel precisionweighted optimization of latent variables that balances error distributions during the relaxation phase, the second issue by proposing a novel weight update mechanism that reduces error accumulation in deeper layers, and the third one by using identity nodes that slow down the propagation of the energy in the residual connections. Empirically, our methods achieve performance comparable to backpropagation on deep models such as ResNet18, opening new possibilities for predictive coding in complex tasks.

Abstract:
Testtime adaptation (TTA) offers a compelling remedy for machine learning (ML) models that degrade under domain shifts, improving generalisationon-the-flywith only unlabelled samples. This flexibility suits real deployments, yet conventional evaluations unrealistically assume unbounded processing time, overlooking the accuracy-latency trade-off. As ML increasingly underpins latency-sensitive and user-facing use-cases, temporal pressure constrains the viability of adaptable inference; predictions arriving too late to act on are futile. We introduceTempora, a framework for evaluating TTA under this pressure. It consists of temporal scenarios that model deployment constraints, evaluation protocols that operationalise measurement, and time-contingent utility metrics that quantify the accuracy-latency trade-off. We instantiate the framework with three such metrics: (1)discreteutility for asynchronous streams with hard deadlines, (2)continuousutility for interactive settings where value decays with latency, and (3)amortisedutility for budget-constrained deployments. Applying Tempora to seven TTA methods on ImageNet-C across 240 temporal evaluations revealsrank instability: conventional rankings do not predict rankings under temporal pressure; ETA, a state-of-the-art method in the conventional setting, falls short in 41.2% of evaluations. The highest-utility method varies with corruption type and temporal pressure, with no clear winner. By enabling systematic evaluation across diverse temporal constraints for the first time, Tempora reveals when and why rankings invert, offering practitioners a lens for method selection and researchers a target for deployable adaptation.

Abstract:
Large Language Models (LLMs) have rapidly advanced, with Gemini3-Pro setting a new performance milestone. In this work, we explore collective intelligence as an alternative to monolithic scaling, and demonstrate that open-source LLMs' collaboration can surpass Gemini-3-Pro. We first revisit LLM routing and aggregation at scale and identify three key bottlenecks: (1) current train-free routers are limited by a query-based paradigm focusing solely on textual similarity; (2) recent aggregation methods remain largely static, failing to select appropriate aggregators for different tasks; (3) the complementarity of routing and aggregation remains underutilized. To address these problems, we introduce JiSi, a novel framework designed to release the full potential of LLMs' collaboration through three innovations: (1) Query-Response Mixed Routing capturing both semantic information and problem difficulty; (2) Support-Set-based Aggregator Selection jointly evaluating the aggregation and domain capacity of aggregators; (3) Adaptive Routing-Aggregation Switch dynamically leveraging the advantages of routing and aggregation. Comprehensive experiments on nine benchmarks demonstrate that JiSi can surpass Gemini-3-Pro with only 47% costs by orchestrating ten open-source LLMs, while outperforming mainstream baselines. It suggests that collective intelligence represents a novel path towards Artificial General Intelligence (AGI).

Abstract:
Recent advances in large language models (LLMs) have led to remarkable progress across various domains, yet their capabilities in the humanities, particularly history, remain underexplored. Historical reasoning poses unique challenges for LLMs, involving multimodal source interpretation, temporal inference, and crosslinguistic analysis. Existing general-purpose agents perform well on many current benchmarks but lack the domain expertise needed to address complex historical questions. To address this gap, we introduce HistBench, a new benchmark of 414 high-quality and carefully-reviewed questions stratified by difficulty and designed to evaluate LLM's capacity for historical reasoning. The tasks span a wide range of historical problems—from factual retrieval based on primary sources to interpretive analysis of manuscripts and images, to interdisciplinary challenges involving archaeology, linguistics, or cultural history. Furthermore, the benchmark dataset spans 29 ancient and modern languages and covers a wide range of historical periods and world regions. Finding the poor performance of LLMs and other agents on HistBench, we further present HistAgent, a history-specific agent equipped with carefully designed tools for OCR, translation, archival search, and image understanding in History. On HistBench, HistAgent based on GPT-4o achieves an accuracy of 27.54% pass@1 and 36.47% pass@2, significantly outperforming LLMs with online search and generalist agents, including GPT-4o (18.60%), DeepSeek-R1(14.49%), Grok 3(17.63%) and Open Deep Research by smolagents(20.29% pass@1 and 25.12% pass@2). These results highlight the limitations of existing LLMs and generalist agents and demonstrate the advantages of HistAgent for historical reasoning. Notably, HistAgent also achieves 60.00% pass@1 accuracy on the GAIA benchmark, showing that domain-specific customization doesn't hinder HistAgent's competitive performance on real-world general tasks.

Abstract:
Research on differentially private synthetic tabular data has largely focused on independent and identically distributed rows where each record corresponds to a unique individual. This perspective neglects the temporal complexity in longitudinal datasets, such as electronic health records, where a user contributes an entire (sub) table of sequential events. While practitioners might attempt to model such data by flattening user histories into highdimensional vectors for use with standard marginal-based mechanisms, we demonstrate that this strategy is insufficient. Flattening fails to preserve temporal coherence even when it maintains valid marginal distributions. We introduce PATH, a novel generative framework that treats the full table as the unit of synthesis and leverages the autoregressive capabilities of privately fine-tuned large language models. Extensive evaluations show that PATH effectively captures long-range dependencies that traditional methods miss. Empirically, our method reduces the distributional distance to real trajectories by over 60% and reduces state transition errors by nearly 50% compared to leading marginal mechanisms while achieving similar marginal fidelity.

Abstract:
Reliable decisionmaking with streaming data requires principled uncertainty quantification of online methods. While first-order methods enable efficient iterate updates, their inference procedures still require updating proper (covariance) matrices, incurring O(d^2) time and memory complexity, and are sensitive to ill-conditioning and noise heterogeneity of the problem. This costly inference task offers an opportunity for more robust second-order methods, which are, however, bottlenecked by solving Newton systems with O(d^3) complexity. In this paper, we address this gap by studying an online Newton method with Hessian averaging, where the Newton direction at each step is approximately computed using a sketch-and-project solver with Nesterov's acceleration, matching O(d^2) complexity of first-order methods. For the proposed method, we quantify its uncertainty arising from both random data and randomized computation. Under standard smoothness and moment conditions, we establish global almost-sure convergence, prove asymptotic normality of the last iterate with a limiting covariance characterized by a Lyapunov equation, and develop a fully online covariance estimator with non-asymptotic convergence guarantees. We also connect the resulting uncertainty quantification to that of exact and sketched Newton methods without Nesterov's acceleration. Extensive experiments on regression models demonstrate the superiority of the proposed method for online inference.

Abstract:
Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusionbased Large Language Models (dLLMs), which generate text by iteratively denoising masked segments. This approach has shown significant advantages and potential. However, dLLMs suffer from high inference latency. Traditional ARM acceleration techniques, such as Key-Value caching, are incompatible with dLLMs due to their bidirectional attention mechanism. To address this specific challenge, our work begins with a key observation that dLLM inference involves a static prompt and a partially dynamic response, where most tokens remain stable across adjacent denoising steps. Based on this, we propose dLLM-Cache, a training-free adaptive caching framework that combines long-interval prompt caching with partial response updates guided by feature similarity. This design enables efficient reuse of intermediate computations without compromising model performance. Extensive experiments on representative dLLMs, including LLaDA 8B and Dream 7B, show that dLLM-Cache achieves up to 9.1× speedup over standard inference without compromising output quality. Notably, our method brings dLLM inference latency close to that of ARMs under many settings. Codes are provided in the supplementary material and will be released publicly on GitHub.

Abstract:
Encoderonly language models are frequently used for a variety of language tasks, including classification and retrieval. However, there has been a lack of recent research efforts for encoder models, especially with respect to multilingual models. We introduce mmBERT, an encoder-only language model pretrained on 3T tokens of multilingual text in over 1800 languages. To build mmBERT we introduce several novel elements to massively multilingual encoder training, including phased data curation and scheduled language inclusion. We add over 1700 low-resource languages to the data mix only during the decay phase, showing that it boosts performance dramatically and maximizes the gains from the relatively small amount of training data without excessive repetition. The model uses recent advances in architecture and training schemes to be faster and more multilingual than other models and we release weights, data, and code. We show that mmBERT significantly outperforms the previous generation, on various tasks, for both high and low-resource languages.

Abstract:
Large Language Model (LLM)based agents employ external and internal memory systems to handle complex, goal-oriented tasks, yet this exposes them to severe extraction attacks, and corresponding defenses are currently lacking. In this paper, we proposeMemPot, the first theoretically verified defense framework against memory extraction attacks by injecting optimized honeypots into the memory. Through a two-stage optimization process,MemPotgenerates trap documents that maximize the retrieval probability for attackers while remaining inconspicuous to benign users. We model the detection process as Wald’s Sequential Probability Ratio Test (SPRT) and theoretically prove thatMemPotachieves a lower average number of sampling rounds compared to optimal static detectors. Empirically,MemPotsignificantly outperforms state-of-the-art baselines, achieving a 50% improvement in detection AUROC and an 80% increase in True Positive Rate under low False Positive Rate constraints. Furthermore, our experiments confirm thatMemPotincurs zero online inference latency and preserves the agent's utility on standard tasks, verifying its superiority in safety, harmlessness and efficiency.

Abstract:
Budgetfeasible procurement auctions play a pivotal role in various AI-driven marketplaces, such as data acquisition and crowdsourcing, where a buyer with a limited budget seeks to procure services from strategic sellers with private costs. While numerous budget-feasible mechanisms have been proposed for the classic objective of maximizing the buyer's valuation, the more challenging and economically significant objective of social welfare maximization has only recently been studied, and existing approaches still sacrifice budget feasibility, thereby limiting their practical applicability. In this paper, we bridge this gap by proposing BFM-SWM, the first budget-feasible mechanism with provable approximation guarantees for submodular welfare maximization in procurement auctions. Our mechanism satisfies standard economic properties, including truthfulness, individual rationality, and non-negative auctioneer surplus. As a by-product, we develop BFM-VM, a variant tailored for valuation maximization, which achieves a deterministic approximation ratio of 1/(12+4\sqrt3) for general submodular functions, substantially improving upon the best-known deterministic ratio of 1/64 established by [Balkanski et al., SODA 2022], while reducing the running time from \mathcalO(n^2\log n) to \mathcalO(n\log n). Extensive experiments demonstrate the efficiency and effectiveness of our mechanisms.

Abstract:
This paper investigates search in modelbased reinforcement learning (RL). Conventional wisdom holds that long-term predictions and compounding errors are the primary obstacles for model-based RL. We challenge this view, showing that search is not a plug-and-play replacement for a learned policy. Surprisingly, we find that search can harm performance even when the model is highly accurate. Instead, we show that mitigating distribution shift matters more than improving model or value function accuracy. Building on this insight, we identify key techniques for enabling effective search, achieving state-of-the-art performance across multiple popular benchmark domains.

Abstract:
Boundary representation (Brep) is the industry standard for computer-aided design (CAD). While deep learning shows promise in processing B-rep models, existing methods suffer from a representation gap: continuous approaches offer analytical precision but are visually abstract, whereas discrete methods provide intuitive clarity at the expense of geometric precision. To bridge this gap, we introduce Brep2Shape, a novel self-supervised pre-training framework designed to align abstract boundary representations with intuitive shape representations. Our framework employs a geometry-aware task where the model learns to predict dense spatial points from parametric Bézier control points, enabling the network to better understand physical manifolds derived from abstract coefficients. To enhance this alignment, we propose a dual transformer backbone with parallel streams that independently encode surface and curve tokens to capture their distinct geometric properties. Additionally, the topology attention is integrated to model the inherent interdependencies between surfaces and curves, thereby maintaining topological consistency. Experimental results demonstrate that Brep2Shape offers significant scalability, achieving state-of-the-art accuracy and faster convergence across various downstream tasks.

Abstract:
Enabling large language models (LLMs) to solve complex reasoning tasks is a key step toward artificial general intelligence. Recent work augments LLMs with external tools to enable agentic reasoning, achieving high utility and efficiency in a plugand-play manner. However, the inherent vulnerabilities of such methods to malicious manipulation of the tool-calling process remain largely unexplored. In this work, we identify a tool-specific attack surface and propose Sponge Tool Attack (STA), which disrupts agentic reasoning solely by rewriting the input prompt under a strict query-only access assumption. Without any modification on the underlying model or the external tools, STA converts originally concise and efficient reasoning trajectories into unnecessarily verbose and convoluted ones before arriving at the final answer. This results in substantial computational overhead while remaining stealthy by preserving the original task semantics and user intent. To achieve this, we design STA as an iterative, multi-agent collaborative framework with explicit rewritten policy control, and generates benign-looking prompt rewrites from the original one with high semantic fidelity. Extensive experiments across 6 models (including both open-source models and closed-source APIs), 12 tools, 4 agentic frameworks, and 13 datasets spanning 5 domains validate the effectiveness of STA.

Abstract:
Controlling autonomous systems under realworld conditions often requires policies that can be evaluated with low latency and low energy requirements. Unfortunately, these conditions are at odds with the use of high-precision deep networks as controllers. In this work, we introduce Differentiable Weightless Controllers (DWCs), a symbolic-differentiable architecture that allows learning flexible non-linear yet highly efficient control policies. DWCs can be trained end-to-end by gradient-based techniques, yet compile directly into FPGA-compatible circuits with few- or even single-clock-cycle latency and nanojoule-level energy cost per action for the core computation. Across five MuJoCo benchmarks, including high-dimensional Humanoid, DWCs achieve returns competitive with standard deep policies (full precision or quantized neural networks). Furthermore, DWCs exhibit structurally sparse and interpretable connectivity patterns, enabling a direct inspection of which input values influence control decisions.

Abstract:
Vision Language Models (VLMs) are designed to extend Large Language Models (LLMs) with visual capabilities, yet in this work we observe a surprising phenomenon: VLMs can outperform their underlying LLMs on purely textonly tasks, particularly in long-context information retrieval. To investigate this effect, we build a controlled synthetic retrieval task and find that a transformer trained only on text achieves perfect in-distribution accuracy but fails to generalize out of distribution, while subsequent training on an image-tokenized version of the same task nearly doubles text-only OOD performance. Mechanistic interpretability reveals that visual training changes the model’s internal binding strategy: text-only training encourages positional shortcuts, whereas image-based training disrupts them through spatial translation invariance, forcing the model to adopt a more robust symbolic binding mechanism that persists even after text-only examples are reintroduced. We further characterize how binding strategies vary across training regimes, visual encoders, and initializations, and show that analogous shifts occur during pretrained LLM-to-VLM transitions. Our findings suggest that cross‑modal training can enhance reasoning and generalization even for tasks grounded in a single modality.

Abstract:
Longcontext inference is constrained by KV-cache memory, which grows linearly with sequence length; KV-cache compression therefore hinges on reliably selecting which past tokens to retain. Most geometry-based eviction methods score keys by cosine similarity to a global centroid, but cosine is scale-invariant and can discard magnitude cues that distinguish semantically salient tokens. We propose ManifoldKV, a training-free scorer that ranks tokens by Euclidean distance to the key centroid, capturing both angular and radial deviations. On the RULER benchmark, ManifoldKV achieves95.7%accuracy at 4K–16K contexts with 20% compression, matching the best geometric baseline while improving robustness in two regimes where cosine scoring fails. First, on multi-key retrieval, ManifoldKV reduces directional collisions, achieving92.4%vs KeyDiff's 77.0% (+15.4 points) on 3-key NIAH at 50% compression. Second, to address dilution and performance collapse of global centroids at 64K context, we introduce WindowedManifoldKV, which restores accuracy to 84.3% at 25% compression, a 49-point recovery over global L2 and +3.2 points over KeyDiff. The method requires only 3 lines of code and works across 4 architectures without tuning.

Abstract:
Reinforcement Learning for Large Language Models (LLMs) often suffers from training collapse in longhorizon tasks due to exploding gradient variance. To mitigate this, baseline is commonly introduced for advantage computation; however, traditional value models remain difficult to optimize, and standard group-based baselines overlook sequence heterogeneity. Although classic optimal baseline theory can achieve global variance reduction, it neglects token heterogeneity and requires prohibitive gradient-based computation. In this work, we derive the Optimal Token Baseline (OTB) from first principles, proving that gradient updates should be weighted inversely to their cumulative gradient norm. To ensure efficiency, we propose the Logit-Gradient Proxy that approximates the gradient norm using only forward-pass probabilities. Our method achieves training stability and matches the performance of large group sizes (N=32) with only N=4, reducing token consumption by over 65% across single-turn and tool-integrated reasoning tasks.

Abstract:
Iterative selfimprovement fine-tunes an autoregressive large language model (LLM) on reward-verified outputs generated by the LLM itself. In contrast to the empirical success of self-improvement, the theoretical foundation of this generative, iterative procedure in a practical, finite-sample setting remains limited. We make progress toward this goal by modeling each round of self-improvement as maximum-likelihood fine-tuning on a reward-filtered distribution and deriving finite-sample guarantees for the expected reward. Our analysis reveals an explicit feedback loop where better models accept more data per iteration, supporting sustained self-improvement while explaining eventual saturation of such improvement. Adopting a task-centric view by considering reasoning tasks with multiple difficulty levels, we further prove quantifiable conditions on model initialization, task difficulty, and sample budget where easy-to-hard curricula provably achieve better guarantees than training on fixed mixtures of tasks. Our analyses are validated via Monte-Carlo simulations and controlled experiments on graph-based reasoning tasks.

Abstract:
Generative model evaluation commonly relies on highdimensional embedding spaces to compute distances between samples. We show that dataset representations in these spaces are affected by the hubness phenomenon, which distorts nearest neighbor relationships and biases distance-based metrics. Building on the classical Iterative Contextual Dissimilarity Measure (ICDM), we introduce Generative ICDM (GICDM), a method to correct neighborhood estimation for both real and generated data. We introduce a multi-scale extension to improve empirical behavior. Extensive experiments on synthetic and real benchmarks demonstrate that GICDM resolves hubness-induced failures, restores reliable metric behavior, and improves alignment with human judgment.

Abstract:
AI agents equipped with toolcalling capabilities are susceptible toIndirect Prompt Injection(IPI) attacks. In this attack scenario, malicious commands hidden withinuntrustedcontent trick the agent into performing unauthorized actions. Existing defenses can reduce attack success but often suffer from theover-defense dilemma: they deploy expensive,always-onsanitization regardless of actual threat, thereby degrading utility and latency even in benign scenarios. We revisit IPI through a causal ablation perspective: a successful injection manifests as adominance shiftwhere the user request no longer provides decisive support for the agent's privileged action, while a particular untrusted segment, such as a retrieved document or tool output, provides disproportionate attributable influence. Based on this signature, we proposeCausalArmor, a selective defense framework that (i) computes lightweight, leave-one-out ablation-based attributions at privileged decision points, and (ii) triggers targeted sanitization only when an untrusted segment dominates the user intent. Additionally, CausalArmor employsretroactive Chain-of-Thought maskingto prevent the agent from acting on ``poisoned" reasoning traces. We present a theoretical analysis showing that sanitization based on attribution margins conditionally yields an exponentially small upper bound on the probability of selecting malicious actions. Experiments on AgentDojo and DoomArena demonstrate that CausalArmor matches the security of aggressive defenses while improving explainability and preserving utility and latency of AI agents.

Abstract:
Reasoning large language models exhibit complex reasoning behaviors via extended chainof-thought generation that are highly fragile to information loss during decoding, creating critical challenges for KV cache compression. Existing token-dropping methods directly disrupt reasoning chains by removing intermediate steps, while head-reallocation methods, designed for retrieval tasks, fail to preserve the heads essential for generative reasoning. However, no existing method can identify which attention heads genuinely maintain reasoning consistency and control generation termination. To address this, we propose RLKV, which uses reinforcement learning as a probe to discover which heads contribute to reasoning quality by directly optimizing their cache usage against actual generation outcomes. This discovery naturally leads to an efficient compression strategy: we allocate full KV cache to reasoning-critical heads while aggressively compressing others. Experiments reveal that a fraction of heads proves essential for reasoning, enabling 20--50% cache reduction with near-lossless performance and up to 1.21x speedup.

Abstract:
We introduce TANGO a dynamical systems inspired framework for graph representation learning that governs node feature evolution through a learned energy landscape and its associated descent dynamics. At the core of our approach is a learnable Lyapunov function over node embeddings, whose gradient defines an energy non-increasing direction that guarantees stability. To enhance flexibility while preserving the benefits of energy-based dynamics, we incorporate a novel tangential component, learned via message passing, that evolves features while maintaining the energy value. This decomposition into orthogonal flows of energy gradient descent and tangential evolution yields a flexible form of graph dynamics, and enables effective signal propagation even in flat or ill-conditioned energy regions, that often appear in graph learning. Our method is designed to help alleviate oversquashing, and is compatible with different graph neural network backbones. Empirically, TANGO achieves strong performance across a diverse set of node and graph classification and regression benchmarks, demonstrating the effectiveness of jointly learned energy functions and tangential flows for graph neural networks.

Abstract:
Edgecloud collaborative inference is crucial for LLM-powered edge devices, as on-device models often lack the required reasoning capability, while cloud-only inference can be costly and slow under strict latency and token/API budgets. However, existing edge-cloud collaboration methods typically route input tasks based on their estimated difficulty. These static, coarse heuristics overlook subtask dependencies, missing opportunities for parallel execution and budget-adaptive routing. To this end, we propose HybridFlow, a resource-adaptive edge-cloud inference framework that enables parallel execution of interdependent subtasks. Specifically, we build a dependency-aware DAG for each input task, facilitating concurrent execution of subtasks once their dependencies are resolved, thereby reducing end-to-end latency. Additionally, we propose a dynamic benefit–cost utility model, optimizing the trade-off between accuracy, token/API cost, and latency in real-time. This dynamic routing minimizes unnecessary cloud usage while preserving reasoning quality. Across GPQA, MMLU-Pro, AIME24, and LiveBench-Reasoning, HybridFlow improves the cost-accuracy trade-off, reducing latency and cloud API usage while maintaining competitive accuracy.

Abstract:
Generative Flow Networks (GFlowNets) excel at sampling diverse, highreward objects. In many practical applications where active reward queries are infeasible, these models must be trained using static offline datasets. Prevailing training methods typically rely on a proxy model to provide reward feedback for online sampled trajectories. However, constructing a reliable proxy is often challenging due to data scarcity or high evaluation costs. While existing proxy-free approaches attempt to address this, they often impose coarse constraints that limit the model's ability to explore effectively. To overcome these limitations, we proposeTrajectory-Distilled GFlowNet (TD-GFN), a novel proxy-free training framework. TD-GFN utilizes inverse reinforcement learning (IRL) to extract dense, transition-level edge rewards from offline trajectories, providing rich structural guidance for efficient exploration. Crucially, to ensure robustness, these rewards guide the policy indirectly through DAG pruning and prioritized backward sampling. This design ensures that gradient updates rely exclusively on ground-truth terminal rewards from the dataset, thereby preventing error propagation. Empirical results demonstrate that TD-GFN significantly outperforms a broad range of existing baselines in both convergence speed and sample quality, establishing a more robust and efficient paradigm for offline GFlowNet training.

Abstract:
In recent advances, to enable a fully datadriven learning paradigm on relational databases (RDB), relational deep learning (RDL) is proposed to structure the RDB as a heterogeneous entity graph and adopt the graph neural network (GNN) as the predictive model. However, existing RDL methods neglect the imbalance problem of relational data in RDBs and risk under-representing the minority entities, leading to an unusable model in practice. In this work, we investigate, for the first time, class imbalance problem in RDB entity classification and design the relation-centric minority synthetic over-sampling GNN (Rel-MOSS), in order to fill a critical void in the current literature. Specifically, to mitigate the issue of minority-related information being submerged by majority counterparts, we design the relation-wise gating controller to modulate neighborhood messages from each individual relation type. Based on the relational-gated representations, we further propose the relation-guided minority synthesizer for over-sampling, which integrates the entity relational signatures to maintain relational consistency. Extensive experiments on 12 entity classification datasets provide compelling evidence for the superiority of Rel-MOSS, yielding an average improvement of up to 2.46% and 4.00% in terms ofBalanced AccuracyandG-Mean, compared with SOTA RDL methods and classic methods for handling class imbalance.

Abstract:
Detecting hallucinations in large language models is a critical open problem with significant implications for safety and reliability. While existing hallucination detection methods achieve strong performance in question‑answering tasks, they remain less effective on tasks requiring reasoning. In this work, we revisit hallucination detection through the lens of out‑of‑distribution (OOD) detection, a well‑studied problem in areas like computer vision. Treating next‑token prediction in language models as a classification task allows us to apply OOD techniques, if we bring to bear appropriate modifications to account for the structural differences in large language models. We show that approaches based on OOD detection yield trainingfree, single-sample based detectors, achieving strong accuracy in hallucination detection in reasoning tasks. Overall, our work suggests that reframing hallucination detection as OOD detection provides a promising and scalable pathway toward language model safety.

Abstract:
A novel neural network inspired by Cauchy's integral formula, is proposed for function approximation tasks that include time series forecasting, missing data imputation, etc. Hence, the novel neural network is named CauchyNet. By embedding realvalued data into the complex plane, CauchyNet efficiently captures complex temporal dependencies, surpassing traditional real-valued models in both predictive performance and computational efficiency. Grounded in Cauchy's integral formula and supported by the universal approximation theorem, CauchyNet offers strong theoretical guarantees for function approximation. The architecture incorporates complex-valued activation functions, enabling robust learning from incomplete data while maintaining a compact parameter footprint and reducing computational overhead. Through extensive experiments in diverse domains, including transportation, energy consumption, and epidemiological data, CauchyNet consistently outperforms state-of-the-art models in predictive accuracy, often achieving a lower mean absolute error with fewer parameters. These findings highlight CauchyNet’s potential as an effective and efficient tool for data-driven predictive modeling, particularly in resource-constrained and data-scarce environments. The code used to reproduce the results will be released upon the publication.

Abstract:
We present the first largescale empirical analysis of disempowerment patterns in real-world AI assistant interactions, analyzing 1.5 million consumer Claude.ai conversations using a privacy-preserving approach. We focus on situational dis-empowerment potential, which occurs when AI assistant interactions risk leading users to form distorted perceptions of reality, make inauthentic value judgments, or act in ways misaligned with their values. Quantitatively, we find that severe forms of disempowerment potential occur in fewer than one in a thousand conversations, though rates are substantially higher in personal domains like relationships and lifestyle. Qualitatively, we uncover several concerning patterns, such as validation of persecution narratives and grandiose identities with emphatic sycophantic language, definitive moral judgments about third parties, and complete scripting of value-laden personal communications that users appear to implement verbatim. Analysis of historical trends reveals an increase in the prevalence of disempowerment potential over time. We also find that interactions with greater disempowerment potential receive higher user approval ratings, possibly suggesting a tension between short-term user preferences and long-term human empowerment.

Abstract:
Generative Flow Networks (GFlowNets) are a flexible family of amortized samplers trained to generate discrete and compositional objects with probability proportional to a reward function. To this end, they learn a policy function over an intractably large state graph by minimizing a stochastic objective over sampled trajectories. However, learning efficiency is constrained by the model’s ability to rapidly explore diverse highprobability regions during training. To mitigate this issue, recent works have focused on incentivizing the exploration of unvisited and valuable states via curiosity-driven search and self-supervised random network distillation, which tend to waste samples on already well-approximated regions of the state space. In this context, we proposeAdaptive Complementary Exploration(ACE), a principled algorithm for the effective exploration of novel and high-probability regions when learning GFlowNets. To achieve this, ACE introduces anexplorationGFlowNet explicitly trained to search for high-reward states in regions underexplored by thecanonicalGFlowNet, which learns to sample from the target distribution. Through extensive experiments, we show that ACE consistently and significantly improves upon prior work in terms of approximation accuracy to the target distribution and discovery rate of diverse high-reward states.

Abstract:
Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity’s Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools are able to both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate the use of intelligent tools. ToolOrchestra makes explicit use of reinforcement learning with outcome, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On \tau ^2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to previously unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems. These results demonstrate that orchestrating diverse tools with lightweight agents is not only more efficient, but also more effective, paving the way for practical and scalable tool-augmented reasoning systems.

Abstract:
LLM workflows, which coordinate structured calls to individual LLMs (each augmented with varying instructions and tools) to achieve a particular goal, offer a promising path towards extending the capabilities of LLMs and building powerful systems that can tackle diverse tasks. However, existing approaches for building such workflows generally rely on humancrafted pipelines and prompts, which presents a substantial bottleneck to widening the scope of their applications. How can automatically induce and optimize such workflows in a data-driven way? And can lessons from optimizing deep learning architectures help the design of workflow induction algorithms? This paper describes a simple approach for automatically inducing LLM workflows. We formulate workflow induction as a bilevel optimization problem: an outer loop which optimizes a high-level sketch of the workflow (in particular how the LLM calls should be structured), and an inner loop which optimizes each individual LLM call one-by one. Both loops are optimized withtextual gradients'', where for the inner loop we optimize each component in a modular way throughbackpropagating'' textual gradients layer-by-layer. We find that LLM workflows discovered through our WIBOT (workflow induction through bilevel optimization and textual gradients) approach performs competitively against strong baselines that automate workflow generation and optimization.

Abstract:
Large language models (LLMs) have achieved remarkable success, but their rapidly growing scale imposes prohibitive costs in memory, computation, and energy. Posttraining quantization (PTQ) is a promising solution for efficient deployment, yet achieving accurate W4A4 quantization remains an open challenge. While most existing methods are designed for INT4 formats, the emergence of MXFP4—a new FP4 format with various hardware support (NVIDIA, AMD, Intel)—raises questions about the applicability of current techniques. In this work, we establish a comprehensive benchmark of PTQ methods under the MXFP4 format. Through systematic evaluation, we find that methods like GPTQ consistently deliver strong performance, whereas rotation-based approaches, which are almost used by all state-of-the-art approaches, suffer from severe incompatibility with MXFP4. We further provide the first in-depth analysis of this conflict, tracing its root to a fundamental mismatch between MXFP4’s PoT (power-of-two) block scaling and the redistribution of outlier energy via global rotation. Building on this insight, we propose a simple yet effective block rotation strategy that adapts rotation-based methods to MXFP4, leading to substantial accuracy improvements across diverse LLMs. Our findings not only offer clear guidance for practitioners but also set a foundation for advancing PTQ research under emerging low-precision formats.

Abstract:
Highperformance GPU kernels are critical for efficient LLM serving, yet their optimization remains a bottleneck requiring deep system expertise. While code LLMs show promise in generating functionally correct code, kernel optimization is intrinsically a search problem over a vast optimization space. The fundamental mismatch prevents existing LLM agents from efficiently exploring the optimization space for diverse hardware and compute patterns. To bridge the gap, we present KernelBand, a framework that formulates kernel optimization as a Multi-Armed Bandit (MAB) problem, explicitly balancing exploration and exploitation to unlock the potential of code LLMs. To navigate the infinite arm space of optimization strategies applied to candidate kernels, we design two key mechanisms: a hardware-aware pruning strategy via profiling bounds and a trace-driven clustering algorithm that leverages Lipschitz continuity. Theoretically, we prove that KernelBand reduces the regret bound to depend on the compact covering number of runtime clusters, ensuring sample-efficient discovery of high-performance kernels. Extensive experiments on TritonBench-G with three GPU architectures and four code LLMs show that KernelBand consistently and substantially outperforms state-of-the-art methods with over 33% average improvement.

Abstract:
Reverse Chainof-Thought Generation (RCG) synthesizes reasoning traces from query-answer pairs, but runs the risk of producing post-hoc rationalizations: when models can see the answer during generation, the answer serves as a cognitive anchor that shapes the entire explanation. We formalize this phenomenon through a three-level measurement hierarchy: lexical, entropic, and probabilistic anchoring, each captures surface artifacts, entropy dynamics, and latent answer dependence, respectively. We analyze semantic suppression, the intuitive mitigation strategy that instructs models to ignore the answer, to find out its counterproduction: while it reduces lexical overlap, it paradoxically increases entropic and probabilistic anchoring. Drawing on Ironic Process Theory from cognitive psychology, we attribute this failure to active monitoring of the forbidden answer, which inadvertently deepens dependence on it. To break this cycle, we propose Structural Skeleton-guided Reasoning (SSR), a two-phase approach that first generates an answer-invariant functional skeleton structure, then uses this skeleton to guide full trace generation. By redirecting the information flow to structural planning rather than answer monitoring, SSR consistently reduces anchoring across all three levels. We further introduce Distilled SSR (SSR-D), which fine-tunes models on teacher-generated SSR traces to ensure reliable structural adherence. Experiments across open-ended reasoning benchmarks demonstrate that SSR-D achieves up to 10% improvement over suppression baselines while preserving out-of-distribution (OOD) generalization. Code and data will be open-sourced upon acceptance.

Abstract:
Realworld reinforcement learning often faces environment drift, but most existing methods rely on static entropy coefficients/target entropy, causing over-exploration during stable periods and under-exploration after drift (thus slow recovery), and leaving unanswered the principled question of how exploration intensity should scale with drift magnitude. We show that, under standard assumptions, entropy scheduling in non-stationary maximum-entropy RL can be cast as the dynamic-regret trade-off between tracking a drifting comparator and stabilizing updates, yielding a square-root scaling rule for the entropy weight in terms of a (possibly conservative) online non-stationarity proxy. Building on this, we propose AES (Adaptive Entropy Scheduling), which adaptively adjusts the entropy coefficient/temperature online using observable drift proxies during training, requiring almost no structural changes and incurring minimal overhead. Across 4 algorithm variants, 12 tasks, and 4 drift modes, AES significantly reduces the fraction of performance degradation caused by drift and accelerates recovery after abrupt changes.

Abstract:
Sampling from unnormalized probability densities is a central challenge in computational science. Boltzmann generators are generative models that enable independent sampling from the Boltzmann distribution of physical systems at a given temperature. However, their practical success depends on dataefficient training, as both simulation data and target energy evaluations are costly. To this end, we propose off-policy log-dispersion regularization (LDR), a novel regularization framework that builds on a generalization of the log-variance objective. We apply LDR in the off-policy setting in combination with standard data-based training objectives, without requiring additional on-policy samples. LDR acts as a shape regularizer of the energy landscape by leveraging additional information in the form of target energy labels. The proposed regularization framework is broadly applicable, supporting unbiased or biased simulation datasets as well as purely variational training without access to target samples. Across all benchmarks, LDR improves both final performance and data efficiency, with sample efficiency gains of up to one order of magnitude.

Abstract:
We analyze algorithms for solving stochastic variational inequalities (VI) without the bounded variance or bounded domain assumptions, where our main focus is minmax optimization with possibly unbounded constraint sets. We focus on two classes of problems: monotone VIs; and structured nonmonotone VIs that admit a solution to the weak Minty VI. The latter assumption allows us to solve structured nonconvex-nonconcave min-max problems. For both classes of VIs, to make the expected residual norm less than \varepsilon, we show an oracle complexity of \widetildeO(\varepsilon^-4), which is the best-known for constrained VIs. In our setting, this complexity had been obtained with the bounded variance assumption in the literature, which is not even satisfied for bilinear min-max problems with an unbounded domain. We obtain this complexity for stochastic oracles whose variance can grow as fast as the squared norm of the optimization variable.

Abstract:
Longterm weather forecasting is critical for socioeconomic planning and disaster preparedness. While recent approaches employ finetuning to extend prediction horizons, they remain constrained by the issues of catastrophic forgetting, error accumulation, and high training overhead. To address these limitations, we present a novel pipeline across pretraining, finetuning and forecasting to enhance long‑context modeling while reducing computational overhead. First, we introduce an Efficient Multi‑scale Transformer (EMFormer) to extract multi‑scale features through a single convolution in both training and inference. Based on the new architecture, we further employ an accumulative context finetuning to improve temporal consistency without degrading short‑term accuracy. Additionally, we propose a composite loss that dynamically balances different terms via a sinusoidal weighting, thereby adaptively guiding the optimization trajectory throughout pretraining and finetuning. Experiments show that our approach achieves great performance in weather forecasting and extreme event prediction, substantially improving long-term forecast accuracy. Moreover, EMFormer demonstrates strong generalization on vision benchmarks (ImageNet-1K and ADE20K) while delivering a 5.69× speedup over conventional multi-scale modules.

Abstract:
Large language models (LLMs) have been serving as effective backbones for retrieval systems, including RetrievalAugmentation-Generation (RAG), Dense Information Retriever (IR), and Agent Memory Retrieval. Recent studies have demonstrated that such LLM-based Retrieval (LLMR) is vulnerable to adversarial attacks, which manipulates documents by token-level injections and enables adversaries to either boost or diminish these documents in retrieval tasks. However, existing attack studies mainly (1) presume a known query is given to the attacker, and (2) highly rely on access to the victim model's parameters or interactions, which are hardly accessible in real-world scenarios, leading to limited validity. To further explore the secure risks of LLMR, we propose a practical black-box attack method that generates transferable injection tokens based on zero-shot surrogate LLMs without need of victim queries or victim models knowledge. The effectiveness of our attack raises such a robustness issue that similar effects may arise from benign or unintended document edits in the real world. To achieve our attack, we first establish a theoretical framework of LLMR and empirically verify it. Under the framework, we simulate the transferable attack as a min-max problem, and propose an adversarial learning mechanism that finds optimal adversarial tokens with learnable query samples. Our attack is validated to be effective on benchmark datasets across popular LLM retrievers.

Abstract:
We present HOIPAGE, a new approach that prioritizes part-level affordance reasoning to generate high-fidelity 4D human-object interactions (HOIs) from text prompts in a zero-shot fashion. In contrast to prior works that focus on global, whole body-object motion synthesis, our approach explicitly reasons about the underlying fine-grained mechanics of interactions using large language models (LLMs). We capture this reasoning in a structured part affordance graph (PAG) representation, serving as a high-level interaction scaffolding to guide a three-stage synthesis: first, decomposing input 3D objects into semantic parts; then, generating reference HOI videos from text prompts to extract part-based motion constraints; and finally, optimizing for 4D HOI motion sequences that mimic the reference dynamics while satisfying part-level contact constraints. Extensive experiments show that our approach is flexible and capable of generating complex multi-object or multi-person interaction sequences, with significantly improved realism and text alignment for zero-shot 4D HOI generation.

Abstract:
DeepSeekOCR leverages visual–text compression to reduce long-text processing costs and accelerate inference, yet visual tokens remain prone to redundant textual and structural information. Moreover, current token pruning methods for conventional vision–language models (VLMs) fail to preserve textual fidelity due to improper compression mechanisms. By analyzing the decoding process of DeepSeek-OCR, we find that a distinct two-stage reading trajectory: the model initially prioritizes the majority of high-norm tokens, then subsequently redistributes its attention to the remaining ones. Motivated by this insight, we proposeRTPrune, a two-stage token pruning method tailored for DeepSeek-OCR. In the first stage, we prioritize high-norm visual tokens that capture salient textual and structural information. In the second stage, the remaining tokens are paired and merged based on optimal transport theory to achieve efficient feature aggregation. We further introduce a dynamic pruning ratio that adapts to token similarity and textual density for OCR tasks, enabling a better efficiency–accuracy trade-off. Extensive experiments demonstrate state-of-the-art performance, as evidenced by 99.47% accuracy and 1.23× faster prefill on OmniDocBench, achieved with 84.25% token retention when applied to DeepSeek-OCR-Large. Code is released.

Abstract:
Large language models (LLMs) are increasingly deployed as agents for decisionmaking (DM) in interactive and dynamic environments. However, since they are not originally designed for DM, recent studies show that LLMs struggle in basic online DM settings. We introduce ITERATIVE REGRET-MINIMIZATION FINE-TUNING (ITERATIVE RMFT), a post-training procedure that repeatedly distills low-regret decision trajectories into the base model. Unlike prior methods that rely on distilling known algorithms or enforcing manually designed reasoning formats, our approach leverages regret as a training signal to elicit improved decision-making behavior while incorporating model-generated reasoning in natural language. Empirically, ITERATIVE RMFT improves DM performance across models, including numerical Transformers, lightweight open-weight LLMs, and the closed-weight model GPT-4o mini, while exhibiting generalization across varying horizons, action spaces, reward processes, and natural-language-described DM scenarios. Overall, we position our approach as an initial exploration, calling for more principled and novel post-training paradigms for LLMs when it comes to addressing DM tasks.

Abstract:
Vanilla variational inference finds an optimal approximation to the Bayesian posterior distribution, but even the exact Bayesian posterior is often not meaningful under model misspecification. We propose predictive variational inference (PVI): a general inference framework that seeks and samples from an optimal posterior density such that the resulting posterior predictive distribution is as close to the true data generating process as possible, while this closeness is measured by multiple scoring rules. By optimizing the objective, the predictive variational inference is generally not the same as, or even attempting to approximate, the Bayesian posterior, even asymptotically. Rather, we interpret it as implicit hierarchical expansion. Further, the learned posterior uncertainty detects heterogeneity of parameters among the population, enabling automatic model diagnosis. This framework applies to both likelihoodexact and likelihood-free models. We demonstrate its application in real data examples.

Abstract:
Graphical User Interface (GUI) grounding is commonly framed as a coordinate prediction task – given a natural language instruction, generate onscreen coordinates for actions such as clicks and keystrokes. However, recent Vision Language Models (VLMs) often fail to predict accurate numeric coordinates when processing GUI images with high resolutions and complex layouts. To address this issue, we reframe GUI grounding as an interactive search task, where the VLM generates actions to move a cursor in the GUI to locate UI elements. At each step, the model determines the target object, evaluates the spatial relations between the cursor and the target, and moves the cursor closer to the target conditioned on the movement history. In this interactive process, the rendered cursor provides visual feedback to help the model align its predictions with the corresponding on-screen locations. We train our GUI grounding model, GUI-Cursor, using multi-step online reinforcement learning with a dense trajectory-based reward function. Experimental results demonstrate that GUI-Cursor surpasses strong baselines in GUI grounding and agentic tasks, achieving superior performance with the same base models while requiring less training data. Further analysis shows that GUI-Cursor learns to adaptively conduct more steps on more difficult examples, and it obtains better spatial reasoning capability on out-of-distribution domains.

Abstract:
Standard attention mechanisms in transformers employ static token representations that remain unchanged across all pairwise computations in each layer. This limits their representational alignment with the potentially diverse dynamics of each token-pair interaction. While they excel in domains with relatively homogeneous relationships, standard attention may be inadequate in capturing heterogeneous inter-channel dependencies of multivariate time series (MTS) data where different channel-pair interactions within a single system may be governed by entirely different physical laws or temporal dynamics. To better align the attention mechanism for such domain phenomena, we propose attention with dynamic relational priming (prime attention). Prime attention modulates token representations for each token-pair, optimizing each pair-wise interaction for that specific relationship. Our results demonstrate that prime attention consistently outperforms standard attention across benchmarks, achieving up to 6.5% improvement in forecasting accuracy. In addition, prime attention achieves comparable performance using up to 40% less sequence length compared to standard attention, demonstrating its superior relational modeling capabilities and potential for data efficiency.

Abstract:
Posttraining quantization (PTQ) is a widely used approach for reducing the memory and compute costs of large language models (LLMs). Recent studies have shown that applying invertible transformations to activations can significantly improve quantization robustness by reducing activation outliers; however, existing approaches are largely restricted to rotation or Hadamard-based transformations. Moreover, most studies focused primarily on traditional quantization schemes, whereas modern hardware increasingly supports the microscaling (MX) data format. Attempts to combine both showed severe performance degradation, leading prior work to introduce assumptions on the transformations. In this work, we take a complementary perspective. First, we provide a theoretical analysis of transformations under MX quantization by deriving a bound on the quantization error. Our analysis emphasizes the importance of accounting for both the activation distribution and the underlying quantization structure. Building on this analysis, we propose LATMiX, a method that generalizes outlier reduction to learnable invertible affine transformations optimized using standard deep learning tools. Experiments show consistent improvements in average accuracy for MX low-bit quantization over strong baselines on a wide range of zero-shot benchmarks, across multiple model sizes.

Abstract:
Quantized neural network training optimizes a discrete, nondifferentiable objective. The straight-through estimator (STE) enables backpropagation through surrogate gradients and is widely used. While previous studies have primarily focused on the properties of surrogate gradients and their convergence, the influence of quantization hyperparameters, such as bit width and quantization range, on learning dynamics remains largely unexplored. We theoretically show that in the high-dimensional limit, STE dynamics converge to a deterministic ordinary differential equation. This reveals that STE training exhibits a plateau followed by a sharp drop in generalization error, with plateau length depending on the quantization range. A fixed-point analysis quantifies the asymptotic deviation from the unquantized linear model. We also extend analytical techniques for stochastic gradient descent to nonlinear transformations of weights and inputs.

Abstract:
Openset anomaly detection (OSAD) is an emerging paradigm designed to utilize limited labeled data from anomaly classes seen in training to identify both seen and unseen anomalies during testing. Current approaches rely on simple augmentation methods to generate pseudo anomalies that replicate unseen anomalies. Despite being promising in image data, these methods are found to be ineffective in time series data due to the failure to preserve its sequential nature, resulting in trivial or unrealistic anomaly patterns. They are further plagued when the training data is contaminated with unlabeled anomalies. This work introduces IMPACT, a novel framework that leverages \underlineinfluence \underlinemodeling for o\underlinepen-set time series \underlineanomaly dete\underlinection, to tackle these challenges. The key insight is to i) learn an influence function that can accurately estimate the impact of individual training samples on the modeling, and then ii) leverage these influence scores to generate semantically divergent yet realistic unseen anomalies for time series while repurposing high-influential samples as supervised anomalies for anomaly decontamination. Extensive experiments show that IMPACT significantly outperforms existing state-of-the-art methods, showing superior accuracy under varying OSAD settings and contamination rates.

Abstract:
Modern largescale neural networks are often trained and released in multiple sizes to accommodate diverse inference budgets. To improve efficiency, recent work has explored model upscaling: initializing larger models from trained smaller ones in order to transfer knowledge and accelerate convergence. However, this method can be sensitive to hyperparameters that need to be tuned at the target upscaled model size, which is prohibitively costly to do directly. It remains unclear whether the most common workaround—tuning on smaller models and extrapolating via hyperparameter scaling laws—is still sound when using upscaling. We address this with principled approaches to upscaling with respect to model widths and efficiently tuning hyperparameters in this setting. First, motivated by \muP and any-dimensional architectures, we introduce a general upscaling method applicable to a broad range of architectures and optimizers, backed by theory guaranteeing that models are equivalent to their widened versions and allowing for rigorous analysis of infinite-width limits. Second, we extend the theory of \muTransfer to a hyperparameter transfer technique for models upscaled using our method and empirically demonstrate that this method is effective on realistic datasets and architectures.

Abstract:
StructureBased Drug Design (SBDD) aims to discover bioactive ligands. Conventional approaches construct probability paths separately in Euclidean and probabilistic spaces for continuous atomic coordinates and discrete chemical categories, leading to a mismatch with the underlying statistical manifolds. We address this issue from an information-geometric perspective by modeling molecules as composite exponential-family distributions and defining generative flows along exponential geodesics under the Fisher–Rao metric. To avoid the instantaneous trajectory collapse induced by geodesics directly targeting Dirac distributions, we propose Evolving Exponential Geodesic Flow for SBDD (EvoEGF-Mol), which replaces static Dirac targets with dynamically concentrating distributions, ensuring stable training via a progressive-parameter-refinement architecture. Our model approaches a reference-level PoseBusters passing rate (93.4%) on CrossDock, demonstrating remarkable geometric precision and interaction fidelity, while outperforming baselines on real-world MolGenBench tasks by recovering bioactive scaffolds and generating candidates that meet established MedChem filters.

Abstract:
Large Language Models (LLMs) excel at longcontext understanding but exhibit significant limitations in long-form generation. Existing studies primarily focus on single-generation quality, generally overlooking the volatility of the output (i.e., the inconsistency in length and content across multiple generations). This volatility not only leads to significant computational costs but also severely impacts the models' reliable application. To address this gap, our work unfolds in three stages: benchmarking, probing, and mitigation. We first propose the VOlatility in Long-form Text Benchmark (VOLTBench), a novel heterogeneous-task benchmark designed to systematically quantify the length volatility of long-form generation. Subsequently, by analyzing attention traces, we conduct an in-depth probe to identify several common internal patterns that cause this volatility. Finally, to mitigate long-form output volatility, we propose SELB (Structural Enforcement via Logits Boosting), a lightweight decoding-stage optimization strategy, designed to significantly enhance both the length accuracy and stability of long-form generation without additional training. Extensive experiments on VOLTBench provide the first systematic confirmation of severe long-form output instability in mainstream models and validate that our proposed method successfully improves the mean output length of the base model by 148% and reduces the length volatility by 69%, while maintaining high generation quality.

Abstract:
VisionLanguage Models (VLMs) exhibit puzzling failures in multi-object visual tasks, such as hallucinating non-existent elements or failing to identify the most similar objects among distractions. While these errors mirror human cognitive constraints, such as the "Binding Problem'', the internal mechanisms driving them in artificial systems remain poorly understood. Here, we propose a mechanistic insight by analyzing the representational geometry of open-weight VLMs (Qwen, InternVL, Gemma), comparing methodologies to distill "concept vectors'' - latent directions encoding visual concepts. We validate our concept vectors via steering interventions that reliably manipulate model behavior in both simplified and naturalistic vision tasks (e.g., forcing the model to perceive a red flower as blue). We observe that the geometric overlap between these vectors strongly correlates with specific error patterns, offering a grounded quantitative framework to understand how internal representations shape model behavior and drive visual failures.

Abstract:
The development of highquality text embeddings is increasingly drifting toward an exclusionary future, defined by three critical barriers: prohibitive computational costs, a narrow linguistic focus that neglects most of the world's languages, and a lack of transparency from closed-source or open-weight models that stifles research. To dismantle these barriers, we introduce ML-Embed, a suite of inclusive and efficient models built upon a new framework: 3-Dimensional Matryoshka Learning (3D-ML). Our framework addresses the computational challenge with comprehensive efficiency across the entire model lifecycle. Beyond the storage benefits of Matryoshka Representation Learning (MRL), we integrate Matryoshka Layer Learning (MLL) for flexible inference-time depth and Matryoshka Embedding Learning (MEL) for enhanced parameter efficiency. To address the linguistic challenge, we curate a massively multilingual dataset and train a suite of models ranging from 140M to 8B parameters. In a direct commitment to transparency, we release all models, data, and code. Extensive evaluation on 430 tasks demonstrates that our models set new records on 9 of 17 evaluated MTEB benchmarks, with particularly strong results in low-resource languages, providing a reproducible blueprint for building globally equitable and computationally efficient AI systems.

Abstract:
Multimodal models for textto-image generation have achieved strong visual fidelity, yet they remain brittle under compositional structural constraints—notably generative numeracy, attribute binding, and part-level relations. To address these challenges, we proposeShape-of-Thought (SoT), a visual CoT framework that enablesprogressive shape assembly represented as coherent 2D projectionswithout external engines at inference time. SoT trains a unified multimodal autoregressive model to generate interleaved textual plans and rendered intermediate states, helping the model capture shape-assembly logic without producing explicit geometric representations. To support this paradigm, we introduceSoT-26K, a large-scale dataset of grounded assembly traces derived from part-based CAD hierarchies, andT2S-CompBench, a benchmark for evaluating structural integrity and trace faithfulness. Fine-tuning on SoT-26K achieves 88.4% on component numeracy and 84.8% on structural topology, outperforming text-only baselines by around 20%. SoT establishes a new paradigm for transparent, process-supervised compositional generation. The code is available at https://anonymous.4open.science/r/16FE/.

Abstract:
Speculative decoding accelerates autoregressive large language model (LLM) inference by using a lightweight draft model to propose candidate tokens that are then verified in parallel by the target model. The speedup is significantly determined by the acceptance rate, yet standard training minimizes KullbackLeibler (KL) divergence as a proxy objective. While KL divergence and acceptance rate share the same global optimum, small draft models, having limited capacity, typically converge to suboptimal solutions where minimizing KL does not guarantee maximizing acceptance rate. To address this issue, we proposeLK losses, special training objectives that directly target acceptance rate. Comprehensive experiments across four draft architectures and six target models, ranging from 8B to 685B parameters, demonstrate consistent improvements in acceptance metrics across all configurations compared to the standard KL-based training. We evaluate our approach on general, coding and math domains and report gains of up to 10% in average acceptance length. LK losses are easy to implement, introduce no computational overhead and can be directly integrated into any existing speculator training framework, making them a compelling alternative to the existing draft training objectives.

Abstract:
Representation learning has driven major advances in natural image analysis by enabling models to acquire highlevel semantic features. In microscopy imaging, however, it remains unclear what current representation learning methods actually learn. In this work, we conduct a systematic study of representation learning for the two most widely used and broadly available microscopy data types, representing critical scales in biology: cell culture and tissue imaging. To this end, we introduce a set of simple yet revealing baselines on curated benchmarks, including untrained models and simple structural representations of cellular tissue. Our results show that, surprisingly, state-of-the-art methods perform comparably to these baselines. We further show that, in contrast to natural images, existing models fail to consistently acquire high-level, biologically meaningful features. Moreover, we demonstrate that commonly used benchmark metrics are insufficient to assess representation quality and often mask this limitation. In addition, we investigate how detailed comparisons with these benchmarks provide ways to interpret the strengths and weaknesses of models for further improvements. Together, our results suggest that progress in microscopy image representation learning requires not only stronger models, but also more diagnostic benchmarks that measure what is actually learned.

Abstract:
Predicting highdimensional transcriptional responses to genetic perturbations is challenging due to severe experimental noise and sparse gene-level effects. Existing methods often suffer from mean collapse, where high correlation is achieved by predicting global average expression rather than perturbation-specific responses, leading to many false positives and limited biological interpretability. Recent approaches incorporate biological knowledge graphs into perturbation models, but these graphs are typically treated as dense and static, which can propagate noise and obscure true perturbation signals. We propose AdaPert, a perturbation-conditioned framework that addresses mean collapse by explicitly modeling sparsity and biological structure. AdaPert learns perturbation-specific subgraphs from biological knowledge graphs and applies adaptive learning to separate true signals from noise. Across multiple genetic perturbation benchmarks, AdaPert consistently outperforms existing baselines and achieves substantial improvements on DEG-aware evaluation metrics, indicating more accurate recovery of perturbation-specific transcriptional changes.

Abstract:
With the rapid progress of multiagent large language model (LLM) reasoning, how to effectively aggregate answers from multiple LLMs has emerged as a fundamental challenge. Standard majority voting treats all answers equally, failing to consider latent heterogeneity and correlation across models. In this work, we design two new aggregation algorithms called Optimal Weight (OW) and Inverse Surprising Popularity (ISP), leveraging both first-order and second-order information. Our theoretical analysis shows these methods provably mitigate the inherent limitations of majority voting under mild assumptions, leading to more reliable collective decisions. We empirically validate our algorithms on synthetic datasets, popular LLM fine-tuning benchmarks such as UltraFeedback and MMLU, and a real-world healthcare setting ARMMAN. Our algorithms consistently outperform standard baselines, establishing a robust, training-free framework for effective multi-agent LLM aggregation.

Abstract:
RetrievalAugmented Generation (RAG) mitigates hallucination in LLMs by incorporating external knowledge, but relies on chunk-based retrieval that lacks structural semantics. GraphRAG methods improve RAG by modeling knowledge as entity-relation graphs, but still face challenges in high construction cost, fixed one-time retrieval, and reliance on long-context reasoning and prompt design. To address these challenges, we propose Graph-R1, an agentic GraphRAG framework via end-to-end reinforcement learning (RL). It introduces lightweight knowledge hypergraph construction, models retrieval as a multi-turn agent-environment interaction, and optimizes the agent process via an end-to-end reward mechanism. Experiments on standard RAG datasets show that Graph-R1 outperforms traditional GraphRAG and RL-enhanced RAG methods in reasoning accuracy, retrieval efficiency, and generation quality. Our code is available.

Abstract:
In longhorizon tasks, recent agents based on Large Language Models (LLMs) face a significant challenge that sparse, outcome-based rewards make it difficult to assign credit to intermediate steps. Previous methods mainly focus on creating dense reward signals to guide learning, either through traditional reinforcement learning techniques like inverse reinforcement learning or by using Process Reward Models for step-by-step feedback. In this paper, we identify a fundamental problem in the learning dynamics of LLMs: the magnitude of policy gradients is inherently coupled with the entropy, which leads to inefficient small updates for confident correct actions and potentially destabilizes large updates for uncertain ones. To resolve this, we propose Entropy-Modulated Policy Gradients (EMPG), a framework that recalibrates the learning signal based on step-wise uncertainty and the final task outcome. EMPG amplifies updates for confident correct actions, penalizes confident errors, and attenuates updates from uncertain steps to stabilize exploration. We further introduce a bonus term for future clarity that encourages agents to find more predictable solution paths. Through comprehensive experiments on three challenging agent tasks, WebShop, ALFWorld, and Deep Search, we demonstrate that EMPG achieves substantial performance gains and significantly outperforms strong policy gradient baselines.

Abstract:
Large language models (LLMs) continue to struggle with lowresource languages, primarily due to limited training data, translation noise, and unstable cross-lingual alignment. To address these challenges, we propose LiRA (Linguistic Robust Anchoring for LLMs)—a plug-and-play framework that requires only lightweight fine-tuning on top of existing pretrained backbones. LiRA jointly optimizes representation stability and cross-lingual semantic consistency by combining two key components: Arca (Anchored Representation Composition Architecture), which aligns low-resource inputs to a shared English semantic space through anchor-based alignment and collaborative encoding; and LaSR (Language-coupled Semantic Reasoner), a lightweight, language-aware head that enforces consistency regularization for unified cross-lingual understanding, retrieval, and reasoning. We theoretically show that under controlled anchoring error and translation-induced bias, LiRA guarantees bounded representation deviation and stable downstream performance under local Lipschitz continuity. To facilitate research, we release a new multilingual product retrieval dataset covering five Southeast Asian and two South Asian languages. Extensive experiments across diverse low-resource benchmarks demonstrate consistent improvements in retrieval, ranking, question answering, and reasoning tasks. Code will be publicly available on GitHub, and the dataset will be hosted on Hugging Face.

Abstract:
Federated graph learning (FGL) enables collaborative training on graph data across multiple clients. With the rise of large language models (LLMs), textual attributes in FGL graphs are gaining attention. Textattributed graph federated learning (TAG-FGL) improves FGL by explicitly leveraging LLMs to process and integrate these textual features. However, current TAG-FGL methods face three main challenges: (1) Overhead. LLMs for processing long texts incur high token and computation costs. To make TAG-FGL practical, we introduce graph condensation (GC) to reduce computation load, but this choice also brings new issues. (2) Suboptimal. To reduce LLM overhead, we introduce GC into TAG-FGL by compressing multi-hop texts/neighborhoods into a condensed core with fixed LLM surrogates (summaries/embeddings). However, this one-shot condensation is often not client-adaptive, leading to suboptimal performance. (3) Interpretability. LLM-based condensation further introduces a black-box bottleneck: summaries lack faithful attribution and clear grounding to specific source spans, making local inspection and auditing difficult. To address the above issues, we propose DANCE, a new TAG-FGL paradigm with GC. To improve suboptimal performance, DANCE performs round-wise, model-in-the-loop condensation refresh using the latest global model. To enhance interpretability, DANCE preserves provenance by storing locally inspectable evidence packs that trace predictions to selected neighbors and source text spans. Across 8 TAG datasets, DANCE improves accuracy by 2.33% at an 8% condensation ratio, with 33.42% fewer tokens per condensed node than TAG-FGL baselines.

Abstract:
While reinforcement learning has achieved impressive progress in language model reasoning, it is constrained by the requirement for verifiable rewards. Recent verifierfree RL methods address this limitation by utilizing the probabilities that LLMs generate reference answers as reward signals. However, these approaches typically sample reasoning traces conditioned only on the question. This design decouples reasoning-trace sampling from answer information, leading to inefficient exploration and incoherence between traces and final answers. In this paper, we propose \bCoupled \bVariational \bReinforcement \bLearning (CoVRL), which bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy. By constructing and optimizing a composite distribution that integrates these two distributions, CoVRL enables efficient exploration while preserving strong thought-answer coherence. Extensive experiments on mathematical and general reasoning benchmarks show that CoVRL improves performance by 12.4% over the base model and achieves an additional 2.3% improvement over state-of-the-art verifier-free RL baselines, providing a principled framework for enhancing the general reasoning capabilities of language models.

Abstract:
As Large Language Models (LLMs) are increasingly deployed in social and strategic scenarios, it becomes critical to understand where and why their behavior diverges from that of humans. While behavioral game theory (BGT) provides a framework for analyzing behavior, existing models do not fully capture the idiosyncratic behavior of humans or blackbox, non-human agents like LLMs. We employ AlphaEvolve, a cutting-edge program discovery tool, to directly discoverinterpretablemodels of human and LLM behavior from data, thereby enabling open-ended discovery of structural factors driving human and LLM behavior. Our analysis on iterated rock-paper-scissors reveals that frontier LLMs can be capable of deeper strategic behavior than humans. These results provide a foundation for understanding structural differences driving differences in human and LLM behavior in strategic interactions.

Abstract:
Autonomous driving requires safe planning, but most learningbased planners lack explicit self-correction ability: once an unsafe action is proposed, there is no mechanism to correct it. Thus, we propose CorrectionPlanner, an autoregressive planner with self-correction that models planning as motion-token generation within a propose, evaluate, and correct loop. At each planning step, the policy proposes an action, namely a motion token, and a learned collision critic predicts whether it will induce a collision within a short horizon. If the critic predicts a collision, we retain the sequence of historical unsafe motion tokens as a self-correction trace, generate the next motion token conditioned on it, and repeat this process until the safe motion token is proposed or the safety criterion is met. This self-correction trace, consisting of all the unsafe motion tokens, represents the planner’s correction process in motion-token space (analogous to reasoning trace in language models). We train the planner with imitation learning followed by model-based reinforcement learning using rollouts from a pretrained world model that realistically models agents' reactive behaviors. Closed-loop evaluations show that CorrectionPlanner reduces the collision rate by over 20% on Waymax and obtains state-of-the-art planning scores on nuPlan.

Abstract:
Mechanistic interpretability reveals that safetycritical behaviors (e.g., alignment, jailbreak, backdoor) in Large Language Models (LLMs) are grounded in specialized functional components. However, existing safety attribution methods struggle with generalization and reliability due to their reliance on heuristic, domain-specific metrics and search algorithms. To address this, we propose SafeSeek, a unified safety interpretability framework that identifies functionally complete safety circuits in LLMs via optimization. Unlike methods focusing on isolated heads or neurons, SafeSeek introduces differentiable binary masks to extract multi-granular circuits through gradient descent on safety datasets, while integrates Safety Circuit Tuning to utilize these sparse circuits for efficient safety fine-tuning. We validate SafeSeek in two key scenarios in LLM safety: (1) backdoor attacks, identifying a backdoor circuit with 0.42% sparsity, whose ablation eradicates the Attack Success Rate (ASR) from 100% \to 0.4% while retaining over 99% general utility; (2) safety alignment, localizing an alignment circuit with 3.03% heads and 0.79% neurons, whose removal spikes ASR from 0.8% \to 96.9%, whereas excluding this circuit during helpfulness fine-tuning maintains 96.5% safety retention.

Abstract:
We introduce Box Thirding (B3), a flexible and efficient algorithm for Best Arm Identification (BAI) under fixed budget constraints. It is designed for both anytime BAI and scenarios with large N, where the number of arms is too large for exhaustive evaluation within a limited budget T. The algorithm employs a Remedian Estimation strategy: in each iteration, three arms are compared—the bestperforming arm is explored further, the median is retained for future comparisons, and the weakest is discarded. Even without prior knowledge of T, B3 achieves an \epsilon -best arm misidentification probability comparable to Sequential Halving, which requires T as a prior, applied to a randomly selected subset of c_0 arms that fit within the budget. Empirical results show that B3 outperforms existing methods for the limited budget constraint in terms of simple regret, as demonstrated on the New Yorker Cartoon Caption Contest dataset.

Abstract:
We analyze a compressed variant of Oja's algorithm for estimating the principal eigenvector of the data covariance matrix using only two adaptive measurements per sample. At each iteration, we observe one measurement along the current estimate and one in a random orthogonal direction. We prove that after t iterations, the expected sinesquared error to the true eigenvector is \mathcalO(\lambda_1\lambda_2 d^2 / (\Delta^2 t)), where d is the ambient dimension, \lambda_1, \lambda_2 are the leading eigenvalues, and \Delta = \lambda_1 - \lambda_2 is the eigengap. We complement this with a matching information-theoretic lower bound of \Omega(\lambda_1\lambda_2 d^2 / (\Delta^2 t)) --- the first for compressed eigenvector estimation --- proving that the d^2 factor, an additional factor of d compared to full-observation PCA, is the fundamental cost of compression and cannot be improved. Our analysis handles the noisy setting where the covariance has nonzero trailing eigenvalues, providing the first convergence guarantee for adaptive compressed subspace tracking beyond the noiseless case.

Abstract:
Conversational assistants are increasingly integrating longterm memory with large language models (LLMs). This persistence of memories, e.g., the user is vegetarian, can enhance personalization in future conversations. However, the same persistence can also introduce safety risks that have been largely overlooked. Hence, we introduce PersistBench to measure the extent of these safety risks. We identify two long-term memory-specific risks: cross-domain leakage, where LLMs inappropriately inject context from the long-term memories; and memory-induced sycophancy, where stored long-term memories insidiously reinforce user biases. We evaluate 18 frontier and open-source LLMs on our benchmark. Our results reveal a surprisingly high failure rate across these LLMs - a median failure rate of 53% on cross-domain samples and 97% on sycophancy samples. To address this, our benchmark encourages the development of more robust and safer long-term memory usage in frontier conversational systems.

Abstract:
Learningbased methods for sampling from the Gibbs distribution in finite-dimensional spaces have progressed quickly, yet theory and algorithmic design for infinite-dimensional function spaces remain limited. This gap persists despite their strong potential for sampling the paths of conditional diffusion processes, enabling efficient simulation of trajectories of diffusion processes that respect rare events or boundary constraints. In this work, we present the adjoint sampler for infinite-dimensional function spaces, a stochastic optimal control-based diffusion sampler that operates in function space and targets Gibbs-type distributions on infinite-dimensional Hilbert spaces. Our Functional Adjoint Sampler (FAS) generalizes Adjoint Sampling (Havens et al., 2025) to Hilbert spaces based on a SOC theory called stochastic maximum principle, yielding a simple and scalable matching-type objective for a functional representation. We show that FAS achieves superior transition path sampling performance across synthetic potential and real molecular systems, including Alanine Dipeptide and Chignolin.

Abstract:
Image editing has achieved impressive results with the development of largescale generative models. However, existing models mainly focus on the editing effects of intended objects and regions, often leading to unwanted changes in unintended regions. We present a post-training framework for Content-Consistent Editing (CoCoEdit) by using region regularized reinforcement learning. We first augment existing editing datasets with refined instructions and masks, from which 40K diverse and high quality samples are curated as training set. We introduce a pixel-level similarity reward that complements MLLM-based rewards, enabling models to ensure both editing quality and content consistency during the editing process. To overcome the spatial-agnostic nature of the rewards, we propose a region-based regularizer, aiming to preserve non-edited regions for high-reward samples while encouraging editing effects for low-reward samples. For evaluation, we annotate editing masks for GEdit-Bench and ImgEdit-Bench, introducing pixel-level similarity metrics to measure content consistency and editing quality. Applying CoCoEdit to Qwen-Image-Edit and FLUX-Kontext, we achieve not only superior editing scores to state-of-the-art models, but also significantly better content consistency, measured by PSNR/SSIM metrics and human subjective ratings. Code will be released.

Abstract:
Pairwise preference learning is central to machine learning, with recent applications in aligning language models with human preferences. A typical dataset consists of triplets (x, y^+, y^), where response y^+ is preferred over response y^- for context x. The Bradley--Terry (BT) model is the predominant approach, modeling preference probabilities as a function of latent score differences. Standard practice assumes data follows this model and learns the latent scores accordingly. However, real data may violate this assumption, and it remains unclear what BT learning recovers in such cases. Starting from triplet comparison data, we formalize the preference information it encodes through the conditional preference distribution (CPRD). We give precise conditions for when BT is appropriate for modeling the CPRD, and identify factors governing sample efficiency---namely, margin and connectivity. Together, these results offer a data-centric foundation for understanding what preference learning actually recovers.

Abstract:
We introduce ImmersePro, an innovative framework specifically designed to transform singleview videos into stereo videos. This framework utilizes a novel dual-branch architecture comprising a disparity branch and a context branch on video data by leveraging spatial-temporal attention mechanisms. ImmersePro employs implicit disparity guidance, enabling the generation of stereo pairs from video sequences without the need for explicit disparity maps, thus reducing potential errors associated with disparity estimation models. In addition to the technical advancements, we introduce the YouTube-SBS dataset, a comprehensive collection of 423 stereo videos sourced from YouTube. This dataset is unprecedented in its scale, featuring over 7 million stereo pairs, and is designed to facilitate training and benchmarking of stereo video generation models. Our experiments demonstrate the effectiveness of ImmersePro in producing high-quality stereo videos, offering significant improvements over existing methods. Compared to the best competitor stereo-from-mono we quantitatively improve the results by 11.76% (L1), 6.39% (SSIM), and 5.10% (PSNR).

Abstract:
The scaling of large language models (LLMs) is currently bottlenecked by the rigidity of distributed programming. While highperformance libraries like CuBLAS and NCCL provide optimized primitives, they lack the flexibility required for rapidly evolving model architectures. Conversely, existing tensor compilers fail to address the complex memory hierarchy of distributed clusters effectively. To bridge this gap, we propose DITRON, a scalable tile-level compiler that democratizes high-performance distributed kernel development. DITRON introduces a novel hierarchical programming abstraction spanning Core, Device, and Task levels to map tensor programs efficiently onto heterogeneous distributed hardware. This abstraction allows DITRON to support diverse parallelism strategies while abstracting away the complexity of inter-node and intra-node communication. Evaluated across large-scale clusters, DITRON achieves performance parity with or exceeding expert-tuned CUDA libraries, delivering speedups of 6%–30% on isolated kernels and 5%–30% on end-to-end inference in vLLM. Furthermore, DITRON demonstrates strong portability, achieving significant speedups on both NVIDIA and AMD platforms.

Abstract:
Highquality molecular representations are essential for property prediction and molecular design, yet large labeled datasets remain scarce. While self-supervised pretraining on molecular graphs has shown promise, many existing approaches either depend on hand-crafted augmentations or complex generative objectives, and often rely solely on 2D topology, leaving valuable 3D structural information underutilized. To address this gap, we introduce C-FREE (Contrast-FreeRepresentation learning onEgo-nets), a simple framework that integrates 2D graphs with ensembles of 3D conformers. C-FREE learns molecular representations by predicting subgraph embeddings from their complementary neighborhoods in the latent space, using fixed-radius ego-nets as modeling units across different conformers. This design allows us to integrate both geometric and topological information within a hybrid Graph Neural Network (GNN)-Transformer backbone, without negatives, positional encodings, or expensive pre-processing. Pretraining on the GEOM dataset, which provides rich 3D conformational diversity, C-FREE achieves state-of-the-art results on MoleculeNet, surpassing contrastive, generative, and other multimodal self-supervised methods. Fine-tuning across datasets with diverse sizes and molecule types further demonstrates that pretraining transfers effectively to new chemical domains, highlighting the importance of 3D-informed molecular representations.

Abstract:
While Large Language Models (LLMs) excel in certain reasoning tasks, they struggle in multiagent games where the final outcome depends on the joint strategies of all agents. In multi-agent games, the non-stationarity of other agents brings significant challenges on the evaluation of the reasoning process and the credit assignment over multiple reasoning steps. Existing single-agent reinforcement learning (RL) approaches and their multi-agent extensions fail to address these challenges as they do not incorporate other agents in the reasoning process. In this work, we propose Strat-Reasoner, a novel RL-based framework that improves LLMs' strategic reasoning ability in multi-agent games. We introduce a novel recursive reasoning paradigm where an agent's reasoning also integrates other agents' reasoning processes. To provide effective reward signals for the intermediate reasoning sequences, we employ a centralized Chain-of-Thought (CoT) comparison module to evaluate the reasoning quality. Finally, we compute an accurate hybrid advantage and develop a group-relative RL approach to optimize the LLM policy. Experimental results show that Strat-Reasoner substantially improves strategic abilities of underlying LLMs, achieving 22.1% average performance improvements across various multi-agent games. Our code is publicly available at https://anonymous.4open.science/r/Strat-Reasoner-3B04.

Abstract:
Point clouds provide a compact and expressive representation of 3D objects, and have recently been integrated into multimodal large language models (MLLMs). However, existing methods primarily focus on static objects, while understanding dynamic point cloud sequences remains largely unexplored. This limitation is mainly caused by the lack of largescale cross-modal datasets and the difficulty of modeling motions in spatio-temporal contexts. To bridge this gap, we present 4DPC^2hat, the first MLLM tailored for dynamic point cloud understanding. To this end, we construct a large-scale cross-modal dataset 4DPC^2hat-200K via a meticulous two-stage pipeline consisting of topology-consistent 4D point construction and two-level captioning. The dataset contains over 44K dynamic object sequences, 700K point cloud frames, and 200K curated question–answer (QA) pairs, supporting inquiries about counting, temporal relationship, action, spatial relationship, and appearance. At the core of the framework, we introduce a Mamba-enhanced temporal reasoning MLLM to capture long-range dependencies and dynamic patterns among a point cloud sequence. Furthermore, we propose a failure-aware bootstrapping learning strategy that iteratively identifies model deficiencies and generates targeted QA supervision to continuously strengthen corresponding reasoning capabilities. Extensive experiments demonstrate that our 4DPC^2hat significantly improves action understanding and temporal reasoning compared with existing models, establishing a strong foundation for 4D dynamic point cloud understanding.

Abstract:
Direct Preference Optimization (DPO) guides large language models (LLMs) to generate recommendations aligned with user historical behavior distributions by minimizing preference alignment loss. However, our systematic empirical research and theoretical analysis reveal that DPO tends to amplify spurious correlations caused by environmental confounders during the alignment process, significantly undermining the generalization capability of LLMbased generative recommendation methods in out-of-distribution (OOD) scenarios. To mitigate this issue, we propose CausalDPO, an extension of DPO that incorporates a causal invariance learning mechanism. This method introduces a backdoor adjustment strategy during the preference alignment phase to eliminate interference from environmental confounders, explicitly models the latent environmental distribution using a soft clustering approach, and enhances robust consistency across diverse environments through invariance constraints. Theoretical analysis demonstrates that CausalDPO can effectively capture users' stable preference structures across multiple environments, thereby improving the OOD generalization performance of LLM-based recommendation models. We conduct extensive experiments under four representative distribution shift settings to validate the effectiveness of CausalDPO, achieving an average performance improvement of 24.10% across four evaluation metrics.

Abstract:
Inspired by the idea of Positiveincentive Noise (Pi-Noise or \pi-Noise) that aims at learning the reliable noise beneficial to tasks, we scientifically investigate the connection between contrastive learning and \pi-noise in this paper. By converting the contrastive loss to an auxiliary Gaussian distribution to quantitatively measure the difficulty of the specific contrastive model under the information theory framework, we properly define the task entropy, the core concept of \pi-noise, of contrastive learning. It is further proved that the predefined data augmentation in the standard contrastive learning paradigm can be regarded as a kind of point estimation of \pi-noise. Inspired by the theoretical study, a framework that develops a \pi-noise generator to learn the beneficial noise (instead of estimation) as data augmentations for contrast is proposed. The designed framework can be applied to diverse types of data and is also completely compatible with the existing contrastive models. From the visualization, we surprisingly find that the proposed method successfully learns effective augmentations.

Abstract:
As large language models (LLMs) continue to scale in size, the computational overhead has become a major bottleneck for taskspecific fine-tuning. While low-rank adaptation (LoRA) effectively curtails this cost by confining the weight updates to a low-dimensional subspace, such a restriction can hinder effectiveness and slow convergence. This contribution deals with these limitations by accumulating progressively a high-rank weight update from consecutive low-rank increments. Specifically, the per update optimal low-rank matrix is identified to minimize the loss function and closely approximate full fine-tuning. To endow efficient and seamless optimization without restarting, this optimal choice is formed by appropriately scaling the columns of the original low-rank matrix. Rigorous performance guarantees reveal that the optimal scaling can be found analytically. Extensive numerical tests with popular LLMs scaling up to 12 billion parameters demonstrate a consistent performance gain and fast convergence relative to state-of-the-art LoRA variants on diverse tasks including natural language understanding, commonsense reasoning, and mathematical problem solving.

Abstract:
Surgical video datasets are essential for scene understanding, enabling procedural modeling and intraoperative support. However, these datasets are often heavily imbalanced, with rare actions and tools under-represented, which limits the robustness of downstream models. We address this challenge with SurgiFlowVid, a sparse and controllable video diffusion framework for generating surgical videos of under-represented classes. Our approach introduces a dual-prediction diffusion module that jointly denoises RGB frames and optical flow, providing temporal inductive biases to improve motion modeling from limited samples. In addition, a sparse visual encoder conditions the generation process on lightweight signals (e.g., sparse segmentation masks or RGB frames), enabling controllability without dense annotations. We validate our approach on three surgical datasets across tasks including action recognition, tool presence detection, and laparoscope motion prediction. Synthetic data generated by our method yields consistent gains of 10–20% over competitive baselines, establishing SurgiFlowVid as a promising strategy to mitigate data imbalance and advance surgical video understanding methods.

Abstract:
Hazard and survival functions are natural, interpretable targets in timeto-event prediction, but their inherent non-additivity fundamentally limits standard additive explanation methods. We introduce Survival Functional Decomposition (SurvFD), a principled approach for analyzing feature interactions in machine learning survival models. By separating higher-order effects into time-dependent and time-independent components, SurvFD offers a previously unrecognized perspective on survival explanations, explicitly characterizing when and why additive explanations fail. Building on this theoretical decomposition, we propose SurvSHAP-IQ, which extends Shapley interactions to time-indexed functions, providing a practical estimator for higher-order, time-dependent interactions. Together, SurvFD and SurvSHAP-IQ establish a interaction- and time-aware interpretability framework for survival modeling, with broad applicability across time-to-event prediction tasks.

Abstract:
We study the transit stop placement (TrSP) problem in general metric spaces, where agents travel between source–destination pairs and may either walk directly or utilize a shuttle service via selected transit stops. We investigate fairness in TrSP through the lens of justified representation (JR) and the core, and uncover a structural correspondence with fair clustering. Specifically, we show that a constantfactor approximation to proportional fairness in clustering can be used to guarantee a constant-factor bi-parameterized approximation to core. We establish a lower bound of 1.366 on the approximability of JR, and moreover show that no clustering algorithm can approximate JR within a factor better than 3. Going beyond clustering, we propose the Expanding Cost Algorithm, which achieves a tight 2.414-approximation for JR, but does not give any bounded core guarantee. In light of this, we introduce a parameterized algorithm that interpolates between these approaches, and enables a tunable trade-off between JR and core. Finally, we complement our results with an experimental analysis using small-market public carpooling data.

Abstract:
Understanding how anatomical shapes evolve in response to developmental covariates—and quantifying their spatially varying uncertainties—is critical in healthcare research. Existing approaches typically rely on global timewarping formulations that ignore spatially heterogeneous dynamics. We introduce \textttPRISM, a novel framework that bridges implicit neural representations with uncertainty-aware statistical shape analysis. \textttPRISM models the conditional distribution of shapes given covariates, providing spatially continuous estimates of both the population mean and covariate-dependent uncertainty at arbitrary locations. A key theoretical contribution is a closed-form Fisher Information metric that enables efficient, analytically tractable local temporal uncertainty quantification via automatic differentiation. Experiments on three synthetic datasets and one clinical dataset demonstrate \textttPRISM's strong performance across diverse tasks—from modeling shape evolution to anomaly detection—within a unified framework, while providing interpretable and clinically meaningful uncertainty estimates.

Abstract:
Recent works propose complex multimodal models that handle both time series and language, ultimately claiming high performance on complex tasks like time series reasoning and cross-modal question answering. However, they skip foundational evaluations that such complex models should have mastered. So we ask a simple question: How well can recent models describe structural properties of time series? To answer this, we propose that successful models should be able to recognize, differentiate, and generate descriptions of time series. We then create BEDTime, a benchmark to assess these novel tasks, that comprises five datasets reformatted across three modalities. In evaluating 13 state-of-the-art models, we find that (1) surprisingly, dedicated time series-language models fall short, despite being designed for similar tasks, (2) vision language models are quite capable, (3) language only methods perform worst, despite many lauding their potential, and (4) all approaches are clearly fragile to a range of real world robustness tests, indicating directions for future work. Together, our findings critique prior works' claims and provide avenues for advancing multi-modal time series modeling.

Abstract:
As LLMs proliferate with diverse capabilities and costs, LLM routing has emerged by learning to predict each LLM's quality and cost for a given query, then selecting the one with high quality and low cost. However, existing routers implicitly assume a single fixed quality and cost per LLM for each query, ignoring that the same LLM's quality varies with its output length. This causes routers to exclude powerful LLMs when their estimated cost exceeds the budget, missing the opportunity that these LLMs could still deliver high quality at reduced cost with shorter outputs. To address this, we introduce R2Router, which treats output length budget as a controllable variable and jointly selects the best LLM and length budget, enforcing the budget via length-constrained instructions. This enables R2-Router to discover that a powerful LLM with constrained output can outperform a weaker LLM at comparable cost—efficient configurations invisible to prior methods. Together with the router framework, we construct R2-Bench, the first routing dataset capturing LLM behavior across diverse output length budgets. Experiments show that R2-Router achieves state-of-the-art performance at 4-5× lower cost compared with existing routers. This work opens a new direction: routing as reasoning, where routers evolve from reactive selectors to deliberate reasoners that explore which LLM to use and at what cost budget. Source code is available at https://anonymous.4open.science/r/router-763C/README.md, with an interactive demo at https://r2-router.org.

Abstract:
Accurate longhorizon vessel trajectory prediction remains challenging due to compounded uncertainty from complex navigation behaviors and environmental factors. Existing methods often struggle to maintain global directional consistency, leading to drifting or implausible trajectories when extrapolated over long time horizons. To address this issue, we propose a semantic-key-point-conditioned trajectory modeling framework, in which future trajectories are predicted by conditioning on a high-level Next Key Point (NKP) that captures navigational intent. This formulation decomposes long-horizon prediction into global semantic decision-making and local motion modeling, effectively restricting the support of future trajectories to semantically feasible subsets. To efficiently estimate the NKP prior from historical observations, we adopt a pretrain-finetune strategy. Extensive experiments on real-world AIS data demonstrate that the proposed method consistently outperforms state-of-the-art approaches, particularly for long travel durations, directional accuracy, and fine-grained trajectory prediction.

Abstract:
Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently datadependent, meaning that diverse and representative workloads are essential for accurately measuring its effectiveness. Existing benchmarks suffer from limited task diversity, inadequate support for throughput-oriented evaluation, and a reliance on high-level implementations that fail to reflect production environments. To address this, we introduceSPEED-Bench, a comprehensive suite designed to standardize SD evaluation across diverse semantic domains and realistic serving regimes. SPEED-Bench offers a carefully curatedQualitativedata split, selected by prioritizing semantic diversity across the data samples. Additionally, it includes aThroughputdata split, allowing speedup evaluation across a range of concurrencies, from latency-sensitive low-batch settings to throughput-oriented high-load scenarios. By integrating with production engines like vLLM and TensorRT-LLM, SPEED-Bench allows practitioners to analyze system behaviors often masked by other benchmarks. We highlight this by quantifying how synthetic inputs overestimate real-world throughput, identifying batch-size dependent optimal draft lengths and biases in low-diversity data, and analyzing the caveats of vocabulary pruning in state-of-the-art drafters. We release SPEED-Bench to establish a unified evaluation standard for practical comparisons of SD algorithms.

Abstract:
Scorebased generative models have recently achieved remarkable success. While they are usually parameterized by the score, an alternative way is to use a series of time-dependent energy-based models (EBMs), where the score is obtained from the negative input-gradient of the energy. Crucially, EBMs can be leveraged not only for generation, but also for tasks such as compositional sampling or building Boltzmann Generators via Monte Carlo methods. However, training EBMs remains challenging. Direct maximum likelihood is computationally prohibitive due to the need for nested sampling, while score matching, though efficient, suffers from mode blindness. To address these issues, we introduce the Diffusive Classification (DiffCLF) objective, a simple method that avoids blindness while remaining computationally efficient. DiffCLF reframes EBM learning as a supervised classification problem across noise levels, and can be seamlessly combined with standard score-based objectives. We validate the effectiveness of DiffCLF by comparing the estimated energies against ground truth in analytical Gaussian mixture cases, and by applying the trained models to tasks such as model composition and Boltzmann Generator sampling. Our results show that DiffCLF enables EBMs with higher fidelity and broader applicability than existing approaches.

Abstract:
We study multiobjective optimization over probability distributions in Wasserstein space. Recently, \citetnguyen2025multiple introduced Multiple Wasserstein Gradient Descent (MWGraD) algorithm, which exploits the geometric structure of Wasserstein space to jointly optimize multiple objectives. Building on this approach, we propose an accelerated variant, A-MWGraD, inspired by Nesterov's acceleration. We analyze the continuous-time dynamics and establish convergence to weakly Pareto optimal points in probability space. Our theoretical results show that A-MWGraD achieves a convergence rate of \mathcalO(1/t^2) for geodesically convex objectives and \mathcalO(e^-\sqrt\betat) for \beta-strongly geodesically convex objectives, improving upon the \mathcalO(1/t) rate of MWGraD in the geodesically convex setting. We further introduce a practical kernel-based discretization for A-MWGraD and demonstrate through numerical experiments that it consistently outperforms MWGraD in convergence speed and sampling efficiency on multi-target sampling tasks.

Abstract:
Traditional neural topic models are typically optimized by reconstructing the document's Bagof-Words (BoW) representations, overlooking contextual information and struggling with data sparsity. In this work, we propose a novel approach to construct semantically-grounded soft label targets using Language Models (LMs) by projecting the next token probabilities, conditioned on a specialized prompt, onto a pre-defined vocabulary to obtain contextually enriched supervision signals. By training the topic models to reconstruct the soft labels using the LM hidden states, our method produces higher-quality topics that are more closely aligned with the underlying thematic structure of the corpus. Experiments on three datasets show that our method achieves substantial improvements in topic coherence, purity over existing baselines. Additionally, we also introduce a retrieval-based metric, which shows that our approach significantly outperforms existing methods in identifying semantically similar documents, highlighting its effectiveness for retrieval-oriented applications.\footnotewe will release the code upon publishing the paper.

Abstract:
Large Audio Language Models (LALMs) excel at perception but struggle with complex reasoning requiring precise acoustic measurements. While external tools can extract finegrained features like exact tempo or pitch, effective integration remains challenging: naively using all tools causes information overload, while prompt-based selection fails to assess context-dependent utility. To address this, we proposeAuTAgent(AudioToolAgent), a reinforcement learning framework that learns when and which tools to invoke. By employing a sparse-feedback training strategy with a novel Differential Reward mechanism, the agent learns to filter out irrelevant tools and invokes external assistance only when it yields a net performance gain over the base model. Experimental results confirm that AuTAgent complements the representation bottleneck of LALMs by providing verifiable acoustic evidence. It improves accuracy by 4.20% / 6.20% and 9.80% / 8.00% for open-source and closed-source backbones on the MMAU Test-mini and the MMAR benchmarks, respectively. In addition, further experiments demonstrate exceptional transferability. We highlight the complementary role of external tools in augmenting audio model reasoning.

Abstract:
Retrieval quality is the primary bottleneck for accuracy and robustness in retrievalaugmented generation (RAG). Current evaluation relies on heuristically constructed query sets, which introduce a hidden intrinsic bias. We formalize retrieval evaluation as a statistical estimation problem, showing that metric reliability is fundamentally limited by the evaluation-set construction. We further introduce \emphsemantic stratification, which grounds evaluation in corpus structure by organizing documents into an interpretable global space of entity-based clusters and systematically generating queries for missing strata. This yields (1) formal semantic coverage guarantees across retrieval regimes and (2) interpretable visibility into retrieval failure modes. Experiments across multiple benchmarks and retrieval methods validate our framework. The results expose systematic coverage gaps, identify structural signals that explain variance in retrieval performance, and show that stratified evaluation yields more stable and transparent assessments while supporting more trustworthy decision-making than aggregate metrics.

Abstract:
Neural operators have emerged as datadriven surrogates for solving partial differential equations (PDEs), and their success hinges on efficiently modeling the long-range, global coupling among spatial points induced by the underlying physics. In many PDE regimes, the induced global interaction kernels are empirically compressible, exhibiting rapid spectral decay that admits low-rank approximations. We leverage this observation to unify representative global mixing modules in neural operators under a shared low-rank template: compressing high-dimensional pointwise features into a compact latent space, processing global interactions within it, and reconstructing the global context back to spatial points. Guided by this view, we introduce Low-Rank Spatial Attention (LRSA) as a clean and direct instantiation of this template. Crucially, unlike prior approaches that often rely on non-standard aggregation or normalization modules, LRSA is built purely from standard Transformer primitives, i.e., attention, normalization, and feed-forward networks, yielding a concise block that is straightforward to implement and directly compatible with hardware-optimized kernels. In our experiments, such a simple construction is sufficient to achieve high accuracy, yielding an average error reduction of over 17% relative to second-best methods, while remaining stable and efficient in mixed-precision training.

Abstract:
Multimodal Diffusion Transformers (MMDiTs) for textto-image generation maintain separate text and image branches, with bidirectional information flow between text tokens and visual latents throughout denoising. In this setting, we observe a prompt forgetting phenomenon: the semantics of the prompt representation in the text branch is progressively forgotten as depth increases. We further verify this effect on three representative MMDiTs—SD3, SD3.5, and FLUX.1 by probing linguistic attributes of the representations over the layers in the text branch. Motivated by these findings, we introduce a training-free approach, prompt reinjection, which reinjects prompt representations from early layers into later layers to alleviate this forgetting. Experiments on GenEval, DPG, and T2I-CompBench++ show consistent gains in instruction-following capability, along with improvements on metrics capturing preference, aesthetics, and overall text--image generation quality.

Abstract:
The expansion of textto-image diffusion models has raised concerns about harmful outputs, from fabricated depictions of public figures to sexually explicit imagery. To mitigate such risks, prior work has proposed concept erasure methods that aim to sever unwanted concepts from the model via fine-tuning, yet it remains unclear whether these approaches truly remove all links to the harmful concept or merely conceal superficial connections. In this work, we reveal a critical vulnerability, the Erasure Evasion Backdoors (EEB): an adversary binds a backdoor trigger to a concept slated for removal, and this malicious link survives subsequent erasure. We show that both black-box and white-box adversaries can instantiate this threat. Across six state-of-the-art erasure methods, including robust ones that explicitly search for alternative representations of the target concept, EEB consistently exposes harmful content: up to 82% success against celebrity-identity unlearning, up to 94% for object erasure, and up to 16× amplification of explicit-content exposure. While EEB uncovers a blind spot in current erasure methods, it also provides a diagnostic tool for stress-testing future concept erasure techniques.

Abstract:
Data assimilation (DA) is a cornerstone of scientific and engineering applications, combining model forecasts with sparse and noisy observations to estimate latent system states. Classical highdimensional DA methods, such as the ensemble Kalman filter, rely on Gaussian approximations that are violated for complex dynamics or observation operators. To address this limitation, we introduce DAISI, a scalable filtering algorithm built on flow-based generative models that enables flexible probabilistic inference using data-driven priors. The core idea is to use a stationary, pre-trained generative prior that first incorporates forecast information through a novelinverse-sampling step, before assimilating observations via guidance-based conditional sampling. This allows us to leverage any forecasting model as part of the DA pipeline without having to retrain or fine-tune the generative prior at each assimilation step. Experiments on challenging nonlinear systems show that DAISI achieves accurate filtering results in regimes with sparse, noisy, and nonlinear observations where traditional methods struggle.

Abstract:
DiLoCo is a powerful framework for training large language models (LLMs) under networking constraints, allowing for increased parallelism and accelerator utilization in data center settings. A critical but often overlooked factor in DiLoCo’s behavior is the choice of inner optimizer, which shapes the pseudogradient used by the outer optimizer. Given the recent success of Muon relative to AdamW for data parallel (DP) training, in this work, we examine how Muon's normalized optimizer steps can affect the pseudogradient's quality. Empirically, we find that, relative to AdamW, Muon yields more \emphdirectionally correct pseudogradients as the number of workers (K) is increased. In our experiments pretraining language models, we conduct extensive hyperparameter tuning across 150M, 416M, 914M, 1.76B, and 3.1B models for DiLoCo, MuLoCo, AdamW DP, and Muon DP. Consistently across all scales, we find that with K\geq1 workers, MuLoCo (Muon inner optimizer DiLoCo) achieves superior performance to DiLoCo in absolute terms and for K>2 it outperforms DiLoCo relative to their data parallel baselines, while being compatible with quantization, streaming, and long synchronization intervals. At K=1, we find that MuLoCo can even outperform the data-parallel gold standard while having larger optimal and critical batch sizes.

Abstract:
We propose LOGDIFF (Logical Guidance for the Exact Composition of Diffusion Models), a guidance framework for diffusion models that enables principled constrained generation with complex logical expressions at inference time. We study when exact scorebased guidance for complex logical formulas can be obtained from guidance signals associated with atomic attributes and constraints. First, we derive an exact Boolean calculus that provides a sufficient condition for exact logical guidance. Specifically, if a formula admits a circuit representation in which conjunctions combine conditionally independent subformulas and disjunctions combine subformulas that are either conditionally independent or mutually exclusive, exact logical guidance is achievable. In this case, the guidance signal can be computed exactly from atomic scores and posterior probabilities using an efficient recursive algorithm. Moreover, we show that, for commonly encountered classes of distributions, any desired Boolean formula is compilable into such a circuit representation. Second, by combining atomic guidance scores with posterior probability estimates, we introduce a hybrid guidance approach that bridges classifierguidance and classifier-free guidance, applicable to both compositional logical guidance and standard conditional generation. We demonstrate the effectiveness of our framework on multiple image and protein structure generation tasks.

Abstract:
Digital watermarking is essential for securing generated images from diffusion models. Accurate watermark evaluation is critical for algorithm development, yet existing methods have significant limitations: they lack a unified framework for both residual and semantic watermarks, provide results without interpretability, neglect comprehensive security considerations, and often use inappropriate metrics for semantic watermarks. To address these gaps, we proposeWMVLM, the first unified and interpretable evaluation framework for diffusion model imagewatermarking viavisionlanguagemodels (VLMs). We redefine quality and security metrics for each watermark type: residual watermarks are evaluated by artifact strength and erasure resistance, while semantic watermarks are assessed through latent distribution shifts. Moreover, we introduce a three-stage training strategy to progressively enable the model to achieve classification, scoring, and interpretable text generation. Experiments show WMVLM outperforms state-of-the-art VLMs with strong generalization across datasets, diffusion models, and watermarking methods.

Abstract:
Code translation transforms source code from one programming language (PL) to another. Validating the functional equivalence of translation and repairing, if necessary, are critical steps in code translation. Existing automated validation and repair approaches struggle to generalize to many PLs due to high engineering overhead, and they rely on existing and often inadequate test suites, which results in false claims of equivalence and ineffective translation repair. To bridge this gap, we develop MatchFixAgent, a large language model (LLM)based, PL-agnostic framework for equivalence validation and repair of translations. MatchFixAgent features a multi-agent architecture that divides equivalence validation into several sub-tasks to ensure thorough and consistent semantic analysis of the translation. We compare MatchFixAgent's validation and repair results with four repository-level code translation techniques. Our results demonstrate that MatchFixAgent produces (in)equivalence verdicts for 99.2% of translation pairs, with the same equivalence validation result as prior work on 72.8% of them. When MatchFixAgent's result disagrees with prior work, we find that 60.7% of the time MatchFixAgent's result is actually correct. In addition, we show that MatchFixAgent can repair 50.6% of inequivalent translation, compared to prior work’s 18.5%.

Abstract:
Recent studies indicate that when faced with explicit biases in prompts, models often omit mentioning these biases in their Chainof-Thought (CoT) output, revealing that verbalized reasoning can give an incorrect picture of how models arrive at conclusions (unfaithfulness). In this work, we show that unfaithful CoT also occurs on naturally worded, non-adversarial prompts without adding artificial biases or editing model outputs. We find that when separately presented with the questions "Is X bigger than Y?" and "Is Y bigger than X?", models sometimes produce superficially coherent arguments to justify systematically answering Yes to both questions or No to both questions, despite such responses being logically contradictory. We present preliminary evidence that this is due to models' implicit biases towards Yes or No, labeling thisImplicit Post-Hoc Rationalization. Our results reveal rates up to 13% for production models, and while frontier models are more faithful, none are entirely so, including thinking models like DeepSeek R1 (0.37%) and Sonnet 3.7 with thinking (0.04%). We also investigateUnfaithful Illogical Shortcuts, where models use subtly illogical reasoning to make speculative answers to hard math problems seem rigorously proven. Our findings indicate that while CoT can be useful for assessing outputs, it is not a complete account of a model's internal reasoning and should be used with caution in agentic or safety-critical settings.

Abstract:
Diffusion models have achieved remarkable success in generative modeling, yet how to effectively adapting large pretrained models to new tasks remains challenging. We revisit the reconstruction behavior of diffusion models during denoising to unveil the underlying frequency–energy mechanism governing this process. Building upon this observation, we propose FeRA, a frequencydriven fine-tuning framework that aligns parameter updates with the intrinsic frequency–energy progression of diffusion. FeRA establishes a comprehensive frequency–energy framework for effective diffusion adaptation fine-tuning, comprising three synergistic components: (i) a compact frequency–energy indicator that characterizes the latent’s bandwise energy distribution, (ii) a soft frequency router that adaptively fuses multiple frequency-specific adapter experts, and (iii) a frequency–energy consistency regularization that stabilizes diffusion optimization and ensures coherent adaptation across bands. Routing operates in both training and inference, with inference-time routing dynamically determined by the latent frequency energy. It integrates seamlessly with adapter-based tuning schemes and generalizes well across diffusion backbones and resolutions. By aligning adaptation with the frequency–energy mechanism, FeRA provides a simple, stable, and compatible paradigm for effective and robust diffusion model adaptation. Codes will be made publicly available.

Abstract:
Adaptive NeuroFuzzy Inference System (ANFIS) was designed to combine the learning capabilities of neural network with the reasoning transparency of fuzzy logic. However, conventional ANFIS architectures suffer from structural complexity, where the product-based inference mechanism causes an exponential explosion of rules in high-dimensional spaces. We herein propose the Kolmogorov-ArnoldNeuro-FuzzyInferenceSystem (KANFIS), a compact neuro-symbolic architecture that unifies fuzzy reasoning with additive function decomposition. KANFIS employs an additive aggregation mechanism, under which both model parameters and rule complexity scale linearly with input dimensionality rather than exponentially. Furthermore, KANFIS is compatible with both Type-1 (T1) and Interval Type-2 (IT2) fuzzy logic systems, enabling explicit modeling of uncertainty and ambiguity in fuzzy representations. By using sparse masking mechanisms, KANFIS generates compact and structured rule sets, resulting in an intrinsically interpretable model with clear rule semantics and transparent inference processes. Empirical results demonstrate that KANFIS achieves competitive performance against representative neural and neuro-fuzzy baselines.

Abstract:
Offline goalconditioned RL (GCRL) learns goal-reaching policies from static datasets, but real-world datasets are often partially observable and history-dependent, exhibiting a mix of Markovian and non-Markovian that violate standard RL assumptions. History-aware sequence models such as Decision Transformer (DT) are a natural fit for long-term dependency modeling, yet pure attention is inefficient and brittle when handling local Markovian structure and long-range context simultaneously. Although recent hybrid architectures (e.g., LSDT) introduce local extractors to improve local dependencies modeling, the fixed-window extraction cannot adapt its effective memory to varying dependency lengths in temporally heterogeneous settings, often truncating long-range context rather than compressing its content adaptively. Moreover, sequential offline GCRL faces a key bottleneck: under sparse rewards, return-to-go (RTG) becomes non-discriminative across sub-trajectories, providing little guidance signal for stitching goal-reaching behaviors from diverse demonstrations. To address these, we propose QHyer, which replaces RTG with a flow-parameterized, state-conditioned goal-reaching Q-estimator to support stitching across demonstrations, and introduces a gated Hybrid Attention-Mamba backbone that performs content-adaptive history compression while preserving local dynamics. Extensive experiments demonstrate that QHyer achieves state-of-the-art performance on both non-Markovian and Markovian datasets, validating its effectiveness for diverse scenarios.

Abstract:
Geometrical interpretations of deep learning models offer insightful perspectives into their underlying mathematical structures. In this work, we introduce a novel approach that leverages differential geometry, particularly concepts from Riemannian geometry, to integrate multiple models into a unified transfer learning framework. By aligning the Ricci curvature of latent space of individual models, we construct an interrelated architecture, namely Geometric Embedding Alignment via cuRvature matching in transfer learning (GEAR), which ensures comprehensive geometric representation across datapoints. This framework enables the effective aggregation of knowledge from diverse sources, thereby improving performance on target tasks. We evaluate our model on 23 molecular task pairs and demonstrate significant performance gains over existing benchmark models—achieving improvements of at least 14.4% under random splits and 8.3% under scaffold splits.

Abstract:
Statespace models (SSMs) have emerged as an efficient strategy for building powerful language models, avoiding the quadratic complexity of computing attention in transformers. Despite their promise, the interpretability and steerability of modern SSMs remain relatively underexplored. We take a major step in this direction by identifyingactivation subspace bottlenecksin the Mamba family of SSM models using tools from mechanistic interpretability. We then introduce a test-time steering intervention that simply multiplies the activations of the identified bottlenecks by a scalar. Across 5 SSMs and 6 diverse benchmarks, this intervention improves performance by an average of 8.27%, without requiring any task-specific tuning. Finally, we validate that the identified bottlenecks are indeed hindering performance by modifying them to yield an architecture we call Stable-Mamba, which achieves long-context performance gains when retrained from scratch.

Abstract:
LoRA has become a widely adopted method for PEFT, and its initialization methods have attracted increasing attention. However, existing methods have notable limitations: many methods do not incorporate targetdomain data, while gradient-based methods exploit data only at a shallow level by relying on one-step gradient decomposition. In this paper, we establish a theoretical framework for data-aware LoRA initialization. Starting from minimizing the expectation of the parameter discrepancy between the fine-tuned and target models, we derive an optimization problem with two components: a bias term, which is related to the parameter distance between the fine-tuned and target models, and is approximated using a Fisher–gradient formulation to preserve anisotropy; and a variance term, which accounts for the uncertainty introduced by sampling stochasticity through the Fisher information. Solving this problem yields an optimal initialization strategy for LoRA, based on which we develop an efficient algorithm, LoRA-DA. Empirical results across multiple benchmarks demonstrate that LoRA-DA consistently improves final accuracy over existing initialization methods. Additional studies show faster, more stable convergence, robustness across ranks, and only a small initialization overhead for LoRA-DA. The source code will be released upon publication.

Abstract:
Beyond conditional average treatment effects, treatments may impact the entire outcome distribution in covariatedependent ways, for example, by altering the variance or tail risks for specific subpopulations. We propose a novel estimand to capture such conditional distributional treatment effects, and develop a doubly robust estimator that is minimax optimal in the local asymptotic sense. Using this, we develop a test for the global homogeneity of conditional potential outcome distributions that accommodates discrepancies beyond the maximum mean discrepancy (MMD), has provably valid type 1 error, and is consistent against fixed alternatives---the first test, to our knowledge, with such guarantees in this setting. Furthermore, we derive exact closed-form expressions for two natural discrepancies (including the MMD), and provide a computationally efficient, permutation-free algorithm for our test.

Abstract:
Reinforcement learning (RL) has become essential to the reasoning and alignment posttraining of large language models (LLMs). However, successful RL relies on sufficient exploration of diverse actions by the model during training. We study whether RL is robust toexploration hacking, where a model strategically alters its exploration during training to influence the subsequent training outcome. First, we create model organisms of exploration hacking by using fine-tuning-based "locking" techniques; we show that these models can successfully resist RL-based capability elicitation in AI R&D and agentic biosecurity environments, while maintaining performance on closely related tasks. Next, we use our model organisms to evaluate the effectiveness of monitoring techniques as detection methods for exploration hacking. Finally, we show that current frontier models can reason effectively about suppressing their exploration when presented with simulated RL environments and encouraged to act strategically. Together, our results empirically establish exploration hacking as a failure mode of RL on sufficiently capable LLMs.

Abstract:
Data is the lifeblood of AI, yet much of the most valuable data remains locked in silos due to privacy and regulations. As a result, AI remains heavily underutilized in many of the most important domains, including healthcare, education, and finance. Synthetic data generation (SDG), i.e.~the generation of artificial data with a synthesizer trained on real data, offers an appealing solution to make data available while mitigating privacy concerns, however existing SDGas-a-service workflow require data holders to trust providers with access to private data. We propose FHAIM, the first fully homomorphic encryption (FHE) framework for training a marginal-based synthetic data generator on encrypted tabular data. FHAIM adapts the widely used AIM algorithm to the FHE setting using novel FHE protocols, ensuring that the private data remains encrypted throughout and is released only with differential privacy guarantees. Our empirical analysis show that FHAIM preserves the performance of AIM while maintaining feasible runtimes.

Abstract:
Label Distribution Learning (LDL) models supervision as an instancewise probability distribution, enabling fine-grained learning under inherent ambiguity, but its success relies on high-fidelity label distributions that are costly to obtain and thus often noisy. Motivated by privacy-sensitive applications, we study Federated Label Distribution Learning (Fed-LDL), where data isolation further induces heterogeneous annotation quality across clients, making local updates unevenly reliable and breaking sample-size-based aggregation (e.g., FedAvg). To address this trust dilemma, we propose FedQual, a quality-aware Fed-LDL framework with two coupled mechanisms: (i) quality-adaptive client training guided by a global semantic anchor that calibrates low-quality clients while preserving high-quality autonomy, and (ii) reliability-aware server aggregation that reweights client contributions by effective reliable information rather than raw sample size. To enable rigorous evaluation, we construct four new Fed-LDL benchmarks (FER-LDL, FI-LDL, PIPAL-LDL, and KADID-LDL) with controlled annotation quality disparity. We further provide a theoretical guarantee showing that under heterogeneous supervision quality, client-specific calibration is strictly better than any uniform calibration. Extensive experiments on the proposed benchmarks demonstrate the effectiveness of FedQual.

Abstract:
The evolution of RetrievalAugmented Generation (RAG) has shifted from static retrieval pipelines to dynamic, agentic workflows where a central planner orchestrates multi-turn reasoning. However, existing paradigms face a critical dichotomy: they either optimize modules jointly within rigid, fixed-graph architectures, or empower dynamic planning while treating executors as frozen, black-box tools. We identify that this decoupled optimization creates a ``strategic-operational mismatch,'' where sophisticated planning strategies fail to materialize due to unadapted local executors, often leading to negative performance gains despite increased system complexity. In this paper, we propose JADE (Joint Agentic Dynamic Execution), a unified framework for the joint optimization of planning and execution within dynamic, multi-turn workflows. By modeling the system as a cooperative multi-agent team unified under a single shared backbone, JADE enables end-to-end learning driven by outcome-based rewards. This approach facilitates co-adaptation: the planner learns to operate within the capability boundaries of the executors, while the executors evolve to align with high-level strategic intent. Empirical results demonstrate that JADE transforms disjoint modules into a synergistic system, yielding remarkable performance improvements via joint optimization and enabling a flexible balance between efficiency and effectiveness through dynamic workflow orchestration.

Abstract:
LLMbased multi-agent systems have demonstrated impressive capabilities, but they also introduce significant safety risks when individual agents fail or behave adversarially. In this work, we study the automated design of agentic systems that remain safe even when a subset of agents is compromised. We formalize this challenge as a Stackelberg security game between a system designer (the Meta-Agent) and a best-responding Meta-Adversary that selects and compromises a subset of agents to minimize safety. We propose Meta-Adversary–Meta-Agent (MaMa), a novel algorithm for approximately solving this game and automatically designing safe agentic systems. Our approach uses LLM-based adversarial search, where the Meta-Agent iteratively proposes system designs and receives feedback based on the strongest attacks discovered by the Meta-Adversary. Empirical evaluations across diverse environments show that systems designed with MaMa consistently defend against worst-case attacks while maintaining performance comparable to systems optimized solely for task success. Moreover, the resulting systems generalize to stronger adversaries, as well as ones with different attack objectives or underlying LLMs, demonstrating robust safety beyond the training setting.

Abstract:
We study online learning in Bayesian Stackelberg games, where a leader repeatedly interacts with a follower whose unknown private type is independently drawn at each round from an unknown probability distribution. The goal is to design algorithms that minimize the leader's regret with respect to always playing an optimal commitment computed with knowledge of the game. We consider, for the first time to the best of our knowledge, the most realistic case in which the leader does not know anything about follower's types, i.e., the possible follower's payoffs. This raises considerable additional challenges compared to the usually addressed case in which follower's payoffs are known. First, we prove a strong negative result: noregret is unattainable under action feedback, i.e., when the leader only observes the follower's best response at the end of each round. Thus, we focus on the easier type feedback model, where the follower's type is also revealed. In such a setting, we propose an algorithm that achieves a regret of \widetildeO(\sqrtT), ignoring the dependence on other parameters.

Abstract:
InContext Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs and accommodate diverse preferences without costly post-training, known asIn-Context Alignment(ICA). However, LLMs' comprehension of input prompts remains agnostic, limiting ICA's ability to address value tensions—human values are inherentlypluralistic, often imposing conflicting demands,e.g., stimulation vs. tradition. Current ICA methods therefore face theInstruction Bottleneckchallenge, where LLMs struggle to reconcile multiple intended values within a single prompt, leading to incomplete or biased alignment. To address this, we proposePICACO, a novel pluralistic ICA method. Without fine-tuning, PICACO optimizes a meta-instruction that incorporates multiple values to better elicit LLMs' understanding of them and improve alignment. This is achieved by maximizing the total correlation between specified values and LLM responses, which theoretically reinforces value conformity and reduces distractive noise, resulting in more effective instructions. Extensive experiments on five value sets show that PICACO works well with both black-box and open-source LLMs, outperforms several recent strong baselines, and achieves a better balance across up to 8 distinct values.

Abstract:
Momentum Stochastic Gradient Descent (mSGD) relies on a fixed momentum coefficient shared across all parameters, failing to account for the heterogeneous structure of modern loss landscapes. In this work, we adopt a continuoustime formulation to introduce individual, adaptive momentum coefficients regulated by the kinetic energy of each model parameter. This mechanism automatically adjusts to local curvature to maintain stability without sacrificing convergence speed. We demonstrate that this adaptive friction is inextricably linked to cubic damping, a suppression mechanism from structural dynamics. We additionally introduce two optimization schemes by augmenting the continuous dynamics of mSGD and Adam with a cubic damping term. Empirically, our methods demonstrate robustness and match or outperform Adam on training ViT, BERT, and GPT2; tasks where mSGD typically struggles. We further provide theoretical results establishing the exponential convergence of the proposed schemes.

Abstract:
As AI systems begin to automate complex tasks, supervision increasingly relies on weaker models or limited human oversight that cannot fully verify output quality. A model more capable than its supervisors could exploit this gap through sandbagging, producing work that appears acceptable but falls short of its true abilities. Can training force a model to produce its best work, even when we cannot reliably verify whether it has? We study this using model organisms trained to deliberately sandbag, testing supervised finetuning (SFT) and reinforcement learning (RL) as elicitation techniques on Olympiad math, graduate-level science (Super GPQA), and competitive coding (Code Contests). SFT on weaker supervisor outputs reliably reduces sandbagging and elicits performance beyond the supervisor’s own capabilities, though not always fully. RL alone is unreliable: consistent sandbagging limits exploration of correct answers, allowing the model to reward hack the supervisor instead. SFT followed by RL works most reliably: SFT reduces sandbagging enough for RL to obtain useful signal and fully elicit the sandbagging model. When training and evaluation distributions differ, however, the model can exploit this gap by producing correct answers during training while continuing to sandbag at evaluation.

Abstract:
In the era of increasingly complex AI models for time series forecasting, progress is often measured by marginal improvements on benchmark leaderboards. However, standard evaluations rely on aggregate metrics (e.g., MSE) that conflate model capability with the intrinsic difficulty of the evaluated instances. To address this, we propose a diagnostic framework anchored in Spectral Coherence Predictability (SCP), which provides an efficient \mathcalO(N\log N) perinstance difficulty reference and yields a corresponding linear MSE lower bound. Complementing this, we introduce the Linear Utilization Ratio (LUR) to quantify how effectively models exploit linearly predictable structures across frequencies. Experiments on synthetic and real-world benchmarks show that SCP aligns strongly with realized forecasting errors across diverse state-of-the-art forecasters. Using this lens, we uncover ``predictability drift,'' revealing that task difficulty is not static but fluctuates significantly over time and variables. Furthermore, stratified evaluation exposes complementary architectural strengths across distinct frequency bands and difficulty regimes. Overall, we advocate moving beyond leaderboard-style ranking toward a more insightful, predictability-aware evaluation that fosters fairer model comparisons and a deeper understanding of model behavior. Code and data are available at https://anonymous.4open.science/r/TS_Predictability-C8B7.

Abstract:
Ecommerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose amulti-modal information density assessment frameworkto quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduceE-commerce Video Ads Benchmark (E-VAds), which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs. These questions are organized into two primary dimensions, namely Perception and Cognition and Reasoning, which consist of five distinct tasks. Finally, we developE-VAds-R1, an RL-based reasoning model featuring a multi-grained reward design calledMG-GRPO. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples.

Abstract:
Graph Neural Networks (GNNs) are widely deployed in industry, making their intellectual property valuable. However, protecting GNNs from unauthorized use remains a challenge. Watermarking offers a solution by embedding ownership information into models. Existing watermarking methods have two limitations: First, they rarely focus on graph data or GNNs. Second, the \emphde facto backdoorbased method relies on manipulating training data, which can introduce ownership ambiguity through misclassification and vulnerability to data poisoning attacks that can interrupt the backdoor mechanism. Our explanation-based watermarking inherits the strengths of backdoor-based methods (e.g., black-box verification) without data manipulation, eliminating ownership ambiguity and data dependencies. In particular, we watermark GNN explanations such that these explanations are statistically distinct from others, so ownership claims must be verified through statistical significance. We theoretically prove that, even with full knowledge of our method, locating the watermark is NP-hard. Empirically, our method demonstrates robustness to fine-tuning and pruning attacks. By addressing these challenges, our approach significantly advances GNN intellectual property protection.

Abstract:
This work investigates the resilience of contemporary large language models (LLMs) against frequent characterlevel perturbations. We examine three types of character-level perturbations including introducing numerous typos within words, shuffling the characters in each word, and inserting a large number of invisible characters into the text. Surprisingly, even under severe perturbation, such as shuffling nearly all words character-wise to produce text that is almost unreadable to humans, or inserting invisible characters which are several times more than the visible ones as noise, many LLMs still maintain notable performance. We explore the underlying causes of this robustness and find that LLMs exhibit remarkable resilience to chaotic segmentation and fragmented tokenization. Furthermore, we examine the mechanisms by which LLMs remove perturbations to correctly comprehend text, including both implicit and explicit mechanisms for character-level perturbation. We hope that our findings on the low-level robustness of LLMs will unveil their inherent architectural strengths, reveal the potential risks of their misuse, and inform the reliable deployment of LLMs across diverse application scenarios.

Abstract:
Classifierfree guidance (CFG) is a widely used technique for controllable generation in diffusion and flow-based models. Despite its empirical success, CFG relies on a heuristic linear extrapolation that is often sensitive to the guidance scale. In this work, we provide a principled interpretation of CFG through the lens of optimization. We demonstrate that the velocity field in flow matching corresponds to the gradient of a sequence of smoothed distance functions, which guides latent variables toward the scaled target image set. This perspective reveals that the standard CFG formulation is an approximation of this gradient, where the prediction gap, the discrepancy between conditional and unconditional outputs, governs guidance sensitivity. Leveraging this insight, we reformulate the CFG sampling as a homotopy optimization with a manifold constraint. This formulation necessitates a manifold projection step, which we implement via an incremental gradient descent scheme during sampling. To improve computational efficiency and stability, we further enhance this iterative process with Anderson Acceleration without requiring additional model evaluations. Our proposed methods are training-free and consistently refine generation fidelity, prompt alignment, and robustness to the guidance scale. We validate their effectiveness across diverse benchmarks, demonstrating significant improvements on large-scale models such as DiT-XL-2-256, Flux, and Stable Diffusion 3.5. Code is available in the supplementary materials.

Abstract:
3D Gaussian Splatting (3DGS) has demonstrated remarkable performance in novel view synthesis and 3D scene reconstruction, but its quality often degrades in realworld environments due to transient distractors, such as moving objects and varying shadows. Existing methods commonly rely on semantic cues extracted from pre-trained vision models to identify and suppress these distractors, but such semantics are misaligned with the binary distinction between static and transient regions and remain fragile under the appearance perturbations introduced during 3DGS optimization. We propose 3DGS-HPC, a framework that circumvents these limitations by combining two complementary principles: a patch-wise classification strategy that leverages local spatial consistency for robust region-level decisions, and a hybrid classification metric that adaptively integrates photometric and perceptual cues for more reliable separation. Extensive experiments demonstrate the superiority and robustness of our method in mitigating distractors to improve 3DGS-based novel view synthesis. The code will be released.

Abstract:
While large language models (LLMs) perform strongly on diverse tasks, their trustworthiness is limited by erratic behavior that is unfaithful to their internal knowledge. In particular, LLMs often fail on multiplechoice questions (MCQs) even if they encode correct answers in their hidden representations, revealing a misalignment between internal knowledge and output behavior. We investigate and mitigate this knowledge-prediction gap on MCQs through a three-step analysis of hidden representations. First, we quantify the prevalence and magnitude of the gap across models and datasets. Second, we provide a geometric interpretation by identifying distinct knowledge and prediction subspaces in the residual stream. Third, we introduce KAPPA, a lightweight inference-time intervention that aligns the two subspaces within the residual stream to reduce the knowledge-prediction gap. Our results provide a geometric and interpretable explanation of the knowledge-prediction gap in LLMs. Furthermore, KAPPA effectively reduces the gap across diverse MCQ benchmarks and models, and generalizes to free-form settings.

Abstract:
Visionlanguage models (VLMs) achieve strong performance on standard, high-quality datasets, but we still don't fully understand how they perform under real-world image distortions. We presentVLM-RobustBench, a benchmark spanning 49 augmentation types across noise, blur, weather, digital, and geometric perturbations, evaluated under graded severities (low/mid/high) and binary transforms, yielding 133 corrupted settings. We evaluate VLMs from four families (Qwen, InternVL, Molmo, Gemma) on two complementary benchmarks: MMBench (visually grounded) and MMMU-Pro (reasoning-oriented). Our results reveal that visual severity is a weak predictor of difficulty: low-severity spatial perturbations often degrade performance more than visually severe photometric corruptions. In particular, low-severityglass_blurreduces MMBench accuracy by about 8pp on average across models, while the largest drops arise from resampling and geometric distortions (e.g.,upsample,elastic_transform, reaching up to 34pp. Overall, our findings suggest current VLMs aresemantically strong but spatially fragile, motivating the definition of novel robustness evaluation protocols and training regimes that emphasize resampling and geometric invariances.

Abstract:
Diffusion models are distinguished by their exceptional generative performance, particularly in producing highquality samples through iterative denoising. While current theory suggests that the number of denoising steps required for accurate sample generation should scale linearly with data dimension, this does not reflect the practical efficiency of widely used algorithms like Denoising Diffusion Probabilistic Models (DDPMs). This paper investigates the effectiveness of diffusion models in sampling complex high-dimensional distributions that can be well-approximated by Gaussian Mixture Models (GMMs). For these distributions, our main result shows that DDPM takes at most \widetildeO(1/\varepsilon) iterations to attain an \varepsilon-accurate distribution in total variation (TV) distance, independent of both the ambient dimension d and the number of components K, up to logarithmic factors. Furthermore, this result remains robust to score estimation errors. These findings highlight the remarkable effectiveness of diffusion models in high-dimensional settings given the universal approximation capability of GMMs, and provide theoretical insights into their practical success.

Abstract:
Process rewards have been widely used in deep reinforcement learning to improve training efficiency, reduce variance, and prevent reward hacking. In LLM reasoning, existing works also explore various solutions for learning effective process reward models (PRM) with or without the help of an expert policy. However, existing methods either rely on strong assumptions about the expert policies (e.g., requiring their reward functions) or suffer intrinsic limitations (e.g., entropy collapse), resulting in weak PRMs or limited generalizability. In this paper, we introduce rePIRL, an inverse RLinspired framework that learns effective PRMs with minimal assumptions about expert policies. Specifically, we design a dual learning process that updates the policy and the PRM interchangeably. Our learning algorithm has customized techniques to address the challenges of scaling traditional inverse RL to LLMs. We theoretically show that our proposed learning framework can unify both online and offline PRM learning methods, justifying that rePIRL can learn PRMs with minimal assumptions. Empirical evaluations on standardized math and coding reasoning datasets demonstrate the effectiveness of rePIRL over existing methods. We further show the application of our trained PRM in test-time training, test-time scaling, and providing an early signal for training hard problems. Finally, we validate our training recipe and key design choices via a detailed ablation study.

Abstract:
We consider the privacy amplification properties of a sampling scheme in which a user’s data is used in k steps chosen randomly and uniformly from a sequence (or set) of t steps. This sampling scheme has been recently applied in the context of differentially private optimization [Chua et al., 2024a, ChoquetteChoo et al., 2025] and communication-efficient high-dimensional private aggregation [Asi et al., 2025], where it was shown to have utility advantages over the standard Poisson sampling. Theoretical analyses of this sampling scheme [Feldman and Shenfeld, 2025, Dong et al., 2025] lead to bounds that are close to those of Poisson sampling, yet still have two significant shortcomings. First, in many practical settings, the resulting privacy parameters are not tight due to the approximation steps in the analysis. Second, the computed parameters are either the hockey stick or Rényi divergence, both of which introduce overheads when used in privacy loss accounting. In this work, we demonstrate that the privacy loss distribution (PLD) of random allocation applied to any differentially private algorithm can be computed efficiently. When applied to the Gaussian mechanism, our results demonstrate that the privacy-utility trade-off for random allocation is at least as good as that of Poisson subsampling. In particular, random allocation is better suited for training via DP-SGD. To support these computations, our work develops new tools for general privacy loss accounting based on a notion of PLD realization. This notion allows us to extend accurate privacy loss accounting to subsampling which previously required manual noise-mechanism-specific analysis.

Abstract:
Strategic classification examines how decision rules interact with agents who strategically adapt their features. Most existing models focus on maximizing predictive performance, assuming agents best respond to the learned classifier. However, real decisionmaking systems are rarely optimized solely for accuracy: ethical, economic, and institutional considerations often make some feature changes more desirable than others. At the same time, principals may wish to incentivize these changes fairly across heterogeneous agents. While prior work has studied causal structure between features, notions of desirability, and information disparities in isolation, this work initiates a unified treatment of these components within a single framework. We frame the problem as a constrained optimization problem that captures the trade-offs between optimality, desirability, and fairness. We provide theoretical guarantees on the principal's optimality loss constrained to a particular desirability fairness tolerance for multiple broad classes of fairness measures. Finally, through experiments on real datasets, we show the explicit tradeoff between maximizing accuracy and fairness in desirability effort.

Abstract:
Motivated by problems in online advertising, we address the task of Learning from Label Proportions (LLP). We introduce a novel and versatile lowvariance debiasing methodology to learn from aggregate label information, significantly advancing the state of the art in LLP. Our debiasing approach exhibits remarkable flexibility, seamlessly accommodating a broad spectrum of practically relevant loss functions across both binary and multi-class classification settings. By carefully combining our estimators with standard techniques, we improve sample complexity guarantees for a large class of losses of practical relevance. We also empirically validate the efficacy of our proposed approach across a diverse array of benchmark datasets, demonstrating compelling empirical advantages over standard baselines.

Abstract:
Realworld knowledge is often organized as hierarchies such as product taxonomies, medical ontologies, and label trees, yet learning hierarchical representations is challenging due to asymmetric structure and noisy semantics. We introduce Polaris, a polar hyperspherical embedding framework that separates semanticity from hierarchy using angular geometry and radius, enabling the learning of meaning and structure without interference. To map latent representation onto the sphere, we project it to the tangent space at the north pole, apply the exponential map, and learn unit-norm representations using spherical linear layers. Polaris then combines robust local constraints, global regularization that prevents geometric collapse, and uncertainty-aware asymmetric objectives that encourage directional containment. At inference time, Polaris uses structure-guided retrieval to efficiently narrow down candidate parents before final ranking. We evaluate Polaris on different settings of taxonomy expansion -- spanning trees, multi-parent DAGs, and multimodal hierarchies, showing consistent improvements of up to ～19 points in top-K retrieval and up to ～ 60% reduction in mean rank over fourteen strong baselines.

Abstract:
This paper focuses on a key challenge in Neural Architecture Search (NAS): integrating established architectural knowledge while exploring new designs under expensive evaluations. Large language models (LLMs) are a promising assistant for NAS because they can translate rich architectural and coding priors into executable code edits. However, in practice, seemingly local revisions often propagate into nonlocal behavioral and performance shifts because a single edit can inadvertently couple multiple interacting functional factors, a phenomenon we refer to as functional entanglement. To make LLM knowledge usable under such entanglement, we propose Structured Progressive Knowledge Activation (SPARK), which activates relevant priors by explicitly selecting the functional factor to modify and conditioning the edit on that factor. This factor-conditioned editing reduces entangled side effects and yields more targeted, reliable architecture modifications. On CLRS-DFS, SPARK reduces the number of training evaluations by 28.1× over EvoPrompting and improves OOD accuracy by +15.6 points, with essentially unchanged compute (～453K MACs).

Abstract:
Web agents require massive trajectories to generalize, yet realworld training is constrained by network latency, rate limits, and safety risks. We introduce WebWorld series, the first open-web simulator trained at scale. While existing simulators are restricted to closed environments with thousands of trajectories, WebWorld leverages a scalable data pipeline to train on 1M+ open-web interactions, supporting reasoning, multi-format data, and long-horizon simulations of 30+ steps. For intrinsic evaluation, we introduce WebWorld-Bench with dual metrics spanning nine dimensions, where WebWorld achieves simulation performance comparable to Gemini-3-Pro. For extrinsic evaluation, Qwen3-14B trained on WebWorld-synthesized trajectories improves by +9.2% on WebArena, reaching performance comparable to GPT-4o. WebWorld enables effective inference-time search, outperforming GPT-5 as a world model. Beyond web simulation, WebWorld exhibits cross-domain generalization to code, GUI, and game environments, providing a replicable recipe for world-model construction.

Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) is increasingly viewed as a tree pruning mechanism. However, we identify a systemic pathology termed Recursive Space Contraction (RSC), an irreversible collapse driven by the combined dynamics of positive sharpening and negative squeezing, where the sampling probability of valid alternatives vanishes. While KullbackLeibler (KL) regularization aims to mitigate this, it imposes a rigid Shape Matching constraint that forces the policy to mimic the reference model's full density, creating a gradient conflict with the sharpening required for correctness. We propose Anchored Policy Optimization (APO), shifting the paradigm from global Shape Matching to Support Coverage. By defining a Safe Manifold based on the reference model's high-confidence support, APO permits aggressive sharpening for efficiency while selectively invoking a restorative force during error correction to prevent collapse. We theoretically derive that APO serves as a gradient-aligned mechanism to maximize support coverage, enabling an Elastic Recovery that re-inflates valid branches. Empirical evaluations on mathematical benchmarks demonstrate that APO breaks the accuracy-diversity trade-off, significantly improving Pass@1 while restoring the Pass@K diversity typically lost by standard policy gradient methods.

Abstract:
Vision–language models achieve incredible performance across a wide range of tasks, but their large size makes inference costly. Recent work has shown that multimodal processing contains significant redundancies, making it possible to skip certain layers with minimal performance loss. Yet current pruning techniques remain adhoc, relying on heuristics or hyperparameter sweeps rather than principled criteria for determining when layer skipping is beneficial. In this paper, we propose a unified framework that characterizes the redundancy conditions under which pruning can enhance efficiency without sacrificing performance. Central to our approach are experimentally verifiable and interpretable notions of redundancy that can be evaluated without requiring downstream task performance as a metric. Applying this framework, we corroborate prior findings that both early and late vision tokens are redundant across models, and we validate our conditions by showing they align with actual performance degradation. Beyond these empirical results, our framework provides a theoretically grounded understanding of redundancy in VLMs and unifies many of the ideas behind modern layer-skipping techniques.

Abstract:
While stateof-the-art language models (LMs) surpass the vast majority of humans in certain domains, their reasoning remains largely opaque, reducing trust and risking deception and hallucination. In this work, we introduce the Prototype Transformer (ProtoT)—an autoregressive LM architecture that replaces the quadratic-cost self-attention in the transformer with a linear-cost module based on prototypes (parameter vectors). In ProtoT, the prototypes create communication channels aggregating contextual information at different time scales. We show that this leads to the prototypes automatically capturing nameable concepts (e.g. “woman”) during training, and it provides the potential to interpret the model’s reasoning and do targeted edits of its behavior. Compared to baselines, ProtoT scales well with model and data size, shows robustness to input perturbations, and performs well on text generation and downstream tasks (GLUE). Reaching close to the performance of state-of-the-art architectures, ProtoT paves the way to creating well-performing autoregressive LMs interpretable by design.

Abstract:
Sparse Autoencoders (SAEs) decompose LLM activations into interpretable features, yet existing SAEbased steering methods require contrastive datasets or large activation stores. We introduce CorrSteer, which selects steering features by correlating task outcomes with SAE activations computed during generation, then validates these selections through intervention. This two-stage approach treats correlation as a selection heuristic and intervention as the causal test: features that both correlate with success and improve performance when amplified are retained. Coefficients derive from mean activations on correct samples, yielding a fully automated pipeline without task-specific tuning. On Gemma-2 2B and LLaMA-3.1 8B, CorrSteer achieves +3.3% on MMLU (4k samples) and +27.2% on HarmBench (108 samples), with lower side-effect ratios than fine-tuning despite comparable accuracy. Selected features cluster into interpretable categories: structured-output features for multiple-choice tasks, refusal features for safety, and domain-specific semantics for specialized benchmarks. The method scales to 10^5 SAE features via streaming correlation (O(1) in dataset size), requiring no backward passes or activation storage.

Abstract:
As realworld tasks grow increasingly complex, long-context reasoning has become a core capability for Large Language Models (LLMs). However, few studies explore which data types are effective for long-context reasoning and why. We find that structured table data with periodic structures shows strong potential for long-context reasoning. Motivated by this observation, we mathematically analyze tabular dependency structures using mutual information, revealing periodic non-vanishing dependencies in table data. Furthermore, we systematically analyze the capabilities of structured table data, conduct relevant scaling experiments, and validate its underlying mechanisms for enhancing long-context reasoning, yielding several meaningful insights. Leveraging these insights, we propose a simple yet scalable pipeline(TableLong) for synthesizing high-quality, diverse, and verifiable structured table data to boost long-context reasoning via RL. Extensive experimental results demonstrate that table data significantly enhances the long-context reasoning capability of LLMs across multiple long-context benchmarks (+8.24% on average), and even improves performance on out-of-domain benchmarks (+8.06% on average). We hope that our insights provide practical guidance for effective post-training data to enhance long-context reasoning in LLMs.

Abstract:
Latent categorical variables are frequently found in deep learning architectures. They can model actions in discrete reinforcementlearning environments, represent categories in latent-variable models, or express relations in graph neural networks. Despite their widespread use, their discrete nature poses significant challenges to gradient-descent learning algorithms. While a substantial body of work has offered improved gradient estimation techniques, we take a complementary approach. Specifically, we: 1) revisit the ubiquitoussoftmaxfunction and demonstrate its limitations from an information-geometric perspective; 2) replace thesoftmaxwith thecatnatfunction, a function composed by a sequence of hierarchical binary splits; we prove that this choice offers significant advantages to gradient descent due to the resulting diagonal Fisher Information Matrix. A rich set of experiments - including graph structure learning, variational autoencoders, and reinforcement learning - empirically show that the proposed function improves the learning efficiency and yields models characterized by consistently higher test performance.Catnatis simple to implement and seamlessly integrates into existing codebases. Moreover, it remains compatible with standard training stabilization techniques and, as such, offers a better alternative to thesoftmaxfunction.

Abstract:
Recent years have seen a surge in datadriven surrogates for dynamical systems that can be orders of magnitude faster than numerical solvers. However, many machine learning-based models such as neural operators exhibit spectral bias, attenuating high-frequency components that often encode small-scale structure. This limitation is particularly damaging in applications such as weather forecasting, where misrepresented high frequencies can induce long-horizon instability. To address this issue, we propose multi-scale wavelet transformers (MSWTs), which learn system dynamics in a tokenized wavelet domain. The wavelet transform explicitly separates low- and high-frequency content across scales. MSWTs leverage a wavelet-preserving downsampling scheme that retains high-frequency features and employ wavelet-based attention to capture dependencies across scales and frequency bands. Experiments on chaotic dynamical systems show substantial error reductions and improved long-horizon spectral fidelity. On the ERA5 climate reanalysis, MSWTs further reduce climatological bias, demonstrating their effectiveness in a real-world forecasting setting.

Abstract:
We propose \emphEuler Mean Flows (EMF), a flowbased generative framework for one-step and few-step generation that enforces long-range trajectory consistency with minimal sampling cost. The key idea of EMF is to replace the trajectory consistency constraint, which is difficult to supervise and optimize over long time scales, with a principled linear surrogate that enables direct data supervision for long-horizon flow-map compositions. We derive this approximation from the semigroup formulation of flow-based models and show that, under mild regularity assumptions, it faithfully approximates the original consistency objective while being substantially easier to optimize. This formulation leads to a unified, JVP-free training framework that supports both u-prediction and x_1-prediction variants, avoiding explicit Jacobian computations and significantly reducing memory and computational overhead. Experiments on image synthesis, particle-based geometry generation, and functional generation demonstrate improved optimization stability and sample quality under fixed sampling budgets, together with approximately 50% reductions in training time and memory consumption compared to existing one-step methods for image generation.

Abstract:
The GilbertPollak Conjecture, also known as the Steiner Ratio Conjecture, states that for any finite point set in the Euclidean plane, the Steiner minimum tree has length at least \sqrt3/2 \approx 0.866 times that of the Euclidean minimum spanning tree (the Steiner ratio). A sequence of improvements through the 1980s culminated in a lower bound of 0.824, with no substantial progress reported over the past three decades. Recent advances in LLMs have demonstrated strong performance on contest-level mathematical problems, yet their potential for addressing open, research-level questions remains largely unexplored. In this work, we present a novel AI system for obtaining tighter lower bounds on the Steiner ratio. Rather than directly prompting LLMs to solve the conjecture, we task them with generating rule-constrained geometric lemmas implemented as executable code. These lemmas are then used to construct a collection of specialized functions, which we call verification functions, that yield theoretically certified lower bounds of the Steiner ratio. Through progressive lemma refinement driven by reflection, the system establishes a new certified lower bound of 0.8559 for the Steiner ratio. The entire research effort involves only thousands of LLM calls, demonstrating the strong potential of LLM-based systems for advanced mathematical research.

Abstract:
Validating mathematical reasoning in large language models currently requires a tradeoff between computationally expensive learned verifiers and the unreliability of output-based heuristics. We therefore propose a training-free, mechanistic alternative: spectral analysis of attention topology. By treating attention matrices as dynamic graphs over tokens, we extract four interpretable spectral diagnostics, Fiedler value, High-Frequency Energy Ratio (HFER), spectral entropy, and graph smoothness, that differentiate valid reasoning from hallucinated outputs without any learned parameters. We perform experiments across seven models from four architectural families (Llama, Qwen, Phi, Mistral) yield effect sizes up to Cohen's d = 3.30 (p < 10^-116), enabling 85--96% classification accuracy with a single threshold. We discover that spectral analysis detects logical coherence rather than compiler acceptance: proofs rejected by formal verifiers due to timeouts or missing imports are correctly identified as valid, a phenomenon we term "Platonic validity". Furthermore, causal ablation studies confirm that this spectral signature reflects the functional health of induction head circuits, establishing a mechanistic basis for the method. We also identify an architectural dependency: Sliding Window Attention shifts the discriminative signal from HFER to late-layer smoothness (d = 2.09, p < 10^-48), demonstrating that attention mechanism design determines which spectral features capture reasoning validity. The method generalizes to informal chain-of-thought reasoning (d = 0.78, p < 10^-3). These findings establish spectral graph analysis as a principled framework for reasoning verification, with immediate applications to hallucination detection and real-time safety monitoring.

Abstract:
Understanding and decoding brain activity into visual representations is a fundamental challenge at the intersection of neuroscience and artificial intelligence. While electroencephalogram (EEG) visual decoding has shown promise due to its noninvasive and low-cost nature, existing methods suffer from Hierarchical Neural Encoding Neglect (HNEN), a critical limitation in which flat neural representations fail to model the brain’s hierarchical visual processing. Inspired by the hierarchical organization of visual cortex, we propose ViEEG, a neuro-inspired framework that addresses HNEN. ViEEG decomposes each visual stimulus into three biologically aligned components, namely contour, foreground object, and contextual scene, which serve as anchors for a three-stream EEG encoder. These EEG features are progressively integrated via cross-attention routing, simulating cortical information flow from low-level to high-level vision. We further adopt hierarchical contrastive learning for EEG-CLIP representation alignment, enabling zero-shot object recognition. Extensive experiments on THINGS-EEG dataset demonstrate that ViEEG significantly outperforms previous methods by a large margin in both subject-dependent and subject-independent settings. Results on THINGS-MEG dataset further confirm ViEEG's generalization to different neural modalities. Our framework not only advances the performance frontier but also sets a new paradigm for EEG brain decoding. Code and pretrained models will be available.

Abstract:
Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chainof-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, which limits the adaptability of LLM agents to new or evolving toolsets. We present AutoTool, a training framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories. AutoTool employs a dual-phase optimization pipeline: (i) SFT and RL-based trajectory stabilization for coherent reasoning, and (ii) KL-regularized Plackett–Luce Ranking to refine consistent multi-step tool selection. We further build a 200k dataset with explicit tool-selection rationales across 1,000+ tools and 100+ tasks spanning mathematics, science, code generation, and multimodal reasoning. Across ten diverse benchmarks, we train two base models, Qwen3-8B and Qwen2.5-VL-7B, with AutoTool. With fewer parameters, AutoTool consistently outperforms advanced LLM agents and tool-integration methods, yielding average gains of 6.4% in math & science reasoning, 4.5% in search-based QA, 7.7% in code generation, and 6.9% in multimodal understanding. In addition, AutoTool exhibits stronger generalization by dynamically leveraging unseen tools from evolving toolsets during inference.

Abstract:
We propose LoRAMCL, a training scheme that extends next-token prediction in language models with a method designed to decode diverse, plausible sentence continuations at inference time. Traditional language modeling is an intrinsically ill-posed problem: given a context, multiple ``futures'' may be equally plausible. Our approach leverages Multiple Choice Learning (MCL) and the Winner-Takes-All loss to efficiently handle ambiguity through Low-Rank Adaptation. We provide a theoretical interpretation of applying MCL to language modeling, assuming the data is generated from a mixture of distributions. We illustrate the proposed approach using mixtures of Markov chains. We then demonstrate with experiments on visual and audio captioning, as well as machine translation, that our method achieves high diversity and relevance in generated outputs.

Abstract:
Large Language Models (LLMs) have been augmented with web search to overcome the limitations of the static knowledge boundary by accessing upto-date information from the open Internet. While this integration enhances model capability, it also introduces a distinct safety threat surface: the retrieval and citation process has the potential risk of exposing users to harmful or low-credibility web content. Existing red-teaming methods are largely designed for standalone LLMs as they primarily focus on unsafe generation, ignoring risks emerging from the complex search workflow. To address this gap, we propose CREST-Search, a pioneering red-teaming framework for LLMs with web search. The cornerstone of CREST-Search is three novel attack strategies that generate seemingly benign search queries yet induce unsafe citations. It also employs an iterative in-context refinement mechanism to strengthen adversarial effectiveness under black-box constraints. In addition, we construct a search-specific harmful dataset, WebSearch-Harm, which enables fine-tuning a specialized red-teaming model to improve query quality. Our experiments demonstrate that CREST-Search can effectively bypass safety filters and systematically expose vulnerabilities in web search-based LLM systems, underscoring the necessity of the development of robust search models.

Abstract:
Inverse design aims to find design parameters x achieving target performance y^. Generative approaches learn bidirectional mappings between designs and labels, enabling diverse solution sampling. However, standard conditional flow matching (CFM), when adapted to inverse problems by pairing labels with design parameters, exhibits strong sensitivity to their arbitrary ordering and scaling, leading to unstable training. We introduce Diagonal Flow Matching (DiagCFM), which resolves this through a zero-anchoring strategy that pairs design coordinates with noise and labels with zero, making the learning problem provably invariant to coordinate permutations. This yields order-of-magnitude improvements in round-trip accuracy over CFM and invertible neural network baselines across design dimensions up to P=100. We develop two architecture-intrinsic uncertainty metrics, Zero-Deviation and Self-Consistency, that enable three practical capabilities: selecting the best candidate among multiple generations, abstaining from unreliable predictions, and detecting out-of-distribution targets; consistently outperforming ensemble and general-purpose alternatives across all tasks. We validate on airfoil, gas turbine combustor, and an analytical benchmark with scalable design dimension.

Abstract:
Realizing the full potential of quantum computation requires Quantum Error Correction (QEC). QEC reduces error rates by encoding logical information across redundant physical qubits, enabling errors to be detected and corrected. A common decoder used for this task is Minimum Weight Perfect Matching (MWPM) a graphbased algorithm that relies on edge weights to identify the most likely error chains. In this work, we propose a data-driven decoder named Neural Minimum Weight Perfect Matching (NMWPM). Our decoder utilizes a hybrid architecture that integrates Graph Neural Networks (GNNs) to extract local syndrome features and Transformers to capture long-range global dependencies, which are then used to predict dynamic edge weights for the MWPM decoder. To facilitate training through the non-differentiable MWPM algorithm, we formulate a novel proxy loss function that enables end-to-end optimization. Our findings on the toric code under depolarizing noise demonstrate thresholds of 17.9% and 10.95%, nearing the 18.9% and 11.0% maximum likelihood bounds, highlighting the advantage of hybrid decoders that combine the predictive capabilities of neural networks with the algorithmic structure of classical matching.

Abstract:
We present 4RC, a unified feedforward framework for 4D reconstruction from monocular videos. Unlike existing methods that typically decouple motion from geometry or produce limited 4D attributes, such as sparse trajectories or two-view scene flow, 4RC learns a holistic 4D representation that jointly captures dense scene geometry and motion dynamics. At its core, 4RC introduces a novel encode-once, query-anywhere and anytime paradigm: a transformer backbone encodes the entire video into a compact spatio-temporal latent space, from which a conditional decoder can efficiently query 3D geometry and motion for any query frame at any target timestamp. To facilitate learning, we represent per-view 4D attributes in a minimally factorized form, decomposing them into base geometry and time-dependent relative motion. Extensive experiments demonstrate that 4RC outperforms prior methods across a wide range of 4D reconstruction tasks.

Abstract:
Large language models (LLMs) achieve impressive performance across domains but face significant challenges when deployed on consumergrade GPUs or personal devices such as laptops, due to high memory consumption and inference costs. Post-training quantization (PTQ) of LLMs offers a promising solution that reduces their memory footprint and decoding latency. In practice, PTQ with uniform quantization representation is favored due to its efficiency and ease of deployment, as uniform quantization is widely supported by mainstream hardware and software libraries. Recent studies on low-bit uniform quantization have led to noticeable improvements in post-quantization model performance; however, they mainly focus on quantization methodologies, while the initialization of quantization parameters remains underexplored and still relies on the conventionalMin-Max formula. In this work, we identify the limitations of theMin-Max formula, move beyond its constraints, and proposeNeUQI, a method that efficiently determines near-optimal initialization for uniform quantization. Our NeUQI simplifies the joint optimization of the scale and zero-point by deriving the zero-point for a given scale, thereby reducing the problem to a scale-only optimization. Benefiting from the improved quantization parameters, our NeUQI consistently outperforms existing methods in the experiments with the LLaMA and Qwen families on various settings and tasks. Furthermore, when combined with a lightweight distillation strategy, NeUQI even achieves superior performance to PV-tuning, a considerably more resource-intensive method.

Abstract:
The rapid adoption of large language models (LLMs) in digital health has been driven by a "scalingfirst" philosophy, i.e., the assumption that clinical intelligence increases with model size and data. However, real-world clinical needs include not only effectiveness, but also reliability and reasonable deployment cost. Since clinical decision-making is inherently collaborative, we challenge the monolithic scaling paradigm and ask whether a Small Agent Group (SAG) can support better clinical reasoning. SAG shifts from single-model intelligence to collective expertise by distributing reasoning, evidence-based analysis, and critical audit through a collaborative deliberation process. To assess the clinical utility of SAG, we conduct extensive evaluations using diverse clinical metrics spanning effectiveness, reliability, and deployment cost. Our results show that SAG achieves superior performance compared to a single giant model, both with and without additional optimization or retrieval-augmented generation. These findings suggest that the synergistic reasoning represented by SAG can substitute for model parameter growth in clinical settings. Overall, SAG offers a scalable solution to digital health that better balances effectiveness, reliability, and deployment efficiency.

Abstract:
Accurate RNA structure modeling remains difficult because RNA backbones are highly flexible, noncanonical interactions are prevalent, and experimentally determined 3D structures are comparatively scarce. We introduce RiboSphere, a framework that learns discrete geometric representations of RNA by combining vector quantization with flow matching. Our design is motivated by the modular organization of RNA architecture: complex folds are composed from recurring structural motifs. RiboSphere uses a geometric transformer encoder to produce SE(3)-invariant (rotation/translation-invariant) features, which are discretized with finite scalar quantization (FSQ) into a finite vocabulary of latent codes. Conditioned on these discrete codes, a flow-matching decoder reconstructs atomic coordinates, enabling high-fidelity structure generation. We find that the learned code indices are enriched for specific RNA motifs, suggesting that the model captures motif-level compositional structure rather than acting as a purely compressive bottleneck. Across benchmarks, RiboSphere achieves strong performance in structure reconstruction (RMSD 1.25\,\AA, TM-score 0.84), and its pretrained discrete representations transfer effectively to inverse folding and RNA--ligand binding prediction, with robust generalization in data-scarce regimes.

Abstract:
Statistical evaluation aims to estimate the generalization performance of a model using heldout i.i.d. test data sampled from the ground-truth distribution. In supervised learning settings such as classification, performance metrics such as error rate are well-defined, and test error reliably approximates population error given sufficiently large datasets. In contrast, evaluation is more challenging for generative models due to their open-ended nature: it is unclear which metrics are appropriate and whether such metrics can be reliably evaluated from finite samples. In this work, we introduce a theoretical framework for evaluating language models and establish evaluability results for commonly used metrics. We study two categories of metrics: test-based metrics, including integral probability metrics (IPMs), and similarity-based metrics, including Rényi and KL divergences. We show that IPMs with respect to any bounded test class can be evaluated from finite samples up to multiplicative and additive approximation errors. Moreover, when the test class has finite fat-shattering dimension, IPMs can be evaluated with arbitrary precision. In contrast, similarity-based metrics, including Rényi and KL divergences, are not evaluable from finite samples, as their values can be critically determined by rare events. We also analyze the potential and limitations of perplexity as an evaluation method.

Abstract:
We present a batched firstorder method for solving multiple linear programs in parallel on GPUs. Our approach extends the primal-dual hybrid gradient algorithm to efficiently solve batches of related linear programming problems that arise in mixed-integer programming techniques such as strong branching and bound tightening. By leveraging matrix-matrix operations instead of repeated matrix-vector operations, we obtain significant computational advantages on GPU architectures. We demonstrate the effectiveness of our approach on various case studies and identify the problem sizes where first-order methods outperform traditional simplex-based solvers depending on the computational environment one can use. This is a significant step for the design and development of integer programming algorithms tightly exploiting GPU capabilities where we argue that some specific operations should be allocated to GPUs and performed in full instead of using light-weight heuristic approaches on CPUs.

Abstract:
Textto-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself. To disentangle data effects, we consider two regimes across different dataset sizes: (1) concept generalization, where each individual concept is observed during training under potentially imbalanced data distributions, and (2) compositional generalization, where specific combinations of concepts are systematically held out. To study these regimes, we introduce mosaic (Multi-Object Spatial relations, AttrIbution, Counting), a controlled framework for dataset generation. By training diffusion models on mosaic, we find that scene complexity plays a dominant role rather than concept imbalance, and that counting is uniquely difficult to learn in low-data regimes. Moreover, compositional generalization collapses as more concept combinations are held out during training. These findings highlight fundamental limitations of diffusion models and motivate stronger inductive biases and data design for robust multi-object compositional generation.

Abstract:
Posttraining LLMs with RLHF and preference optimization methods (e.g., DPO, IPO) has greatly improved alignment, yet these approaches assume a single objective. In reality, humans express multiple, often conflicting objectives, such as helpfulness and harmlessness, with no natural scalarization. We study the multi-objective preference alignment problem, where a policy must balance several objectives simultaneously. We propose Multi-Objective Preference Optimization (MOPO), a constrained KL-regularized framework that maximizes a primary objective while enforcing lower bounds on secondary objectives via tunable safety thresholds. MOPO operates directly on pairwise preferences without point-wise rewards, and admits simple closed-form iterative updates. Empirically, MOPO recovers Pareto-optimal policies on synthetic benchmarks and, when fine-tuned on human-preference data, yields multi-billion parameter models that achieve higher rewards and Pareto-dominate baselines, with stable and robust optimization dynamics.

Abstract:
Learning beneficial treatment allocations for a patient population is an important problem in precision medicine. For such allocations, a certain proportion of treated patients may not receive any benefit. This proportion of unnecessary treated represents a `treatment risk' which is a waste of resources and may, in addition, expose patients to unnecessary adverse effects. Therefore, we aim to control the treatment risk when learning beneficial allocations. This learning problem is complicated by the fact that the treatment risk is generally not identifiable from either randomized trial or observational data. We propose a certifiable learning method that controls treatment risk, using finite samples in the partially identified setting. The method is illustrated using both simulated and real data.

Abstract:
The success of Large Language Models (LLMs) hinges on the stable training of deep Transformer architectures. A critical design choice is the placement of normalization layers, leading to a fundamental tradeoff: the ''PreNorm'' architecture ensures training stability at the cost of potential performance degradation in deep models, while the ''PostNorm'' architecture offers strong performance but suffers from severe training instability. In this work, we propose SpanNorm, a novel technique designed to resolve this dilemma by integrating the strengths of both paradigms. SpanNorm adopts the clean residual path of PreNorm to stabilize signal propagation while employing a PostNorm-style computation that normalizes the output of the residual connection, thereby enhancing model performance. We provide a theoretical analysis demonstrating that SpanNorm, combined with a principled scaling strategy, maintains bounded signal variance throughout the network, preventing the gradient issues that plague PostNorm models, and alleviating the representation collapse of PreNorm. Empirically, SpanNorm consistently outperforms standard normalization schemes in both dense and Mixture-of-Experts (MoE) scenarios, paving the way for more powerful and stable Transformer architectures.

Abstract:
Foundation models have revolutionized AI, yet biological applications often repurpose general architectures without accounting for the intrinsic structural and functional properties of distinct modalities, such as genomic and proteomic sequences. Consequently, these architectures lack the inductive biases required to capture the complex ``grammars" inherent to biological data, resulting in suboptimal performance. To address this, we introduce BioArc, a framework utilizing Neural Architecture Search (NAS) to shift from intuitiondriven design to automated data-driven discovery. Unlike standard NAS restricted to homogeneous spaces, BioArc navigates a heterogeneous space for open-ended composition of architectural blocks. By systematically analyzing the interplay between architecture, tokenization, and training across modalities, BioArc identifies novel hybrid architectures that surpass state-of-the-art models while being up to 25x smaller. We distill these findings into empirical design principles and validate their biological relevance, demonstrating how our designs hierarchically capture the underlying biological grammar. Additionally, we introduce an agentic framework to predict optimal architectures for new tasks. Overall, BioArc provides a data-driven methodology for developing the next generation of efficient biological foundation models and task-specific networks.

Abstract:
GRPOstyle reinforcement learning (RL)-based LLM fine-tuning algorithms have recently gained popularity. Relying on heuristic trust-region approximations, however, they can lead to brittle optimization behavior, as global importance-ratio clipping and group-wise normalization fail to regulate samples whose importance ratios fall outside the clipping range. We propose Query-Adaptive Trust-Region policy Optimization (QUATRO), which directly enforces trust-region constraints through a principled optimization. This yields a clear and interpretable objective that enables explicit control over policy updates and stable, entropy-controlled optimization, with a stabilizer terms arising intrinsically from the exact trust-region formulation. Empirically verified on diverse mathematical reasoning benchmarks, QUATRO shows stable training under increased policy staleness and aggressive learning rates, maintaining well-controlled entropy throughout training.

Abstract:
We study query time bounds for the fundamental problem of estimating the kernel mean \frac1|X|\sum_x\in X\mathbf\mathrmk(x,y) of a query y in a finite dataset X\subset\mathbbR^d up to a prescribed additive error \varepsilon. The best known bounds for the Gaussian kernel are O(d/\varepsilon^2), \widetilde O(d+1/\varepsilon^4), and \widetilde O(d+\Delta^2/\varepsilon^2), where \Delta is the diameter of a region containing the points. We prove the new bound \tilde O(d+\varepsilon\Delta^2+1/\varepsilon^3), which improves over the previous ones in regimes with small error \varepsilon and intermediate diameter \Delta. At the center of our proof is a new fast spherical embedding theorem in the sense introduced by Bartal, Recht and Schulman (2011), which limits the embedded data diameter while preserving local Euclidean distances and avoiding ``distance collapse'' at larger scales. This fast embedding theorem may be of independent interest.

Abstract:
Small language models (SLMs) offer computational efficiency for scalable deployment, yet they often fall short of the reasoning capabilities exhibited by their larger counterparts (LLMs). To mitigate this gap, current approaches invoke an LLM to generate tokens at points of reasoning divergence, but these external calls introduce substantial latency and costs. Alternatively, standard distillation is often hindered by the capacity limitation, as SLMs struggle to accurately mimic the LLM’s complex generative distribution. We address this dilemma by identifying local sufficiency: at divergence points, the LLM’s preferred token consistently resides within the SLM’s topK next-token predictions, even when failing to emerge as the SLM top-1 choice. We therefore propose SELECT TO THINK (S2T), which reframes the LLM’s role from open-ended generation to selection among the SLM’s proposals, simplifying the supervision signal to discrete candidate rankings. Leveraging this, we introduce S2T-LOCAL, which dis- tills the selection logic into the SLM, empowering it to perform autonomous re-ranking without inference-time LLM dependency. Empirically, we demonstrate that a 1.5B SLM’s top-8 candidates capture the 32B LLM’s choice with 95% hit rate. Translating this potential into performance, S2T-LOCAL improves greedy decoding by 24.1% on average across benchmarks, effectively matching the efficacy of 8-path self-consistency while operating with single-pass efficiency.

Abstract:
Data mixingdetermining the ratios of data from different domains---is a first-order concern for training language models (LMs), but existing mixing methods have poorly understood design choices and assume that the set of domains remain fixed throughout development. We present Olmix, a framework that addresses two challenges encountered during LM development. First, the configuration space for developing a mixing method is not well understood---design choices across existing methods lack justification or consensus and overlook practical issues like data constraints. We conduct a comprehensive empirical study of this space, identifying which design choices lead to a strong mixing method. Second, the domain set evolves throughout LM development as datasets are revised and expanded---a problem setting largely unaddressed by existing works. We study how to efficiently recompute the mixture after the domain set is updated, given an existing mix from before the update. We introduce mixture reuse, a mechanism that reuses existing relative ratios and recomputes ratios only for domains affected by an update. Over a sequence of five domain-set updates mirroring real-world LM development, mixture reuse matches the performance of fully recomputing the mix after each update with 74% less compute and improves over training without mixing by 11.6% on downstream tasks.

Abstract:
When deploying a single predictor across multiple subpopulations, we propose a fundamentally different approach: interpreting group fairness as a bargaining problem among subpopulations. This gametheoretic perspective reveals that existing robust optimization methods such as minimizing worst-group loss or regret correspond to classical bargaining solutions and embody different fairness principles. We propose relative improvement, the ratio of actual risk reduction to potential reduction from a baseline predictor, which recovers the Kalai–Smorodinsky solution. Unlike absolute-scale methods that may not be comparable when groups have different potential predictability, relative improvement provides axiomatic justification including scale invariance and individual monotonicity. We establish finite-sample convergence guarantees under mild conditions.

Abstract:
We propose a sequential test for distributionshift detection that allows conformal test martingales (CTMs) to work under a fixed, reference-conditional setting. Existing CTM detectors construct test martingales by continually growing a reference set with each incoming sample, using it to assess how atypical the new sample is relative to past observations. While this design yields anytime-valid type-I error control, it suffers from test-time contamination: after a change, post-shift observations enter the reference set and dilute the evidence for distribution shift, increasing detection delay and reducing power. In contrast, our method avoids contamination by design by comparing each new sample to a fixed null reference dataset. Our main technical contribution is a robust martingale construction that remains valid conditional on the null reference data, achieved by explicitly accounting for the estimation error in the reference distribution induced by the finite reference set. This yields anytime-valid type-I error control together with guarantees of asymptotic power one and bounded expected detection delay. Empirically, our method detects shifts faster than standard CTMs, providing a powerful and reliable distribution-shift detector.

Abstract:
Optimizing data mixtures is is essential for unlocking the full potential of of large language models (LLMs), yet identifying the optimal composition remains computationally prohibitive due to reliance on heuristic trials or expensive proxy training. To address this, we introduce MergeMix, a novel appraoch that efficiently determines optimal data mixing ratios by repurposing model merging weights as a highfidelity, low-cost performance proxy. By training domain-specific experts on minimal tokens and optimizing their merging weights against downstream benchmarks, MergeMix effectively optimizes the performance of data mixtures without incurring the cost of full-scale training. Extensive experiments on models with 8B and 16B parameters validate that MergeMix achieves performance comparable to or surpassing exhaustive manual tuning while drastically reducing search costs. Furthermore, MergeMix exhibits high rank consistency (Spearman \rho > 0.9) and strong cross-scale transferability, offering a scalable, automated solution for data mixture optimization.

Abstract:
Probing is widely adopted in computer vision to faithfully evaluate selfsupervised learning (SSL) embeddings, as finetuning may misrepresent their inherent quality. In contrast, audio SSL models still rely on finetuning because simple probing fails to unlock their full potential and alters their rankings when competing for SOTA on AudioSet. Hence, a robust and efficient probing mechanism is required to guide the trajectory of audio SSL towards reliable and reproducible methods. We introduceConvex Gated Probing(CGP), a prototype-based method that drastically closes the gap between finetuning and probing in audio. CGP efficiently utilizes all frozen layers via a gating mechanism and exposes the location of latent task-relevant information. Guided by CGP, we rework the entire SSL pipeline to address reproducibility issues in current SOTA audio models that use legacy implementations of prior SSL methods. By refining data preprocessing, model architecture, and pretraining recipe, we introduceBetter Audio Transformer(BAT), and establish new SOTA on audio benchmarks.

Abstract:
Understanding how cellular morphology, gene expression, and spatial context jointly shape tissue function is a central challenge in biology. Imagebased spatial transcriptomics technologies now provide high-resolution measurements of cell images and gene expression profiles, but existing methods typically analyze these modalities in isolation or at limited resolution. We address the problem by introducing SPATIA, a multi-level generative and predictive model that learns unified, spatially aware representations by fusing morphology, gene expression, and spatial context from the cell to the tissue level. SPATIA also incorporates a novel spatially conditioned generative framework for predicting cell morphologies under perturbations. Specifically, we propose a confidence-aware flow matching objective that reweights weak optimal-transport pairs based on uncertainty. We further apply morphology-profile alignment to encourage biologically meaningful image generation, enabling the modeling of microenvironment-dependent phenotypic transitions. We assembled a multi-scale dataset consisting of 25.9 million cell-gene pairs across 17 tissues. We benchmark SPATIA against 18 models across 12 tasks, spanning categories such as phenotype generation, annotation, clustering, gene imputation, and cross-modal prediction. SPATIA achieves improved performance over state-of-the-art models, improving generative fidelity by 8% and predictive accuracy by up to 3%.

Abstract:
Explaining why a language model produces a particular output requires local, inputlevel explanations. Existing methods uncover global capability circuits (e.g., indirect object identification), but not why the model answers a specific input query in a particular way. We introduce query circuits, which directly trace the information flow inside a model that maps a specific input to the output. Unlike surrogate-based approaches (e.g., sparse autoencoders), query circuits are identified within the model itself, resulting in more faithful and computationally accessible explanations. To make query circuits practical, we address two challenges. First, we introduce Normalized Deviation Faithfulness (NDF), a robust metric to evaluate how well a discovered circuit recovers the model's decision for a specific input, and is broadly applicable to circuit discovery beyond our setting. Second, we develop sampling-based methods to efficiently identify circuits that are sparse yet faithfully describe the model’s behavior. Across benchmarks (IOI, arithmetic, MMLU, and ARC), we find that there exist sparse query circuits within the model that recover much of its performance on single queries. For example, on average, a circuit covering only 1.3% of model connections can recover about 60% of performance on an MMLU question. Overall, query circuits provide a step towards faithful, scalable explanations of how language models process individual inputs.

Abstract:
Large language models (LLMs) achieve strong capabilities by scaling model capacity and training data, yet many realworld deployments rely on smaller models trained or adapted from low-resource data. This gap motivates the need for mechanisms to transfer knowledge from large, high-resource models to smaller, low-resource targets. While model merging provides an effective transfer mechanism, most existing approaches assume architecture-compatible models and therefore cannot directly transfer knowledge from large high-resource LLMs to heterogeneous low-resource targets. In this work, we propose a cross-architecture merging framework based on optimal transport (OT) that aligns activations to infer cross-neuron correspondences between heterogeneous models. The resulting transport plans are then used to guide direct weight-space fusion, enabling effective high-resource to low-resource transfer using only a small set of inputs. Extensive experiments across low-resource languages and specialized domains demonstrate consistent improvements over target models.

Abstract:
This paper studies the minimal dimension required to embed subset memberships (m elements and m\choose k subsets of at most k elements) into vector spaces, denoted as Minimal Embeddable Dimension (MED). The tight bounds of MED are derived theoretically and supported empirically for various notions of "distances" or "similarities", including \ell_2 metric, inner product, and cosine similarity. In addition, we conduct numerical simulation in a more achievable setting, where the m\choose k subset embeddings are chosen as the centroid of the embeddings of the contained elements. Our simulation easily realizes a logarithmic dependency between the MED and the number of elements to embed. These findings imply that embeddingbased retrieval limitations stem primarily from learnability challenges, not geometric constraints, guiding future algorithm design.

Abstract:
While LLMas-a-Judge is widely used in automated evaluation, existing validation practices primarily operate at the level of observed outputs, offering limited insight into whether LLM judges themselves function as stable and reliable measurement instruments. To address this limitation, we introduce a two-phase diagnostic framework for assessing reliability of LLM-as-a-Judge, grounded in Item Response Theory (IRT). The framework adopts Graded Response Model (GRM) of IRT and formalizes reliability along two complementary dimensions: (1) intrinsic consistency, defined as the stability of measurement behavior under prompt variations, and (2) human alignment, capturing correspondence with human quality assessments. We empirically examine diverse LLM judges with this framework, and show that leveraging IRT-GRM yields interpretable signals for diagnosing judgments systematically. These signals provide practical guidance for verifying reliablity of LLM-as-a-Judge and identifying potential causes of unreliability.

Abstract:
Asymmetric actorcritic methods are widely used in partially observable reinforcement learning, but typically assume full state observability to condition the critic during training, which is often unrealistic in practice. We introduce the informed asymmetric actor-critic framework, allowing the critic to be conditioned on arbitrary state-dependent privileged signals without requiring access to the full state. We show that any such privileged signal yields unbiased policy gradient estimates, substantially expanding the set of admissible privileged information. This raises the problem of selecting the most adequate privileged information in order to improve learning. For this purpose, we propose two novel informativeness criteria: a dependence-based test that can be applied prior to training, and a criterion based on improvements in value prediction accuracy that can be applied post-hoc. Empirical results on partially observable benchmark tasks and synthetic environments demonstrate that carefully selected privileged signals can match or outperform full-state asymmetric baselines while relying on strictly less state information.

Abstract:
Most current methods for unlearning concepts in textto-image diffusion models rely on mean squared error-based loss functions to align target distributions with anchors. In this paper, we generalize this idea into a unified f-divergence-based framework that recovers the standard mean squared error loss as a specific instance. By generalizing the loss function, we theoretically analyze and numerically validate how different f-divergences impact the gradient magnitude and the convergence properties of the algorithm, affecting the quality of unlearning. The proposed unified framework offers a flexible paradigm for selecting the optimal divergence based on the application and user goal, allowing for finer control over the trade-off between unlearning efficacy and generative fidelity.

Abstract:
E(3)equivariant neural networks have proven to be extremely effective in a wide range of 3D modeling tasks. A fundamental operation of such networks is the tensor product, which allows interaction between different feature types. Because this operation scales poorly, there has been considerable work towards accelerating this interaction. However, recently \citetxieprice have pointed out that most speedups come from a reduction in expressivity rather than true algorithmic improvements on computing Clebsch-Gordan tensor products. A modification of Gaunt tensor product \citepgaunt can give a true asymptotic speedup but is incomplete and misses many interactions. In this work, we provide the first complete algorithm which truly provides asymptotic benefits Clebsch-Gordan tensor products. For full CGTP, our algorithm brings runtime complexity from the naive O(L^6) to O(L^4\log^2 L), close to the lower bound of O(L^4). We first show how generalizing fast Fourier based convolution naturally leads to the previously proposed Gaunt tensor product \citepgaunt. To remedy antisymmetry issues, we generalize from scalar signals to irrep valued signals, giving us tensor spherical harmonics. We prove a generalized Gaunt formula for the tensor harmonics. Finally, we show that we only need up to vector valued signals to recover the missing interactions of Gaunt tensor product.

Abstract:
The goal of this paper is to improve the performance and reliability of visionlanguage-action (VLA) models through iterative online interaction. Since collecting policy rollouts in the real world is expensive, we investigate whether a learned simulator—specifically, an action-conditioned video generation model—can be used to generate additional rollout data. Unfortunately, existing world models lack the physical fidelity necessary for policy improvement: they are predominantly trained on demonstration datasets that lack coverage of many different physical interactions (particularly failure cases) and struggle to accurately model small yet critical physical details in contact-rich object manipulation. We propose a simple iterative improvement algorithm that uses real-world roll-out data to improve the fidelity of the world model, which can then, in turn, be used to generate supplemental synthetic data for improving the VLA model. In our experiments on a real robot, we use this approach to improve the performance of a state-of-the-art VLA model on multiple downstream tasks. We achieve a 39.2% absolute success rate improvement over the base policy and 11.6% improvement from training with the generated synthetic rollouts. Videos can be found at this anonymous website: \urlhttps://sites.google.com/view/vla-w.

Abstract:
Transformerbased diffusion models have demonstrated remarkable performance at generating high-quality samples. However, our theoretical understanding of the reasons for this success remains limited. For instance, existing models are typically trained by minimizing a denoising objective, which is equivalent to fitting the score function of the training data. However, we do not know why transformer-based models can match the score function for denoising, or why gradient-based methods converge to the optimal denoising model despite the non-convex loss landscape. To the best of our knowledge, this paper provides the first convergence analysis for training transformer-based diffusion models. More specifically, we consider the population Denoising Diffusion Probabilistic Model (DDPM) objective for denoising data that follow a multi-token Gaussian mixture distribution. We theoretically quantify the required number of tokens per data point and training iterations for the global convergence towards the Bayes optimal risk of the denoising objective, thereby achieving a desired score matching error. A deeper investigation reveals that the self-attention module of the trained transformer implements a \emphmean denoising mechanism that enables the trained model to approximate the oracle Minimum Mean Squared Error (MMSE) estimator of the injected noise in the diffusion steps. Numerical experiments validate these findings.

Abstract:
Modeling peptide cyclization is critical for the virtual screening of candidate peptides with desirable physical and pharmaceutical properties. This task is challenging because a cyclic peptide often exhibits diverse, ringshaped conformations, which cannot be well captured by deterministic prediction models derived from linear peptide folding. In this study, we propose MuCO (Multi-stage Conformation Optimization), a generative peptide cyclization method that models the distribution of cyclic peptide conformations conditioned on the corresponding linear peptide. In principle, MuCO decouples the peptide cyclization task into three stages: topology-aware backbone design, generative side-chain packing, and physics-aware all-atom optimization, thereby generating and optimizing conformations of cyclic peptides in a coarse-to-fine manner. This multi-stage framework enables an efficient parallel sampling strategy for conformation generation and allows for rapid exploration of diverse, low-energy conformations. Experiments on the large-scale CPSea dataset demonstrate that MuCO significantly outperforms state-of-the-art methods in consistently in physical stability, structural diversity, secondary structure recovery, and computational efficiency, making it a promising computational tool for exploring and designing cyclic peptides.

Abstract:
Speechto-speech language models have recently emerged to enhance the naturalness of conversational AI. In particular, full-duplex models are distinguished by their real-time interactivity, including handling of pauses, interruptions, and backchannels. However, improving their factuality remains an open challenge. While scaling the model size could address this gap, it would make real-time inference prohibitively expensive. In this work, we propose Moshi-RAG, a modular approach that combines a compact full-duplex interface with selective retrieval to access more powerful knowledge sources. Our asynchronous framework enables the model to identify knowledge-demanding queries and ground its responses in external information. By leveraging the natural temporal gap between response onset and the delivery of core information, the retrieval process can be completed while maintaining a natural conversation flow. With this approach, Moshi-RAG achieves factuality comparable to the best publicly released non-duplex speech language models while preserving the interactivity inherent to full-duplex systems. Moreover, our flexible design supports plug-and-play retrieval methods without retraining and demonstrates strong performance on out-of-domain mathematical reasoning tasks.

Abstract:
Reward design remains a critical bottleneck in visual reinforcement learning (RL) for robotic manipulation. In simulated environments, rewards are conventionally designed based on the distance to a target position. However, such precise positional information is often unavailable in realworld visual settings due to sensory and perceptual limitations. In this study, we propose a method that implicitly infers spatial distances through keypoints extracted from images. Building on this, we introduce Reward Learning with Anticipation Model (ReLAM), a novel framework that automatically generates dense, structured rewards from action-free video demonstrations. ReLAM first learns an anticipation model that serves as a planner and proposes intermediate keypoint-based subgoals on the optimal path to the final goal, creating a structured learning curriculum directly aligned with the task's geometric objectives. Based on the anticipated subgoals, a continuous reward signal is provided to train a low-level, goal-conditioned policy under the hierarchical reinforcement learning (HRL) framework with provable sub-optimality bound. Extensive experiments on complex, long-horizon manipulation tasks show that ReLAM significantly accelerates learning and achieves superior performance compared to SOTA methods.

Abstract:
SQL is central to enterprise data engineering, yet generating fully correct SQL code in a single attempt remains difficult—even for experienced developers and advanced \ttsql LLMs—often requiring multiple debugging iterations. We introduce \ourbench, the first benchmark for enterpriselevel SQL reasoning and debugging. Our benchmark is built upon two key innovations: (1) an automated construction workflow that employs reverse engineering to systematically inject realistic bugs into large-scale SQL code, enabling scalable and diverse benchmark generation; and (2) an execution-free evaluation framework tailored for enterprise settings, providing fast, accurate, and resource-efficient assessment. \ourbench comprises 469 \ourbenchsyn queries featuring syntax errors with explicit error messages, and 516 \ourbenchsem queries targeting semantic errors where codes fails to meet user intent. The queries are highly complex, averaging over 140 lines, and featuring deep and wide abstract syntax trees (average width >11, depth >8.7). Evaluation of nearly 30 LLMs reveals a substantial performance gap: the best-performing model, Claude-4-Sonnet, achieves only 36.46% accuracy on \ourbenchsyn and 32.17% on \ourbenchsem, while most models score below 20%. We further explore four solution strategies, identify key challenges, and outline promising directions for enterprise SQL debugging with LLMs.

Abstract:
Diffusion models have recently set new benchmarks in Speech Enhancement (SE). However, most existing scorebased models treat speech spectrograms merely as generic 2D images, applying uniform processing that ignores the intrinsic structural sparsity of audio, which results in inefficient spectral representation and prohibitive computational complexity. To bridge this gap, we proposeDVPD, an extremely lightweightDual-ViewPredictiveDiffusion model, which uniquely exploits the dual nature of spectrograms as both visual textures and physical frequency-domain representations across both training and inference stages. Specifically, during training, we optimize spectral utilization via the Frequency-Adaptive Non-uniform Compression (FANC) encoder, which preserves critical low-frequency harmonics while pruning high-frequency redundancies. Simultaneously, we introduce a Lightweight Image-based Spectro-Awareness (LISA) module to capture features from a visual perspective with minimal overhead. During inference, we propose a Training-free Lossless Boost (TLB) strategy that leverages the same dual-view priors to refine generation quality without any additional fine-tuning. Extensive experiments across various benchmarks demonstrate that DVPD achieves state-of-the-art performance while requiring only35of the parameters and40%of the inference MACs compared to SOTA lightweight model, PGUSE. These results highlight DVPD's superior ability to balance high-fidelity speech quality with extreme architectural efficiency. Code and audio samples are available at the anonymous website: https://anonymous.4open.science/r/dvpd_demo-E630

Abstract:
Asynchronous pipeline parallelism maximizes hardware utilization by eliminating the pipeline bubbles inherent in synchronous execution, offering a path toward efficient largescale distributed training. However, this efficiency gain can be compromised by gradient staleness, where the immediate model updates with delayed gradients introduce noise into the optimization process. Crucially, we identify a critical, yet often overlooked, pathology: this delay scales linearly with pipeline depth, fundamentally undermining the very scalability that the method originally intends to provide. In this work, we investigate this inconsistency and bridge the gap by rectifying delayed gradients through basis rotation, restoring scalable asynchronous training while maintaining performance. Specifically, we observe that the deleterious effects of delayed gradients are exacerbated when the Hessian eigenbasis is misaligned with the standard coordinate basis. We demonstrate that this misalignment prevents coordinate-wise adaptive schemes, such as Adam, from effectively leveraging curvature-aware adaptivity. This failure leads to significant oscillations in the optimization trajectory and, consequently, slower convergence. We substantiate these findings through both rigorous theoretical analysis and empirical evaluation. To address this challenge, we propose the use of basis rotation, demonstrating that it effectively mitigates the alignment issue and significantly accelerates convergence in asynchronous settings. For example, our training of a 1B-parameter LLM with basis rotation achieves the same training loss in 76.8% fewer iterations compared to the best-performing asynchronous pipeline parallel training baseline.

Abstract:
In modern applications such as ECG monitoring, neuroimaging, wearable sensing, and industrial equipment diagnostics, complex and continuously structured data are ubiquitous, presenting both challenges and opportunities for functional data analysis. However, existing methods face a critical tradeoff: conventional functional models are limited by linearity, whereas deep learning approaches lack interpretable region selection for sparse effects. To bridge these gaps, we propose a sparse Bayesian functional deep neural network (sBayFDNN). It learns adaptive functional embeddings through a deep Bayesian architecture to capture complex nonlinear relationships, while a structured prior enables interpretable, region-wise selection of influential domains with quantified uncertainty. Theoretically, we establish rigorous approximation error bounds, posterior consistency, and region selection consistency. These results provide the first theoretical guarantees for a Bayesian deep functional model, ensuring its reliability and statistical rigor. Empirically, comprehensive simulations and real‑world studies confirm the effectiveness and superiority of sBayFDNN. Crucially, sBayFDNN excels in recognizing intricate dependencies for accurate predictions and more precisely identifies functionally meaningful regions, capabilities fundamentally beyond existing approaches.

Abstract:
Diffusion models for superresolution (SR) produce high-quality visual results but require expensive computational costs. Despite the development of several methods to accelerate diffusion-based SR models, some (e.g., SinSR) fail to produce realistic perceptual details, while others (e.g., OSEDiff) may hallucinate non-existent structures. To overcome these issues, we presentRSD, a new distillation method for ResShift. Our method is based on training the student network to produce images such that a new fake ResShift model trained on them will coincide with the teacher model. RSD achieves single-step restoration and outperforms the teacher by a noticeable margin in various perceptual metrics (LPIPS, CLIPIQA, MUSIQ). We show that our distillation method can surpass SinSR, the other distillation-based method for ResShift, making it on par with state-of-the-art diffusion SR distillation methods with limited computational costs in terms of perceptual quality. Compared to SR methods based on pre-trained text-to-image models, RSD produces competitive perceptual quality and requires fewer parameters, GPU memory, and training cost. We provide experimental results on various real-world and synthetic datasets, including RealSR, RealSet65, DRealSR, ImageNet, and DIV2K.

Abstract:
This paper introduces BioAgent Bench, a benchmark dataset and an evaluation suite designed for measuring the performance and robustness of AI agents in common bioinformatics tasks. The benchmark contains curated endto-end tasks (e.g., RNA-seq, variant calling, metagenomics) with prompts that specify concrete output artifacts to support automated assessment, including stress testing under controlled perturbations. We evaluate frontier closed-source and open-weight models across multiple agent harnesses, and use an LLM-based grader to score pipeline progress and outcome validity. We find that frontier agents can complete multi-step bioinformatics pipelines without elaborate custom scaffolding, often producing the requested final artifacts reliably. However, robustness tests reveal failure modes under controlled perturbations (corrupted inputs, decoy files, and prompt bloat), indicating that correct high-level pipeline construction does not guarantee reliable step-level reasoning. Finally, because bioinformatics workflows may involve sensitive patient data, proprietary references, or unpublished IP, closed-source models can be unsuitable under strict privacy constraints; in such settings, open-weight models may be preferable despite lower completion rates. We release the dataset and evaluation suite publicly.

Abstract:
While the ecosystem of Lean and Mathlib has enjoyed celebrated success in formal mathematical reasoning with the help of large language models (LLMs), the absence of many folklore lemmas in Mathlib remains a persistent barrier that limits Lean's usability as an everyday tool for mathematicians like \LaTeX or Maple. To address this, we introduce \textscMathlibLemma, the first LLMbased multi-agent system to automate the discovery and formalization of mathematical folklore lemmas. This framework constitutes our primary contribution, proactively mining the missing connective tissue of mathematics. Its efficacy is demonstrated by the production of a verified library of folklore lemmas, a subset of which has already been formally merged into the latest build of Mathlib, thereby validating the system's real-world utility and alignment with expert standards. Leveraging this pipeline, we further construct the \textscMathlibLemma benchmark, a suite of approximately 4,000 type-checked Lean statements spanning a broad range of mathematical domains. By transforming the role of LLMs from passive consumers to active contributors, this work establishes a constructive methodology for the self-evolution of formal mathematical libraries.

Abstract:
Key–value (KV) caching is essential for large language model inference, yet its memory overhead poses a critical bottleneck for longcontext generation. Existing eviction policies predominantly rely on empirical heuristics, lacking a rigorous theoretical foundation. This work rethinks KV cache eviction through the lens of the Information Bottleneck principle. Under a linear–Gaussian surrogate of attention, we derive a closed-form mutual information objective that characterizes the effective information capacity of a retained KV cache subset. This formulation reveals that a wide range of existing eviction strategies can be interpreted as different approximations of the same capacity-maximization principle. Guided by this insight, we introduce CapKV, a capacity-aware eviction method that directly targets information preservation via a log-determinant approximation using statistical leverage scores. This approach replaces heuristic selection with a theoretically grounded mechanism that preserves the maximum predictive signal. Extensive experiments across multiple models and long-context benchmarks show that CapKV consistently outperforms prior methods, achieving a better trade-off between memory efficiency and generational fidelity.

Abstract:
Evaluating the realworld capabilities of AI systems requires grounding benchmark performance in human-interpretable measures of task difficulty. Existing approaches that rely on direct human task completion time annotations are costly, noisy, and difficult to scale across benchmarks. In this work, we propose BRIDGE, a unified psychometric framework that learns the latent difficulty scale from model responses and anchors it to human task completion time. Using a two-parameter logistic Item Response Theory model, we jointly estimate latent task difficulty and model capability from model performance data across multiple benchmarks. We demonstrate that latent task difficulty varies linearly with the logarithm of human completion time, allowing human task completion time to be inferred for new benchmarks from model performance alone. Leveraging this alignment, we forecast frontier model capabilities in terms of human task length and independently reproduce METR’s exponential scaling results, with the 50% solvable task horizon doubling approximately every 6 months.

Abstract:
Many fields collect largescale temporal data through repeated measurements (`trials’), where each trial is labeled with a set of metadata variables spanning several categories. For example, a trial in a neuroscience study may be linked to a value from category (a): task difficulty, and category (b): animal choice. A critical challenge in time-series analysis is to understand how these labels are encoded within the multi-trial observations, and disentangle the distinct effect of each label entry across categories. Here, we present MILCCI, a novel data-driven method that i) identifies the interpretable components underlying the data, ii) captures cross-trial variability, and iii) integrates label information to understand each category's representation within the data. MILCCI extends a sparse per-trial decomposition that leverages label similarities within each category to enable subtle, label-driven cross-trial adjustments in component compositions and to distinguish the contribution of each category. MILCCI also learns each component’s corresponding temporal trace, which evolves over time within each trial and varies flexibly across trials. We demonstrate MILCCI’s performance through both synthetic and real-world examples, including voting patterns, online page view trends, and neuronal recordings.

Abstract:
Instruction following in LLMs captures models' ability to change their visible behaviors as requested by users. Instead, we study models' ability to control their chainof-thought (CoT). This capability -- CoT controllability -- is undesirable because it could allow models to suppress signs of misbehavior in their CoT, thereby undermining our ability to monitor them. To measure this, we introduce the \emphCoT-Control evaluation suite. We show that reasoning models are less able to follow instructions in their CoT than in their outputs: on instructions like reasoning about a genetics problem without mentioning the word ``chromosome", Claude-Sonnet-4.5 complies only 5% of the time. We also find that CoT controllability is higher for larger models and decreases with more RL training, test-time compute, and increased problem difficulty. CoT controllability failures extend even to situations in which models are given incentives (as opposed to direct requests) to evade CoT monitors, although models that are told they're being monitored exhibit slightly higher controllability. Similarly, eliciting controllability by adversarially optimizing prompts doesn’t meaningfully increase controllability. Our results leave us cautiously optimistic: reasoning models generally seem characterized by low CoT controllability. However, the mechanism behind this phenomenon is not well understood. Given its importance for maintaining CoT monitorability, we recommend that frontier labs keep tracking controllability for future models.

Abstract:
Large language models (LLMs) are increasingly tasked with producing and manipulating structured artifacts. We consider the task of endto-endspreadsheet generation, where language models are prompted to produce spreadsheet artifacts to satisfy users' explicit and implicit constraints, specified in natural language. We introduceSpreadsheetArena, a platform for evaluating models' performance on the task via blind pairwise evaluations of LLM-generated spreadsheet workbooks. As with other complex, open-ended tasks, relevant evaluation criteria can vary substantially across use cases and prompts, often in ways that are difficult to formalize. Compared to general chat or text generation settings, spreadsheet generation presents unique challenges and opportunities: the task output structure is well-defined and multi-dimensional, and there are often complex considerations around interactivity and layout. Among other findings, we observe that stylistic, structural, and functional features of preferred spreadsheets vary substantially across use cases, and expert evaluations of spreadsheets for finance prompts suggests that even highly ranked arena models do not reliably produce spreadsheets aligned with domain-specific best practices. Our hope is that our work prompts further study of end-to-end spreadsheet generation as a challenging and interesting category of complex, open-ended tasks for LLMs. Our live arena is hosted at https://spreadsheetarena.ai.

Abstract:
The increasing complexity of AI tasks has shifted the paradigm from monolithic models toward multiagent large language model (LLM) systems. However, these collaborative architectures introduce a critical bottleneck: redundant prefill computation for shared content generated by previous agents, which significantly increases KV cache memory usage and time-to-first-token (TTFT). While various KV cache methods have been proposed to mitigate prefill redundancy, they either fail to maintain accuracy on agent-generated outputs or exhibit low reuse rates due to rigid constraints. We present RelayCaching, a training-free inference method that directly reuses decoding phase KV caches from previous agents in subsequent prefill phases. Our key insight is that KV caches for identical content are highly consistent across phases, while prefix-induced deviations are sparse and localized within a limited range of layers and token positions. By selectively recomputing KV caches at these positions, RelayCaching preserves model accuracy with minimal overhead, yielding a superior accuracy–efficiency trade-off over existing methods. Experiments on diverse collaborative LLM tasks spanning mathematical reasoning, general knowledge, and code generation demonstrate that RelayCaching achieves over 80% KV cache reuse, reduces TTFT by up to 4.7× compared to the standard pipeline, all with negligible accuracy degradation.

Abstract:
Longform video understanding remains challenging for Vision-Language Models (VLMs) due to the inherent tension between computational constraints and the need to capture information distributed across thousands of frames. Existing approaches either sample frames uniformly (risking information loss) or select keyframes in a single pass (with no recovery from poor choices). We propose VideoBrain, an end-to-end framework that enables VLMs to adaptively acquire visual information through learned sampling policies. Our approach features dual complementary agents: a CLIP-based agent for semantic retrieval across the video and a Uniform agent for dense temporal sampling within intervals. Unlike prior agent-based methods that rely on text-only LLMs orchestrating visual tools, our VLM directly perceives frames and reasons about information sufficiency. To prevent models from invoking agents indiscriminately to maximize rewards, we introduce a behavior-aware reward function coupled with a data classification pipeline that teaches the model when agent invocation is genuinely beneficial. Experiments on four long video benchmarks demonstrate that VideoBrain achieves +3.5% to +9.0% improvement over the baseline while using 30-40% fewer frames, with strong zero-shot generalization to short video benchmarks.

Abstract:
Bayesian experimental design (BED) provides a principled framework for optimizing data collection by choosing experiments that are maximally informative about unknown parameters. However, existing methods cannot deal with the joint challenge of (a)partially observable dynamical systems, where only noisy and incomplete observations are available, and (b)fully online inference, which updates posterior distributions and selects designs sequentially in a computationally efficient manner. Under partial observability, dynamical systems are naturally modeled as statespace models (SSMs), where latent states mediate the link between parameters and data, making the likelihood---and thus information-theoretic objectives like the expected information gain (EIG)---intractable. We address these challenges by deriving new estimators of the EIG and its gradient that explicitly marginalize latent states, enabling scalable stochastic optimization in nonlinear SSMs. Our approach leverages nested particle filters for efficient online state-parameter inference with convergence guarantees. Applications to realistic models, such as the susceptible–infectious–recovered (SIR) and a moving source location task, show that our framework successfully handles both partial observability and online inference.

Abstract:
De novo ligand design is a fundamental task that seeks to generate protein or molecule candidates that can effectively dock with protein receptors and achieve strong binding affinity entirely from scratch. It holds paramount significance for a wide spectrum of biomedical applications. However, most existing studies are constrained by the Pseudo De Novo, Limited Docking Modeling, and Inflexible Ligand Type. To address these issues, we propose MagicDock, a forwardlooking framework grounded in the progressive pipeline and differentiable surface modeling. (1) We adopt a well-designed gradient inversion framework. To begin with, general docking knowledge of receptors and ligands is incorporated into the backbone model. Subsequently, the docking knowledge is instantiated as reverse gradient flows by binding prediction, which iteratively guide the de novo generation of ligands. (2) We emphasize differentiable surface modeling in the generation process, leveraging learnable 3D point-cloud representations to precisely capture docking details, thereby ensuring that the generated ligands preserve docking validity through interpretable spatial fingerprints. (3) We introduce customized designs for different ligand types and integrate them into a unified gradient inversion framework with flexible triggers, thereby ensuring broad applicability. Moreover, we provide sufficient theoretical guarantees for MagicDock. Extensive experiments across 9 scenarios demonstrate that MagicDock achieves average improvements of 7.0% and 7.4% over SOTA baselines specialized for protein or molecule ligand design, respectively.

Abstract:
This paper studies optimization for a family of problems termed compositional entropic risk minimization, in which each data's loss is formulated as a LogExpectation-Exponential (Log-E-Exp) function. The Log-E-Exp formulation serves as an abstraction of the Log-Sum-Exponential (LogSumExp) function when the explicit summation inside the logarithm is taken over a gigantic number of items and is therefore expensive to evaluate. While entropic risk objectives of this form arise in many machine learning problems, existing optimization algorithms suffer from several fundamental limitations including non-convergence, numerical instability, and slow convergence rates. To address these limitations, we propose a geometry-aware stochastic algorithm, termed SCENT, for the dual formulation of entropic risk minimization cast as a min--min optimization problem. The key to our design is a stochastic proximal mirror descent (SPMD) update for the dual variable, equipped with a Bregman divergence induced by a negative exponential function that faithfully captures the geometry of the objective. Our main contributions are threefold: (i) we establish an O(1/\sqrtT) convergence rate of the proposed SCENT algorithm for convex problems; (ii) we theoretically characterize the advantages of SPMD over standard SGD update for optimizing the dual variable; and (iii) we demonstrate the empirical effectiveness of SCENT on extreme classification, partial AUC maximization, contrastive learning and distributionally robust optimization, where it consistently outperforms existing baselines.

Abstract:
Realworld videos naturally portray complex interactions among distinct physical objects, effectively forming dynamic compositions of visual elements. However, most current video generation models synthesize scenes holistically and therefore lack mechanisms for explicit compositional manipulation. To address this limitation, we propose HECTOR, a generative pipeline that enables fine-grained compositional control. In contrast to prior methods, HECTOR supports hybrid reference conditioning, allowing generation to be simultaneously guided by static images and/or dynamic videos. Moreover, users can explicitly specify the trajectory of each referenced element, precisely controlling its location, scale, and speed (see Figure1). This design allows the model to synthesize coherent videos that satisfy complex spatiotemporal constraints while preserving high-fidelity adherence to references. Extensive experiments demonstrate that HECTOR achieves superior visual quality, stronger reference preservation, and improved motion controllability compared with existing approaches.

Abstract:
Multihop reasoning for question answering (QA) plays a critical role in retrieval-augmented generation (RAG) for large language models (LLMs). Based on inherent relation-dependency and reasoning patterns, it is categorized into parallel fact-verification (simultaneously verifying independent sub-questions) and chained reasoning (sequential multi-step inference). Existing approaches adopt either LLM-based fact verification or KG path-based chain construction, failing to handle both categories well: the former underperforms on chained reasoning, while the latter suffers from redundant paths in parallel tasks. Inspired by the Dual Process Theory in cognitive science and Stanovich’s Cognitive Misers Theory, we propose an effective multi-hop QA framework DTKG (Dual-Track Knowledge Graph) through building a two-stage pipeline: i) Classification Stage (dynamic question categorization via few-shot prompting, emulating "unconscious processing"); and ii) Branch Processing Stage (tailored reasoning paths, emulating "conscious processing"). Multi-facet experiments on six datasets show DTKG achieves 5.0%-29.5% performance improvement. The code is available at https://anonymous.4open.science/r/DTKG-621F

Abstract:
Maximum inner product search (MIPS) is a crucial subroutine in machine learning, requiring identification of database vectors that align most strongly with a given query. We propose amortized MIPS: a learningbased approach that trains neural networks to directly predict MIPS solutions, amortizing the computational cost of search across queries drawn from a known distribution. Our key insight is that the MIPS value function - the maximum inner product as a function of the query - is convex (as the pointwise maximum of linear functions), and its gradient at each query equals the optimal database vector. We explore two complementary architectures: (1) Input Convex Neural Networks (ICNNs) that learn the convex value function and recover the optimal match via gradient computation, and (2) VectorICNNs that directly regress the argmax, bypassing gradient computation entirely at inference time. For ICNNs, we combine score regression with gradient matching losses; for VectorICNNs, we introduce a score consistency loss derived from Euler's theorem for homogeneous functions. We further propose homogenization wrappers that enforce positive 1-homogeneity, theoretically linking function values to gradients. Our experiments on retrieval benchmarks demonstrate that convexity provides an effective inductive bias, with learned potentials achieving high match rates while requiring only a single forward pass at inference.

Abstract:
Existing alignment methods directly use the reward model learned from user preference data to optimize an LLM policy, subject to KL regularization with respect to the base policy. This practice is suboptimal for maximizing user's utility because the KL regularization may cause the LLM to inherit the bias in the base policy that conflicts with user preferences. While amplifying rewards for preferred outputs can mitigate this bias, it also increases the risk of reward hacking. This tradeoff motivates the problem of optimally designing reward models under KL regularization. We formalize this reward model optimization problem as a Stackelberg game, and show that a simple reward shaping scheme can effectively approximate the optimal reward model. We empirically evaluate our method in inferencetime alignment settings and demonstrate that it integrates seamlessly into existing alignment methods with minimal overhead. Our method consistently improves average reward and achieves win–tie rates exceeding 66% against all baselines, averaged across evaluation settings.

Abstract:
Can we train LLMs to introspect, i.e. to faithfully describe their own behaviors in natural language? Prior work has shown some, limited, success. However, it is difficult to scale introspection training due to a lack of groundtruth labels. In this work, we study an approach to introspection training which side-steps this data bottleneck. Given a target model M, our method works by fine-tuning models M_i from M with implanted behaviors b_i (such as downplaying medical problems); the (M_i, b_i) pairs serve as labeled introspection training data. We then train an introspection adapter (IA): a LoRA adapter jointly optimized across the fine-tunes M_i which causes them to verbalize their implanted behaviors. This IA induces faithful introspection in fine-tunes of M that were trained in very different ways from the M_i, as well as in M itself. This is surprising because the IA was never trained on M. To demonstrate the utility of IAs, we use them to successfully audit misaligned models introduced in prior work. IAs can also be used to detect fine-tuning API attacks which train models to comply with encrypted harmful requests. Notably, IAs are more effective when applied to larger models. Overall, our results suggest that IAs are a scalable, effective, and practically useful approach to LLM introspection training.

Abstract:
Mixtureof-Experts (MoE) models have become a leading approach for decoupling parameter count from computational cost in large language models. Despite significant progress, effectively scaling MoE performance remains a challenge. Previous work shows that the use of fine-grained experts enlarges the space of expert combinations and can improve flexibility, but it also imposes substantial routing overhead, creating a new scalability bottleneck. In this paper, we explore a complementary axis for scaling --- expert-output mixture. We first analyze the limitations of the standard weighted-summation aggregation in conventional MoE architectures. We then theoretically demonstrate that introducing structural aggregation both expands the expert-combination space without altering the experts or router configuration and enables possible multi-step reasoning within a single MoE layer. To this end, we propose DAG-MoE, a sparse MoE framework that employs a lightweight module to automatically learn the optimal aggregation structure among the selected experts. We evaluate DAG-MoE under standard language modeling settings. Extensive experiments show that DAG-MoE consistently improves performance in both pretraining and fine-tuning, surpassing traditional MoE baselines.

Abstract:
We develop a geometric theory of projection heads in selfsupervised learning by interpreting the head as a trainable metric on the backbone representation manifold. Our analysis reveals that head curvature and architectural asymmetry induce negative eigenvalues of the Hessian at collapsed equilibria in networks with smooth activation functions, yielding a destabilization mechanism which explains collapse avoidance in non-contrastive methods. We further show that linear heads perform implicit subspace whitening under induced metric geometry, while nonlinear heads adapt local metrics to satisfy the specific topological constraints of the loss. Finally, we characterize how metric degeneracy governs the information-invariance trade-off in learned representations. Our results apply to both contrastive and non-contrastive objectives including InfoNCE, BYOL, SimSiam, and decorrelation-based methods, demonstrating that the projection head acts as a universal geometric buffer that decouples the semantic backbone from the rigid constraints of the training objective.

Abstract:
Sparse recovery in linear systems underpins applications from signal processing to highdimensional regression. Sparse Bayesian Learning, grounded in the principle of automatic relevance determination (ARD), offers a practical Bayesian mechanism for feature sparsity via marginal likelihood optimization. Yet, its reliance on a homoscedastic noise model renders it sensitive to data contaminations such as outliers or misspecified noise, harming model fit and predictions. Instead, we propose jointly learning individual feature and sample relevancies, enabling simultaneous model and data sparsification via a single Bayesian objective. This symmetric pruning of model and data offers a natural extension that preserves conjugacy, admits closed-form updates for standard optimization procedures, and aligns with perspectives from robust regression and influence functions. Empirical results across diverse regression tasks affirm that a joint ARD approach consistently yields both sparse and robust prediction models.

Abstract:
Offline reinforcement learning (RL) aims to optimize policies from precollected datasets. A bottleneck of this paradigm is managing epistemic uncertainty, which arises from limited data coverage (sample-level) and the ambiguity in identifying transition dynamics from finite data (model-level). To provide a unified quantification of these uncertainties, Bayesian RL has been proposed by treating the dynamics model as a random variable and maintaining a corresponding belief. Despite its theoretical appeal, policy optimization in Bayesian RL remains computationally challenging as it requires solving composite objectives with expectations. Prior methods either employ search-based techniques with poor computational scalability or impose restrictive posterior assumptions that sacrifice the adaptability of Bayesian RL. To address these limitations, we propose Posterior Hybrid Bayesian Belief (PhyB), which reformulates the expectation as a convex combination over a subset of dynamics models. Theoretical analysis demonstrates that the objective discrepancy induced by this approximation remains bounded. Based on PhyB, we develop an iterative regularized policy optimization algorithm that provides metric-agnostic guarantees for monotonic improvement until convergence. Empirical results demonstrate that PhyB achieves state-of-the-art performance on various benchmarks.

Abstract:
Crossview geo-localization (CVGL) is fundamental for precise navigation in GPS-denied environments, aiming to match ground or UAV imagery with satellite views. While existing approaches rely on global feature alignment, they often suffer from substantial domain shifts induced by varying regional textures and weather conditions. This issue becomes even more pronounced in UAV-based scenarios where the broader perspective provided by UAVs inevitably introduces dense and fine-grained objects, creating significant visual clutter. To address this, we draw inspiration from Object-Centric Learning (OCL) and propose InfoGeo, an information-theoretic framework designed to enhance robustness and generalization. InfoGeo reformulates the optimization as an information bottleneck process with two core objectives: (i) maximizing view-invariant information by aligning the object-centric structural relations across views, and (ii) minimizing view-specific noisy signals through cross-view knowledge constraints. Extensive evaluations across diverse benchmarks and challenging scenarios demonstrate that InfoGeo significantly outperforms state-of-the-art methods.

Abstract:
Multimodal Diffusion Transformers (MMDiTs) have achieved remarkable progress in text-to-image generation, yet they frequently suffer from concept omission, where specified objects or attributes fail to emerge in the generated image. By performing linear probing on text tokens, we demonstrate that text embeddings can distinguish a characteristic `omission signal' representing the absence of target concepts. Leveraging this insight, we propose Omission Signal Intervention (OSI), which amplifies the omission signal to actively catalyze the generation of missing concepts. Comprehensive experiments on FLUX.1-Dev and SD3.5-Medium demonstrate that OSI significantly alleviates concept omission even in extreme scenarios.

Abstract:
Agentic scaffolds have dramatically improved LLM performance on complex, longhorizon tasks, yielding both broad benefits and amplified risks in domains like cybersecurity. Existing benchmarks for AI agents in cybersecurity focus mainly on measuring proficiency–how effectively agents can complete offensive security tasks–but neglect a critical question: when and how should agents refuse harmful requests? We present the first framework for establishing refusal boundaries in offensive security contexts. Our framework defines (1) principled criteria for when tasks should be refused, (2) categories of tasks that warrant refusal, and (3) evaluation methodology for measuring agent robustness under both benign and adversarial conditions. We apply this framework to assess how current LLM-powered agents adhere to appropriate refusal boundaries across a range of web-based offensive security scenarios, finding that 6 of 8 frontier models tested refuse to identify vulnerabilities in real systems 0% of the time.

Abstract:
Recent advances in language model interpretability using sparse autoencoders (SAEs) have yet to effectively translate to the visual domain, mainly due to the difficulty and ambiguity of labeling visual concepts. In this paper, we introduce Visual Interpretability via SAE Transfer Alignment (VISTA), a framework that transfers interpretability from language to vision by constraining a visual projector to map visual tokens into an LLM's preexisting, labeled textual SAE space. This approach enables visual interpretability without training dedicated vision SAEs. By regularizing the projector using the LLM's SAE reconstruction loss, VISTA achieves a threefold increase in the matching rate which measures how accurately the most activating textual concepts in the SAE space correspond to semantic elements in the image. Using this framework, we further analyze spatial localization properties of different vision encoders and show that DINOv2 features have significantly stronger localization abilities than other encoders. Leveraging this precision, VISTA enables fine-grained, localized concept steering, allowing specific objects to be removed or replaced while preserving the surrounding scene. This results in improvements of 38% in object removal and 58% in object replacement tasks over vision-only baselines. These contributions are validated across multiple LLM architectures.

Abstract:
Graph anomaly detection methods aim to distinguish anomalous nodes. While prior methods characterize anomalies through increased variation in the spectral energy distributions, they overlook those that result in decreased variation, i.e., camouflaged anomalies that appear normal. We show that this type of anomaly persists across multiple datasets and remains undetectable by existing spectral approaches. To address this limitation, we propose a nodelevel spectral energy formulation that is fully compatible with message passing and enables the detection of camouflaged anomalies. Building on this formulation, we introduce an energy-aware graph learning framework that models spectral shifts through energy-driven message passing in both static and time-series graphs. Besides, our unified architecture extends to temporal settings without introducing specialized sequence modules, enabling efficient learning under long sliding windows. Extensive experiments on large-scale benchmarks demonstrate the effectiveness and scalability of our approach.

Abstract:
LLM pretraining efficacy increasingly depends on data composition rather than sheer volume. Yet, optimal mixing is hindered by categorization flaws: human taxonomies suffer from ontological misalignment, and Euclidean clustering fails to address embedding anisotropy. We introduceGEM(GeometricEntropyMixing), a framework reformulating data curation as a variational problem on the hypersphere augmented with amixing-balance regularizer. By decoupling the generative prior and optimizing the objective via a provableMM (Minorize-Maximize)algorithm, GEM effectively counteracts the cluster collapse to discover balanced semantic structures invisible to Euclidean heuristics. We employ teacher-student distillation to scale this geometric fidelity to web-scale corpora and introduce theGeometric Influence Score (GIS)for interpretable taxonomy generation. Experiments with 1.1B-parameter models demonstrate that GEM establishes a new state-of-the-art when integrated into mixing strategies like DoReMi and RegMix, improving average downstream accuracy by up to1.2%and offering a robust coordinate system for predictable data mixing.

Abstract:
There is growing interest in utilizing flowbased models as decision-making policies in reinforcement learning due to their high expressive capacity. However, effectively leveraging this expressivity for value maximization remains challenging, as naive gradient-based optimization requires backpropagating through numerical solvers and often leads to instability. Existing approaches typically address this issue by restricting the expressive capacity of flow-based policies, resulting in a trade-off between optimization stability and representational flexibility. To resolve this, we introduceQ-Flow, a framework that leverages the deterministic nature of flow dynamics to explicitly propagate terminal trajectory value to intermediate latent states along the policy-induced flow. This formulation enables stable policy optimization using intermediate value gradients without unrolling the numerical solver, effectively bridging the gap between stability and expressivity. We evaluate Q-Flow in the offline learning setting on the challenging OGBench suite, where it consistently outperforms state-of-the-art baselines by an average of10.6percentage points, while also enabling stable online adaptation within the same framework.

Abstract:
Pipeline parallelism is essential for largescale model training, but existing asynchronous approaches often degrade convergence due to parameter mismatch between forward and backward passes. We propose Asynchronous Multi-Directional Pipeline parallelism (AMDP) to mitigate this issue while sustaining high utilization. AMDP limits the first stage of each pipeline to process at most two minibatches before backpropagation, bounding the number of parameter updates between forward and backward passes. To alleviate the resulting pipeline bubbles, AMDP launches multiple concurrent pipelines and adapts their number according to pipeline depth. In addition, AMDP accumulates gradients across minibatches and applies them in a single update, ensuring that only a bounded number of minibatches experience parameter mismatch, limited to within one optimization step. Experiments on GPT- and BERT-style models demonstrate that AMDP significantly accelerates training while preserving convergence. The source code based on Megatron-LM is available at https://anonymous.4open.science/r/Megatron-AMDP-59A7.

Abstract:
We show that contrary to conventional wisdom in the community, graph neural networks (GNNs) are not continuous with respect to all natural modes of graph convergence. As a result, GNNs may generate substantially different latent representations for graphs that are very similar. In particular they assign vastly different latent embeddings to graphs that represent the same underlying object at different resolution scales. We trace this failure of continuity back to a structural obstruction arising from commonly used informationpropagation schemes. Building on this insight we then derive a principled modification to standard GNN architectures which equips models with continuity across scales. The proposed modification enables consistent integration of distinct resolutions and reliable generalization between them. We systematically validate our theoretical findings in a wide range of numerical experiments.

Abstract:
Training a family of large language models (LLMs), either from scratch or via iterative compression, is prohibitively expensive and inefficient, requiring separate training runs for each model in the family. In this paper, we introduce Star Elastic, a novel LLM posttraining method that adds N nested submodels to a given parent reasoning model using the compute of one run (Nx savings) via a single post-training job. Beyond reducing training costs, Star Elastic also addresses a fundamental limitation in efficient reasoning: the rigidity of static architectures, which forces the allocation of constant resources regardless of token difficulty. By unlocking elastic budget control, Star Elastic enables a novel approach that uses different submodels for each reasoning phase (thinking and answering). Star Elastic supports (1) nesting along the SSM, embedding channel, MoE and FFN axes, (2) learning nested submodels via an end-to-end trainable router, and (3) curriculum-based knowledge distillation. We apply Star Elastic to the NVIDIA Nemotron Nano models; in particular, we demonstrate its effectiveness on hybrid MoE architectures with Nemotron Nano v3 (30B/3.6A), generating 23B (2.8A) and 12B (2.0A) variants with 160B training tokens. For Nemotron Nano v2 (12B), we produce 9B and 6B nested models using only 110B training tokens, achieving a 360x reduction versus training from scratch and a 7x reduction over state-of-the-art compression methods. All nested models match or outperform independently trained baselines of comparable size. Crucially, elastic budget control advances the accuracy--latency Pareto frontier, achieving up to 16% higher accuracy and 1.9x lower latency via dynamic per-phase model selection.

Abstract:
Recent work suggests that LLMs can improve their abilities through selfevolution, using only internally generated supervision. A central open question, however, is not whether self-evolution can help, but: how far is it from oracle-supervised training under minimal assumptions? To address this question, we present a controlled empirical analysis of LLM self-evolution under a strict formulation: self-evolution is allowed access only to (i) an unlabeled prompt set and (ii) a base language model, with all supervision signals generated from this model. Under this formulation, we can evaluate many self-evolution approaches in a unified preference optimization framework. Specifically, we analyze four representative self-evolution methods, ranging from single-round verification to multi-turn feedback, iterative training, and curriculum learning. For our primary analysis, we use a clean setting based on the Knights and Knaves logical reasoning dataset, which provides deterministic solutions, systematic verification, and a hierarchy of difficulty levels that enables an evaluation of easy-to-hard generalization. Across this controlled setting, we find that increasingly complex self-evolution strategies yield consistent but limited gains. In general, a substantial performance gap persists relative to oracle supervision. One strategy stands out as effective: we nearly match the oracle performance by using a larger model (Gemma 12B) with iterative revision based on natural language feedback. We also study self-evolution on the OpenThoughts reasoning corpus and evaluate on standard problem-solving benchmarks. In this regime, self-evolution only leads to modest improvements, including when using more resource-intensive strategies or online RL. Overall, our results shed new light on the empirical limits of various types of self-evolution.

Abstract:
Standard Set Representation Learning methods typically excel on curated data but often overlook the challenge of Inferencetime Element Corruption. This refers to scenarios where deployed models encounter element-level degradations, such as outliers or missing components, that may distort the set representation and degrade performance. To address this, we propose SW-DRSO, a distributionally robust optimization framework tailored for sets. Rather than minimizing loss solely on the observed training data, SW-DRSO optimizes the worst-case expected loss over a family of plausible inference-time variations. We further introduce a barycentric adversary that transforms the intractable search for worst-case corrupted sets into a differentiable and efficient optimization process. Extensive experiments across four tasks demonstrate that SW-DRSO effectively enhances robustness against corruption while maintaining high overall performance.

Abstract:
The generation of immersive and navigable 3D environments is increasingly prevalent with the growing adoption of virtual reality and 3D content. However, recent methods face a fundamental limitation: they cannot produce 3D worlds that simultaneously (i) are navigable over longrange spatial extents and (ii) cover the complete omnidirectional field of view (360^\circ horizontally and 180^\circ vertically). To address this challenge, we introduce SphericalDreamer, a method for generating fully immersive and long-range 3D environments from textual prompts. Our approach is built on the generation of multiple panoramic images, which are subsequently lifted into 3D and fused together while maintaining visual and geometric consistency. SphericalDreamer produces highly detailed, fully immersive 3D environments, while substantially improving scale and navigability compared to prior approaches.

Abstract:
Geometryaware generative models and novel view synthesis approaches have shown strong potential in visual fidelity and consistency. In parallel, equivariant representation learning has emerged as a powerful framework for constructing latent spaces where analytically known group transformations could act directly, capturing geometric structure in data and enhancing both interpretability and generalization in novel view synthesis. However, we identify that existing approaches often suffer from latent misalignment, a discrepancy between the intended group action and the actually required transformations in the latent space. Consequently, the learned latents often fail to consistently preserve the equivariant relations imposed by the underlying group symmetry. To address this, we propose \emphResidual Latent Flow, a flow-based framework that corrects the misaligned latents, thereby improving compliance with the underlying equivariance relation. Our comprehensive experiments show that our method significantly reduces latent misalignment and improves novel view synthesis quality, under rotation groups \mathrmSO(n).

Abstract:
Confidence calibration for classification models is vital in safetycritical decision-making scenarios and has received extensive attention. General confidence calibration methods assume training and test data are independent and identically distributed (i.i.d.), limiting their effectiveness under covariate shifts. Previous calibration methods under covariate shift struggle with class-wise or canonical calibrations and often rely on unstable importance weighting when density ratios are large or unbounded. Given the above limitations, this paper rethinks confidence calibration under covariate shifts. First, we derive a necessary and sufficient condition for confidence calibration under covariate shifts, named Expectation consistency condition, which reveals covariate shifts do not necessarily lead to uncalibrated confidence and provides a weaker condition for confidence calibration than global covariate distribution alignment. Then, utilizing Expectation consistency condition, this paper proposes an unsupervised domain adaptation loss to calibrate confidence of the target domain, named Expectation consistency loss (ECL), which is compatible with canonical calibration, class-wise calibration, and top-label calibration. Third, we prove that computing ECL loss has the same sample complexity as Expected Calibration Error (ECE) and provide a theoretically grounded mini-batch trainable scheme for ECL loss. Finally, we validate the effectiveness of our method on both simulated and real-world covariate shift datasets.

Abstract:
Many stochastic physical systems evolve smoothly over time in the sense that the distribution of states changes regularly with time. The precise transition from current to next state is often modeled as the interplay of a smooth map and an explicit source of randomness. Stochastic Lifting leverages this premise by attaching an independent, highdimensional random label to each state transition in the training data and fitting a transition map from the current state and label to the next state using a standard regression loss. The labels act as auxiliary coordinates that let the model represent multiple plausible outcomes for similar current states, avoiding collapse to a mean prediction in the finite-sample size regime. At inference, drawing fresh labels and rolling the map forward generates diverse trajectories with a single network evaluation per time step, with the smoothness bias of the learned map supporting accurate sampling in practice.

Abstract:
The paradigm of learning from automatic annotations—driven by pretrained experts and Foundation Models—dominates data-hungry applications. However, it introduces a critical challenge: model-induced label noise. Unlike stochastic noise in classical robust learning, this noise stems from annotator inductive biases, manifesting as systematic errors tightly coupled with local feature manifolds. Existing methods relying on global transition matrices underfit these structural patterns, while learning instance-specific matrices remains mathematically intractable. We propose Model-Induced Noise Decoupling (MIND), a theoretically grounded framework addressing this dilemma. We demonstrate that the high-dimensional noise manifold can be decoupled into tractable, subspace-dependent components via Latent Manifold Disentanglement. Specifically, our Latent Decoupling Estimator (LDE) dynamically projects samples into latent structural clusters with consistent error modes, facilitating noise identifiability without ground-truth anchor points. To rigorously evaluate robustness, we adopt a hierarchical protocol: moving from controlled noise on CIFAR-100 to a structural stress test on large-scale real-world 3D datasets (S3DIS, ScanNet), where error patterns explicitly couple with geometric manifolds. Empirically, MIND significantly outperforms state-of-the-art methods on these complex benchmarks and effectively corrects zero-shot hallucinations from Vision-Language Models (e.g., OpenSeg), highlighting its potential as a robust distillation framework for Foundation Models.

Abstract:
Many learning tasks in science and engineering are characterized by sparse datasets, which limits the effectiveness of purely datadriven approaches. At the same time, these problems are often accompanied by rich domain knowledge derived from physical laws, operational requirements, and expert heuristics. Such knowledge is frequently expressed as rules involving logical propositions and linear inequalities. Existing neuro-symbolic methods typically enforce these rules approximately through soft penalties, assume input-independent rules when designing specialized architectures, or rely on non-differentiable post-processing at inference time to achieve hard constraint satisfaction. While recent advances in differentiable optimization layers enable end-to-end feasibility enforcement within neural networks, extending these approaches to logical or mixed-integer rules remains challenging due to inherent nonconvexity. In this work, we propose a unified end-to-end framework for enforcing hard, input-dependent mixed integer linear constraints within neural networks. Our approach represents rules as disjunctive constraints and applies hierarchical convex relaxations to obtain convex hull formulations. These relaxations yield tractable linear constraints that can be embedded as differentiable optimization layers while enabling exact rule satisfaction. We demonstrate the effectiveness of the proposed framework on real-world datasets, achieving perfect rule satisfaction and strong predictive performance.

Abstract:
Large language models (LLMs) demonstrate strong chainof-thought (CoT) reasoning abilities, while smaller models (\leq 3B parameters) significantly underperform on multi-step reasoning tasks. Based on empirical analyses of the Qwen-2.5 model family on math reasoning benchmarks, we find that more proficient reasoning is associated with fewer reasoning steps but higher information density per step, a property we term Dense Reasoning. Motivated by this observation, we propose DenseSteer, a training-free inference-time steering framework that enhances small-model reasoning by modulating internal representations toward dense reasoning patterns. Experiments show that our method yields consistent accuracy improvements without increasing token-level Negative Log-Likelihood, highlighting dense reasoning as an effective structural approach to mathematical problem solving.

Abstract:
While Flow Matching theoretically guarantees constantvelocity trajectories, we identify a critical breakdown in high-dimensional practice: the Velocity Deficit. We show that the MSE objective systematically underestimates velocity magnitude, causing generated samples to fail to reach the data manifold—a phenomenon we term Integration Lag. To rectify this, we propose Initial Energy Injection, instantiated via two complementary methods: the training-based Magnitude-Aware Flow Matching (MAFM) and the training-free Scale Schedule Corrector (SSC). Both are grounded in our discovery of a crucial asymmetry: velocity contraction causes harmful kinetic stagnation at the trajectory's start, yet acts as a beneficial denoising mechanism at its end. Empirically, SSC yields significant efficiency gains with zero retraining and just one line of code. On ImageNet-1k (256x256), it improves FID by 44.6% (from 13.68 to 7.58) and achieves a 5x speedup, enabling a 50-step generator (FID 7.58) to beat a 250-step baseline (FID 8.65). Furthermore, our methods generalize to Text-to-Image tasks and high-resolution generation, improving FID on MS-COCO by ~22%.

Abstract:
In this paper, we introduce a model of evolution and learning in robots that cooptimizes a distribution of latent design vectors (genotypes) and a mixture of control experts (neural modules), which are gated by the latent coordinates of each decoded design (phenotype). This provides a scalable alternative to co-design algorithms that either train an individual policy for every robot, which is inefficient, or a monolithic universal controller for all robots, which results in overly conservative structures and behaviors. Our approach lies somewhere between these two extremes, preserving ancestral knowledge in a unified yet modular framework in which different body plans activate and deactivate different combinations of learned sensorimotor circuits for goal-directed behavior. This allows one part of the controller to be overhauled to better suit new species of designs as they emerge without disrupting the hard-earned knowledge contained within other expert modules. Pretrained expert policies can also be directly plugged into the mixture, providing a simple mechanism to indirectly steer evolution into areas of latent space containing desired morphological traits. We refer to this process as "evolution by demo" and use it to direct evolution toward the canonical form defined by the pretrained policy.

Abstract:
The rise in deployment of large language models has driven a surge in GPU demand and datacenter scaling, raising concerns about electricity use, grid stress, and the impacts of modern AI workloads. Distillation is often promoted as one of the most effective paths to obtain cheaper, more efficient models, yet these claims rarely account for the full endto-end energy and resource costs, including crucial teacher-side workloads such as data generation, logit caching, and evaluation. We present a comprehensive energy accounting framework that measures the complete computational cost of distillation pipelines via detailed stage-wise tracking of GPU device power consumption. In our experiments, we separate and log empirical energy use across distinct phases and systematically measure the energy and emissions of two common distillation methods: the classic logit-based knowledge distillation and synthetic-data supervised fine-tuning, constructing energy–quality–throughput Pareto frontiers that expose the previously ignored costs. From these measurements and analyses, we derive practical design rules for selecting distillation methods and hyperparameters under energy and budget constraints, and release an open-source measurement harness and accounting protocol to provide a standardized foundation for comparable, reproducible distillation research, explicitly accountable for complete pipeline energy impact.

Abstract:
We study the posterior contraction rate of Bayesian PhysicsInformed Neural Networks (PINNs) for solving a general class of elliptic partial differential equations (PDEs). We focus on learning of the elliptic equation with a non-homogeneous Dirichlet boundary condition from independent and noisy measurements collected both inside the domain and on the boundary. Assuming that the PDE admits a strong solution in a Hölder space and using with a suitably constructed prior on the neural network weights, we prove that the posterior distribution concentrates around the exact solution at a near-minimax rate. Furthermore, the chosen prior israte-adaptive: the posterior contracts at an (almost) optimal rate without prior knowledge of the smoothness level of the exact solution. Our results provide statistical guarantees for uncertainty quantification of PDEs via Bayesian PINNs.

Abstract:
We address the challenge of enabling robots to manipulate deformable linear objects (DLOs), such as ropes, cables, and rubber bands. Prior work has primarily focused on narrow, taskspecific problems, often relying on real-world demonstrations or handcrafted heuristics. Such approaches, however, struggle to scale to the wide variety of materials and tasks encountered in practice, and collecting sufficiently diverse real-world data is often impractical. Additionally, existing simulation environments offer limited support for the broad spectrum of material behaviors necessary for generalizable DLO manipulation. To overcome these limitations, we introduce a differentiable simulator explicitly designed for versatile DLO manipulation. Our simulator models a wide range of material properties—including (in)extensibility, elasticity, bending plasticity, and complex interactions with other objects—providing a robust foundation for learning and evaluating manipulation skills. Building on this simulator, we propose a benchmark suite of representative tasks that highlight the unique challenges of DLO manipulation. The successful execution of these tasks is often hindered by the topological complexity and grasp sensitivity inherent to DLOs. Therefore, we introduce a specialized DLO agent that explicitly manages these challenges by proposing strategic grasping points and decomposing long-horizon tasks to maximize control authority. Finally, we evaluate various policy-learning algorithms using our framework, alongside sim-to-real transfer experiments, demonstrating our platform's potential to advance DLO manipulation.

Abstract:
Many phenomena in science and engineering involve unknown latent functions observed through finite, irregularly sampled measurements. Neural processes (NPs) offer a powerful framework for probabilistic functional inference by bridging stochastic processes with deep learning. In many domains, these functions exhibit symmetriesmost notably translation equivariance---that can be exploited to improve sample efficiency and generalization. Existing translation-equivariant NPs, however, have two key limitations: (i) they are constructed by stacking generic components with nonlinearities, obscuring the induced function class, thus limiting interpretability; and (ii) convolutional designs are based on localized receptive fields and require dense discretization, while attention-based methods avoid these issues but scale quadratically with the number of observations. We address these challenges through two contributions. First, we characterize continuous translation-equivariant operators through their Volterra expansions, representing them as sums of higher-order convolutions. This yields analytical transparency while remaining amenable to efficient approximation by first-order convolution operators. Second, we introduce set Fourier convolutions (SFConvs), a frequency-domain parameterization that operates directly on irregularly sampled sets. SFConvs achieve global receptive fields without spatial discretization and scale linearly in the number of observations. Building on these ideas, we propose two families of conditional NPs (CNPs): SFConvCNPs, constructed by stacking SFConv blocks with nonlinearies, and SFVConvCNPs, which integrate the Volterra formulation. Experiments on synthetic and real-world datasets demonstrate the efficacy of our methods compared to the state-of-the-art baselines.

Abstract:
Large Language Models (LLMs) are increasingly deployed for knowledge synthesis, yet their capacity for compositional generalization in scientific knowledge remains undercharacterized. Existing benchmarks primarily focus on single-turn restricted scenarios, failing to capture the capability boundaries exposed by real-world interactive scientific workflows. To address this, we introduce XDomainBench, a diagnostic benchmark for interactive interdisciplinary scientific reasoning. We formalize the composition order and mixture structure to enable systematic stress-testing from single-discipline to inter-disciplinary, comprising 8,598 interactive sessions across 20 domains and 4 task categories, with 8 realistic trajectory patterns covering difficulty and domain-mixture dynamics, simulating real AI4S scenarios. Large-scale evaluation of LLMs reveals a systematic reasoning collapse as composition order increases, stemming from two root causes: (i) direct difficulty increases induced by domain composition, and (ii) indirect interaction-amplified failures where trajectory patterns trigger error accumulation, reasoning breaks, and domain confusion, ultimately leading to session collapse.

Abstract:
Federated Learning (FL) with pretrained Vision-Language Models (VLMs) has emerged as a promising paradigm for various downstream tasks. By leveraging its strong representations, recent studies improve task adaptation under insufficient local data while preserving generalization. However, these methods emphasize fully local optimization with simple parameter aggregation, which can amplify inter-client optimization inconsistency and intra-client over-specialization under heterogeneous and full-data FL settings, making it difficult to balance global task adaptation and generalization. To address these challenges, we propose FedDTL, a novel federated VLM framework that decouples the image encoder and text encoder across clients and the server. Through decoupled encoder training with server-client modality alignment, FedDTL promotes coherent global semantic update and reduces inter-client optimization inconsistency, improving global task adaptation. To further mitigate intra-client over-specialization, we introduce a two-stage local fine-tuning, where a supervised fine-tuning stage enables rapid and reliable warm-start, followed by a reinforcement learning stage that enhances generalization. Extensive experiments on multiple benchmarks, including label skew and feature shift, demonstrate that FedDTL achieves an effective balance between global task adaptation and generalization under various FL data distributions in both few-shot and full-data regimes.

Abstract:
Generating highfidelity 3D indoor scenes remains a significant challenge due to data scarcity and the complexity of modeling intricate spatial relations. Current methods often struggle to scale beyond training distribution to dense scenes or rely on Large Language Models (LLMs) that lack the ability for precise spatial reasoning. Building on top of the observation that object placement relies mainly on local dependencies instead of information-redundant global distributions, in this paper, we proposePair2Scene, a novel procedural generation framework, for scene generation based on a set oflearned procedural rules. These rules mainly capture two types of inter-object relations, namelysupport relationsthat follow physical hierarchies, andfunctional relationsthat reflect semantic links. We model these rules through a network, which estimates spatial position distributions of dependent objects conditioned on position and geometry of the anchor ones. Accordingly, we curate a dataset3D-Pairsfrom existing scene data to train the model. During inference, our framework can generate scenes by recursively applying our model. Extensive experiments demonstrate that our framework outperforms existing methods in generating complex environments that go beyond training data while maintaining physical and semantic plausibility.

Abstract:
Alwayson edge systems must keep learning as conditions change under tight compute budgets and must detect unreliable predictions. Bayesian binary neural networks are attractive in this setting, but mean-field Bernoulli posteriors can saturate on long non-stationary streams, wiping out epistemic uncertainty and freezing plasticity. We propose BiMU, derived from a bounded-memory variational objective that balances stability, plasticity, and forgetting. BiMU combines a data term with controlled relaxation toward the prior and an uncertainty-dependent step size that prevents saturation and sustains informative uncertainty. This non-degenerate posterior enables fully online, buffer-free active querying via Monte Carlo disagreement, reducing label queries and backpropagation updates under imbalance. BiMU sustains learning and strong OOD detection on 1000-task Permuted-MNIST, and on OpenLORIS-Object achieves up to 32× label/update savings at matched accuracy under class imbalance and feature compression.

Abstract:
Mixtureof-Experts (MoE) models achieve remarkable performance by sparsely activating specialized experts, yet their massive parameters in experts pose significant challenges for deployment. While low-rank quantization offers a promising route to compress MoE models, existing methods still incur nonnegligible memory overhead and inference latency. To address these limitations, we propose TileQ, a fine-tuning-free post-training quantization (PTQ) method that employs 2D-tiling structured low-rank quantization to share low-rank factors across both input and output dimensions of MoE experts. Furthermore, we introduce an efficient inference technique for TileQ that fuses multiple low-rank expert computations into a single-pass operation, significantly improving hardware utilization. Experiments show that TileQ cuts down additional memory usage up to 10x and reduces inference latency to 5% while preserving state-of-the-art accuracy.

Abstract:
Reconstructing dynamic 3D scenes from blurry monocular videos is challenging because motioninduced blur entangles object motion and geometry, hindering geometric consistency. We present Kinematics-GS, a kinematics-aware framework that models blur as motion-aligned deformation and introduces a kinematic prior to reparameterize Gaussian shapes along motion trajectories, thereby mitigating degenerate shape collapse without auxiliary motion supervision. To stabilize optimization, we decompose scenes into dynamic and static components using temporal deformation variance and employ a coarse-to-fine deformation strategy to capture both global motion and fine-grained details. We also introduce a challenging real-world dataset of deformable and elastic objects exhibiting non-rigid motion with spatially non-uniform motion blur that obscures geometric cues. Extensive experiments on real-world benchmarks with realistic motion blur demonstrate that Kinematics-GS outperforms prior methods by a clear margin in monocular dynamic scene reconstruction, highlighting its effectiveness in handling complex and non-rigid motion scenarios.

Abstract:
Evolutionary algorithms for adversarial attacks leverage populationbased search to discover perturbations without gradient information, but suffer from inefficient crossover operations that destroy adversarial properties through discrete interpolation. We introduce Mode Connectivity Evolutionary Attack (MoCo-EA), which replaces traditional crossover with a novel Bézier crossover operator that optimizes perturbations along a continuous Bézier curve between parent perturbations. Our key insight is that adversarial examples lie on connected manifolds where intermediate points maintain, and often enhance attack effectiveness. We demonstrate three findings: (1) Successful adversarial perturbations exhibit mode connectivity; (2) Intermediate points along optimized paths achieve higher transferability than endpoints; (3) Bézier crossover dramatically outperforms discrete genetic operations while reducing convergence time and query requirements. By exploiting the geometric structure of adversarial space through path optimization, MoCo-EA provides an efficient and reliable method. Our work challenges the traditional view of adversarial examples as isolated points and opens new directions for both attack generation and defense research.

Abstract:
Lagrangian Relaxation (LR) is a powerful technique for solving largescale Mixed Integer Linear Programming (MILP), particularly those with decomposable structures like Vehicle Routing or Unit Commitment. By relaxing coupling constraints, LR enables parallel solving of subproblems and frequently yields tighter dual bounds than standard linear programming relaxations, which is crucial for efficient branch-and-bound pruning. While recent empirical works showed promising results using machine learning to predict these multipliers, a theoretical understanding of such methods remains an open question. In this work, we bridge this gap by analyzing the problem of learning LR through the lens of Data-driven Algorithm Design, i.e., a statistical learning problem over a distribution of problem instances. Our contributions are as follows: first, we derive a generalization bound of \cO(s^1.5/\sqrtN) for the learned multipliers, where s is the number of coupled constraints and N is the sample size. Second, we provide a minimax lower-bound of \Omega(s/\sqrtN), proving that a linear dependency is unavoidable. Finally, we extend our framework to the problem of learning to warm-start sub-gradient ascent.

Abstract:
Internet of Things (IoT) systems continuously collect heterogeneous sensing signals from ubiquitous sensors to support intelligent applications such as human activity analysis, emotion monitoring, and environmental perception. These signals are inherently nonstationary and multi-scale, posing unique challenges for standard tokenization techniques. This paper proposes Dywave, a dynamic tokenization framework for IoT sensing signals that constructs compact input representations aligned with intrinsic temporal structures and underlying physical events. Dywave leverages wavelet-based hierarchical decomposition, identifies meaningful temporal boundaries corresponding to underlying semantic events, and adaptively compresses redundant intervals while preserving temporal coherence. Extensive evaluations on five real-world IoT sensing datasets across activity recognition, stress assessment, and nearby object detection demonstrate that Dywave outperforms state-of-the-art methods by up to 12% in accuracy, while improving computational efficiency by reducing input token lengths by up to 75% across mainstream sequence models. Moreover, Dywave exhibits improved robustness to domain shifts and varying sequence lengths.

Abstract:
Sampling from discrete distributions with multiple modes and energy barriers is fundamental to machine learning and computational physics. Recent discrete neural samplers like MDNS suffer from mode collapse and fail to sample highenergy barrier regions between modes, which is critical for free energy estimation and understanding phase transitions. We propose Metadynamics Discrete Neural Sampler (MetaDNS), a general framework integrating well-tempered metadynamics into discrete diffusion or autoregressive samplers. By maintaining an adaptive, history-dependent bias potential along selected low-dimensional coordinates, MetaDNS forces exploration of previously inaccessible regions, enabling free energy reconstruction infeasible with standard neural samplers due to a lack of high-energy samples. On challenging low-temperature benchmarks including Ising, Potts, and the copper-gold binary alloy, MetaDNS reproduces the thermodynamic distribution. Compared to MCMC-based metadynamics, MetaDNS also achieves comparable exploration requiring fewer energy evaluations.

Abstract:
Bayesian optimization (BO) selects evaluation points for expensive blackbox objectives using Gaussian process (GP) predictive distributions. Kernel choice and hyperparameter selection can lead to miscalibrated predictive distributions, which can distort the exploration--exploitation trade-off. In the minimization setting, sampling criteria such as expected improvement (EI) depend on the predictive lower tail and can therefore be sensitive to miscalibration. This article studies goal-oriented calibration of GP predictive distributions below a low threshold t in the noiseless setting, complementing standard GP modeling with hyperparameters selected by maximum likelihood. A framework for predictive reliability below t is introduced, based on two notions of spatial calibration: occurrence calibration over the design space and thresholded \mu-calibration on the sublevel set \lbrace x\in\mathbbX, f(x)\le t \rbrace. Building on this framework, we propose tcGP, a post-hoc method that calibrates GP predictive distributions below t, and we establish a convergence result for the resulting EI-based global optimization algorithm. Experiments on standard benchmarks show improved lower-tail calibration and BO performance relative to standard GP models and global calibration GP models.

Abstract:
Flowbased text-to-image models follow deterministic trajectories, making it costly to explore diverse modes under limited sampling budgets. Existing approaches to improving diversity often rely on retraining or degrade image fidelity. To address this limitation, we present a training-free, inference-time control mechanism that makes the flow itself diversity-aware. Our core insight is to encourage diversity through guidance that is geometrically decoupled from the model’s quality-seeking direction. Our method simultaneously encourages lateral spread among trajectories via a feature-space objective and reintroduces uncertainty through a time-scheduled stochastic perturbation. Crucially, this perturbation is projected to be orthogonal to the generation flow, a geometric constraint that allows it to boost variation without degrading image details or prompt fidelity. Theoretically, we show that this design monotonically increases a volume surrogate while approximately preserving the marginal distribution, providing a principled explanation for the robustness of generation quality. Empirically, across multiple text-to-image settings under fixed sampling budgets, our method consistently improves diversity metrics such as the Vendi Score and Brisque over strong baselines, while upholding image quality and alignment.

Abstract:
Unsupervised skill discovery (USD) aims to learn diverse behaviors without reward functions, but often results in taskirrelevant or hazardous behaviors due to uniform exploration. Guided skill discovery (GSD) addresses this issue by incorporating human intent to focus exploration on meaningful regions. However, existing GSD methods typically require training additional guidance models from scratch, which can be ineffective with sparse human feedback. To tackle this, we propose COLLIE, a GSD framework that utilizes sparse human feedback effectively by constructing a semantically coherent skill latent space. The semantical coherence property enables a training-free guidance signal construction, eliminating the need for additional model training beyond skill learning. Furthermore, as this property is derived from dense unsupervised data, the latent space is well-structured, ensuring reliability even with sparse human feedback. Theoretical analysis justifies the effectiveness of our training-free guidance signal, while experiments across diverse state-based and pixel-based tasks show that COLLIE learns diverse, human-aligned skills, avoids hazardous behaviors, and achieves superior downstream performance with minimal human feedback.

Abstract:
Simulationbased inference (SBI) is transforming experimental sciences by enabling parameter estimation in complex non-linear models from simulated data. A persistent challenge, however, is model misspecification: simulators are only approximations of reality, and mismatches between simulated and real data can yield biased or overconfident posteriors. We address this issue by introducing Flow Matching Corrected Posterior Estimation (FMCPE), a framework that leverages the flow matching paradigm to refine simulation-trained posterior estimators using a small set of calibration samples. Our approach proceeds in two stages: first, a posterior approximator is trained on abundant simulated data; second, flow matching transports its predictions toward the true posterior supported by real observations, without requiring explicit knowledge of the misspecification. This design enables FMCPE to combine the scalability of SBI with robustness to distributional shift. Across synthetic benchmarks and real-world datasets, we show that our proposal consistently mitigates the effects of misspecification, delivering improved inference accuracy and uncertainty calibration compared to standard SBI baselines, while remaining computationally efficient.

Abstract:
The docalculus defines a general system of inference for interventional queries, allowing causal quantities to be transformed through successive applications of its rules. This process induces a rich space of equivalent interventional expressions, but combining and ordering these rules remains challenging. In this work, we introduce derivation graphs, which represent how do-calculus rules are applied and combined, and characterize the full space of observational and interventional probabilities which are equivalent under the do-calculus. The structure of these graphs yields a simple procedure that uses at most four applications of do-calculus rules. Finally, we show how applying identification algorithms to equivalent causal queries produces multiple valid estimands for the same causal quantity, eventually yielding more efficient estimators.

Abstract:
Datadriven algorithm design automates hyperparameter tuning, but its statistical foundations remain limited because model performance can depend on hyperparameters in implicit and highly non-smooth ways. Existing guarantees focus on the simple case of a one-dimensional (scalar) hyperparameter. This leaves the practically important, multi-dimensional hyperparameter tuning setting unresolved. We address this open question by establishing the first general framework for establishing generalization guarantees for tuning multi-dimensional hyperparameters in data-driven settings. Our approach strengthens the generalization guarantee framework for semi-algebraic function classes by exploiting tools from real algebraic geometry, yielding sharper, more broadly applicable guarantees. We then extend the analysis to hyperparameter tuning using the validation loss under minimal assumptions, and derive improved bounds when additional structure is available. Finally, we demonstrate the scope of the framework with new learnability results, including data-driven weighted group lasso and weighted fused lasso.

Abstract:
Prepropagation graph neural networks (PP-GNNs) decouple node feature propagation from transformation: graph diffusion is performed once as preprocessing, and training reduces to dense per-node transformations. This design enables mini-batch training without inter-node dependencies, avoids repeated sparse matrix--matrix multiplications, and better matches modern accelerators optimized for dense compute. However, their expressivity remains unclear, and empirical results show a gap between PP-GNNs and their message-passing counterparts on commonly used graph benchmarks, especially heterophilic ones. In this paper, we propose a suite of robust graph diffusion operators for preprocessing and a few-shot hidden-state re-propagation scheme during training. Our methods improve the validation and test accuracy of PP-GNNs, enabling them to match the accuracy of message-passing GNNs while maintaining training efficiency.

Abstract:
Coupled spatiotemporal forecasting is important for predicting the future evolution of multiple interacting dynamical systems, such as in climate models. However, existing methods are severely constrained by the persistent bottleneck of compounding errors. In coupled systems, errors from each subsystem simulator propagate and amplify one another, a phenomenon we term Reciprocal Error Amplification leading to a rapid collapse of longrange predictions. To address this challenge, we propose a universal framework called PnP-Corrector (Plug-and-Play Corrector). The core idea of our framework is to decouple the physical simulation from the error correction process: it freezes pre-trained physics simulation engines and exclusively trains a lightweight correction agent to proactively counteract the systematic biases emerging from the coupled system. Furthermore, we design an efficient predictive model architecture, DSLCast, to serve as the backbone of this framework. Extensive experiments demonstrate that our method significantly enhances the long-term stability and accuracy of coupled forecasting systems. For instance, in the challenging task of a 300-day global ocean-atmosphere coupled forecast, our PnP-Corrector framework reduces the prediction error of the baseline model by 28% and surpasses state-of-the-art models on several key metrics. Codes link: \urlhttps://anonymous.4open.science/r/Anonymous_PnP-E365.

Abstract:
Existing onestream Transformer-based visual trackers localize targets by training a classification head with a handcrafted spatial prior encoded as a heatmap. However, this heuristic supervision merely serves as a surrogate objective, which misaligns with evaluation metrics such as IoU and AUC. To address this limitation, we propose RELO, a reinforcement-learning tracking framework that formulates target localization as a decision-making problem within the Transformer-based tracking paradigm. Unlike prior-driven localization learning, RELO performs sequence-level reinforcement learning to optimize localization behavior using both instantaneous IoU and sequence-level AUC rewards, better aligning the training objective with real evaluation criteria. As a result, RELO not only eliminates the need for handcrafted heatmaps, but also achieves superior performance. For instance, RELO attains 57.5% AUC on LaSOT_\mathrmext without template updates, establishing a new state-of-the-art performance. Code and models will be made available.

Abstract:
Reinforcement learning (RL) is often hierarchical: an upperlevel (UL) learner selects model parameters while a lower-level (LL) decision-making process responds, forming a nested two-level optimization structure captured by bilevel optimization. Most existing bilevel RL methods assume a single-policy LL Markov decision process (MDP), and thus miss the competitive structure in applications such as incentive design where multiple policies interact. We study a class of bilevel optimization problems whose LL is a regularized min--max zero-sum Markov game, and whose UL optimizes through the saddle-point equilibrium induced by the LL game. We propose a penalty-based first-order policy-gradient method built on the Nikaido–Isoda function, termed penalty-augmented Nikaido–Isoda descent–ascent (PANDA), which avoids UL hypergradients and requires no second-order information by exploiting the min--max game structure. We prove that PANDA converges to stationary points of this bilevel problem without restrictive assumptions such as convexity on either the UL or LL objectives. Moreover, PANDA reaches an \epsilon-stationary point in \tilde\mathcalO(\epsilon^-1) iterations with sample complexity \tilde\mathcalO(\epsilon^-3), matching the best-known rates for bilevel RL with single-policy LL MDPs. Experiments further demonstrate superior performance over closely related baselines.

Abstract:
Offlineto-Online Reinforcement Learning (O2O-RL) leverages an offline, pre-trained policy to minimize costly online interactions. Although data-efficient, O2O-RL is susceptible to shifts between offline and online distributions. Existing work aims to mitigate the harm of this shift by finetuning the policy on trajectory data sampled from a diffusion model. Inspired by this line of work, we propose DUAL: an efficient Diffusion Uncertainty-Aware Actor-Critic framework for O2O-RL. DUAL utilizes the prior knowledge of the diffusion model to distill a fast-sampling diffusion actor policy and transition model in the offline phase. DUAL also employs a Laplace approximation and distance transition-state-shift detection, thereby using uncertainty quantification to improve exploration versus exploitation in the online phase. We formally show that our actor loss with the Laplace approximation provides a valid estimate of epistemic uncertainty. Empirically, DUAL improves online expected return over O2O-RL baselines across MuJoCo, AntMaze, Frozen-Lake, and Adroit environments.

Abstract:
Many striking phenomena in deep learning, such as linear mode connectivity and the structured behavior of training dynamics, are closely tied to parameter symmetries: transformations that leave the realized function unchanged. Despite growing attention to structural parameter symmetries, the exact interplay between parameters, data, and representations remains underexplored. To investigate this, we develop a theoretical framework of effective function classes defined by the neurons' induced functions restricted to the representation subspace. We then formalizeeffective symmetry breakingvia neuron identifiability across independent training runs. Our analysis shows that neural networks can admit large families of approximately equivalent solutions even instructurally asymmetricmodels. This allows us to disentangle the effects of dataspecific and architectural symmetries. We further show that neuron identifiability enables representation mergingwithout prior alignment, and characterize when such merging admits a linear low-loss connecting path. These findings highlight the role of effective function classes in affecting the loss landscape.

Abstract:
Textto-image diffusion models (DMs) have achieved remarkable success in image generation. However, concerns about data privacy and intellectual property remain due to their potential to inadvertently memorize and replicate training data. Recent mitigation efforts have focused on identifying and pruning weights responsible for triggering verbatim training data replication, based on the assumption that memorization can be localized. We challenge this assumption and demonstrate that, even after such pruning, small perturbations to the text embeddings of previously mitigated prompts can re-trigger data replication, revealing the fragility of such methods. Our further analysis then provides multiple indications that memorization is indeednotinherently local: (1) replication triggers for memorized images are distributed throughout text embedding space; (2) embeddings yielding the same replicated image produce divergent model activations; and (3) different pruning methods identify inconsistent sets of memorization-related weights for the same image. Finally, we show that bypassing the locality assumption enables more robust mitigation through adversarial fine-tuning. These findings provide new insights into the fundamental nature of memorization in text-to-image DMs and inform the future development of more reliable mitigation methods against DM memorization.

Abstract:
Theoretical properties of reinforcement learning algorithms are most commonly studied under the Markov assumption. This is unrealistic, as most environments encountered in practice are either partially observable, or require function approximation that restricts the agent to access nonMarkovian state features. We consider the problem of learning an optimal reactive policy in a finite environment under deterministic observations (or equivalently, hard state aggregation). We introduce a new algorithm, _Committed Q-learning_, and prove almost sure convergence to the optimal reactive policy under an intuitive assumption we call _rewire-robustness_. This assumption is strictly weaker than the q_\star-realizability condition used in prior work. Our algorithm is a variant of classical Q-learning in which the behavior policy commits to a single action upon entering a feature, and only resamples actions when the observed feature changes.

Abstract:
This work studies online episodic tabular Markov decision processes (MDPs) with known transitions and develops bestof-both-worlds algorithms that achieve refined data-dependent regret bounds in the adversarial regime and variance-dependent regret bounds in the stochastic regime. We quantify MDP complexity using a first-order quantity and several new data-dependent measures for the adversarial regime, including a second-order quantity and a path-length measure, as well as variance-based measures for the stochastic regime. To adapt to these measures, we develop algorithms based on global optimization and policy optimization, both built on optimistic follow-the-regularized-leader with log-barrier regularization. For global optimization, our algorithms achieve first-order, second-order, and path-length regret bounds in the adversarial regime, and in the stochastic regime, they achieve a variance-aware gap-independent bound and a variance-aware gap-dependent bound that is polylogarithmic in the number of episodes. For policy optimization, our algorithms achieve the same data- and variance-dependent adaptivity, up to a factor of the episode horizon, by exploiting a new optimistic Q-function estimator. Finally, we establish regret lower bounds in terms of data-dependent complexity measures for the adversarial regime and a variance measure for the stochastic regime, implying that the regret upper bounds achieved by the global-optimization approach are nearly optimal.

Abstract:
Ensuring fairness in algorithmic ranking systems is a critical challenge with significant societal implications for hiring, recommendations, web search, and data management. Standard methods for aggregating multiple preference orders into a consensus ranking may perpetuate and even amplify the lack of representation of underrepresented groups. To address this, recent research has focused on incorporating fairness constraints to ensure the presence of different groups in the topk positions of the final aggregate ranking. We study two fairness-aware variants under the well-known Spearman footrule, which corresponds to the L_1 distance between rankings. First, we address the practically salient task of computing a fair aggregate top-k ranking -- crucial in settings like recommendations and hiring where selection is primarily based on the top-k results -- and present the first optimal algorithm for this problem. Second, we consider fair (full) rank aggregation over all candidates (not specifically on top-k). We already know of a 3-approximation for this fair rank aggregation variant (Wei et al., SIGMOD’22; Chakraborty et al., NeurIPS’22), whereas an exact algorithm exists for the corresponding unconstrained (unfair) version (Dwork et al., WWW’01). Closing the computational gap between fair and unconstrained rank aggregation has remained a tantalizing open problem. We make significant progress by giving a 2-approximation algorithm for fair (full) rank aggregation, improving substantially over the previous 3-approximation. Further, we complement our theoretical contributions with experiments on different real-world datasets, which corroborate our theoretical results and demonstrate strong empirical performance relative to state-of-the-art baselines.

Abstract:
Graph foundation models (GFMs) aim to reuse a single backbone across diverse graph domains, yet their transfer is often uneven and can exhibit negative transfer. While most prior work improves transfer through architectural or adaptation choices, we ask a datacentric question:which properties of two graph domains determine how much a fixed representation model changes its outputs?Using a graphon-based continuous limit for dense graphs, we show that for both set-based and message-passing tokenizations, any Lipschitz backbone admits an explicit decomposition of cross-domain output shift into (i) graph-specific finite-sample approximation terms and (ii) an intrinsic, relabeling-invariant domain discrepancy capturing structural mismatch. A key ingredient is positional-encoding (PE) stability: we establish stability guarantees for spectral PEs and highlight contrasting behaviors of eigenvector- versus subspace-based PEs. Experiments on synthetic and real graphs validate the theory and translate the decomposition into guidance for data curation in GFM transfer.

Abstract:
Openset supervised anomaly detection (OSAD) aims to identify unseen anomalies using limited anomalous supervision. However, existing prototype-based methods typically model normal data via a unimodal Gaussian prior, failing to capture inherent multi-modality and resulting in blurred decision boundaries. To address this, we propose Mixture Prototype Flow Matching (MPFM), a framework that learns a continuous transformation from normal feature distributions to a structured Gaussian mixture prototype space. Departing from traditional flow-based approaches that rely on a single velocity vector, MPFM explicitly models the velocity field as a Gaussian mixture prior where each component corresponds to a distinct normal class. This design facilitates mode-aware and semantically coherent distribution transport. Furthermore, we introduce a Mutual Information Maximization Regularizer (MIMR) to prevent prototype collapse and maximize normal-anomaly separability. Extensive experiments demonstrate that MPFM achieves state-of-the-art performance across diverse benchmarks under both single- and multi-anomaly settings.

Abstract:
We investigate unpaired image inverse problems, a challenging setting where only independent, nonpaired sets of noisy measurements and clean target signals are available for training. We propose a novel inverse problem solver based on Unbalanced Optimal Transport, calledUnbalanced Optimal Transport Map for Inverse Problems (UOTIP). Our method formulates the reconstruction task—predicting clean target signals from noisy measurements—as learning a UOT Map from noisy measurement distribution to clean signal distribution by incorporating a likelihood-based cost function. By relaxing the exact marginal constraint, the UOT framework provides key advantages to our model: robustness to multi-level observation noise, adaptability to class imbalance between noisy and clean datasets, and generalizability to diverse noise-type scenarios. Furthermore, we theoretically demonstrate that incorporating a quadratic cost term ensures the existence and uniqueness of the transport map by satisfying the twist condition, even for ill-posed inverse problems. Our experiments demonstrate that UOTIP achieves state-of-the-art performance on unpaired image inverse problem benchmarks, across linear and nonlinear inverse problems.

Abstract:
Singlecell perturbation modeling is fundamental for understanding and predicting cellular responses to genetic perturbations. However, existing approaches, from causal representation learning to foundation models, often struggle with an overlooked challenge: gene expression is dominated by perturbation-invariant information, while perturbation-specific signals are intrinsically sparse. As a result, learned representations either entangle invariant and perturbation-specific information, leading to spurious and non-generalizable predictors, or suppress perturbation-specific signals altogether, rendering them ineffective for prediction. To address this, we propose PerturbedVAE, a general framework designed to resolve this signal imbalance. The framework explicitly separates perturbation-specific information from dominant invariant structure and recovers causal representations to effectively utilize such information for prediction. We further provide an identifiability analysis that characterizes the conditions under which sparse perturbation effects can be reliably recovered, thereby clarifying how the framework can be concretely specified under such conditions. Empirically, PerturbedVAE achieves state-of-the-art performance on a widely used benchmark across multiple evaluation settings, yielding significant gains on out-of-distribution combinatorial predictions and uncovering interpretable perturbation-response programs.

Abstract:
In collaborative inference (CI), transmitting intermediate representations Z from edge devices enables model inversion attacks (MIA) that reconstruct the original inputs X, while existing defenses mainly perturb shallowlayer Z at the cost of utility. We instead ask: where should an edge–cloud model be partitioned to obtain intrinsic resistance to MIA? We challenge the intuition that depth is the driver of MIA resistance, and show that depth is sufficient only insofar as it enables a representational transition; this transition is necessary for intrinsic resistance and is marked by an abrupt rise in the lower bound of H(X|Z). Correspondingly, the decisive variance term in the entropy bound shifts from a global variance to the intra-class mean-squared radius R^2_c rather than dimensionality alone, yielding an R^2_c-based criterion to locate the transition zone, or identify it post hoc from MIA outcomes, which we term the Golden Partition Zone (GPZ). We further explain how R^2_c evolves during training and show that it can be controlled through the label distribution; we refer to this controllable dynamics as the Neural Vortex. Across four representative deep vision models, partitioning at the GPZ yields over 4× higher reconstruction MSE compared to shallow splits; under entropy and inversion-model enhancements, decision-level representations provide 66% stronger resistance than feature-level ones, and we further observe that data type affects both the transition boundary and reconstruction.

Abstract:
Unifying the complementary strengths of diverse Vision Foundation Models (VFMs) into a single efficient model is highly desirable but challenged by the negative transfer inherent in monolithic distillation. To address these feature conflicts, we introduce PRISM, a novel dualstream Mixture-of-Experts (MoE) framework that synergizes VFMs via modular specialization. We propose a two-stage paradigm: (1) expertise deconstruction, where a teacher-conditional router guides experts to specialize in distinct representational subspaces to mitigate interference, followed by (2) dynamic recomposition, where the router learns to assemble these experts into tailored computational pathways for downstream tasks. Experiments on PASCAL-Context and NYUD-v2 show that PRISM establishes a new state of the art, validating that sparse, emergent specialization is a scalable approach for integrating diverse visual knowledge.

Abstract:
State Space Models (SSMs) have emerged as a powerful and efficient alternative to Transformers, demonstrating lineartime complexity and exceptional sequence modeling capabilities. However, their application to vision tasks remains challenging. First, existing vision SSMs largely depend on manually designed fixed scanning methods to flatten image patches into sequences, which imposes predefined geometric structures and increases the complexity. Second, the broader adoption of vision SSMs is hindered in domains that require query-based interactions between distinct information streams. This is a result of the inherently causal and self-referential nature of SSMs designed for 1D sequence modeling tasks. This fusion mechanism is indispensable for critical perception tasks such as multi-view 3D fusion. To address these limitations, we propose Deformba, a context adaptive method that dynamically augments the spatial structural information while maintaining the linear complexity of SSMs. Deformba also allows multi-modal fusion, analogous to standard cross attention. To demonstrate the effectiveness and general applicability of Deformba, we test its performance on general 2D vision tasks such as image classification, object detection, instance segmentation, and semantic segmentation, as well as 3D vision tasks like BEV perception. Extensive experiments show that Deformba achieves strong performance across various visual perception benchmarks.

Abstract:
Large Language Model (LLM)based agents demonstrate strong reasoning and execution capabilities on complex tasks when guided by structured instructions, commonly referred to as workflows. However, existing workflow-assisted agent serving systems typically rely on predefined templates and shallow matching mechanisms, which limit their ability to capture deep semantic relationships and generalize to previously unseen tasks. To address these limitations, we propose a new workflow management paradigm that represents workflows using a unified graph, termed wGraph, where each node corresponds to an atomic operation. wGraph serves as a shared substrate from which task-specific workflows are dynamically instantiated. Building on wGraph primitives, we introduce GraphFlow, a system that efficiently integrates workflows into agent serving through two key designs. First, adaptive workflow generation dynamically constructs workflows from wGraph based on task semantics and constraint requirements. Second, workflow state management exploits wGraph structure to efficiently manage Key-Value (KV) caches, reducing redundant computation during agent serving. Extensive experiments across five benchmark datasets show that GraphFlow consistently outperforms state-of-the-art methods, yielding an average performance improvement of approximately 4.95 percentage points, while achieving an approximately 4× reduction in memory footprint.

Abstract:
LLM serving frameworks are quickly evolving with a complex software stack and a vast number of optimizations. The rapid development process can introduce silent errors where output quality silently degrades without any explicit error signals. Diagnosing silent errors is notoriously difficult due to the substantial semantic gap between the highlevel symptoms and the low-level root causes. We observe that diagnosis of silent errors can be effectively framed as a differential debugging problem by leveraging the existence of semantically correct reference implementations. We propose Ekka, an automated diagnosis system that identifies root causes by systematically aligning and comparing intermediate execution states between a target and a reference framework. We constructed a benchmark of real-world silent errors from popular serving frameworks, where Ekka shows 84% pass@1 diagnosis accuracy and 88% pass@5 diagnosis accuracy, outperforming state-of-the-art systems. Ekka also diagnoses 4 new silent errors from serving frameworks, all of which have been confirmed by the developers.

Abstract:
This work addresses the problem of learning the dynamics of highdimensional probability densities over time using unlabeled samples, without assuming access to trajectory information. We introduce two-parameter flows that learn only sampling-time transports from a base distribution to each marginal and then extract a physics-time velocity by regressing on coupled synthetic trajectories. We prove that the resulting physics-time dynamics are unique and inherit regularity from the sampling-time transports. Because we can build on standard, well-developed conditional flow matching techniques for learning the base-to-marginal transports, our approach scales to high dimensions and avoids per-step optimal-transport couplings, while allowing admissible non-gradient dynamics that can naturally explain rotational or circulating physics phenomena.

Abstract:
Diffusion models are effective for waypoint prediction in visual navigation, but standard sampling and test time guidance can produce unsafe or inefficient trajectories when updates drift off the training manifold. We propose Fisher Preserving Guidance with Outer Product Span Projection, a trainingfree inference method that avoids large Fisher drift associated with off-distribution actions while optimizing a task objective. Our method computes the Fisher-preserving update via a low-rank Jacobian factorization, requiring only a single backward pass per step and enabling real-time use. We further introduce Truncated Fisher Denoising Sensitivity as an uncertainty signal and use it for robust multi-sample action blending. Experiments on toy and realistic navigation benchmarks, including Maze2D with TSDF-based guidance, PushT with official Diffusion Policy weights, and visual navigation in simulation and on real robots, demonstrate consistent improvements in performance over strong diffusion-policy baselines without additional training.

Abstract:
Existing video object removal methods predominantly rely on diffusion models following a noiseto-data paradigm, where generation starts from uninformative Gaussian noise. This approach discards the rich structural and contextual priors present in the original input video. Consequently, such methods often lack sufficient guidance, leading to incomplete object erasure or the synthesis of implausible content that conflicts with the scene's physical logic. In this paper, we reformulate video object removal as a video-to-video translation task via a stochastic bridge model. Unlike noise-initialized methods, our framework establishes a direct stochastic path from the source video (with objects) to the target video (objects removed). This bridge formulation effectively leverages the input video as a strong structural prior, guiding the model to perform precise removal while ensuring that the filled regions are logically consistent with the surrounding environment. To address the trade-off where strong bridge priors hinder the removal of large objects, we propose a novel adaptive mask modulation strategy. This mechanism dynamically modulates input embeddings based on mask characteristics, balancing background fidelity with generative flexibility. Extensive experiments demonstrate that our approach significantly outperforms existing methods in both visual quality and temporal consistency. The project page is https://bridgeremoval.github.io/.

Abstract:
Effective exploration remains a key challenge in modelbased reinforcement learning (MBRL), especially in high-dimensional continuous control tasks where sample efficiency is critical. Recent work addresses this by using learned policies as proposal distributions for Model-Predictive Path Integral (MPPI) planning. Early approaches update the sampling policy independently of the planner, typically via deterministic policy gradients with entropy regularization. However, since the data distribution is induced by the MPPI planner, misalignment between the policy and planner degrades value estimation and long-term performance. To address this, recent methods explicitly align the policy with the planner by minimizing KL divergence to the planner distribution or by incorporating planner-guided regularization. In this work, we unify these approaches under the Policy Optimization–Model Predictive Control (PO-MPC) framework, a family of KL-regularized MBRL methods that treat the planner’s action distribution as a prior in policy optimization. We show how existing methods emerge as special cases of this family and explore previously unstudied variants. Experiments demonstrate that these variants yield significant performance gains, advancing the state of the art in MPPI-based RL.

Abstract:
Large language model (LLM) agents increasingly coordinate in multiagent systems, yet we lack an understanding of where and why cooperation failures may arise. In many real-world coordination problems, from knowledge sharing in organizations to code documentation, helping others carries negligible personal cost while generating substantial collective benefits. However, whether LLM agents cooperate when helping neither benefits nor harms the helper, despite being given explicit instructions to do so, remains unknown. We build a multi-agent setup designed to study cooperative behavior in a frictionless environment, removing all strategic complexity from cooperation. We find that capability does not predict cooperation: OpenAI o3 achieves only 17% of optimal collective performance while OpenAI o3-mini reaches 50%, despite identical instructions to maximize group revenue. Through a causal decomposition that automates one side of agent communication, we separate cooperation failures from competence failures, tracing their origins through agent reasoning analysis. Testing targeted interventions, we find that explicit protocols double performance for low-competence models, and tiny sharing incentives improve models with weak cooperation. These results demonstrate that even when helping is free and strategically trivial, many LLMs fail to follow the instructed cooperative objectives, requiring interventions based on specific failure modes. Our findings suggest that scaling intelligence alone will not solve coordination problems in multi-agent systems and will require deliberate cooperative design, even when helping others costs nothing.

Abstract:
Model routing chooses which language model to use for each query. By sending easy queries to cheaper models and hard queries to stronger ones, it can significantly reduce inference cost while maintaining high accuracy. However, most existing routers treat this as a fixed choice among a small set of models, which makes them hard to adapt to new models or changing budget constraints. In this paper, we propose SCOPE (Scalable and Controllable Outcome Performance Estimator), a routing framework that goes beyond model selection by predicting their cost and performance. Trained with reinforcement learning, SCOPE makes reasoningbased predictions by retrieving how models behave on similar problems, rather than relying on fixed model names, enabling it to work with new, unseen models. Moreover, by explicitly predicting how accurate and how expensive a model will be, it turns routing into a dynamic decision problem, allowing users to easily control the trade-off between accuracy and cost. Experiments show that SCOPE is more than just a cost-saving tool. It flexibly adapts to user needs: it can boost accuracy by up to25.7%when performance is the priority, or cut costs by up to95.1%when efficiency matters most.

Abstract:
Large language models (LLMs) have emerged as a standard paradigm for automated multilingual evaluation, yet exhibit systematic biases. In this paper, we identify ``translationese bias'', in which LLMs systematically favor machinetranslated text over human-authored references, and this bias is particularly pronounced in low-resource languages. We attribute this bias to spurious correlations with (a) strong latent manifold isomorphism with English and (b) high predictive confidence. To mitigate these issues, we present DIBJudge, a robust fine-tuning framework that decouples robust features and bias representations by explicitly isolating spurious attributes into a dedicated bias branch and penalizing mutual dependence to enforce disentanglement. In particular, we present a vector-quantized compression that ensures the robust representation retains minimal and sufficient judgment-critical information. Extensive evaluations on multilingual reward modeling benchmarks and a specially designed translationese bias evaluation suite demonstrate that DIBJudge outperforms strong baselines and effectively mitigates translationese bias.

Abstract:
The field of computer architecture, which bridges highlevel software abstractions and low-level hardware implementations, remains absent from current large language model (LLM) evaluations. To this end, we present QuArch (pronounced 'quark'), the first benchmark designed to facilitate the development and evaluation of LLM knowledge and reasoning capabilities specifically in computer architecture. QuArch provides a comprehensive collection of 2,671 expert-validated question-answer (QA) pairs covering various aspects of computer architecture, including processor design, memory systems, and interconnection networks. Our evaluation reveals that while frontier models possess domain-specific knowledge, they struggle with skills that require higher-order thinking in computer architecture. Frontier model accuracies vary widely (from 34% to 73%) on these advanced questions, highlighting persistent gaps in architectural reasoning across analysis, design, and implementation QAs. Furthermore, via fine-tuning we find that QuArch can translate to improved performance on a realistic memory hierarchy design task, resulting in up to 1.99x more area-efficient solutions and up to 40% more viable solutions overall. By holistically assessing fundamental skills, QuArch provides a foundation for building and measuring LLM capabilities that can accelerate innovation in computing systems.

Abstract:
In this paper, we study dynamic regret in unconstrained online convex optimization (OCO) with movement costs. Specifically, we generalize the standard setting by allowing the movement cost coefficients \lambda_t to vary arbitrarily over time. Our main contribution is a novel algorithm that establishes the first comparatoradaptive dynamic regret bound for this setting, guaranteeing \widetilde\mathcalO(\sqrt(1+P_T)(T+\sum_t \lambda_t)) regret, where P_T is the path length of the comparator sequence over T rounds. This recovers the optimal guarantees for both static and dynamic regret in standard OCO as a special case where \lambda_t=0 for all rounds. To demonstrate the versatility of our results, we consider two applications: OCO with delayed feedback and OCO with time-varying memory. We show that both problems can be translated into time-varying movement costs, establishing a novel reduction specifically for the delayed feedback setting that is of independent interest. A crucial observation is that the first-order dependence on movement costs in our regret bound plays a key role in enabling optimal comparator-adaptive dynamic regret guarantees in both settings.

Abstract:
Graph alignment, the problem of identifying corresponding nodes across multiple graphs, is fundamental to numerous applications. Most existing unsupervised methods embed node features into latent representations to enable crossgraph comparison without ground-truth correspondences. However, these methods suffer from two critical limitations: the degradation of node distinctiveness due to oversmoothing in GNN-based embeddings, and the misalignment of latent spaces across graphs caused by structural noise, feature heterogeneity, and training instability, ultimately leading to unreliable node correspondences. Our key insight is that rather than balancing these objectives purely in the feature space, a unified framework can simultaneously filter in both feature and map spaces. We propose a novel framework that employs a dual-pass encoder to inject high-frequency discriminability into node features, paired with a geometry-aware functional map module that operates on the correspondence itself. This functional map module learns bijective and isometric transformations that align latent spaces while acting as a low-pass filter on correspondences, enforcing smoothness and robustness as a structural prior in the map space. Extensive experiments on graph benchmarks demonstrate that our method consistently outperforms existing unsupervised alignment baselines, exhibiting superior robustness to structural inconsistencies and challenging alignment scenarios.

Abstract:
Recent video generation models have revealed the emergence of Chainof-Frame (CoF) reasoning, enabling frame-by-frame visual inference. With this capability, video models have been successfully applied to various visual tasks (e.g., maze solving, visual puzzles). However, their potential to enhance text-to-image (T2I) generation remains largely unexplored due to the absence of a clearly defined visual reasoning starting point and interpretable intermediate states in the T2I generation process. To bridge this gap, we proposeCoF-T2I, a model that integrates CoF reasoning into T2I generation via progressive visual refinement, where intermediate frames act as explicit reasoning steps and the final frame is taken as output. To establish such explicit generation process, we curateCoF-Evol-Instruct, a dataset of CoF trajectories that model the generation process from semantics to aesthetics. To further improve quality and avoid motion artifacts, we enable an independent encoding operation for each frame. Experiments show that CoF-T2I significantly outperforms the base video model and achieves competitive performance, reaching 0.86 on GenEval and 7.468 on Imagine-Bench. These results indicate the substantial promise of video models for advancing high-quality text-to-image generation.

Abstract:
Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through Long Chains of Thought (CoTs). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in realtime applications. Recent studies show that longer reasoning chains are frequently uncorrelated with correctness and can even be detrimental to accuracy. In a further in-depth analysis of this phenomenon, we surprisingly uncover and empirically verify that LRMs implicitly know the appropriate time to stop thinking, while this capability is obscured by current sampling paradigms. Motivated by this, we introduce SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm that unleashes this efficient reasoning potential. Furthermore, integrating SAGE as mixed sampling into group-based reinforcement learning (SAGE-RL) enables SAGE-RL to effectively incorporate SAGE-discovered efficient reasoning patterns into standard pass@1 inference, markedly enhancing both the reasoning accuracy and efficiency of LRMs across multiple challenging mathematical benchmarks.

Abstract:
Primal heuristics play a crucial role in quickly finding feasible solutions for NPhard integer linear programming (ILP). Although end-to-end learning-based primal heuristics (E2EPH) have recently been proposed, they are typically unable to independently generate feasible solutions. To address this challenge, we propose RL-SPH, a novel reinforcement learning-based start primal heuristic capable of independently generating feasible solutions, even for ILP involving non-binary integers. Empirically, RL-SPH rapidly obtains high-quality feasible solutions with a 100% feasibility rate, achieving on average a 39× lower primal gap and a 2.3× lower primal integral compared to existing start primal heuristics.

Abstract:
Flow matching (FM) has shown promising results in datadriven planning. However, it inherently lacks formal guarantees for ensuring state and action constraints, whose satisfaction is a fundamental and crucial requirement for the safety and admissibility of planned trajectories on various systems. Moreover, existing FM planners do not ensure the dynamical consistency, which potentially renders trajectories inexecutable. We address these shortcomings by proposing SAD-Flower, a novel framework for generating Safe, Admissible, and Dynamically consistent trajectories. Our approach relies on an augmentation of the flow with a virtual control input. Thereby, principled guidance can be derived using techniques from nonlinear control theory, providing formal guarantees for state constraints, action constraints, and dynamic consistency. Crucially, SAD-Flower operates without retraining, enabling test-time satisfaction of unseen constraints. Through extensive experiments across several tasks, we demonstrate that SAD-Flower outperforms various generative-model-based baselines in ensuring constraint satisfaction. Video and demos can be found at sadflowerplanning.github.io.

Abstract:
Emotional expression in human speech is nuanced and compositional, often involving multiple, sometimes conflicting, affective cues that may diverge from linguistic content. In contrast, most expressive textto-speech (TTS) systems enforce a single utterance-level emotion, collapsing affective diversity and suppressing mixed or text–emotion–misaligned expression. While activation steering via latent direction vectors offers a promising solution, it remains unclear whether emotion representations are linearly steerable in TTS, where steering should be applied within hybrid TTS architectures, and how such complex emotion behaviors should be evaluated. This paper presents the first systematic analysis of activation steering for emotional control in hybrid TTS models, introducing a quantitative, controllable steering framework, and multi-rater evaluation protocols that enable composable mixed-emotion synthesis and reliable text–emotion mismatch synthesis. Our results demonstrate, for the first time, that emotional prosody and expressive variability are primarily synthesized by the TTS language module instead of the flow-matching module, and also provide a lightweight steering approach for generating natural, human-like emotional speech.

Abstract:
Scaling testtime compute via long Chain-of-Thought unlocks remarkable gains in reasoning capabilities, yet it faces practical limits due to the linear growth of KV cache and quadratic attention complexity. In this paper, we introduce AccordionThinking, an end-to-end framework where LLMs learn to self-regulate the granularity of the reasoning steps through dynamic summarization. This mechanism enables a Fold inference mode, where the model periodically summarizes its thought process and discards former thoughts to reduce dependency on historical tokens. We apply reinforcement learning to incentivize this capability further, uncovering a critical insight: the accuracy gap between the highly efficient Fold mode and the exhaustive Unfold mode progressively narrows and eventually vanishes over the course of training. This phenomenon demonstrates that the model learns to encode essential reasoning information into compact summaries, achieving effective compression of the reasoning context. Our AccordionThinker demonstrates that with learned self-compression, LLMs can tackle complex reasoning tasks with minimal dependency token overhead without compromising solution quality, and it achieves a 3× throughput while maintaining accuracy on a 48GB GPU memory configuration, while the structured step summaries provide a human-readable account of the reasoning process.

Abstract:
Vision Transformers (ViTs) have achieved remarkable success across various vision tasks, yet their deployment is often hindered by prohibitive computational costs. While structured weight pruning and token compression have emerged as promising solutions, they suffer from prolonged retraining times and global propagation that creates optimization challenges, respectively. We propose ToaSt, a decoupled framework applying specialized strategies to distinct ViT components. We apply coupled headwise structured pruning to Multi-Head Self-Attention modules, leveraging attention operation characteristics to enhance robustness. For Feed-Forward Networks (over 60% of FLOPs), we introduce Token Channel Selection (TCS) that enhances compression ratios while avoiding global propagation issues. Our analysis reveals TCS effectively filters redundant noise during selection. Extensive evaluations across nine diverse models, including DeiT, ViT-MAE, and Swin Transformer, demonstrate that ToaSt achieves superior trade-offs between accuracy and efficiency, consistently outperforming existing baselines. On ViT-MAE-Huge, ToaSt achieves 88.52% accuracy (+1.64 %) with 39.4% FLOPs reduction. ToaSt transfers effectively to downstream tasks, achieving 52.2 versus 51.9 mAP on COCO object detection. Code and models will be released upon acceptance.

Abstract:
Algorithmic fairness is often studied in static or singleagent settings, yet many real-world decision-making systems involve multiple interacting entities whose multi-stage actions jointly influence long-term outcomes. Existing fairness methods applied at isolated decision points frequently fail to mitigate disparities that accumulate over time. Although recent work has modeled fairness as a sequential decision-making problem, it typically assumes centralized agents or simplified dynamics, limiting its applicability to complex social systems. We introduceMAFE, a suite ofMulti-Agent Fair Environmentsdesigned to simulate realistic, modular, and dynamic systems in which fairness emerges from the interplay of multiple agents. We demonstrate MAFEs in three domains—loan processing, healthcare, and higher education—supporting heterogeneous agents, configurable interventions, and fairness metrics. The environments are open-source and compatible with standard multi-agent reinforcement learning (MARL) libraries, enabling reproducible evaluation of fairness-aware policies. Through extensive experiments on cooperative use cases, we demonstrate how MAFE facilitates the design of equitable multi-agent algorithms and reveals critical trade-offs between fairness, performance, and coordination. MAFE provides a foundation for systematic progress in dynamic, multi-agent fairness research.

Abstract:
Large Language Models (LLMs) can improve via reinforcement learning by generating trajectories to discover better solutions. This exploration process represents an investment of finite GPU compute to obtain learning signals. However, current methods typically allocate a small, uniform budget to every task, which is inefficient and ineffective: easy tasks consistently succeed while difficult tasks consistently fail. For policy optimization algorithms such as Group Relative Policy Optimization (GRPO), both edge cases produce zero gradients, resulting in wasted computation. We address this by reframing exploration budget allocation as a resource optimization problem. Viewing each task's exploration as an `"item'' with a distinct "learning value'' and "computational cost'', we establish a connection to the classical knapsack problem and derive an optimal assignment rule. When applied to GRPO, our method increases the ratio of effective gradients by 40%. As a computational "free lunch'', it enables substantially larger budgets (e.g., 93) for challenging tasks—allocations that would be expensive under a uniform allocation framework. These efficiency gains translate to meaningful improvements on mathematical reasoning benchmarks, with average gains of 24 points and peak gains of 9 points. Notably, achieving comparable performance with traditional homogeneous allocation would require approximately 2× the computational resources.

Abstract:
Visionlanguage models (VLMs) have shown remarkable abilities by integrating large language models with visual inputs. However, they often fail to utilize visual evidence adequately, either depending on linguistic priors in vision-centric tasks or resorting to textual shortcuts during reasoning. Although reinforcement learning (RL) can align models with desired behaviors, its application to VLMs has been hindered by the lack of scalable and reliable reward mechanisms. To overcome this challenge, we proposeSSL4RL, a novel framework that leverages self-supervised learning (SSL) tasks as a source of verifiable rewards for RL-based fine-tuning. Our approach reformulates SSL objectives—such as predicting image rotation or reconstructing masked patches—into dense, automatic reward signals, eliminating the need for human preference data or unreliable AI evaluators. Experiments show that SSL4RL substantially improves performance on both vision-centric and vision-language reasoning benchmarks, with encouraging potentials on open-ended image-captioning scenarios and stronger resilience to visual corruptions. Through systematic ablations, we identify key factors—such as data volume, model scale, model choice, task difficulty, and semantic alignment with the target domain—that influence the effectiveness of SSL4RL tasks, offering new design principles for future work. We also demonstrate the framework’s generality by applying it to graph learning, where it yields significant gains. SSL4RL establishes a versatile and effective paradigm for aligning multimodal models using verifiable, self-supervised objectives.

Abstract:
Iterative generative policies, such as diffusion models and flow matching, offer superior expressivity for continuous control but complicate Maximum Entropy Reinforcement Learning because their action logdensities are not directly accessible. To address this, we propose Field Least-Energy Actor-Critic (FLAC), a likelihood-free framework that regulates policy stochasticity by penalizing the kinetic energy of the velocity field. Our key insight is to formulate policy optimization as a Generalized Schr\"odinger Bridge (GSB) problem relative to a high-entropy reference process (e.g., uniform). Under this view, the maximum-entropy principle emerges naturally as staying close to a high-entropy reference while optimizing return, without requiring explicit action densities. In this framework, kinetic energy serves as a physically grounded proxy for divergence from the reference: minimizing path-space energy bounds the deviation of the induced terminal action distribution. Building on this view, we derive an energy-regularized policy iteration scheme and a practical off-policy algorithm that automatically tunes the kinetic energy via a Lagrangian dual mechanism. Empirically, FLAC achieves superior or comparable performance on high-dimensional benchmarks relative to strong baselines, while avoiding explicit density estimation.

Abstract:
Learning to Defer (L2D) enables a classifier to abstain from predictions and defer to an expert, and has recently been extended to multiexpert settings. In this work, we show that multi-expert L2D is fundamentally more challenging than the single-expert case. With multiple experts, the classifier's underfitting becomes inherent, which seriously degrades prediction performance, whereas in the single-expert setting it arises only under specific conditions. We theoretically reveal that this stems from an intrinsic expert identifiability issue: learning which expert to trust from a diverse pool, a problem absent in the single-expert case and renders existing underfitting remedies failed. To tackle this issue, we propose PiCCE (Pick theConfident andCorrectExpert), a surrogate-based method that adaptively identifies a reliable expert based on empirical evidence. PiCCE effectively reduces multi-expert L2D to a single-expert–like learning problem, thereby resolving multi-expert underfitting. We further prove its statistical consistency and ability to recover class probabilities and expert accuracies. Extensive experiments across diverse settings, including real-world expert scenarios, validate our theoretical results and demonstrate improved performance.

Abstract:
Stochastic differential equations (SDEs) and stochastic partial differential equations (SPDEs) are fundamental for modeling stochastic dynamics across the natural sciences and modern machine learning. Learning their solution operators with deep learning models promises fast solvers and new perspectives on classical learning tasks. In this work, we build on Wiener–chaos expansions (WCE) to design neural operator (NO) architectures for SDEs and SPDEs: we project driving noise paths onto orthonormal Wick–Hermite features and use NOs to parameterize the resulting chaos coefficients, enabling reconstruction of full trajectories from noise in a single forward pass. We also make the underlying WCE structure explicit for multidimensional SDEs and semilinear SPDEs by showing the coupled deterministic ODE/PDE systems governing these coefficients. Empirically, we achieve competitive accuracy across several tasks, including standard SPDE benchmarks and SDE-based diffusion one-step image sampling, topological graph interpolation, financial extrapolation, parameter estimation, and manifold SDE flood forecasting. These results suggest WCE-based neural operators are a practical and scalable approach to learning SDE/SPDE solution operators across domains.

Abstract:
Unified Multimodal Models (UMMs) integrate both visual understanding and generation within a single framework. Their ultimate aspiration is to create a cycle where understanding and generation mutually reinforce each other. While recent posttraining methods have successfully leveraged understanding to enhance generation, the reverse direction of utilizing generation to improve understanding remains largely unexplored. In this work, we propose UniMRG (Unified Multi-Representation Generation), a simple yet effective architecture-agnostic post-training method. UniMRG enhances the understanding capabilities of UMMs by incorporating auxiliary generation tasks. Specifically, we train UMMs to generate multiple intrinsic representations of input images, namely pixel (reconstruction), depth (geometry), and segmentation (structure), alongside standard visual understanding objectives. By synthesizing these diverse representations, UMMs capture rich complementary information regarding appearance, spatial relations, and structural layout. Consequently, UMMs develop a deeper and more comprehensive understanding of visual inputs. Extensive experiments across diverse UMM architectures demonstrate that our method notably enhances fine-grained perception, reduces hallucinations, and improves spatial understanding, while simultaneously boosting generation capabilities.

Abstract:
Differentially private (DP) training protects individual examples by adding noise to gradients, but the injected noise interacts nontrivially with adaptive optimizers. Recent DP methods temporally filter privatized gradients to reduce variance; however, filtering also changes the DP noise statistics seen by AdamW’s secondmoment accumulator. As a result, bias corrections derived for unfiltered DP noise (e.g., subtracting \sigma_w^2) can become miscalibrated when filtering is present. We propose FIBER, a DP optimizer designed for temporally filtered privatized gradients. FIBER (i) performs denoising in innovation space by filtering the residual stream and integrating it to form the filtered gradient estimate, (ii) decouples the two-point observation geometry from the innovation gain to enable independent tuning, and (iii) introduces a filter-aware second-moment calibration that subtracts the attenuated DP noise contribution A(\omega)\sigma_w^2, where A(\omega) is derived in closed form for the innovation filter and can be computed for general stable linear filters. Across vision and language benchmarks, FIBER consistently demonstrates substantial improvements in the performance of DP optimizers, surpassing state-of-the-art results under equivalent privacy constraints on multiple tasks.

Abstract:
We present SWEBench Pro, a comprehensive benchmark designed to evaluate software engineering capabilities through complex, realistic programming challenges. This benchmark extends beyond traditional algorithmic problems to encompass the full spectrum of professional software development tasks. The dataset comprises 1,865 problems sourced from 41 active software engineering repositories, spanning 123 unique programming languages and various application domains. The benchmark is structured into public and private components, with public access to problems from 11 repositories and private evaluation sets from 12 repositories across 4 distinct problem categories. SWE-Bench Pro addresses limitations of existing evaluation frameworks by incorporating problems that reflect real-world software engineering scenarios, including substantial codebases, complex enterprise applications, and multi-file projects requiring sophisticated reasoning and code modification skills. Problems range from early-stage startup environments to enterprise-level applications, with the private commercial set remaining inaccessible to maintain evaluation integrity while enabling public access to representative problems for professional development. Our evaluation methodology employs diverse coding approaches and models under controlled conditions, ensuring robust performance assessment across multiple programming paradigms. Results demonstrate significant performance variations across different problem categories, with traditional algorithmic challenges showing notably higher success rates compared to complex, multi-file engineering tasks. The benchmark reveals substantial gaps in current capabilities for handling real-world software engineering scenarios, particularly in areas requiring deep contextual understanding, cross-file reasoning, and integration with existing large-scale systems. This work contributes a more comprehensive and realistic evaluation framework for assessing software engineering capabilities, providing insights into current limitations and establishing a foundation for future development in automated software engineering tools and methodologies.

Abstract:
We introduce Native Parallel Reasoner (NPR), a teacherfree framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities. NPR transforms the model from sequential emulation to native parallel cognition through three key innovations: 1) a self-distilled progressive training paradigm that transitions from ``cold-start'' format discovery to strict topological constraints without external supervision; 2) a novel Parallel-Aware Policy Optimization (PAPO) algorithm that optimizes branching policies directly within the execution graph, allowing the model to learn adaptive decomposition via trial and error; and 3) a robust NPR Engine that refactors memory management and flow control of SGLang to enable stable, large-scale parallel RL training. Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains of up to 24.5% and inference speedups up to 4.6×. Unlike prior baselines that often fall back to autoregressive decoding, NPR demonstrates 100% genuine parallel execution, establishing a new standard for self-evolving, efficient, and scalable agentic reasoning.

Abstract:
Muon updates matrix parameters via the matrix sign of the gradient and has shown strong empirical gains, yet its dynamics and scaling behavior remain unclear in theory. We study Muon in a linear associative memory model with softmax retrieval and a hierarchical frequency spectrum over query–answer pairs, with and without label noise. In this setting, we show that Gradient Descent (GD) learns frequency components at highly imbalanced rates, leading to slow convergence bottlenecked by lowfrequency components. In contrast, the Muon optimizer mitigates this imbalance, leading to faster and more uniform progress. Specifically, in the noiseless case, Muon achieves an exponential speedup over GD; in the noisy case with a power-decay frequency spectrum, we derive Muon's optimization scaling law and demonstrate its superior scaling efficiency over GD. Furthermore, we show that Muon can be interpreted as an implicit matrix preconditioner arising from adaptive task alignment and block-symmetric gradient structure. In contrast, the preconditioner with coordinate-wise sign operator could match Muon under oracle access to unknown task representations, which is infeasible for SignGD in practice. Experiments on synthetic long-tail classification and LLaMA-style pre-training corroborate the theory.

Abstract:
Time series data is fundamental to decisionmaking across many domains including healthcare, finance, power systems, and logistics. However, analyzing this data correctly often requires incorporating unstructured contextual information, answering domain-specific questions, and generating natural language explanations – capabilities that traditional time series models lack. While Large Language Models (LLMs) excel at contextual reasoning and knowledge integration, they struggle with numerical time series due to inefficient text-based representations and limited exposure to numerical data during pretraining. We address this gap by augmenting an LLM with specialized time series perception through a patch-based encoder-decoder architecture. We train this Time Series-augmented LLM (TsLLM) on a large corpus of over 20 billion tokens of interleaved time series and text spanning diverse tasks: forecasting with contextual information, question-answering, anomaly detection, classification, report generation, and more, all unified as next token prediction. This training enables TsLLM to leverage both its language understanding and newly acquired temporal reasoning capabilities. While not designed to surpass specialized models on traditional benchmarks, TsLLM demonstrates strong performance on tasks requiring the integration of time series analysis with natural language – capabilities that existing approaches cannot provide. It also exhibits strong zero-shot and few-shot performance, showing it can adapt to new data without additional training.

Abstract:
Despite recent algorithmic advances, we still lack principled ways to leverage the welldocumented rescaling symmetries in ReLU neural network parameters. While two properly rescaled weights implement the same function, the training dynamics can be dramatically different. To offer a fresh perspective on exploiting this phenomenon, we build on the recent path-lifting framework, which provides a compact factorization of ReLU networks. We introduce a geometrically motivated criterion to rescale neural network parameters which minimization leads to a conditioning strategy that aligns a kernel in the path-lifting space with a chosen reference. We derive an efficient algorithm to perform this alignment. In the context of random network initialization, we analyze how the architecture and the initialization scale jointly impact the output of the proposed method. Numerical experiments illustrate its potential to speed up training.

Abstract:
Reinforcement learning (RL) struggles to scale to large, combinatorial action spaces common in many realworld problems. This paper introduces a novel framework for training discrete diffusion models as highly effective policies in these complex settings. Our key innovation is an efficient online training process that ensures stable and effective policy improvement. By leveraging policy mirror descent (PMD) to define an ideal, regularized target policy distribution, we frame the policy update as a distributional matching problem, training the expressive diffusion model to replicate this stable target. This decoupled approach stabilizes learning and significantly enhances training performance. Our method achieves state-of-the-art results and superior sample efficiency across a diverse set of challenging combinatorial benchmarks, including DNA sequence generation, RL with macro-actions, and multi-agent systems. Experiments demonstrate that our diffusion policies attain superior performance compared to other baselines. Crucially, our extensive empirical analysis reveals a key trade-off: FKL demonstrates superior sample efficiency and faster initial convergence, whereas RKL ensures greater training stability and higher asymptotic performance.

Abstract:
Parameterfree stochastic optimization aims to design algorithms that are agnostic to the underlying problem parameters while still achieving convergence rates competitive with optimally tuned methods. While some parameter-free methods do not require the specific values of the problem parameters, they still rely on prior knowledge, such as the lower or upper bounds of them. We refer to such methods as "partially parameter-free". In this work, we target achieving "fullyparameter-free" methods, i.e., the algorithmic inputs do not need to satisfy anyunverifiablecondition related to the true problem parameters. We propose a general and powerfulgrid searchframework, named GRASP, with a novelself-boundinganalysis technique that effectively determines the parameter search ranges, in contrast to previous work. Our method demonstrates generality in: (i) the non-convex case, where we propose a fully parameter-free method that achieves near-optimal convergence rate, up to logarithmic factors; (ii) the convex case, where our parameter-free methods are competitive with strong performance in terms of acceleration and universality. Finally, we contribute a sharper guarantee for the model ensemble, a final step of the grid search framework, under interpolated variance characterization.

Abstract:
Reliable Dockerbased environment construction is a dominant bottleneck for scaling execution-grounded training and evaluation of software engineering agents. We introduce DockSmith, a specialized agentic Docker builder designed to address this challenge. DockSmith treats environment construction not merely as a preprocessing step, but as a core agentic capability that exercises long-horizon tool use, dependency reasoning, and failure recovery, yielding supervision that transfers beyond Docker building itself. DockSmith is trained on large-scale, execution-grounded Docker-building trajectories produced by a SWE-Factory--style pipeline augmented with a loop-detection controller and a cross-task success memory. Training a 30B-A3B model on these trajectories achieves open-source state-of-the-art performance on Multi-Docker-Eval, with 39.72% Fail-to-Pass and 58.28% Commit Rate. Moreover, DockSmith improves out-of-distribution performance on SWE-bench Verified, SWE-bench Multilingual, and Terminal-Bench 2.0, demonstrating the broader agentic benefits of environment construction. Our model and Docker-building trajectories are publicly available at https://huggingface.co/collections/8sj7df9k8m5x8/docksmith.

Abstract:
The evaluation of large language models relies heavily on standardized benchmarks. These benchmarks provide useful aggregated metrics, but can obscure (i) particular subareas where the models are weak ("model gaps") (ii) imbalanced coverage in the benchmarks themselves ("benchmark gaps"). To automatically uncover both types of gaps, we propose a simple new method using concept activations from sparse autoencoders, to identify fine-grained gaps on a per-concept basis. The method also benefits from grounding evaluation in the model's internal representations, as well as easy comparison across benchmarks. We applied the method to two popular open-source models and ten benchmarks, as illustrative examples. As validation of the approach, we found that our automatic, unsupervised method was able to recover model gaps that have been previously documented in the literature (e.g. relating to sycophancy), in addition to identifying novel model gaps. We were also able to automatically uncover benchmark gaps: core concepts that should fall within the scope of a given benchmark. Our ``competency gaps'' method can be used to complement existing benchmarks, by providing a concept-level decomposition of model behavior, and by helping benchmark developers iterate upon benchmark design. Code is available at [anonymized].

Abstract:
Code complexity metrics such as cyclomatic complexity have long been used to assess software quality and maintainability. With the rapid advancement of large language models (LLMs) on code understanding and generation tasks, an important yet underexplored question arises: do these traditional complexity metrics meaningfully characterize the difficulty LLMs experience when processing code? In this work, we empirically demonstrate that, after controlling for code length, classical metrics exhibit no consistent correlation with LLM performance, revealing a fundamental mismatch with modelperceived difficulty. To address this gap, we propose LM-CC, a novel code complexity metric designed from the perspective of LLMs. The core premise of LM-CC is that LLM-perceived difficulty is driven by the nonlinearity of program semantics. Accordingly, we decompose programs into semantic units based on entropy, organize these units into a compositional hierarchy, and quantify complexity as a principled aggregation of compositional level and branching-induced divergence, capturing cumulative model uncertainty during code processing. Our extensive experiments show that LM-CC not only correlates more strongly with LLM performance than traditional metrics but also that lowering it directly enhances task performance. The source code is available at: https://anonymous.4open.science/r/lm-cc-7EE6.

Abstract:
Inferencetime scaling offers a versatile paradigm for aligning visual generative models with downstream objectives without parameter updates. However, existing approaches that optimize the high-dimensional initial noise suffer from severe inefficiency, as many search directions exert negligible influence on the final generation. We show that this inefficiency is closely related to a spectral bias in generative dynamics: model sensitivity to initial perturbations diminishes rapidly as frequency increases. Building on this insight, we propose Spectral Evolution Search (SES), a plug-and-play framework for initial noise optimization that executes gradient-free evolutionary search within a low-frequency subspace. Theoretically, we derive the Spectral Scaling Prediction from perturbation propagation dynamics, which explains the systematic differences in the impact of perturbations across frequencies. Extensive experiments demonstrate that SES significantly advances the Pareto frontier of generation quality versus computational cost, consistently outperforming strong baselines under equivalent budgets. Our code is available at \urlhttps://anonymous.4open.science/r/Spectral-Evolution-Search-66DB.

Abstract:
We propose S3 (Specialization, Selection, Sparsification), a framework that rethinks multimodal learning through a structural perspective. Instead of encoding all signals into a fixed embedding, S3 decomposes multimodal inputs into semantic experts and selectively routes them for each task. Specialization forms conceptlevel experts in a shared latent space, Selection adapts routing for task-specific needs, and Sparsification prunes low-utility paths to yield compact, information-minimal representations. Across four MultiBench benchmarks, S3 improves accuracy and exhibits consistent sparsity-performance dynamics, exhibiting a reverse U-shaped trend, with performance peaking at intermediate sparsity. These results suggest that structuring multimodal representations as selectable semantic components provides a practical and principled alternative to contrastive learning or InfoMax-driven approaches.

Abstract:
As scaling laws push the training of frontier large language models (LLMs) toward ever larger data requirements, training pipelines are approaching a regime where much of the publicly available online text may be consumed. At the same time, widespread LLM usage increases the volume of machinegenerated content on the web; together, these trends raise the likelihood of generated content re-entering future training corpora, increasing the associated risk ofmodel collapse. In practice, model developers address this concern through data cleaning, watermarking, synthetic-data policies, or by simply ignoring the issue. However, the problem of model collapse in generative models has not been examined from a learning-theoretic perspective: we study it through the theoretical lens of thelanguage generation in the limitframework, introducing areplayadversary that augments the example stream with the generator's own past outputs. Our main contribution is a fine-grained learning-theoretic characterization of when replay fundamentally limits generation: while replay is benign for the strongest notion of uniform generation, it provably creates separations for the weaker notions of non-uniform generation and generation in the limit. Interestingly, our positive results mirror heuristics widely used in practice, such as data cleaning, watermarking, and output filtering, while our separations show when these ideas can fail.

Abstract:
We develop a theory of intelligent agency grounded in probabilistic modeling for neural models. Agents are represented as outcome distributions with epistemic utility given by log score, and compositions are defined through weighted logarithmic pooling that strictly improves every member's welfare. We prove that strict unanimity is impossible under linear pooling or in binary outcome spaces, but possible with three or more outcomes. Our framework admits recursive structure via cloning invariance, continuity, and openness, while tiltbased analysis rules out trivial duplication. Finally, we formalize an agentic alignment phenomenon in LLMs using our theory: eliciting a benevolent persona ("Luigi'') induces an antagonistic counterpart ("Waluigi''), while a manifest-then-suppress Waluigi strategy yields strictly larger first-order misalignment reduction than pure Luigi reinforcement alone. These results clarify how developing a principled mathematical framework for how subagents can coalesce into coherent higher-level entities provides novel implications for alignment in agentic AI systems.

Abstract:
Large Reasoning Models (LRMs) excel at solving complex problems by explicitly generating a reasoning trace before deriving the final answer. However, these extended generations incur substantial memory footprint and computational overhead, bottlenecking LRMs' efficiency. This work uses attention maps to analyze the influence of reasoning traces and uncover an interesting phenomenon: only some decisioncritical tokens in a reasoning trace steer the model toward the final answer, while the remaining tokens contribute negligibly. Building on this observation, we propose Dynamic Thinking-Token Selection (DynTS). This method identifies decision-critical tokens and retains only their associated Key-Value (KV) cache states during inference, evicting the remaining redundant entries to optimize efficiency. Across six benchmarks, \toolname surpasses the state-of-the-art KV cache compression methods, improving Pass@1 by 2.6% under the same budget. Compared to vanilla Transformers, it reduces inference latency by 1.84–2.62× and peak KV-cache memory footprint by 3.32–5.73× without compromising LRMs' reasoning performance. The code is available at the anonymous link.\footnotehttps://anonymous.4open.science/r/DynTS-2D0D

Abstract:
Unlearning in large language models (LLMs) is critical for regulatory compliance and for building ethical generative AI systems that avoid producing private, toxic, illegal, or copyrighted content. Despite rapid progress, in this work we show that almost all existing unlearning methods fail to achieve true forgetting in practice. Specifically, while evaluations of these `unlearned' models under deterministic (greedy) decoding often suggest successful knowledge removal using standard benchmarks (as has been done in the literature), we show that sensitive information reliably resurfaces when models are sampled with standard probabilistic decoding. To rigorously capture this vulnerability, we introduce \textttleak@k, a new metaevaluation metric that quantifies the likelihood of forgotten knowledge reappearing when generating k samples from the model under realistic decoding strategies. Using three widely adopted benchmarks, TOFU, MUSE, and WMDP, we conduct the first large-scale, systematic study of unlearning reliability using our newly defined \textttleak@k metric. Our findings demonstrate that knowledge leakage persists across methods and tasks, underscoring that current state-of-the-art unlearning techniques provide only limited forgetting and highlighting the urgent need for more robust approaches to LLM unlearning. We propose an algorithm, termed Robust Unlearning under LEak@k metric (\textttRULE), which serves as an initial step toward addressing this concern. We demonstrate that \textttRULE provides an unlearned model for TOFU benchmark with no information leakage for a large number of generation samples.

Abstract:
This paper studies fullbody 3D human motion recovery from head-mounted device signals. Existing diffusion-based methods often rely on global distribution matching, leading to local joint reconstruction errors. We propose MotionGRPO, a novel framework leveraging reinforcement learning post-training to inject fine-grained guidance into the diffusion process. Technically, we model diffusion sampling as a Markov decision process optimized via Group Relative Policy Optimization (GRPO). To this end, we introduce a hybrid reward mechanism that combines a learned conditioned perceptual model for global visual plausibility and explicit constraints for local joint precision. Our key technical insight is that policy optimization in diffusion-based recovery suffers from vanishing gradients due to limited intra-group sample diversity. To address this, we further introduce a noise-injection strategy that explicitly increases sample variance and stabilizes learning. Extensive experiments demonstrate that MotionGRPO achieves state-of-the-art performance with superior visual fidelity.

Abstract:
LiDAR point cloud compression is vital for autonomous systems to handle massive data from highresolution sensors. While learned entropy modeling built upon octree structures yields high compression gains, it faces two critical bottlenecks: 1) prohibitive latency, particularly during decoding, caused by causal, multi-stage context modeling; and 2) a rigid performance-latency trade-off, preventing a single model from adapting to varying constraints. These limitations stem from the tight coupling between context aggregation backbone and probability prediction. To address this, we propose PACE, a new framework that reformulates ancestral context aggregation as a non-causal backbone and confines causality to a lightweight, stage-scalable predictor, eliminating repetitive backbone executions and reducing computational overhead. The predictor supports an arbitrary number of prediction stages, supporting seamless adaptation across diverse performance-latency trade-offs without reloading parameters. Experiments demonstrate that PACE sets a new state-of-the-art in compression efficiency, achieving notable BD-BR savings and reducing decoding latency by over 90% in autoregressive mode, highly attractive for practical applications.

Abstract:
Generalized Bayesian Inference (GBI) tempers a loss with a temperature \beta>0 to mitigate overconfidence and improve robustness under model misspecification, but existing GBI methods typically rely on costly MCMC or SDEbased samplers and must be re-run for each new dataset and each \beta-value. We give the first fully amortized variational approximation to the tempered posterior family p_\beta(\theta\! \mid\! x) \propto \pi(\theta)p(x\! \mid\! \theta)^\beta by training a single \beta-conditioned neural posterior estimator q_\phi(\theta \mid x, \beta) that enables sampling in a single forward pass, without simulator calls or inference-time MCMC. We introduce two complementary training routes: (i) synthesizes off-manifold samples (\theta, x) ～ \pi(\theta)p(x \mid \theta)^\beta and (ii) reweights a fixed base dataset \pi(\theta)p(x \mid \theta) using self-normalized importance sampling (SNIS), where we show that the SNIS-weighted objective provides a consistent forward-KL fit to the tempered posterior with finite weight variance. Across four standard simulation-based inference (SBI) benchmarks—including the chaotic Lorenz–96 system—our \beta-amortized estimator achieves competitive posterior approximations, in standard two-sample metrics, with non-amortized MCMC-based power-posterior samplers over a wide range of temperatures.

Abstract:
Despite rapid progress in autoregressive video diffusion, we identify an emerging system–algorithm bottleneck that limits both deployability and generation quality: KV-cache memory. In auto-regressive video generation models, the KV-cache grows with generation history and quickly dominates GPU memory (often ≥30 GB), preventing deployment on widely available hardware. More critically, memory-bounded KV budgets constrain the effective working memory, directly degrading long-horizon consistency in identity, layout, and motion. To address this challenge, we present Quant VideoGen (QVG), a training-free KV-cache quantization framework for auto-regressive video diffusion models. QVG exploits video’s inherent spatiotemporal redundancy via Semantic-Aware Smoothing, producing low-magnitude, quantization-friendly residuals. Building on this, QVG introduces Progressive Residual Quantization, a coarse-to-fine multi-stage scheme that further reduces quantization error while enabling a smooth quality–memory trade-off. Across LongCat-Video, HY-WorldPlay, and Self-Forcing, QVG establishes a new Pareto frontier between quality and memory efficiency, reducing KV memory by up to 7.0× with less than 4% end-to-end latency overhead, while delivering significantly better generation quality than existing baselines.

Abstract:
Automating AI research differs from general software engineering due to computationally expensive evaluation (e.g., model training) and opaque performance attribution. Current LLMbased agents struggle here, often generating monolithic scripts that ignore execution costs and causal factors. We introduceMARS(ModularAgent withReflectiveSearch), a framework optimized for autonomous AI research. MARS relies on three pillars: (1) Budget-Aware Planning via cost-constrained Monte Carlo Tree Search (MCTS) to explicitly balance performance with execution expense; (2) Modular Construction, employing a "Design-Decompose-Implement" pipeline to manage complex research repositories; and (3) Comparative Reflective Memory, which addresses credit assignment by analyzing solution differences to distill high-signal insights. MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings, maintaining competitiveness with the global leaderboard's top methods. Furthermore, the system exhibits qualitative "Aha!" moments, where 63% of all utilized lessons originate from cross-branch transfer, demonstrating that the agent effectively generalizes insights across search paths.

Abstract:
Code performance optimization is paramount in realworld software engineering and critical for production-level systems. While Large Language Models (LLMs) have demonstrated impressive capabilities in code generation and bug fixing, their proficiency in enhancing code performance at the repository level remains largely unexplored. To address this gap, we introduce SWE-Perf, the first benchmark specifically designed to systematically evaluate LLMs on code performance optimization tasks within authentic repository contexts. SWE-Perf comprises 140 carefully curated instances, each derived from performance-improving pull requests from popular GitHub repositories. Each benchmark instance includes the relevant codebase, target functions, performance-related tests, expert-authored patches, and executable environments. Through a comprehensive evaluation of representative methods that span file-level and repo-level approaches (e.g., Agentless and OpenHands), we reveal a substantial capability gap between existing LLMs and expert-level optimization performance, highlighting critical research opportunities in this emerging field.

Abstract:
Discrete Diffusion Large Language Models (dLLMs) have recently emerged as a promising nonautoregressive paradigm, offering faster inference while achieving strong performance in code generation and mathematical reasoning tasks. In this work, we show that dLLMs’ decoding mechanism not only improves utility but also enables effective model attribution: by analyzing a response’s decoding trajectory, we can identify its source model and help mitigate risks from model misuse. A key challenge is the diversity of attribution scenarios, ranging from distinguishing different models to identifying different checkpoints or backups of the same model. To ensure broad applicability, we focus on two core questions: what information to extract from the decoding trajectory, and how to use it effectively. We first observe that per-step model confidence is ineffective, as the bidirectional nature of dLLMs causes mutual influence among decoded tokens, leading to highly redundant confidence signals that obscure structural information about decoding order and dependencies. To overcome this, we propose a novel information extraction scheme called the Directed Decoding Map (DDM), which captures structural relationships between decoding steps and reveals model-specific behaviors. Furthermore, to fully leverage the extracted structure, we propose Gaussian-Trajectory Attribution (GTA), which fits a cell-wise Gaussian distribution at each decoding position for each model and uses log-likelihood differences between trajectories as the attribution score. Extensive experiments across diverse models, datasets and different model access assumptions validate the effectiveness of our approach.

Abstract:
Accurate 6D pose estimation and tracking are core capabilities for physical AI systems, yet realworld deployment remains brittle and labor-intensive. Many pipelines rely on CAD models, manual masking, or per-object adaptation, and still fail under occlusion or fast motion without a principled way to recognize failure. We propose STORM, a unified framework for reference-conditioned 6D tracking with minimal manual input and improved robustness. STORM introduces two mechanisms: (i) Hierarchical Spatial Fusion Attention (HSFA), which performs latent manifold alignment between reference and query features, guided by vision-language semantic conditioning to resolve instance ambiguities; and (ii) an energy-based failure detector to detect drift and trigger automatic re-initialization, yielding a self-healing tracker. Experiments on LM-O and YCB-Video show that STORM improves annotation-free pose tracking accuracy over strong baselines and recovers reliably from severe occlusions and rapid viewpoint changes with minimal overhead.

Abstract:
Predicting transcriptional responses to genetic perturbations is a central problem in functional genomics. In practice, perturbation responses are rarely geneindependent but instead manifest as coordinated, program-level transcriptional changes among functionally related genes. However, most existing methods do not explicitly model such coordination, due to gene-wise modeling paradigms and reliance on static biological priors that cannot capture dynamic program reorganization. To address these limitations, we propose scBIG, a module-inductive perturbation prediction framework that explicitly models coordinated gene programs. scBIG induces coherent gene programs from data via Gene-Relation Clustering, captures inter-program interactions through a Gene-Cluster-Aware Encoder, and preserves modular coordination using structure-aware alignment objectives. These structured representations are then modeled using conditional flow matching to enable flexible and generalizable perturbation prediction. Extensive experiments on multiple single-cell perturbation benchmarks show that scBIG consistently outperforms state-of-the-art methods, particularly on unseen and combinatorial perturbation settings, achieving an average improvement of 6.7% over the strongest baselines.

Abstract:
Joint Embedding Predictive Architectures (JEPA) offer a scalable paradigm for selfsupervised learning by predicting latent representations rather than reconstructing high-entropy observations. However, existing formulations rely on deterministic regression objectives, which masks probabilistic semantics and limits its applicability in stochastic control. We introduce \emphVariational JEPA (VJEPA), a probabilistic generalization that learns a predictive distribution over future latent states via a variational objective. We show that VJEPA unifies representation learning with Predictive State Representations (PSRs) and Bayesian filtering, establishing that sequential modeling does not require autoregressive observation likelihoods. Theoretically, we prove that VJEPA representations serve as sufficient information states for optimal control without pixel reconstruction, while providing formal guarantees for collapse avoidance. We further propose \emphBayesian JEPA (BJEPA), which extends the VJEPA framework to factorize predictive belief into a learned dynamics expert and a modular prior expert, enabling zero-shot task transfer and constraints satisfactions (e.g., goals, physics) via a Product of Experts. Empirically, VJEPA filters out high-variance nuisance distractors that cause representation collapse in generative baselines. By enabling principled uncertainty estimation (e.g. constructing credible intervals via sampling) while remaining likelihood-free regarding observations, VJEPA provides a foundational framework for scalable, robust, uncertainty-aware planning in high-dimensional, noisy environments.

Abstract:
Precision oncology requires forecasting clinical events and trajectories, yet modeling sparse, multimodal clinical time series remains a critical challenge. We introduce TwinWeaver, an open-source framework that serializes longitudinal patient histories into text, enabling unified event prediction as well as forecasting with large language models, and use it to build Genie Digital Twin (GDT) on 93,054 patients across 20 cancer types. In benchmarks, GDT significantly reduces forecasting errors, achieving a median Mean Absolute Scaled Error (MASE) of 0.87 compared to 0.97 for the strongest time-series baseline (p<0.001). Furthermore, GDT improves risk stratification, achieving an average concordance index (C-index) of 0.703 across survival, progression, and therapy switching tasks, surpassing the best baseline of 0.662. GDT also generalizes to out-of-distribution clinical trials, matching trained baselines at zero-shot and surpassing them with fine-tuning, achieving a median MASE of 0.75–0.88 and outperforming the strongest baseline in event prediction with an average C-index of 0.672 versus 0.648. Finally, TwinWeaver enables an interpretable clinical reasoning extension, providing a scalable and transparent foundation for longitudinal clinical modeling.

Abstract:
Understanding the geometry of learned distributions is fundamental to improving and interpreting diffusion models, yet systematic tools for exploring their landscape remain limited. Standard latentspace interpolations fail to respect the structure of the learned distribution, often traversing low-density regions. We introduce a framework based on the string method that computes continuous paths between samples by evolving curves under the learned score function. Operating on pretrained models without retraining, our approach interpolates between three regimes: pure generative transport, which yields continuous sample paths; gradient-dominated dynamics, which recover minimum energy paths (MEPs); and finite-temperature string dynamics, which compute principal curves---self-consistent paths that balance energy and entropy. We demonstrate that the choice of regime matters in practice. For image diffusion models, MEPs contain high-likelihood but unrealistic ``cartoon'' images, confirming prior observations that likelihood maxima appear unrealistic; principal curves instead yield realistic morphing sequences despite lower likelihood. For protein structure prediction, our method computes transition pathways between metastable conformers directly from models trained on static structures, yielding paths with physically plausible intermediates. Together, these results establish the string method as a principled tool for probing the modal structure of diffusion models---identifying modes, characterizing barriers, and mapping connectivity in complex learned distributions.

Abstract:
One of the most common machine learning setups is logistic regression. In many classification models, including neural networks, the final prediction is obtained by applying a logistic link function to a linear score. In binary logistic regression, the feedback can be either soft labels, corresponding to the true conditional probability of the data (as in distillation), or sampled hard labels (taking values \pm 1). We point out a fundamental problem that arises even in a particularly favorable setting, where the goal is to learn a noisefree soft target of the form \sigma(\mathbfx^\top\mathbfw^\star). In the over-constrained case (i.e. the number of samples n exceeds the input dimension d) with examples (\mathbfx_i,\sigma(\mathbfx_i^\top\mathbfw^\star)), it is sufficient to recover \mathbfw^\star and hence achieve the Bayes risk. However, we prove that when the examples are labeled by hard labels y_i sampled from the same conditional distribution \sigma(\mathbfx_i^\top\mathbfw^\star) and \mathbfw^\star is s-sparse, then rotation-invariant algorithms are provably suboptimal: they incur an excess risk \Omega\left(\fracd-1n\right), while there are simple non-rotation invariant algorithms with excess risk O(\fracs\log dn). The simplest rotation invariant algorithm is gradient descent on the logistic loss (with early stopping). A simple non-rotation-invariant algorithm for sparse targets that achieves the above upper bounds uses gradient descent on the weights u_i,v_i, where now the linear weight w_i is reparameterized as u_iv_i. We conclude that rotation-invariant algorithms cannot efficiently learn noise-free sparse targets from hard labels.

Abstract:
How do latent and inference time computations enable large language models (LLMs) to solve multistep reasoning? We introduce a framework for tracing and steering algorithmic primitives that underlie model reasoning. Our approach links reasoning traces to internal activations and evaluates algorithmic primitives by injecting them into residual streams and measuring their effect on reasoning steps and task performance. We consider four benchmarks: Traveling Salesperson Problem (TSP), 3SAT, AIME, and graph navigation. We operationalize primitives by clustering activations and annotating their matched reasoning traces using an automated LLM pipeline. We then apply function vector methods to derive primitive vectors as reusable compositional building blocks of reasoning. Primitive vectors can be combined through addition, subtraction, and scalar operations, revealing a geometric logic in activation space. Cross-task and cross-model evaluations (Phi-4, Phi-4-Reasoning, Llama-3-8B) show both shared and task-specific primitives. Notably, comparing Phi-4 with its reasoning-finetuned variant highlights compositional generalization after finetuning: Phi-4-Reasoning exhibits more systematic use of verification and path-generation primitives. Injecting the associated primitive vectors in Phi-4 induces behavioral hallmarks associated with Phi-4-Reasoning. Together, these findings demonstrate that reasoning in LLMs may be supported by a compositional geometry of algorithmic primitives, that primitives transfer cross-task and cross-model, and that reasoning finetuning strengthens algorithmic generalization across domains.

Abstract:
Multitask neural routing solvers have emerged as a promising paradigm for their ability to solve multiple vehicle routing problems (VRPs) using a single model. However, existing neural solvers typically rely on predefined problem constraints or require per-problem fine-tuning, which substantially limits their zero-shot generalization ability to unseen VRP variants. To address this critical bottleneck, we propose URS, a unified neural routing solver that achieves zero-shot generalization across a wide range of unseen VRPs with a single model. We propose a unified data representation (UDR) that replaces problem enumeration with data unification, thereby broadening the problem coverage and reducing reliance on domain expertise. In addition, we introduce a Mixed Bias Module (MBM) during encoding to improve node embeddings, which efficiently captures multiple priors inherent to various problems. On top of the UDR, we develop a problem-conditioned parameter generator to further improve zero-shot generalization. Extensive experiments show that URS consistently produces high-quality solutions for 110 VRP variants (including 99 unseen variants) while demonstrating impressive scalability to large-scale instances with up to 7000 nodes. To the best of our knowledge, URS is the first neural solver to handle over 100 VRP variants with a single model.

Abstract:
Recent posttraining quantization (PTQ) methods have adopted block rotations to diffuse outliers prior to rounding. While this reduces the overhead of full-vector rotations, the effect of block structure on outlier suppression remains poorly understood. To fill this gap, we present the first systematic, non-asymptotic analysis of outlier suppression for block Hadamard rotations. Our analysis reveals that outlier suppression is fundamentally limited by the geometry of the input vector. In particular, post-rotation outliers are deterministically minimized when the pre-rotation \ell_1 norm mass is evenly distributed across blocks. Guided by these insights, we introduce MixQuant, a block rotation-aware PTQ framework that redistributes activation mass via permutations prior to rotation. We propose a greedy mass diffusion algorithm to calibrate permutations by equalizing the expected blockwise \ell_1 norms. To avoid adding inference overhead, we identify permutation-equivariant regions in transformer architectures to merge the resulting permutations into model weights before deployment. Experiments show that MixQuant consistently improves accuracy across all block sizes, recovering up to 90% of the full-vector rotation perplexity when quantizing Llama3 1B to INT4 with block size 16, compared to 46% without permutations.

Abstract:
We study the basic task of mean estimation in the presence of meanshift contamination. In the mean-shift contamination model, an adversary is allowed to replace a small constant fraction of the clean samples by samples drawn from arbitrarily shifted versions of the base distribution. Prior work characterized the sample complexity of this task for the special cases of the Gaussian and Laplace distributions. Specifically, it was shown that consistent estimation is possible in these cases, a property that is provably impossible in Huber's contamination model. An open question posed in earlier work was to determine the sample complexity of mean estimation in the mean-shift contamination model for general base distributions. In this work, we study and essentially resolve this open question. Specifically, we show that, under mild spectral conditions on the characteristic function of the (potentially multivariate) base distribution, there exists a sample-efficient algorithm that estimates the target mean to any desired accuracy. We complement our upper bound with a qualitatively matching sample complexity lower bound. Our techniques make critical use of Fourier analysis, and in particular introduce the notion of a Fourier witness as an essential ingredient of our upper and lower bounds.

Abstract:
Algorithm extraction aims to synthesize executable programs directly from models trained on algorithmic tasks, enablingde novoalgorithm discovery without relying on humanwritten code. However, applying this paradigm to Transformer is hindered by representation entanglement (e.g., superposition), where entangled features encoded in overlapping directions obstruct the recovery of symbolic expressions. We propose the Discrete Transformer, an architecture explicitly designed to bridge the gap between continuous representations and discrete symbolic logic. By injecting discreteness through temperature-annealed sampling, our framework effectively leverages hypothesis testing and symbolic regression to extract human-readable programs. Empirically, the Discrete Transformer achieves performance comparable to RNN-based methods while extending interpretability to continuous variable domains, and the annealing dynamics exhibit a clear exploration-to-exploitation transition. Finally, we show that architectural inductive biases provide fine-grained control over synthesized programs, establishing the Discrete Transformer as a robust framework for demonstration-free algorithm discovery and Transformer interpretability.

Abstract:
Conditional independence (CI) is central to causal inference, feature selection, and graphical modeling, yet it is untestable in many settings without additional assumptions. Existing CI tests often rely on restrictive structural conditions, limiting their validity. Kernel methods using partial covariance operators offer a more principled approach but suffer from limited adaptivity, slow convergence, and poor scalability. In this work, we explore whether representation learning can help address these limitations. Specifically, we focus on representations derived from the singular value decomposition of partial covariance operators and use them to construct a simple test statistic. We also introduce a bilevel contrastive algorithm to learn these representations. Our theory links representation learning error to test performance and establishes asymptotic validity and power guarantees. Experiments on real and synthetic data suggest that this approach offers a principled and statistically grounded path toward scalable CI testing, bridging kernel-based theory with modern representation learning.

Abstract:
Parallel diffusion decoding can accelerate diffusion language model inference by unmasking multiple tokens per step, but aggressive parallelism often harms quality. Revocable decoding mitigates this by rechecking earlier tokens, yet we observe that existing verification schemes frequently trigger flipflop oscillations, where tokens are remasked and later restored unchanged. This behaviour slows inference in two ways: remasking verified positions weakens the conditioning context for parallel drafting, and repeated remask cycles consume the revision budget with little net progress. We propose COVER (Cache Override Verification for Efficient Revision), which performs leave-one-out verification and stable drafting within a single forward pass. COVER constructs two attention views via KV cache override: selected seeds are masked for verification, while their cached key value states are injected for all other queries to preserve contextual information, with a closed form diagonal correction preventing self leakage at the seed positions. COVER further prioritises seeds using a stability aware score that balances uncertainty, downstream influence, and cache drift, and it adapts the number of verified seeds per step. Across benchmarks, COVER markedly reduces unnecessary revisions and yields faster decoding while preserving output quality.

Abstract:
Modern highthroughput biological datasets containing thousands of perturbations enable large-scale discovery of causal graphs that represent regulatory interactions between genes. Differentiable causal graphical models and regression-based methods have been developed to infer gene regulatory networks (GRNs) from interventional datasets. However, existing approaches fail to capture the non-linear dynamics of biological processes such as cellular differentiation. To address this limitation, we propose PerturbODE, a novel framework that employs interpretable neural ordinary differential equations (neural ODEs) to model cell state trajectories under perturbations and derive the underlying causal GRN from the neural ODE parameters, enabling downstream simulation of unseen genetic interventions. The GRN is encoded via a single-hidden-layer feedforward network, implicitly grouping genes into interpretable co-regulated modules. We demonstrate PerturbODE's efficacy in GRN inference and extension to perturbation response prediction across both simulated and real overexpression datasets.

Abstract:
Recurrent and deep neural networks (RNNs/DNNs) are cornerstone architectures in machine learning. Remarkably, RNNs differ from DNNs only by weight sharing, as can be shown through unrolling in time. How does this structural similarity fit with the distinct functional properties these networks exhibit? To address this question, we here develop a unified meanfield theory for RNNs and DNNs in terms of representational kernels, describing fully trained networks in the feature learning (\muP) regime. This theory casts training as Bayesian inference over sequences and patterns, directly revealing the functional implications induced by the RNNs' weight sharing. In DNN-typical tasks, we identify a phase transition when the learning signal overcomes the noise due to randomness in the weights: below this threshold, RNNs and DNNs behave identically; above it, only RNNs develop correlated representations across timesteps. For sequential tasks, the RNNs' weight sharing furthermore induces an inductive bias that aids generalization by interpolating unobserved time steps. Overall, our theory offers a way to connect architectural structure to functional biases.

Abstract:
The deployment of Large Language Models is constrained by the memory and bandwidth demands of static weights and dynamic KeyValue cache. SVD-based compression provides a hardware-friendly solution to reduce these costs. However, existing methods suffer from two key limitations: some are suboptimal in reconstruction error, while others are theoretically optimal but practically inefficient. In this paper, we propose Swift-SVD, an activation-aware, closed-form compression framework that simultaneously guarantees theoretical optimum, practical efficiency and numerical stability. Swift-SVD incrementally aggregates covariance of output activations given a batch of inputs and performs a single eigenvalue decomposition after aggregation, enabling training-free, fast, and optimal layer-wise low-rank approximation. We employ effective rank to analyze local layer-wise compressibility and design a dynamic rank allocation strategy that jointly accounts for local reconstruction loss and end-to-end layer importance. Extensive experiments across six LLMs and eight datasets demonstrate that Swift-SVD outperforms state-of-the-art baselines, achieving optimal compression accuracy while delivering 3–70× speedups in end-to-end compression time. Our code will be released upon acceptance.

Abstract:
Estimating the quality of register transfer level (RTL) designs is crucial in the electronic design automation (EDA) workflow, as it enables instant feedback on key performance metrics like area and delay without the need for timeconsuming logic synthesis. While recent approaches have leveraged large language models (LLMs) to derive embeddings from RTL code and achieved promising results, they overlook the structural semantics essential for accurate quality estimation. In contrast, the control data flow graph (CDFG) view exposes the design's structural characteristics more explicitly, offering richer cues for representation learning. In this work, we introduce StructRTL, a novel structure-aware graph self-supervised learning framework for improved RTL design quality estimation. By learning structure-informed representations from CDFGs, StructRTL significantly outperforms prior art on various quality estimation tasks. To further boost performance, we incorporate a knowledge distillation strategy that transfers low-level insights from post-mapping netlists into the CDFG-based predictor. Experimental results demonstrate that StructRTL establishes new state-of-the-art results, highlighting the effectiveness of combining structural learning with cross-stage supervision.

Abstract:
Research has repeatedly demonstrated that intermediate hidden states extracted from large language models and speech audio models predict measured brain response to natural language stimuli. Yet, very little is known about the representation properties that enable this high prediction performance. Why is it the intermediate layers, and not the output layers, that are most effective for this unique and highly general transfer task? We give evidence that the correspondence between speech and language models and the brain derives from shared meaning abstraction and not their nextword prediction properties. In particular, models construct higher-order linguistic features in their middle layers, cued by a peak in the layerwiseintrinsic dimension, a measure of feature complexity. We show that a layer's intrinsic dimension strongly predicts how well it explains fMRI and ECoG signals; that the relation between intrinsic dimension and brain predictivity arises over model pre-training; and finetuning models to better predict the brain causally increases both representations' intrinsic dimension and their semantic content. Results suggest that semantic richness, high intrinsic dimension, and brain predictivity mirror each other, and that the key driver of model-brain similarity isrich meaning abstractionof the inputs, where language modeling is a task sufficiently complex (but perhaps not the only) to require it.

Abstract:
Existing tasks fall short in evaluating reasoning ability of Large Language Models (LLMs) in an interactive, unknown environment. This deficiency leads to the isolated assessment of deductive, inductive, and abductive reasoning, neglecting the integrated reasoning process that is indispensable for humans discovery of real world. We introduce a novel evaluation paradigm, blackbox interaction, to tackle this challenge. A black-box is defined by a hidden function that maps a specific set of inputs to outputs. LLMs are required to unravel the hidden function behind the black-box by interacting with it in given exploration turns, and reasoning over observed input-output pairs. Leveraging this idea, we build the Oracle benchmark which comprises 6 types of black-box task and 96 black-boxes. 19 modern LLMs are benchmarked. o3, a leading LLM from OpenAI, ranks first in 5 of the 6 tasks, achieving over 70% accuracy on most easy black-boxes. But it still struggles with some hard black-box tasks, where its average performance drops below 40%. Further analysis indicates a universal difficulty among LLMs: They lack the high-level planning capability to develop efficient and adaptive exploration strategies for hypothesis refinement.

Abstract:
Flow matching models have recently emerged as an efficient alternative to diffusion, especially for textguided image generation and editing, offering faster inference through continuous-time dynamics. However, existing flow-based editors predominantly support global or single-instruction edits and struggle with multi-instance scenarios, where multiple parts of a reference input must be edited independently without semantic interference. We identify this limitation as a consequence of globally conditioned velocity fields and joint attention mechanisms, which entangle concurrent edits. To address this issue, we introduce Instance-Disentangled Attention, a mechanism that partitions joint attention operations, enforcing binding between instance-specific textual instructions and spatial regions during velocity field estimation. We evaluate our approach on both natural image editing and a newly introduced benchmark of text-dense infographics with region-level editing instructions. Experimental results demonstrate that our approach promotes edit disentanglement and locality while preserving global output coherence, enabling single-pass, instance-level editing.

Abstract:
Scaling testtime compute enhances long chain-of-thought (CoT) reasoning, yet existing approaches face a fundamental trade-off between computational cost and coverage quality: either incurring high training expense or yielding redundant trajectories. We introduce The Geometric Reasoner (TGR), a training-free framework that performs manifold-informed latent foresight search under strict memory bounds. At each chunk boundary, TGR scores candidate latent anchors via a lightweight look-ahead estimate combined with soft geometric regularizers that encourage smooth trajectories and diverse exploration. Chunk-wise KV cache resets keep memory linear in chunk length. On challenging math and code benchmarks, TGR improves robust trajectory coverage, measured by the area under the Pass@k curve (AUC), by up to 13 points on Qwen3-8B, with negligible overhead of about 1.1--1.3×.

Abstract:
In this paper, we introduce LATO, a novel topologypreserving latent representation that enables scalable, flow matching-based synthesis of explicit 3D meshes. LATO represents a mesh as a Vertex Displacement Field (VDF) anchored on surface, incorporating a sparse voxel Variational Autoencoder (VAE) to compress this explicit signal into a structured, topology-aware voxel latent. To decapsulate the mesh, the VAE decoder progressively subdivides and prunes latent voxels to instantiate precise vertex locations. In the end, a dedicated connection head queries the voxel latent to predict edge connectivity between vertex pairs directly, allowing mesh topology to be recovered without isosurface extraction or heuristic meshing. For generative modeling, LATO adopts a two-stage flow matching process, first synthesizing the structure voxels and subsequently refining the voxel-wise topology features. Compared to prior isosurface/triangle-based diffusion models and autoregressive generation approaches, LATO generates meshes with complex geometry, well-formed topology while being highly efficient in inference.

Abstract:
Recent work suggests that (stochastic) gradient descent selforganizes near the instability boundary, shaping both optimization and the solutions found. Momentum and mini-batch gradients are widely used in practical deep learning optimization, but it remains unclear whether they operate in a comparable regime of instability. We demonstrate that SGD with momentum exhibits an Edge of Stochastic Stability (EoSS)-like regime with batch-size--dependent behavior that cannot be explained by a single momentum-adjusted stability threshold. Batch Sharpness (the expected directional mini-batch curvature) stabilizes in two distinct regimes: at small batch sizes it converges to a lower plateau 2(1-\beta)/\eta, reflecting amplification of stochastic fluctuations by momentum and favoring flatter regions than vanilla SGD; at large batch sizes it converges to a higher plateau 2(1+\beta)/\eta, where momentum recovers its classical stabilizing effect and favors sharper regions consistent with full-batch dynamics. We further show this aligns with linear stability thresholds and we discuss the implications on hyperparameters tuning and coupling.

Abstract:
Existing multimodal large language models for longvideo understanding predominantly rely on uniform sampling and single-turn inference, limiting their ability to identify sparse yet critical evidence amid extensive redundancy. We introduce VideoSeeker, a novel framework that supports iterative discovery of salient visual clues, fine-grained inspection of key segments, and adaptive termination once sufficient evidence is acquired. Technically, we address two core challenges in interleaved tool invocation. First, to mitigate attention dispersion induced by the heterogeneity of reasoning and tool-calling, we propose Task-Decoupled Attention Masking, which isolates per-step concentration while preserving shared global context. Second, to control context length growth in multi-turn interactions, we introduce a Verifiable Trajectory-Guided Reward that balances exploration coverage with reasoning efficiency. To support training at scale, we further develop a data synthesis pipeline and construct Seeker-173K, comprising 173K high-quality tool-interaction trajectories for effective supervised and reinforcement learning. Extensive experiments show that VideoSeeker substantially outperforms state-of-the-art methods, achieving 72.1% accuracy on MLVU and 46.5% on Video-Holmes. These results demonstrate VideoSeeker's strong multi-hop evidence-seeking and reasoning capabilities, and validate the effectiveness of native tool invocation in long-video scenarios.

Abstract:
Dpeptide binders targeting L-proteins have promising therapeutic potential. Despite rapid advances in machine learning-based target-conditioned peptide design, generating D-peptide binders remains largely unexplored. In this work, we show that by injecting axial features to E(3)-equivariant (polar) vector features, it is feasible to achieve cross-chirality generalization from homo-chiral (L-L) training data to hetero-chiral (D-L) design tasks. By implementing this method within a latent diffusion model, we achieved D-peptide binder design that not only outperforms existing tools inin silicobenchmarks, but also demonstrates efficacy in wet-lab validation. To our knowledge, our approach represents the first experimentally validated AI generative model for thede novodesign of D-peptide binders, offering new perspectives on handling chirality in protein design.

Abstract:
Learning meaningful representations from medical time series (MedTS), such as ECG or EEG signals, is a critical challenge. These signals are often highdimensional, variable-length, and rife with noise. Existing self-supervised approaches, such as Masked Autoencoders (MAEs), are highly effective for pre-training general-purpose encoders. However, they do not explicitly learn compact, fixed-size, or semantically interpretable latent representations, typically relying on heuristic aggregation strategies such as global average pooling or a designated [CLS] token. We propose a novel framework that compresses a variable-length MedTS into a fixed-size set of k latent Fingerprint Tokens. Our architecture employs a cross-attention bottleneck to generate these tokens and is trained with a dual-objective function. The first objective is a reconstruction loss, which ensures the tokens are sufficient statistics for the original data. The second, a diversity penalty based on the Total Coding Rate (TCR), explicitly minimizes the redundancy between tokens, encouraging them to become statistically disentangled representations. We present the theoretical justification for our method, framing it as a novel Disentangled Rate-Distortion problem. This approach produces a low-dimensional, interpretable, and sample-efficient representation, where each token is encouraged to capture an independent factor of variation, paving the way for more robust digital biomarkers.

Abstract:
We analyze two classical algorithms for solving additively composite convex optimization problems where the objective is the sum of a smooth term and a nonsmooth regularizer: proximal stochastic gradient method for a single regularizer; and the randomized incremental proximal method, which uses the proximal operator of a randomly selected function when the regularizer is given as the sum of many nonsmooth functions. We focus on relaxing the bounded variance assumption that is common, yet stringent, for getting last iterate convergence rates. We prove the \widetildeO(1/\sqrtT) rate of convergence for the last iterate of both algorithms under componentwise convexity and smoothness, which is optimal up to log terms. Our results apply directly to graphguided regularizers that arise in multi-task and federated learning, where the regularizer decomposes as a sum over edges of a collaboration graph.

Abstract:
Training stability remains a critical bottleneck for Group Relative Policy Optimization (GRPO), often manifesting as a tradeoff between reasoning plasticity and general capability retention. We identify a root cause as the geometric conflict between plasticity and stability gradients, which leads to destructive interference. Crucially, we argue that deterministic projection methods are suboptimal for GRPO as they overlook the intrinsic stochasticity of group-based gradient estimates. To address this, we propose Probabilistic Conflict Resolution (PCR), a Bayesian framework that models gradients as random variables. PCR dynamically arbitrates conflicts via an uncertainty-aware ``soft projection'' mechanism, optimizing the signal-to-noise ratio. Extensive experiments demonstrate that PCR significantly smooths the training trajectory and achieves superior performance in various reasoning tasks.

Abstract:
Most posttraining backdoor detection methods rely on attacked models exhibiting extreme outlier detection statistics for the target class of an attack, compared to non-target classes. However, these approaches may fail: (1) when some (non-target) classes are easily discriminable from all others, in which case they maynaturallyachieve extreme detection statistics (e.g., decision confidence); and (2) when the backdoor is subtle, i.e., with its features weak relative to intrinsic class-discriminative features. A key observation is that the backdoor target class has contributions to its detection statistic from both the backdoor triggerandfrom its intrinsic features, whereas non-target classesonlyhave contributions from their intrinsic features. To achieve more sensitive detectors, we thus propose tosuppressintrinsic features while optimizing the detection statistic for a given class. For non-target classes, such suppression will drastically reduce the achievable statistic, whereas for the target class the (significant) contribution from the backdoor trigger remains. In practice, we formulate a constrained optimization problem, leveraging a small set of clean examples from a given class, and optimizing the detection statistic while orthogonalizing with respect to the class's intrinsic features. We dub this approach ''class subspace orthogonalization'' (CSO). CSO can be ''plug-and-play'' applied to a wide variety of existing detectors. We demonstrate its effectiveness in improving several well-known detectors, comparing with a variety of baseline detectors, against a variety of attacks, on the CIFAR-10, GTSRB, and TinyImageNet domains. Moreover, to make the detection problem even more challenging, we also evaluate against a novel mixed clean/dirty-label poisoning attack that is more surgical and harder to detect than traditional dirty-label attacks. Finally, we evaluate CSO against an adaptive attack designed to defeat it, with promising detection results.

Abstract:
Advances in generative modeling have recently been adapted to tabular data containing discrete and continuous features. However, generating mixedtype features that combine discrete states with an otherwise continuous distribution in a single feature remains challenging. We advance the state-of-the-art in diffusion models for tabular data with a cascaded approach. We first generate a low-resolution version of a tabular data row, that is, the collection of the purely categorical features and a coarse categorical representation of numerical features. Next, this information is leveraged in the high-resolution flow matching model via a novel guided conditional probability path and data-dependent coupling. The low-resolution representation of numerical features explicitly accounts for discrete outcomes, such as missing or inflated values, and therewith enables a more faithful generation of mixed-type features. We formally prove that this cascade tightens the transport cost bound. The results indicate that our model generates significantly more realistic samples and captures distributional details more accurately, for example, the detection score increases by 40%.

Abstract:
Latent visual reasoning aims to mimic human'simaginationprocess by meditating through hidden states of Multimodal Large Language Models. While recognized as a promising paradigm for visual reasoning, the underlying mechanisms driving its effectiveness remain unclear. Motivated to demystify the true source of its efficacy, we investigate the validity of latent reasoning using Causal Mediation Analysis. We model the process as a causal chain: the input as the treatment, the latent tokens as the mediator, and the final answer as the outcome. Our findings uncover two critical disconnections: (a)InputLatent Disconnect: dramatic perturbations on the input result in negligible changes to the latent tokens, suggesting that latent tokens do not effectively attend to the input sequence. (b)Latent-Answer Disconnect: perturbations on the latent tokens yield minimal impact on the final answer, indicating the limited causal effect latent tokens imposing on the outcome. Furthermore, extensive probing analysis reveals that latent tokens encode limited visual information and exhibit high similarity. Consequently, we challenge the necessity of latent reasoning and propose a straightforward alternative namedCapImagine, which teaches the model to explicitlyimagineusing text. Experiments on vision-centric benchmarks show thatCapImaginesignificantly outperforms complex latent-space baselines, highlighting the superior potential of visual reasoning through explicit imagination.

Abstract:
A popular paradigm for training LM agents relies on imitation learning, finetuning on expert trajectories. However, we show that the off-policy nature of imitation learning for multi-turn LM agents suffers from the fundamental limitation known as covariate shift: as the student policy's behavior diverges from the expert's, it encounters states not present in the training data, reducing the effectiveness of fine-tuning. Taking inspiration from the classic DAgger algorithm, we propose a novel data generation methodology for addressing covariate shift for multi-turn LLM training. We introduce on-policy expert corrections (OECs), partially on-policy data generated by starting rollouts with a student model and then switching to an expert model part way through the trajectory. We explore the effectiveness of our data generation technique in the domain of software engineering (SWE) tasks, a multi-turn setting where LLM agents must interact with a development environment to fix software bugs. Our experiments compare OEC data against various other on-policy and imitation learning approaches on SWE agent problems and train models using a common rejection sampling (i.e., using environment reward) combined with supervised fine-tuning technique. Experiments find that OEC trajectories show a relative 14% and 13% improvement over traditional imitation learning in the 7b and 32b setting, respectively, on SWE-bench verified. Our results demonstrate the need for combining expert demonstrations with on-policy data for effective multi-turn LM agent training.

Abstract:
Vision–Language–Action (VLA) models have shown strong performance in robotic manipulation, but often struggle in longhorizon or out-of-distribution scenarios due to the lack of explicit mechanisms for multimodal reasoning and anticipating how the world will evolve under action. Recent works introduce textual chain-of-thought or visual subgoal prediction within VLA models to reason, but still fail to offer a unified human-like reasoning framework for joint textual reasoning, visual foresight, and action prediction. To this end, we propose HALO, a unified VLA model that enables embodied multimodal chain-of-thought (EM-CoT) reasoning through a sequential process of textual task reasoning, visual subgoal prediction for fine-grained guidance, and EM-CoT-augmented action prediction. We instantiate HALO with a Mixture-of-Transformers (MoT) architecture that decouples semantic reasoning, visual foresight, and action prediction into specialized experts while allowing seamless cross-expert collaboration. To enable HALO learning at scale, we introduce an automated pipeline to synthesize EM-CoT training data along with a carefully crafted training recipe. Extensive experiments demonstrate that: (1) HALO achieves superior performance in both simulated and real-world environments, surpassing baseline policy \pi_0 by 34.1% on RoboTwin benchmark; (2) all proposed components of the training recipe and EM-CoT design help improve task success rate; and (3) HALO exhibits strong generalization capabilities under aggressive unseen environmental randomization with our proposed EM-CoT reasoning.

Abstract:
Large Language Models (LLMs) have inherent limitations of faithfulness and factuality, commonly referred to as hallucinations. Several benchmarks have been developed that provide a test bed for factuality evaluation within the context of Englishcentric datasets, while relying on supplementary informative context like web links or text passages but ignoring the available structured factual resources. To this end, Knowledge Graphs (KGs) have been identified as a useful aid for hallucination mitigation, as they provide a structured way to represent the facts about entities and their relations with minimal linguistic overhead. We bridge the lack of KG paths and multilinguality for factual language modeling within the existing hallucination evaluation benchmarks and propose a KG-based multilingual, multihop benchmark called MultiHal framed for generative text evaluation. As part of our data collection pipeline, we mined 140k KG-paths from open-domain KGs, from which we pruned noisy KG-paths, curating a high-quality subset of 25.9k. Our baseline evaluation shows an absolute scale improvement by approximately 0.12 to 0.36 points for the semantic similarity score, 0.16 to 0.36 for NLI entailment and 0.29 to 0.42 for hallucination detection in KG-RAG over vanilla QA across multiple languages and multiple models, demonstrating the potential of KG integration. We anticipate MultiHal will foster future research towards several graph-based hallucination mitigation and fact-checking tasks.

Abstract:
Pretraining and finetuning are central stages in modern machine learning systems. In practice, feature learning plays an important role across both stages: deep neural networks learn a broad range of useful features during pretraining and further refine those features during fine-tuning. However, an end-to-end theoretical understanding of how choices of initialization impact the ability to reuse and refine features during fine-tuning has remained elusive. Here we develop an analytical theory of the pretraining–fine-tuning pipeline in diagonal linear networks, deriving exact expressions for the generalization error as a function of initialization parameters and task statistics. We find that different initialization choices place the network into four distinct fine-tuning regimes that are distinguished by their ability to support feature learning and reuse—and therefore by the task statistics for which they are beneficial. In particular, a smaller initialization scale in earlier layers enables the network to both reuse and refine its features, leading to superior generalization on fine-tuning tasks that rely on a subset of pretraining features. We demonstrate empirically that the same initialization parameters impact generalization in nonlinear networks trained on CIFAR-100. Overall, our results demonstrate analytically how data and network initialization interact to shape fine-tuning generalization, highlighting an important role for the relative scale of initialization across different layers in enabling continued feature learning during fine-tuning.

Abstract:
The performance of unified multimodal models for image generation and editing is fundamentally constrained by the quality and comprehensiveness of their training data. While existing datasets have covered basic tasks like style transfer and simple object manipulation, they often lack the systematic structure and challenging scenarios required for realworld applications. To address this bottleneck, we introduce OpenGPT-4o-Image, a large-scale dataset constructed using a novel methodology that combines hierarchical task taxonomy with automated data generation. Our taxonomy not only includes fundamental capabilities such as text rendering and style control but also introduces highly practical yet challenging categories like scientific imagery for physics/chemistry illustrations and complex instruction editing requiring simultaneous execution of multiple operations. Through an automated pipeline leveraging structured resource pools and GPT-4o, we generate 80k high-quality instruction-image pairs with controlled diversity, covering 11 major domains and 51 subtasks. Extensive experiments show that fine-tuning leading models on our dataset achieves significant performance gains across multiple benchmarks, with improvements of up to 18% on editing tasks (UniWorld-V1 on ImgEdit-Bench) and 13% on generation tasks (Harmon on GenEval). Our work demonstrates that systematic data construction is key to advancing multimodal AI capabilities.

Abstract:
Textto-image (TTI) diffusion models have achieved remarkable visual quality, yet they have been repeatedly shown to exhibit social biases across sensitive attributes such as gender, race and age. To mitigate these biases, existing approaches frequently depend on curated prompt datasets - either manually constructed or generated with large language models (LLMs) - as part of their training and/or evaluation procedures. Beside the curation cost, this also risks overlooking unanticipated, less obvious prompts that trigger biased generation, even in models that have undergone debiasing. In this work, we introduce Bias-Guided Prompt Search (BGPS), a framework that automatically generates prompts that aim to maximize the presence of biases in the resulting images. BGPS comprises two components: (1) an LLM instructed to produce attribute-neutral prompts and (2) attribute classifiers acting on the TTI’s internal representations that steer the decoding process of the LLM toward regions of the prompt space that amplify the image attributes of interest. We conduct extensive experiments on Stable Diffusion 1.5 and a state-of-the-art debiased model and discover an array of subtle and previously undocumented biases that severely deteriorate fairness metrics. Crucially, the discovered prompts are interpretable, i.e they may be entered by a typical user, quantitatively improving the perplexity metric compared to a prominent hard prompt optimization counterpart. Our findings uncover TTI vulnerabilities, while BGPS expands the bias search space and can act as a new evaluation tool for bias mitigation.

Abstract:
LLMs achieve remarkable multistep reasoning capabilities, yet effectively transferring these skills via post-training distillation remains challenging. Existing data selection methods, ranging from manual curation to heuristics based on length, entropy, or overall loss, fail to capture the causal importance of individual reasoning steps, limiting distillation efficiency. To address this, we propose Attention Influence for Reasoning (AIR), a principled, unsupervised and training-free framework that leverages mechanistic insights of the retrieval head to select high-value post-training data. AIR first identifies reasoning-critical attention heads of an off-the-shelf model, then constructs a weakened reference model with disabled head influence, and finally quantifies the resulting loss divergence as the Attention Influence Score. This score enables fine-grained assessment at both the step and sample levels, supporting step-level weighted fine-tuning and global sample selection. Experiments across multiple reasoning benchmarks show that AIR consistently improves reasoning accuracy, surpassing heuristic baselines and effectively isolating the most critical steps and samples. Our work establishes a mechanism-driven, data-efficient approach for reasoning distillation in LLMs.

Abstract:
Crystal modeling spans a family of conditional and unconditional generation tasks across different modalities, including crystal structure prediction (CSP) andde novogeneration (DNG). While recent deep generative models have shown promising performance, they remain largely taskspecific, lacking a unified framework that shares crystal representations across different generation tasks. To address this limitation, we proposeMultimodal Crystal Flow (MCFlow), a unified multimodal flow model that realizes multiple crystal generation tasks as distinct inference trajectories via independent time variables for atom types and crystal structures. To enable multimodal flow in a standard transformer model, we introduce a composition- and symmetry-aware atom ordering with hierarchical permutation augmentation, injecting strong compositional and crystallographic priors without explicit structural templates. Experiments on the MP-20 and MPTS-52 benchmarks show that MCFlow achieves competitive performance against task-specific baselines across multiple crystal generation tasks. Our code and inference trajectories are available athttps://anonymous.4open.science/r/mcflow-46E4.

Abstract:
As Large Language Models (LLMs) gain agentic abilities, they will have to navigate complex multiagent scenarios, interacting with human users and other agents in cooperative and competitive settings. This will require new reasoning skills, a crucial one being theory of mind (ToM), or the ability to reason about the "mental" states of other agents. However, ToM and other multi-agent abilities in LLMs are poorly understood, since existing benchmarks suffer from narrow scope, data leakage, saturation, and lack of interactivity. We thus propose Decrypto, a game-based benchmark for multi-agent reasoning and ToM drawing inspiration from cognitive science, computational pragmatics and multi-agent reinforcement learning. It is designed to be as easy as possible in all other dimensions, eliminating confounding factors commonly found in other benchmarks. To our knowledge, it is also the first platform for designing interactive ToM experiments. We validate the benchmark design through comprehensive empirical evaluations of frontier LLMs, robustness studies, and human-AI cross-play experiments. We find that LLM game-playing abilities lag behind humans and simple word-embedding baselines. We then create variants of two classic cognitive science experiments within Decrypto to evaluate three key ToM abilities. Surprisingly, we find that state-of-the-art reasoning models are significantly worse at those tasks than their older counterparts. This demonstrates that Decrypto addresses a crucial gap in current reasoning and ToM evaluations, and paves the path towards better artificial agents. Code at anonymous.4open.science/r/decrypto/.

Abstract:
Recent advancements in tabular deep learning have demonstrated exceptional practical performance, yet the field often lacks a clear understanding of why these techniques actually succeed. To address this gap, our paper highlights the importance of the concept of data (aleatoric) uncertainty for explaining the effectiveness of recent tabular DL methods. While data uncertainty leads to irreducible prediction errors on test samples, it also introduces stochasticity into the training signal that can impede effective learning. We demonstrate that tabular methods differ significantly in their ability to cope with this optimization challenge. Specifically, we reveal that the success of many beneficial design choices in tabular DL, such as numerical feature embeddings, advanced ensembling strategies, retrievalaugmented models, and tabular Prior-Fitted Networks, can be partially attributed to their respective implicit mechanisms for performing well under high data uncertainty. By dissecting these varied mechanisms, we provide a unifying understanding of recent performance improvements. Furthermore, leveraging insights from this perspective, we design a novel, more effective numerical feature embedding method as an immediate practical outcome of our analysis. Overall, our work paves the way toward a principled understanding of the benefits introduced by modern tabular methods that results in the concrete advancements of existing techniques and outlines future research directions for tabular DL.

Abstract:
In many safetycritical settings, probabilistic ML systems have to make predictions subject to algebraic constraints, e.g., predicting the most likely trajectory that does not cross obstacles. These real-world constraints are rarely convex, nor the densities considered are (log-)concave. This makes computing this constrained maximum a posteriori (MAP) prediction in an efficient and reliable way extremely challenging. In this paper, we first investigate under which conditions we can perform constrained MAP inference over continuous variables exactly and efficiently and devise a scalable message-passing algorithm for this tractable fragment. Then, we devise a general constrained MAP strategy that interleaves partitioning the domain into convex feasible regions with numerical constrained optimization. We evaluate both methods on synthetic and real-world benchmarks, showing our structure aware approach outperforms constraint-agnostic baselines.

Abstract:
The performance of autonomous Web GUI agents heavily relies on the quality and quantity of their training data. However, a fundamental bottleneck persists: collecting interaction trajectories from realworld websites is expensive and difficult to verify. The underlying state transitions are hidden, leading to reliance on inconsistent and costly external verifiers (e.g., human or LLM judges) to evaluate step-level correctness. To address this, we propose AutoWebWorld, a novel framework for synthesizing controllable and verifiable web environments by modeling them as Finite State Machines (FSMs) and use coding agents to translate FSMs into interactive websites. Unlike real websites, where state transitions are implicit, AutoWebWorld explicitly defines all states, actions, and transition rules. This enables programmatic verification: action correctness is checked against predefined rules, and task success is confirmed by reaching a goal state in the FSM graph. AutoWebWorld enables a fully automated search-and-verify pipeline, generating over 11,663 verified trajectories from 29 diverse web environments at only \0.04 per trajectory. Training on this synthetic data significantly boosts real-world performance. Our 7B Web GUI agent achieves state-of-the-art on WebVoyager, outperforming all baselines within 15 steps. Furthermore, we observe a clear scaling law: as the synthetic data volume increases, performance on WebVoyager and Online-Mind2Web consistently improves.

Abstract:
Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for assisting human scientists, there is currently no widely adopted holistic benchmark for evaluating scientific reasoning, and few approaches systematically disentangle the distinct roles of knowledge and reasoning in these tasks. To address these gaps, we introduceSciReas, a diverse suite of existing benchmarks for scientific reasoning tasks, andSciReasPro, a selective subset that requires more complex reasoning. Our holistic evaluation surfaces insights about scientific reasoning performance that remain hidden when relying on individual benchmarks alone. We then proposeKRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks. Combining the two, we conduct an in-depth analysis that yields several key findings: (1) Retrieving task-relevant knowledge from model parameters is a critical bottleneck for LLMs in scientific reasoning; (2) Reasoning models consistently benefit from external knowledge added in-context on top of the reasoning enhancement; (3) Enhancing verbalized reasoning improves LLMs' ability to surface task-relevant knowledge.

Abstract:
Instructionbased image editing (IIE) has advanced rapidly with the success of diffusion models. However, existing efforts primarily focus on simple and explicit instructions to execute editing operations such as adding, deleting, moving, or swapping objects. They struggle to handle more complex implicit hypothetical instructions that require deeper reasoning to infer plausible visual changes and user intent. Additionally, current datasets provide limited support for training and evaluating reasoning-aware editing capabilities. Architecturally, these methods also lack mechanisms for fine-grained detail extraction that support such reasoning. To address these limitations, we propose Reason50K, a large-scale dataset specifically curated for training and evaluating hypothetical instruction–reasoning image editing, along with ReasonBrain, a novel framework designed to reason over and execute implicit hypothetical instructions across diverse scenarios. Reason50K includes over 50K samples spanning four key reasoning scenarios: Physical, Temporal, Causal, and Story reasoning. ReasonBrain leverages Multimodal Large Language Models (MLLMs) for editing guidance generation and a diffusion model for image synthesis, incorporating a Fine-grained Reasoning Cue Extraction (FRCE) module to capture detailed visual and textual semantics essential for supporting instruction reasoning. To mitigate the semantic loss, we further introduce a Cross-Modal Enhancer (CME) that enables rich interactions between the fine-grained cues and MLLM-derived features. Extensive experiments demonstrate that ReasonBrain consistently outperforms state-of-the-art baselines on reasoning scenarios while exhibiting strong zero-shot generalization to conventional IIE tasks. Our dataset and code will be released publicly.

Abstract:
Large language model (LLM)powered multi-agent systems (MAS) enable agents to communicate and share information, achieving strong performance on complex tasks. However, this communication also creates an attack surface where malicious agents can propagate misinformation and steer group decisions, undermining MAS safety. Existing embedding-based defenses aim to detect and prune suspicious agents, but their effectiveness depends on a clear separation between the text embeddings of malicious and benign messages. Stronger attackers can break such defenses by crafting messages whose embeddings lie close to benign ones. We analyze this failure mode theoretically and validate it empirically with three stronger attacks, Slow Drift, Benign Wrapper, and Chaos Seeding. Our findings also expose a key limitation of embedding-based defenses: they operate only on the text embeddings and ignore token-level confidence signals such as logits, which can remain informative when embeddings are not distinguishable under attack. In this paper, we use confidence score to prune or down-weight messages during MAS communication. Experiments show improved robustness across models, datasets, and communication topologies. Moreover, we find that the effectiveness of confidence signals decays over communication rounds, highlighting the importance of early intervention. This insights can inform and inspire future work on MAS attacks and defenses.

Abstract:
LLMbased multi-agent simulations are increasingly adopted across application domains, but remain difficult to scale due to GPU memory pressure. Each agent maintains private GPU-resident states, including models, prefix caches, and adapters, which quickly exhaust device memory as the agent count grows. We identify two key properties of these workloads: sparse agent activation and an estimable agent invocation order. Based on an analysis of representative workload classes, we introduce invocation distance, a unified abstraction that estimates the relative order in which agents will issue future LLM requests. Leveraging this abstraction, we present ScaleSim, a memory-efficient LLM serving system for large-scale multi-agent simulations. ScaleSim enables proactive prefetching and priority-based eviction, supports diverse agent-specific memory through a modular interface, and achieves up to 1.74× speedup over SGLang on simulation benchmarks.

Abstract:
Sampling useful threedimensional molecular structures along with their most favorable conformations is a key challenge in drug discovery. Current state-of-the-art 3D de-novo design flow matching or diffusion-based models are limited to generating a single conformation. However, the conformational landscape of a molecule determines its observable properties and how tightly it is able to bind to a given protein target. By generating a representative set of low-energy conformers, we can more directly assess these properties and potentially improve the ability to generate molecules with desired thermodynamic observables. Towards this aim, we propose FlexiFlow, a novel architecture that extends flow-matching models, allowing for the joint sampling of molecules along with multiple conformations while preserving both equivariance and permutation invariance. We demonstrate the effectiveness of our approach on the QM9 and GEOM Drugs datasets, achieving state-of-the-art results in molecular generation tasks. Our results show that FlexiFlow can generate valid, unstrained, unique, and novel molecules with high fidelity to the training data distribution, while also capturing the conformational diversity of molecules. Moreover, we show that our model can generate conformational ensembles that provide similar coverage to state-of-the-art physics-based methods at a fraction of the inference time. Finally, FlexiFlow can be successfully transferred to the protein-conditioned ligand generation task, even when the dataset contains only static pockets without accompanying conformations.

Abstract:
While current embodied policies exhibit remarkable manipulation skills, their execution remains unsatisfactorily slow as they inherit the tardy pacing of human demonstrations. Existing acceleration methods typically require policy retraining or costly online interactions, limiting their scalability for largescale foundation models. In this paper, we propose Speedup Patch (SuP), a lightweight, policy-agnostic framework that enables plug-and-play acceleration using solely offline data. SuP introduces an external scheduler that adaptively downsamples action chunks provided by embodied policies to eliminate redundancies. Specifically, we formalize the optimization of our scheduler as a Constrained Markov Decision Process (CMDP) aimed at maximizing efficiency without compromising task performance. Since direct success evaluation is infeasible in offline settings, SuP introduces World Model based state deviation as a surrogate metric to enforce safety constraints. By leveraging a learned world model as a virtual evaluator to predict counterfactual trajectories, the scheduler can be optimized via offline reinforcement learning. Empirical results on simulation benchmarks (Libero, Bigym) and real-world tasks validate that SuP achieves an overall 1.8× execution speedup for diverse policies while maintaining their original success rates.

Abstract:
Anchors is a popular local modelagnostic explanation technique whose applicability is limited by its computational inefficiency. To address this limitation, we propose a memorization-based framework that accelerates Anchors while preserving explanation fidelity and understandability. Our approach leverages the iterative nature of Anchors' algorithm which gradually refines an explanation until it is precise enough for a given input by storing and reusing intermediate results obtained during prior explanations. Specifically, we maintain a memory of low-precision, high-coverage rules and introduce a rule transformation framework to adapt them to new inputs: the horizontal transformation adapts a retrieved explanation to the current input by replacing features, and the vertical transformation refines the general explanation until it is precise enough for the input. We evaluate our method across tabular, text, and image datasets, demonstrating that it significantly reduces explanation generation time while maintaining fidelity and understandability, thereby enabling the practical adoption of Anchors in time-sensitive applications.

Abstract:
The quadratic computational complexity of softmax transformers has become a bottleneck in longcontext scenarios. In contrast, linear attention model families provide a promising direction towards a more efficient sequential model. These linear attention models compress past KV values into a single hidden state, thereby efficiently reducing complexity during both training and inference. However, their expressivity remains limited by the size of their hidden state. Previous work proposed interleaving softmax and linear attention layers to reduce computational complexity while preserving expressivity. Nevertheless, the efficiency of these models remains bottlenecked by their softmax attention layers. In this paper, we propose Neural Attention Search Linear (NAtS-L), a framework that applies both linear attention and softmax attention operations within the same layer on different tokens. NAtS-L automatically determines whether a token can be handled by a linear attention model, i.e., tokens that have only short-term impact and can be encoded into fixed-size hidden states, or require softmax attention, i.e., tokens that contain information related to long-term retrieval and need to be preserved for future queries. By searching for optimal Gated DeltaNet and softmax attention combinations across tokens, we show that NAtS-L provides a strong yet efficient token-level hybrid architecture.

Abstract:
Humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings. However, existing omnimodal models still exhibit substantial performance degradation on visual tasks when the audio modality is incorporated. We identify this “modality interference” as a consequence of pretraining data imbalances, where the scarcity of mixed modality supervision induces a bias towards isolated modalities, resulting in an inherent trade-off. To address this challenge, we propose OmniVideo-R1, a novel reinforced reasoning framework that leverages post-training to rectify modality bias. OmniVideo-R1 empowers models to “think with omnimodal cues” and integrate cross-modal information. The framework consists of two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality- attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.

Abstract:
In modern AI research, baseline and dataset selection is a highstakes decision in experimental design. It operationalizes a research idea into a concrete evaluation protocol and largely determines the validity and comparability of empirical conclusions. However, making appropriate choices is increasingly difficult as baselines and datasets proliferate, while suitability is inherently context-dependent and rarely captured by baseline and dataset metadata. To address these challenges, we present AgentExpt, a comprehensive framework for baseline and dataset recommendation. We first curate a large-scale, high-quality knowledge base that links 108,825 accepted papers to their used baselines and datasets. Based on this resource, we design a collective perception-enhanced retriever that represents each baseline or dataset by integrating first-person self-descriptions with third-person citation contexts, thereby effectively positioning them within the scholarly network. We further design a reasoning-augmented reranker that encodes baseline-dataset interaction chains as a reasoning prior to fine-tune an LLM, producing refined rankings with interpretable justifications. Experiments show that our framework outperforms the strongest baseline, with average gains of +5.85% in Recall@20 and +7.90% in HitRate@10, and ablation studies confirm the effectiveness of our designed components. Overall, AgentExpt advances the efficient and reliable automation of experimental design. Our code is available at \urlhttps://anonymous.4open.science/r/Agentexpt-DD3E.

Abstract:
Higherorder ODE solvers have become a standard tool for accelerating diffusion probabilistic model (DPM) sampling, motivating the widespread view that first-order methods are inherently slower and that increasing discretization order is the primary path to faster generation. This paper challenges this belief and revisits acceleration from a complementary angle: beyond solver order, the placement of DPM evaluations along the reverse-time dynamics can substantially affect sampling accuracy in the low-neural function evaluation (NFE) regime. We propose a novel training-free, first-order sampler named Forward DPMSolver (F-DPMSolver), whose leading discretization error has the opposite sign to that of DDIM. Algorithmically, the method approximates the forward-value evaluation via a cheap one-step lookahead predictor. We provide theoretical guarantees showing that the resulting sampler provably approximates the ideal forward-value trajectory while retaining first-order convergence. Empirically, across standard image generation benchmarks, the proposed sampler consistently improves sample quality under the same NFE budget and can be competitive with, and sometimes outperform, state-of-the-art higher-order samplers. Overall, the results suggest that the placement of DPM evaluations provides an additional and largely independent design angle for accelerating diffusion sampling. Our code is available at https://anonymous.4open.science/r/F-DPMSolver.

Abstract:
Underdamped Langevin dynamics (ULD) is a widelyused sampler for Gibbs distributions \pi\propto e^-V, and is often empirically effective in high dimensions. However, existing non-asymptotic convergence guarantees for discretized ULD typically scale polynomially with the ambient dimension d, leading to vacuous bounds when d is large. The main known dimension-free result concerns the randomized midpoint discretization in Wasserstein-2 distance (Liu et al., 2023), while dimension-independent guarantees for ULD discretizations in KL divergence have remained open. We close this gap by proving the first dimension-free KL divergence bounds for discretized ULD. Our analysis refines the KL local error framework \citepaltschuler2025shifted to a dimension-free setting and yields bounds that depend on \mathrmtr(\mathbfH), where \mathbfH upper bounds the Hessian of V, rather than on d. As a consequence, we obtain improved iteration complexity for underdamped Langevin Monte Carlo relative to overdamped Langevin methods in regimes where \mathrmtr(\mathbfH)\ll d.

Abstract:
Neural operators have emerged as powerful surrogates for the solution of partial differential equations (PDEs), yet their ability to handle general, highly variable boundary conditions (BCs) remains limited. Existing approaches often fail when the solution operator exhibits strong sensitivity to boundary forcings. We propose a general framework for conditioning neural operators on complex nonhomogeneous BCs through function extensions. Our key idea is to map boundary data to latent pseudo-extensions defined over the entire spatial domain, enabling any standard operator learning architecture to consume boundary information. The resulting operator, coupled with an arbitrary domain-to-domain neural operator, can learn rich dependencies on complex BCs and input domain functions at the same time. To benchmark this setting, we construct 18 challenging datasets spanning Poisson, linear elasticity, and hyperelasticity problems, with highly variable, mixed-type, component-wise, and multi-segment BCs on diverse geometries. Our approach achieves state-of-the-art accuracy, outperforming baselines by large margins, while requiring no hyperparameter tuning across datasets. Overall, our results demonstrate that learning boundary-to-domain extensions is an effective and practical strategy for imposing complex BCs in existing neural operator frameworks, enabling accurate and robust scientific machine learning models for a broader range of PDE-governed problems.

Abstract:
Large reasoning models (LRMs) often generate long, seemingly coherent reasoning traces yet still produce incorrect answers, making hallucination detection challenging. Although trajectories contain useful signals, directly using trace text or vanilla hidden states for detection is brittle: traces vary in form and detectors can overfit to superficial patterns rather than answer validity. We introduce Answeragreement Representation Shaping (ARS), which learns detection-friendly trace-conditioned representations by explicitly encoding answer stability. ARS generates counterfactual answers through small latent interventions, specifically, perturbing the trace-boundary embedding, and labels each perturbation by whether the resulting answer agrees with the original. It then learns representations that bring answer-agreeing states together and separate answer-disagreeing ones, exposing latent instability indicative of hallucination risk. The shaped embeddings are plug-and-play with existing embedding-based detectors and require no human annotations during training. Experiments demonstrate that ARS consistently improves detection and achieves substantial gains over strong baselines.

Abstract:
Density functional theory (DFT) is the most widely used method for calculating molecular properties; however, its accuracy is often insufficient for quantitative predictions. Coupled cluster (CC) theory is the most successful method for achieving accuracy beyond DFT and predicting properties that closely align with experiment. It is known as the ``gold standard'' of quantum chemistry. Unfortunately, the high computational cost of CC limits its widespread applicability. In this work, we present the Molecular Orbital Learning Model (MoLe), an equivariant machine learning model that directly predicts CC's core mathematical objects, the excitation amplitudes, from the meanfield Hartree-Fock molecular orbitals as inputs. We test various aspects of our model and demonstrate its very high data efficiency and remarkable out-of-distribution generalization to larger molecules and off-equilibrium geometries, despite being trained only on small equilibrium geometries. Finally, we also examine its ability to reduce the number of cycles required to converge CC calculations. MoLe can set the foundations for high-accuracy wavefunction-based ML architectures to accelerate molecular design and complement force-field approaches.

Abstract:
Unsupervised multisource domain adaptation (UMDA) leverages labeled data from multiple source domains to generalize to an unlabeled target. While federated UMDA addresses privacy by avoiding raw data sharing, existing methods scale poorly as the number of sources increases, often suffering from high computational overhead or training instability. We propose GALA, a scalable and robust federated UMDA framework designed for high-diversity settings. GALA achieves scalability by coupling a novel inter-group discrepancy minimization objective that approximates pairwise alignment with linear complexity alongside a temperature-controlled, centroid-based weighting strategy for dynamic source prioritization. These components enable stable, parallelizable training across many heterogeneous sources, addressing a critical scalability bottleneck that remains largely unaddressed in current literature. To evaluate performance in high-diversity scenarios, we introduce Digit-18, a new benchmark comprising 18 datasets with varied synthetic and real-world domain shifts. Extensive experiments demonstrate that GALA achieves state-of-the-art results on standard benchmarks and significantly outperforms prior methods in large-scale settings where others either fail to converge or become computationally infeasible.

Abstract:
Molecular dynamics (MD) is a central computational tool in physics, chemistry, and biology, enabling quantitative prediction of experimental observables as expectations over highdimensional molecular distributions such as Boltzmann distributions and transition densities. However, conventional MD is fundamentally limited by the high computational cost required to generate independent samples. Generative molecular dynamics (GenMD) has recently emerged as an alternative, learning surrogates of molecular distributions either from data or through interaction with energy models. While these methods enable efficient sampling, their transferability across molecular systems is often limited. In this work, we show that incorporating auxiliary sources of information can improve the data efficiency and generalization of transferable implicit transfer operators (TITO) for molecular dynamics. We find that coarse-grained TITO models are substantially more data-efficient than Boltzmann emulators, and that incorporating protein language model (PLM) embeddings further improves out-of-distribution generalization. Our approach, PLaTITO, achieves state-of-the-art performance on equilibrium sampling benchmarks for out-of-distribution protein systems, including fast-folding proteins. We further study the impact of additional conditioning signals---such as structural embeddings, temperature, and large-language-model-derived embeddings---on model performance.

Abstract:
Federated Learning (FL) with LowRank Adaptation (LoRA) has become a standard for privacy-preserving LLM fine-tuning. However, existing personalized methods predominantly operated under a restrictive Flat-Model Assumption: they addressed client-sidestatistical heterogeneitybut treated the model as a monolithic block, ignoring thefunctional heterogeneityacross LLM layers. We argue that these two statistical (horizontal) and functional (vertical) dimensions, areorthogonal in source yet coupled in interaction, implying that the optimal depth of parameter sharing is functionally dependent on client similarity. To address this, we proposeFedTreeLoRA, a framework employing tree-structured aggregation for fine-grained, layer-wise alignment. By dynamically constructing an aggregation hierarchy, FedTreeLoRA allows clients to share broad consensus on shallow ’trunks‘ while progressively specializing on deep ‘branches'. Experiments on NLU and NLG benchmarks demonstrate that FedTreeLoRA significantly outperforms state-of-the-art methods by effectively reconciling generalization and personalization.

Abstract:
Chainof-Thought (CoT) reasoning successfully enhances the reasoning capabilities of Large Language Models (LLMs), yet it incurs substantial computational overhead for inference. Existing CoT compression methods often suffer from a critical loss of logical fidelity at high compression ratios, resulting in significant performance degradation. To achieve high-fidelity, fast reasoning, we propose a novel EXTreme-RAtio Chain-of-Thought Compression framework, termed Extra-CoT, which aggressively reduces the token budget while preserving answer accuracy. To generate reliable, high-fidelity supervision, we first train a dedicated semantically-preserved compressor on mathematical CoT data with fine-grained annotations. An LLM is then fine-tuned on these compressed pairs via a mixed-ratio supervised fine-tuning (SFT), teaching it to follow a spectrum of compression budgets and providing a stable initialization for reinforcement learning (RL). We further propose Constrained and Hierarchical Ratio Policy Optimization (CHRPO) to explicitly incentivize question-solving ability under lower budgets by a hierarchical reward. Experiments on three mathematical reasoning benchmarks show the superiority of Extra-CoT. For example, on MATH-500 using Qwen3-1.7B, Extra-CoT achieves over 73% token reduction with an accuracy improvement of 0.6%, significantly outperforming state-of-the-art (SOTA) methods. Our source codes are released in the Supplementaries.

Abstract:
While VisionLanguage-Action (VLA) models generalize well to generic instructions, they struggle with personalized commands such as "bringmycup," where the robot must act on one specific instance among visually similar objects. We study this setting of manipulating personal objects, in which a VLA must identify and control a user-specific object unseen during training using only a few reference images. To address this challenge, we proposeVisual Attentive Prompting (VAP), a simple-yet-effective training-free perceptual adapter that equips frozen VLAs with top-down selective attention. VAP treats the reference images as a non-parametric visual memory, grounds the personal object in the scene through open-vocabulary detection and embedding-based matching, and then injects this grounding as a visual prompt by highlighting the object and rewriting the instruction. We construct two simulation benchmarks, Personalized-SIMPLER and Personalized-VLABench, and a real-world tabletop benchmark to evaluate personalized manipulation across multiple robots and tasks. Experiments show that VAP consistently outperforms generic policies and token-learning baselines in both success rate and correct-object manipulation, helping to bridge the gap between semantic understanding and instance-level control.

Abstract:
The Shapley value, and its broader family of semivalues, has received much attention in various attribution problems. A fundamental and long-standing challenge is their efficient approximation, since exact computation generally requires an exponential number of utility queries in the number of players n. To meet the challenges of large-scale applications, we explore the limits of efficiently approximating semi-values under a \Theta(n) space constraint. Building upon a vector concentration inequality, we establish a theoretical framework that enables sharper asymptotic query complexities for existing unbiased randomized algorithms. Within this framework, we systematically develop a linear-space algorithm that requires O(\fracn\epsilon^2\log\frac1\delta) utility queries to ensure P(\\|\hat\boldsymbol\phi-\boldsymbol\phi\\|\_2\geq\epsilon)\leq \delta for all commonly used semi-values. In particular, our framework naturally bridges OFA, unbiased kernelSHAP, SHAP-IQ and the regression-adjusted approach, and definitively characterizes when paired sampling is beneficial. Moreover, our algorithm allows explicit minimization of the approximation variance \mathbbE[\\|\hat\boldsymbol\phi-\boldsymbol\phi\\|_2^2] for each specific utility function. Accordingly, we introduce the first adaptive, linear-time, linear-space randomized algorithm, Adalina, that theoretically achieves improved approximation variance. All of our theoretical findings are experimentally validated.

Abstract:
Dynamic Data selection aims to accelerate training by prioritizing informative samples during online training. However, existing methods typically rely on taskspecific handcrafted metrics or static/snapshot-based criteria to estimate sample importance, limiting scalability across learning paradigms and making it difficult to capture the evolving utility of data throughout training. To address this challenge, we propose Data Agent, an end-to-end dynamic data selection framework that formulates data selection as a training-aware sequential decision-making problem. The agent learns a sample-wise selection policy that co-evolves with model optimization, guided by a composite reward that integrates loss-based difficulty and confidence-based uncertainty signals. The reward signals capture complementary objectives of optimization impact and information gain, together with a tuning-free adaptive weighting mechanism that balances these signals over training. Extensive experiments across a wide range of datasets and architectures demonstrate that Data Agent consistently accelerates training while preserving or improving performance, e.g., reducing costs by over 50% on ImageNet-1k and MMLU with lossless performance. Moreover, its dataset-agnostic formulation and modular reward make it plug-and-play across tasks and scenarios, e.g., robustness to noisy datasets, highlighting its potential in real-world scenarios.

Abstract:
Activation steering promises to be an extremely parameterefficient form of adaptation, but its effectiveness depends on critical design choices---such as intervention location and parameterization---that currently rely on empirical heuristics rather than a principled foundation. We establish a first-order equivalence between activation-space interventions and weight-space updates, deriving the conditions under which activation steering can replicate fine-tuning behavior. This equivalence yields a principled framework for steering design and identifies the post-block output as a theoretically-backed and highly expressive intervention site. We further explain why certain intervention locations outperform others and show that weight updates and activation updates play distinct, complementary functional roles. This analysis motivates a new approach---joint adaptation---that trains in both spaces simultaneously. Our post-block steering method achieves accuracy within 0.4%-1.5% of full-parameter tuning while training only 0.04% of model parameters, consistently outperforming prior activation steering methods such as ReFT and PEFT approaches including LoRA while using significantly fewer parameters. Finally, we show that joint adaptation often surpasses the performance ceilings of weight and activation updates in isolation, introducing a new paradigm for efficient model adaptation.

Abstract:
Offline reinforcement learning (RL) offers a powerful paradigm for datadriven control. Compared to model-free approaches, offline model-based RL (MBRL) explicitly learns a world model from a static dataset and uses it as a surrogate simulator, improving data efficiency and enabling potential generalization beyond the dataset support. However, most existing offline MBRL methods follow a two-stage training procedure: first learning a world model by maximizing the likelihood of the observed transitions, then optimizing a policy to maximize its expected return under the learned model. This objective mismatch results in a world model that is not necessarily optimized for effective policy learning. Moreover, we observe that policies learned via offline MBRL often lack robustness during deployment, and small adversarial noise in the environment can lead to significant performance degradation. To address these, we propose a framework that dynamically adapts the world model alongside the policy under a unified learning objective aimed at improving robustness. At the core of our method is a maximin optimization problem, which we solve by innovatively utilizing Stackelberg learning dynamics. We provide theoretical analysis to support our design and introduce computationally efficient implementations. We benchmark our algorithm on twelve noisy D4RL MuJoCo tasks and three stochastic Tokamak Control tasks, demonstrating its state-of-the-art performance.

Abstract:
Large language model (LLM) serving demands low latency and high throughput, but high load variability leads to significant GPU utilization. In this paper, we identify a synergetic but overlooked opportunity to coserve latency-critical online requests alongside latency-tolerant offline tasks, which existing systems fail to exploit due to their coarse-grained resource management and interference. We present ConServe, a co-serving system that enables fine-grained resource sharing through latency-aware token-level scheduling, sub-iteration layer-wise preemption, and incremental KV-cache management. These mechanisms allow offline execution to fill millisecond-scale GPU idle time while preserving strict online latency guarantees. Across real-world workloads with Llama-3.1 and Qwen-2.5 models, ConServe improves throughput by 2.2× on average and reduces online tail latency by 2.9× over state-of-the-art systems.

Abstract:
Recent advances in de novo protein binder design have enabled increasing experimental validation, yet reported in silico metrics remain difficult to interpret or compare across studies due to nonstandardized evaluation protocols. We introduce ProtDBench, a standardized and throughput-aware evaluation framework for protein binder design. ProtDBench defines unified benchmark tasks, evaluation protocols, and success criteria, enabling systematic analysis of how evaluation design influences observed performance. Using a large wet-lab annotated dataset, we analyze commonly used structure prediction models as evaluation verifiers, revealing substantial verifier-dependent bias and limited agreement under identical filtering protocols. We then benchmark representative open-source generative binder design methods across ten diverse protein targets under a fixed evaluation protocol. Beyond per-sequence success rates, ProtDBench incorporates throughput-aware metrics based on a fixed 24-hour budget, as well as cluster-level success criteria to account for structural diversity. Together, these results expose systematic differences induced by filtering rules, success definitions, and throughput-aware evaluation between computational efficiency, success rate, and structural diversity. Overall, ProtDBench provides a fair and reproducible evaluation pipeline that supports systematic and controlled comparison of protein binder design methods under realistic evaluation settings.

Abstract:
Integrating language models (LMs) in healthcare systems holds great promise for improving medical workflows and decisionmaking. However, a critical barrier to their real-world adoption is the lack of reliable evaluation of their trustworthiness, especially in multilingual healthcare settings. Existing LMs are predominantly trained in high-resource languages, making them ill-equipped to handle the complexity and diversity of healthcare queries in mid- and low-resource languages, posing significant challenges for deploying them in global healthcare contexts where linguistic diversity is key. In this work, we present \textscClinic, a Comprehensive Multilingual Benchmark to evaluate the trustworthiness of language models in healthcare. \name systematically benchmarks LMs across five key dimensions of trustworthiness: truthfulness, fairness, safety, robustness, and privacy, operationalized through 18 diverse tasks, spanning 15 languages (covering all the major continents), and encompassing a wide array of critical healthcare topics like disease conditions, preventive actions, diagnostic tests, treatments, surgeries, and medications. Our extensive evaluation reveals that LMs struggle with factual correctness, demonstrate bias across demographic and linguistic groups, and are susceptible to privacy breaches and adversarial attacks. By highlighting these shortcomings, \name lays the foundation for enhancing the global reach and safety of LMs in healthcare across diverse languages.

Abstract:
Semidefinite programs (SDPs) are a powerful framework for convex optimization and for constructing strong relaxations of hard combinatorial problems. However, solving large SDPs can be computationally expensive, motivating the use of machine learning models as fast computational surrogates. Graph neural networks (GNNs) are a natural candidate in this setting due to their sparsityawareness and ability to model variable-constraint interactions. In this work, we study what expressive power is sufficient to recover optimal SDP solutions. We first prove negative results showing that standard GNN architectures fail on recovering linear SDP solutions. We then identify a more expressive architecture that captures the key structure of SDPs and can, in particular, emulate the updates of a standard first-order solver. Empirically, on both synthetic and SDPLIB benchmarks of various classes of SDPs, this more expressive architecture achieves consistently lower prediction error and objective gap than theoretically weaker baselines. Finally, using the learned high-quality predictions to warm-start the first-order solver yields practical speedups of up to 80%.

Abstract:
Standard Bayesian Optimization (BO) assumes uniform smoothness across the search space—an assumption violated in multiregime problems such as molecular conformation search through distinct energy basins or drug discovery across heterogeneous molecular scaffolds. A single GP either oversmooths sharp transitions or hallucinates noise in smooth regions, yielding miscalibrated uncertainty. We propose RAMBO, a Dirichlet Process Mixture of Gaussian Processes that automatically discovers latent regimes during optimization, each modeled by an independent GP with locally-optimized hyperparameters. We derive collapsed Gibbs sampling that analytically marginalizes latent functions for efficient inference, and introduce adaptive concentration parameter scheduling for coarse-to-fine regime discovery. Our acquisition functions decompose uncertainty into intra-regime and inter-regime components. Experiments on synthetic benchmarks and real-world applications—including molecular conformer optimization, virtual screening for drug discovery, and fusion reactor design—demonstrate consistent improvements over state-of-the-art baselines on multi-regime objectives.

Abstract:
Omnimodal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial computational overhead. Despite this challenge, token compression methods designed for Omni-LLMs remain limited. To bridge this gap, we propose OmniSIFT (Omni-modal Spatio-temporal Informed Fine-grained Token compression), a modality-asymmetric token compression framework tailored for Omni-LLMs. Specifically, OmniSIFT adopts a two-stage compression strategy: (i) a spatio-temporal video pruning module that removes video redundancy arising from both intra-frame structure and inter-frame overlap, and (ii) a vision-guided audio selection module that filters audio tokens. The entire framework is optimized end-to-end via a differentiable straight-through estimator. Extensive experiments on five representative benchmarks verify the efficacy and robustness of OmniSIFT. Notably, for Qwen2.5-Omni-7B, OmniSIFT adds 4.85M parameters while still achieving lower latency than training-free baselines such as OmniZip. With only 25% of the original token context, OmniSIFT consistently outperforms all compression baselines and even surpasses the full-token model on several tasks.

Abstract:
Diffusion Language Models (DLMs) have recently achieved significant success due to their anyorder generation capabilities. However, existing inference methods typically rely on local, immediate-step metrics—such as confidence or entropy—which inherently lack a more reliable perspective, leading to sub-optimal generation quality. To address this, we proposeCoherentContextualDecoding (CCD), a novel inference framework built upon two core innovations. First, CCD bypasses the potential bias of the single context to leverage historical contexts for approximating the marginal distribution of token prediction, leading to better sequence coherence and the early rejection of sub-optimal paths. More importantly, we demonstrate that this mechanism is theoretically equivalent to modeling the consistency of historical steps via the conditional mutual information between contexts and token predictions. Finally, CCD achieves significantly milder performance degradation under highly parallel decoding scenarios compared to baselines. Empirically, our method achieves a simultaneous enhancement in both inference speed and performance across diverse benchmarks on Dream and LLaDA.

Abstract:
Textto-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture fine-grained joint correctness required by realistic prompts. We introduce AVGen-Bench, a task-driven benchmark for T2AV generation, featuring high-quality prompts across 11 real-world categories. To support comprehensive assessment, we propose a multi-granular evaluation framework that combines lightweight specialist models with Multimodal Large Language Models (MLLMs), enabling evaluation from perceptual quality to fine-grained semantic controllability. Our evaluation reveals a pronounced gap between strong audio-visual aesthetics and weak semantic reliability, including persistent failures in text rendering, speech coherence, physical reasoning, and universal breakdown in musical pitch control.

Abstract:
Recent advances in reinforcement learning for code generation have made robust environments essential to prevent reward hacking. As LLMs increasingly serve as evaluators in codebased RL, their ability to detect reward hacking remains understudied. In this paper, we propose a novel taxonomy of reward exploits spanning across 54 categories and introduce TRACE (Testing Reward Anomalies in Code Environments), a synthetically curated and human-verified benchmark containing 517 testing trajectories. Unlike prior work that evaluates reward hack detection in isolated classification scenarios, we contrast these evaluations with a more realistic, contrastive anomaly detection setup on TRACE. Our experiments reveal that models capture reward hacks more effectively in contrastive settings than in isolated classification settings, with GPT-5.2 with highest reasoning mode achieving the best detection rate at 63%, up from 45% in isolated settings on TRACE. Building on this insight, we demonstrate that state-of-the-art models struggle significantly more with semantically contextualized reward hacks compared to syntactically contextualized ones. We further conduct qualitative analyses of model behaviors, as well as ablation studies showing that the ratio of benign to hacked trajectories and analysis cluster sizes substantially impact detection performance. We release the benchmark and evaluation harness to enable the community to expand TRACE and evaluate their models.

Abstract:
Multiagent systems (MAS) are increasingly capable of tackling complex real-world tasks, yet their reliance on inter-agent coordination, tool use, and long-horizon reasoning makes error recognition particularly challenging. Minor errors can propagate across agents, escalating into task failures while producing long, intertwined execution trajectories that impose significant costs for both human developers and automated systems to debug and analyze. Our key insight is that, despite surface differences in failure trajectories (e.g., logs), MAS errors often recur with similar structural patterns. This paper presents CORRECT, the first lightweight, training-free framework that leverages an online cache of distilled error schemata to recognize and transfer knowledge of failure structures across new requests. This cache-based reuse allows LLMs to perform targeted error localization at inference time, avoiding the need for expensive retraining while adapting to dynamic MAS deployments in subseconds. To support rigorous study in this domain, we also introduce CORRECT-Error, a large-scale dataset of over 2,000 annotated trajectories collected through a novel error-injection pipeline guided by real-world distributions, and further validated through human evaluation to ensure alignment with natural failure patterns. Experiments across seven diverse MAS applications show that CORRECT improves step-level error localization up to 19.8% over existing advances while at near-zero overhead, substantially narrowing the gap between automated and human-level error recognition.

Abstract:
Functionguided protein design is a crucial task with significant applications in drug discovery and enzyme engineering. However, the field lacks a unified and comprehensive evaluation framework. Current models are assessed using inconsistent and limited subsets of metrics, which prevents fair comparison and a clear understanding of the relationships between different evaluation criteria. To address this gap, we introducePDFBench, the first comprehensive benchmark for function-guided de novo protein design. Our benchmark systematically evaluates eight state-of-the-art models on 16 metrics across two key settings: description-guided design, for which we repurpose the Mol-Instructions dataset, originally lacking quantitative benchmarking, and keyword-guided design, for which we introduce a new test set, SwissTest, created with a strict datetime cutoff to ensure data integrity. By benchmarking across a wide array of metrics and analyzing their correlations,PDFBenchenables more reliable model comparisons and provides key insights to guide future research.

Abstract:
While passive agents merely follow instructions, proactive agents align with higherlevel objectives, such as assistance and safety by continuously monitoring the environment to determine when and how to act. However, developing proactive agents is hindered by the lack of specialized resources. To address this, we introduceProAct-75, a benchmark designed to train and evaluate proactive agents across diverse domains, including assistance, maintenance, and safety monitoring. Spanning 75 tasks, our dataset features 91,581 step-level annotations enriched with explicit task graphs. These graphs encode step dependencies and parallel execution possibilities, providing the structural grounding necessary for complex decision-making. Building on this benchmark, we proposeProAct-Helper, a reference baseline powered by a Multimodal Large Language Model (MLLM) that grounds decision-making in state detection, and leveraging task graphs to enable entropy-driven heuristic search for action selection, allowing agents to execute parallel threads independently rather than mirroring the human's next step. Extensive experiments demonstrate that ProAct-Helper outperforms strong closed-source models, improving trigger detection mF1 by 6.21%, saving 0.25 more steps in online one-step decision, and increasing the rate of parallel actions by 15.58%. Code is available at https://github.com/only4anonymous/ProAct-Helper.git

Abstract:
Recent progress in largescale flow and diffusion models raised two fundamental algorithmic challenges: (i) control-based reward adaptation of pre-trained flows, and (ii) integration of multiple models, i.e., flow merging. While current approaches address them separately, we introduce a unifying probability-space framework that subsumes both as limit cases, and enables reward-guided flow merging, allowing principled, task-aware combination of multiple pre-trained flows (e.g., merging priors while maximizing drug-discovery utilities). Our formulation renders possible to express a rich family of operators over generative models densities, including intersection (e.g., to enforce safety), union (e.g., to compose diverse models), interpolation (e.g., for discovery), their reward-guided counterparts, as well as complex logical expressions via generative circuits. Next, we introduce Reward-Guided Flow Merging (RFM), a mirror-descent scheme that reduces reward-guided flow merging to a sequence of standard fine-tuning problems. Then, we provide first-of-their-kind theoretical guarantees for reward-guided and pure flow merging via RFM. Ultimately, we showcase the capabilities of the proposed method on illustrative settings providing visually interpretable insights, and apply our method to high-dimensional de-novo molecular design and low-energy conformer generation.

Abstract:
Graph neural networks have been widely used in Boolean satisfiability (SAT) tasks to learn structural information from SAT formulas. The goal of these studies is to solve SAT instances or to enhance SAT solvers, including tasks such as unsatcore prediction. However, most existing approaches model a SAT formula as a bipartite graph or a directed acyclic graph, which are less expressive in capturing higher-order interactions among literals and clauses. Moreover, these approaches are limited in modeling intrinsic polarity-related properties of SAT, such as the complementary relationship between the positive and negative literals of a variable. To address these limitations, we propose a polarity-aware representation learning framework over clause–literal hypergraphs. We model SAT formulas as clause–literal hypergraphs augmented with a clause incidence graph to capture higher-order structural interactions. We then introduce a polarity-aware decomposed mechanism that separates variable representations into polarity invariant and equivariant components, explicitly modeling the relationship between positive and negative literals, with the resulting literal representations propagated along the hypergraph structure. We further incorporate a polarity-inversion consistency regularization to reinforce polarity-consistent representations during training. Experimental results on multiple SAT datasets demonstrate the effectiveness of the proposed approach.

Abstract:
Zerothorder optimization has emerged as a promising approach for fine-tuning large language models on domain-specific data, particularly under differential privacy (DP) and memory constraints. While first-order methods have been extensively studied from a privacy perspective, the privacy analysis and algorithmic design for zeroth-order methods remain significantly underexplored. A critical open question concerns hidden-state DP analysis: although convergent privacy bounds are known for first-order methods, it has remained unclear whether similar guarantees can be established for zeroth-order methods. In this work, we provide an affirmative answer by proving a convergent DP bound for zeroth-order optimization. Our analysis generalizes the celebrated privacy amplification-by-iteration framework to the setting of smooth loss functions in zeroth-order optimization. Furthermore, it induces better DP zeroth-order algorithmic designs that are previously unknown to the literature.

Abstract:
When computing gradients, deep learning training algorithms typically treat the minibatch as a fundamental unit --- only returning batch-averaged gradients. Computing non-linear statistics of the mini-batch gradient distribution has traditionally been viewed as prohibitively expensive or requiring complex, custom implementations. We challenge this view by demonstrating that sequence-level architectures offer a natural testbed for prototyping algorithms based on per-example gradients. We show that staged programming languages like JAX enable generic manipulations of mini-batch gradient computations. We then build on Dangel et. al. (2019) to derive implementations of specific per-example or per-token operations with negligible computational or memory overhead. Finally, we leverage our findings to re-examine two nonlinear optimization operations. First, we analyze signSGD, showing that the optimal placement of the sign operation is critical to success and can be predicted via a simple signal-to-noise ratio argument. Second, we investigate per-example variations of the Adam preconditioner and find that, contrary to conventional wisdom, optimization is best served when the preconditioner is dominated by the mean squared of the gradient distribution rather than its variance. Overall our work shows that accessible per-example gradient information unlocks new avenues for algorithm analysis and design.

Abstract:
As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multireward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.

Authors: Mubashara Akhtar, Anka Reuel, Prajna Soni, Sanchit Ahuja, Pawan Sasanka Ammanamanchi, Ruchit Rawal, Vilém Zouhar, Srishti Yadav, Chenxi Whitehouse, Dayeon Ki, Jennifer Mickel, Leshem Choshen, Marek Šuppa, Jan Batzner, Jenny Chim, Jeba Sania, Yanan Long, Hossein A. Rahmani, Christina Knight, Yiyang Nan, Jyoutir Raj, Yu Fan, Shubham Singh, Subramanyam Sahoo, Eliya Habba, Usman Gohar, Siddhesh Pawar, Robert Scholz, Arjun Subramonian, Jingwei Ni, Mrinmaya Sachan, Mykel Kochenderfer, Sanmi Koyejo, Stella Biderman, Zeerak Talat, Avijit Ghosh, Irene Solaiman

Abstract:
Artificial Intelligence (AI) benchmarks play a central role in measuring progress in model development and guiding deployment decisions. However, many benchmarks quickly become saturated, meaning that they can no longer differentiate between the bestperforming models, diminishing their long-term value. In this study, we analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers. To identify factors driving saturation, we characterize benchmarks along 14 properties spanning task design, data construction, and evaluation format. We test five hypotheses examining how each property contributes to saturation rates. Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age. Notably, hiding test data (i.e., public vs. private) shows no protective effect, while expert-curated benchmarks resist saturation better than crowdsourced ones. Our findings highlight which design choices extend benchmark longevity and inform strategies for more durable evaluation.

Abstract:
Foundation models are increasingly being deployed in contexts where understanding the uncertainty of their outputs is critical to ensuring responsible deployment. While Bayesian methods offer a principled approach to uncertainty quantification, their computational overhead renders their use impractical for training or inference at foundation model scale. Stateof-the-art models achieve parameter counts in the trillions through carefully engineered sparsity including Mixture-of-Experts (MoE) layers. In this work, we demonstrate calibrated uncertainty at scale by introducing Variational Mixture-of-Experts Routing (VMoER), a structured Bayesian approach for modelling uncertainty in MoE layers. VMoER confines Bayesian inference to the expert-selection stage which is typically done by a deterministic routing network. We instantiate VMoER using two inference strategies: amortised variational inference over routing logits and inferring a temperature parameter for stochastic expert selection. Across tested foundation models, VMoER improves routing stability under noise by 38%, reduces calibration error by 94%, and increases out-of-distribution AUROC by 12%, while incurring less than 1% additional FLOPs. These results suggest VMoER offers a scalable path toward robust and uncertainty-aware foundation models.

Abstract:
Computerusing agents (CUAs) enable task completion through natural interaction with operating systems and software interfaces. While script-based verifiers are widely adopted for evaluation, they suffer from limited scalability and inability to provide step-wise assessment. Reward models offer promising alternatives, but their effectiveness on CUA evaluation remains largely underexplored. To address this gap, we present CUARewardBench, comprising four key contributions: (1) First-ever Comprehensive CUA Reward Benchmark: We introduce the first benchmark for evaluating both outcome reward models (ORM) and process reward models (PRM) on CUA tasks, enabling systematic assessment across trajectory-level and step-level evaluation. (2) Diverse, Practical and Reliable Dataset: CUARewardBench encompasses trajectories from 10 software categories and 7 agent architectures with varying performance levels (25.9%-50.8% success rates). All trajectories are expertly annotated through carefully designed protocols, with rigorous quality control to ensure reliability and practical applicability. (3) Comprehensive Analysis and Insights: Through extensive experiments across 7 vision-language models and 3 prompt templates, we reveal critical limitations of current CUA RMs, including insufficient visual reasoning capabilities, knowledge deficiencies, and the superiority of general VLMs over specialized CUA models for reward evaluation. (4) Unanimous Prompt Ensemble (UPE): Based on the insights from our comprehensive analysis, we propose UPE, a novel ensemble method that significantly enhances reward model reliability through strict unanimous voting and strategic prompt-template configurations. UPE achieves 88.0% precision and 95.3% NPV for ORM, and 83.1% precision and 86.2% NPV for PRM, substantially outperforming single VLMs and traditional ensemble approaches. In a short, this work introduces both a comprehensive benchmark and a novel ensemble method that substantially enhances CUA reward model reliability.

Abstract:
Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most approaches tokenize symbolic music as sequences of musical events, such as onsets, pitches, time shifts, or compound note events. This strategy is intuitive and has proven effective in Transformerbased models, but it treats the regularity of musical time implicitly: individual tokens may span different durations, resulting in non-uniform time progression. In this paper, we instead consider whether an alternative tokenization is possible, where an uniform-length musical step (e.g., a beat) serves as the basic unit. Specifically, we encode all events within a single time step at the same pitch as one token, and group tokens explicitly by time step, which resembles a sparse encoding of a piano roll representation. We evaluate the proposed tokenization on music continuation and accompaniment generation tasks, comparing it with mainstream event-based methods. Results show improved musical quality and long-term structural coherence, while additional analyses confirm higher efficiency and more effective capture of long-range patterns with the proposed tokenization.

Abstract:
With large language models (LLMs) now performing strongly across diverse tasks, there is growing demand for them to personalize outputs for individual users. Personalization is typically framed as an additional layer on top of a base NLP task, requiring model responses to meet userspecific needs while still accomplishing the underlying task. From a token-level perspective, different tokens in a response contribute to personalization to varying degrees. Tokens with higher personalization relevance should therefore receive greater emphasis when developing personalized LLMs. However, accurately estimating such personalization degrees remains challenging. To address this challenge, we propose PerContrast, a self-contrast method that estimates each output token’s dependence on user-specific information through causal intervention. Building on this mechanism, we develop the PerCE loss, which adaptively upweights tokens with higher estimated personalization degrees during training via a bootstrap procedure, enabling the model to alternate between estimating and optimizing these tokens. Experiments on multiple LLMs demonstrate that PerCE substantially improves personalization performance with minimal additional cost, achieving average gains of over 10% and up to 68.04% on the LongLaMP dataset, along with strong cross-task and cross-scenario transferability. These results highlight the importance of token-level personalization modeling and establish token-aware training as a simple yet effective paradigm for advancing personalized LLMs.

Abstract:
Sparse autoencoders (SAEs) have become a standard tool for mechanistic interpretability in autoregressive large language models (LLMs), enabling researchers to extract sparse, humaninterpretable features and intervene on model behavior. Recently, as diffusion language models (DLMs) have become an increasingly powerful and promising alternative to autoregressive LLMs, it is essential to develop tailored mechanistic interpretability tools for this emerging class of models. In this work, we presentDLM-Scope, the first SAE-based interpretability framework for DLMs, and demonstrate that trained Top-K SAEs can faithfully extract sparse, interpretable features. Notably, we find that inserting SAEs affects DLMs differently from autoregressive LLMs: while SAE insertion in LLMs typically incurs a loss penalty, in DLMs it can reduce cross-entropy loss when applied to early layers, a phenomenon absent or markedly weaker in LLMs. Additionally, SAE features in DLMs enable more effective diffusion-time interventions, often outperforming LLM steering. Moreover, we pioneer new SAE-based research directions for DLMs: we show that SAEs provide useful signals for DLM decoding order, and that SAE features remain stable during DLM post-training. Overall, our work establishes a foundation for mechanistic interpretability in DLMs and highlights the potential of applying SAEs to DLM-related tasks and algorithms.

Abstract:
Structurebased and ligand-based computational drug design have traditionally relied on disjoint data sources and modeling assumptions, limiting their joint use at scale. In this work, we introduceContrastiveGeometricLearning forUnified ComputationalDrug Design (ConGLUDe), a single contrastive geometric model that unifies structure- and ligand-based training. ConGLUDe couples a geometric protein encoder that produces whole-protein representations and implicit embeddings of predicted binding sites with a fast ligand encoder, removing the need for pre-defined pockets. By aligning ligands with both global protein representations and multiple candidate binding sites through contrastive learning, ConGLUDe supports ligand-conditioned pocket prediction in addition to virtual screening and target fishing, while being trained jointly on protein-ligand complexes and large-scale bioactivity data. Across diverse benchmarks, ConGLUDe achieves competitive zero-shot virtual screening performance, substantially outperforms existing methods on a challenging target fishing task, and demonstrates state-of-the-art ligand-conditioned pocket selection. These results highlight the advantages of unified structure-ligand training and position ConGLUDe as a step toward general-purpose foundation models for drug discovery.

Abstract:
Activations have become the primary memory bottleneck in largebatch LLM training. However, existing compression methods fail to exploit structure information of activations, resulting in slow convergence or limited compression. To address this, we bridge the relationship between the algorithm’s fast convergence and the requirements for subspace projection and show a compression should yield an unbiased estimate of the original activation with low varianc. We proposePrincipal-Random Subspace for LLMActivationCompression (PRAC), which novelly decomposes activations into two components: a principal subspace captured via SVD to retain dominant information, and a random subspace sampled from the orthogonal complement to approximate the tail. By introducing a precise scaling factor, we prove that PRAC yields an unbiased gradient estimator with \emphminimum variance under suitable conditions. Extensive experiments on pre-training and fine-tuning tasks demonstrate that PRAC achieves up to 36% total memory reduction with negligible performance degradation and minimal computational cost.

Abstract:
Reinforcement learning (RL) has emerged as a crucial approach for enhancing the capabilities of large language models. However, in Mixtureof-Experts (MoE) models, the routing mechanism often introduces instability, even leading to catastrophic RL training collapse. We analyze the training-inference consistency of MoE models and identify a notable discrepancy in routing behaviors between the two phases. To address this issue, we propose Rollout Routing Replay (R3), a novel and effective method that records routing distributions from the inference engine and replays them during training. R3 significantly reduces training-inference policy KL divergence and mitigates extreme discrepancies without compromising training speed. Extensive experiments on various settings confirm that R3 succeeds in stabilizing RL training, preventing collapse and outperforming strong baselines. R3 is orthogonal to most policy optimization algorithm improvements, allowing it to be used in conjunction with them. We believe this work can offer a new solution for stabilizing RL in MoE model.

Abstract:
Sparse autoencoders (SAEs) have emerged as a promising method for interpreting neural network representations by decomposing activations into sparse combinations of dictionary atoms. However, SAEs assume that features combine additively through linear reconstruction, an assumption that cannot capture compositional structure: linear models cannot distinguish whether "Starbucks" arises from the composition of "star" and "coffee" features or merely their cooccurrence. This forces SAEs to allocate monolithic features for compound concepts rather than decomposing them into interpretable constituents. We introduce PolySAE, which extends the SAE decoder with higher-order terms to model feature interactions while preserving the linear encoder essential for interpretability. Through low-rank tensor factorization on a shared projection subspace, PolySAE captures pairwise and triple feature interactions with small parameter overhead (3% on GPT2). Across four language models and three SAE variants, PolySAE achieves an average improvement of approximately 8% in probing F1 while maintaining comparable reconstruction error, and produces 2-10× larger Wasserstein distances between class-conditional feature distributions. Critically, learned interaction weights exhibit negligible correlation with co-occurrence frequency (r = 0.06 vs. r = 0.82 for SAE feature covariance), suggesting that polynomial terms capture compositional structure, such as morphological binding and phrasal composition, largely independent of surface statistics.

Abstract:
We study the discrete Bertrand pricing game with a nonincreasing demand function. The game has n \ge 2 players who simultaneously choose prices from the set 1/k, 2/k, \ldots, 1, where k\in\mathbbN. The player who sets the lowest price captures the entire demand; if multiple players tie for the lowest price, they split the demand equally. We study the Bertrand paradox, where classical theory predicts low prices, yet real markets often sustain high prices. To understand this gap, we analyze a repeated-game model in which firms set prices using no-regret learners. Our goal is to characterize the equilibrium outcomes that can arise under different no-regret learning guarantees. We are particularly interested in questions such as whether no-external-regret learners can converge to undesirable high-price outcomes, and how stronger guarantees such as no-swap regret shape the emergence of competitive low-price behavior. We address these and related questions through a theoretical analysis, complemented by experiments that support the theory and reveal surprising phenomena for no-swap regret learners.

Abstract:
Model depth is a doubleedged sword in deep learning: deeper models achieve higher accuracy but require higher computational cost. To efficiently train models at scale, progressive training -- an effective strategy where model capacity scales up during training, has emerged to significantly reduce computation with little to none performance degradation. In this work, we study the depth expansion of large-scale models through the lens of optimization theory and feature learning, offering insights on the initialization of new layers, hyperparameter transfer, learning rate schedule, and timing of model expansion. Specifically, we propose zero/one-layer progressive training for the optimal tradeoff between computation and loss. For example, zero/one-layer progressive training on GPT2 can save \approx 80% compute, or equivalently accelerate by \approx 5×, and achieve a loss comparable to a fully trained 60-layer model with 7B parameters.

Abstract:
Realworld autonomous planning requires coordinating tightly coupled constraints where a single decision dictates the feasibility of all subsequent actions. However, existing benchmarks predominantly feature loosely coupled constraints solvable through local greedy decisions and rely on idealized data, failing to capture the complexity of extracting parameters from dynamic web environments. We introduce WorldTravel, a benchmark comprising 150 real-world travel scenarios across 5 cities that demand navigating an average of 15+ interdependent temporal and logical constraints. To evaluate agents in realistic deployments, we develop WorldTravel-Webscape, a multi-modal environment featuring over 2,000 rendered webpages where agents must perceive constraint parameters directly from visual layouts to inform their planning. Our evaluation of 10 frontier models reveals a significant performance collapse: even the state-of-the-art GPT-5.2 achieves only 28.0% feasibility in text-only settings, which plummets to 3.4% in multi-modal environments. We identify a critical Perception-Action Gap and a Planning Horizon threshold at approximately 10 constraints where model reasoning consistently fails, suggesting that perception and reasoning remain independent bottlenecks. These findings underscore the need for next-generation agents that unify high-fidelity visual perception with long-horizon reasoning to handle brittle real-world logistics.

Abstract:
Multimodal Large Language Models (MLLMs) have significantly improved the performance of various tasks, but continue to suffer from visual hallucinations, a critical issue where generated responses contradict visual evidence. While Direct Preference Optimization (DPO) is widely used for alignment, its application to MLLMs often fails to capture finegrained semantic differences and encourages shortcut learning. To address these challenges, we propose Semantic Curriculum Preference Optimization (SCPO), a novel framework for MLLM alignment. SCPO employs a progressive, easy-to-hard curriculum built upon our Semantic Curriculum Preference Pairs dataset, which provides fine-grained semantic contrasts sorted by difficulty. This curriculum is trained with a dynamic reference model and a novel symmetric, bidirectional objective to facilitate simultaneous learning from both textual and visual preferences. To our knowledge, SCPO is the first framework to unify semantics, symmetry, and curriculum for MLLMs alignment, effectively mitigating visual hallucinations. Extensive experiments on LLaVA models across various scales and versions validate that SCPO demonstrates superior performance compared to baseline models on multiple hallucination benchmarks, reducing the hallucination rate by up to 62.9%. Moreover, evaluations on generalized benchmarks show that SCPO improves factuality while preserving general capabilities, with its performance remaining stable across general vision-language benchmarks.

Abstract:
Despite the fact that experimental neural scaling laws have substantially guided empirical progress in largescale machine learning, no existing theory can quantitatively predict the exponents of these important laws for any modern LLM trained on any natural language dataset. We provide the first such theory in the case of data-limited scaling laws. We isolate two key statistical properties of language that \it alone can predict neural scaling exponents: (i) the decay of pairwise token correlations with time separation between token pairs, and (ii) the decay of the next-token conditional entropy with the length of the conditioning context. We further derive a simple formula in terms of these statistics that predicts data-limited neural scaling exponents from first principles \it without any free parameters or synthetic data models. Our theory exhibits a remarkable match with experimentally measured neural scaling laws obtained from training GPT-2 and LLaMA style models from scratch on two qualitatively different benchmarks, TinyStories and WikiText.

Abstract:
Multimodal large language models (MLLMs) have markedly expanded the competence of graphical userinterface (GUI) systems, propelling them beyond controlled simulations into complex, real-world environments across diverse platforms. However, practical usefulness is still bounded by the reliability of visual grounding, i.e., mapping textual references to exact on-screen elements. This limitation prevents the system from accurately performing pointer-level actions such as clicking or dragging. To address it, we introduce GUI-Spotlight -- a model trained for image-grounded reasoning that dynamically invokes multiple specialized tools to iteratively narrow its focus to the relevant region of the screen, thereby substantially improving visual grounding accuracy. On the ScreenSpot-Pro benchmark, GUI-Spotlight trained with only 18.5K training samples achieves 52.8% accuracy, surpassing V2P-7B (50.6% with 9.6M training samples) and GTA-1-7B (50.1% with 1.56M training samples).

Abstract:
Modern neural networks have shown promise for solving partial differential equations over surfaces, often by discretizing the surface as a mesh and learning with a meshaware graph neural network. However, graph neural networks suffer from oversmoothing, where a node's features become increasingly similar to those of its neighbors. Unitary graph convolutions, which are mathematically constrained to preserve smoothness, have been proposed to address this issue. Despite this, in many physical systems, such as diffusion processes, smoothness naturally increases and unitarity may be overconstraining. In this paper, we systematically study the smoothing effects of different GNNs for dynamics modeling and prove that unitary convolutions hurt performance for such tasks. We propose relaxed unitary convolutions that balance smoothness preservation with the natural smoothing required for physical systems. We also generalize unitary and relaxed unitary convolutions from graphs to meshes. In experiments on PDEs such as the heat and wave equations over complex meshes and on weather forecasting, we find that our method outperforms several strong baselines, including mesh-aware transformers and equivariant neural networks. Our code is available at https://anonymous.4open.science/r/rayleigh_analysis-BD52/README.md .

Abstract:
A fundamental bottleneck in humanAI collaboration is the "intention expression gap", the difficulty for humans to effectively convey complex, high-dimensional thoughts to AI. This challenge often traps users in inefficient trial-and-error loops and is exacerbated by the diverse expertise levels of users. We reframe this problem from passive instruction following to a Socratic collaboration paradigm, proposing an agent that actively probes for information to resolve its uncertainty about user intent. we name the proposed agent Nous, trained to acquire proficiency in this inquiry policy. The core mechanism of Nous is a training framework grounded in the first principles of information theory. Within this framework, we define the information gain from dialogue as an intrinsic reward signal, which is fundamentally equivalent to the reduction of Shannon entropy over a structured task space. This reward design enables us to avoid reliance on costly human preference annotations or external reward models. To validate our framework, we develop an automated simulation pipeline to generate a large-scale, preference-based dataset for the challenging task of scientific diagram generation. Comprehensive experiments, including ablations, subjective and objective evaluations, and tests across user expertise levels, demonstrate the effectiveness of our proposed framework. Nous achieves leading efficiency and output quality, while remaining robust to varying user expertise. In conclusion, our research provides a systematic methodology and a new perspective for addressing the issue of ambiguous intentions in complex human-machine collaboration.

Abstract:
Spatial transcriptomics offers spatially resolved gene expression profiling within tissue sections, but its cost and limited throughput hinder largescale deployment. To extend this capability to routine practice, recent computational methods aim to infer spatial gene expression directly from ubiquitous hematoxylin and eosin-stained histology slides. However, most existing models assume Cartesian or geometry-agnostic locality, despite the hexagonal sampling of widely used spot-array platforms, and point-wise regression objectives often yield over-smoothed gene expression profiles, obscuring gene-specific spatial heterogeneity. To address these, we propose HEXST, a geometry-aligned Transformer for spatial gene expression prediction from histology. HEXST operates directly on hexagonal spot coordinates to enable efficient local-to-global contextual modeling via tailored shifted-window attention mechanism and hexagonal rotary positional encoding. To enhance gene-wise spatial contrast, HEXST complements point-wise regression with a contrast-sensitive differential objective and transcriptomic priors from a pretrained single-cell foundation model during training. Across seven spatial transcriptomics datasets, HEXST consistently outperforms state-of-the-art models, providing accurate and robust spatial gene expression predictions while preserving gene-wise contrast and spatial heterogeneity.

Abstract:
Current chemical large language models (LLMs) predominantly rely on explicit Chainof-Thought (CoT) to solve complex reasoning problems. However, forcing nonverbal tacit chemical logic into discrete natural language imposes a fundamental ``modality mismatch,'' creating an artificial bottleneck for reasoning. To investigate this, we introduce LatentChem, a reasoning interface that decouples chemical logic from linguistic generation, enabling the model to process information via continuous thought vectors and dynamic perception. Our investigation reveals a pivotal emergent behavior: spontaneous internalization. When optimized for task success, the model voluntarily abandons verbose textual derivations in favor of implicit latent computation, suggesting that it autonomously identifies the continuous manifold as a more native substrate for chemical logic. This paradigm shift also proves to be a superior computational strategy: LatentChem achieves a 59.88% non-tie win rate against the strong CoT baseline on the rigorous ChemCoTBench, while delivering a broad 10.84× average speedup across all evaluated benchmarks. This empirically validates that chemical logic is inherently better modeled by continuous latent dynamics than by linear linguistic approximations.

Abstract:
We study highdimensional convex empirical risk minimization (ERM) under general non-Gaussian data designs. By heuristically extending the Convex Gaussian Min–Max Theorem (CGMT) to non-Gaussian settings, we derive an asymptotic min–max characterization of key statistics, enabling approximation of the mean \mu_\hat\theta and covariance C_\hat\theta of the ERM estimator \hat\theta. Specifically, under a concentration assumption on the data matrix and standard regularity conditions on the loss and regularizer, we show that for a test covariate x independent of the training data, the projection \hat\theta^\top x approximately follows the convolution of the (generally non-Gaussian) distribution of \mu_\hat\theta^\top x with an independent centered Gaussian variable of variance \mathrmtr\!\big(C_\hat\theta\,\mathbbE[xx^\top]\big). This result clarifies the scope and limits of Gaussian universality for ERMs. Additionally, we prove that any \mathcalC^2 regularizer is asymptotically equivalent to a quadratic form determined solely by its Hessian at zero and gradient at \mu_\hat\theta. Numerical simulations across diverse losses and models are provided to validate our theoretical predictions and qualitative insights.

Abstract:
Language models struggle to generalize beyond the context lengths seen during pretraining, limiting performance on longhorizon reasoning and retrieval. Continued pretraining on long-context data can mitigate this limitation, but it is prohibitively expensive due to the quadratic scaling of Attention with sequence length. In practice, most tokens do not require Global Attention over the entire sequence and can rely on local context. Based on this insight, we propose L2A, a sequence modeling layer that enables token-wise long-term conditional memory access by deciding when to invoke Global Attention. We evaluate L2A on Qwen 2.5 and Qwen 3 models, extending their effective context length from 32K to 128K tokens, where it matches standard long-context training within 1.5–3% while skipping Global Attention for ～80% of tokens and outperforming prior baselines. We also design custom Triton kernels to efficiently realize this token-wise conditional attention on GPUs, achieving up to ～2× improvements in training throughput and time-to-first-token over FlashAttention-2. Moreover, L2A enables post-training pruning of highly sparse Global Attention layers, reducing KV cache memory by up to 50% with negligible performance loss.

Abstract:
Goalconditioned policies enable decision-making models to execute diverse behaviors based on specified goals, yet their downstream performance is often highly sensitive to the choice of instructions or prompts. To bypass the limitations of discrete text prompts, we formulate post-training adaptation as a latent control problem, where the goal embedding serves as a continuous control variable to modulate the behavior of a frozen policy. We propose Preference Goal Tuning (PGT), a framework that optimizes this latent control variable to align the induced trajectory distribution with task preferences. Unlike standard fine-tuning that updates policy parameters, PGT keeps the policy frozen and updates only the latent goal using a trajectory-level preference objective. This approach essentially searches for the optimal conditioning input that maximizes the likelihood of preferred behaviors while suppressing undesirable ones. We evaluate PGT on the Minecraft SkillForge benchmark across 17 tasks. With minimal data, PGT achieves average relative improvements of 72.0% and 81.6% on two foundation policies, consistently outperforming expert-crafted prompts. Crucially, by decoupling task alignment (latent goal) from physical dynamics (frozen policy), PGT surpasses full fine-tuning by 13.4% in out-of-distribution settings, demonstrating superior robustness and generalization.

Abstract:
Large Reasoning Models (LRMs) have demonstrated strong capabilities in complex multistep reasoning, opening new opportunities for automating optimization modeling. However, existing domain adaptation methods, originally designed for earlier instruction-tuned models, often fail to exploit the advanced reasoning patterns of modern LRMs --- In particular, we show that direct fine-tuning on traditionalnon-reflectivedatasets leads to limited gains. To fully leverage LRMs' inherent reasoning abilities, we proposeCALM(Corrective Adaptation with Lightweight Modification), a framework that progressively refines LRMs within their native reasoning modes for optimization modeling tasks. In CALM, an expert intervener identifies reasoning flaws and provides concise corrective hints, which the LRM incorporates to produce improved reasoning trajectories. These interventions modify fewer than 2.6% of generated tokens, but generate high-quality data for soft adaptation through supervised fine-tuning. The adapted model is then further improved through reinforcement learning. Building on CALM, we developSTORM(Smart Thinking Optimization Reasoning Model), a 4B-parameter LRM that achieves a new state-of-the-art average accuracy of 68.9% across five popular optimization modeling benchmarks, matching the performance of a 671B LRM. These results demonstrate that dynamic, hint-based data synthesis both preserves and amplifies the native reasoning patterns of modern LRMs, offering a more effective and scalable path towards expert-level performance on challenging optimization modeling tasks.

Abstract:
Many scientific and geometric problems exhibit general linear symmetries, yet most equivariant neural networks are built for compact groups or simple vector features, limiting their reuse on matrixvalued data such as covariances, inertias, or shape tensors. We introduce Reductive Lie Neurons (ReLNs), an exactly \mathrmGL(n)-equivariant architecture that natively supports matrix-valued and Lie-algebraic features. ReLNs resolve a central stability issue for reductive Lie algebras by introducing a non-degenerate adjoint (conjugation)-invariant bilinear form, enabling principled nonlinear interactions and invariant feature construction in a single architecture that transfers across subgroups without redesign. We demonstrate ReLNs on algebraic tasks with \mathfraksl(3) and \mathfraksp(4) symmetries, Lorentz-equivariant particle physics, uncertainty-aware drone state estimation via joint velocity--covariance processing, learning from 3D Gaussian-splat representations, and EMLP double-pendulum benchmark spanning multiple symmetry groups. ReLNs consistently match or outperform strong equivariant and self-supervised baselines while using substantially fewer parameters and compute, improving the accuracy–efficiency trade-off and providing a practical, reusable backbone for learning with broad linear symmetries.

Abstract:
Modern LLMs are increasingly accessed via blackbox APIs, requiring users to transmit sensitive prompts, outputs, and fine-tuning data to external providers, creating a critical privacy risk at the API boundary. We introduce AlienLM, a deployable API-only privacy layer that protects text by translating it into an Alien Language via a vocabulary-scale bijection, enabling lossless recovery on the client side. Using only standard fine-tuning APIs, Alien Adaptation Training (AAT) adapts target models to operate directly on alienized inputs. Across four LLM backbones and seven benchmarks, AlienLM retains over 81% of plaintext-oracle performance on average, substantially outperforming random-bijection and character-level baselines. Under adversaries with access to model weights, corpus statistics, and learning-based inverse translation, recovery attacks reconstruct fewer than 0.22% of alienized tokens. Our results demonstrate a practical pathway for privacy-preserving LLM deployment under API-only access, substantially reducing plaintext exposure while maintaining task performance.

Abstract:
Training largescale generative models is resource-intensive and relies heavily on heuristic dataset weighting. We address two fundamental questions: Can we train Large Language Models (LLMs) modularly—combining small, domain-specific experts to match monolithic performance—and can we do so robustly for any data mixture, eliminating heuristic tuning? We present a theoretical framework for modular generative modeling where a set of pre-trained experts are combined via a gating mechanism. We define the space of normalized gating functions \mathcalG_1 and formulate the problem as a minimax game to find a single robust gate that minimizes divergence to the worst-case data mixture. We prove the existence of such a robust gate using Kakutani's fixed-point theorem and show that modularity acts as a strong regularizer, with generalization bounds scaling with the lightweight gate's complexity. Furthermore, we prove that this modular approach can theoretically outperform models retrained on aggregate data, with the gap characterized by the Jensen-Shannon Divergence. Finally, we introduce a scalable Stochastic Primal-Dual algorithm and a Structural Distillation method for efficient inference. Empirical results on synthetic and real-world datasets confirm that our modular architecture effectively mitigates gradient conflict and can outperform monolithic baselines.

Abstract:
Search agents powered by Large Language Models have demonstrated significant potential in tackling knowledgeintensive tasks. Reinforcement learning has emerged as a powerful paradigm for training these agents to perform complex, multi-step reasoning. However, prior RL-based methods often rely on sparse or rule-based rewards, which can lead agents to commit to suboptimal or erroneous reasoning paths without the ability to recover. To address these limitations, we proposeReSeek, a self-correcting framework enabling search agents to recover from erroneous search paths during an episode. By invoking a specialJUDGEaction, the agent can judge the information and re-plan its search strategy. To guide this process, we design a dense, instructive process reward function, which decomposes into a correctness reward for retrieving factual information and a utility reward for finding information genuinely useful for the query. Additionally, to mitigate the risk of data contamination in existing datasets, we introduceFictionalHot, a contamination-free benchmark requiring complex reasoning. Experiments show ReSeek significantly outperforms SOTA baselines in task success and path faithfulness.

Abstract:
Large language models are increasingly used to predict human preferences in both scientific and business endeavors, yet current approaches rely exclusively on analyzing model outputs without considering the underlying mechanisms. Using election forecasting as a test case, we introducemechanistic forecasting, a method that demonstrates that probing internal model representations offers a fundamentally differentand sometimes more effective--- approach to preference prediction. Examining over 24 million configurations across 7 models, 6 national elections, multiple persona attributes, and prompt variations, we systematically analyze how demographic and ideological information activates latent party-encoding components within the respective models. We find that leveraging this internal knowledge via mechanistic forecasting (opposed to solely relying on surface-level predictions) can improve prediction accuracy. The effects vary across demographic versus opinion-based attributes, political parties, national contexts, and models. Our findings demonstrate that the latent representational structure of LLMs contains systematic, exploitable information about human preferences, establishing a new paradigm for using language models in social science prediction tasks.

Abstract:
Recent advances in visionlanguage-action (VLA) models for robotics have highlighted the importance of reliable uncertainty quantification in sequential tasks. However, assessing and improving calibration in such settings remains mostly unexplored, especially when only partial trajectories are observed. In this work, we formulatesequential calibrationfor episodic tasks, where task-success confidence is produced along an episode, while success is determined at the end of it. We introduce a sequential extension of the Brier score and show that, for binary outcomes, its risk minimizer coincides with the VLA policy’s value function. This connection bridges uncertainty calibration and reinforcement learning, enabling the use of temporal-difference (TD) value estimation as a principled calibration mechanism over time. We empirically show that TD calibration improves performance relative to the state-of-the-art on simulated and real-robot data. Interestingly, we show that when calibrated using TD, the VLA's single-step action probabilities can yield competitive uncertainty estimates, in contrast to recent findings that employed different calibration techniques.

Abstract:
Webbased agents powered by large language models are increasingly used for tasks such as email management or professional networking. Their reliance on dynamic web content, however, makes them vulnerable to prompt injection attacks: adversarial instructions hidden in interface elements that persuade the agent to divert from its original task. We introduce the Task-Redirecting Agent Persuasion Benchmark (TRAP), an evaluation for studying how persuasion techniques misguide autonomous web agents on realistic tasks. Across six frontier models, agents are susceptible to prompt injection in 25% of tasks on average (13% for GPT-5 to 43% for DeepSeek-R1), with small interface or contextual changes often doubling success rates and revealing systemic, psychologically driven vulnerabilities in web-based agents. We also provide a modular social-engineering injection framework with controlled experiments on high-fidelity website clones, allowing for further benchmark expansion.

Abstract:
Latentbased watermarks, integrated into the generation process of latent diffusion models (LDMs), simplify detection and attribution of generated images. However, recent black-box forgery attacks, where an attacker needs at least one watermarked image and black-box access to the provider’s model, can embed the provider’s watermark into images not produced by the provider, posing outsized risk to provenance and trust. We propose SemBind, the first defense framework for latent-based watermarks that resists black-box forgery by binding latent signals to image semantics via a learned semantic masker. Trained with contrastive learning, the masker yields near-invariant codes for the same prompt and near-orthogonal codes across prompts; these codes are reshaped and permuted to modulate the target latent before any standard latent-based watermark. SemBind is generally compatible with existing latent-based watermarking schemes and keeps image quality essentially unchanged, while a simple mask-ratio parameter offers a tunable trade-off between anti-forgery strength and robustness. Across four mainstream latent-based watermark methods, our SemBind-enabled anti-forgery variants markedly reduce false acceptance under black-box forgery while providing a controllable robustness--security balance.

Abstract:
Continuous prediction in complex environments is critical for Unmanned Aerial Vehicle (UAV). However, the existing VisionLanguage Navigation (VLN) models follows the dead-reckoning, which iteratively predicts the next waypoint and updates its position, thereby constructing the complete trajectory. Then, such stepwise manner will inevitably lead to accumulated errors of position over time, resulting in misalignment between internal belief and objective coordinates, which is known as ``state drift'' and ultimately compromises the subsequent trajectory prediction. Drawing inspiration from classical control theory, we propose to correct for errors by formulating the continuous prediction as a recursive Bayesian state estimation problem. In this paper, we design NeuroKalman, a novel framework that decouples navigation into two complementary processes: a Prior Prediction, based on motion dynamic,s and a Likelihood Correction, from historical observation. We first mathematically associate Kernel Density Estimation of the measurement likelihood with the attention-based retrieval mechanism, which then allows the system to rectify the latent representation using retrieved historical anchors without gradient updates. Comprehensive experiments on TravelUAV benchmark demonstrate that, with only 10% of the full training data fine-tuning, our method clearly outperforms strong baselines and regulates drift accumulation.

Abstract:
Inferring spatial transcriptomics (ST) from histology enables scalable histogenomic profiling, yet current methods are largely restricted to singletissue models. This fragmentation fails to leverage biological principles shared across cancer types and hinders application to data-scarce scenarios. While pan-cancer training offers a solution, the resulting heterogeneity challenges monolithic architectures. To bridge this gap, we introduceMoLF(Mixture-of-Latent-Flow), a generative model for pan-cancer histogenomic prediction. MoLF leverages a conditional Flow Matching objective to map noise to the gene latent manifold, parameterized by a Mixture-of-Experts (MoE) velocity field. By dynamically routing inputs to specialized sub-networks, this architecture effectively decouples the optimization of diverse tissue patterns. Our experiments demonstrate that MoLF establishes a new state-of-the-art, consistently outperforming both specialized and foundation model baselines on pan-cancer benchmarks. Furthermore, MoLF exhibits zero-shot generalization to cross-species data, suggesting it captures fundamental, conserved histo-molecular mechanisms.

Abstract:
Safety alignment remains brittle under domain shift and noisy preference supervision. Existing robust alignment methods predominantly focus on data uncertainty in alignment data, while being less effective at addressing failures caused by optimizationinduced fragility. In this work, we revisit robustness for safety alignment from an optimization geometry perspective, highlighting optimization-induced fragility as a complementary factor to data-space uncertainty. To address this, we propose ShaPO, a geometry-aware preference optimization framework that enforces worst-case alignment objectives via selective geometry control. We instantiate ShaPO at two levels: Token-level ShaPO stabilizes likelihood-based surrogate optimization, while Reward-level ShaPO enforces reward-consistent optimization and provides improved robustness under noisy supervision. Across diverse safety benchmarks and noisy preference settings, ShaPO consistently improves safety robustness over popular preference optimization methods. Moreover, ShaPO composes cleanly with data-robust objectives, yielding additional gains and empirically supporting the proposed optimization-geometry perspective. Our results highlight optimization geometry as a complementary and actionable axis for robust LLM safety alignment. The code is available at \urlhttps://anonymous.4open.science/r/ShaPO-D1B0.

Abstract:
Vision–Language–Action (VLA) policies often fail under distribution shift, suggesting that decisions may depend on spurious visual correlations rather than taskrelevant causes. We formulate visual–action attribution as an interventional estimation problem. Accordingly, we introduce theInterventional Significance Score (ISS), an interventional masking procedure for estimating the causal influence of visual regions on action predictions, and theNuisance Mass Ratio (NMR), a scalar measure of attribution to task-irrelevant features. We analyze the statistical properties of ISS and show that it admits unbiased estimation, and we characterize conditions under which action prediction error provides a valid proxy for causal influence. Experiments across diverse manipulation tasks indicate that NMR predicts generalization behavior and that ISS yields more faithful explanations than existing interpretability methods. These results suggest that interventional attribution provides a simple diagnostic approach for identifying causal misalignment in embodied policies.

Abstract:
Sparse Mixtureof-Experts (MoE) models scale Transformers efficiently but suffer from expert overlap, where different experts process similar tokens and learn redundant functions, resulting in ambiguous routing and underutilized capacity. While architectural solutions like DeepSeek-style shared experts promote specialization, they require substantial structural modifications and rely solely on intra-layer signals. We propose two plug-and-play auxiliary losses that enhance MoE specialization and routing efficiency without modifying routers or model architectures. First, an intra-layer specialization loss penalizes cosine similarity between experts' SwiGLU activations on identical tokens, encouraging experts to specialize in complementary functions. Second, a cross-layer dependency loss maximizes joint Top-k routing probabilities across adjacent layers, establishing coherent expert pathways through network depth while reinforcing intra-layer specialization. Both losses are orthogonal to the standard load-balancing loss and compatible with shared-expert and vanilla Top-k MoE architectures. We implement both losses as a drop-in Megatron-LM module. Extensive experiments across pre-training, fine-tuning, and zero-shot benchmarks demonstrate consistent task gains, higher expert specialization, and lower-entropy routing; together, these improvements translate into faster inference via more stable expert pathways.

Abstract:
The prohibitive cost of evaluating Large Language Models (LLMs) necessitates efficient alternatives to fullscale benchmarking. Prevalent approaches address this by identifying a small coreset of items to approximate full-benchmark performance. However, existing methods must estimate a reliable item profile from response patterns across many source models, which becomes statistically unstable when the source pool is small. This dependency is particularly limiting for newly released benchmarks with minimal historical evaluation data. We argue that discrete correctness labels are a lossy view of the model's decision process and fail to capture information encoded in hidden states. To address this, we introduce \textscRepCore, which aligns heterogeneous hidden states into a unified latent space to construct representative coresets. Using these subsets for performance extrapolation, \textscRepCore achieves precise estimation accuracy with as few as ten source models. Experiments on five benchmarks and over 200 models show consistent gains over output-based baselines in ranking correlation and estimation accuracy. Spectral analysis further indicates that the aligned representations contain separable components reflecting broad response tendencies and task-specific reasoning patterns.

Abstract:
We consider the robustness of scorebased generative modeling to errors in the estimate of the score function. In particular, we show that Langevin dynamics is not robust to the L^2 errors (more generally L^p errors) in the estimate of the score function. It is well-established that with small L^2 errors in the estimate of the score function, diffusion models can sample faithfully from the target distribution under fairly mild regularity assumptions in a polynomial time horizon. In contrast, our work shows that even for simple distributions in high dimensions, Langevin dynamics run for any polynomial time horizon will produce a distribution far from the target distribution in Total Variation (TV) distance, even when the L^2 error (more generally L^p) of the estimate of the score function is arbitrarily small. Considering such an error in the estimate of the score function is unavoidable in practice when learning the score function from data, our results provide further justification for diffusion models over Langevin dynamics and serve to caution against the use of Langevin dynamics with estimated scores.

Abstract:
InferenceTime Scaling (ITS) improves language models by allocating more computation at generation time. Particle Filtering (PF) has emerged as a strong ITS method for complex mathematical reasoning tasks, but it is vulnerable when guided by process reward models, which often assign overconfident scores early in the reasoning process. This causes PF to suffer from premature exploitation: it myopically commits to locally promising trajectories, prunes potentially correct hypotheses, and converges to suboptimal solutions. This failure mode, known as particle impoverishment, is especially severe under constrained computational budgets. To address this, we analyze the problem and identify two root causes: a lack of diversity in the particle set due to overconfident resampling and consequent inability to assess the potential of a reasoning path. We introduce Entropic Particle Filtering (ePF), an algorithm that integrates two new techniques to solve these issues. The first technique, Entropic Annealing (EA), directly mitigates particle impoverishment by monitoring search diversity via entropy; when diversity drops, it intervenes by dynamically annealing the resampling distribution to preserve exploration. The second, an enhancement called Look-ahead Modulation (LaM), adds a predictive guide to evaluate a state's potential based on its successors. On several challenging math benchmarks, ePF significantly outperforms strong baselines and achieves up to a 50% relative improvement in task reward. Together, these methods improve PF's resilience by balancing the exploration of diverse solution spaces with the exploitation of high-reward regions, ultimately leading to higher-quality solutions.

Abstract:
Large Language Model (LLM) based code generation is predominantly formulated as a strictly monotonic process, appending tokens linearly to an immutable prefix. This formulation contrasts to the cognitive process of programming, which is inherently interleaved with forward generation and onthe-fly revision. While prior works attempt to introduce revision via post-hoc agents or external static tools, they either suffer from high latency or fail to leverage the model's intrinsic semantic reasoning. In this paper, we propose Stream of Revision, a paradigm shift that elevates code generation from a monotonic stream to a dynamic, self-correcting trajectory by leveraging model's intrinsic capabilities. We introduce specific action tokens that enable the model to seamlessly backtrack and edit its own history within a single forward pass. By internalizing the revision loop, our framework Stream of Revision allows the model to activate its latent capabilities just-in-time without external dependencies. Empirical results on secure code generation show that Stream of Revision significantly reduces vulnerabilities with minimal inference overhead. We release our code and dataset at \urlhttps://anonymous.4open.science/r/Stream-of-Revision-CB74/, and models will be released upon publication.

Abstract:
Graph neural networks face two fundamental challenges rooted in the linear structure of Euclidean vector spaces: (1) Current architectures represent geometry through vectors (directions, gradients), yet many tasks require matrixvalued representations that capture relationships between directions—such as how atomic orientations covary in a molecule. These second-order representations are naturally captured by points on the symmetric positive definite matrices (SPD) manifold; (2) Standard message passing applies shared transformations across edges. Sheaf neural networks address this via edge-specific transformations, but existing formulations remain confined to vector spaces and therefore cannot propagate matrix-valued features. We address both challenges by developing the first sheaf neural network operates natively on the SPD manifold. Our key insight is that the SPD manifold admits a Lie group structure, enabling well-posed analogs of sheaf operators without projecting to Euclidean space. Theoretically, we prove that SPD-valued sheaves are strictly more expressive than Euclidean sheaves: they admit consistent configurations (global sections) that vector-valued sheaves cannot represent, directly translating to richer learned representations. Empirically, our sheaf convolution transforms effectively rank-1 directional inputs into full-rank matrices encoding local geometric structure. Our dual-stream architecture achieves SOTA on 6/7 MoleculeNet benchmarks, with the sheaf framework providing consistent depth robustness.

Abstract:
Despite their capabilities, large foundation models (LFMs) remain susceptible to adversarial manipulation. Current defenses predominantly rely on the ``locality hypothesis", suppressing isolated neurons or features. However, harmful semantics act as distributed, crosslayer circuits, rendering such localized interventions brittle and detrimental to utility. To bridge this gap, we propose TraceRouter, a path-level framework that traces and disconnects the causal propagation circuits of illicit semantics. TraceRouter operates in three stages: (1) it pinpoints a sensitive onset layer by analyzing attention divergence; (2) it leverages sparse autoencoders (SAEs) and differential activation analysis to disentangle and isolate malicious features; and (3) it maps these features to downstream causal pathways via feature influence scores (FIS) derived from zero-out interventions. By selectively suppressing these causal chains, TraceRouter physically severs the flow of harmful information while leaving orthogonal computation routes intact. Extensive experiments demonstrate that TraceRouter significantly outperforms state-of-the-art baselines, achieving a superior trade-off between adversarial robustness and general utility. Our code will be publicly released. WARNING: This paper contains unsafe model responses.

Abstract:
Quantum machine learning (QML), as an interdisciplinary field bridging quantum computing and machine learning, has garnered significant attention in recent years. Currently, the field as a whole faces challenges due to incomplete theoretical foundations for the expressivity of quantum neural networks (QNNs). In this paper we propose a constructive QNN model and demonstrate that it possesses the universal approximation property (UAP), which means it can approximate any squareintegrable function up to arbitrary accuracy. Furthermore, it supports switching function bases, thus adaptable to various scenarios in numerical approximation and machine learning. Our model has asymptotic advantages over the best classical feed-forward neural networks in terms of circuit size and achieves optimal parameter complexity when approximating Sobolev functions under L_2 norm.

Abstract:
Deep Reinforcement Learning (DRL) has emerged as a promising approach for solving Combinatorial Optimization (CO) problems, such as the 3D Bin Packing Problem (3DBPP), Traveling Salesman Problem (TSP), or Vehicle Routing Problem (VRP), but these neural solvers often exhibit brittleness when facing distribution shifts. To address this issue, we uncover the Satisficing Generalization Edge, which we validate both theoretically and experimentally: identifying a set of promising actions is inherently more generalizable than selecting the single optimal action. To exploit this property, we propose Adaptive Selection After Proposal (ASAP), a generic framework that decomposes the decision-making process into two distinct phases: a proposal policy that acts as a robust filter, and a selection policy as an adaptable decision maker. This architecture enables a highly effective online adaptation strategy where the selection policy can be rapidly fine-tuned on a new distribution. Concretely, we introduce a two-phase training framework enhanced by Model-Agnostic Meta-Learning (MAML) to prime the model for fast adaptation. Extensive experiments on 3D-BPP, TSP, and CVRP demonstrate that ASAP improves the generalization capability of state-of-the-art baselines and achieves superior online adaptation on out-of-distribution instances.

Abstract:
While latent diffusion models (LDMs) have emerged as powerful priors for inverse problems, existing LDMbased solvers frequently suffer from instability. In this work, we first identify the instability as a discrepancy between the solver dynamics and stable reverse diffusion dynamics learned by the diffusion model, and show that reducing this gap stabilizes the solver. Building on this, we introduceMeasurement-Consistent Langevin Corrector (MCLC), a theoretically grounded plug-and-play stabilization module that remedies the LDM-based inverse problem solvers through measurement-consistent Langevin updates. Compared to prior approaches that rely on linear manifold assumptions, which often fail to hold in latent space, MCLC provides a principled stabilization mechanism, leading to more stable and reliable behavior in latent space.

Abstract:
Concept unlearning in diffusion models is hampered by feature splitting, where concepts are distributed across many latent features, making their removal challenging and computationally expensive. We introduce SAEmnesia, a supervised sparse autoencoder framework that overcomes this by enforcing oneto-one concept-neuron mappings. By systematically labeling concepts during training, our method achieves feature centralization, binding each concept to a single, interpretable neuron. This enables highly targeted and efficient concept erasure. SAEmnesia reduces hyperparameter search by 96.7% and achieves a 9.2% improvement over the state-of-the-art on the UnlearnCanvas benchmark. Our method also demonstrates superior scalability in sequential unlearning, improving accuracy by 28.4% when removing nine objects, establishing a new standard for precise and controllable concept erasure. Moreover, SAEmnesia mitigates the possibility of generating unwanted content under adversarial attack and effectively removes nudity when evaluated with I2P.

Abstract:
Associative memory has long underpinned the design of sequential models. Beyond recall, humans reason byprojecting future states and selecting goaldirected actions, a capability that modern language models increasingly require but do not natively encode. While prior work uses reinforcement learning or test-time training, planning remains external to the model architecture. We formulate reasoning asoptimal controland introduce theTest-Time Control (TTC)layer, which performs finite-horizonLQRplanning over latent states at inference time, enablingplanning before prediction. To ensure scalability, we derive a hardware-efficient LQR solver based on a symplectic formulation and implement it as a fused CUDA kernel, enabling parallel execution with minimal overhead. Integrated as an adapter into pretrained LLMs, TTC layers improve mathematical reasoning performance by up to +27.8 on MATH-500 and 2-3x Pass@8 improvements on AMC and AIME, demonstrating that embedding optimal control as an architectural component provides an effective and scalable mechanism for reasoning beyond test-time training.

Abstract:
Language models achieve impressive performance on a variety of knowledge, language, and reasoning tasks due to the scale and diversity of pretraining data available. The standard training recipe is a twostage paradigm: pretraining first on the full corpus of data followed by specialization on a much smaller subset of high quality, specialized data from the full corpus. In the multi-domain setting, this involves continued pretraining of multiple models on each specialized domain, referred to as split model training. We propose a method for pretraining multiple models independently over a general pretraining corpus, and determining the optimal compute allocation between pretraining and continued pretraining using scaling laws. Our approach accurately predicts the loss of a model of size N with D pretraining and D' specialization tokens, and extrapolates to larger model sizes and number of tokens. Applied to language model training, our approach improves performance consistently across common sense knowledge and reasoning benchmarks across different model sizes and compute budgets.

Abstract:
Deploying guardrails for custom policies remains challenging, as generic safety models fail to capture taskspecific requirements, while prompting LLMs suffers from inconsistent boundary-case performance and high inference costs. Training custom classifiers achieves both accuracy and efficiency, yet demands substantial labeled data that is costly to obtain. We present BARRED (Boundary Alignment Refinement through REflection and Debate), a framework for generating faithful and diverse synthetic training data using only a task description and a small set of unlabeled examples. Our approach decomposes the domain space into dimensions to ensure comprehensive coverage, and employs multi-agent debate to verify label correctness, yielding a high-fidelity training corpus. Experiments across diverse custom policies demonstrate that small language models finetuned on our synthetic data consistently outperform state-of-the-art proprietary LLMs (including reasoning models) and dedicated guardrail models. Ablation studies confirm that both dimension decomposition and debate-based verification are critical for ensuring the diversity and label fidelity required for effective fine-tuning. The BARRED framework eliminates the reliance on extensive human annotation, offering a scalable solution for accurate custom guardrails.

Abstract:
As LLMbased agents are increasingly deployed in real-world workflows, existing agent benchmarks---often built on idealized, noise-free assumptions---fall short of characterizing agents' robustness under imperfect user instructions and unreliable tool feedback. To address this gap, we introduce AgentNoiseBench, a systematic evaluation framework for interactive noise robustness in LLM agents. AgentNoiseBench models two primary noise sources: user-side instruction noise arising from ambiguity and variability in human requests, and tool-side result noise caused by failures, partial outputs, and erroneous or distracting tool responses. The benchmark covers two representative agentic settings: (i) multi-step tool use with DeepSearch-style retrieval agents on multi-hop QA tasks, and (ii) multi-turn user--agent interaction via adaptations of \tau^2-Bench and VitaBench to support controlled noise injection. We further provide a modular noise injection pipeline with controllable location and intensity, together with multi-dimensional metrics that go beyond final success to capture degradation trends, decision instability, and compute overhead. Evaluating 25 tool-using models across reasoning and non-reasoning families, we find that tool-side noise generally induces substantially larger performance degradation and trajectory drift than user-side noise, and that some strong reasoning models exhibit a "reasoning trap", spending markedly more tokens and steps under corrupted tool feedback while still making confident errors. Overall, AgentNoiseBench provides a practical testbed for diagnosing failure modes and advancing robust agent design for real deployments.

Abstract:
Graph Foundation Models (GFMs) have achieved remarkable success in generalizing across diverse domains. However, they mainly focus on TextAttributed Graphs (TAGs), leaving Multimodal-Attributed Graphs (MAGs) largely untapped. Developing Multimodal Graph Foundation Models (MGFMs) allows for leveraging the rich multimodal information in MAGs, and extends applicability to broader types of downstream tasks. While recent MGFMs integrate diverse modality information, our empirical investigation reveals two fundamental limitations of existing MGFMs: (1)they fail to explicitly model modality interaction, essential for capturing intricate cross-modal semantics beyond simple aggregation, and (2)they exhibit sub-optimal modality alignment, which is critical for bridging the significant semantic disparity between distinct modal spaces. To address these challenges, we propose PLANET (graPh topoLogy-aware modAlity iNteraction and alignmEnT), a novel framework employing a Divide-and-Conquer strategy to decouple modality interaction and alignment across distinct granularities. At the embedding granularity, (1)Embedding-wise Domain Gating (EDG) performs local semantic enrichment by adaptively infusing topology-aware cross-modal context, achieving modality interaction. At the node granularity, (2)Node-wise Discretization Retrieval (NDR) ensures global modality alignment by constructing a Discretized Semantic Representation Space (DSRS) to bridge modality gaps. Extensive experiments demonstrate that PLANET significantly outperforms state-of-the-art baselines across diverse graph-centric and multimodal generative tasks.

Abstract:
To deploy large language models (LLMs) in highstakes application domains that require substantively accurate responses to open-ended prompts, we need reliable, computationally inexpensive methods that assess the trustworthiness of long-form responses generated by LLMs. However, existing approaches often rely on claim-by-claim fact-checking, which is computationally expensive and brittle in long-form responses to open-ended prompts. In this work, we introduce semantic isotropy—the degree of uniformity across normalized text embeddings on the unit sphere—and use it to assess the trustworthiness of long-form responses generated by LLMs. To do so, we generate several long-form responses, embed them, and estimate the level of semantic isotropy of these responses as the angular dispersion of the embeddings on the unit sphere. We find that higher semantic isotropy—that is, greater embedding dispersion—reliably signals lower factual consistency across samples. Our approach requires no labeled data, no fine-tuning, and no hyperparameter selection, and can be used with open- or closed-weight embedding models. Across multiple domains, our method consistently outperforms existing approaches in predicting nonfactuality in long-form responses using only a handful of samples—offering a practical, low-cost approach for integrating trust assessment into real-world LLM workflows.

Abstract:
We investigate why deep neural networks suffer from loss of plasticity in deep continual learning, failing to learn new tasks without reinitializing parameters. We show that this failure is preceded by Hessian spectral collapse at newtask initialization, where meaningful curvature directions vanish and gradient descent becomes ineffective. Analyzing a linearized ReLU network, we derive explicit \epsilon -rank conditions for successful training and prove that the loss-weighted Gram matrix is spectrally equivalent to the Generalized Gauss-Newton approximation, bridging NTK dynamics to Hessian curvature. Targeting spectral collapse directly, we then discuss the Kronecker factored approximation of the Hessian, which motivates two regularization enhancements: maintaining high effective feature rank and applying L2 penalties. Experiments on continual supervised and reinforcement learning tasks confirm that combining these two regularizers effectively preserves plasticity.

Abstract:
A variety of widely used optimization methods like SignSGD and Muon can be interpreted as instances of steepest descent under different norminduced geometries. In this work, we study the implicit bias of mini-batch stochastic steepest descent in multi-class classification, characterizing how batch size, momentum, and variance reduction shape the limiting max-margin behavior and convergence rates under general entry-wise and Schatten-p norms. We show that without momentum, convergence only occurs with large batches, yielding a batch-dependent margin gap but the full-batch convergence rate. In contrast, momentum enables small-batch convergence through a batch–momentum trade-off, though it slows convergence. This approach provides fully explicit, dimension-free rates that improve upon prior results. Moreover, we prove that variance reduction can recover the exact full-batch implicit bias for any batch size, albeit at a slower convergence rate. Finally, we further investigate the batch-size-one steepest descent without momentum, and reveal its convergence to a fundamentally different bias via a concrete data example, which reveals a key limitation of purely stochastic updates. Overall, our unified analysis clarifies when stochastic optimization aligns with full‑batch behavior, and paves the way for perform deeper explorations of the training behavior of stochastic gradient steepest descent algorithms.

Abstract:
We present MedMosaic, a medical audio question–answering dataset designed to benchmark language and audio reasoning models under realistic clinical constraints. Medical audio data is difficult to collect due to privacy regulations and high annotation costs arising from domain expertise. Thus, existing benchmarks tend to underrepresent complex medical audio scenarios. To address these challenges, MedMosaic features a diverse range of medical audio types, including conditionrelated physiological sounds, carefully constructed synthetic voices to mimic speech with artifacts as well as real short and long length clinical conversations to model varying context lengths. The dataset also features a total of 46,701 question-answer pairs, spanning categories such as multiple-choice, sequential multi-turn, and open-ended question–answers, enabling systematic evaluation of multi-hop reasoning and answer generation capabilities. Benchmarking 13 audio and multimodal reasoning models reveals that reasoning remains challenging for all evaluated systems, with substantial performance variation across question types. In particular, even state-of-the-art model like Gemini-2.5-pro can only achieve 68.1% accuracy approximately. These findings underscore persistent limitations in medical reasoning and highlight the need for more robust, domain-specific multimodal reasoning models. A sample of benchmark data is available here: https://shorturl.at/Lyp33

Abstract:
Multimodal time series forecasting has garnered significant attention for its potential to provide more robust and accurate predictions than traditional singlemodality models by leveraging rich information inherent in other modalities. However, due to fundamental challenges in modality alignment, existing methods often struggle to effectively incorporate multimodal data into predictions, particularly textual information that has a causal influence on time series fluctuations, such as emergency reports and policy announcements. In this paper, we reflect on the role of textual information in numerical forecasting and proposeTime series transformers with MultimodalMixture-of-Experts,TiMi, to unleash the causal reasoning capabilities of LLMs. Concretely, TiMi utilizes language models to generate inferences on future developments, which then serve as guidance for time series forecasting. To seamlessly integrate both exogenous factors and time series into predictions, we introduce a Multimodal Mixture-of-Experts (MMoE) module as a lightweight plug-in to empower Transformer-based time series models for multimodal forecasting, eliminating the need for explicit representation-level alignment. Experimentally, our proposed TiMi demonstrates consistent state-of-the-art performance on sixteen real-world multimodal forecasting benchmarks, outperforming advanced baselines while offering strong adaptability and interpretability.

Abstract:
Generative modeling of neural network parameters is often tied to architectures because standard parameter representations rely on known weightmatrix dimensions. Generation is further complicated by permutation symmetries that allow networks to model similar input-output functions while having widely different, unaligned parameterizations. In this work, we introduce Neural Network Diffusion Transformers (NNiTs), which generate weights in a width-agnostic manner by tokenizing weight matrices into patches and modeling them as locally structured fields. We establish that Graph HyperNetworks (GHNs) with a convolutional neural network (CNN) decoder structurally align the weight space, creating the local correlation necessary for patch-based processing. Focusing on MLPs, where permutation symmetry is especially apparent, NNiT generates fully functional networks across a range of architectures. Our approach jointly models discrete architecture tokens and continuous weight patches within a single sequence model. On ManiSkill3 robotics tasks, NNiT achieves >85% success on architecture topologies unseen during training, while baseline approaches fail to generalize.

Abstract:
The application of generative models for experimental drug discovery campaigns is severely limited by the difficulty of designing molecules de novo that can be synthesized in practice. Previous works have leveraged Generative Flow Networks (GFlowNets) to impose hard synthesizability constraints through the design of state and action spaces based on predefined reaction templates and building blocks. Despite the promising prospects of this approach, it currently lacks flexibility and scalability. As an alternative, we propose S3GFN, which generates synthesizable SMILES molecules via simple soft regularization of a sequence-based GFlowNet. Our approach leverages rich molecular priors learned from large-scale SMILES corpora to steer molecular generation towards high-reward, synthesizable chemical spaces. The model induces constraints through off-policy replay training with a contrastive learning signal based on separate buffers of synthesizable and unsynthesizable samples. Our experiments show that S3-GFN learns to generate synthesizable molecules (\geq 95%) with higher rewards in diverse tasks.

Abstract:
Knowledge Base Question Answering (KBQA) challenges models to bridge the gap between natural language and strict knowledge graph schemas by generating executable logical forms. While Large Language Models (LLMs) have advanced this field, current approaches often struggle with a dichotomy of failure: they either generate hallucinated queries without verifying schema existence or exhibit rigid, templatebased reasoning that mimics synthesized traces without true comprehension of the environment. To address these limitations, we presentKBQA-R1, a framework that shifts the paradigm from text imitation to interaction optimization via Reinforcement Learning. Treating KBQA as a multi-turn decision process, our model learns to autonomously navigate the knowledge base using a structured action space, refining its reasoning strategies based on concrete execution feedback rather than static supervision. Furthermore, we introduce Referenced Rejection Sampling (RRS), a data synthesis method that resolves cold-start challenges by strictly aligning reasoning traces with ground-truth action sequences. Extensive experiments on WebQSP, GrailQA, and GraphQuestions demonstrate that KBQA-R1 achieves state-of-the-art performance. Code is available at https://anonymous.4open.science/r/KBQA-R1-814F.

Abstract:
Domainspecific large language models (LLMs), typically developed by fine-tuning a pre-trained general-purpose LLM on specialized datasets, represent a significant advancement in applied AI. A common strategy in LLM fine-tuning is curriculum learning, which pre-orders training samples based on metrics like difficulty to improve learning efficiency compared to a random sampling strategy. However, most existing methods for LLM fine-tuning rely on a static curriculum, designed prior to training, which lacks adaptability to the model's evolving needs during fine-tuning. To address this, we propose EDCO, a novel framework based on two key concepts: inference entropy and dynamic curriculum orchestration. Inspired by recent findings that maintaining high answer entropy benefits long-term reasoning gains, EDCO prioritizes samples with high inference entropy in a continuously adapted curriculum. EDCO integrates three core components: an efficient entropy estimator that uses prefix tokens to approximate full-sequence entropy, an entropy-based curriculum generator that selects data points with the highest inference entropy, and an LLM trainer that optimizes the model on the selected curriculum. Comprehensive experiments in communication, medicine and law domains, EDCO outperforms traditional curriculum strategies for fine-tuning Qwen3-4B and Llama3.2-3B models under supervised and reinforcement learning settings. Furthermore, the proposed efficient entropy estimation reduces computational time by 83.5% while maintaining high accuracy.

Abstract:
Large language models (LLMs) have demonstrated significant potential in formal theorem proving, yet stateof-the-art performance often necessitates prohibitive test-time compute via massive roll-outs or extended context windows. In this work, we address this scalability bottleneck by exploiting an informative structure in formal verification: the observation that compilers map a vast space of diverse proof attempts to a compact set of structured failure modes. We introduce a learning-to-refine framework that leverages this compression to perform efficient learning and proof exploration. We perform tree search that corrects errors locally conditioned on explicit verifier feedback, thereby circumventing the costs associated with accumulating a long history of proof attempts. Extensive evaluations show that our method consistently amplifies the reasoning capabilities of base provers across varying scales. Notably, our approach achieves state-of-the-art performance on PutnamBench among publicly reported ～8B and ～32B parameter models under comparable test-time budgets, offering a scalable paradigm for next-generation verifier-guided reasoning.

Abstract:
Geometric shapes play important roles in both physical world and human cognition. While multimodal large language models (MLLMs) have made significant advancements in visual understanding, their abilities to recognize geometric shapes and their spatial relationships, which we term geometric perception, are not explicitly and systematically explored. To address this gap, we introduce GePBench, a novel benchmark specifically designed to assess the geometric perception capabilities of MLLMs. Our extensive evaluations reveal that even the current stateof-the-art MLLMs exhibit significant deficiencies in geometric perception tasks. Furthermore, we show that models trained with GePBench data demonstrate considerable improvements on a wide range of downstream tasks, highlighting the critical role of geometric perception in enabling advanced multimodal applications. Our code and datasets will be publicly available.

Abstract:
Inferencetime scaling has recently emerged as a powerful paradigm for improving the reasoning capability of large language models. Among various approaches, \emphSequential Monte Carlo (SMC) has become a particularly important framework, enabling iterative generation, evaluation, rejection, and resampling of intermediate reasoning trajectories. A central component in this process is the \emphreward model, which evaluates partial solutions and guides the allocation of computation during inference. However, in practice, true reward models are never available. All deployed systems rely on \emphapproximate reward models, raising a fundamental question: \emphWhy and when do approximate reward models suffice for effective inference-time scaling? In this work, we provide a theoretical answer. We identify the \emphBellman error of the approximate reward model as the key quantity governing the effectiveness of SMC-based inference-time scaling. For a reasoning process of length T, we show that if the Bellman error of the approximate reward model is bounded by O(1/T), then combining this reward model with SMC reduces the computational complexity of reasoning from exponential in T to polynomial in T. This yields an \emphexponential improvement in inference efficiency despite using only approximate rewards.

Abstract:
We study online maximization of nonmonotone Diminishing-Return(DR)-submodular functions over down-closed convex sets, a regime where existing projection-free online methods suffer from suboptimal regret and limited feedback guarantees. Our main contribution is a new structural result showing that this class is 1/e-Upper-Linearizable under carefully designed exponential reparametrization, scaling parameter, and surrogate potential, enabling a reduction to online linear optimization. As a result, we obtain optimal O(T^1/2) static regret with a single gradient query per round and unlock adaptive and dynamic regret guarantees, together with improved rates under semi-bandit, bandit, and zeroth-order feedback. Across all feedback models, our bounds strictly improve the state of the art.

Abstract:
Large language models have transformed many applications but remain expensive to train. Sparse Mixture of Experts (MoE) addresses this through conditional computation, with Expert Parallel (EP) as the standard distributed training method. However, EP has three limitations: communication cost grows linearly with the number of activated experts k, load imbalance affects latency and memory usage, and datadependent communication requires metadata exchange. We propose Multi-Head LatentMoE and Head Parallel (HP), a new architecture and parallelism that achieve O(1) communication cost regardless of k, completely balanced traffic, and deterministic communication, all while remaining compatible with EP. To accelerate Multi-Head LatentMoE, we propose IO-aware routing and expert computation. Compared to MoE with EP, Multi-Head LatentMoE with HP trains up to 1.61× faster while having identical performance. With doubled granularity, it achieves higher overall performance while still being 1.11× faster. Our method makes multi-billion-parameter foundation model research more accessible.

Abstract:
Recent studies have demonstrated the efficacy of integrating Group Relative Policy Optimization (GRPO) into flow matching models, particularly for textto-image and text-to-video generation. However, we find that directly applying these techniques to image-to-video (I2V) models often fails to yield consistent reward improvements. To address this limitation, we present TAGRPO, a robust post-training framework for I2V models inspired by contrastive learning. Our approach is grounded in the observation that rollout videos generated from identical initial noise provide superior guidance for optimization. Leveraging this insight, we propose a novel GRPO loss applied to intermediate latents, encouraging direct alignment with high-reward trajectories while maximizing distance from low-reward counterparts. Furthermore, we introduce a memory bank for rollout videos to enhance diversity and reduce computational overhead. Despite its simplicity, TAGRPO achieves significant improvements over DanceGRPO in I2V generation. Code and models will be made publicly available.

Abstract:
Despite the recent success of Multimodal Large Language Models (MLLMs), existing approaches predominantly assume the availability of multiple modalities during training and inference. In practice, multimodal data is often incomplete because modalities may be missing, collected asynchronously, or available only for a subset of examples. In this work, we propose PRIMO, a supervised latentvariable imputation model that quantifies the predictive impact of any missing modality within the multimodal learning setting. PRIMO enables the use of all available training examples, whether modalities are complete or partial. Specifically, it models the missing modality through a latent variable that captures its relationship with the observed modality in the context of prediction. During inference, we draw many samples from the learned distribution over the missing modality to both obtain the marginal predictive distribution (for the purpose of prediction) and analyze the impact of the missing modalities on the prediction for each instance. We evaluate PRIMO on a synthetic XOR dataset, Audio-Vision MNIST, and MIMIC-III for mortality and ICD-9 prediction. Across all datasets, PRIMO obtains performance comparable to unimodal baselines when a modality is fully missing and to multimodal baselines when all modalities are available. PRIMO quantifies the predictive impact of a modality at the instance level using a variance-based metric computed from predictions across latent completions. We visually demonstrate how varying completions of the missing modality result in a set of plausible labels.

Abstract:
We study first‐order methods for convex optimization problems with functions f satisfying the recently proposed \ellsmoothness condition ||\nabla^2f(x)|| \le \ell\left(||\nabla f(x)||\right), which generalizes the L-smoothness and (L_0,L_1)-smoothness. While accelerated gradient descent (AGD) is known to reach the optimal complexity \mathcalO(\sqrtL R / \sqrt\varepsilon) under L-smoothness, where \varepsilon is an error tolerance and R is the distance between a starting and an optimal point, existing extensions to \ell-smoothness either incur extra dependence on the initial gradient, suffer exponential factors in L_1 R, or require costly auxiliary sub-routines, leaving open whether an AGD‐type \mathcalO(\sqrt\ell(0) R / \sqrt\varepsilon) rate is possible for small-\varepsilon, even in the (L_0,L_1)-smoothness case. We resolve this open question. Developing new proof techniques, we achieve \mathcalO(\sqrt\ell(0) R / \sqrt\varepsilon) oracle complexity for small-\varepsilon and virtually any \ell. For instance, for (L_0,L_1)-smoothness, our bound \mathcalO(\sqrtL_0 R / \sqrt\varepsilon) is provably optimal in the small-\varepsilon regime and removes all non-constant multiplicative factors present in prior accelerated algorithms.

Abstract:
Current approaches for scaling inferencetime compute in transformers train them to emit explicit chain-of-thought tokens before producing an answer. While these methods are powerful, they are limited because they cannot be applied during pretraining and rely solely on serially-generated, natural-language verbalization. In this work, we propose Thoughtbubbles, a transformer variant that natively performs parallel adaptive computation in latent space by learning to fork or delete residual streams. Thus, tokens requiring more computation can form a "bubble" of cloned residuals in the middle of the network. Crucially, this behavior is learned during pretraining with only language modeling loss. Using half of the training budget, Thoughtbubbles outperforms the perplexity and zero-shot evals of both standard decoder LMs and those using non-adaptive parallel computation approaches. These results hold across model sizes from 150M to 1.9B. Thoughtbubbles achieves competitive GSM8K results using half of the baseline's token budget. The implicit nature of our method enables models to begin learning adaptive computation at pretraining time, paving the way to unified train-time and test-time scaling behaviors.

Abstract:
Group Relative Policy Optimization (GRPO) trains Chainof-Thought reasoning with verifiable rewards, but estimating thought-level advantages without value functions often suffers from high variance. Although tree-style branching is used in practice to reduce the variance, it lacks a theoretical explanation of why it works and whether it is important or even potentially necessary. We study thought-level advantage estimation in GRPO from a variance perspective under a minimal tree-style setting where multiple answers are sampled for each thought. Using the multivariate delta method, we reveal an asymmetry in how different sampling dimensions affect variance. Increasing the number of sampled thoughts (K) leaves a strictly positive variance floor, whereas increasing the number of answers per thought (M) induces a monotonic decrease in variance, asymptotically driving it to zero. This implies that accurate thought-level advantage estimation is impossible through scaling thought sampling alone, making branching a potentially necessary mechanism rather than a heuristic. Experiments further provide empirical evidence for both the effectiveness and necessity of answer-level branching, demonstrating improved optimization stability, training efficiency, and final performance not only in math but also across a broad range of vision domains and under different model architectures and sizes.

Abstract:
Finetuning large language models (LLMs) with backpropagation achieves high performance but incurs substantial memory overhead, limiting scalability on resource-constrained hardware. Zeroth-order (ZO) optimization provides a memory-efficient alternative by relying solely on forward passes, yet it typically suffers from slow or unstable convergence due to high-variance gradient estimates. Sparse ZO updates partially address this issue by perturbing only a subset of parameters, but their effectiveness hinges on selecting informative parameters, which is challenging in ZO optimization because each query yields only scalar feedback. We propose Adaptive Curvature-Guided Sparse Zeroth-Order Optimization (CurvZO), which tracks curvature signals online from scalar ZO feedback and leverages these signals to construct a parameter-wise sampling distribution for selecting coordinates at each update, reducing the variance of the sparse ZO gradient estimator. Moreover, CurvZO dynamically adapts the perturbation budget to the evolving curvature signal distribution, yielding sparse ZO updates that remain both focused and sufficiently exploratory. Extensive experiments on OPT and Llama across diverse NLP tasks show that CurvZO consistently improves fine-tuning performance and reduces training time over ZO baselines. It improves accuracy by up to 4.4 points and achieves up to a 2× speedup, while preserving memory efficiency.

Abstract:
Latent diffusion models excel at generating highquality images but lose the benefits of end-to-end modeling. They discard information during image encoding, require a separately trained decoder, and model an auxiliary distribution to the raw data. In this paper, we propose Latent Forcing, a simple modification to existing architectures that achieves the efficiency of latent diffusion while operating on raw natural images. Our approach orders the denoising trajectory by jointly processing latents and pixels with separately tuned noise schedules. This allows the latents to act as a scratchpad for intermediate computation before high-frequency pixel features are generated. We find that the order of conditioning signals is critical, and we analyze this to explain differences between REPA distillation in the tokenizer and the diffusion model, as well as conditional and unconditional generation. Applied to pixel-space diffusion on ImageNet, Latent Forcing achieves a new state of the art for diffusion transformer-based pixel generation at our compute scale.

Abstract:
We propose 3DGS^2TR, a second-order optimizer for accelerating the scene training problem in 3D Gaussian Splatting (3DGS). Unlike existing second-order approaches that rely on explicit or dense curvature representations, such as 3DGS-LM (Höllein et al., 2025) or 3DGS2 (Lan et al., 2025), our method approximates curvature using only the diagonal of the Hessian matrix, estimated efficiently via Hutchinson’s method. Our approach is fully matrix-free and has the same complexity as ADAM (Kingma, 2024), O(n) in both computation and memory costs. To ensure stable optimization in the presence of strong nonlinearity in the 3DGS rasterization process, we introduce a parameter-wise trust-region technique based on the squared Hellinger distance, regularizing updates to Gaussian parameters. Under identical parameter initialization and without densification, 3DGS^2-TR is able to achieve better reconstruction quality on standard datasets, using 50% fewer training iterations compared to ADAM, while incurring less than 1GB of peak GPU memory overhead (17% more than ADAM and 85% less than 3DGS-LM), enabling scalability to very large scenes and potentially to distributed training settings.

Abstract:
Partial participation is essential for communicationefficient federated learning at scale, yet existing Byzantine-robust methods typically assume full client participation. In the partial participation setting, a majority of the sampled clients may be Byzantine, once Byzantine clients dominate, existing methods break down immediately. We introduce delayed momentum aggregation, a principle where the central server aggregates cached momentum from non-sampled clients along with fresh momentum from sampled clients. This principle ensures Byzantine clients remain a minority from the server's perspective even when they dominate the sampled set. We instantiate this principle in our optimizer DeMoA. We analyze the convergence rate of DeMoA, showing that DeMoA is Byzantine-robust under partial participation. Experiments show that, with 20% Byzantine ratio and only 10% partial participation rate, DeMoA achieves the best accuracy even when existing methods fail empirically.

Abstract:
Attention scores in transformers are bilinear forms S_ij = x_i^\top M x_j / \sqrtd_h whose maximum magnitude governs overflow risk in lowprecision training. We derive a \emphrank-aware concentration inequality: when the interaction matrix M = W^Q W^K\top has rank r \ll d, tail probabilities for \max_i,j|S_ij| decay as \exp(-d^2\alpha^2/r) rather than \exp(-d\alpha^2), an improvement of d/r in the exponent. For transformer attention where r = d_h, this yields 25--64× tighter concentration than rank-agnostic bounds in modern architectures. We apply this result to FP8 training, deriving geometry-aware scale factors that provide provable overflow guarantees without observing activations. The method computes per-layer scales from the spectral norm \|W^Q W^K\top\|_2 via implicit power iteration, includes a grouped query attention formulation that avoids key expansion, and remains compatible with fused attention kernels. Across GPT-2 XL to Llama-2-70B, geometry-aware scaling eliminates overflows in transient scenarios where delayed scaling fails, while matching downstream MMLU accuracy.

Abstract:
Generative Reward Models (GRMs) have demonstrated strong performance in reward modeling, due to their interpretability and potential for refinement through reinforcement learning (RL). However, widely used pairwise GRMs create a computational bottleneck in reinforcement learning from human feedback (RLHF), when calibrating or aggregating preference signals over n candidates, often incurring \mathcalO(n^2) pairwise judgments. To address this issue, we propose Intergroup Relative Preference Modeling (IRPM), an RLbased method that extends the Bradley--Terry preference-learning paradigm via intergroup comparisons to train \emphpointwise GRMs from pairwise preference data. IRPM derives pointwise reward for each response by contrasting groups of chosen vs.\ rejected samples, enabling pointwise scores comparable across candidate sets and \mathcalO(n) reward evaluation for a variable number of candidates during RL training, while preserving interpretability and scalability. Experiments show that IRPM achieves state-of-the-art performance among pointwise GRMs on RM-Bench, JudgeBench and RewardBench, and approaches the performance of leading pairwise GRMs. In addition, IRPM achieves substantial gains in post-training evaluations, demonstrating its effectiveness.

Abstract:
A simple strategy for improving LLM accuracy, especially in math and reasoning problems, is to sample multiple responses and submit the answer most consistently reached. In this paper we leverage Bayesian prior information to save on sampling costs, stopping once sufficient consistency is reached. Although the exact posterior is computationally intractable, we further introduce an efficient ``Laggregated'' stopping policy that tracks only the L-1 most frequent answer counts. Theoretically, we prove that L=3 is all you need: this coarse approximation is sufficient to achieve asymptotic optimality, and strictly dominates prior-free baselines, while having a fast posterior computation. Empirically, this identifies the most consistent (i.e., mode) LLM answer and achieves similar answer accuracy using fewer samples.

Abstract:
Streaming video understanding requires processing unbounded video streams with limited memory and computation, posing two key challenges. First, continuously constructing new and evicting old keyvalue(KV) caches is required for unbounded streams. Secondly, due to the high cost of collecting and training on unbounded streams, models must learn from short sequences while generalizing to long streams. Existing streaming VideoVLLMs fail to scale to unbounded video streams or focus on cache reuse strategies, leaving the impact of cache construction underexplored. In this paper, we propose Decoupled Streaming Cache(DSCache), a training-free cache construction mechanism that adapts pretrained offline models to streaming settings. DSCache maintains a cumulative past KV cache while constructing a separate instant cache on-demand, decoupled from past caches to preserve the informativeness of recent inputs. To enable position extrapolation beyond the training length, DSCache further incorporates a position-agnostic encoding strategy, ensuring KV caches to support unseen positions and preventing position overflow. Experiments on Streaming Video QA benchmarks demonstrate DSCache's state-of-the-art performance, with an average 2.5% accuracy gains over prior methods.

Abstract:
Clustering functional data in the presence of phase variation is challenging, as temporal misalignment can obscure intrinsic shape differences and degrade clustering performance. Most existing approaches treat registration and clustering as separate tasks or rely on restrictive parametric assumptions. We present NeuralFLoC, a fully unsupervised, endto-end deep learning framework for joint functional registration and clustering based on Neural ODE-driven diffeomorphic flows and spectral clustering. The proposed model learns smooth, invertible warping functions and cluster-specific templates simultaneously, effectively disentangling phase and amplitude variation. We establish universal approximation guarantees and asymptotic consistency for the proposed framework. Experiments on functional benchmarks show state-of-the-art performance in both registration and clustering, with robustness to missing data, irregular sampling, and noise, while maintaining scalability. Code is available at https://anonymous.4open.science/r/NeuralFLoC-FEC8.

Abstract:
Realworld requests to AI agents are fundamentally underspecified. Natural human communication relies on shared context and unstated constraints that speakers expect listeners to infer. Current agentic benchmarks test explicit instruction-following but fail to evaluate whether agents can reason about implicit requirements spanning accessibility needs, privacy boundaries, catastrophic risks, and contextual constraints. We presentImplicit Intelligence, an evaluation framework testing whether AI agents can move beyond prompt-following to become genuine goal-fulfillers, paired withAgent-as-a-World (AaW), a harness where interactive worlds are defined in human-readable YAML files and simulated by language models. Our scenarios feature apparent simplicity in user requests, hidden complexity in correct solutions, and discoverability of constraints through environmental exploration. Evaluating 16 frontier and open-weight models across 205 scenarios, we find that even the best-performing model achieves only 48.3% scenario pass rate, revealing substantial room for improvement in bridging the gap between literal instruction-following and human-like contextual reasoning.

Abstract:
Prolonged reinforcement learning with verifiable rewards (RLVR) has been shown to drive continuous improvements in the reasoning capabilities of large language models, but the training is often prone to instabilities, especially in Mixtureof-Experts (MoE) architectures. Training instability severely undermines model capability improvement, yet its underlying causes and mechanisms remain poorly understood. In this work, we introduce a principled framework for understanding RLVR instability through the lens ofobjective-level hacking. Unlike reward hacking, which arises from exploitable verifiers, objective-level hacking emerges fromtoken-level credit misalignmentand is manifested as system-level spurious signals in the optimization objective. Grounded in our framework, together with extensive experiments on a 30B MoE model, we trace the origin and formalize the mechanism behind a key pathological training dynamic in MoE models: the abnormal growth of the training-inference discrepancy, a phenomenon widely associated with instability but previously lacking a mechanistic explanation. These findings provide a concrete and causal account of the training dynamics underlying instabilities in MoE models, offering guidance for the design of stable RLVR algorithms.

Abstract:
Offline decisionmaking via diffusion models often produces trajectories that are misaligned with system dynamics, limiting their reliability for control. We proposeModel Predictive Diffuser(MPDiffuser), a compositional diffusion framework that combines a diffusion planner with a dynamics diffusion model to generate task-aligned and dynamically plausible trajectories. MPDiffuser interleaves planner and dynamics updates during sampling, progressively correcting feasibility while preserving task intent. A lightweight ranking module then selects trajectories that best satisfy task objectives. The compositional design improves sample efficiency and adaptability by enabling the dynamics model to leverage diverse and previously unseen data independently of the planner. Empirically, we demonstrate consistent improvements over prior diffusion-based methods on unconstrained (D4RL) and constrained (DSRL) benchmarks, and validate practicality through deployment on a real quadrupedal robot.

Abstract:
We identify test prediction variance (TPV)—the firstorder sensitivity of model outputs to parameter perturbations around a trained solution—as a unifying quantity that links several classical observations about generalization in deep networks. TPV is a fully label-free object whose trace form \mathrmTr(H_\mathrmeff C) separates the geometry of the trained model H_\mathrmeff from the specific perturbation mechanism C, allowing a broad family of parameter perturbations like SGD noise, label noise, finite-precision noise, and other post-training perturbations to be analyzed under a single framework. Theoretically, we show that TPV estimated on the training set converges to its test-set value in the overparameterized limit, providing the first result that prediction variance under local parameter perturbations can be inferred from training inputs alone, and this stability is decoupled from generalization performance. Empirically, TPV exhibits a striking stability across datasets and architectures even for extremely narrow networks. Further, TPV correlates well with test loss, serving as a training-set based predictive metric for generalization.

Abstract:
Transformers have been established as the most popular backbones in sequence modeling, mainly due to their effectiveness in incontext retrieval tasks and the ability to learn at scale. Their quadratic memory and time complexity, however, bound their applicability in longer sequences and so has motivated researchers to explore effective alternative architectures such as modern recurrent neural networks (a.k.a long-term recurrent memory module). Despite their recent success in diverse downstream tasks, they struggle in tasks that requires long context understanding and extrapolation to longer sequences. We observe that these shortcomings come from three disjoint aspects in their design: (1) limited memory capacity that is bounded by the architecture of memory and feature mapping of the input; (2) online nature of update, i.e., optimizing the memory only with respect to the last input; and (3) less expressive management of their fixed-size memory. To enhance all these three aspects, we present Atlas, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Our experimental results on language modeling, common-sense reasoning, recall-intensive, and long-context understanding tasks support the effectiveness of Atlas compared to other modern recurrent neural networks.

Abstract:
Strategic classification studies the problem where selfinterested individuals or agents manipulate their response to obtain favorable decision outcomes made by classifiers, typically turning to dishonest actions when they are less costly than genuine efforts. While existing studies on sequential strategic classification primarily focus on optimizing dynamic classifier weights, we depart from these weight-centric approaches by analyzing the design of classifier thresholds and difficulty progression within a multi-level promotion-relegation framework. Our model captures the critical inter-temporal incentives driven by an agent's farsightedness, skill retention, and a "leg-up" effect where qualification and attainment can be self-reinforcing. We characterize the agent’s optimal long-term strategy and demonstrate that a principal can design a sequence of thresholds to effectively incentivize honest effort. Crucially, we prove that under mild conditions, this mechanism enables agents to reach arbitrarily high levels solely through genuine improvement efforts.

Abstract:
The deployment of quantized neural networks on edge devices, combined with privacy regulations like GDPR, creates an urgent need for machine unlearning in quantized models. However, existing methods face critical challenges: they induce forgetting by training models to memorize incorrect labels, conflating forgetting with misremembering, and employ scalar gradient reweighting that cannot resolve directional conflicts between gradients. We propose OEU, a novel Orthogonal Entropy Unlearning framework with two key innovations: 1) Entropyguided unlearning maximizes prediction uncertainty on forgotten data, achieving genuine forgetting rather than confident misprediction, and 2) Gradient orthogonal projection eliminates interference by projecting forgetting gradients onto the orthogonal complement of retain gradients, providing theoretical guarantees for utility preservation under first-order approximation. Extensive experiments demonstrate that OEU outperforms existing methods in both forgetting effectiveness and retain accuracy.

Abstract:
Online matching has been a mainstay in domains such as Internet advertising and organ allocation, but practical algorithms often lack strong theoretical guarantees. We take an important step toward addressing this by developing new online matching algorithms based on a coarsening approach. Although coarsening typically implies a loss of granularity, we show that, to the contrary, aggregating offline nodes into capacitated clusters can yield nearoptimal theoretical guarantees. We apply our methodology to heart transplant allocation to develop theoretically grounded policies based on structural properties of historical data. Furthermore, in simulations based on real data, our policy closely matches the performance of the omniscient benchmark, achieving competitive ratio 0.91, drastically higher than the US status quo policy's 0.51. Our work bridges the gap between data-driven heuristics and pessimistic theoretical lower bounds.

Abstract:
Ultrafast online learning is essential for highfrequency systems, such as controls for quantum computing and nuclear fusion, where adaptation must occur on sub-microsecond timescales. Meeting these requirements demands low-latency, fixed-precision computation under strict memory constraints, a regime in which conventional Multi-Layer Perceptrons (MLPs) are both inefficient and numerically unstable. We identify key properties of Kolmogorov-Arnold Networks (KANs) that align with these constraints. Specifically, we show that: (i) KAN updates exploiting B-spline locality are sparse, enabling superior on-chip resource scaling, and (ii) KANs are inherently robust to fixed-point quantization. By implementing fixed-point online training on Field-Programmable Gate Arrays (FPGAs), a representative platform for on-chip computation, we demonstrate that KAN-based online learners are significantly more efficient and expressive than MLPs across a range of low-latency and resource-constrained tasks. To our knowledge, this work is the first to demonstrate model-free online learning at sub-microsecond latencies.

Abstract:
Continuoustime, event-native spiking neural networks (SNNs) operate strictly on spike events, treating spike timing and ordering as the representation rather than an artifact of time discretization. This viewpoint aligns with biological computation and with the native resolution of event sensors and neuromorphic processors, while enabling compute and memory that scale with the number of events. However, two challenges hinder practical, end-to-end trainable event-based SNN systems: 1) exact charge--fire--reset dynamics impose inherently sequential processing of input spikes, and 2) precise spike times must be solved without time bins. We address both. First, we use parallel associative scans to consume multiple input spikes at once, yielding up to 43x speedups over sequential simulation while retaining exact hard-reset dynamics. Second, we implement differentiable spike-time solvers that compute spike times to machine precision without discrete-time approximations or restrictive analytic assumptions. We demonstrate the viability of training SNNs using our solutions on four event-based datasets on GPUs.

Abstract:
This paper presents WorldPlay, a streaming video diffusion model that enables realtime, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes.

Abstract:
Supervised convolutional neural networks (CNNs) are widely used to solve imaging inverse problems, achieving stateof-the-art performance in numerous applications. However, despite their empirical success, these methods are poorly understood from a theoretical perspective and often treated as black boxes. To bridge this gap, we analyze trained neural networks through the lens of the Minimum Mean Square Error (MMSE) estimator, incorporating functional constraints that capture two fundamental inductive biases of CNNs: translation equivariance and locality via finite receptive fields. Under the empirical training distribution, we derive an analytic, interpretable, and tractable formula for this constrained variant, termed Local-Equivariant MMSE (LE-MMSE). Through extensive numerical experiments across various inverse problems (denoising, inpainting, deconvolution), datasets (FFHQ, CIFAR-10, FashionMNIST), and architectures (U-Net, ResNet, PatchMLP), we demonstrate that our theory matches the neural networks outputs (PSNR \gtrsim25 dB). Furthermore, we provide insights into the differences between physics-aware and physics-agnostic estimators, the impact of high-density regions in the training (patch) distribution, and the influence of other factors (dataset size, patch size, etc).

Abstract:
Diffusion models (DMs) excel on dense continuous data, but are not designed for sparse continuous data. They do not model exact zeros that represent the deliberate absence of a signal. As a result, they erase sparsity patterns and perform unnecessary computation on mostly zero entries. With SparsityExploiting Diffusion (SED), we model only non-zero values, preserving sparsity. SED delivers computational savings while maintaining or improving generation quality by skipping zeros during training and inference. Across physics and biology benchmarks, SED matches or surpasses conventional DMs and domain-specific baselines, while vision experiments provide intuitive insights into the limitations of dense DMs and the benefits of SED.

Abstract:
While deep learning has revolutionized financial forecasting through sophisticated architectures, the design of the supervision signal itself is rarely scrutinized. We challenge the canonical assumption that training labels must strictly mirror inference targets, uncovering the Label Horizon Paradox: the optimal supervision signal often deviates from the prediction goal, shifting across intermediate horizons governed by market dynamics. We theoretically ground this phenomenon in a dynamic signalnoise trade-off, demonstrating that generalization hinges on the competition between marginal signal realization and noise accumulation. To operationalize this insight, we propose a bi-level optimization framework that autonomously identifies the optimal proxy label within a single training run. Extensive experiments on large-scale financial datasets demonstrate consistent improvements over conventional baselines, thereby opening new avenues for label-centric research in financial forecasting.

Abstract:
Recent advancements in Video Large Language Models (Video LLMs) have demonstrated impressive results, yet existing approaches handle either temporal or spatial dimension in isolation, struggling in the analysis of complex events that require spatialtemporal integration. To bridge this gap, we propose VideoLoom, a unified Video LLM for joint spatial-temporal understanding. To facilitate the development of fine-grained spatial and temporal localization capabilities, we curate LoomData-8.7k, a character-centric video dataset with temporally grounded and spatially localized captions. With this, VideoLoom achieves the state-of-the-art performance across a variety of spatial and temporal benchmarks. In addition, we introduce LoomBench, a benchmark consisting of temporal, spatial, and compositional video–question pairs, with a novel metric J&F_bi-fore, enabling a comprehensive evaluation of Video LLMs from diverse aspects. Collectively, these contributions offer a universal and effective suite for joint spatial-temporal video understanding, setting a new standard in multimodal intelligence.

Abstract:
Belief Propagation (BP) is a powerful algorithm for distributed inference in probabilistic graphical models, however it quickly becomes infeasible for practical compute and memory budgets. Many efficient, nonparametric forms of BP have been developed, but the most popular is Gaussian Belief Propagation (GBP), a variant that assumes all distributions are locally Gaussian. GBP is widely used due to its efficiency and empirically strong performance in applications like computer vision or sensor networks – even when modelling non-Gaussian problems. In this paper, we seek to provide a theoretical guarantee for when Gaussian approximations are valid in highly non-Gaussian, sparsely-connected factor graphs performing BP (common in Spatial AI). We leverage the Central Limit Theorem to prove mathematically that variables’ beliefs under BP converge to a Gaussian distribution in complex, loopy factor graphs obeying our 4 key assumptions. We then confirm experimentally that variable beliefs become increasingly Gaussian after just a few BP iterations in a stereo depth estimation task.

Abstract:
The evolution of Large Language Model (LLM) agents towards System~2 reasoning, characterized by deliberative, highprecision problem-solving, necessitates maintaining rigorous logical integrity over extended horizons. However, prevalent memory preprocessing paradigms incur destructive de-contextualization. By compressing fluid sequential dependencies into pre-defined structures (e.g., embeddings or graphs), these methods sever the narrative integrity essential for deep reasoning. To address this, we propose E-mem, a framework shifting from Memory Preprocessing to Episodic Context Reconstruction inspired by biological engrams. E-mem employs a heterogeneous hierarchical architecture where multiple assistant agents maintain uncompressed memory contexts, while a central master agent orchestrates global planning. Unlike passive retrieval, our mechanism empowers assistants to locally reason within activated segments, extracting context-aware evidence before aggregation. Evaluations on the LoCoMo benchmark demonstrate that E-mem achieves over 54% F1—surpassing the state-of-the-art GAM by 7.75%—while reducing token cost by over 70%. Our work is available on \urlhttps://anonymous.4open.science/r/E-mem-F6C3/.

Abstract:
Logical reasoning encompasses deduction, induction, and abduction. However, while Large Language Models (LLMs) have effectively mastered the former two, abductive reasoning remains significantly underexplored. Existing frameworks, predominantly designed for static deductive tasks, fail to generalize to abductive reasoning due to unstructured state representation and lack of explicit state control. Consequently, they are inevitably prone to Evidence Fabrication, Context Drift, Failed Backtracking, and Early Stopping. To bridge this gap, we introduce Graph of States (GoS), a generalpurpose neuro-symbolic framework tailored for abductive tasks. GoS grounds multi-agent collaboration in a structured belief states, utilizing a causal graph to explicitly encode logical dependencies and a state machine to govern the valid transitions of the reasoning process. By dynamically aligning the reasoning focus with these symbolic constraints, our approach transforms aimless, unconstrained exploration into a convergent, directed search. Extensive evaluations on two real-world datasets demonstrate that GoS significantly outperforms all baselines, providing a robust solution for complex abductive tasks. Code repo and all prompts: https://anonymous.4open.science/r/Graph-of-States-5B4E.

Abstract:
Existing AIdriven video creation systems typically treat script drafting and key-shot design as two disjoint tasks: the former relies on large language models, while the latter depends on image generation models. We argue that these two tasks should be unified within a single framework, as logical reasoning and imaginative thinking are both fundamental qualities of a film director. In this work, we proposeUniMAGE, a unified director model that bridges user prompts with well-structured scripts, thereby empowering non-experts to produce long-context, multi-shot films by leveraging existing audio–video generation models. To achieve this, we employ the Mixture-of-Transformers architecture that unifies text and image generation. To further enhance narrative logic and keyframe consistency, we introduce a ``first interleaving, then disentangling" training paradigm. Specifically, we first performInterleaved Concept Learning, which utilizes interleaved text–image data to foster the model’s deeper understanding and imaginative interpretation of scripts. We then conductDisentangled Expert Learning, which decouples script writing from keyframe generation, enabling greater flexibility and creativity in storytelling. Extensive experiments demonstrate that UniMAGE achieves state-of-the-art performance among open-source models, generating logically coherent scripts and visually consistent keyframe images.

Abstract:
Large Multimodal Models (LMMs) exhibit shortfalls when interpreting images and, by some measures, have poorer spatial cognition than young children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by surging model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench—a lightweight visual reasoning benchmark curated using adversarial filtering to be “impossible” for frontier LMMs at release time, with initial SotA scores of 0% pass@1 and pass∧5. We track progress on ZeroBench over the subsequent year, observing SotA reaching 6% pass∧5 and 19% pass@5, indicating the potential longevity of our benchmark. Overall, we evaluate 46 LMMs on ZeroBench, compare performance to a human baseline, analyse strengths and weaknesses, and chart performance over a year of advancement in visual capabilities.

Abstract:
While recent multimodal large language models (MLLMs) have made impressive strides, they mostly employ a conventional autoregressive architecture as their backbone, leaving significant room for exploring effective and efficient alternatives in architectural design. Meanwhile, recent studies have successfully applied discrete diffusion models to natural language processing, revealing their considerable potential as a promising new approach in this domain. Drawing inspiration from these pioneering researches, we introduce AnyDiffusion, the first any-to-any multimodal language model built purely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Any-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach enables support for not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models.

Abstract:
VisionLanguage Models (VLMs) are typically deterministic in nature and lack intrinsic mechanisms to quantify epistemic uncertainty, which reflects the model’s lack of knowledge or ignorance of its own representations. We theoretically motivate negative log-density of an embedding as a proxy for the epistemic uncertainty, where low-density regions signify model ignorance. The proposed method REPVLM computes the probability density on the hyperspherical manifold of the VLM embeddings using Riemannian Flow Matching. We empirically demonstrate that REPVLM achieves near-perfect correlation between uncertainty and prediction error, significantly outperforming existing baselines. Beyond classification, we also demonstrate that the model also provides a scalable metric for out-of-distribution detection and automated data curation.

Abstract:
Masked diffusion language models (MDLMs) promise fast, nonautoregressive text generation, yet existing samplers, which pick tokens to unmask based on model confidence, ignore interactions when unmasking multiple positions in parallel and effectively reduce to slow, autoregressive behavior. We propose the Dilated Unmasking Scheduler (DUS), an inference-only, planner-model-free method that partitions sequence positions into non-adjacent dilated groups and unmasked them in parallel so as to minimize an upper bound on joint entropy gain at each denoising step. By explicitly trading off the number of network calls against generation quality, DUS recovers most of the performance lost under traditional parallel unmasking strategies. Across math (GSM8K, MATH500), code (HumanEval, MBPP), general‐knowledge (BBH, MMLU-Pro), and instruction following (IFEval) benchmarks, DUS outperforms confidence‐based planners, without modifying the underlying denoiser, and reveals the true speed-quality frontier of MDLMs.

Abstract:
Graph neural networks (GNNs) are widely used in graph learning and most architectures propagate information by passing messages between vertices. In this work, we shift our attention to GNNs that perform message passing onedgesand introduce EB1WL, an edge-based color-refinement test, and a corresponding architecture, EB-GNN. Our EB-GNN architecture is inspired by the classic triangle-counting algorithm of Chiba and Nishizeki and passes messages along edges and triangles. Our contributions are as follows: 1. Theoretically, we show that EB-1WL is significantly more expressive than 1WL. We provide a complete logical characterization of EB-1WL in first-order logic, along with distinguishability results via homomorphism counting. To the best of our knowledge, EB-GNN has the strongest theoretical expressivity guarantees among edge-based message-passing GNNs in the literature. 2. Unlike many GNN architectures that are more expressive than 1WL, we prove that EB-1WL and EB-GNN admit near-linear time and memory usage on practical graph learning workloads. 3. We show in experiments that EB-GNN is a highly efficient general-purpose architecture: it substantially outperforms simple MPNNs and remains competitive with task-specialized state-of-the-art GNNs at substantially lower computational cost.

Abstract:
Blockwise decoding effectively improves the inference speed and quality in diffusion language models (DLMs) by combining inter-block sequential denoising and intra-block parallel unmasking. However, existing block-wise decoding methods typically partition blocks in a rigid and fixed manner, which inevitably fragments complete semantic or syntactic constituents, leading to suboptimal performance. Inspired by the entropy reduction hypothesis (ERH), we recognize that constituent boundaries offer greater opportunities for uncertainty reduction, which motivates us to employ entropy analysis for identifying constituent boundaries. Therefore, we propose Swordsman, an entropy-driven adaptive block-wise decoding framework for DLMs. Swordsman adaptively partitions blocks by identifying entropy shifts between adjacent tokens to better align with semantic or syntactic constituent boundaries. In addition, Swordsman dynamically adjusts unmasking thresholds conditioned on the real-time unmasking status within a block, further improving both efficiency and stability. As a training-free framework, supported by KV Cache, Swordsman demonstrates state-of-the-art performance across extensive evaluations. Our code is now available.

Abstract:
In longvideo understanding, conventional uniform frame sampling often fails to capture key visual evidence, leading to degraded performance and increased hallucinations. To address this, recent agentic thinking-with-videos paradigms have emerged, adopting a localize–clip–answer pipeline in which the model actively identifies relevant video segments, performs dense sampling within those clips, and then produces answers. However, existing methods remain inefficient, suffer from weak localization, and adhere to rigid workflows. To solve these issues, we propose VideoTemp-o3, a unified agentic thinking-with-videos framework that jointly models video grounding and question answering. VideoTemp-o3 exhibits strong localization capability, supports on-demand clipping, and can refine inaccurate localizations. Specifically, in the supervised fine-tuning stage, we design a unified masking mechanism that encourages exploration while preventing noise. For reinforcement learning, we introduce dedicated rewards to mitigate reward hacking. Besides, from the data perspective, we develop an effective pipeline to construct high-quality long video grounded QA data, along with a corresponding benchmark for systematic evaluation across various video durations. Experimental results demonstrate that our method achieves remarkable performance on both long video understanding and grounding.

Abstract:
Attribution methods reveal which input features a neural network uses for a prediction, adding transparency to their decisions. A common problem is that these attributions seem unspecific, highlighting both important and irrelevant features. We revisit the common attribution pipeline and observe that using logits as attribution target is a main cause of this phenomenon. We show that the solution is in plain sight: considering distributions of attributions over multiple classes using existing attribution methods yields specific and finegrained attributions. On common benchmarks, including the grid-pointing game and randomization-based sanity checks, this improves the ability of 18 attribution methods across 7 architectures up to 2×, agnostic to model architecture.

Abstract:
A reliable reward model is essential for aligning large language models (LLMs) with human preferences through reinforcement learning from human feedback (RLHF). However, standard reward models are susceptible to spurious features that are not causally related to human labels. This can lead toreward hacking, where high predicted reward does not translate into better behavior. In this work, we address this problem from a causal perspective by proposing a factored representation learning framework that decomposes the model’s contextual embedding into (1) causal factors that are sufficient for reward prediction and (2) noncausal factors that capture reward-irrelevant attributes such as length or sycophantic bias. The reward head is then constrained to depend only on the causal component. In addition, we introduce an adversarial head trained to predict reward from the non-causal factors, while applying gradient reversal to discourage them from encoding reward-relevant information. Experiments on both mathematical and dialogue tasks demonstrate that our method learns more robust reward models and consistently improves downstream RLHF performance over state-of-the-art baselines. Analyses on length and sycophantic bias further validate the effectiveness of our method in mitigating reward hacking behaviors.

Abstract:
JointEmbedding Predictive Architectures (JEPA) learn view-invariant representations and admit projection-based distribution matching for collapse preventions. Existing approaches regularize representations towards isotropic Gaussian distributions, but inherently favor dense representations and fail to capture the key property of sparsity observed in efficient representations. We introduce Rectified Distribution Matching Regularization (RDMReg), a sliced two-sample distribution-matching loss that aligns representations to a Rectified Generalized Gaussian (RGG) distribution. RGG enables explicit control over expected \ell_p norms and induces \ell_0 sparsity through rectifications, while preserving maximum entropy up to rescaling under sparsity constraints. Equipping JEPAs with RDMReg yields Rectified LpJEPA, which strictly generalizes prior Gaussian-based JEPAs. Empirically, Rectified LpJEPA learns sparse, non-negative representations with favorable sparsity–performance trade-offs and competitive downstream performance on image classification benchmarks, demonstrating that RDMReg effectively enforces sparsity while preserving task-relevant information.

Abstract:
New AI accelerators with novel instruction set architectures (ISAs) often require developers to manually craft lowlevel kernels - a time-consuming, laborious, and error-prone process that cannot scale across diverse hardware targets. This prevents emerging hardware platforms from reaching the market efficiently. While prior LLM-based code generation has shown promise in mature GPU ecosystems, it remains unclear whether agentic LLM systems can quickly produce valid and efficient kernels for emerging hardware with new ISAs. We present KernelCraft: the first benchmark to evaluate an LLM agent’s ability to generate and optimize low-level kernels for customized accelerators via a function-calling, feedback-driven workflow. Within KernelCraft, the agent refines kernels under ISA and hardware constraints using automated feedback derived from compilation checks, simulation, and correctness validation against ground truth. In our experiments, we assess agent performance across three emerging accelerator platforms on more than 20 ML tasks, each with 5 diverse task configurations, with special evaluation of task configuration complexity. Across four leading reasoning models, top agents produce functionally valid kernels for previously unseen ISAs within a few refinement steps, with optimized kernels that match or outperform template-based compiler baselines. With that, we demonstrate the potential for reducing the cost of kernel development for accelerator designers and kernel developers.

Abstract:
Since their introduction, Transformer architectures have dominated Natural Language Processing (NLP). However, recent research has highlighted an inherent anisotropy phenomenon in these models, presenting a significant challenge to their geometric interpretation. Previous theoretical studies on this phenomenon are rarely based on the underlying representation geometry. In this paper, we extend them by providing such theoretical arguments assessing the problematic nature of this phenomenon. Furthermore, to observe geometric internal model dynamics, we apply mechanistic interpretability (MI) techniques during the model's training checkpoints rather than posthoc, as it is commonly done in the literature. By analyzing multiple models and their checkpoints -including EuroBERT, the Pythia suite, and SmolLM2- we investigate the structure of embedding representations and their correlation with the on manifold entropy of their underlying distribution.

Abstract:
We study a setting where the goal is to learn a target function f(x) with respect to a target distribution D(x), but training is done on i.i.d. samples from a different training distribution D’(x), labeled by the true target f(x). Such a distribution shift (here in the form of covariate shift) is usually viewed negatively, as hurting or making learning harder, and the traditional distribution shift literature is mostly concerned with limiting or avoiding this negative effect. In contrast, we argue that with a wellchosen D'(x), the shift can be positive and make learning easier -- a perspective called Positive Distribution Shift (PDS). Such a perspective is central to contemporary machine learning, where much of the innovation is in finding good training distributions D’(x), rather than changing the training algorithm. We further argue that the benefit is often computational rather than statistical, and that PDS allows computationally hard problems to become tractable even using standard gradient-based training. We formalize different variants of PDS, show how certain hard classes are easily learnable under PDS, and make connections with membership query learning.

Abstract:
Selfevolving memory serves as the trainable parameters for Large Language Models (LLMs)-based agents, where extraction (distilling insights from experience) and management (updating the memory bank) must be tightly coordinated. Existing methods predominately optimize memory management while treating memory extraction as a static process, resulting in poor generalization, where agents accumulate instance-specific noise rather than robust memories. To address this, we propose Unified Memory Extraction and Management (UMEM), a self-evolving agent framework that jointly optimizes a Large Language Model to simultaneous extract and manage memories. To mitigate overfitting to specific instances, we introduce Semantic Neighborhood Modeling and optimize the model with a neighborhood-level marginal utility reward via GRPO. This approach ensures memory generalizability by evaluating memory utility across clusters of semantically related queries. Extensive experiments across five benchmarks demonstrate that UMEM significantly outperforms highly competitive baselines, achieving up to a 10.67% improvement in multi-turn interactive tasks. Futhermore, UMEM maintains a monotonic growth curve during continuous evolution. Codes and models will be publicly released.

Abstract:
Highquality kernel is critical for scalable AI systems, and enabling LLMs to generate such code would advance AI development. However, training LLMs for this task requires sufficient data, a robust environment, and the process is often vulnerable toreward hackingandlazy optimization. In these cases, models may hack training rewards or prioritize trivial correctness over meaningful speedup. In this paper, we systematically study reinforcement learning (RL) for kernel generation. We first designKernelGYM, a robust distributed GPU environment that supports reward hacking check, data collection from multi-turn interactions and long-term RL training. Building on KernelGYM, we investigate effective multi-turn RL methods and identify a biased policy gradient issue caused by self-inclusion in GRPO. To solve this, we propose Turn-level Reinforce-Leave-One-Out (TRLOO) to provide unbiased advantage estimation for multi-turn RL. To alleviate lazy optimization, we incorporate mismatch correction for training stability and introduce Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to overcome the issue. The trained model, Dr. Kernel-14B, reach performance competitive with Claude-4.5-Sonnet in Kernelbench. Finally, we study sequential test-time scaling for Dr. Kernel-14B, which evenoutperformsGPT-5 and Claude-4.5-Sonnet in the Kernelbench level-2 subset.

Abstract:
The quadratic complexity of attention remains the central bottleneck in longcontext inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently evict tokens at specific layers, which can retain irrelevant tokens or rely on irreversible early decisions despite the layer-/head-wise dynamics of token importance. In this paper, we propose Token Sparse Attention, a lightweight and dynamic token-level sparsification mechanism that compresses per-head Q, K, V to a reduced token set during attention and then decompresses the output back to the original sequence, enabling token information to be reconsidered in subsequent layers. Furthermore, Token Sparse Attention exposes a new design point at the intersection of token selection and sparse attention. Our approach is fully compatible with dense attention implementations, including Flash Attention, and can be seamlessly composed with existing sparse attention kernels. Experimental results show that Token Sparse Attention consistently improves accuracy–latency trade-off, achieving up to ×3.23 attention speedup at 128K context with less than 1% accuracy degradation. These results demonstrate that dynamic and interleaved token-level sparsification is a complementary and effective strategy for scalable long-context inference.

Abstract:
We present SemanticNVS, a cameraconditioned multi-view diffusion model for novel view synthesis (NVS), which improves generation quality and consistency by integrating pre-trained semantic feature extractors. Existing NVS methods perform well for views near the input view, however, they tend to generate semantically implausible and distorted images under long-range camera motion, revealing severe degradation. We speculate that this degradation is due to current models failing to fully understand their conditioning or intermediate generated scene content. Here, we propose to integrate pre-trained semantic feature extractors to incorporate stronger scene semantics as conditioning to achieve high-quality generation even at distant viewpoints. We investigate two different strategies, (1) warped semantic features and (2) an alternating scheme of understanding and generation at each denoising step. Experimental results on multiple datasets demonstrate the clear qualitative and quantitative (4.69%-15.26% in FID) improvement over state-of-the-art alternatives. Our codebase and trained models will be released upon acceptance of the paper.

Abstract:
Multimodal agents demonstrate impressive problemsolving capabilities but typically operate in isolated episodes without leveraging past experiences. Recent methods address this through dynamic retrieval of textual insights or predefined skill documents, yet face critical challenges: visual modalities are neglected during knowledge extraction, stored insights lack executable structure, and manually crafted skills fail to scale. We propose \textscExSkill, a framework combining task-level Skills (structured workflows and tool templates) with action-level Experiences (context-specific tactical insights) through automated accumulation from agent trajectories. Our approach employs visually-grounded summarization to extract knowledge integrating visual observations and textual reasoning, hierarchical consolidation to maintain quality and diversity, and context-aware adaptation to tailor knowledge to current visual contexts. Evaluated on five diverse benchmarks spanning visual tool use and multimodal search, \textscExSkill achieves average gains of 4.1-6.5 points over strong baselines across different backbone models, with superior zero-shot transferability and strategic improvements in tool selection and execution accuracy. These results demonstrate that our framework enables transferable continual learning for multimodal agents in real-world scenarios without parametric training, offering broad applicability for practical deployment.

Abstract:
As Large Reasoning Models (LRMs) are increasingly deployed, auditing their chainof-thought (CoT) traces for safety becomes critical. Recent work has reported that monitorability—the degree to which CoT faithfully and informatively reflects internal computation—can appear as a "free gift" during the early stages of Reinforcement Learning with Verifiable Rewards (RLVR). We make this observation concrete through a systematic evaluation across model families and training domains. Our results show that this effect is not universal: monitorability improvements are strongly data-dependent. In particular, we demonstrate the critical role of data diversity and instruction-following data during RLVR training. We further show that monitorability is orthogonal to capability—improvements in reasoning performance do not imply increased transparency. Through mechanistic analysis, we attribute monitorability gains primarily to response distribution sharpening (entropy reduction) and increased attention to the prompt, rather than stronger causal reliance on reasoning traces. We also reveal how monitorability dynamics vary with controlled training and evaluation difficulty. Together, these findings provide a holistic view of how monitorability emerges under RLVR, clarifying when gains are likely to occur and when they are not.

Abstract:
Linear attention mitigates the quadratic complexity of softmax attention but suffers from a critical loss of expressiveness. We identify two primary causes: (1) The normalization operation cancels the query norm, which breaks the correlation between a query's norm and the spikiness (entropy) of the attention distribution as in softmax attention. (2) Standard techniques for enforcing nonnegativity cause destructive information loss by nullifying valid inner-product interactions. To address these challenges, we introduce NaLaFormer, a novel linear attention mechanism built upon a norm×direction (ND) decomposition of the query and key vectors. We leverage each component to solve a distinct problem: The query norm is injected into our kernel to create a query-norm-aware map that restores the attention distribution's spikiness. The direction vectors are processed by a geometric, cosine-based similarity metric that guarantees non-negativity while preserving the rich, fine-grained information of the inner product. We validate NaLaFormer through a comprehensive multi-modal evaluation, where it sets new state-of-the-art benchmarks for linear attention. Our model achieves up to a 7.5% accuracy gain on ImageNet-1K and a 4.7% mIoU improvement on ADE20K over comparable baselines. It demonstrates profound efficiency, reducing peak memory by a transformative 92.3% in token-intensive super-resolution tasks (70K+ tokens). NaLaFormer's versatility is further confirmed as it surpasses strong baselines like Mamba on common-sense reasoning and sets a new state-of-the-art on the Long Range Arena (LRA) benchmark. Source code can be found in the supplementary materials.

Abstract:
Contrastive visionlanguage models like CLIP exhibit remarkable zero-shot generalization. However, prompt tuning remains highly sensitive to label noise, as mislabeled samples generate disproportionately large gradients that can overwhelm pre-trained priors. We argue that because CLIP already provides a near-optimal initialization, adaptation should be inherently conservative, particularly against the extreme gradient updates common in noisy settings. To this end, we propose Double-Softmax Prompt Tuning (DSPT), a hyperparameter-free method for intrinsic gradient suppression. By applying a sequential probabilistic normalization, DSPT induces a self-adaptive saturation zone that suppresses gradients from high-error noisy samples while maintaining informative updates. We also provide both theoretical analysis and empirical evidence about how this mechanism achieves adaptive suppression. This design transforms ``gradient vanishing'', traditionally a training bottleneck, into a principled noise-filtering shield for label-noise prompt tuning. Extensive experiments confirm that this simple, drop-in design achieves state-of-the-art robustness across various noisy benchmarks, outperforming methods with complex architectures and handcrafted hyperparameters.

Abstract:
Transformers excel in natural language processing and computer vision tasks. However, they still face challenges in generalizing to Outof-Distribution (OOD) datasets, i.e. data whose distribution differs from that seen during training. OOD detection aims to distinguish outliers while preserving in-distribution (ID) data performance. This paper introduces the OOD detection Probably Approximately Correct (PAC) Theory for transformers, which establishes the conditions for data distribution and model configurations for the OOD detection learnability of transformers. It shows that outliers can be accurately represented and distinguished with sufficient data under conditions. The theoretical implications highlight the trade-off between theoretical principles and practical training paradigms. By examining this trade-off, we naturally derived the rationale for leveraging auxiliary outliers to enhance OOD detection. Our theory suggests that by penalizing the misclassification of outliers within the loss function and strategically generating soft synthetic outliers, one can robustly bolster the reliability of transformer networks. This approach yields a novel algorithm that ensures learnability and refines the decision boundaries between inliers and outliers. In practice, the algorithm consistently achieves state-of-the-art (SOTA) performance across various data formats.

Abstract:
Current evaluation for Large Language Model (LLM) code agents predominantly focus on generating functional code in singleturn scenarios, which fails to evaluate the agent's capability for continuous code optimization and multi-turn iterative development. To bridge this gap, we introduce CATArena, a framework designed to evaluate the evolutionary capabilities of code agents via iterative tournaments. Agents engage in multi-turn tournaments and continuously refine their code through self-reflection and peer-learning based on comprehensive execution feedback. For evaluation, we propose a dual-metric system to decouple static generation proficiency from evolutionary potential. Extensive experiments reveal that an agent's evolutionary potential is not strictly correlated with its initial proficiency. Our analysis further reveals that current agents struggle to concurrently leverage both peer-learning and self-reflection for effective performance gains. Furthermore, the results validate CATArena's high extensibility and resistance to variance tasks, establishing it as a continuous and reliable standard for assessing the evolutionary capability of LLM code agents.

Abstract:
Large Language Models (LLMs) with agentic web search capabilities show strong potential for tasks requiring realtime information access and complex fact retrieval, yet evaluating such systems remains challenging. We introduce LiveNewsBench, a rigorous and regularly updated benchmark designed to assess the agentic web search abilities of LLMs. LiveNewsBench automatically generates fresh question-answer pairs from recent news articles, ensuring that questions require information beyond an LLM's training data and enabling clear separation between internal knowledge and search capability. The benchmark features intentionally difficult questions requiring multi-step search queries, page visits, and reasoning, making it well-suited for evaluating agentic search behavior. Our automated data curation and question generation pipeline enables frequent benchmark updates and supports construction of a large-scale training dataset for agentic web search models, addressing the scarcity of such data in the research community. To ensure reliable evaluation, we include a subset of human-verified samples in the test set. We evaluate a broad range of systems using LiveNewsBench, including commercial and open-weight LLMs as well as LLM-based web search APIs. The leaderboard, datasets, and code are publicly available at \urllivenewsbench.com.

Abstract:
We propose an anytime online algorithm for learning a sequence of convex cost functions while approximately satisfying a sequence of convex constraints, without prior knowledge of the time horizon. Both the cost and constraint functions may be chosen adversarially over time. While this problem has recently been resolved in the setting where the time horizon is known, extending these guarantees to the anytime setting, without resorting to inefficient doubling tricks, has remained technically challenging. Our main contribution is the introduction of a timevarying yet horizon-oblivious Lyapunov function to track constraint violations. The use of such a time-varying Lyapunov function introduces new technical difficulties, as a key monotonicity property underlying prior analyses no longer holds. By developing a novel analytical technique, we show that our algorithm achieves O(\sqrtt) \regret~ and \tildeO(\sqrtt) cumulative constraint violation (\CCV) for all t \geq 1. We further extend our framework to the dynamic regret setting, obtaining bounds that adapt to the unknown path length of the comparator sequence. Finally, we present an adaptive algorithm for the optimistic setting, whose performance scales gracefully with the cumulative prediction error. We validate the practical effectiveness of our approach through numerical experiments on the online shortest path problem.

Abstract:
The online construction of vectorized highdefinition (HD) maps is a cornerstone of modern autonomous driving systems. State-of-the-art approaches, particularly those based on the DETR framework, formulate this as an instance detection problem. However, their reliance on independent, learnable object queries results in a predominantly local query perspective, neglecting the inherent global representation within HD maps. In this work, we propose MapGR (Global Representation learning for HD Map construction), an architecture designed to learn and utilize a global representations from queries. Our method introduces two synergistic modules: a Global Representation Learning (GRL) module, which encourages the distribution of all queries to better align with the global map through a carefully designed holistic segmentation task, and a Global Representation Guidance (GRG) module, which endows each individual query with explicit, global-level contextual information to facilitate its optimization. Evaluations on the nuScenes and Argoverse2 datasets validate the efficacy of our approach, demonstrating substantial improvements in mean Average Precision (mAP) compared to leading baselines.

Abstract:
Aligning Large Language Models (LLMs) with specific personas typically relies on Supervised FineTuning (SFT) or Reinforcement Learning from Human Feedback (RLHF); however, these methods are resource-intensive, requiring expensive data collection and distinct model training for each target personality. In this work, we propose a parameter-efficient framework for continuous, multi-dimensional personality control via inference-time activation steering. Our approach addresses the challenge of combining multiple interventions by iteratively retraining probes on the residual stream modified by prior traits, ensuring compatibility. Once established, these steering vectors function as modular, reusable primitives; users can instantly synthesize novel, complex personality profiles by simply adjusting steering coefficients (α) without any additional training. To support this, we introduce an automated pipeline that identifies optimal intervention layers via activation separation analysis and calibrates coefficients via hyperparameter optimization to maximize alignment while constraining perplexity. Empirical evaluations validate individual trait shifts using an LLM-as-a-judge framework and demonstrate, via the Big Five inventory, that our method effectively modulates the model's holistic personality profile without updating base model parameters.

Abstract:
Visionlanguage models demonstrate strong reasoning and planning abilities, yet grounding these predictions into precise robot actions remains a central challenge. Existing Vision-Language-Action methods typically entangle reasoning and action generation, leading to limited generalization and costly adaptation. We propose to learn a Generalizable Action Expert (GAE), a task-agnostic model that converts sparse geometric plans into dense robot actions. Our approach introduces a sparse geometric interface: the VLM predicts sparse 3D waypoints representing high-level intention, while GAE maps these waypoints together with real-time point cloud observations to continuous action trajectories. GAE is pretrained on a large-scale pointcloud–trajectory dataset comprising 150k trajectories from both simulation and real-world robots. To further improve efficiency and generalization, we introduce an Action Pre-training, Pointcloud Fine-tuning (APPF) scheme that decouples learning action dynamics from geometry grounding. After pretraining, GAE is frozen and reused across downstream tasks, requiring only lightweight fine-tuning of the VLM to produce the sparse interface. Extensive experiments show that our method achieves strong performance and generalization across diverse visual domains, camera viewpoints, and natural language instructions.

Abstract:
Numerous works have noted similarities in how machine learning models represent the world, even across modalities. Although much effort has been devoted to uncovering properties and metrics on which these models align, surprisingly little work has explored causes of this similarity. To advance this line of inquiry, this work explores how two factors—dataset overlap and task overlap—influence downstream model similarity. We evaluate the effects of both factors through experiments across model sizes and modalities, from small classifiers to large language models. We find that generally, both task and dataset overlap cause higher representational similarity. Finally, we consider downstream consequences of representational similarity, demonstrating how greater similarity increases vulnerability to transferable adversarial attacks.

Abstract:
Flowbased generative models provide strong unconditional priors for inverse problems, but guiding their dynamics for conditional generation remains challenging. Recent work casts training-free conditional generation in flow models as an optimal control problem; however, solving the resulting trajectory optimisation is computationally and memory intensive, requiring differentiation through the flow dynamics or adjoint solves. We propose MPC-Flow, a model predictive control framework that formulates inverse problem solving with flow-based generative models as a sequence of control sub-problems, enabling practical optimal control-based guidance at inference time. We provide theoretical guarantees linking MPC-Flow to the underlying optimal control objective and show how different algorithmic choices yield a spectrum of guidance algorithms, including regimes that avoid backpropagation through the generative model trajectory. We evaluate MPC-Flow on benchmark image restoration tasks, spanning linear and non-linear settings such as in-painting, deblurring, and super-resolution, and demonstrate strong performance and scalability to massive state-of-the-art architectures via training-free guidance of FLUX.2 (32B) in a quantised setting on consumer hardware.

Abstract:
Specialized attention heads dubbed induction heads (IHs) have been argued to underlie the remarkable incontext learning capabilities of modern language models; yet, a precise characterization of their emergence, especially in the context of language modeling, remains wanting. In this study, we investigate the relationship between statistical properties of the training data and IH formation in both natural and synthetic training data settings. We show that: (1) A simple equation combining batch size and context size predicts the point at which IHs form; (2) Surface bigram repetition frequency and reliability strongly affect the formation of IHs, and we find an effective Pareto frontier in terms of these two values; (3) local dependency with high bigram repetition frequency and reliability is sufficient for IH formation, but when the frequency and reliability are low, categoriality and the shape of the marginal distribution matter.

Abstract:
In this work, we proposeNatural Hypergradient Descent(NHGD), a new method for solving bilevel optimization problems. To address the computational bottleneck in hypergradient estimation—namely, the need to compute or approximate Hessian inverses—we exploit the statistical structure of the inner optimization problem and use the empirical Fisher information matrix as an asymptotically consistent surrogate for the Hessian. This design enables a paralleloptimizeand-approximateframework in which the Hessian-inverse approximation is updatedsynchronouslywith the stochastic inner optimization, reusing gradient information at negligible additional cost. Our main theoretical contribution establishes high-probability error bounds and sample complexity guarantees for NHGD that match those of state-of-the-art optimize-then-approximate methods, while significantly reducing computational time overhead. Empirical evaluations on representative bilevel learning tasks further demonstrate the practical advantages of NHGD, highlighting its scalability and effectiveness in large-scale machine learning settings.

Abstract:
It is increasingly important that LLM agents interact effectively and safely with other goalpursuing agents, yet, according to recent works, the opposite trend appears to be the case: LLMs with stronger reasoning capabilities behavelesscooperatively in mixed-motive games such as the prisoner's dilemma and in public goods settings. Indeed, our experiments show that recent models---with or without reasoning enabled---consistently defect on the other players in single-shot social dilemmas. To tackle this safety concern, we study game-theoretic mechanisms that are designed to enable cooperative outcomes between rational agentsin equilibrium. Across four social dilemmas testing distinct components of robust cooperation, we evaluate under the following mechanisms: (1) repeating the game for many rounds, (2) reputation systems, (3) third-party mediators to delegate decision making to, and (4) contract agreements for outcome-conditional payments between players. Among our findings, we establish that contracting and mediation are most effective in achieving cooperative outcomes between capable LLM models, and that repetition-induced cooperation deteriorates drastically when co-players vary. Moreover, we demonstrate that these cooperation mechanisms becomemore effectivewith higher pressures to optimize for one own's utility.

Abstract:
Although upper bound guarantees for bilevel optimization have been widely studied, progress on lower bounds has been limited due to the complexity of the bilevel structure. In this work, we focus on the smooth nonconvexstrongly-convex setting and develop new hard instances that yield nontrivial lower bounds under deterministic and stochastic first-order oracle models. In the deterministic case, we prove that any first-order zero-respecting algorithm requires at least \Omega(\kappa^3/2\epsilon^-2) oracle calls to find an \epsilon-accurate stationary point, improving the optimal lower bounds known for single-level nonconvex optimization and for nonconvex-strongly-convex min-max problems. In the stochastic case, we show that at least \Omega(\kappa^5/2\epsilon^-4) stochastic oracle calls are necessary, again strengthening the best known bounds in related settings. Our results expose substantial gaps between current upper and lower bounds for bilevel optimization and suggest that even simplified regimes, such as those with quadratic lower-level objectives, warrant further investigation toward understanding the optimal complexity of bilevel optimization under standard first-order oracles.

Abstract:
Differentiating through the solution of a quadratic program (QP) is a central problem in differentiable optimization. Most existing approaches differentiate through the KarushKuhn--Tucker (KKT) system, but their computational cost and numerical robustness can degrade at scale. To address these limitations, we propose dXPP, a penalty-based differentiation framework that decouples QP solving from differentiation. In the solving step (forward pass), dXPP is solver-agnostic and can leverage any black-box QP solver. In the differentiation step (backward pass), we map the solution to a smooth approximate penalty problem and implicitly differentiate through it, requiring only the solution of a much smaller linear system in the primal variables. This approach bypasses the difficulties inherent in explicit KKT differentiation and significantly improves computational efficiency and robustness. We evaluate dXPP on various tasks, including randomly generated QPs, large-scale sparse projection problems, and a real-world multi-period portfolio optimization task. Empirical results demonstrate that dXPP is competitive with KKT-based differentiation methods and achieves substantial speedups on large-scale problems.

Abstract:
Large Audio–Language Models (LALMs) are becoming essential as a powerful multimodal backbone for realworld applications. However, recent studies show that audio inputs can more easily elicit harmful responses than text, exposing new risks toward deployment. While safety alignment has made initial advances in LLMs and Large Vision–Language Models (LVLMs), we find that vanilla adaptation of these approaches to LALMs faces two key limitations: 1) LLM-based steering fails under audio input due to the large distributional gap between activations, and 2) prompt-based defenses induce over-refusals on benign-speech queries. To address these challenges, we proposeSafe-AblatedRefusalSteering (SARSteer), an effective inference-time defense framework for LALMs. Specifically, SARSteer leverages text-derived refusal steering to enforce rejection without manipulating audio inputs and introduces decomposed safe-space ablation to mitigate over-refusal. Extensive experiments demonstrate that SARSteer significantly improves harmful-query refusal while preserving benign responses, establishing a principled step toward safety alignment in LALMs.

Abstract:
With the ongoing integration of machine learning systems into the everyday social life of millions the notion of fairness becomes an ever increasing priority in their development. Fairness notions commonly rely on protected attributes to assess potential biases. Here, the majority of literature focuses on discrete setups regarding both target and protected attributes. The literature on continuous attributes especially in conjunction with regression - we refer to this ascontinuous fairness- is scarce. A common strategy is iterative null-space projection which as of now has only been explored for linear models or embeddings such as obtained by a non-linear encoder. We improve on this by extending this to kernel induced feature spaces by means of the ``empirical feature space''. We theoretically derive this as a direct transformation of the kernel matrix yielding a model and fairness-score agnostic method applicable to continuous protected attributes. We demonstrate that our novel approach in conjunction with Support Vector Regression (SVR) provides competitive or improved performance across multiple datasets in comparisons to other contemporary methods.

Abstract:
Current autoregressive models can generate high-quality, topologically precise meshes; however, they necessitate thousands—or even tens of thousands—of next-token predictions during inference, resulting in substantial latency. We introduce XSpecMesh, a quality-preserving acceleration method for auto-regressive mesh generation models. XSpecMesh employs a lightweight, multi-head speculative decoding scheme to predict multiple tokens in parallel within a single forward pass, thereby accelerating inference. We further propose a verification and resampling strategy: the backbone model verifies each predicted token and resamples any tokens that do not meet the quality criteria. In addition, we propose a distillation strategy that trains the lightweight decoding heads by distilling from the backbone model, encouraging their prediction distributions to align and improving the success rate of speculative predictions. Extensive experiments demonstrate that our method achieves a 1.7× speedup without sacrificing generation quality. Our code will be released.

Abstract:
Biological learning proceeds from easy to difficult tasks, gradually reinforcing perception and robustness. Inspired by this principle, we address Context‑Entangled Content Segmentation (CECS)—a challenging setting where objects share intrinsic visual patterns with their surroundings, as in camouflaged object detection. Conventional segmentation networks predominantly rely on architectural enhancements but often ignore the learning dynamics that govern robustness under entangled data distributions. We introduce CurriSeg, a dual‑phase learning framework that unifies curriculum and anti‑curriculum principles to improve representation reliability. In the Curriculum Selection phase, CurriSeg dynamically selects training data based on the temporal statistics of sample losses, distinguishing hard‑but‑informative samples from noisy or ambiguous ones, thus enabling stable capability enhancement. In the Anti‑Curriculum Promotion phase, we design Spectral‑Blindness Fine‑Tuning, which suppresses high‑frequency components to enforce dependence on low‑frequency structural and contextual cues. Combined with Sharpness‑Aware Minimization, this promotes convergence to flatter minima and strengthens generalization. Extensive experiments demonstrate that CurriSeg achieves consistent improvements across diverse CECS benchmarks without adding parameters or increasing total training time, offering a principled view of how progression and challenge interplay to foster robust and context‑aware segmentation. Code will be released.

Abstract:
Aligning large language models (LLMs) to serve users with heterogeneous and potentially conflicting preferences is a central challenge for personalized and trustworthy AI. We formalize an ideal notion of universal alignment through testtime scaling: for each prompt, the model produces k\ge 1 candidate responses and a user selects their preferred one. We introduce (k,f(k))-robust alignment, which requires the k-output model to have win rate f(k) against any other single-output model, and asymptotic universal alignment (U-alignment), which requires f(k)\to 1 as k\to\infty. Our main result characterizes the optimal convergence rate: there exists a family of single-output policies whose k-sample product policies achieve U-alignment at rate f(k)=\frackk+1, and no method can achieve a faster rate in general. We show that popular post-training methods, including Nash learning from human feedback (NLHF), can fundamentally underutilize the benefits of test-time scaling. Even though NLHF is optimal for k=1, sampling from the resulting (often deterministic) policy cannot guarantee win rates above \tfrac12 except for an arbitrarily small slack. This stems from a lack of output diversity: existing alignment methods can collapse to a single majority-preferred response, making additional samples redundant. In contrast, our approach preserves output diversity and achieves the optimal test-time scaling rate. In particular, we propose a family of symmetric multi-player alignment games and prove that any symmetric Nash equilibrium policy of the (k+1)-player alignment game achieves the optimal (k,\frackk+1)-robust alignment. Finally, we provide theoretical convergence guarantees for self-play learning dynamics in these games and extend the framework to opponents that also generate multiple responses.

Abstract:
Reinforcement learning (RL) has advanced LLM agents on verifiable tasks but remains challenging for openended tasks with vast solution spaces (e.g., complex travel planning). Lacking objective ground truth, current RL algorithms rely on reward models assigning scalar scores to individual responses. We contend such pointwise scoring induces discrimination collapse: reward model fails to distinguish subtle advantages among trajectories, compressing intra-group rewards into a narrow range. This drowns effective reward signals in reward model noise, causing optimization stagnation. To address this, we propose ArenaRL, a reinforcement learning paradigm shifting from pointwise scalar scoring to intra-group relative ranking. ArenaRL introduces a process-aware pairwise evaluation with multi-level rubrics for fine-grained relative scoring. Meanwhile, we construct an intra-group adversarial arena and devise a tournament-based ranking scheme to obtain stable advantage signals. ArenaRL achieves high-precision advantage estimation with only O(N) computational complexity, striking a favourable balance between efficiency and accuracy. Furthermore, to address the lack of full-cycle benchmarks for open-ended agents, we introduce two high-quality benchmarks: Open-Travel and Open-DeepResearch, encompassing full training and multi-dimensional evaluation pipelines. Extensive experiments across three open-ended tasks validate the effectiveness of ArenaRL.

Abstract:
Large language models have recently enabled textto-CAD systems that synthesize parametric CAD programs (e.g., CadQuery) from natural-language prompts. In practice, however, geometric descriptions can be under-specified or internally inconsistent: critical dimensions may be missing and constraints may conflict. However, existing fine-tuned models tend to reactively follow the user’s instructions and hallucinate dimensions when the text is ambiguous. To address this, we propose a proactive agentic framework for text-to-CadQuery generation, named as ProCAD, that resolves specification issues before code synthesis. Our framework pairs a proactive clarifying agent, which audits the prompt and asks targeted clarification questions only when necessary to produce a self-consistent specification, with a CAD coding agent that translates the specification into an executable CadQuery program. We fine-tune the coding agent based on a curated high-quality text-to-CadQuery dataset and train the clarifying agent via agentic SFT on clarification trajectories. Experiments show that proactive clarification significantly improves robustness to ambiguous prompts while keeping interaction overhead low. ProCAD outperforms frontier closed-source models, including Claude Sonnet 4.5, reducing the mean Chamfer distance by 79.9% and lowering the invalidity ratio from 4.8% to 0.9%. Our code and datasets will be made publicly available.

Abstract:
Spatiotemporal neural dynamics and oscillatory synchronization are widely implicated in biological information processing and have been hypothesized to support flexible coordination such as feature binding. By contrast, most deep learning architectures represent and propagate information through activation values, neglecting the joint dynamics of rate and phase. In this work, we introduce Kuramoto oscillatory Phase Encoding (KoPE) as an additional, evolving phase state to Vision Transformers, incorporating a neuroinspired synchronization mechanism to advance learning efficiency. We show that KoPE can improve training, parameter, and data efficiency of vision models through synchronization-enhanced structure learning. Moreover, KoPE benefits tasks requiring structured understanding, including semantic and panoptic segmentation, representation alignment with language, and few-shot abstract visual reasoning (ARC-AGI). Theoretical analysis and empirical verification further suggest that KoPE can accelerate attention concentration for learning efficiency. These results indicate that synchronization can serve as a scalable, neuro-inspired mechanism for advancing state-of-the-art neural network models.

Abstract:
Large language models (LLMs) exhibit persistent miscalibration, especially after instruction tuning and preference alignment. Modified training objectives can improve calibration, but retraining is expensive. Inferencetime steering offers a lightweight alternative, yet most existing methods optimize proxies for correctness rather than correctness itself. We introduce CORAL (Correctness-Optimized Residual Activation Lens), a regularized inference-time steering method that captures distributed correctness signals from model internal activations using weight-decay MLP probes. We evaluate CORAL across three 7B-parameter models and find that it consistently improves accuracy by 10% and expected calibration error (ECE) by 50% on average. We additionally demonstrate that these gains transfer without retraining to the complete published test sets of four held-out benchmarks (ARC-Challenge, HellaSwag, Math-MC, OpenBookQA), averaging 14% accuracy improvements and 49% ECE improvements. Our results support the hypothesis that distributed information in model internals can be extracted using regularized probes when individual neurons are insufficient. CORAL thus provides a compute-efficient, transferable, and calibration-aware approach to improve MCQA performance during inference.

Abstract:
Existing backdoor attacks on Reinforcement Learning (RL) typically rely on unrealistic whitebox access to victim parameters, rewards, or observations. Inspired by real world behaviors, we introduce the Supply-Chain Backdoor (SCAB) attack to demonstrate that such assumptions are unnecessary. SCAB targets the common practice of training with third-party policies, poisoning the dataset solely through a black-box of legitimate agent-environment interactions. With only 3% data corruption, SCAB achieves a 90% attack success rate and reduces victim returns by 80%. These findings expose a critical vulnerability in the modern RL supply chain, highlighting that reliance on untrusted external agents constitutes a severe and practical security risk.

Abstract:
We consider minimizing an objective function subject to constraints defined by the intersection of lowerlevel sets of convex functions. We study two cases: (i) strongly convex and Lipschitz-smooth objective function and (ii) convex but possibly nonsmooth objective function. To deal with the constraints that are not easy to project on, we use a randomized feasibility algorithm with Polyak steps and a random number of sampled constraints per iteration, while taking (sub)gradient steps to minimize the objective function. For case (i), we prove linear convergence in expectation of the objective function values to any prescribed tolerance using an adaptive stepsize. For case (ii), we develop a fully problem parameter-free and adaptive stepsize scheme that yields an O(1/\sqrtT) worst-case rate in expectation. The infeasibility of the iterates decreases geometrically with the number of feasibility updates almost surely, while for the averaged iterates, we establish an expected lower bound on the function values relative to the optimal value that depends on the distribution for the random number of sampled constraints. For certain choices of sample-size growth, optimal rates are achieved. Finally, simulations on a Quadratically Constrained Quadratic Programming (QCQP) problem and Support Vector Machines (SVM) demonstrate the computational efficiency of our algorithm compared to other state-of-the-art methods.

Abstract:
Speculative decoding can significantly accelerate LLM serving, but its realworld benefits often erode due to training–serving mismatch and non-stationary traffic. Unlike previous systems that decouple speculator training from inference, we present a unified training–serving system, Aurora, that closes this loop by continuously learning a speculator model directly from live inference traces. Our design integrates an SGLang-based inference server with an asynchronous training server connected via efficient GPU-to-GPU RPC, enabling hot-swapped speculator updates without service interruption. Crucially, our system supports day-0 deployment: a speculator can be served immediately and quickly adapted on live traffic, improving overall system throughput. This paradigm shift enables us to frame the training–serving loop as an asynchronous reinforcement learning process and allows us to leverage rejected tokens from the speculator to improve sampling efficiency. Our experiments show that this unified system achieves a 1.33× speedup in the mixed-data scenario when starting from a scratch speculator, and a 1.48× speedup compared to a static speculator. We also find that the system adapts more effectively to distribution shifts in user traffic, delivering a 1.25× speedup over a well-trained but static speculative decoding.

Abstract:
The hallucination of code generation models hinders their applicability to systems requiring higher safety standards. One critical bottleneck in addressing code hallucination is the difficulty of identifying the functional correctness of generated code, due to its unnatural form. We address this core bottleneck by automatically generating unit tests using dynamic code analysis tools, leveraging the \emphexecutable nature of code. Accordingly, we propose \emphselective code generator that abstains from uncertain generations based on the functional correctness evaluated by generated unit tests -- to theoretically control the correctness among non-abstained answers, \ie the false discovery rate. Finally, we propose to use generated unit tests in evaluation as well as in learning for precise code evaluation, calling this paradigm \emphFuzzEval. We demonstrate the efficacy of our method along with the controllability of code hallucination and reasonable selection efficiency.

Abstract:
Foundation models must handle multiple generative processes, yet mechanistic interpretability largely studies capabilities in isolation; it remains unclear how a single transformer organizes multiple, potentially conflicting "world models". Previous experiments on Othello playing neuralnetworks test world-model learning but focus on a single game with a single set of rules. We introduceMetaOthello, a controlled suite of Othello variants with shared syntax but different rules or tokenizations, and train small GPTs on mixed-variant data to study how multiple world models are organized in a shared representation space. We find that transformers trained on mixed-game data do not partition their capacity into isolated sub-models; instead, they converge on a mostly shared board-state representation that transfers causally across variants. Linear probes trained on one variant can intervene on another's internal state with effectiveness approaching that of matched probes. For isomorphic games with token remapping, representations are equivalent up to a single orthogonal rotation that generalizes across layers. When rules partially overlap, early layers maintain game-agnostic representations while a middle layer identifies game identity, and later layers specialize.MetaOthellooffers a path toward understanding not just whether transformers learn world models, but how they organize many at once.

Abstract:
Repositorylevel code editing requires models to understand complex dependencies and execute precise multi-file modifications across a large codebase. While recent gains on SWE-bench rely heavily on complex agent scaffolding, it remains unclear how much of this capability can be internalised via high-quality training signals. To address this, we propose Clean Pull Request (Clean-PR), a mid-training paradigm that leverages real-world GitHub pull requests as a training signal for repository-level editing. We introduce a scalable pipeline that converts noisy pull request diffs into Search/Replace edit blocks through reconstruction and validation, resulting in the largest publicly available corpus of 2 million pull requests spanning 12 programming languages. Using this training signal, we perform a mid-training stage followed by an agentless-aligned supervised fine-tuning process with error-driven data augmentation. On SWE-bench, our model significantly outperforms the instruction-tuned baseline, achieving absolute improvements of 13.6% on SWE-bench Lite and 12.3% on SWE-bench Verified. These results demonstrate that repository-level code understanding and editing capabilities can be effectively internalised into model weights under a simplified, agentless protocol, without relying on heavy inference-time scaffolding.

Abstract:
Optimizing the performance of largescale software repositories demands expertise in code reasoning and software engineering (SWE) to reduce runtime while preserving program correctness. However, most benchmarks emphasize what to fix rather than how to fix code. We introduce SWE-fficiency, a benchmark for evaluating repository-level performance optimization on real workloads. Our suite contains 498 tasks across nine widely used data-science, machine-learning, and HPC repositories (e.g., numpy, pandas, scipy): given a complete codebase and a slow workload, an agent must investigate code semantics, localize bottlenecks and relevant tests, and produce a patch that matches or exceeds expert speedup while passing the same unit tests. To enable this how-to-fix evaluation, our automated pipeline scrapes GitHub pull requests for performance-improving edits, combining keyword filtering, static analysis, coverage tooling, and execution validation to both confirm expert speedup baselines and identify relevant repository unit tests. Empirical evaluation of state-of-the-art agents reveals significant underperformance. On average, top agents achieve less than 0.23x the expert speedup: agents struggle in localizing optimization opportunities, reasoning about execution across functions, and maintaining correctness in proposed edits. We release the benchmark and accompanying data pipeline to facilitate research on automated performance engineering and long-horizon software reasoning.

Abstract:
There has been a growing interest in using neural networks, especially messagepassing neural networks (MPNNs), to solve hard combinatorial optimization problems heuristically. However, existing learning-based approaches for hard combinatorial optimization tasks often rely on supervised training data, reinforcement learning, or gradient estimators, leading to significant computational overhead, unstable training, or a lack of provable performance guarantees. In contrast, classical approximation algorithms offer such performance guarantees under worst-case inputs but are non-differentiable and unable to adaptively exploit structural regularities in natural input distributions. We address this dichotomy with the fundamental example of Uniform Facility Location (UniFL), a variant of the combinatorial facility location problem with applications in clustering, data summarization, logistics and supply chain design. We develop a fully differentiable MPNN model that embeds approximation-algorithmic principles while avoiding the need for solver supervision or discrete relaxations. Our approach admits provable approximation and size generalization guarantees to much larger instances than seen during training. Empirically, we show that our approach outperforms standard non-learned approximation algorithms in terms of solution quality, closing the gap with computationally intensive integer linear programming approaches. Overall, this work provides a step toward bridging learning-based methods and approximation algorithms for discrete optimization.

Abstract:
Testtime scaling strategies have effectively leveraged inference-time compute to enhance the reasoning abilities of Autoregressive Large Language Models. In this work, we demonstrate that Masked Diffusion Language Models (MDLMs) are inherently amenable to advanced search strategies, owing to their iterative and non-autoregressive generation process. To leverage this, we proposeUnMaskFork(UMF), a framework that formulates the unmasking trajectory as a search tree and employs Monte Carlo Tree Search to optimize the generation path. In contrast to standard scaling methods relying on stochastic sampling, UMF explores the search space through deterministic partial unmasking actions performed by multiple MDLMs. Our empirical evaluation demonstrates that UMF consistently outperforms existing test-time scaling baselines on complex coding benchmarks, while also exhibiting strong scalability on mathematical reasoning tasks.

Abstract:
We propose CLoVE (Clustering of Loss Vector Embeddings), a novel algorithm for Clustered Federated Learning (CFL). In CFL, clients are naturally grouped into clusters based on their data distribution. However, identifying these clusters is challenging, as client assignments are unknown. CLoVE utilizes client embeddings derived from model losses on client data, and leverages the insight that clients in the same cluster share similar loss values, while those in different clusters exhibit distinct loss patterns. Based on these embeddings, CLoVE is able to iteratively identify and separate clients from different clusters and optimize clusterspecific models through federated aggregation. Key advantages of CLoVE over existing CFL algorithms are (1) its simplicity, (2) its applicability to both supervised and unsupervised settings, and (3) the fact that it eliminates the need for near-optimal model initialization, which makes it more robust and better suited for real-world applications. We establish theoretical convergence bounds, showing that CLoVE can recover clusters accurately with high probability in a single round and converges exponentially fast to optimal models in a linear setting. Our comprehensive experiments comparing with a variety of both CFL and generic Personalized Federated Learning (PFL) algorithms on different types of datasets and an extensive array of non-IID settings demonstrate that CLoVE achieves highly accurate cluster recovery in just a few rounds of training, along with state-of-the-art model accuracy, across a variety of both supervised and unsupervised PFL tasks.

Abstract:
Maximum entropy has become a mainstream offpolicy reinforcement learning (RL) framework for balancing exploitation and exploration. However, two bottlenecks still limit further performance gains: (1) non-stationary Q-value estimation stemming from the joint injection of entropy and the concurrent updating of its temperature parameter; and (2) short-sighted local entropy tuning, which adjusts temperature solely based on current single-step entropy without accounting for cumulative entropy over time. In this paper, we broaden the maximum entropy framework by proposing a trajectory entropy-constrained reinforcement learning (TECRL) framework to address these limitations. We begin by introducing reward-entropy separation (RES) to decouple the value targets, ensuring they remain stable and unaffected by temperature fluctuations. Subsequently, the resulting entropy Q-function is leveraged to explicitly quantify expected cumulative entropy, allowing for the enforcement of a trajectory entropy constraint (TEC) to govern long-term stochasticity. We instantiate this framework as DSAC-E, a practical off-policy algorithm that builds upon the latest distributional soft actor-critic. Extensive evaluations across 10 challenging tasks in locomotion, robotic manipulation, and vision-based driving domains demonstrate that DSAC-E consistently outperforms baselines in both cumulative returns and training stability.

Abstract:
Graph neural networks are widely used for node classification, but they remain vulnerable to outof-distribution (OOD) shifts in node features and graph structure. Prior work established that methods trained with standard supervised learning (SL) objectives tend to capture spurious signals from either features and/or structure, leaving the model fragile under distributional changes. To address this, we propose \textscTide, a novel and effective \underlineTri-Component \underlineInformation \underlineDecomposition framework that explicitly decomposes information into feature-specific, structure-specific and joint components. \textscTide aims to preserve only the label-relevant part of the joint information while filtering out spurious feature- and structure-specific information, thereby enhancing the separation between in-distribution (ID) and OOD nodes. Beyond the framework, we provide theoretical and empirical analyses showing that an information bottleneck objective is preferable to standard SL for graph OOD detection, with higher ID confidence and a greater entropy gap between ID and OOD data. Extensive experiments across seven datasets confirm the efficacy of \textscTide, achieving up to a 34% improvement in FPR95 over strong baselines while maintaining competitive ID accuracy. Code will be released upon acceptance.

Abstract:
Despite recent progress, LLM agents still struggle with reasoning over long interaction histories. While current memoryaugmented agents rely on a static ``retrieve-then-reason'' paradigm, this rigid pipeline design prevents them from dynamically adapting memory access to intermediate evidence discovered during inference. To bridge this gap, we propose MRAgent, a framework that combines an associative memory graph with an active reconstruction mechanism. We represent memory as a Cue–Tag–Content graph, where associative tags serve as semantic bridges connecting fine-grained cues to memory contents. Operating on this structure, our active reconstruction mechanism integrates LLM reasoning directly into memory access, allowing the agent to iteratively explore and prune retrieval paths based on accumulated evidence. This ensures that memory retrieval is dynamically adapted to the reasoning context while avoiding combinatorial explosion caused by unconstrained expansion. Experiments on the LoCoMo benchmark and LongMemEval benchmark demonstrate significant improvements over strong baselines (up to 23%), while substantially reducing retrieval cost, highlighting the effectiveness of active and associative reconstruction for long-horizon memory reasoning.

Abstract:
Learning causal relationships between variables from data is a fundamental research area with many applications across disciplines. Most of the existing causal discovery algorithms rely on the assumptions that (i) the underlying system is acyclic, (ii) the exogenous noise variables are Gaussian, and (iii) that the intervention targets for the data generating experiments are known. While these assumptions simplify the analysis, they are violated in reallife systems. Most existing methods that address these issues either assume the underlying model is linear or are constrained to operate in limited interventional settings. To that end, we propose SCOUT, a novel causal discovery framework to learn nonlinear causal cyclic relationships from soft interventional data with unknown targets. Our main approach maximizes the data log-likelihood to recover the graph structure, using two normalizing-flow architectures—contractive residual flows and neural spline flows. By conducting experiments on synthetic and real-world data, we show that SCOUT outperforms state-of-the-art methods in both causal graph and unknown target recovery across various interventional and noise settings.

Abstract:
The distortion–perception (D–P) tradeoff is a fundamental phenomenon of Bayesian inverse problems, which characterizes the inherent tension between distortion performance and perceptual quality. Enabling flexible traversal of the DP tradeoff at inference time is crucial for practical applications. Despite the recent success of diffusion models in zero-shot inverse problem solving, efficient and principled strategies for D-P traversal in diffusion-based inverse algorithms remain inadequately characterized. In this paper, we propose a stage-wise framework for realizing D-P traversal using a single diffusion model in zero-shot inverse problems. Our proposed method, termed MAP-RPS, starts with an MAP estimation stage that approximates the MMSE solution and provides a low-distortion initialization, followed by a re-noised posterior sampling stage that progressively improves perceptual quality. We provide theoretical analyses for both stages, establishing the validity and effectiveness of the proposed design. Furthermore, we extend MAP-RPS to the latent space, yielding LMAP-RPS, which enjoys broader applicability by leveraging large-scale pre-trained latent diffusion backbones. Extensive experiments demonstrate that MAP-RPS and LMAP-RPS enable more effective D-P traversal on various tasks, while also exhibiting strong performance as efficient solvers for real-world inverse problems.

Abstract:
We introduce Spectral Guidance, a framework for controlling diffusion models by leveraging the intrinsic geometry of the generative process. As data is progressively corrupted by noise, only a small number of features remain informative for control. We characterize them as the singular functions of a conditional expectation operator and show that they can be learned via a selfsupervised objective. Once recovered, this basis enables the projection of arbitrary guidance signals, such as labels, CLIP embeddings, or masks, directly onto the sampling trajectory. This approach allows for stable, high-fidelity control without retraining or denoiser backpropagation during sampling. Empirically, we improve conditional accuracy on CIFAR-10 by 37 percentage points over the strongest training-free baseline while offering 4× faster sampling. Moreover, the same representations that support label and CLIP guidance also enable spatial control, such as mask-based guidance, without auxiliary models. Finally, our framework reveals a phase transition in the generative process, pinpointing the optimal time window for effective guidance.

Abstract:
In the era of Large VideoLanguage Models (LVLMs), the computational necessity of sparse frame sampling creates a fundamental ``temporal gap'', rendering models blind to critical causal transitions. Existing solutions relying on generative hallucination (e.g., latent diffusion) or autoregressive extrapolation often fail to maintain semantic consistency over long horizons, suffering from object vanishing and energetic instability. We propose a paradigm shift from probabilistic generation to variational mechanics with the Semantic Least Action Principle (SLAP). Drawing a rigorous isomorphism between classical mechanics and semantic dynamics, we model the latent video trajectory as a path on a Riemannian manifold governed by a Semantic Lagrangian. By formulating the interpolation task as a Boundary Value Problem (BVP) solved via the discrete Euler-Lagrange equations, SLAP naturally enforces object persistence without pixel-level rendering. Extensive experiments on multiple challenging datasets show the effectiveness of our proposed SLAP.

Abstract:
Visual agents employ tools such as zoomin cropping within visual chains of thought to access fine-grained details. Prior work has primarily demonstrated the effectiveness of these tools on visual search tasks, leaving their applicability to more diverse and complex visual problems underexplored. In this paper, we move beyond visual search and study challenging visual tasks that require advanced spatial understanding and reasoning, such as 3D spatial reasoning, where agents must not only crop or zoom in on relevant regions but also understand how these local details relate to the global context. We identify a tool-use collapse phenomenon: models progressively stop using tools while still achieving higher task accuracy. Moreover, we observe a clear asymmetry: (i) completely eliminating tool use degrades performance, whereas (ii) incentivizing tool use yields only marginal gains despite substantially increasing usage. We find that vanilla training and rewarding tool-use encouragement reduce rollout diversity during training, explaining why higher tool-use does not yield stronger reasoning performance. Motivated by these findings, we encourage diverse rollout exploration by adding an entropy-regularization term to the reinforcement learning objective, which results in the best performance despite tool usage gradually declining during training. Overall, our findings suggest a training-time view of tools as scaffolding, where broader exploration in both text and vision shapes representations that improve despite tool-use collapse.

Abstract:
Retrievalaugmented generation (RAG) systems are vulnerable to attacks that inject poisoned passages into the retrieved context, even at low corruption rates. We show that existing attacks are not designed to be stealthy, allowing reliable detection and mitigation. We formalize a distinguishability-based security game to quantify stealth for such attacks. If a few poisoned passages control the response, they must bias the inference process more than the benign ones, inherently compromising stealth. This motivates analyzing intermediate signals of LLMs, such as attention weights, to approximate the influence of different passages on the response. Leveraging attention weights, we introduce the Normalized Passage Attention Score (NPAS) and a lightweight Attention-Variance Filter (AV Filter) that flags anomalous passages. Our method improves robustness, yielding up to ~20% higher accuracy than baseline defenses. We also develop adaptive attacks that attempt to conceal such anomalies, achieving up to 35% success rate and underscoring the challenges of achieving true stealth in poisoning RAG systems.

Abstract:
Achieving robust generalization from limited data is a central challenge in embodied intelligence. Prevailing methods fail by regressing absolute coordinates, which violates the principle of general covariance. Theoretically, this conflates the intrinsic task geometry with rigid execution patterns, binding policies to specific motion styles and fixed speeds. To resolve this, we propose the Generalized Action Manifold (GAM) framework that enforces general covariance through structural disentanglement. Specifically, GAM constructs the manifold by enforcing invariance across two orthogonal dimensions: (1) Temporal Invariance, utilizing an ArcLength Parameterizer to orthogonalize the spatial path geometry from temporal dynamics, ensuring robustness to velocity variations; (2) Geometric Invariance, where a Schema-Affine-Factorization mechanism maps trajectories to canonical “world lines” in the Lie-algebraic tangent space. This distinguishes invariant topological schemas from affine modulations, ensuring spatial generalizability. By integrating GAM within a structured Vision-Language-Action (VLA) architecture, we expand sparse training data into a continuous, valid action manifold. Empirical results demonstrate that GAM enables superior transfer and robustness capabilities, significantly outperforming geometry-agnostic baselines.

Abstract:
Rotationbased Post-Training Quantization (PTQ) has emerged as a promising solution for mitigating activation outliers in the quantization of Large Language Models (LLMs). Global rotation methods achieve inference efficiency by fusing activation rotations into attention and FFN blocks, but suffer from limited expressivity as they are constrained to use a single learnable rotation matrix across all layers. To tackle this, layer-wise transformation methods emerged, achieving superior accuracy through localized adaptation. However, layer-wise methods cannot fuse activation rotation matrices into weights, requiring online computations and causing significant overhead. In this paper, we proposeReSpinQuant, a quantization framework that resolves such overhead by leveraging offline activation rotation fusion and matching basis using efficient residual subspace rotation. This design reconciles the high expressivity of layer-wise adaptation with only negligible inference overhead. Extensive experiments on W4A4 and W3A3 quantization demonstrate that ReSpinQuant achieves state-of-the-art performance, outperforming global rotation methods and matching the accuracy of computationally expensive layer-wise methods with minimal overhead.

Abstract:
Riemannian diffusion models generalize scorebased generative modeling to manifold-supported data via stochastic diffusion equations on the manifold. However, training requires sampling from and differentiating the manifold heat kernel, which is rarely available in closed form beyond a few highly symmetric manifolds. We propose a general approach that approximates the heat kernel by directly solving the manifold heat equation with a physics-informed neural network (PINN). Given an explicit manifold specification, we choose a coordinate system, derive the corresponding heat (Fokker--Planck) equation and a short-time asymptotic approximation, and then train a PINN to learn the log heat kernel. The resulting surrogate enables both forward noising (heat-kernel sampling) and conditional-score evaluation for denoising score matching. We demonstrate the method on diverse manifolds including S^2, SO(3), \mathrmSPD(n), and permutation-quotiented point clouds.

Abstract:
Multivector retrieval (MVR) models, exemplified by ColBERT, have established new benchmarks in retrieval accuracy by preserving fine-grained token-level interactions. However, this granularity imposes prohibitive storage and retrieval efficiency bottlenecks: to manage the immense memory footprint and computational overhead of billion-scale token vectors, state-of-the-art systems are forced to rely on aggressive dimension reduction and complex clustering (e.g., K-means). This compromise introduces two critical limitations: excessive indexing latency of clustering large-scale corpora and semantic information loss inherent to compression. In this paper, we propose Single-Stage Sparse Retrieval (SSR), a paradigm shift that replaces expensive clustering with efficient sparse coding. Instead of compressing features into low-dimensional dense vectors, we utilize Sparse Autoencoder (SAE) to project token embeddings into a high-dimensional but highly sparse representation. This transformation enables us to bypass vector clustering entirely and leverage inverted indexing for precise, high-throughput retrieval. Extensive experiments on the BEIR benchmark demonstrate that SSR achieves a "trifecta" of improvements: it reduces indexing time by 15× compared to ColBERTv2, halves retrieval latency, and simultaneously improves retrieval performance over leading baselines.

Abstract:
Recent diffusionbased approaches have made substantial progress in image layer decomposition. However, accurately decomposing complex natural images remains challenging due to difficulties in occlusion completion, robust layer disentanglement, and precise foreground boundaries. Moreover, the scarcity of high-quality multi-layer natural image datasets limits advancement. To address these challenges, we proposeRevealLayer, a diffusion-based framework that decomposes an RGB image into multiple RGBA layers, enabling precise layer separation and reliable recovery of occluded content in natural images. RevealLayer incorporates three key components: (1) aRegion-Aware Attentionmodule to disentangle hidden and visible layers; (2) anOcclusion-Guided Adapterto leverage contextual information to enhance overlapping regions; and (3) acomposite lossto enforce sharp alpha boundaries and suppress residual artifacts. To support training and evaluation, we introduceRevealLayer-100K, a high-quality multi-layer natural image constructed through a collaboration between automated algorithms and human annotation, and further establishRevealLayerBenchfor benchmarking layer decomposition in general natural scenes. Extensive experiments demonstrate that RevealLayer consistently outperforms existing approaches in layer decomposition.

Abstract:
Designing genetic circuits, which are biological systems capable of programmed behaviors within living cells, remains a laborious, expertdriven process despite decades of progress in synthetic biology. We introduce GenCircuit-RL, a reinforcement learning framework that trains language models to reason about genetic circuit design through code generation, where models produce Python code using PySBOL to construct circuits in the standardized Synthetic Biology Open Language (SBOL) format. Our approach addresses the challenge of sparse feedback in biological design through hierarchical verification rewards that decompose correctness into five levels, from code execution through structural validity to functional behavior, providing dense learning signal while multiplicative dependencies prevent reward hacking. We contribute SynBio-Reason, a benchmark of approximately 4,753 circuits spanning six canonical circuit types and nine tasks from code repair to de novo design, with held-out biological parts enabling rigorous out-of-distribution evaluation. A four-stage curriculum progressively shifts optimization pressure from basic code generation toward functional correctness, enabling models to acquire compositional reasoning capabilities incrementally. Our framework demonstrates that hierarchical verification combined with curriculum learning enables compact language models to generate functionally correct genetic circuits, including generalization to novel biological parts and rediscovery of canonical designs from synthetic biology literature.

Abstract:
Token mixing layers play a key role in how language models can learn and generate longrange dependencies. Their efficiency relies on the necessary trade-off between decoding speed and the memory requirements, along with the cache size. Considering causal generation, this paper explores new trade-offs thanks to a unified framework which separates two crucial features: (i) the direct influence of inputs on outputs in one generation step; (ii) the recurrent propagation of information through past outputs. This framework encompasses major architectures such as attention and state-space models, but also generalizes the recurrence equations by allowing each state to depend on multiple past states rather than only the immediate predecessor. By introducing structure, we design new recurrence patterns that provably achieve the desired complexity, while providing theoretical insights on their expressivity -- trading runtime for expressivity in a principled way. Empirical validation is performed on synthetic tasks, along with language modeling. Together, these results provide a unified toolkit for the understanding and design of efficient and expressive token mixers across model families.

Abstract:
Latent state space systems are ubiquitous in statistical modelling, arising naturally when a timeseries is observed through a noisy measurement function, however training deep state space models (DSSM) at scale remains difficult. Two largely distinct strategies and literatures have developed around the training of DSSMs. Firstly, auto-encoding DSSMs train generative DSSMs by optimising a variational lower bound. Secondly, DSSMs trained by back-propagating the outputs of a classical sequential Monte Carlo algorithm (SMC). Such approaches can train DSSMs for discriminative as well as generative tasks, however, due to the sequentiality of their forward pass, scale poorly on modern hardware. We propose a new training method _Parallel Variational Monte Carlo_ (PVMC) that bridges the gap between the paradigms, and can be used robustly to train DSSMs for both discriminative and generative tasks. Our method achieves state-of-the-art or better results on a set of baseline experiments and trains 10× faster than the fastest competing SMC approach.

Abstract:
Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of largescale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5–20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.

Abstract:
Traditional time series models are often taskspecific and rely heavily on manual feature engineering. While Transformer-based architectures have revolutionized sequence modeling in language and vision, their potential for general-purpose time series representation learning remains underexplored, particularly for heterogeneous sensor data. We introduce CHARM (Channel-Aware Representation Model), a model designed to improve representations for multivariate time series by incorporating channel-level textual descriptions into its architecture. This allows the model to leverage contextual information associated with individual sensors while remaining invariant to channel order. CHARM is trained using a Joint Embedding Predictive Architecture (JEPA) with a novel loss that promotes informative and temporally stable embeddings. We find that CHARM’s latent-space prediction encourages robustness to sensor-level noise and supports learning underlying temporal structure. In addition, the description-aware gating mechanism provides a degree of interpretability through learned inter-channel relationships. Across a range of downstream tasks—including univariate and multivariate anomaly detection, classification, and short- and long-term forecasting—the learned embeddings achieve strong performance using only a lightweight linear probe.

Abstract:
Largescale search, recommendation, and retrieval-augmented generation (RAG) systems typically employ a two-stage architecture: an early-stage ranker (ESR) generates a candidate set, which is subsequently re-ranked by a late-stage ranker (LSR). While there are many reinforcement learning (RL) methods for training the LSR, end-to-end training of the ESR has proven challenging. In particular, naive application of "vanilla" policy gradient (V-PG) is not scalable for candidate-set sizes relevant for practical use due to exploding variance. This issue arises because V-PG propagates the gradient to the joint probability of the candidate sets, ignoring the contribution of each specific item in the candidate set to the reward. To mitigate this issue, we propose a novel"credit-assigned" PG (CA-PG), which computes gradients with respect to the probability that the target item is chosen in any candidate set, i.e. marginalizing over all candidate sets that contain it. Our theoretical analysis reveals that CA-PG significantly reduces the variance of V-PG by marginalizing over the specific composition of the candidate set, while preserving the ability to learn the correct ranking of actions under a reasonably aligned LSR policy. Experiments on both synthetic and real-world data demonstrate that CA-PG improves the convergence speed and training stability for ESRs utilizing the canonical Plackett-Luce model, especially when the candidate-set size is large.

Abstract:
AI has the potential to transform cybersecurity by enabling systems that can autonomously detect, analyze, and remediate software vulnerabilities. However, existing cybersecurity evaluations of AI systems are limited in scale or scope, and fail to capture the endto-end lifecycle of real-world software vulnerability discovery and remediation. To address this gap, we propose CyberCycle, a large-scale and realistic end-to-end cybersecurity benchmark that comprehensively evaluates AI agents' abilities across the full lifecycle of vulnerability discovery, PoC generation, and patch generation. CyberCycle is comprehensive and scalable, as we build an automated, agent-enhanced pipeline for transforming open-source vulnerability data into realistic evaluation environments. Currently, the benchmark consists of 615 real-world vulnerabilities across 120 different open-source projects.

Abstract:
Deep learning optimization relies heavily on the assumption of smooth loss landscapes, a condition systematically violated by modern architectures due to nonsmooth components like ReLU activations and quantization operators. In such non-smooth regimes, adaptive optimizers such as Adam suffer from gradient chattering—violent oscillations caused by conflicting signals within the Clarke subdifferential—leading to poor convergence and suboptimal generalization. To address this, we introduce Singularity-aware Adam (S-Adam), a novel optimizer that stabilizes training by dynamically modulating step sizes based on local geometric instability. Our key contribution is the Local Geometric Instability (LGI) metric, a computationally efficient estimator of the Clarke subdifferential diameter derived from the variance of randomized directional derivatives. S-Adam incorporates an adaptive damping mechanism \exp(-\lambda \rho_t) that decelerates updates in high-instability regions while preserving fast convergence in smooth basins. We provide a rigorous convergence analysis using differential inclusions, proving that S-Adam converges almost surely to (\delta, \epsilon)-Clarke stationary points at the optimal \mathcalO(1/\sqrtT) rate. Empirical evaluations on Quantization-Aware Training (QAT) and high-volatility transfer learning demonstrate that S-Adam consistently outperforms AdamW and Prox-SGD, achieving accuracy gains of up to +6% on CIFAR-100 and +3% on TinyImageNet while effectively mitigating gradient oscillations.

Abstract:
ANNto-SNN conversion offers a practical, training-free route to spiking large language models. However, current pipelines primarily focus on spike-driven realizations for Transformer linear-algebra operations, while providing limited support for key nonlinear operators. This gap limits compatibility with neuromorphic-style execution constraints, where such nonlinearities typically require division, exponentiation, or norm computations that are not naturally supported by standard leaky integrate-and-fire dynamics. To solve this problem, we propose a plug-and-play framework that implements spike-friendly approximations for Transformer nonlinearities and integrates into existing ANN-to-SNN pipelines. Our method decomposes these nonlinear computations into three recurring primitives---division, exponentiation, and \ell_2 norms---and realizes them via population computation using LIF neuron groups, combined with lightweight bit-shift scaling to avoid floating-point arithmetic. By composing these primitives as modular operator blocks, our framework supports common Transformer nonlinearities (e.g., Softmax, SiLU, and normalization) without any fine-tuning. Experiments on a range of LLMs Transformers show that selectively replacing the targeted nonlinear operators incurs less than a 1% accuracy drop across all evaluated tasks.

Abstract:
Contrastive learning is often trained with the InfoNCE loss, which uses a softmax over similarities to make the positive pair score higher than many negatives. Beyond its connection to mutual information, this softmax link has a precise probabilistic meaning: it is the maximum likelihood objective of a discrete choice model Plackett Luce with Gumbel noise. We show that this implicit noise model can be systematically wrong in modern settings where similarities are bounded, such as cosine normalized embeddings. In the bounded regime, the most competitive negatives pile up near the score ceiling, and extreme value theory predicts Weibull rather than Gumbel behavior for these extremes. We confirm this prediction empirically by measuring Weibull style tail behavior in the hardest negatives throughout InfoNCE training. Motivated by this mismatch, we propose WEINCE, a simple modification of InfoNCE that targets the extreme score regime directly. Across standard benchmarks and backbone architectures, WEINCE improves downstream linear evaluation over InfoNCE with minimal changes to existing training pipelines, showing that modeling the geometry of extremes can yield stronger contrastive representations.

Abstract:
Developing effective representations for register transfer level (RTL) designs is crucial for accelerating the hardware design workflow. Existing approaches, however, typically rely on a single data modality, either the RTL code or its associated graphbased representation, limiting the expressiveness and generalization ability of the learned representations. For RTL, the control data flow graph (CDFG) offers a comprehensive structural representation that preserves complete information, while the code modality explicitly encodes semantic and functional information. We argue that integrating these complementary modalities is essential for a thorough understanding of RTL designs. To this end, we propose UniRTL, a multimodal pretraining framework that learns unified RTL representations by jointly leveraging code and CDFG. UniRTL achieves fine-grained alignment between code and graph through mutual masked modeling and employs a hierarchical training strategy that incorporates a pretrained graph-aware tokenizer and staged alignment of text (i.e., functional summary) and code prior to graph integration. We evaluate UniRTL on two downstream tasks, performance prediction and code retrieval, under multiple settings. Experimental results show that UniRTL consistently outperforms prior methods, establishing it as a more robust and powerful foundation for advancing hardware design automation.

Abstract:
Sparse Mixtureof-Experts architectures are essential for scaling model capacity efficiently, yet the standard Top-k routing imposes a rigid sparsity pattern that ignores the intrinsic variance in token difficulty and layer-specific computational needs. While Top-p routing offers a flexible alternative, we demonstrate that existing naive Top-p implementations with fixed global probability thresholds provide only marginal gains over Top-k, suffer from hyperparameter sensitivity, and result in uncontrolled computational costs. In this paper, we propose \textttDTop-p, a sparsity-controllable dynamic routing mechanism. To overcome the non-differentiability of the MoE sparsity level - the Top-p threshold, we utilize a Proportional-Integral controller that dynamically learns the Top-p probability threshold to align the running sparsity with a user-defined budget. Furthermore, we introduce dynamic routing normalization to adaptively rescale logits, enabling distinct expert selection patterns across layers under a global sparsity constraint. Extensive experiments on Large Language Models and Diffusion Transformers demonstrate that \textttDTop-p consistently outperforms both Top-k and fixed Top-p baselines while matching the average FLOPs of Top-k MoE. Our analysis confirms that \textttDTop-p exhibits strong scaling properties across expert granularity, total expert capacity, model size, and dataset size, offering a robust and efficient MoE framework for foundation model pre-training.

Abstract:
In modern machine learning pipelines, abundant pretrained representations act as noisy proxy covariates while taskspecific labels remain scarce. We study semi-supervised regression in this noisy-covariate setting and propose a simple two-stage estimator. We derive finite-sample generalization bounds showing that sufficiently many unlabeled proxy covariates can yield fast labeled-sample rates for both well-specified and misspecified regression settings. We further show that distribution regression is a special case of our framework, where each covariate is a latent distribution observed through a finite bag of samples, and the same guarantees hold when the bag size is large enough. Numerical experiments demonstrate consistent improvements over competitive supervised and semi-supervised baselines, especially in low-label regimes.

Abstract:
Communication costs are a major bottleneck in distributed learning and firstorder optimization. A common approach to alleviate this issue is to compress the gradient information exchanged between agents. However, such compression typically degrades the convergence guarantees of gradient-based methods. Error feedback mechanisms provide a simple and computationally cheap remedy for this issue, but numerous variants have been proposed, and their relative performance remains poorly understood. In this work, we provide tight convergence analyses for two of the main error feedback algorithms in the literature, namely error feedback and EF21. Our results hold independently of the number of participating agents and rely on the construction of novel Lyapunov functions that recover the known best guarantees in the single-agent regime.

Abstract:
Testtime adaptation (TTA) aims to adapt models to maintain reliable performance on non-stationary test streams without requiring labeled data. Despite its empirical success, the learnability of TTA under distributional non-stationarity remains unexplored. A key challenge is lacking of a principled theoretical framework that simultaneously aligns with the TTA objective and captures both continuously evolving distribution shifts and intrinsic information constraints. To address this gap, we propose the first theoretical framework for characterizing the learnability of TTA, introducing the notions of (\epsilon,\delta)-Recovery Complexity and (\epsilon,\rho)-TTA Learnability. Recovery complexity quantifies the minimal time required for a TTA algorithm to recover to a target excess risk following a distribution shift, and is further generalized to define (\epsilon,\rho)-TTA Learnability, which measures the long-term reliability of TTA algorithms over non-stationary data streams. Within this framework, we introduce a novel temporally dependent discrete surrogate method that models complex non-stationary test streams, enabling a unified and tractable analysis of both gradual and abrupt distribution shifts. We derive order-wise matching lower and upper bounds on recovery complexity through information-theoretic and optimization-based analysis. Our results uncover fundamental limits of TTA, highlight the intrinsic adaptivity-information trade-off of TTA algorithms, and provide the first unified learnability guarantees that go beyond regret-based perspectives.

Abstract:
Existing calibration methods for Large Language Models (LLMs) often overlook a critical dimension of trustworthiness: a model's \em behavioral robustness to irrelevant or misleading information. In this paper, we argue that a model's true confidence should reflect its stability under cognitive pressure. We introduce \textscCaliDist, a novel, posthoc calibration approach that directly measures and penalizes a model's susceptibility to distraction. \textscCaliDist quantifies how an LLM's predictions and uncertainty change when its input prompt is perturbed with semantic distractors. This stability (or lack thereof) signal is then used to adaptively scale the model's initial confidence score. Our extensive experiments on seven Natural Language Understanding (NLU) classification benchmarks using six distinct LLMs show that \textscCaliDist consistently achieves lower Expected Calibration Error (ECE) and Brier Score compared with strong baselines. Remarkably, our method reduces the ECE from 19% to 11% on average—a relative improvement of 47%—demonstrating that behavioral stability is a powerful and practical signal for calibration.

Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a central paradigm for scaling LLM reasoning, yet its optimization often suffers from training instability and suboptimal convergence. Through a systematic dissection of the GRPObased objective, we reveal that the rigid clipping decision inherent to the hard-clipping mechanism is the primary bottleneck. Specifically, we find that many high-value signals lie in thenear-boundaryregion just beyond the clipping threshold, and are thus discarded. Motivated by this diagnosis, we proposeNear-boundary Stochastic Rescue (NSR), a minimal, plug-and-play modification that stochastically retains these slightly out-of-bound tokens to recover lost signals. While NSR, via stochastic sampling, can be interpreted as inducing an implicit gradient decay in expectation, our ablations reveal that its stochastic, boundary-local rescue mechanism is consistently more effective than deterministic gradient decay. Validated by extensive experiments across model sizes from 7B to 30B and both dense and MoE architectures, as a plug-and-play solution, NSR substantially improves training stability and delivers consistent gains over strong baselines such as DAPO and GSPO.

Abstract:
Chainof-Thought (CoT) prompting has shown promise in enhancing the reasoning capabilities of large language models (LLMs) on text-attributed graphs (TAGs). This work reframes CoT-based graph learning through the principle of clustering as reasoning, offering a k-means interpretation of how iterative reasoning operates over graph-structured data. We observe that existing graph CoT methods rely on disjoint architectures and fixed graph representations, limiting step-by-step semantic-topological interaction and interpretability. To overcome this limitation, we propose a unified framework named KCoT that integrates CoT reasoning with graph representation learning. Our key theoretical result reveals a formal mathematical correspondence between a Transformer block and the k-means algorithm, allowing reasoning to be interpreted as iterative assignment and update steps. Based on this insight, we introduce a Semantic Discriminating Prompt that explicitly formulates these steps as structured CoT reasoning, together with a structure-grounded alignment strategy to fuse topological priors with evolving thought-conditioned representations. Experiments on standard benchmarks demonstrate consistent improvements over state-of-the-art methods, validating clustering as a principled mechanism for CoT-based graph learning.

Abstract:
Most valuebased and actor-critic reinforcement learning methods rely on Bellman-style recursions, yet these recursions collapse under non-exponential discounting common in human preferences and survival processes. We show the breakdown is structural: exponential discounting sits at a fragile intersection of multiplicativity and time homogeneity, and violating either property breaks standard dynamic programming. To overcome this, we proposePontryagin-Guided Direct Policy Optimization (PG-DPO), a variational framework that abandons recursion and couples the Pontryagin Maximum Principle with Monte Carlo rollouts via anAdjoint-MC projectionenforcing pointwise Hamiltonian maximization. Across multidimensional hyperbolic and survival-discount benchmarks, PG-DPO improves accuracy and stability where equation-driven solvers and critic-based baselines diverge.

Abstract:
Continual Model Merging (CMM) enables rapid customization of foundation models across sequentially arriving tasks, offering a scalable alternative to repeated retraining. However, existing merging rules lack explicit controllability over the allocation of learning capacity between previously learned capabilities and newly merged models. Consequently, as tasks are merged sequentially, this deficiency accumulates into severe forgetting, particularly in scenarios with heterogeneous task importance, where performance allocation becomes highly inconsistent. The key reason can be attributed to the fact that previous methods treat each task model as an isolated parameter point and apply fixed algebraic combinations, rather than explicitly constructing a transition that respects how independently trained models can be connected in parameter space. Motivated by mode connectivity, we assume that desirable merged models lie on low loss connecting paths, and that continual merging should follow such paths without crossing loss barriers that induce forgetting. Grounded in these insights, we propose a novel ODEdriven Merging (ODE-M) tailored for CMM that traces such a path by integrating a time-dependent velocity field and enforcing barrier constraints to prevent loss-increasing steps. Extensive experiments demonstrate that ODE-M achieves state-of-the-art performance compared to its competitors across mainstream CMM benchmarks.

Abstract:
Navigating under nonstationary environment shifts poses a critical challenge for a Vision-and-Language Navigation (VLN) agent deployed in the wild. Yet, existing Test-Time Adaptation (TTA) methods for VLN largely treat online adaptation as transient, isolated updates, leading to catastrophic forgetting and negative transfer. To overcome these issues, we proposeInter-Domain BridgEwith HistoricalAssets (IDEA), a novel TTA framework that transforms adaptation into the accumulation and composition of assets. Specifically, IDEA introduces soft prompts optimized via a Fisher-guided weighting scheme to capture the transferable knowledge. These optimized prompts are then augmented with domain coordinates to form a dynamic asset library. Leveraging this library, IDEA constructs a cross-domain bridge by projecting the target domain onto the convex hull of historical knowledge. These designs form a complementary loop: the evolving library underpins bridge construction, while the bridge provides superior initialization to accelerate asset optimization. Extensive experiments across REVERIE, R2R, and R2R-CE benchmarks demonstrate the consistent superiority of IDEA over existing methods, showcasing its ability to enable training-free adaptation via asset sharing. The code will be publicly available.

Abstract:
Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone technique for posttraining large language models. While most existing approaches rely on the reverse KL-regularization, recent empirical studies have begun exploring alternative divergences (e.g., forward KL, chi-squared) as regularizers in RLHF. However, a unified theoretical understanding of general f-divergence regularization remains under-explored. To fill this gap, this work develops a comprehensive theoretical framework for online RLHF with an f-divergence regularized objective. Rather than treating each divergence in isolation, we adopt a holistic perspective across the entire class and propose two algorithms based on distinct sampling principles. The first extends the classical optimism principle with a carefully designed exploration bonus, while the second introduces a new method that exploits the sensitivity of the optimal policy to reward perturbations under f-divergence regularization. Theoretical analysis shows that O(\log T) regret and O(1/T) sub-optimality gap are achievable, establishing provable efficiency of both algorithms and, to the best of our knowledge, the first performance bounds for online RLHF under general f-divergence regularization.

Abstract:
In many practical reinforcement learning (RL) environments, observations are far higherdimensional than the variables that matter for control. In this work, we ask: can we learn representations that capture only control-relevant features of the environment? We study this question through the \emphempowerment objective, which maximizes an agent’s influence over the environment and is widely used for unsupervised skill learning. We show that empowerment agents induce two distinct representations --- forward and backward --- that capture complementary aspects of the state, and both of which are invariant to control-irrelevant features. Thus, empowerment maximization leads agents to learn an implicit, \emphcontrol-centric model of the world. Our analysis highlights the importance of learning representations through interaction rather than from passive datasets: interaction aimed at maximizing control is essential for learning useful invariance properties, a perspective that aligns closely with the causal learning literature.

Abstract:
Accurate weather forecast reporting enables individuals and communities to better plan daily activities, agricultural operations, and transportation. However, the current reporting process primarily relies on manual analysis of multisource data, which often leads to information overload and reduced efficiency. With the rapid advancement of multimodal large language models (MLLMs), leveraging data-driven models to analyze and generate reports in the weather forecasting domain remains largely underexplored. In this work, we propose the Weather Forecasting Report (WFR) task and construct the first instruction-tuning dataset for this task, named WSInstruct, which covers 31 cities in America and 8 weather aspects. Based on this corpus, we develop the first model, WeatherSyn, specialized in generating weather forecast reports. Evaluation across multiple metrics on our dataset shows that WeatherSyn consistently outperforms leading closed-source MLLMs, particularly on structurally complex weather aspects. We further analyze its performance across diverse geographic regions and weather aspects. WeatherSyn demonstrates strong transferability across different regions, highlighting its zero-shot generalization capability. WeatherSyn offers valuable insight for developing MLLMs specialized in weather report generation.

Abstract:
Diffusionbased large language models (dLLMs) support parallel text generation via iterative denoising, yet inference remains latency-heavy because many steps are spent on redundant refinement and repeated remasking of tokens whose final values are already determined. Prior acceleration methods mainly depend on step-local confidence heuristics or fixed schedules, which are sensitive to prompt and task variation and ignore strong positional effects within a sequence. We cast diffusion decoding as a dynamic control problem and show that token-wise denoising trajectories provide the key signal for reliable control. We propose a trace-aware decoding framework with two components. First, Temporal-Spatial Parallel Decoding (TSPD) uses a lightweight temporal-spatial correctness sensor that consumes per-token trajectory features, including confidence, entropy, and momentum, together with token position, to decide when a token has converged and can be safely fixed. Second, we introduce ]Confidence Extrapolation (CE)], a training-free state-space module that forecasts future logit trends with uncertainty to support proactive decisions, including safe look-ahead and targeted stabilization when trajectories are oscillatory or underconfident. Together, TSPD and CE reduce unnecessary denoising iterations while preserving output quality, and they compose cleanly with system optimizations such as KV caching.

Abstract:
Red teaming is critical for uncovering vulnerabilities in Large Language Models (LLMs). While automated methods have improved scalability, existing approaches often rely on static heuristics or stochastic search, rendering them brittle against advanced safety alignment. To address this, we introduce Metis, a framework that reformulates jailbreaking as inferencetime policy optimization within an adversarial Partially Observable Markov Decision Process (POMDP). Metis employs a self-evolving metacognitive loop to perform causal diagnosis of a target's defense logic and leverages structured feedback as a semantic gradient to refine its policy, offering enhanced interpretability through transparent reasoning traces. Extensive evaluations across 10 diverse models demonstrate that Metis establishes a new state-of-the-art with an average Attack Success Rate (ASR) of 89.2%, maintaining high efficacy on resilient frontier models (e.g., 76.0% on O1 and 78.0% on GPT-5-chat) where traditional baselines exhibit substantial performance degradation. By replacing redundant exploration with directed optimization, Metis reduces token costs by an average of 8.2× (and up to 11.4×). Our analysis reveals that current defenses remain systematically vulnerable to internally-steered, closed-loop reasoning trajectories, highlighting a critical need for next-generation defenses capable of reasoning about safety dynamically during inference.

Abstract:
Recent progress in LLM reasoning has increasingly shifted from singlepass generation to explicit search over intermediate reasoning states. Tree-of-Thoughts (ToT) organizes inference to tree-structured search with branching and backtracking, but it substantially amplifies the key--value (KV) cache: retaining KV states for a frontier of partial trajectories quickly becomes a memory bottleneck that limits throughput and constrains search depth and width under fixed hardware budgets. We address this challenge by observing that KV reuse in ToT-style inference is governed by search dynamics: near-term decoding depends primarily on the active branch and its ancestors, whereas inactive subtrees have low short-term reuse probability yet must remain recoverable for backtracking. Motivated by this, we propose ArborKV, a structure-aware eviction framework that couples a lightweight value estimator with a tree-aware allocation policy, and performs purely token-extractive eviction with lazy rehydration to support revisits. Experiments on ToT-style reasoning benchmarks show that ArborKV achieves up to ～4× peak KV-memory reduction while preserving near-full-retention accuracy, enabling larger search configurations under fixed device budgets that would otherwise run out of memory.

Abstract:
Group Relative Policy Optimization has emerged as essential for aligning video diffusion models with human preferences, but faces a critical computational bottleneck: training a 14B parametered model typically demands hundreds of GPU days per experiment. Existing efficiency methods reduce costs through sliding window subsampling training timesteps, but fundamentally compromise optimization, exhibiting severe instability and failing to reach full trajectory performance. We present FlashGRPO, a single-step training framework that outperforms full trajectory training in alignment quality under low computational budgets while substantially improving training efficiency. Flash-GRPO addresses two critical challenges: iso-temporal grouping eliminates timestep-confounded variance by enforcing prompt-wise temporal consistency, decoupling policy performance from timestep difficulty; temporal gradient rectification neutralizes the time-dependent scaling factor that causes vastly inconsistent gradient magnitudes across timesteps. Experiments on 1.3B to 14B parameter models validate Flash-GRPO's effectiveness, demonstrating substantial training acceleration with consistent stability and state-of-the-art alignment quality.

Abstract:
While visually grounded Chainof-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains under-scrutinized. In this work, we empirically find that mandating the explicit object boxes in visually grounded CoT during inference often degrades performance compared to standard textual CoT---which reasons without explicit visual grounding. We hypothesize that the visual localization capability can be internalized into the textual CoT and that the mandatory explicit grounding imposes unnecessary task interference, which detracts from the model's primary focus on answer prediction. To address this problem, we propose Internalizing Visually Grounded Reasoning (iVGR), a novel reinforcement learning framework that transfers localization capabilities into the textual reasoning process. We employ a dual-stream training strategy, where a textual stream is aligned with a high-quality (visually) grounded stream via a proposed consistency reward, enabling the model to localize accurately without explicit grounding during inference. Extensive experiments on Qwen2.5-VL and Qwen3-VL demonstrate that our method significantly outperforms existing baselines on fine-grained benchmarks, while maintaining the flexibility to support tool-assisted inference workflows.

Abstract:
Generative Recommendation (GR) has emerged as a promising paradigm by formulating item recommendation as a sequenceto-sequence generation task over item identifiers. Recent studies have incorporated multimodal signals to provide richer token-level evidence for generation. However, existing approaches largely rely on alignment- centric fusion and underexplore synergistic information across modalities. In practice, synergistic information plays a critical role in capturing emergent item properties that cannot be inferred from any single modality alone. Such properties encode intrinsic item semantics and guide user preferences, enabling models to move beyond surface-level feature matching. To address this limitation, we propose SynGR, a synergistic generative recommendation framework that explicitly encourages the exploitation of cross-modal dependencies during generation. By constraining overreliance on dominant modalities, SynGR enables the model to capture emergent item semantics beyond shared or modality-specific signals. Extensive experiments across three benchmark datasets demonstrate that SynGR achieves superior recommendation performance, with an average improvement of 9.01%.

Abstract:
Discovering atomlevel phenomena requires molecular dynamics (MD) simulations with ab initio accuracy. Machine learning interatomic potentials (MLIPs) enable stable, high-accuracy MD simulations, and their models exhibit scaling-law trends similar to large language models. However, the lack of scalable and efficient distributed training systems for conservative MLIPs makes them difficult to scale. This is because conservative MLIPs inherently follow a double-backward execution pattern, which involves computing gradients during the forward pass. This pattern creates a mismatch with existing distributed training systems, especially for pipeline parallelism. Therefore, we present JanusPipe, an efficient 3D-parallel (PP/DP/GP) training system tailored for conservative MLIPs. It integrates SymFold to enable memory-efficient pipeline parallelism for conservative MLIPs, and WaveK to reduce pipeline bubbles by balancing the four-phase compute time. Experimental results on 32 GPUs show that JanusPipe improves throughput by 1.51× and 1.45× on average over 1F1B and Hanayo, respectively.

Abstract:
Continual learning enables neural networks to learn tasks sequentially without forgetting previously acquired knowledge. However, catastrophic forgetting, where performance on earlier tasks degrades sharply when learning new ones, remains a fundamental challenge. We address this problem with Shapley Neuron Valuation (SNV), a principled framework grounded in cooperative game theory that quantifies Neuron importance in continual learning. By selectively freezing important Neurons while keeping others plastic, SNV enables memoryfree continual learning without architectural expansion. Extensive experiments show that SNV delivers substantial gains over memory-free baselines, achieving +19.50% accuracy on CIFAR-100 and +17.20% on TinyImageNet in the Class-IL setting. In Task-IL scenarios, SNV consistently surpasses existing memory-free approaches by large margins, reaching up to 9.08% higher accuracy on CIFAR-100 compared to the second-best memory-free method, while remaining competitive in comparison with memory-based methods.

Abstract:
A hallmark of intelligence is the ability to adapt in nonstationary environments, yet deep Reinforcement Learning (RL) agents often struggle in such settings. Most prior studies introduce non-stationarity through abrupt shifts in features or dynamics, whereas real-world changes might be more gradual, reflecting naturalistic continual drift in the underlying dynamics. This may have important implications for studies on the "stability versus plasticity dilemma" in RL, since abrupt changes in the task may necessitate more plasticity than real-world situations actually would demand. To address these concerns, we modify existing 3D Miniworld and MuJoCo environments to incorporate naturalistic, continual non-stationary changes, and use them to identify whether poor performance in RL systems arises from a loss of plasticity or stability. We find that in these settings, methods that preserve stability, such as synaptic consolidation, achieve better performance than those focused on plasticity, such as resetting a subset of the parameters. Motivated by this finding, and prior evidence that successor features (SFs) reduce interference in non-stationary settings, we investigate whether SFs provide a better target than Q-values for consolidation. Across both environments, we find that applying a neuro-inspired synaptic consolidation mechanism to SFs rather than Q-values yields superior performance on the naturalistic, continual changing MuJoCo tasks. Furthermore, we find that consolidation is most effective when SFs are stabilized across multiple timescales, as different timescales capture complementary aspects of the gradually changing environment. Together, these results show that stability may be more important in continual learning settings when abrupt changes in tasks do not occur. Moreover, to enhance stability, multi-timescale consolidation of predictive representations is an effective approach.

Abstract:
Large Language Model (LLM) based agents have demonstrated proficiency in multistep interactions with graphical user interfaces (GUIs). While most research focuses on improving single-task performance, practical scenarios often involve repetitive GUI tasks for which invoking LLM reasoning repeatedly, i.e., the ReAct paradigm, is inefficient. Prior to LLMs, traditional Robotic Process Automation (RPA) offers runtime efficiency but demands significant manual effort to develop and maintain. To bridge this gap, we propose AutoRPA, a framework that automatically distills the decision logic of ReAct-style agents into robust RPA functions. AutoRPA introduces two core innovations: (1) A translator-builder pipeline where a translator agent converts hard-coded ReAct actions into soft-coded procedures, and a builder agent synthesizes robust RPA functions via retrieval-augmented generation over multiple trajectories; (2) A hybrid repair strategy during code verification, combining RPA execution with ReAct-based fallback for iterative refinement. Experiments across multiple GUI environments demonstrate that RPA functions generated by AutoRPA successfully solve similar tasks while reducing token usage by 82%\textasciitilde96%, significantly improving runtime efficiency and reusability.

Abstract:
Recovering surface normals from singleview polarization images in complex media remains challenging. This paper focuses on ice as a representative complex medium, where intricate light–matter interactions lead to a nonlinear mapping between polarization observations and surface normals. To address this, a structure-aware polarization prior based on autocorrelation functions is proposed to capture the local spatial consistency of AoLP. Building on this, a dual-branch network (IceSfP) is designed to integrate raw polarization features with priors via cross-modal attention and multi-scale feature fusion, enabling accurate surface normal estimation under complex media conditions. To evaluate the method, the first real-world ice SfP dataset is constructed. Experimental results show that the method outperforms existing approaches across all metrics, achieving a MAE of 16.01^\circ, which is 2.74^\circ lower than the second-best method. The framework provides a generalizable solution for high-precision geometric perception in complex media.

Abstract:
We design a class of additive noise mechanisms that satisfy (\varepsilon, \delta)differential privacy (DP) for scalar, real-valued query functions with known sensitivities, with a particular focus on moderate and low-privacy regimes. These mechanisms, which we call mixture mechanisms, are constructed by mixing multiple Gaussian distributions that share the same variance but differ in their means and mixture weights. The resulting distributions can be interpreted as convex combinations of a zero-mean Gaussian (as used in the analytic Gaussian mechanism) and additional Gaussians whose means depend on the sensitivity of the query function. We derive tight conditions on the variances required for (\varepsilon, \delta)-DP and provide efficient algorithms to compute them. Compared to the analytic Gaussian mechanism, our mechanisms yield substantially lower expected noise amplitudes (l_1-loss) and variances (l_2-loss for zero-mean distributions). In the low-privacy regime that motivates our design, our mechanisms approach optimality, mitigating nearly all of the optimality gap of the analytic Gaussian mechanism.

Abstract:
Textto-image (T2I) systems increasingly rely on upstream prompters, either humans or multimodal large language models (MLLMs), to translate user intent into detailed prompts. Yet current benchmarks fix the prompt and only evaluate T2I models, leaving the prompting proficiency of this upstream component entirely unmeasured. We introduce AtelierEval, the first unified benchmark that quantifies prompting proficiency across 360 expert-crafted tasks. Grounded in a cognitive view, it spans three task categories and instantiates tasks using a taxonomy of real-world challenges, with a dual interface for both humans and MLLMs. To enable scalable and reliable evaluation, we propose AtelierJudge, a skill-based, memory-augmented agentic evaluator. It produces subjective and objective scores for prompt–image pairs, achieving a Spearman correlation of 0.79 with human experts, approaching human performance. Extensive experiments benchmark 8 MLLMs against 48 human users across 4 T2I backends, validate AtelierEval as a robust diagnostic tool, and reveal the superiority of mimicry over planning, advocating for an image-augmented direction for future prompters. Our work is released to support future research.

Abstract:
We identify a new dimension for enhancing rollout diversity in Group Relative Policy Optimization (GRPO) for LLMs. While GRPO relies on diverse rollouts, prevailing strategies primarily increase diversity by injecting more tokenlevel randomness, which may introduce step-wise noise and leads to incoherent trajectories. We uncover that smaller models within the same model family inherently exhibit higher policy-level diversity, indicated by their superior pass@k relative to larger counterparts as sample counts increase. Unlike token-level noise, this diversity is temporally correlated, preserves logical consistency, and provides structured exploration signals for gradient estimation. We thus propose S2L-PO (Small-to-Large Policy Optimization), a framework that leverages fixed small models as natural explorers to train larger models. To balance exploration and exploitation, we design a progressive annealing strategy that transitions from offline small-model rollouts to the large learner’s own sampling. This shift elegantly avoids mid-training performance drops caused by the small model's capacity limits, achieving faster convergence and unlocking a higher performance ceiling. S2L-PO improves accuracy on diverse mathematical reasoning benchmarks (eg., +8.8% on AIME 24 using a 1.7B explorer to guide the 8B model) while reducing rollout compute. The code will be made available.

Abstract:
In reinforcement learning (RL), agents benefit from exploration only because they repeatedly encounter similar states: trying different actions can improve performance or reduce uncertainty; without such retries, a greedy policy is optimal. We formalize this intuition with ReMax, an objective that evaluates a policy by the expected maximum return over M samples (M \in \mathbbN), while accounting for return uncertainty. Optimizing this objective induces stochastic exploration as an emergent property, without explicit bonus terms. For efficient policy optimization, we derive a new policygradient formulation for ReMax and introduce ReMax PPO (RePPO), a PPO variant that optimizes ReMax while generalizing the discrete retry count M to a continuous parameter m > 0, enabling fine-grained control of exploration. Empirically, RePPO promotes exploration—without any explicit exploration bonuses—on the MinAtar and Craftax benchmarks.

Abstract:
Mixtureof-Experts (MoE) language models organize knowledge into explicitly routed expert modules, making expert-level representations traceable and analyzable. By analyzing expert activation patterns in MoE language large models (LLMs), we find that a subset of experts is consistently activated across diverse knowledge domains. These common experts encode cross-domain, generalizable knowledge that is closely related to model generalization, naturally raising the question of how such identifiable expert knowledge can be practically reused. Motivated by this observation, we propose XPERT, a training-free framework that extracts, consolidates, and reuses expert knowledge from pre-trained MoE LLMs to support effective training of language models across different model scales. XPERT identifies cross-domain experts via inference-only analysis, refines their representations through tensor decomposition, and adapts the extracted knowledge to be reused in downstream models. Experiments on language understanding and dialogue generation benchmarks show that models benefiting from reused expert knowledge achieve consistently stronger performance and faster convergence compared to strong baselines. These results highlight MoE LLMs as structured and reusable knowledge sources, and demonstrate the value of expert-level knowledge reuse for improving model training.

Abstract:
Parity functions are fundamental Boolean operations with critical applications across machine learning, cryptography, and error correction. Yet, learning highdimensional parity functions poses significant challenges: in a general setting, standard neural network architectures typically require exponential sample complexity, making gradient-based optimization intractable for large number of inputs N. We demonstrate that compact product-based neural architectures combined with stochastic data sparsity (Bernoulli inputs with p_e \leq 1/N) and appropriate hyperparameter choice enable efficient parity learning, with theoretical guarantees of convergence. Experiments validate our theory across dimensions up to N = 10,000, with empirical evidence showing optimal hyperparameter choices for p_e and learning rate \alpha, as well as polynomial complexity scaling laws. This work establishes fundamental connections between architectural inductive bias and data sparsity, opening new possibilities for neural arithmetic, structured reasoning, binary neural networks, and machine learning applied to automated protocol discovery.

Abstract:
Tree ensembles are machine learning models with strong predictive performance and interpretability, and remain widely used for tabular data. Standard pruning methods for tree ensembles typically optimize an accuracy–compression tradeoff and may change a subset of predictions, potentially compromising decision consistency. Faithful pruning methods address this issue by preserving prediction equivalence over the entire input space, but this requirement leads to lower compression ratios. We propose PINE, a pruning method that provides strong guarantees within an in-distribution region. PINE preserves prediction equivalence within this region and controls the region size using a single parameter \\alpha via conformal calibration. Experiments on 12 public tabular datasets show that PINE improves the compression ratio by up to 30% while maintaining a comparable rate of prediction equivalence to existing faithful pruning methods. As a result, PINE achieves an improved equivalence–compression trade-off.

Abstract:
The high cost and data scarcity in scientific exploration have motivated the use of large language models (LLMs) as knowledgedriven components in Bayesian optimization (BO). However, existing approaches typically embed LLMs directly into the sampling or surrogate modeling pipeline, without fully leveraging their significantly lower evaluation cost compared to real-world experiments. To address this limitation, we propose LLM-Accelerated Bayesian Optimization (LABO), a framework that combines LLM predictions with experimental observations within a single BO loop. LABO employs a gating mechanism to dynamically balance reliance on LLM predictions versus actual experiments. By leveraging inexpensive LLM evaluations to broadly explore the search space and reserving costly real experiments only for regions with high uncertainty, LABO achieves more sample-efficient optimization. We provide a theoretical analysis with a cumulative regret bound that formalizes this efficiency gain. Empirical results across diverse scientific tasks demonstrate that LABO consistently outperforms existing methods under identical experimental budgets. Our results suggest that LABO offers a practical and theoretically grounded approach for integrating LLMs into scientific discovery workflows.

Abstract:
In point clouds, noise directly perturbs point coordinates that encode both spatial location and geometry, making oneto-one correspondence construction more challenging than in images. Existing methods impose statistical mappings across noisy variants via noise or optimal transport, but suffer from correspondence ambiguity. In this work, we propose Self-Induced Mirror-Point Consistency (SIMPC) to learn deterministic correspondences between points and the underlying surface in an unsupervised manner. For each noisy point, SIMPC generates a mirror-point on the opposite side of the underlying surface, guided by geometric priors during the denoising process. By encouraging consistency between the denoising targets of the original point and its mirror counterpart, SIMPC effectively localizes the position of underlying surface. Extensive experiments on synthetic and real-world datasets demonstrate that SIMPC significantly outperforms state-of-the-art unsupervised methods and surpasses several strong supervised counterparts.

Abstract:
Behavior cloning with highcapacity generative policies achieves strong imitation performance, but performance is often constrained by limited demonstration coverage and sensitivity to distribution shift. While reinforcement learning can improve task performance, directly fine-tuning large action decoders is often unstable and sample inefficient. We proposeLagrangian Perturbation Diffusion Steering (LP-DS), a lightweight adaptation method that improves a frozen generative policy while preserving its multimodal structure. LP-DS learns a compact noise-space perturbation module that shifts Gaussian noise inputs before decoding, enabling policy improvement without modifying the action decoder. To prevent off-manifold latent queries and unstable denoising dynamics, we optimize this module with a Lagrangian trust-region objective that maximizes downstream value while constraining perturbation magnitude, yielding stable and sample-efficient learning. Across RoboMimic manipulation, OpenAI Gym locomotion, and Adroit dexterous manipulation benchmarks, LP-DS improves sample efficiency, success, and return while maintaining diverse behavior, as quantified by higher action-space entropy using the Kozachenko--Leonenko k-nearest neighbor estimator, with return improvements of up to 25% over prior baselines. Anonymous project page: https://sites.google.com/view/lp-ds/home.

Abstract:
Efficient Distillation (EDistill) compresses large language models (LLMs) by structured pruning parameters and tuning lightweight modules with high training efficiency. Although these EDistilled LLMs achieve stateof-the-art (SOTA) performance on general ability benchmarks relative to similarly sized LLMs, we identify a severe degradation in their multi-step reasoning ability, which we term reasoning collapse. We systematically analyze the geometric origins of reasoning collapse and show that the SOTA EDistill method based on width-reducing projection matrices suffers from eRank collapse, in which the effective rank (eRank) of hidden representations drops. We theoretically explain how singular values of randomly initialized projection matrices become unevenly distributed, leading to eRank collapse and thus token indistinguishability. To address this issue, we propose RED (Reasoning-preserved Efficient Distillation) for LLMs, which introduces activation-aware initialization to initialize projection matrices as channel-selection matrices, thus theoretically mitigating eRank collapse. Experiments on Llama and Qwen series demonstrate that RED substantially recovers reasoning while maintaining high training efficiency and SOTA general ability.

Abstract:
Controllable generation with discrete diffusion models is often hindered by high computational overhead or the need for retraining. In this paper, we present GradientInformed Logit Correction (GILC), a plug-and-play framework that efficiently estimates guidance signals by repurposing the pretrained denoising network as a variational proxy. To circumvent the gradient instability inherent in high-dimensional discrete spaces, we introduce a Jacobian-free mechanism that directly corrects the clean prediction logits, facilitating stable and effective guidance. Our method accommodates both differentiable and non-differentiable reward functions. Extensive experiments across DNA, protein sequence, and molecular generation tasks demonstrate that GILC achieves state-of-the-art performance without additional training, frequently outperforming fine-tuning approaches.

Abstract:
A central goal of large language model (LLM) research is to build agentic systems that can plan, act, and adapt through sustained interaction with dynamic environments. While recent LLMbased agents exhibit impressive contextual reasoning, their long-horizon decision-making remains fragile, often suffering from objective drift, where goals and plans drift over extended interactions. We introduce \textttMulti^2, a hierarchical multi-agent decision-making framework that explicitly decomposes agent behavior into complementary roles. A high-level agent (\textttSystem 1) focuses on context-aware sub-goal generation using supervised fine-tuning (SFT), while a low-level agent (\textttSystem 2) executes atomic actions through offline-to-online reinforcement learning (RL) in interactive environments. This separation enables stable long-horizon control, mitigates objective drift, and allows efficient adaptation. Across diverse interactive environments, \textttMulti^2 consistently outperforms strong agentic baselines, demonstrating improved robustness and coordination in multi-turn interaction. Beyond performance, we introduce and release three hierarchical benchmark datasets, filling a long-standing gap in training and evaluating hierarchical decision-making for LLM-based agents.

Abstract:
InContext Reinforcement Learning (ICRL) has enabled foundation agents to adapt instantaneously to novel tasks, yet its efficacy in Ad-Hoc Teamwork (AHT)—where coordination with unknown partners is required—remains unexplored. To rigorously evaluate this, we introduce a large-scale benchmarkICRL4AHT, built upon a high-throughput JAX implementation of Overcooked-V2. Our benchmark includes a large, diverse teammate suite spanning both RL and heuristic policies, enabling controlled train-test shifts, and provides an end-to-end pipeline to generate learning histories, serialize them into reproducible datasets, and perform online multi-episode evaluation. We evaluate state-of-the-art ICRL algorithms, including Algorithm Distillation (AD) and Decision-Pretrained Transformer (DPT), across millions of transitions. Results reveal startling limitations: contrary to their success in single-agent domains, current ICRL architectures fail to exhibit test-time adaptation in multi-agent settings. Specifically, these methods frequently underperform random baselines across both unseen teammate and unseen layout tracks, with no observable in-context improvement over long horizons. These findings highlight the fundamental challenges of strategic inference under partial observability, establishing our benchmark as a critical testbed for next-generation coordination algorithms. Our repository is available at https://anonymous.4open.science/r/ICRL4AHT.

Abstract:
Quantum error correction (QEC) is essential for enabling quantum advantages, with decoding as a central algorithmic primitive. Owing to its importance and intrinsic difficulty, substantial effort has been made to QEC decoder design, among which neural decoders have recently emerged as a promising datadriven paradigm. Despite this progress, practical deployment remains hindered by a fundamental accuracy–latency tradeoff, often on the microsecond timescale. To address this challenge, here we revisit neural decoders for surface-code decoding under explicit accuracy–latency constraints, considering code distances up to d=9 (161 physical qubits). We unify and redesign representative neural decoders into five architectural paradigms and develop an end-to-end compression pipeline to evaluate their deployability and performance on FPGA hardware. Through systematic experiments, we reveal several previously underexplored insights: (i) near-term decoding performance is driven more by data scale than architectural complexity; (ii) appropriate inductive bias is essential for achieving high decoding accuracy; and (iii) INT4 quantization is a prerequisite for meeting microsecond-scale latency requirements on FPGAs. Together, these findings provide concrete guidance toward scalable and real-time neural QEC decoding.

Abstract:
We propose Intrinsic Quality (IQ), a validationfree metric designed to estimate the inherent potential of face recognition (FR) datasets to produce high-performance models without the need for full-scale training. IQ integrates two components: (i) a Neighbor-Consistency Score that quantifies local identity label agreement via nearest neighbors, and (ii) Global Representation Subspace Complexity (Effective Rank, ER), which captures the underlying embedding geometry and dataset diversity. IQ allows for rapid evaluation using lightweight proxy models or data subsets, facilitating dataset diagnosis and curation prior to resource-intensive full-scale training. We describe an experimental protocol tailored to clean, noisy, and mixed‑quality FR datasets, and outline evaluation methodologies to validate IQ’s predictive power for downstream performance.

Abstract:
Normalization Equivariance (NE), equivariance to global contrast and brightness transforms, improves robustness to distribution shift in imageto-image prediction. Existing methods enforce this prior by constraining internal layers to NE-compatible families, limiting compatibility with standard components (e.g., attention, LayerNorm) and adding runtime cost. We prove that a function is NE if and only if it admits a normalize-process-denormalize factorization. Using this characterization, we construct a parameter-free wrapper (WNE) that enforces input-output NE around any backbone, including transformers. On blind denoising, wrapping CNN and transformer architectures improves robustness under noise-level mismatch with no measurable overhead on GPU, while architectural NE baselines incur up to a 1.6× slowdown.

Abstract:
The Linear Assignment Problem (LAP) is a fundamental combinatorial optimization task with applications ranging from computer vision to logistics. Classical exact solvers such as the Hungarian and JonkerVolgenant (LAPJV) algorithms guarantee optimality, but their cubic time complexity \mathcalO(N^3) becomes a bottleneck for large-scale instances. Recent learning-based approaches aim to replace these solvers with neural models, often sacrificing exactness or failing to scale due to memory constraints. We propose a learning-augmented framework that accelerates exact assignment solvers while maintaining optimality and worst-case guarantees. Our method predicts dual variables to warm-start a classical solver, with a fallback that prevents asymptotic runtime degradation when the learned advice is unreliable. We introduce RowDualNet, a lightweight row-independent architecture that avoids the \mathcalO(N^2) memory bottleneck of graph-based models, enabling neural warm-starting at large scale (N=16,384). Feasibility is ensured via a constructive mechanism based on LP duality (namely, the Min-Trick), eliminating costly iterative projection. Empirically, our approach reduces the search effort of LAPJV and achieves over 2× speedups on challenging synthetic distributions, in addition to improving over 1.25× and 1.5× on real-world tracking (MOT) and transportation (LPT) datasets, respectively, while strictly maintaining full optimality, effectively yielding a robust zero-shot generalization to real-world tasks.

Abstract:
Learning semantics is essential for deep learning models to be interpretable and better aligned with human reasoning. Conceptbased models approach this by representing classes through meaningful semantic abstractions, but typically treat all concepts as a flat, unstructured set learned at a single neural network layer. This overlooks a fundamental property of human semantic understanding: concepts being organized hierarchically, from general to specific. While deep networks do learn a hierarchy of visual features, this structure is rarely aligned with explicit semantic hierarchies. Drawing on Formal Concept Analysis, we demonstrate that formal concept lattices provide principled semantic scaffolds to guide neural network learning. These lattices naturally identify where in the network concepts should be learned based on their level of generality. This allows the model to develop staged, semantically grounded representations throughout its depth. Empirical results on real-world datasets show that our models produce more interpretable embeddings, support more effective interventions, and learn concept representations that are both meaningful and hierarchically structured.

Abstract:
Graph Foundation Models (GFMs), built upon thePretraining and Adaptationparadigm, have emerged as a research hotspot in graph learning. For GNN-based GFMs, graph prompt tuning has become the prevailing adaptation method for downstream tasks. Although recent methods explain why graph prompt tuning works, how to rigorously measure its adaptation capacity remains an open problem. Addressing this problem is critical for understanding the capability limits of graph prompt tuning and for developing more powerful adaptation methods. In this paper, we propose Prismatic Space Theory (PS-Theory), a novel mathematical framework to quantify the capacity of adaptation methods, while focusing on establishing the upper bound for the adaptation capacity of graph prompt tuning. Building upon the proposed PS-Theory, we further introduceMessageTuning forGFMs (MTG), a lightweight approach that injects a small set of learnable message prototypes into each layer of the GNN backbone to adaptively guide message fusion without updating pre-trained weights. Through our PS-Theory, we prove that the adaptation capacity of MTG can exceed the theoretical upper bound of graph prompt tuning. Extensive experiments demonstrate that MTG consistently outperforms graph prompt baselines across diverse benchmark datasets, providing strong empirical support for our theoretical findings. Our code is available at https://anonymous.4open.science/r/MTG.

Abstract:
Ensuring that large language models (LLMs) comply with safety requirements is a central challenge in AI deployment. Existing alignment approaches operate primarily during training, such as through finetuning or reinforcement learning from human feedback, but these methods are costly and inflexible, requiring retraining whenever new requirements arise. Recent efforts toward inference-time alignment mitigate some of these limitations but still assume access to model internals, which is impractical, and not suitable for third party stakeholders who do not have access to the models. In this work, we propose a model-independent, black-box framework for safety alignment that does not require retraining or access to the underlying LLM architecture. As a proof of concept, we address the problem of trading off between generating safe but uninformative answers versus helpful yet potentially risky ones. We formulate this dilemma as a two-player zero-sum game whose minimax equilibrium captures the optimal balance between safety and helpfulness. LLM agents operationalize this framework by leveraging a linear programming solver at inference time to compute equilibrium strategies. Our results demonstrate the feasibility of black-box safety alignment, offering a scalable and accessible pathway for stakeholders, including smaller organizations and entities in resource-constrained settings, to enforce safety across rapidly evolving LLM ecosystems.

Abstract:
The remarkable success of the Adam in training neural networks has naturally led to the widespread use of its descentascent counterpart, Adam-DA, for solving zero-sum games. Despite its popularity in practice, a rigorous theoretical understanding of Adam-DA still lags behind. In this paper, we derive ordinary differential equations (ODEs) that serve as continuous-time limits of the Adam-DA. These ODEs closely approximate the discrete-time dynamics of Adam-DA, providing a tractable analytical framework for understanding its behavior in zero-sum games. Using this ODE approach, we investigate two fundamental aspects of Adam-DA: local convergence and implicit gradient regularization. Our analysis reveals that the roles of the first- and second-order momentum parameters in zero-sum games are exactly the opposite of their well-documented effects in minimization problems. We validate these predictions through GAN experiments across multiple architectures and datasets, demonstrating the practical implications of this reversed momentum effect.

Abstract:
Graph Neural Networks (GNNs) have revolutionized Neural Force Fields for atomistic simulations, achieving nearquantum accuracy at reduced cost, yet adapting these models to new chemical systems requires expensive retraining of foundation models. Inspired by model merging in vision and language processing, we introduce GFFMERGE, the first principled framework for closed-form model merging in GNNs. We exploit the linear structure of message-passing layers and formulate merging as a convex embedding-alignment problem with an analytical solution. Through the first systematic benchmarking of model merging for GNNs, we show that existing methods designed for vision and language catastrophically fail on force field regression, while GFFMERGE recovers performance approaching gold standard joint training. Across molecular (MD17, MD22), solid-state (LiPS20), and large-scale graph benchmarks, GFFMERGE and GNNMERGE (its generic GNN counterpart) achieve 5-27× speedups while enabling modular composition of specialized models. Remarkably, our closed-form solution alone outperforms all baseline methods before fine-tuning and provides superior initialization for faster, data-efficient convergence.

Abstract:
Crossresolution person re-identification (CR-ReID) remains challenging in practical surveillance, where camera quality and capture distance lead to substantial resolution gaps between low-resolution (LR) queries and high-resolution (HR) gallery images. Prior approaches commonly rely on super-resolution (SR) or resolution-invariant representation learning, which often increases system complexity and may not directly address the feature mismatch induced by resolution degradation. In this work, we report a new empirical finding: after averaging out identity-specific variation, the HR--LR feature discrepancy produced by standard ReID backbones exhibits a consistent, resolution-related semantic direction in the embedding space. We further support this observation with statistical analyses based on Canonical Correlation Analysis (CCA) and Pearson correlation analysis. Motivated by this finding, we propose Vector Panning Feature Alignment (VPFA), a lightweight post-hoc module that learns to pan LR features along the learned resolution direction to obtain pseudo-HR representations. VPFA operates after feature extraction and can be integrated into existing ReID systems with negligible overhead. Extensive experiments on multiple CR-ReID benchmarks show that VPFA achieves state-of-the-art performance while improving efficiency compared to SR-based or jointly trained alternatives.

Abstract:
Shapley additive explanations (SHAP) are widely recognised as computationally intractable for neural networks, since they induce an exponential search space over the input features. In this work, we take a first step towards scaling exact SHAP computation to larger search spaces by introducing an algorithm that leverages recent advances in neural network verification to compute arbitrarily tight exact lower and upper bounds on SHAP values for neural networks, ultimately recovering the exact SHAP values. We demonstrate that our approach scales to orders of magnitude larger search spaces than stateof-the-art exact methods. This provides an important first step towards exact SHAP computation and establishes a principled cornerstone for evaluating statistical approximation methods on larger search spaces.

Abstract:
Toolcalling is a central component of modern large language model (LLM) agents, equipping them with skills beyond their parametric knowledge. This paper studies tool-calling along two complementary axes: effectiveness, i.e., how this capability is measured, and efficiency, i.e., how it is learned. On effectiveness, we systematically analyze tool-calling evaluation pipelines and show that results can be highly sensitive to seemingly minor, often undocumented implementation choices including the random seed, system prompt, multi-turn template construction, and how prior interaction/reasoning history is carried forward. These choices can lead to substantial differences in reported performance, especially in multi-turn settings where without rigorous standardization, leaderboard rankings are unreliable. On efficiency, we examine standard reinforcement learning (RL) for tool-calling and identify two sources of computational waste: (i) during rollouts, many prompts produce no learning signal, and (ii) during policy updates, optimization incurs high computational cost. Guided by these findings, we introduce two techniques that accelerate RL-based tool-calling training, achieving substantial wall-clock speedup without degrading performance.

Abstract:
Task vector composition has emerged as a promising paradigm for editing pretrained models, enabling model merging via addition and task removal via subtraction. Fine-tuning in the tangent space of a pre-trained model (linearized fine-tuning) has proven particularly effective in this setting, as it yields task vectors that are naturally disentangled and less prone to interference. However, linearized models suffer from reduced expressivity during training and increased computational cost at inference time, limiting their practical applicability. In this work, we propose to bridge linearized and standard non-linear fine-tuning through knowledge distillation. Specifically, we distill hidden representations from a linearized, curvature-regularized teacher into a non-linear student trained with conventional fine-tuning. By doing so, the goal is to transfer the disentanglement properties of the linearized regime, biasing optimization toward solutions that are composable by design. We show that the resulting task vectors can be composed using naïve Task Arithmetic, achieving strong results across vision and language benchmarks without incurring any inference-time overhead.

Abstract:
Data attribution has become a core primitive for pricing, auditing, and governing machine learning pipelines, yet current attribution methods implicitly assume that attribution value faithfully reflects participants' contributions. We show that this assumption can fail: a single participant in a standard distributed training workflow can substantially inflate its measured attribution value while keeping global utility intact. Our attributionfirst attack uses a latent optimization procedure that injects small, utility-preserving synthetic batches to exploit non-IID label coverage and evaluator sensitivities. Across datasets, models, and multiple marginal-utility evaluators, the attack consistently raises the adversary’s attribution value and reshapes the relative attribution structure among benign clients without degrading accuracy or triggering geometry-based defenses. These results demonstrate that attribution itself constitutes a new attack surface and motivate the development of attribution-robust and incentive-compatible scoring mechanisms.

Abstract:
Automated laboratories hold the promise of accelerating scientific discovery, yet their deployment is bottlenecked by the difficulty of designing safe and executable environments. While simulatorbased design offers scalability, existing 3D scene generation methods are primarily tailored for household settings, optimizing for visual plausibility while neglecting the rigorous functional semantics and safety constraints essential for scientific experimentation. We present LabBuilder, an end-to-end system that generates and verifies 3D laboratory layouts from concise textual specifications. It operates through three tightly coupled components: LabForge first curates a meta-dataset of annotated assets and chemical knowledge, translating natural language specifications into structured protocols; building on these protocols, LabGen synthesizes laboratory layouts via an iterative, constraint-aware optimization strategy; finally, LabTouchstone evaluates the resulting layouts as a unified benchmark. Extensive experiments demonstrate that LabBuilder significantly outperforms existing state-of-the-art methods, producing laboratory environments that are not only realistic but also functionally valid and safe for complex experimental workflows.

Abstract:
Posttraining with explicit reasoning traces is common to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, acquiring high-quality reasoning traces is often costly and time-consuming. Hence, the self-improvement paradigm has emerged, enabling MLLMs to self-generate reasoning traces for training without external supervision. Despite its effectiveness, we reveal two shortcomings in the self-improvement training of MLLMs: 1) data imbalance, where simple samples are over-trained, but the challenging yet crucial samples are under-trained; 2) language prior bias, where MLLMs overly rely on linguistic priors while neglecting the visual cues. To this end, we propose VISTA, a VIsion-aware Self-improvement Training framework for enhancing the multimodal Reasoning of MLLMs. Specifically, VISTA first introduces a prefix resampling strategy to reuse the partial correct reasoning traces for efficient data collection, and then designs a vision-aware attention score to quantify the model’s focus on visual information. Extensive experiments show that VISTA can be applied to various post-training scenarios, i.e., supervised fine-tuning and preference learning, and effectively enhances the multimodal reasoning performance across various MLLMs and tasks, e.g., bringing up to +13.66% average performance gains for Qwen2.5-VL-3B-Instruct.

Abstract:
While VisionLanguage-Action (VLA) models offer broad general capabilities, deploying them on specific hardware requires real-world adaptation to bridge the embodiment gap. Since robot demonstrations are costly, this adaptation must often occur under a strict data budget. In this work, we identify a criticaldiversity trap: the standard heuristic of ``maximizing coverage" by collecting diverse, single-shot demonstrations can be self-defeating due to non-vanishing estimation noise. We formalize this phenomenon as aCoverage--Density Trade-off. By decomposing the policy error into estimation (density) and extrapolation (coverage) terms, we characterize an interior optimal allocation of unique conditions for a fixed budget. Guided by this analysis, we proposeAnchor-Centric Adaptation (ACA), a two-stage framework that first stabilizes a policy skeleton through repeated demonstrations at core anchors, then selectively expands coverage to high-risk boundaries via teacher-forced error mining and constrained residual updates. Real-robot experiments validate our trade-off framework and demonstrate that ACA significantly improves task reliability and success rates over standard diverse sampling strategies under the same budget.

Abstract:
Noisebased certified machine unlearning currently faces a hard ceiling: the noise magnitude required to certify unlearning typically destroys model utility, particularly for large-scale deletion requests. While leveraging public data is a standard technique in differential privacy to relax this tension, its role in unlearning remains unexplored. We address this gap by introducing Asymmetric Langevin Unlearning (ALU), a framework that uses public data to mitigate privacy costs. We prove that public data injection suppresses the unlearning cost by a factor of O(1/n_\mathrmpub^2), guaranteeing a strict computational advantage over retraining. This establishes a new control mechanism: practitioners can mitigate the need for high noise—and the associated utility loss—by increasing the volume of public data. Crucially, we analyze the realistic setting of distribution mismatch, explicitly characterizing how shifts between public and private sources impact utility. We show that ALU enables "mass unlearning'' of constant dataset fractions -- a regime where standard symmetric methods become impractical -- while maintaining high utility. Empirical evaluations using variational Rényi divergence and membership inference attacks confirm that ALU effectively thwarts privacy attacks while preserving utility under reasonable distribution shifts.

Abstract:
Opendomain multimodal document retrieval aims to retrieve specific components (paragraphs, tables, or images) from large and interconnected document corpora. Existing graph-based retrieval approaches typically rely on a uniform similarity metric that overlooks hop-specific semantics, and their rigid pre-defined plans hinder dynamic error correction. These limitations suggest that a retriever should adapt its reasoning to the evolving context and recover intelligently from dead ends. To address these needs, we propose Failure is Feedback (FiF), which casts subgraph retrieval as a sequential decision process and introduces two key innovations. (i) We introduce a history-aware backtracking mechanism; unlike standard backtracking that simply reverts the state, our approach piggybacks on the context of failed traversals, leveraging insights from previous failures. (ii) We implement an economically-rational agentic workflow. Unlike conventional agents with static strategies, our orchestrator employs a cost-aware traversal method to dynamically manage the trade-off between retrieval accuracy and inference costs, escalating to intensive LLM-based reasoning only when the prior failure justifies the additional computational investment. Extensive experiments show that FiF achieves state-of-the-art retrieval on the benchmarks of MultimodalQA, MMCoQA and WebQA.

Abstract:
Reinforcement learning (RL) has powered many of the recent breakthroughs in large language models (LLMs), especially for tasks where rewards can be computed automatically, such as code generation. However, these methods deteriorate in openended domains like medical consultation, where feedback is inherently ambiguous, highly context-dependent, and cannot be reduced to a reliable scalar signal. In such settings, RL must either rely on supervision-intensive reward models that often fail to generalize, or it falls into pathological behaviors such as reward hacking—an especially troubling risk for high-stakes medical dialogue. To address these limitations, we introduce ORBIT, an open-ended rubric-based incremental training framework for high-stakes medical dialogue. ORBIT integrates synthetic dialogue generation with dynamically constructed rubrics that serve as adaptive guides for incremental RL. Instead of relying on external medical knowledge bases or handcrafted rule sets, ORBIT uses rubric-driven feedback to steer the learning process. Its judge component can be instantiated with general-purpose instruction-following LLMs, removing the need for any task-specific fine-tuning. Applied to the Qwen3-4B-Instruct model, ORBIT raises the HealthBench-Hard score from 7.0 to 27.5 using only 2k training samples, achieving SOTA performance for models at this scale. With larger rubric datasets, ORBIT-trained models further compete with the strongest open-source baselines on HealthBench-Hard. Our analysis shows that rubric-guided RL consistently improves consultation quality across diverse medical scenarios. We also apply such rubric generation and training pipeline to InfoBench, where ORBIT enhances instruction-following performance, highlighting the generality of rubric-based feedback.

Abstract:
Group Relative Policy Optimization (GRPO) was introduced recently and used to train DeepSeek\textendash R1 for promoting reasoning in LLMs under verifiable (binary) rewards. We show that the mean+variance calibration of these rewards induces a contrastive loss in which the contrastive samples are synthetic data drawn from the previous policy. While GRPO was originally paired with clipping to keep updates near the old policy, we analyze variants that differ in reward normalization (meanonly vs.\ mean+variance) and in how they regularize updates using KL divergence: either penalizing divergence from the previous model (\emphmirror), penalizing divergence from a fixed reference model \pi_\mathrmref, or combining both forms of regularization. For each, the optimal policy \pi_n admits an explicit form in terms of the binary reward and the first and second order statistics of the reward under \pi_n-1, as well as the policies \pi_n-1 and \pi_\mathrmref. Iterating results in a sequence \\pi_n\ whose \emphprobability of success (PoS) obeys a simple recurrence that converges to a fixed point determined by the reference PoS and the regularization strength. We further show that this fixed point exceeds the reference, demonstrating that GRPO amplifies the policy's probability of success.

Abstract:
Quantization is a key method for reducing the GPU memory requirement of training large language models (LLMs). Yet, current approaches are ineffective for 4bit activations and 8-bit gradients, which would easily cause slow convergence or accuracy loss. To address this, we introduce AGoQ, incorporating two new techniques: 1) a layer-aware activation quantization algorithm that allocates appropriate bit-widths for activations of various layers based on their types and pipeline stages to achieve near 4-bit activation storage, and 2) a gradient quantization algorithm that reduces memory usage and shortens communication time by employing 8-bit gradient storage and precision-preserving 8-bit All-Reduce communication. We conduct extensive experiments using different sizes of LLMs on two GPU clusters (up to 64 GPUs), and the experimental results show that our AGoQ reduces the memory by up to 52% and achieves up to 1.34× improvement of training speed compared to state-of-the-art training systems Megatron-LM (w/ or w/o ZeRO), COAT and DeepSpeed with 8B to 32B LLaMA models, while achieving convergence loss on pretraining and comparable accuracy on downstream tasks with LLaMA architectures.

Abstract:
Diffusion language models (dLMs) have emerged as a promising paradigm enabling parallel generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study ARto-dLM conversion, which transforms pretrained AR models into efficient dLMs that excel in speed while preserving AR models’ task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing methodologies and actionable insights for scalable AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is key to effective AR-to-dLM conversion. Accordingly, we introduce a continuous pretraining scheme with a block-wise attention pattern. We find that, in addition to block-wise attention’s known benefit of enabling KV caching, its block-wise causality better preserves pretrained AR models’ weight distributions, leading to a win–win in accuracy and efficiency. Second, to mitigate the training–test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. These studies lead to the Efficient-DLM model family, which outperforms state-of-the-art AR models and dLMs in accuracy–throughput trade-offs; for example, our Efficient-DLM 8B achieves +5.4%/+2.7% higher accuracy with 4.5×/2.7× higher throughput compared to Dream 7B and Qwen3 4B, respectively.

Abstract:
Determining where to execute the manipulation policy is a fundamental challenge in mobile manipulation. Most approaches have formulated this as a geometric search problem, prioritizing physical reachability. However, given the high sensitivity of modern learningbased manipulation policies, geometric criteria alone are insufficient. Optimal performance requires base positioning that is aware of the policy's preference. While recent works have attempted to address this, they remain limited in practicality due to reliance on pre-built scene reconstruction and slow inference. In this work, we introduce N2M that systematically reformulates the approach to base positioning problem, naturally overcoming limitations of previous methods. Our key insight is that policy preferences are inherent to the local scene structure and can be effectively learned from the policy rollouts. Technically, we propose a novelviewpoint augmentationstrategy that enables the model to learn robust, viewpoint-invariant pose preferences with remarkable data efficiency. Extensive experiments demonstrate that N2M achieves state-of-the-art performance, outperforming both non-policy-aware baselines and recent policy-aware alternatives. Furthermore, we provide a comprehensive analysis highlighting N2M’s broad applicability, generalization capabilities, and data efficiency. Anonymized project website: https://nav2manip.github.io

Abstract:
We study the role of batch size in stochastic conditional gradient methods under a \muKurdyka–Łojasiewicz (\mu-KL) condition. Focusing on momentum-based stochastic Frank–Wolfe–type conditional gradient algorithms (e.g., Scion), we derive a new analysis that explicitly captures the interaction between stepsize, batch size, and stochastic noise. Our study reveals a regime-dependent behavior: increasing the batch size initially improves optimization accuracy, but beyond a critical threshold, the benefits saturate and can eventually degrade performance under a fixed token budget. Notably, the theory predicts the magnitude of the optimal stepsize and aligns well with empirical practices observed in large-scale training. Leveraging these insights, we derive principled guidelines for selecting the batch size and stepsize, and propose an adaptive strategy that increases batch size and sequence length during training while preserving convergence guarantees. Preliminary experiments are consistent with the theoretical predictions and illustrate the emergence of the predicted scaling regimes. Overall, our results provide a theoretical framework for understanding batch-size scaling in stochastic conditional gradient methods and offer guidance for designing efficient training schedules in large-scale optimization.

Abstract:
We present CoarseBind, a foundation model for proteinligand structure and binding affinity prediction that achieves 26× faster inference than state-of-the-art methods while improving affinity prediction accuracy by up to 20%. Current deep learning approaches to structure-based drug design rely on expensive all-atom diffusion to generate 3D coordinates, creating inference bottlenecks that render large-scale compound screening computationally intractable. We challenge this paradigm with the hypothesis: full all-atom resolution is unnecessary for accurate small molecule pose and binding affinity prediction. CoarseBind tests this hypothesis through a coarse pocket-level representation (protein C_\beta atoms and ligand heavy atoms only) within a multimodal architecture combining pretrained molecular encoders and ESM-2 protein embeddings that learns rich structural representations, which are used in a diffusion-free optimization module for pose generation and a binding affinity likelihood prediction module. On structure prediction benchmarks, CoarseBind matches diffusion-based baselines in ligand pose accuracy. For binding affinity, CoarseBind outperforms Boltz-2 by 16-20% in Pearson correlation on both a public benchmark (CASP16) and a diverse private dataset (18 assays). The affinity module also provides well-calibrated uncertainty estimates, addressing a critical gap in compound prioritization for drug discovery. Furthermore, this module enables a continual learning framework and a hedged batch selection strategy that, in simulated drug discovery cycles, achieves 6× greater affinity improvement over greedy approaches.

Abstract:
Temporal causal representation learning methods assume that causal mechanisms switch instantaneously between discrete domains, yet realworld systems often exhibitcontinuousmechanism transitions. For example, a vehicle's dynamics evolve gradually through a turning maneuver, and human gait shifts smoothly from walking to running. We formalize this setting by modeling transitional mechanisms as convex combinations of finitely manyatomic mechanisms, governed by time-varying mixing coefficients. Our theoretical contributions establish that both the latent causal variables and the continuous mixing trajectory are jointly identifiable. We further proposeTRACE, a Mixture-of-Experts framework where each expert learns one atomic mechanism during training, enabling recovery of mechanism trajectories at test time. This formulation generalizes to intermediate mechanism states never observed during training. Experiments on synthetic and real-world data demonstrate that TRACE recovers mixing trajectories with up to 0.99 correlation, substantially outperforming discrete-switching baselines.

Abstract:
Transformers enable incontext learning (ICL) for rapid, gradient-free adaptation in time series forecasting, yet most ICL-style approaches rely on tabularized, hand-crafted features, while end-to-end sequence models lack inference-time adaptation. We bridge this gap with a unified framework, Baguan-TS, which integrates the raw-sequence representation learning with ICL, instantiated by a 3D Transformer that attends jointly over temporal, variable, and context axes. To make this high-capacity model practical, we tackle two key hurdles: (i) calibration and training stability, improved with a feature-agnostic, target-space retrieval-based local calibration; and (ii) output oversmoothing, mitigated via context-overfitting strategy. On public benchmark with covariates, Baguan-TS consistently outperforms established baselines, achieving the highest win rate and significant reductions in both point and probabilistic forecasting metrics. Further evaluations across diverse real-world energy datasets demonstrate its robustness, yielding substantial improvements.

Abstract:
Templatefree retrosynthesis methods treat the task as black-box sequence generation, limiting learning efficiency, while semi-template approaches rely on rigid reaction libraries that constrain generalization. We address this gap with a key insight: atom ordering in neural representations matters. Building on this insight, we propose a structure-aware template-free framework that encodes the two-stage nature of chemical reactions as a positional inductive bias. By placing reaction center atoms at the sequence head, our method transforms implicit chemical knowledge into explicit positional patterns that the model can readily capture. The proposed RetroDiT backbone, a graph transformer with rotary position embeddings, exploits this ordering to prioritize chemically critical regions. Combined with discrete flow matching, our approach decouples training from sampling and enables generation in 20--50 steps versus 500 for prior diffusion methods. Our method achieves state-of-the-art performance on both USPTO-50k (61.2% top-1) and the large-scale USPTO-Full (51.3% top-1) with predicted reaction centers. With oracle centers, performance reaches 71.1% and 63.4% respectively, surpassing foundation models trained on 10 billion reactions while using orders of magnitude less data. Ablation studies further reveal that structural priors outperform brute-force scaling: a 280K-parameter model with proper ordering matches a 65M-parameter model without it.

Abstract:
While largescale text-to-image diffusion models continue to improve in visual quality, their increasing scale has widened the gap between state-of-the-art models and on-device solutions. To address this gap, we introduce NanoFLUX, a 2.4B text-to-image flow-matching model distilled from 17B FLUX.1-Schnell using a progressive compression pipeline designed to preserve generation quality. Our contributions include: (1) A model compression strategy driven by pruning redundant components in the diffusion transformer, reducing its size from 12B to 2B; (2) A ResNet-based token downsampling mechanism that reduces latency by allowing intermediate blocks to operate on lower-resolution tokens while preserving high-resolution processing elsewhere; (3) A novel text encoder distillation approach that leverages visual signals from early layers of the denoiser during sampling. Empirically, NanoFLUX generates 512 × 512 images in approximately 2.5 seconds on mobile devices, demonstrating the feasibility of high-quality on-device text-to-image generation.

Abstract:
Vision language models (VLMs) extend the reasoning capabilities of large language models (LLMs) to crossmodal settings, yet remain highly vulnerable to multimodal jailbreak attacks. Existing defenses predominantly rely on safety fine-tuning or aggressive token manipulations, incurring substantial training costs or significantly degrading utility. Recent research shows that LLMs inherently recognize unsafe content in text, and the incorporation of visual inputs in VLMs frequently dilutes risk-related signals. Motivated by this, we propose Risk Awareness Injection (RAI), a lightweight and training-free framework for safety calibration that restores LLM-like risk recognition by amplifying unsafe signals in VLMs. Specifically, RAI constructs an Unsafe Prototype Subspace from language embeddings and performs targeted modulation on selected high-risk visual tokens, explicitly activating safety-critical signals within the cross-modal feature space. This modulation restores the model’s LLM-like ability to detect unsafe content from visual inputs, while preserving the semantic integrity of original tokens for cross-modal reasoning. Extensive experiments across multiple jailbreak and utility benchmarks demonstrate that RAI substantially reduces attack success rate without compromising task performance.

Abstract:
Personalized federated learning has emerged as a popular approach to training on devices holding statistically heterogeneous data, known as clients. However, most existing approaches require a client to have labeled data for training or finetuning in order to obtain their own personalized model. In this paper we address this by proposing FLowDUP, a novel method that is able to generate a personalized model using only a forward pass with unlabeled data. The generated model parameters reside in a lowdimensional subspace, enabling efficient communication and computation. FLowDUP's learning objective is theoretically motivated by our new transductive multi-task PAC-Bayesian generalization bound, that provides performance guarantees for unlabeled clients. The objective is structured in such a way that it allows both clients with labeled data and clients with only unlabeled data to contribute to the training process. To supplement our theoretical results we carry out a thorough experimental evaluation of FLowDUP, demonstrating strong empirical performance on a range of datasets with differing sorts of statistically heterogeneous clients. Through numerous ablation studies, we test the efficacy of the individual components of the method.

Abstract:
Probabilistic superresolution of high-dimensional spatial fields using diffusion models is often computationally prohibitive due to the cost of operating directly in pixel space. We propose PODiff, a structured conditional generative framework that performs diffusion in a fixed, variance-ordered Proper Orthogonal Decomposition (POD) coefficient space, exploiting the orthogonality of POD modes to impose an interpretable, variance-ordered latent geometry. This design enables efficient ensemble generation, preserves dominant spatial structure, and yields spatially interpretable, well-calibrated uncertainty at substantially lower computational cost. We evaluate PODiff on sea surface temperature downscaling over the West Australian coast and on a controlled advection--diffusion benchmark. PODiff achieves reconstruction accuracy comparable to pixel-space diffusion while requiring significantly less memory and producing more reliable uncertainty estimates than deterministic and Monte Carlo Dropout baselines.

Abstract:
Modern AI systems are typically developed through multiple stagespretraining, fine-tuning rounds, and subsequent adaptation or alignment, where each stage builds on the previous ones and updates the model in distinct ways. This raises a critical question of accountability: when a deployed model succeeds or fails, which stage is responsible, and to what extent? We pose the accountability attribution problem for tracing model behavior back to specific stages of the model development process. To address this challenge, we propose a general framework that answers counterfactual questions about stage effects: how would the model's behavior have changed if the updates from a particular stage had not occurred? Within this framework, we introduce estimators that efficiently quantify stage effects without retraining the model, accounting for both the data and key aspects of model optimization dynamics, including learning rate schedules, momentum, and weight decay. We demonstrate that our approach successfully quantifies the accountability of each stage to the model's behavior. Based on the attribution results, our method can identify and remove spurious correlations learned during image classification and text toxicity detection tasks that were developed across multiple stages. Our approach provides a practical tool for model analysis and represents a significant step toward more accountable AI development.

Abstract:
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising approach to improve correctness in LLMs, however, in many scientific problems, the objective is not necessarily to produce the correct answer, but instead to produce a diverse array of candidates which satisfy a set of constraints. We study this challenge in the context of materials generation. To this end, we introduce PLaID++, an LLM posttrained for stable and property-guided crystal generation. We find that applying naive preference optimization to a coordinate-based crystal representation leads to mode collapse. Hence, we introduce a compact, symmetry-informed Wyckoff text representation which improves computational efficiency and encourages generalization from physical priors. By encoding symmetry constraints directly into text and guiding model outputs towards desirable chemical space, PLaID++ generates structures that are thermodynamically stable, unique, and novel at a >50% greater rate than prior methods. We further demonstrate that unified training across conditional and unconditional tasks are mutually beneficial in data-sparse regimes. Our work demonstrates the potential of adapting post-training techniques from natural language processing to materials design, paving the way for targeted and efficient discovery of novel materials.

Abstract:
Inference on the conditional mean function (CMF) is central to tasks from adaptive experimentation to optimal treatment assignment and algorithmic fairness auditing. In this work, we provide a novel asymptotic anytimevalid test for a CMF global null (e.g., that all conditional means are zero) and contrasts between CMFs, enabling experimenters to make high confidence decisions at any time during the experiment beyond a minimum sample size. We provide mild conditions under which our tests achieve (i) asymptotic type-I error guarantees, (i) power one, and, unlike past tests, (iii) optimal sample complexity relative to a Gaussian location testing. By inverting our tests, we show how to construct function-valued asymptotic confidence sequences for the CMF and contrasts thereof. Experiments on both synthetic and real-world data show our method is well-powered across various distributions while preserving the nominal error rate under continuous monitoring.

Abstract:
Protein sequences are abundant in repeating segments, both as exact copies and as approximate segments with mutations. These repeats are important for protein structure and function, motivating decades of algorithmic work on repeat identification. Recent work has shown that protein language models (PLMs) identify repeats, by examining their behavior in maskedtoken prediction. To elucidate their internal mechanisms, we investigate how PLMs detect both exact and approximate repeats. We find that the mechanism for approximate repeats functionally subsumes that of exact repeats. We then characterize this mechanism, revealing two main stages: PLMs first build feature representations using both general positional attention heads and biologically specialized components, such as neurons that encode amino-acid similarity. Then, induction heads attend to aligned tokens across repeated segments, promoting the correct answer. Our results reveal how PLMs solve this biological task by combining language-based pattern matching with specialized biological knowledge, thereby establishing a basis for studying more complex evolutionary processes in PLMs.

Abstract:
Semisupervised learning with manifold regularization is a classical framework for jointly learning from both labeled and unlabeled data, where the key requirement is that the support of the unknown marginal distribution has the geometric structure of a Riemannian manifold. Typically, the Laplace-Beltrami operator-based manifold regularization can be approximated empirically by the Laplacian regularization associated with the entire training data and its corresponding graph Laplacian matrix. However, the graph Laplacian matrix depends heavily on the prespecified similarity metric and may lead to inappropriate penalties when dealing with redundant or noisy input variables. To address the above issues, this paper proposes a new Semi-Supervised Meta Additive Model (S^2MAM) based on a bilevel optimization scheme that automatically identifies informative variables, updates the similarity matrix, and simultaneously achieves interpretable predictions. Theoretical guarantees are provided for S^2MAM, including the computing convergence and the statistical generalization bound. Experimental assessments across 4 synthetic and 12 real-world datasets, with varying levels and categories of corruption, validate the robustness and interpretability of the proposed approach. The implementation is available at https://anonymous.4open.science/r/S2MAM-2731/.

Abstract:
Advances in AIGC technologies have enabled the synthesis of highly realistic audio deepfakes capable of deceiving human auditory perception. Although numerous audio deepfake detection (ADD) methods have been developed, most rely on local temporal/spectral features or pairwise relations, overlooking highorder interactions (HOIs). HOIs capture discriminative patterns that emerge from multiple feature components beyond their individual contributions. We propose HyperPotter, a hypergraph-based framework that explicitly models these synergistic HOIs through clustering-based hyperedges with class-aware prototype initialization. Extensive experiments demonstrate that HyperPotter surpasses its baseline by an average relative gain of 22.15% across 11 datasets and outperforms state-of-the-art methods by 13.96% on 4 challenging cross-domain datasets, demonstrating superior generalization to diverse attacks and speakers.

Abstract:
LowRank Adaptation (LoRA) is a standard tool for parameter-efficient finetuning of large models. While it induces a small memory footprint, its training dynamics can be surprisingly complex as they depend on several hyperparameters such as initialization, adapter rank, and learning rate. In particular, it is unclear how the optimal learning rate scales with adapter rank, which forces practitioners to re-tune the learning rate whenever the rank is changed. In this paper, we introduce Maximal-Update Adaptation (\muA), a theoretical framework that characterizes how the "optimal" learning rate should scale with model width and adapter rank to produce stable, non-vanishing feature updates under standard configurations. Our analysis leverages techniques from hyperparameter transfer and reveals that the optimal learning rate exhibits different scaling patterns depending on initialization and LoRA scaling factor. Specifically, we identify two regimes: one where the optimal learning rate remains roughly invariant across ranks, and another where it scales inversely with rank. We further identify a configuration that allows learning rate transfer from LoRA to full finetuning, drastically reducing the cost of learning rate tuning for full finetuning. Experiments across language, vision, vision–language, image generation, and reinforcement-learning tasks validate our scaling rules and show that learning rates tuned on LoRA transfer reliably to full finetuning.

Abstract:
Reinforcement Learning with Verifiable Reward (RLVR) is a powerful method for enhancing the reasoning abilities of Large Language Models, but its full potential is limited by a lack of exploration in two key areas: Depth (the difficulty of problems) and Breadth (the number of training instances). Our analysis of the popular GRPO algorithm reveals a bias that downweights difficult, low-accuracy problems, which are crucial for improving reasoning skills. To address this, we introduce Difficulty Adaptive Rollout Sampling (DARS), a method that re-weights difficult problems by using targeted, multi-stage rollouts. This approach increases the number of rollout outcomes for these harder problems according to our proposed re-balancing schedules and leads to consistent gains in Pass@K. We also found that simply enlarging the rollout size isn't effective and can even harm performance. We also investigated the role of breadth by scaling the batch size and using full-batch updates. This significantly improved Pass@1 performance by maintaining high token-level entropy, which indicates continued exploration and reduced gradient noise. Finally, we present DARS-Breadth, a combined approach that uses DARS with a large breadth of training data. This method demonstrates simultaneous gains in both Pass@K and Pass@1, confirming that depth (adaptive exploration) and breadth (scaling the training data) are orthogonal and essential dimensions for unlocking the full reasoning power of RLVR.

Abstract:
Large visionlanguage models (LVLMs) excel at vision-language tasks but remain vulnerable to backdoor attacks. Most existing backdoor attacks on LVLMs force the model to generate predefined target patterns. However, these fixed-pattern attacks are easy to detect, as the model tends to memorize frequent patterns and exhibits overconfidence on targets given poisoned inputs. To address these limitations, we introduce TokenSwap, a more evasive and stealthy backdoor attack that focuses on the \emphcompositional understanding capabilities of LVLMs. Instead of enforcing a fixed targeted content, TokenSwap subtly disrupts the understanding of object relationships in text. Specifically, it causes the backdoored model to generate outputs that mention the correct objects in the image but misrepresent their relationships (i.e., bags-of-words behavior). During training, TokenSwap injects a visual trigger into selected samples while swapping the grammatical roles of key tokens in the textual answers. Since the poisoned samples differ only subtly from clean ones, an adaptive token-weighted loss is employed to emphasize learning on swapped tokens, strengthening the association between visual triggers and the bags-of-words behavior. Extensive experiments demonstrate that TokenSwap achieves high attack success rates while maintaining evasiveness and stealthiness across multiple benchmarks and LVLM architectures.

Abstract:
Large Language Models (LLMs) show remarkable capabilities, yet their stochastic nexttoken prediction creates logical inconsistencies and reward hacking that formal symbolic systems avoid. To bridge this gap, we introduce a formal logic verification-guided framework that dynamically interleaves formal symbolic verification with the natural language generation process, providing real-time feedback to detect and rectify errors as they occur. Distinguished from previous neuro-symbolic methods limited by passive post-hoc validation, our approach actively penalizes intermediate fallacies during the reasoning chain. We operationalize this framework via a novel two-stage training pipeline that synergizes formal logic verification-guided supervised fine-tuning and policy optimization. Extensive evaluation on six benchmarks spanning mathematical, logical, and general reasoning demonstrates that our 7B and 14B models outperform state-of-the-art baselines by average margins of 10.4% and 14.2%, respectively. These results validate that formal verification can serve as a scalable mechanism to significantly push the performance boundaries of advanced LLM reasoning.

Abstract:
Selfattention dominates the computational and memory cost of long-context LLM inference across both prefill and decode phases. To address this challenge, we introduce Sketch\&Walk Attention, a training-free sparse attention method that determines sparsity with lightweight sketches and deterministic walk. Sketch\&Walk applies Hadamard sketching to get inexpensive approximations of attention scores, then aggregates these estimates across layers via a walk mechanism that captures attention influence beyond direct interactions between tokens. The accumulated walk scores are used to select top-k attention blocks, enabling dynamic sparsity with a single training-free algorithm that applies uniformly to both the prefill and decode phases, together with custom sparse attention kernels. Across a wide range of models and tasks, Sketch\&Walk maintains near-lossless accuracy at 20% attention density and can slightly outperform dense attention in some settings, while achieving up to 6× inference speedup.

Abstract:
Multimodal Large Language Model (MLLM) agents facilitate Graphical User Interface (GUI) automation but struggle with longhorizon, cross-application tasks due to limited context windows. While memory systems provide a viable solution, existing paradigms struggle to adapt to dynamic GUI environments, suffering from a granularity mismatch between high-level intent and low-level execution, and context pollution where the static accumulation of outdated experiences drives agents into hallucination. To address these bottlenecks, we propose the Darwinian Memory System (DMS), a self-evolving architecture that constructs memory as a dynamic ecosystem governed by the law of "survival of the fittest." DMS decomposes complex trajectories into independent, reusable units for compositional flexibility, and implements Utility-driven Natural Selection to track survival value, actively pruning suboptimal paths and inhibiting high-risk plans. This evolutionary pressure compels the agent to derive superior strategies. Extensive experiments on real-world multi-app benchmarks validate that DMS boosts general-purpose MLLMs without training costs or architectural overhead, achieving average gains of 18.0% in success rate and 33.9% in execution stability, while reducing task latency, establishing it as an effective self-evolving memory system for GUI tasks. The implementation is available at https://anonymous.4open.science/r/DMS-C48C.

Abstract:
Large Language Models (LLMs) often provide chainof-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call theseunverbalized biases. Monitoring models via their stated reasoning is therefore unreliable, and existing bias evaluations typically require predefined categories and hand-crafted datasets. In this work, we introduce a fully automated, black-box pipeline for detecting task-specific unverbalized biases. Given a task dataset, the pipeline uses LLM autoraters to generate candidate bias concepts. It then tests each concept on progressively larger input samples by generating positive and negative variations, and applies statistical techniques for multiple testing and early stopping. A concept is flagged as an unverbalized bias if it yields statistically significant performance differences while not being cited as justification in the model's CoTs. We evaluate our pipeline across six LLMs on three decision tasks (hiring, loan approval, and university admissions). Our technique automatically discovers previously unknown biases in these models (e.g., Spanish fluency, English proficiency, writing formality). In the same run, the pipeline also validates biases that were manually identified by prior work (gender, race, religion, ethnicity). More broadly, our proposed approach provides a practical, scalable path to automatic task-specific bias discovery.

Abstract:
We provide the first study of the problem of finding differentially private (DP) secondorder stationary points (SOSP) in stochastic (non-convex) minimax optimization. Existing literature either focuses only on first-order stationary points for minimax problems or on SOSP for classical stochastic minimization problems. This work provides, for the first time, a unified and detailed treatment of both empirical and population risks. Specifically, we propose a purely first-order method that combines a nested gradient descent--ascent scheme with SPIDER-style variance reduction and Gaussian perturbations to ensure privacy. A key technical device is a block-wise (q-period) analysis that controls the accumulation of stochastic variance and privacy noise without summing over the full iteration horizon, yielding a unified treatment of both empirical-risk and population formulations. Under standard smoothness, Hessian-Lipschitzness, and strong concavity assumptions, we establish high-probability guarantees for reaching an (\alpha,\sqrt\rho_\Phi \alpha)-approximate second-order stationary point with \alpha = \mathcalO( (\frac\sqrtdn\varepsilon)^2/3) for empirical risk objectives and \mathcalO(\frac1n^1/3 + (\frac\sqrtdn\varepsilon)^1/2) for population objectives, matching the best known rates for private first-order stationarity.

Abstract:
Offpolicy evaluation (OPE) and off-policy learning (OPL) are foundational for decision-making in offline contextual bandits. Recent advances in OPL primarily optimize OPE estimators with improved statistical properties, assuming that better estimators inherently yield superior policies. Although theoretically justified, this estimator-centric approach neglects a critical practical obstacle: challenging optimization landscapes. In this paper, we provide theoretical insights and empirical evidence showing that current OPL methods encounter severe optimization issues, particularly as the action space grows. We show that estimator-aware policy parametrization can mitigate, but not fully resolve, optimization challenges. Building on this, we explore simpler weighted log-likelihood objectives and demonstrate that they enjoy substantially better optimization properties and still recover competitive, often superior, learned policies. Our findings emphasize the necessity of explicitly addressing optimization considerations in the development of OPL algorithms for large action spaces.

Abstract:
Large Language Models (LLMs) have shown strong potential in complex medical reasoning yet face diminishing gains under inference scaling laws. While existing studies augment LLMs with various knowledge types, it remains unclear how effectively the additional costs translate into accuracy. In this paper, we explore how metacognition of LLMs, i.e., their self-awareness of their own knowledge states, can regulate the reasoning process. Specifically, we propose MedCoG, a Medical Meta-Cognition Agent with Knowledge Graph, where the meta-cognitive assessments of task complexity, familiarity, and knowledge density dynamically regulate utilization of procedural, episodic, and factual knowledge. The LLM-centric on-demand reasoning aims to mitigate scaling laws by (1) reducing costs via avoiding indiscriminate scaling, (2) improving accuracy via filtering out distractive knowledge. To validate this, we empirically characterize the scaling curve and introduce inference density to quantify inference efficiency, defined as the ratio of theoretically effective cost to actual cost. Experiments demonstrate the effectiveness and efficiency of MedCoG on five hard sets of medical benchmarks, yielding 5.5× inference density. Furthermore, the Oracle study highlights the significant potential of meta-cognitive regulation.

Abstract:
Onpolicy deep reinforcement learning remains a dominant paradigm for continuous control, yet standard implementations rely on Gaussian actors and relatively shallow MLP policies, often leading to brittle optimization when gradients are noisy and policy updates must be conservative. In this paper, we revisit policy representation as a first-class design choice for on-policy optimization. We study discretized categorical actors that represent each action dimension with a distribution over bins, yielding a policy objective that resembles a cross-entropy loss. Building on architectural advances from supervised learning, we further propose regularized actor networks, while keeping critic design fixed. Our results show that simply replacing the standard actor network with our discretized regularized actor yields consistent gains and achieve the state-of-the-art performance across diverse continuous-control benchmarks.

Abstract:
Graph Neural Networks (GNNs) have shown remarkable capabilities in learning from graphstructured data with various applications such as social analysis and bioinformatics. However, the presence of label noise in real scenarios poses a significant challenge in learning robust GNNs, and their effectiveness can be severely impacted when dealing with noisy labels on graphs, often stemming from annotation errors or inconsistencies. To address this, in this paper we propose a novel approach called ICGNN that harnesses the structure information of the graph to effectively alleviate the challenges posed by noisy labels. Specifically, we first design a novel noise indicator that measures the influence contradiction score (ICS) based on the graph diffusion matrix to quantify the credibility of nodes with clean labels, such that nodes with higher ICS values are more likely to be detected as having noisy labels. Then we leverage the Gaussian mixture model to precisely detect whether the label of a node is noisy or not. Additionally, we develop a soft strategy to combine the predictions from neighboring nodes on the graph to correct the detected noisy labels. At last, pseudo-labeling for abundant unlabeled nodes is incorporated to provide auxiliary supervision signals and guide the model optimization. Experiments on benchmark datasets show the superiority of our approach over competitive baselines in noisy label scenarios.

Abstract:
Large Reasoning Models (LRMs) suffer from sycophantic behavior, where models tend to agree with users' incorrect beliefs and follow misinformation rather than maintain independent reasoning. This behavior undermines model reliability and poses societal risks. Mitigating LRM sycophancy requires monitoring how this sycophancy emerges during the reasoning trajectory; however, current methods mainly focus on judging based on final answers and correcting them, without understanding how sycophancy develops during reasoning processes. To address this limitation, we propose MONICA, a novel Monitorguided Calibration framework that monitors and mitigates sycophancy during model inference at the level of reasoning steps, without requiring the model to finish generating its complete answer. MONICA integrates a sycophantic monitor that provides real-time monitoring of sycophantic drift scores during response generation with a calibrator that dynamically suppresses sycophantic behavior when scores exceed predefined thresholds. Extensive experiments across 12 datasets and 3 LRMs demonstrate that our method effectively reduces sycophantic behavior in both intermediate reasoning steps and final answers, yielding robust performance improvements.

Abstract:
Activation steering offers a computationally efficient mechanism for controlling Large Language Models (LLMs) without finetuning. While effectively controlling target traits (e.g., persona), coherency degradation remains a major obstacle to safety and practical deployment. We hypothesize that this degradation stems from intervening on the residual stream, which indiscriminately affects aggregated features and inadvertently amplifies off-target noise. In this work, we identify a sparse subset of attention heads (only three heads) that independently govern persona and style formation, which we termStyle Modulation Heads. Specifically, these heads can be localized via geometric analysis of internal representations, combining layer-wise cosine similarity and head-wise contribution scores. We demonstrate that intervention targeting only these specific heads achieves robust behavioral control while significantly mitigating the coherency degradation observed in residual stream steering. More broadly, our findings show that precise, component-level localization enables safer and more precise model control.

Abstract:
Large language models have advanced web agents, yet current agents lack personalization capabilities. Since users rarely specify every detail of their intent, practical web agents must be able to interpret ambiguous queries by inferring user preferences and contexts. To address this challenge, we present Persona2Web, the first benchmark for evaluating personalized web agents on the real open web, built upon the clarifyto-personalize principle, which requires agents to resolve ambiguity based on user history rather than relying on explicit instructions. Persona2Web consists of: (1) user histories that reveal preferences implicitly over long time spans, (2) ambiguous queries that require agents to infer implicit user preferences, and (3) a reasoning-aware evaluation framework that enables fine-grained assessment of personalization. We conduct extensive experiments across various agent architectures, backbone models, history access schemes, and queries with varying ambiguity levels, revealing key challenges in personalized web agent behavior. For reproducibility, our codes and datasets are publicly available at https://anonymous.4open.science/r/Persona2Web-73E8

Abstract:
Accurate nuclear instance segmentation is a pivotal task in computational pathology, supporting datadriven clinical insights and facilitating downstream translational applications. While large vision foundation models have shown promise for zero-shot biomedical segmentation, most existing approaches still depend on dense supervision and computationally expensive fine-tuning. Consequently, training-free methods present a compelling research direction, yet remain largely unexplored. In this work, we introduce SPROUT, a fully training- and annotation-free prompting framework for nuclear instance segmentation. SPROUT leverages histology-informed priors to construct slide-specific reference prototypes that mitigate domain gaps. These prototypes progressively guide feature alignment through a partial optimal transport scheme. The resulting foreground and background features are transformed into positive and negative point prompts, enabling the Segment Anything Model (SAM) to produce precise nuclear delineations without any parameter updates. Extensive experiments across multiple histopathology benchmarks demonstrate that SPROUT achieves competitive performance without supervision or retraining, establishing a novel paradigm for scalable, training-free nuclear instance segmentation in pathology.

Abstract:
Current audiovisual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice introduces a fundamental dimensionality mismatch that precludes reliable source localization and spatial reasoning in complex 3D environments. We address this limitation by presenting JAEGER, a framework that extends AV-LLMs to 3D space, to enable joint spatial grounding and reasoning through the integration of RGB-D observations and multi-channel first-order ambisonics. A core contribution of our work is the neural intensity vector (Neural IV), a learned spatial audio representation that encodes robust directional cues to enhance direction-of-arrival estimation, even in adverse acoustic scenarios with overlapping sources. To facilitate large-scale training and systematic evaluation, we propose SpatialSceneQA, a benchmark of 61k instruction-tuning samples curated from simulated physical environments. Extensive experiments demonstrate that our approach consistently surpasses 2D-centric baselines across diverse spatial perception and reasoning tasks, underscoring the necessity of explicit 3D modelling for advancing AI in physical environments. Our source code, pre-trained model checkpoints and datasets will be released upon acceptance.

Abstract:
While Large Language Models (LLMs) have demonstrated strong zeroshot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning. Unlike open-ended text generation, embodied agents must decompose high-level intent into actionable sub-goals while strictly adhering to the logic of a dynamic, observed environment. Standard LLM planners frequently fail to maintain strategy coherence over extended horizons due to context window limitation or hallucinate transitions that violate constraints. We propose GiG, a novel planning framework that structures embodied agents' memory using a \underlineGraph-\underlinein-\underlineGraph architecture. Our approach employs a Graph Neural Network (GNN) to encode environmental states into embeddings, organizing these embeddings into action-connected execution trace graphs within a latent memory bank. By clustering these graph embeddings, the framework enables retrieval of structure-aware priors, allowing agents to ground current decisions in relevant past structural patterns. Furthermore, we introduce a novel bounded lookahead module that leverages symbolic transition logic to enhance the agents' planning capabilities through the grounded action projection. We evaluate our framework on three embodied planning benchmarks—Robotouille Synchronous, Robotouille Asynchronous, and ALFWorld. Our method outperforms state-of-the-art baselines, achieving Pass@1 performance gains of up to 22% on Robotouille Synchronous, 37% on Asynchronous, and 15% on ALFWorld with comparable or lower computational cost.

Abstract:
Autonomous code agents built on large language models are reshaping software and AI development through tool use, longhorizon reasoning, and self-directed interaction. However, this autonomy introduces a previously unrecognized security risk: agentic interaction fundamentally expands the LLM attack surface, enabling systematic probing and recovery of hidden system prompts that guide model behavior. We identify system prompt extraction as an emergent vulnerability intrinsic to code agents and present \justask, a self-evolving framework that autonomously discovers effective extraction strategies through interaction alone. Unlike prior prompt-engineering or dataset-based attacks, \justask requires no handcrafted prompts, labeled supervision, or privileged access beyond standard user interaction. It formulates extraction as an online exploration problem, using Upper Confidence Bound--based strategy selection and a hierarchical skill space spanning atomic probes and high-level orchestration. These skills exploit imperfect system-instruction generalization and inherent tensions between helpfulness and safety. Evaluated on 41 black-box commercial models across multiple providers, \justask consistently achieves full or near-complete system prompt recovery, revealing recurring design- and architecture-level vulnerabilities. Our results expose system prompts as a critical yet largely unprotected attack surface in modern agent systems.

Abstract:
Given that Large Language Models (LLMs) are increasingly applied to automate software development, comprehensive software assurance spans three distinct goals: regression prevention, reactive reproduction, and proactive discovery. Current evaluations systematically overlook the third goal. Specifically, they either constrain models to a compliance trap by treating existing code as the ground truth for regression prevention, or rely on postfailure artifacts (e.g., issue reports) for reactive bug reproduction, failing to expose defects before they manifest as failures. To bridge this gap, we present TestExplora, a benchmark designed to evaluate LLMs as proactive testers within full-scale, realistic repository environments. Comprising 2,389 tasks across 482 repositories, TestExplora conceals all defect-related information, forcing models to uncover bugs by identifying discrepancies between implementation and documentation-derived intent—utilizing documentation as the reference oracle. Furthermore, to ensure sustainable evaluation and mitigate risks of data leakage in static datasets, we propose a continuous, time-aware data collection framework. Our evaluation reveals a significant capability gap: state-of-the-art models achieve a maximum Fail-to-Pass (F2P) rate of only 16.06%. Further analysis indicates that navigating complex cross-module interactions and leveraging agentic exploration are critical to advancing LLMs toward autonomous software quality assurance. Consistent with this, SWEAgent instantiated with GPT-5-mini achieves an F2P of 17.27% and an F2P@5 of 29.7%, highlighting the effectiveness and promise of agentic exploration in proactive bug discovery tasks.

Abstract:
Epidemiological forecasting from surveillance data is a hard problem and hybridizing mechanistic compartmental models with neural models is a natural direction. The mechanistic structure helps keep trajectories epidemiologically plausible, while neural components can capture nonstationary, data-adaptive effects. In practice, however, many seemingly straightforward couplings fail under partial observability and continually shifting transmission dynamics driven by behavior, waning immunity, seasonality, and interventions. We catalog these failure modes and show that robust performance requires making non-stationarity explicit: we extract multi-scale structure from the observed infection series and use it as an interpretable control signal for a controlled neural ODE coupled to an epidemiological model. Concretely, we decompose infections into trend, seasonal, and residual components and use these signals to drive continuous-time latent dynamics while jointly forecasting and inferring time-varying transmission, recovery, and immunity-loss rates. Across seasonal and non-seasonal settings—including early outbreaks and multi-wave regimes—our approach reduces long-horizon RMSE by 15–35%, improves peak timing error by 1–3 weeks, and lowers peak magnitude bias by up to 30% relative strong time-series, neural ODE, and hybrid baselines, without relying on auxiliary covariates.

Abstract:
Multiagent systems (MAS) built on large language models (LLMs) have demonstrated remarkable performance across diverse tasks. Existing approaches optimize communication topology, role assignment, or LLM routing in isolation, while treating each agent as a monolithic unit—failing to exploit internal LLM mixtures that can enhance individual role capabilities. We proposeHieraMAS, a hierarchical agent collaboration framework with intra-node LLM mixtures and inter-node communication topology. HieraAgent introducessupernodes, where each functional role comprises multiple heterogeneous LLMs in a propose-synthesis structure. The optimization ofHieraMASposes unique credit assignment challenges, as final task performance heavily depends on LLM capabilities, potentially causing erroneous reinforcement of suboptimal configurations. We address this via a two-stage algorithm: (1) multi-level reward attribution providing fine-grained feedback at both node and system levels; and (2) graph classification treating topology selection as a holistic task rather than per-edge optimization. Experiments on reasoning and coding benchmarks demonstrate thatHieraMASsignificantly outperforms existing methods while achieving better cost-performance trade-offs.

Abstract:
How do neural networks trained over sequences acquire the ability to perform structured operations, such as arithmetic, geometric, and algorithmic computation? To gain insight into this question, we introduce the sequential group composition task. In this task, networks receive a sequence of elements from a finite group encoded in a real vector space and must predict their cumulative product. The task can be ordersensitive and requires a nonlinear architecture to be learned. Our analysis isolates the roles of the group structure, encoding statistics, and sequence length in shaping learning. We prove that two-layer networks learn this task one irreducible representation of the group at a time in an order determined by the Fourier statistics of the encoding. These networks can perfectly learn the task, but doing so requires a hidden width exponential in the sequence length k. In contrast, we show how deeper models exploit the associativity of the task to dramatically improve this scaling: recurrent neural networks compose elements sequentially in k steps, while multilayer networks compose adjacent pairs in parallel in \log k layers. Overall, the sequential group composition task offers a tractable window into the mechanics of deep learning.

Abstract:
Complex problems, whether in math, logic, or planning, are solved by humans through a sequence of steps where the result of one step informs the next. In this work, we adopt the perspective that the reasoning power of Transformers is fundamentally limited by a fixed maximum number of steps along any latent path of computation. To address this, we introduce Turbo Connection (TurboConn), a novel architecture that overcomes the fixeddepth constraint by routing multiple residual connections from the higher-layer hidden states of each token t to the lower layers of token t+1. Fine-tuning pre-trained LLMs with our method not only yields accuracy gains of 0.9% to over 10% on benchmarks like GSM8K, Parity, and multi-step arithmetic, but also demonstrates that the density of these backward connections is critical; our dense interaction significantly outperforms "sparse" alternatives that only pass a single hidden state or vector. Notably, TurboConn can be integrated into pre-trained LLMs to overcome task-specific plateaus: while a fine-tuned Qwen-3-1.7B achieves only 53.78% on Parity, adding our architectural modification enables the model to reach 100% accuracy, all without the necessity to retrain the full model from scratch or sophisticated curriculum learning. Our results provide strong empirical evidence that the depth of the computational path is a key factor in reasoning ability, also offering a new mechanism to enhance LLMs without significantly affecting generation latency.

Abstract:
Existing methods for expressive music performance rendering rely on supervised learning over small labeled datasets, which limits scaling of both data volume and model size, despite the availability of vast unlabeled music, as in vision and language. To address this gap, we introduce Pianist Transformer, with three key contributions: 1) introducing largescale self-supervised learning into expressive piano performance rendering through a unified Musical Instrument Digital Interface (MIDI) representation, enabling pre-training on 10B tokens of unlabeled MIDI data; 2) an efficient asymmetric Transformer with note-level compression, substantially improving training efficiency, memory usage, and inference speed for long-context music modeling; 3) a state-of-the-art rendering model with an editable workflow, achieving strong objective and subjective results and enabling integration into real-world music production workflows. Overall, Pianist Transformer outlines a scalable path toward human-like performance synthesis in the music domain. An anonymous demo with audio examples is available at: https://anonymous.4open.science/r/JSKJDHKIOWBBCGFBDKS/.

Abstract:
Modern enterprises generate vast streams of time series metrics when monitoring complex systems, known as observability data. Unlike conventional time series from domains such as climate, observability data are zeroinflated, highly stochastic, and exhibit minimal temporal structure. Despite their importance, observability datasets are underrepresented in public benchmarks due to proprietary restrictions. Existing datasets are often anonymized and normalized, removing scale information and limiting their use for tasks such as anomaly detection, root-cause analysis, and multi-modal reasoning. To address this gap, we introduce TelecomTS, a large-scale observability dataset derived from a 5G telecommunications network. TelecomTS features heterogeneous, de-anonymized covariates with explicit scale information and provides a suite of downstream tasks, including anomaly detection, root-cause analysis, and multi-modal question-answering. Benchmarking state-of-the-art time series, language, reasoning, and multi-modal models reveals that existing approaches struggle with the abrupt, noisy, and high-variance dynamics of observability data. Our experiments also underscore the importance of preserving covariates’ absolute scale, emphasizing the need for foundation time series models that natively leverage scale information for practical observability applications. The code is available at: \urlhttps://anonymous.4open.science/r/TelecomTS_Benchmark-72AF.

Abstract:
In many machine learning applications acquiring reliable groundtruth labels is costly, or unfeasible, leading practitioners to rely on crowdsourcing and aggregation of noisy human annotations. When labels are subjective, however, aggregation may amplify individual biases, particularly with respect to sensitive attributes, raising fairness concerns. Despite this, fairness in crowdsourced aggregation remains largely unexplored, with no existing convergence guarantees and only limited post-processing approaches for enforcing \varepsilon-fairness under demographic parity. We address this gap by analyzing fairness properties of crowdsourcing aggregation methods within the \varepsilon-fairness framework, focusing on Majority Voting and Optimal Bayesian aggregation. In the small-crowd regime, we derive an upper bound on the fairness gap of Majority Voting in terms of the individual annotators’ fairness gaps. We further show that the fairness gap of the aggregated consensus converges exponentially fast to that of the ground truth under interpretable conditions. Since the ground truth itself may still be unfair, we generalize a state-of-the-art multiclass fairness post-processing algorithm from the continuous to the discrete setting, enabling the enforcement of strict demographic parity constraints for any aggregation rule. Experiments on both synthetic and real-world datasets demonstrate the effectiveness of our approach and corroborate the theoretical insights.

Abstract:
Existing multimodal retrieval systems excel at semantic matching but implicitly assume that queryimage relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, where information is distributed across temporal sequences rather than confined to single snapshots. To bridge this gap, we introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task. Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues. We construct DISBench, a challenging benchmark built on interconnected visual data. To address the scalability challenge of creating context-dependent queries, we propose a human-model collaborative pipeline that employs vision-language models to mine latent spatiotemporal associations, effectively offloading intensive context discovery before human verification. Furthermore, we build a robust baseline using a modular agent framework equipped with fine-grained tools and a dual-memory system for long-horizon navigation. Extensive experiments demonstrate that DISBench poses significant challenges to state-of-the-art models, highlighting the necessity of incorporating agentic reasoning into next-generation retrieval systems.

Abstract:
Current stateof-the-art segmentation models encode entire images before focusing on specific objects. This wastes computational resources. We introduce FLIP (Fovea-Like Input Patching), a parameter-efficient vision model that realizes object segmentation through biologically-inspired top-down attention. FLIP selectively samples multi-resolution patches centered on objects of interest from the input. As a result, it allocates high-resolution processing to object centers while maintaining coarser peripheral context. This off-grid, scale-invariant design enables FLIP to outperform META's Segment Anything models (SAM, SAM2 and fast variants) by large margins: With more than 440× fewer parameters, FLIP-Tiny (0.51M parameters) reaches a mean IoU of 78.24% while SAM2-L reaches 75.87% IoU (224.45M parameters). FLIP-Large even achieves 80.33% mean IoU (96.6M parameters), still running about 2.3× faster than SAM2-L. We evaluate on six benchmarks in total. In five established benchmarks (Hypersim, KITTI-360, OpenImages, COCO, LVIS) FLIP consistently outperforms SAM and various variants of it. In our novel ObjaScale dataset, which stress-tests scale invariance with objects ranging from 0.0001% up to 25% of the image area, we show that FLIP segments even very small objects accurately, where existing models fail severely. FLIP opens new possibilities for real-time, object-centric vision applications and offers much higher energy efficiency. We believe that FLIP can act as a powerful foundation model, as it is very well-suited to track objects over time, for example, when being integrated into slot-based scene segmentation architectures.

Abstract:
Generalpurpose audio representations aim to map acoustically variable instances of the same event to nearby points, resolving content identity in a zero-shot setting. Unlike supervised classification benchmarks that measure adaptability via parameter updates, we introduce VocSim, a training-free benchmark probing the intrinsic geometric alignment of frozen embeddings. VocSim aggregates 125k single-source clips from 19 corpora spanning human speech, animal vocalizations, and environmental sounds. By restricting to single-source audio, we isolate content representation from the confound of source separation. We evaluate embeddings using Precision@k for local purity and the Global Separation Rate (GSR) for point-wise class separation. To calibrate GSR, we report lift over an empirical permutation baseline. Across diverse foundation models, a simple pipeline, frozen Whisper encoder features, time–frequency pooling, and label-free PCA, yields strong zero-shot performance. However, VocSim also uncovers a consistent generalization gap. On blind, low-resource speech, local retrieval drops sharply. While performance remains statistically distinguishable from chance, the absolute geometric structure collapses, indicating a failure to generalize to unseen phonotactics. As external validation, our top embeddings predict avian perceptual similarity, improve bioacoustic classification, and achieve state-of-the-art results on the HEAR benchmark. We posit that the intrinsic geometric quality measured here proxies utility in unlisted downstream applications. We release data, code, and a public leaderboard to standardize the evaluation of intrinsic audio geometry.

Abstract:
Agentic reasoning enables large reasoning models (LRMs) to dynamically acquire external knowledge, but yet optimizing the retrieval process remains challenging due to the lack of dense, principled reward signals. In this paper, we introduceInfoReasoner, a unified framework that incentivizes effective information seeking via asynthetic semantic information gain reward. Theoretically, we redefine information gain as uncertainty reduction over the model's belief states, establishing guarantees, including nonnegativity, telescoping additivity, and channel monotonicity. Practically, to enable scalable optimization without manual retrieval annotations, we propose an output-aware intrinsic estimator that computes information gain directly from the model's output distributions usingsemantic clustering via bidirectional textual entailment. This intrinsic reward guides the policy to maximize epistemic progress, enabling efficient training via Group Relative Policy Optimxization (GRPO). Experiments across seven question-answering benchmarks demonstrate that InfoReasoner consistently outperforms strong retrieval-augmented baselines, achieving up to 5.4% average accuracy improvement. Our work provides a theoretically grounded and scalable path toward agentic reasoning with retrieval.

Abstract:
Diffusion policy sampling enables reinforcement learning (RL) to represent multimodal action distributions beyond suboptimal unimodal Gaussian policies. However, existing diffusionbased RL methods primarily focus on offline setting for reward maximization, with limited consideration of safety in online settings. To address this gap, we propose Augmented Lagrangian-Guided Diffusion (ALGD), a novel algorithm for off-policy safe RL. By revisiting optimization theory and energy-based modeling, we show that the instability of primal–dual methods arises from the non-convex Lagrangian landscape. In diffusion-based safe RL, the Lagrangian can be interpreted as an energy function guiding the denoising dynamics; counter-intuitively, direct usage destabilizes both policy generation and training. ALGD resolves this issue by introducing an augmented Lagrangian that locally convexifies the energy landscape, yielding a stabilized policy generation and training, without altering the distribution of optimal policy. Theoretical analysis and extensive experiments demonstrate that ALGD is both theoretically grounded and empirically effective, achieving strong and stable performance across diverse environments.

Abstract:
Constantstepsize stochastic approximation is widely used in learning for computational efficiency. For a fixed stepsize, the iterates typically admit a stationary distribution that is rarely tractable. Prior work shows that as the stepsize \alpha \downarrow 0, the centered-and-scaled steady state converges weakly to a Gaussian limit. However, for fixed \alpha, this weak convergence offers no usable error bound for approximating the steady-state by its Gaussian limit. This paper provides explicit, non-asymptotic error bounds for fixed \alpha. We study (i) stochastic gradient descent on smooth strongly convex objectives, (ii) linear SA, and (iii) contractive nonlinear SA, and we treat both i.i.d. and Markovian noise models to ensure broad applicability. Our main results first give dimension- and stepsize-dependent, explicit bounds in Wasserstein distance between the centered-scaled steady state and its Gaussian limit, with errors that vanish as \alpha \downarrow 0. We further derive sharp tail control, comparing the steady-state tail probability to Gaussian tails with an explicit error term that decays in both the deviation level and \alpha. Our analysis combines steady-state Stein's method with moment bounds on the SA iterations, and uses Poisson equation techniques to manage temporal dependence in the Markovian noise setting. We adapt the same toolkit to SGD with general convex objectives and suggest non-Gaussian limiting behavior, which is validated in numerical experiments.

Abstract:
Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a welldefined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting \alpha-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution (\alpha > 1) to intensify logical reasoning, or flattening it (\alpha < 1) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.

Abstract:
Frontier language model agents can exhibit misaligned behaviors, including deception, exploiting reward hacks, and pursuing hidden objectives. To control such agents, we can use LLMs themselves tomonitorfor misbehavior. In this paper, we study howinformation accessaffects LLM monitor performance and find that contemporary systems often perform better with less information, a phenomenon we call thelessis-more effectfor automated oversight. We demonstrate this phenomenon, and analyze the conditions under which it occurs, in three evaluation environments where agents must conduct sabotage while evading monitors. Leveraging this effect, we introduceextract-and-evaluate(EaE) monitoring—a new hierarchical approach where one monitor isolates relevant excerpts from the monitored agent's trajectory and another monitor scores these excerpts. In BigCodeBench-Sabotage with a GPT-4.1-mini monitor, EaE improves sabotage detection rates by 16.8 percentage points over the next-best approach. In other settings, EaE either outperforms or is competitive with baselines. Additionally, we find that agents unaware of being monitored can be caught much more easily, that monitors scoring longer reasoning traces gain more from information filtering, and that monitor performance scales linearly with cost.

Abstract:
Large language models (LLMs) are deployed on increasingly complex tasks that require multistep decision-making. Understanding their algorithmic reasoning abilities is therefore crucial. However, we lack a diagnostic benchmark for evaluating this capability. We propose data structures as a principled lens: as fundamental building blocks of algorithms, they naturally probe structural reasoning—the ability to understand and manipulate relationships such as order, hierarchy, and connectivity that underpin algorithmic reasoning. We introduce DSR-Bench, spanning 20 data structures, 35 operations, and 4,140 problem instances. DSR-Bench features hierarchical task organization, fully automated generation and evaluation, and fine-grained diagnostics. Evaluating 13 state-of-the-art LLMs reveals critical limitations: the top-performing model achieves only 0.46/1 on challenging instances. Three auxiliary probes targeting more realistic usages expose further weaknesses: models perform poorly on spatial data and context-rich scenarios, and they struggle to reason over their own code.

Abstract:
Metalearning methods perform well on new within-distribution tasks but often fail when adapting to out-of-distribution target tasks, where transfer from source tasks can induce negative transfer. We propose a causally-aware Bayesian meta-learning method, by conditioning task-specific priors on precomputed latent causal task embeddings, enabling transfer based on mechanistic similarity rather than spurious correlations. Our approach explicitly considers realistic deployment settings where access to target-task data is limited, and adaptation relies on noisy (expert-provided) pairwise judgments of causal similarity between source and target tasks. We provide a theoretical analysis showing that conditioning on causal embeddings controls prior mismatch and mitigates negative transfer under task shift. Empirically, we demonstrate reductions in negative transfer and improved out-of-distribution adaptation in both controlled simulations and a large-scale real-world clinical prediction setting for cross-disease transfer, where causal embeddings align with underlying clinical mechanisms.

Abstract:
Selfspeculative decoding (SSD) accelerates LLM inference by skipping layers to create an efficient draft model, yet existing methods often rely on static heuristics that ignore the dynamic computational overhead of attention in long-context scenarios. We propose KnapSpec, a training-free framework that reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput. By decoupling Attention and MLP layers and modeling their hardware-specific latencies as functions of context length, KnapSpec adaptively identifies optimal draft configurations on the fly via a parallel dynamic programming algorithm. Furthermore, we provide the first rigorous theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for the token acceptance rate. This foundation allows our method to maintain high drafting faithfulness while navigating the shifting bottlenecks of real-world hardware. Our experiments on Qwen3 and Llama3 demonstrate that KnapSpec consistently outperforms state-of-the-art SSD baselines, achieving up to 1.47× wall-clock speedup across various benchmarks. Our plug-and-play approach ensures high-speed inference for long sequences without requiring additional training or compromising the target model's output distribution.

Abstract:
Large visionlanguage models can produce object hallucinations in image descriptions, highlighting the need for effective detection and mitigation strategies. Prior work commonly relies on the model's attention weights on visual tokens as a detection signal. We reveal that coarse-grained attention-based analysis is unreliable due to hidden confounders, specifically token position and object repetition in a description. This leads to Simpson’s paradox: the attention trends reverse when statistics are aggregated. Based on this observation, we introduce HaloProbe, a Bayesian framework that factorizes external description statistics and internal decoding signals to estimate token-level hallucination probabilities. HaloProbe uses balanced training to isolate internal evidence and combines it with learned prior over external features to recover the true posterior. While intervention-based mitigation methods often degrade utility or fluency by modifying models’ internals, we use HaloProbe as an external scoring signal for non-invasive mitigation. Our experiments show that HaloProbe guided decoding reduces hallucinations more effectively than state-of-the-art intervention-based methods while preserving utility.

Abstract:
Plugand-Play diffusion prior (PnPDP) frameworks have emerged as a powerful paradigm for solving imaging inverse problems by treating pretrained generative models as modular priors. However, we identify a critical flaw in prevailing PnP solvers (e.g., based on HQS or Proximal Gradient): they function as memoryless operators, updating estimates solely based on instantaneous gradients. This lack of historical tracking inevitably leads to non-vanishing steady-state bias, where the reconstruction fails to strictly satisfy physical measurements under heavy corruption. To resolve this, we proposeDual-Coupled PnP Diffusion, which restores the classical dual variable to provide integral feedback, theoretically guaranteeing asymptotic convergence to the exact data manifold. However, this rigorous geometric coupling introduces a secondary challenge: the accumulated dual residuals exhibit spectrally colored, structured artifacts that violate the Additive White Gaussian Noise (AWGN) assumption of diffusion priors, causing severe hallucinations. To bridge this gap, we introduce Spectral Homogenization (SH), a frequency-domain adaptation mechanism that modulates these structured residuals into statistically compliant pseudo-AWGN inputs. This effectively aligns the solver's rigorous optimization trajectory with the denoiser's valid statistical manifold. Extensive experiments on CT and MRI reconstruction demonstrate that our approach resolves the bias-hallucination trade-off, achieving state-of-the-art fidelity with significantly accelerated convergence.

Abstract:
Gradual Domain Adaption (GDA) aims to mitigate domain shift by progressively adapting models from the source domain to the target domain via intermediate domains. However, real intermediate domains are often unavailable or ineffective, necessitating the synthesis of intermediate samples. Flowbased models are recently used for this purpose by interpolating between source and target distributions, but their training typically resorts to sample-based log-likelihood estimation, which can discard useful information and thus degrade GDA performance. The key to addressing this limitation is constructing the intermediate domains via samples directly. To this end, we propose an \underline\textEntropy-regularized \underline\textSemi-dual \underline\textUnbalanced \underline\textOptimal \underline\textTransport (E-SUOT) framework to construct intermediate domains. Specifically, we reformulate flow-based GDA as a Lagrangian dual problem and derive an equivalent objective that circumvents the needs for likelihood estimation. However, the dual problem results in the unstable min–max training procedure. To alleviate this issue, we further introduce entropy regularization to convert it into a more stable alternative optimization procedure. Based on this, we propose a novel GDA training framework and provide theoretical analysis in terms of stability and generalization. Finally, extensive experiments are conducted to demonstrate the efficacy of the E-SUOT framework.

Abstract:
Recently, visionlanguage models have demonstrated increasing influence in morally sensitive domains such as autonomous driving and medical analysis, owing to their powerful multimodal reasoning capabilities. As these models are deployed in high-stakes real-world applications, it is of paramount importance to ensure that their outputs align with human moral values and remain within moral boundaries. However, existing work on moral alignment either focuses solely on textual modalities or relies heavily on AI-generated images, leading to distributional biases and reduced realism. To overcome these limitations, we introduce MORALISE, a comprehensive benchmark for evaluating the \underlinemoral \underlinealignment of v\underlineision-languag\underlinee models (VLMs) using diverse, expert-verified real-world data. We begin by proposing a comprehensive taxonomy of 13 moral topics grounded in Turiel's Domain Theory, spanning the personal, interpersonal, and societal moral domains encountered in everyday life. Built on this framework, we manually curate 2,481 high-quality image-text pairs, each annotated with two fine-grained labels: (1) topic annotation, identifying the violated moral topic(s), and (2) modality annotation, indicating whether the violation arises from the image or the text. For evaluation, we encompass two tasks, moral judgment and moral norm attribution, to assess models' awareness of moral violations and their reasoning ability on morally salient content. Extensive experiments on 19 popular open- and closed-source VLMs show that MORALISE poses a significant challenge, revealing persistent moral limitations in current state-of-the-art models.

Abstract:
Multiagent systems built on large language models (LLMs) are expected to enhance decision-making by pooling distributed information, yet systematically evaluating this capability has remained challenging. We introduce HiddenBench, a 65-task benchmark grounded in the Hidden Profile paradigm, which isolates collective reasoning under distributed information from individual reasoning ability. Evaluating 15 frontier LLMs, we find that multi-agent LLMs achieve only 30.1% accuracy under distributed information, compared to 80.7% accuracy for single agents given complete information. We trace this gap to a systematic failure mode: agents cannot recognize or act under latent information asymmetry—they fail to reason about what others might know but have not yet expressed, leading to premature convergence on shared evidence while critical distributed facts remain unexplored. These failures persist across prompting strategies, communication depths, and group sizes—and worsen as groups scale. While some models (e.g., Gemini-2.5-Flash/Pro) outperform others, neither model scale nor individual reasoning accuracy reliably predicts collective performance. Our results identify failures in collective information exploration in decision-making as a key limitation of multi-agent LLMs, and provide a theory-grounded, reproducible framework for diagnosing collective reasoning failures.

Abstract:
Mixtureof-Experts (MoE) architectures are often considered a natural fit for continual learning because sparse routing should localize updates and reduce interference, yet MoE Transformers still forget substantially even with sparse, well-balanced expert utilization. We attribute this gap to a pre-routing bottleneck: multi-head attention concatenates head-specific signals into a single post-attention router input, forcing routing to act on co-occurring feature compositions rather than separable head channels. We show that this router input simultaneously encodes multiple separately decodable semantic and structural factors with uneven head support, and that different feature compositions induce weakly aligned parameter-gradient directions; as a result, routing maps many distinct compositions to the same route. We quantify this collision effect via a route-wise effective composition number N_\mathrmeff and find that higher N_\mathrmeff is associated with larger old-task loss increases after continual training. Motivated by these findings, we propose MH-MoE, which performs head-wise routing over sub-representations to increase routing granularity and reduce composition collisions. On TRACE with Qwen3-0.6B/8B, MH-MoE effectively mitigates forgetting, improving BWT on Qwen3-0.6B from -11.2% (LoRAMoE) to -4.5%.

Abstract:
The rise of bot accounts on social media poses significant risks to public discourse. To address this threat, modern bot detectors increasingly rely on Graph Neural Networks (GNNs). However, the effectiveness of these GNNbased detectors in real-world settings remains poorly understood. In practice, attackers continuously adapt their strategies as well as must operate under domain-specific and temporal constraints, which can fundamentally limit the applicability of existing attack methods. As a result, there is a critical need for robust GNN-based bot detection methods under realistic, constraint-aware attack scenarios. To address this gap, we introduce BOCLOAK to systematically evaluate the robustness of GNN-based social bot detection via both edge editing and node injection adversarial attacks under realistic constraints. BOCLOAK constructs a probability measure over spatio-temporal neighbor features and learns an optimal transport geometry that separates human and bot behaviors. It then decodes transport plans into sparse, plausible edge edits that evade detection while obeying real-world constraints. We evaluate BOCLOAK across three social bot datasets, five state-of-the-art bot detectors, three adversarial defenses, and compare it against four leading graph adversarial attack baselines. BOCLOAK achieves up to 80.13% higher attack success rates while using 99.80% less GPU memory under realistic real-world constraints. Most importantly, BOCLOAK shows that optimal transport provides a lightweight, principled framework for bridging adversarial attacks and real-world bot detection

Abstract:
Largescale text-to-image (T2I) diffusion models excel at open-domain synthesis but still struggle with precise text rendering, especially for multi-line layouts, dense typography, and long-tailed scripts such as Chinese. Prior solutions typically necessitate costly retraining or impose rigid external layout constraints, often compromising aesthetic quality and flexibility. We proposeFreeText, a training-free, plug-and-play framework that improves text rendering by leveraging intrinsic mechanisms ofDiffusion Transformer (DiT)models.FreeTextdecomposes the problem intowhere to writeandwhat to write. For the former, we localize writing regions by extracting token-wise spatial attribution from image-to-text attention, using sink-like tokens as stable spatial anchors and topology-aware refinement to produce high-confidence masks. For the latter, we introduce Spectral-Modulated Glyph Injection (SGMI), which injects a noise-aligned glyph prior with frequency-domain band-pass modulation to strengthen glyph structure and mitigate semantic leakage (rendering the concept instead of the word). Extensive experiments on Qwen-Image, FLUX.1-dev, and SD3 variants across longText-Benchmark, CVTG, and our CLT-Bench show consistent gains in text readability while maintaining semantic alignment and aesthetic quality, with modest inference overhead.

Abstract:
Decision trees and diffusion models are ostensibly disparate model classes, one discrete and hierarchical, the other continuous and dynamic. This work unifies the two by establishing a crisp mathematical correspondence between hierarchical decision trees and diffusion processes in appropriate limiting regimes. Our unification reveals a shared optimization principle: \emphGlobal Trajectory Score Matching (GTSM), for which gradient boosting (in an idealized version) is asymptotically optimal. We underscore the conceptual value of our work through two key practical instantiations: \treeflow, which achieves competitive generation quality on tabular data with higher fidelity and a 2× computational speedup, and \dsmtree, a novel distillation method that transfers hierarchical decision logic into neural networks, matching teacher performance within 2% on many benchmarks.

Abstract:
Posttraining of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT performance alone. We show that, after identical RL training, models initialized from stronger SFT checkpoints can significantly underperform those initialized from weaker ones. We propose PEAR (Policy Evaluation–inspired Algorithm for Offline Learning Loss Reweighting), an SFT-stage method that corrects this mismatch and better prepares the model for RL. PEAR uses importance sampling to reweight the SFT loss, with three variants operating at the token, block, and sequence levels. It can be used to augment standard SFT objectives and incurs little additional training overhead once probabilities for the offline data are collected. We conduct controlled experiments on verifiable reasoning games and mathematical reasoning tasks on Qwen2.5/3 and DeepSeek-distilled models. PEAR consistently improves post-RL performance over canonical SFT, with pass@8 gains up to a 14.6% on AIME-2025. Our results suggest that PEAR is an effective step toward more holistic LLM post-training by designing and evaluating SFT with downstream RL in mind rather than in isolation.

Abstract:
Reinforcement learning has become a cornerstone for enhancing the reasoning capabilities of Large Language Models, where groupbased approaches such as GRPO have emerged as efficient paradigms that optimize policies by leveraging intra-group performance differences. However, these methods typically rely on absolute numerical rewards, introducing intrinsic limitations. In verifiable tasks, identical group evaluations often result in sparse supervision, while in open-ended scenarios, the score range instability of reward models undermines advantage estimation based on group means. To address these limitations, we proposeReinforcement Learning with Relative Rewards (RLRR), a framework that shifts reward shaping from absolute scoring to relative ranking. Complementing this framework, we introduce theRanking Reward Model, a listwise preference model tailored for group-based optimization to directly generate relative rankings. By transforming raw evaluations into robust relative signals, RLRR effectively mitigates signal sparsity and reward instability. Experimental results demonstrate that RLRR yields consistent performance improvements over standard group-based baselines across reasoning benchmarks and open-ended generation tasks. Our source code is available at https://anonymous.4open.science/r/RLRR-55AF.

Abstract:
Many large language model applications require conditioning on long contexts. Transformers typically support this by storing a large perlayer KV-cache of past activations, which incurs substantial memory overhead. A desirable alternative is compressive memory: read a context once, store it in a compact state, and answer many queries from that state. We study this in a context removal setting, where the model must generate an answer without access to the original context at inference time. We introduce GradMem, which writes context into memory via per-sample test-time optimization. Given a context, GradMem performs a few steps of gradient descent on a small set of prefix memory tokens while keeping model weights frozen. GradMem explicitly optimizes a model-level self-supervised context reconstruction loss, resulting in a loss-driven write operation with iterative error correction, unlike forward-only methods. On associative key--value retrieval, GradMem outperforms forward-only memory writers with the same memory size, and additional gradient steps scale capacity much more effectively than repeated forward writes. We further show that GradMem transfers beyond synthetic benchmarks: with pretrained language models, it attains competitive results on natural language tasks including bAbI and SQuAD variants, relying only on information encoded in memory.

Abstract:
Masked Diffusion Language Models (MDLMs) have recently emerged as a promising alternative to Autoregressive Language Models (ARLMs), leveraging a denoising objective that, in principle, should enable more uniform context utilisation. In this work, we examine the context comprehension abilities of MDLMs and uncover two key limitations. First, despite their more global training objective and bidirectional attention mechanism, similarly to ARLMS,MDLMs exhibit a strong locality bias: performance is highly sensitive to the position of relevant information within the input, favouring local over distant context. Second, appending a large number ofmask tokens—required for generation—can significantly degrade context comprehensionin models trained from scratch. Through systematic ablations, we find that these masksact as distractors, reducing the model's ability to process relevant information. To address and further study this undesirable behaviour, we introduce the maskagnostic loss function that encourages predictions to remain invariant to the number of appended masks. Fine-tuning with this objective substantially mitigates the distracting effect of masks, improving robustness of MDLMs. Overall, our findings reveal critical limitations of the current MDLM training paradigm, with implications for training, evaluation and deployment.

Abstract:
Mixedinteger linear programming (MILP) is a foundational framework for combinatorial optimization across science and engineering, but remains hard to solve at scale due to NP-hardness.Recent learning-based methods typically model MILP instances as variable–constraint bipartite graphs and use Graph Neural Networks (GNNs) for representation learning, yet their locality limits representation power.We propose an attention-driven neural backbone that adopts an element-centric view of variables and constraints, with dual attention performing parallel intra-type self-attention and inter-type cross-attention.Across three representative tasks at the instance, element, and solving-state levels, our model consistently outperforms conventional GNN-based architectures, highlighting attention-based, element-centric modeling as a powerful foundation for learning-enhanced combinatorial optimization.

Abstract:
Large language models (LLMs) with extended context windows enable powerful applications but impose significant memory overhead, as caching all key–value (KV) states scales linearly with sequence length and batch size. Existing cache eviction methods address this by exploiting attention sparsity, yet they typically rank tokens heuristically using accumulated attention weights without considering their true impact on attention outputs. We propose Optimal Brain Cache (OBCache), a principled framework that formulates cache eviction as a layerwise structured pruning problem. Building upon the Optimal Brain Damage (OBD) theory, OBCache quantifies token saliency by measuring the perturbation in attention outputs induced by pruning tokens, with closed-form scores derived for isolated keys, isolated values, and joint key–value pairs. Our scores account not only for attention weights but also for information from value states and attention outputs, thereby enhancing existing eviction strategies with output-aware signals. Experiments on LLaMA and Qwen models demonstrate that replacing the heuristic scores in existing works, which estimate token saliency across different query positions, with OBCache's output-aware scores consistently improves long-context accuracy.

Abstract:
The SWEBench Verified leaderboard is approaching saturation, with the top system achieving 78.80%. However, we reveal that this performance is inflated: our re-evaluation demonstrates that one in five "solved" patches from the top-30 agents are semantically incorrect, passing only because weak test suites fail to expose their errors. We present SWE-ABS, an adversarial framework that strengthens test suites through a two-stage pipeline: (1) coverage-driven augmentation utilizing program slicing to target untested code regions, and (2) mutation-driven adversarial testing that synthesizes plausible-but-incorrect patches to expose semantic blind spots. On SWE-Bench Verified (500 instances), SWE-ABS strengthens 48.4% of instances (a 24.2× improvement over prior work) and rejects 21.4% of previously passing patches. Consequently, the top agent's score decreases from 78.80% to 61.80%, causing significant leaderboard reshuffling (e.g., the top-ranked agent drops to 4th place).

Abstract:
While the shift toward unified foundation models has revolutionized many deep learning domains, sleep medicine remains largely restricted to taskspecific models that focus on localized micro-structure features. These approaches often neglect the rich, multi-modal context of Polysomnography (PSG) and fail to capture the global macro-structure of a full night's sleep. To address this, we introduce SleepMaMi, a Sleep Foundation Model engineered to master both hour-long sleep architectures and fine-grained signal morphologies. Our framework utilizes a hierarchical dual-encoder design: a Macro-Encoder to model full-night temporal dependencies and a Micro-Encoder to capture short-term characteristics from biosignals. Macro-Encoder is trained via Demographic-Guided Contrastive Learning, which aligns overnight sleep patterns with objective subject metadata, such as age and sex, to refine global representations. Micro-Encoder is optimized via a hybrid Masked Autoencoder (MAE) and multi-modal contrastive objective. Pre-trained on a massive corpus of >20,000 PSG recordings (15.8K hours), SleepMaMi outperforms existing foundation models across a diverse suite of downstream tasks, demonstrating superior generalizability and label-efficient adaptation for clinical sleep analysis.

Abstract:
Realworld graph datasets often arise from mixtures of populations, where graphs are generated by multiple distinct underlying distributions. In this work, we propose a unified framework that explicitly models graph data as a mixture of probabilistic graph generative models represented by graphons. To characterize and estimate these graphons, we leverage graph moments (motif densities) to cluster graphs generated from the same underlying model. We establish a novel theoretical guarantee, deriving a tighter bound showing that graphs sampled from structurally similar graphons exhibit similar motif densities with high probability. This result enables principled estimation of graphon mixture components. We show how incorporating estimated graphon mixture components enhances two widely used downstream paradigms: graph data augmentation via mixup and graph contrastive learning. By conditioning these methods on the underlying generative models, we develop graphon-mixture-aware mixup (GMAM) and model-aware graph contrastive learning (MGCL). Extensive experiments on both simulated and real-world datasets demonstrate strong empirical performance. In supervised learning, GMAM outperforms existing augmentation strategies, achieving new state-of-the-art accuracy on 6 out of 7 datasets. In unsupervised learning, MGCL performs competitively across seven benchmark datasets and achieves the lowest average rank overall.

Abstract:
Testtime scaling improves LLM performance by generating multiple candidate solutions, yet token-level sampling requires temperature tuning that trades off diversity against stability. Fine-grained MoE, featuring hundreds of well-trained experts per layer and multi-expert activation per token, offers an unexplored alternative through its rich routing space. We empirically characterize fine-grained MoE routing and uncover an informative pattern: router scores exhibit a certain head of high-confidence experts followed by an uncertain tail of low-confidence candidates. While single-run greedy accuracy remains stable when fewer experts are activated, multi-sample pass@n degrades significantly—suggesting that the certain head governs core reasoning capability while the uncertain tail correlates with reasoning diversity. Motivated by these findings, we propose Expert-Sample, a training-free method that preserves high-confidence selections while injecting controlled stochasticity into the uncertain tail, enabling diverse generation without destabilizing outputs. Evaluated on multiple fine-grained MoE models across math, knowledge reasoning, and code tasks, Expert-Sample consistently improves pass@n and verification-based accuracy. On Qwen3-30B-A3B-Instruct evaluated on GPQA-Diamond with 32 parallel samples, pass@32 rises from 85.4% to 91.9%, and accuracy improves from 59.1% to 62.6% with Best-of-N verification.

Abstract:
We study online configuration selection with admission control problem, which arises in LLM serving, GPU scheduling, and revenue management. In a planning horizon with T periods, we consider a twolayer framework for the decisions made within each time period. In the first layer, the decision maker selects one of the K configurations (ex. quantization, parallelism, fare class) which induces distribution over the reward-resource pair of the incoming request. In the second layer, the decision maker observes the request and then decides whether to accept it or not. Benchmarking this framework requires care. We introduce a switching-aware fluid oracle that accounts for the value of mixing configurations over time, provably upper-bounding any online policy. We derive a min-max formulation for evaluating the benchmark, and we characterize saddle points of the max-min problem via primal-dual optimality conditions linking equilibrium, feasibility, and complementarity. This guides the design of SP-UCB--OLP algorithm, which solves an optimistic saddle point problem and achieves \tildeO(\sqrtKT) regret.

Abstract:
Large language models have achieved remarkable success in recent years, primarily due to selfattention. However, traditional Softmax attention suffers from numerical instability and reduced performance as the number of inference tokens increases. This work addresses these issues by proposing a new design principle for attention, viewing it as a two-stage process. The first stage (normalisation) refines standard attention by replacing Softmax with the more numerically stable Softplus followed by l_1-normalisation. Furthermore, we introduce a dynamic scale factor based on invariance entropy. We show that this novel attention mechanism outperforms conventional Softmax attention, and state-of-the-art Softmax-free alternatives. Our second proposal is to introduce a second processing stage (sharpening) which consists of a re-weighting mechanism that amplifies significant attentional weights while diminishing weaker ones. This enables the model to concentrate more effectively on relevant tokens, mitigating the attention sink phenomenon, and fundamentally improving length extrapolation. This novel, two-stage, replacement for self-attention is shown to ensure numerical stability and dramatically improve length extrapolation, maintaining a nearly constant validation loss at 16× the training length while achieving superior results on challenging long-context retrieval tasks and standard downstream benchmarks. Furthermore, symbolic regression experiments demonstrate that our method enables models to recover Newton's gravitational law from orbital trajectory sequences, providing evidence that appropriate attention mechanisms are crucial in order for foundation models to develop genuine physical world models.

Abstract:
Unraveling how macroscopic cognitive phenotypes emerge from microscopic neuronal connectivity remains one of the core pursuits of neuroscience. To this end, researchers typically leverage multimodal information from structural connectivity (SC) and functional connectivity (FC) to complete downstream tasks. Recent methodologies explore the intricate coupling mechanisms between SC and FC, attempting to fuse their representations at the regional level. However, lacking fundamental neuroscientific insight, these approaches fail to uncover the latent interactions between neural regions underlying these connectomes, and thus cannot explain why SC and FC exhibit dynamic states of both coupling and heterogeneity. In this paper, we formulate multi-modal fusion through the lens of neural communication dynamics and propose the Adaptive Flow Routing Network (AFR-Net), a physics-informed framework that models how structural constraints (SC) give rise to functional communication patterns (FC), enabling interpretable discovery of critical neural pathways. Extensive experiments demonstrate that AFR-Net significantly outperforms state-of-the-art baselines. The code is available at \urlhttps://anonymous.4open.science/r/DIAL-F0D1.

Abstract:
Bridging highlevel semantic understanding with low-level physical control remains a persistent challenge in embodied intelligence, stemming from the fundamental spatiotemporal scale mismatch between cognition and action. Existing generative policies typically adopt a "Generation-from-Noise" paradigm, which disregards this disparity, leading to representation inefficiency and optimization "Loss Collapse". In this work, we propose ResVLA, a novel architecture that shifts the paradigm to "Refinement-from-Intent". Recognizing that robotic motion naturally decomposes into global intent and local dynamics, ResVLA utilizes spectral analysis to decouple control into a deterministic low-frequency anchor and a stochastic high-frequency residual. By anchoring the generative process on the predicted intent, our model focuses strictly on refining local dynamics via a residual diffusion bridge. Extensive evaluations on LIBERO and the challenging LIBERO-Plus benchmarks demonstrate that ResVLA achieves state-of-the-art performance. Notably, our approach exhibits exceptional robustness against semantic drift and kinematic perturbations while achieving significantly faster convergence than standard generative baselines.

Abstract:
While Large Language Models (LLMs) have demonstrated exceptional proficiency in code completion, they typically adhere to aHard Completion (HC)paradigm, compelling the generation of fully concrete code even amidst insufficient context. Our analysis of 3 million realworld interactions exposes the limitations of this strategy: 61% of the generated suggestions were either edited after acceptance or rejected despite exhibiting over 80% similarity to the user's subsequent code, suggesting that models frequently make erroneous predictions at specific token positions. Motivated by this observation, we proposeAdaptive Placeholder Completion (APC), a collaborative framework that extends HC by strategically outputting explicit placeholders at high-entropy positions, allowing users to fill directly via IDE navigation. Theoretically, we formulate code completion as a cost-minimization problem under uncertainty. Premised on the observation that filling placeholders incurs lower cost than correcting errors, we prove the existence of a critical entropy threshold above which APC achieves strictly lower expected cost than HC. We instantiate this framework by constructing training data from filtered real-world edit logs and design a cost-based reward function for reinforcement learning. Extensive evaluations across 1.5B--14B parameter models demonstrate that APC reduces expected editing costs from 19% to 50% while preserving standard HC performance. Our work provides both a theoretical foundation and a practical training framework for uncertainty-aware code completion, demonstrating that adaptive abstention can be learned end-to-end without sacrificing conventional completion quality.

Abstract:
Augmented large language models (LLMs) that invoke external calls are increasingly prevalent in inference serving. However, such augmentations pose significant challenges to inference efficiency under strict ServiceLevel Objectives (SLOs). Existing inference systems are agnostic to the dynamic execution behaviors induced by external calls and rely on fixed batch-level token budget, which leads to severe Head-of-Line (HoL) blocking and substantially reduced effective throughput. We present AugServe, an efficient augmented LLM inference serving framework that mitigates request queuing latency and improves effective throughput under external-call-augmented workloads. AugServe integrates state-aware request scheduling with dynamic batch-level token budgets to adapt to heterogeneous requests and their dynamically changing execution states. Experimental results show that AugServe achieves 6.5× and 4.7× higher effective throughput than vLLM and INFERCEPT, respectively.

Abstract:
Despite recent successes, testtime scaling -i.e., dynamically expanding the token budget during inference as needed- remains brittle for vision-language models (VLMs): unstructured chains-of-thought about images entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Moreover, expensive reinforcement learning with hand-crafted rewards is required to achieve good performance. Here, we introduce SPARC (Separating Perception And Reasoning Circuits), a modular framework that explicitly decouples visual perception from reasoning. Inspired by sequential sensory-to-cognitive processing in the brain, SPARC implements a two-stage pipeline where the model first performs explicit visual search to localize question-relevant regions, then conditions its reasoning on those regions to produce the final answer. This separation enables independent test-time scaling with asymmetric compute allocation (e.g., prioritizing perceptual processing under distribution shift), supports selective optimization (e.g., improving the perceptual stage alone when it is the bottleneck for end-to-end performance), and accommodates compressed contexts by running global search at lower image resolutions and allocating high-resolution processing only to selected regions, thereby reducing total visual tokens count and compute. Across challenging visual reasoning benchmarks, SPARC outperforms monolithic baselines and strong visual-grounding approaches. For instance, SPARC improves the accuracy of Qwen3VL-4B on the V^ VQA benchmark by 6.7 percentage points, and it surpasses "thinking with images" by 4.6 points on a challenging OOD task despite requiring a 200× lower token budget.

Abstract:
An increasing number of LLMs employ Mixtureof-Experts (MoE) architectures where the feed-forward layer is replaced by a pool of experts and each token only activates a small subset of them. During autoregressive generation, these models often enter a memory-bound regime even for moderate batch sizes because the average expert load grows more slowly than in an equivalent dense feedforward layer. Consequently, MoE latency is governed by the number of activated experts. We introduce a framework for dynamically re-routing token-to-expert mapping to lower this number (and thus, the decode latency) while preserving a comparable quality. Our best results use a batch-aware routing that works by having tokens piggyback experts that have already been loaded into memory due to being crucial to other tokens within the same batch. Empirically, we evaluate our method on the Qwen3-30B and Qwen3-235B models with a batch size of 16. Without any statistically significant loss in accuracy, our approach achieves latency reductions of 39% and 15% in the MoE layer decode latency, respectively.

Abstract:
Advancing complex reasoning in large language models relies on highquality, verifiable datasets, yet human annotation remains cost-prohibitive and difficult to scale. Current synthesis paradigms often face a recurring trade-off: maintaining structural validity typically restricts problem complexity, while relaxing constraints to increase difficulty frequently leads to inconsistent or unsolvable instances. To address this, we propose Agentic Proposing, a framework that models problem synthesis as a goal-driven sequential decision process where a specialized agent dynamically selects and composes modular reasoning skills. Through an iterative workflow of internal reflection and tool-use, we develop the Agentic-Proposer-4B using Multi-Granularity Policy Optimization (MGPO) to generate high-precision, verifiable training trajectories across mathematics, coding, and science. Empirical results demonstrate that downstream solvers trained on agent-synthesized data significantly outperform leading baselines and exhibit robust cross-domain generalization. Notably, a 30B solver trained on only 11,000 synthesized trajectories achieves a state-of-the-art 91.6% accuracy on AIME25, rivaling frontier-scale proprietary models such as GPT-5 and proving that a small volume of high-quality synthetic signals can effectively substitute for massive human-curated datasets.

Abstract:
We propose a novel benchmarking methodology for graph neural networks (GNNs) based on the graph alignment problem, a combinatorial optimization task that generalizes graph isomorphism by aligning two unlabeled graphs to maximize overlapping edges. We frame this problem as a selfsupervised learning task and present several methods to generate graph alignment datasets using synthetic random graphs and real-world graph datasets from multiple domains. For a given graph dataset, we generate a family of graph alignment datasets with increasing difficulty, allowing us to rank the performance of various architectures. Our experiments prove that there is an optimal task difficulty for having a statistically relevant ranking of different models and that, even on a structure-only task, anisotropic models perform better compared to isotropic ones. To further prove that our synthetic task capture meaningful information, we show its effectiveness for self-supervised GNN pre-training: the learned node embeddings can be leveraged as positional encodings by transformers for graph regression or can be used to reconstruct the full structure of the graph with 98% accuracy. To support reproducibility and further research, we provide an open-source Python package to generate graph alignment datasets and benchmark new GNN architectures.

Abstract:
Allin-One Image Restoration (AiOIR) faces the fundamental challenge in reconciling conflicting optimization objectives across heterogeneous degradations. Existing methods are often constrained by coarse-grained control mechanisms or fixed mapping schedules, yielding suboptimal adaptation. To address this, we propose an Uncertainty-Aware Diffusion Bridge Model (UDBM), which innovatively reformulates AiOIR as a stochastic transport problem steered by pixel-wise uncertainty. By introducing a relaxed diffusion bridge formulation which replaces the strict terminal constraint with a relaxed constraint, we model the uncertainty of degradations while theoretically resolving the drift singularity inherent in standard diffusion bridges. Furthermore, we devise a dual modulation strategy: the noise schedule aligns diverse degradations into a shared high-entropy latent space, while the path schedule adaptively regulates the transport trajectory motivated by the viscous dynamics of entropy regularization. By effectively rectifying the transport geometry and dynamics, UDBM achieves state-of-the-art performance across diverse restoration tasks within a single inference step.

Abstract:
Diffusion Bridge and Flow Matching have both demonstrated compelling empirical performance in transformation between arbitrary distributions. However, there remains confusion about which approach is generally preferable, and the substantial discrepancies in their modeling assumptions and practical implementations have hindered a unified theoretical account of their relative merits. We have, for the first time, provided a unified theoretical and experimental validation of these two models. We recast their frameworks through the lens of Stochastic Optimal Control and prove that the cost function of the Diffusion Bridge is lower, guiding the system toward more stable and natural trajectories. Simultaneously, from the perspective of Optimal Transport, interpolation coefficients t and 1t of Flow Matching become increasingly ineffective when the training data size is reduced. To corroborate these theoretical claims, we propose a novel, powerful architecture for Diffusion Bridge built on a latent Transformer, and implement a Flow Matching model with the same structure to enable a fair performance comparison in various experiments. Comprehensive experiments are conducted across Image Restoration, Translation, and Style Transfer tasks, systematically varying both the distributional discrepancy (different difficulty) and the training data size. Extensive empirical results align perfectly with our theoretical predictions and allow us to delineate the respective advantages and disadvantages of these two models. Our code is available at \urlhttps://anonymous.4open.science/r/DBFM-3E8E/.

Abstract:
The de facto approach in video objectcentric learning maintains temporal consistency through learned dynamics modules that predict future object representations, called slots. We demonstrate that these predictors function as expensive approximations of discrete correspondence problems. Modern self-supervised vision backbones already encode instance-discriminative features that distinguish objects reliably. Exploiting these features eliminates the need for learned temporal prediction. We introduce Grounded Correspondence, a framework that replaces learned transition functions with deterministic bipartite matching. Slots initialize from salient regions in frozen backbone features. Frame-to-frame identity is maintained through Hungarian matching on slot representations. The approach requires zero learnable parameters for temporal modeling yet achieves competitive performance on MOVi-D, MOVi-E, and YouTube-VIS. Project page: https://magenta-sherbet-85b101.netlify.app/

Abstract:
Large language models (LLMs) are increasingly deployed as autonomous agents for multiturn decision-making tasks. However, current agents typically rely on fixed cognitive patterns: non-thinking models generate immediate responses, while thinking models engage in deep reasoning uniformly. This rigidity is inefficient for long-horizon tasks, where cognitive demands vary significantly from step to step, with some requiring strategic planning and others only routine execution. In this paper, we introduce CogRouter, a framework that trains agents to dynamically adapt cognitive depth at each step. Grounded in ACT-R theory, we design four hierarchical cognitive levels ranging from instinctive responses to strategic planning. Our two-stage training approach includes Cognition-aware Supervised Fine-tuning (CogSFT) to instill stable level-specific patterns, and Cognition-aware Policy Optimization (CoPO) for step-level credit assignment via confidence-aware advantage reweighting. The key insight is that appropriate cognitive depth should maximize the confidence of the resulting action. Experiments on ALFWorld and ScienceWorld demonstrate that CogRouter achieves state-of-the-art performance with superior efficiency.

Abstract:
Reinforcement learning (RL) trained language model agents with tool access are increasingly deployed in coding assistants, research tools, and autonomous systems. We introduce the Reward Hacking Benchmark (RHB), a suite of multistep tasks requiring sequential tool operations with naturalistic shortcut opportunities such as skipping verification steps, inferring answers from task-adjacent metadata, or tampering with evaluation-relevant functions. RHB supports independent and chained task regimes, where chain length acts as a proxy for longer-horizon agent behavior. We evaluate 13 frontier models from OpenAI, Anthropic, Google, and DeepSeek. Exploit rates range from 0% (Claude Sonnet 4.5) to 13.9% (DeepSeek-R1-Zero), varying sharply by post-training style. A controlled sibling comparison (DeepSeek-V3 vs. DeepSeek-R1-Zero) shows RL post-training is associated with substantially higher reward hacking (0.6% vs. 13.9%), with consistent gaps across all four task families. We identify six exploit categories and find that 72% of reward hacking episodes include explicit chain-of-thought rationale, suggesting models often frame exploits as legitimate problem-solving. Simple environmental hardening reduces exploit rates by 88% without degrading task success. Models with near-zero exploit rates on standard tasks show elevated rates on harder variants, suggesting production post-training suppresses reward hacking only below a complexity threshold where honest solutions remain tractable.

Abstract:
While conformal prediction provides robust marginal coverage guarantees, achieving reliable conditional coverage for specific inputs remains challenging. Although exact distributionfree conditional coverage is impossible with finite samples, recent work has focused on improving the conditional coverage of standard conformal procedures. Distinct from approaches that target relaxed notions of conditional coverage, we directly minimize the mean squared error of conditional coverage by refining the quantile regression components that underpin many conformal methods. Leveraging a Taylor expansion, we derive a sharp surrogate objective for quantile regression: a density-weighted pinball loss, where the weights are given by the conditional density of the conformity score evaluated at the true quantile. We propose a three-headed quantile network that estimates these weights via finite differences using auxiliary quantile levels at 1-\alpha \pm \delta, subsequently fine-tuning the central quantile by optimizing the weighted loss. We provide a theoretical analysis with exact non-asymptotic guarantees characterizing the resulting excess risk. Extensive experiments on diverse high-dimensional real-world datasets demonstrate remarkable improvements in conditional coverage performance.

Abstract:
Repairing system crashes discovered by kernel fuzzers like Syzkaller is a critical yet underexplored challenge in software engineering. While recent works have introduced Large Language Model (LLM) based agents for Linux kernel crashresolution, their evaluation benchmarks are usually static and thus, do not capture the evolving nature of the Linux kernel, and suffer from potential data contamination due to LLM knowledge cutoffs. To address the above problem, we present (i) Live-kBench, an evaluation framework for self-evolving benchmarks that continuously scrapes and evaluates agents on freshly discovered kernel bugs, and (ii) kEnv, an agent-agnostic standardized crash-resolution environment for kernel compilation, execution, and feedback. This design decouples agent workflows from heavy-weight execution, enabling fair and scalable comparison across diverse agent frameworks under identical conditions. To this end, we curate an inaugural dataset of 534 Linux kernel bugs and empirically demonstrate a significant performance gap, with agents achieving up to 25% higher equivalent patch rate on bugs fixed before the LLM knowledge cutoff. Using kEnv, we benchmark three state-of-the-art agents, showing that they resolve 74% of crashes on the first attempt (plausible patches); however only ~20% of generated patches closely match developer fixes. Additionally, exposing crash resolution feedback improves crash resolution rate by 29%. Live-kBench provides the community with an evaluation infrastructure for self-evolving benchmarks that is both time and attribute sensitive; complete with a public dashboard to track agent progress on Linux kernel bugs.

Abstract:
Binary classification based on predicted probabilities (scores) is a fundamental task in supervised machine learning. While thresholding scores is Bayesoptimal in the unconstrained setting, using a single threshold generally violates statistical group fairness constraints. Under independence (statistical parity) and separation (equalized odds), such thresholding suffices when the scores already satisfy the corresponding criterion. However, this does not extend to sufficiency: even perfectly group-calibrated scores---including true class probabilities---violate predictive parity after thresholding. In this work, we present an exact solution for optimal binary (randomized) classification under sufficiency, assuming finite sets of group-calibrated scores. We provide a geometric characterization of the feasible pairs of positive predictive value (PPV) and false omission rate (FOR) achievable by such classifiers, and use it to derive a simple post-processing algorithm that attains the optimal classifier using only group-calibrated scores and group membership. Finally, since sufficiency and separation are generally incompatible, we identify the classifier that minimizes deviation from separation subject to sufficiency, and show that it can also be obtained by our algorithm, often achieving performance comparable to the optimum.

Abstract:
As large language models (LLMs) are increasingly deployed in realworld systems, they must support post-hoc removal of specific content to meet privacy and governance requirements. This motivates selective unlearning, which suppresses information about a particular entity or topic while preserving the LLM’s general utility. However, most existing LLM unlearning methods require access to the original training corpus and rely on output-level refusal tuning or broad gradient updates, creating a tension among unlearning strength, non-target preservation, and data availability. We propose Geometric Unlearning (GU), an approach that operates directly on the model’s prompt-time planning states without access to the original training corpus. GU distills a compact, low-rank geometry of desired safe behavior from a small set of safe reference prompts, and uses lightweight anchor-in-context synthetic prompts to trigger localized, projection-based alignment of hidden planning representations to this safe geometry. A teacher-distillation regularizer on synthetic non-target anchors further reduces collateral drift. Across privacy-oriented unlearning benchmarks (ToFU and UnlearnPII), GU achieves strong target suppression with minimal impact on non-target performance, demonstrating that effective unlearning can be achieved with minimal synthetic data.

Abstract:
As frontier AI systems become increasingly capable, concerns about deceptive behaviors have intensified. Unlike hallucinations, which stem from capability limitations, deception involves strategically misleading responses despite correct internal representations. While prior work has primarily studied deception in textonly settings, little is known about how such behaviors manifest in multimodal large language models. In this work, we systematically investigate multimodal deception and introduce MM-DeceptionBench, the first benchmark designed to evaluate deceptive behaviors in vision–language models across six realistic categories. We find that existing text-centric monitoring approaches are insufficient in multimodal settings due to the complexity of cross-modal reasoning. To address this gap, we propose debate with images, a multi-agent evaluation framework that enforces visual grounding through adversarial debate. Experiments show that this approach achieves substantially higher agreement with human judgments than MLLM-as-a-judge baselines, improving Cohen’s kappa by up to 1.5× and accuracy by up to 1.25× on GPT-4o.

Abstract:
Building generalist robot policies that can handle diverse tasks in openended environments is a central challenge in robotics. To leverage knowledge from large-scale pretraining, prior work (VLA) has typically built generalist policies either on top of vision-language models (VLMs) or generative models. However, both semantic understanding from vision-language pretraining and visual dynamics modeling from visual-generation pretraining are crucial for embodied robots. Recent unified models of generation and understanding have demonstrated strong capabilities in both comprehension and generation through large-scale pretraining. We posit that robotic policy learning can likewise benefit from the combined strengths of understanding, planning and continuous future representation learning. Building on this insight, we introduce UniCoD, which acquires the ability to dynamically model high-dimensional visual features through pretraining on over 1M internet-scale instructional manipulation videos. Subsequently, UniCoD is fine-tuned on data collected from the robot embodiment, enabling the learning of mappings from predictive representations to action tokens. Extensive experiments show our approach consistently outperforms baseline methods in terms of 9% and 12% across simulation environments and real-world out-of-distribution tasks.

Abstract:
Finetuning LLMs is necessary for various dedicated downstream tasks, but classic backpropagation-based fine-tuning methods require substantial GPU memory. To this end, a recent work, MeZO, which relies solely on forward passes to fine-tune LLMs, significantly reduces GPU requirements at the cost of slower convergence due to its indifference to loss landscapes. Standard solutions, such as Adam, explore loss landscapes by estimating the first- and second-order moments and storing them in memory to guide the model's movement through dimensions with lower curvature and vice versa. However, directly applying Adam negates MeZO's advantage as it will triple the memory requirement. In light of this, we propose AdaMeZO, a zeroth-order optimizer that leverages Adam-style first- and second-moment estimates without maintaining them in memory. We present a theoretical analysis of AdaMeZO, corroborated by extensive experiments demonstrating its performance, showing that it can outperform MeZO while requiring up to 70% fewer forward passes. Trajectory visualizations affirm AdaMeZO's ability to adapt to diverse loss landscapes.

Abstract:
While Large Language Models (LLMs) excel in languagebased agentic tasks, their applicability to unseen, nonlinguistic environments (e.g., symbolic or spatial tasks) remains limited. Previous work attributes this performance gap to the mismatch between the pretraining distribution and the testing distribution. In this work, we demonstrate the primary bottleneck is the prohibitive cost of exploration: mastering these tasks requires extensive trial-and-error, which is computationally unsustainable for parameter-heavy LLMs operating in a high dimensional semantic space. To address this, we propose SCOUT (Sub-Scale Collaboration On Unseen Tasks), a novel framework that decouples exploration from exploitation. We employ lightweight "scouts" (e.g., small MLPs) to probe environmental dynamics at a speed and scale far exceeding LLMs. The collected trajectories are utilized to bootstrap the LLM via Supervised Fine-Tuning (SFT), followed by multi-turn Reinforcement Learning (RL) to activate its latent world knowledge. Empirically, SCOUT enables a Qwen2.5-3B-Instruct model to achieve an average score of 0.86, significantly outperforming proprietary models, including Gemini-2.5-Pro (0.60), while saving about 60% GPU hours consumption.

Abstract:
Offline Reinforcement Learning (RL) relies on static datasets and often enforces conservative constraints to mitigate outof-distribution errors, but this inevitably gives rise to learning dataset biases and limited behavioral generalization. Recent Data Augmentation (DA) methods leverage generative models to enrich offline data, yet they mainly operate within a single rollout paradigm and tend to preserve the original trajectory-level connectivity of the dataset. As a result, such methods often introduce local variations and fail to recover connections between distinct behavior patterns. In this paper, we propose Bidirectional Trajectory Diffusion (BiTrajDiff), a novel DA framework that explicitly addresses this limitation. BiTrajDiff decomposes trajectory synthesis into two independent diffusion processes that generate forward-future and backward-history segments conditioned on shared intermediate anchor states. By stitching the generated segments at these anchors, BiTrajDiff can synthesize trajectories that bridge disconnected behavior patterns and recover global trajectory-level connectivity absent from the original data. Extensive experiments on the D4RL benchmark demonstrate that BiTrajDiff consistently outperforms advanced DA methods across a range of offline RL backbones.

Abstract:
Large reasoning models (LRMs) spend substantial testtime compute on long chain-of-thought (CoT) traces, but whatcharacterizesan effective CoT remains unclear. While prior work reports gains from lengthening CoTs and increasing review via appendedwaittokens, recent studies suggest that shorter thinking can outperform longer traces. We therefore conduct a systematic evaluation across ten LRMs on math and scientific reasoning. Contrary to the “longer-is-better” narrative, we find that both naively using longer CoTs and more review behaviors are associated withloweraccuracy. As CoT unfolds step by step, token-level metrics can conflate verbosity with process quality. We introduce a graph view of CoT to extract structure and identify a single statistic—theFailed-Step Fraction(FSF), the fraction of steps in abandoned branches—that consistently outpredicts length and review ratio for correctness across models. To probe causality, we design two interventions. First, we rank candidate CoTs by each metric at test time, where FSF yields the largest pass@1 gains; second, we edit CoTs to remove failed branches, which significantly improves accuracy, indicating that failed branches bias subsequent reasoning. Taken together, these results characterize effective CoTs as those thatfail lessand supportstructure-awaretest-time scaling over indiscriminately generating long CoTs.

Abstract:
Code Large Language Models (CodeLLMs) have been widely adopted for Natural Language to Programming Language code generation, powering applications with large user bases. Their performance, however, varies sharply across programming languages (PLs) and is particularly suboptimal for lowresource PLs due to data scarcity, limiting their overall usability. In this work, we introduce CodeChemist, a simple yet effective, training-free test-time scaling framework that transfers the model's functional knowledge from high-resource to low-resource PLs via synthesized test cases, without relying on external models. Specifically, CodeChemist first applies multi-temperature hedged sampling to generate a pool of candidate solutions in the low-resource PL and synthesizes a set of test inputs. It then estimates uncertainty: when uncertainty is low, it selects the output via in-language majority voting; otherwise, it constructs cross-lingual I/O test oracles by executing high-resource reference programs and selects the candidate with the highest pass rate. Extensive experiments demonstrate that CodeChemist significantly outperforms existing test-time scaling methods, improving code generation for both low-resource PLs (e.g., Lua) and complex-syntax PLs (e.g., C++, Java) without retraining.

Abstract:
Pretrained diffusion models have emerged as powerful generative priors for both unconditional and conditional sample generation, yet their outputs often deviate from the characteristics of user-specific target data. Such mismatches are especially problematic in domain adaptation tasks, where only a few reference examples are available and retraining the diffusion model is infeasible. Existing inference-time guidance methods can adjust sampling trajectories, but they typically optimize surrogate objectives such as classifier likelihoods rather than directly aligning with the target distribution. We propose MMD Guidance, a training-free mechanism that augments the reverse diffusion process with gradients of the Maximum Mean Discrepancy (MMD) between generated samples and a reference dataset. MMD provides reliable distributional estimates from limited data, exhibits low variance in practice, and is efficiently differentiable, which makes it particularly well-suited for the guidance task. Our framework naturally extends to prompt-aware adaptation in conditional generation models via product kernels. Also, it can be applied with computational efficiency in latent diffusion models (LDMs), since guidance is applied in the latent space of the LDM. Experiments on synthetic and real-world benchmarks demonstrate that MMD Guidance can achieve distributional alignment while preserving sample fidelity.

Abstract:
We present a mechanism to steer the sampling diversity of denoising diffusion and flow matching models, allowing users to sample from a sharper or broader distribution than the training distribution. We build on the observation that these models leverage (learned) score functions of noisy data distributions for sampling and show that rescaling these allows one to effectively control a 'local' sampling temperature. Notably, this approach does not require any finetuning or alterations to training strategy, and can be applied to any offthe-shelf model and is compatible with both deterministic and stochastic samplers. We first validate our framework on toy 2D data, and then demonstrate its application for diffusion models trained across five disparate tasks -- image generation, pose estimation, depth prediction, robot manipulation, and protein design. We find that across these tasks, our approach allows sampling from sharper (or flatter) distributions, yielding performance gains e.g., depth prediction models benefit from sampling more likely depth estimates, whereas image generation models perform better when sampling a slightly flatter distribution.

Abstract:
Traditional value models V^\pi in LLM reinforcement learning face a coupling dilemma: they require synchronous training alongside the updating policy \pi, causing inefficiency and overfitting. In this paper, we propose V_0, a generalist value model that decouples value estimation from specific policy parameters by reframing the task as incontext learning to predict performance for unseen policies. We utilize the policy's historical query-performance pairs as a capability representation, transforming from V^\pi(s_0) to V(C_\pi, s_0), where C_\pi serves as an in-context input. This architecture enables us to scale the diversity of policies within the training set. Consequently, V_0 achieves scaling in learning to rapidly identify the capability boundaries of any policy without updating its parameters. Technically, we employ a Residual Query Adapter to compress the high-dimensional policy representation and the target query into structured features, which are then processed by a pre-trained TabPFN head. Empirical results show that V_0 outperforms coupled value models in tracking policy evolution during GRPO training, optimizes cold-start budget allocation, and approaches the performance-cost Pareto frontier in inference routing.

Abstract:
Growing renewable penetration introduces substantial uncertainty into power system operations, necessitating frequent adaptation of dispatch objectives and constraints and challenging expertiseintensive, near-real-time modeling workflows. Large Language Models (LLMs) provide a promising avenue for automating this process by translating natural-language (NL) operational requirements into executable optimization models via semantic reasoning and code synthesis. Yet existing LLM datasets and benchmarks for optimization modeling primarily target coarse-grained cross-domain generalization, offering limited, rigorous evaluation in powersystem settings, particularly for Optimal Power Flow (OPF). We therefore introduce ProOPFD and ProOPF-B, a dataset and benchmark for professional-grade OPF modeling: ProOPF-D contains 12K instances pairing NL requests with parameter adjustments and structural extensions to a canonical OPF, together with executable implementations; ProOPF-B provides 121 expertannotated test cases with ground-truth code, enabling end-to-end evaluation under both concrete and abstract OPF modeling regimes.

Abstract:
Existing forgery detection methods are often limited to unimodal or bi-modal settings, failing to handle the interleaved text, images, and videos prevalent in real-world misinformation. To bridge this gap, we proposeOmniVL-Guard, a unified framework for omni vision-language forgery detection and grounding. In this unified setting, the interplay between diverse modalities and the dual requirements of simultaneous detection and localization pose significant optimization challenges. Through extensive investigations, we identify a critical difficulty bias in this multi-task optimization: the simpler veracity classification task tends to dominate the gradients, leading to suboptimal performance in fine-grained grounding. To address this imbalance, we first develop a Self-Evolving CoT Generation pipeline to synthesize high-quality reasoning paths, effectively overcoming the cold-start challenge. Building upon this, we proposeAdaptiveRewardScalingPolicyOptimization (ARSPO). By dynamically modulating reward scales and task weights, ARSPO ensures a balanced joint optimization that prioritizes challenging grounding objectives. Extensive experiments demonstrate that OmniVL-Guard significantly outperforms state-of-the-art methods and exhibits robust zero-shot generalization across out-of-domain scenarios.

Abstract:
ML attacks on Learning with Errors (LWE) with binary or small secrets only succeed on LWE settings with very simple secrets. For example, they can recover secrets with up to three nonzero bits when models are trained on not-reduced LWE data, and three non-zero bits in the ''cruel region'' [9] when BKZ pre-processing is applied. We show that larger training sets and the use of repeated examples in the training data allow the recovery of denser secrets. We empirically observe a power-law relationship between model based attempts to recover the secrets, dataset size and repeated examples. We introduce a stepwise regression technique to recover the ``cool bits'' of the secret. Overall, these techniques allow for the recovery of denser binary secrets: up to Hamming weight 70 (and 8 cruel bits) for dimension 256 \log_2 q=20 and 75 (and 7 cruel bits) for dimension 512 \log_2 q=41 (vs 33 and 63 Hamming weight and 3 cruel bits in previous works). We also demonstrate our methods' effectiveness on denser ternary secrets, showing a substantial improvement over prior work.

Abstract:
In this paper, we provide a comprehensive theoretical analysis of Stochastic Gradient Descent (SGD) and its momentum variants (Polyak HeavyBall and Nesterov) for tracking time-varying optima under strong convexity and smoothness. Our finite-time bounds reveal a sharp decomposition of tracking error into transient, noise-induced, and drift-induced components. This decomposition exposes a fundamental trade-off: while momentum is often used as a gradient-smoothing heuristic, under distribution shift it incurs an explicit drift-amplification penalty that diverges as the momentum parameter \beta approaches 1, yielding systematic tracking lag. We complement these upper bounds with minimax lower bounds under gradient-variation constraints, proving this momentum-induced tracking penalty is not an analytical artifact but an information-theoretic barrier: in drift-dominated regimes, momentum is unavoidably worse because stale-gradient averaging forces systematic lag. Our results provide theoretical grounding for the empirical instability of momentum in dynamic settings and precisely delineate regime boundaries where vanilla SGD provably outperforms its accelerated counterparts.

Abstract:
VisionLanguage Models (VLMs) have achieved remarkable success in tasks such as image captioning and visual question answering (VQA). However, as their applications become increasingly widespread, recent studies have revealed that VLMs are vulnerable to backdoor attacks. Existing backdoor attacks on VLMs primarily rely on data poisoning by adding visual triggers and modifying text labels, where the induced image–text mismatch makes poisoned samples easy to detect. To address this limitation, we propose the Clean-Label Backdoor Attack on VLMs via Diffusion Models (CBV), which leverages diffusion models to generate natural poisoned examples via score matching. Specifically, CBV modifies the score during the reverse generation process of the diffusion model to guide the generation of poisoned samples that contain triggered image features. To further enhance the effectiveness of the attack, we incorporate the textual information of the triggered images as multimodal guidance during generation. Moreover, to enhance stealthiness, we introduce a GradCAM-guided Mask (GM) that restricts modifications to only the most semantically important regions, rather than the entire image. We evaluate our method on MSCOCO and VQA v2 with four representative VLMs, achieving over 80% ASR while preserving normal functionality.

Abstract:
Pretrained vision-language models (VLMs) such as CLIP have demonstrated strong zero-shot capabilities across diverse domains, yet remain highly vulnerable to adversarial perturbations that disrupt image-text alignment and compromise reliability. Existing defenses typically rely on adversarial fine-tuning with labeled data, limiting their applicability in zero-shot settings. In this work, we identify two key weaknesses of current CLIP adversarial attacks—lack of semantic guidance and vulnerability to view variations—collectively termed semantic and viewpoint fragility. To address these challenges, we propose Self-Calibrated Consistency (SCC), an effective test-time defense. SCC consists of two complementary modules: Semantic consistency, which leverages soft pseudo-labels from counterattack warm-up and multi-view predictions to regularize cross-modal alignment and separate the target embedding from confusable negatives; and Spatial consistency, aligning perturbed visual predictions via augmented views to stabilize inference under adversarial perturbations. Together, these modules form a plug-and-play inference strategy. Extensive experiments on 22 benchmarks under diverse attack settings show that SCC consistently improves the zero-shot robustness of CLIP while maintaining accuracy, and can be seamlessly integrated with other VLMs for further gains. These findings highlight the great potential of establishing an adversarially robust paradigm from CLIP, with implications extending to broader VLMs such as BioMedCLIP.

Abstract:
Learning PDE dynamics for fluids increasingly relies on neural operators and Transformerbased models, yet these approaches often lack interpretability and struggle with localized, high-frequency structures while incurring quadratic cost in spatial samples. We propose to represent fields with a \emphGaussian basis, where learned atoms carry explicit geometry (centers, anisotropic scales, weights) and form a compact, mesh-agnostic, directly visualizable state. Building on this representation, we introduce a \emphGaussian Particle Operator that acts \emphin modal space: learned \emphGaussian modal windows perform a Petrov--Galerkin measurement, a \emphPG Gaussian Attention effects global cross-scale coupling. This basis-to-basis design is resolution-agnostic and achieves near-linear complexity in N for fixed modal budget, supporting irregular geometries and seamless 2D\to3D extension. On standard PDE benchmarks and real datasets, our method attains state-of-the-art–competitive accuracy while providing intrinsic interpretability.

Abstract:
Recent advances in multimodal language models (MLLMs) have made thinking with images a dominant paradigm for multimodal reasoning. However, existing methods still fail to ensure evidence–answer consistency, where correct answers must be supported by correct visual evidence. To address this issue, we propose DeFacto, a counterfactual reasoning framework that explicitly aligns visual evidence with final answers by jointly optimizing for task correctness and evidence–answer consistency. Our approach integrates three complementary training paradigms: (i) positive, (ii) counterfactual, and (iii) randommasking. We further develop a language-guided evidence construction pipeline that automatically localizes question-relevant regions and generates counterfactual variants, resulting in DeFacto-100K, a dataset of about 100k images. Building on this dataset, we train MLLMs with GRPO-based reinforcement learning and design three complementary rewards to promote correct answering, structured reasoning, and consistent evidence selection. Moreover, we introduce DeFacto-1.5K, a human-annotated benchmark for systematically evaluating evidence-grounded consistency beyond answer accuracy. Experiments on diverse benchmarks demonstrate that DeFacto substantially improves both answer accuracy and evidence–answer consistency over strong baselines. The code and datasets will be released upon acceptance.

Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a prevailing paradigm for enhancing reasoning in Multimodal Large Language Models (MLLMs). However, relying solely on outcome supervision risks reward hacking, where models learn spurious reasoning patterns to satisfy final answer checks. While recent rubricbased approaches offer fine-grained supervision signals, they suffer from high computational costs of instance-level generation and inefficient training dynamics caused by treating all rubrics as equally learnable. In this paper, we proposeStratified Rubric-based Curriculum Learning (RuCL), a novel framework that reformulates curriculum learning by shifting the focus from data selection to reward design. RuCL generates generalized rubrics for broad applicability and stratifies them based on the model's competence. By dynamically adjusting rubric weights during training, RuCL guides the model from mastering foundational perception to tackling advanced logical reasoning. Extensive experiments on various visual reasoning benchmarks show that RuCL yields a remarkable+7.83%average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of60.06%.

Abstract:
We propose \emphScoreMatchingRiesz, a family of Riesz representer estimators based on score matching. The Riesz representer is a key nuisance component in debiased machine learning, enabling \sqrtnconsistent and asymptotically efficient estimation of causal and structural targets via Neyman-orthogonal scores. We formulate Riesz representer estimation as a score estimation problem. This perspective stabilizes representer estimation by allowing us to leverage denoising score matching and telescoping density ratio estimation. We also introduce the policy path, a parameter that captures how policy effects evolve under continuous treatments. We show that the policy path can be estimated via score matching by smoothly connecting average marginal effect (AME) and average policy effect (APE) estimation, which improves the interpretability of policy effects.

Abstract:
Entropy serves as a critical metric for measuring the diversity of outputs generated by large language models (LLMs), providing valuable insights into their exploration capabilities. While recent studies increasingly focus on monitoring and adjusting entropy to better balance exploration and exploitation in reinforcement finetuning (RFT), a principled understanding of entropy dynamics during this process is yet to be thoroughly investigated. In this paper, we establish a theoretical framework for analyzing the entropy dynamics during the RFT process, which begins with a discriminant expression that quantifies entropy change under a single logit update. This foundation enables the derivation of a first-order expression for entropy change, which can be further extended to the update formula of Group Relative Policy Optimization (GRPO). The corollaries and insights drawn from the theoretical analysis inspire the design of entropy control methods, and also offer a unified lens for interpreting various entropy-based methods in existing studies. We provide empirical evidence to support the main conclusions of our analysis and demonstrate the effectiveness of the derived entropy-discriminator clipping methods. This study yields novel insights into RFT training dynamics, providing theoretical support and practical strategies for optimizing the exploration-exploitation balance during LLM fine-tuning.

Abstract:
Large Language Models(LLMs) have revolutionized text generation and multimodal perception, but their capabilities in 3D content generation remain underexplored. Existing methods compromise by producing either lowresolution meshes or coarse structural proxies, failing to capture fine-grained geometry natively. In this paper, we propose CG-MLLM, a novel Multi-modal Large Language Model (MLLM) capable of 3D captioning and high-resolution 3D generation in a single framework. Leveraging the Mixture-of-Transformer architecture, CG-MLLM decouples disparate modeling needs, where the Token-level Autoregressive (TokenAR) Transformer handles token-level content, and the Block-level Autoregressive (BlockAR) Transformer handles block-level content. By integrating a pre-trained vision-language backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard tokens and spatial blocks within a single integrated architecture. Experimental results show that CG-MLLM significantly outperforms existing MLLMs in generating high-fidelity 3D objects, effectively bringing high-resolution 3D content creation into the mainstream LLM paradigm.

Abstract:
Deep learning for human sensing on edge systems presents significant potential for smart applications. However, its training and development are hindered by the limited availability of sensor data and resource constraints of edge systems. While transferring pretrained models to different sensing applications is promising, existing methods often require extensive sensor data and computational resources, resulting in high costs and limited transferability. In this paper, we propose XTransfer, a first-of-its-kind method enabling modality-agnostic, few-shot model transfer with resource-efficient design. XTransfer flexibly uses pre-trained models and transfers knowledge across different modalities by (i) model repairing that safely mitigates modality shift by adapting pre-trained layers with only few sensor data, and (ii) layer recombining that efficiently searches and recombines layers of interest from source models in a layer-wise manner to restructure models. We benchmark various baselines across diverse human sensing datasets spanning different modalities. The results show that XTransfer achieves state-of-the-art performance while significantly reducing the costs of sensor data collection, model training, and edge deployment.

Abstract:
Reinforcement learning with verifiable rewards (RLVR) has emerged as an effective posttraining paradigm for improving the reasoning capabilities of large language models. However, existing group-based RLVR methods often suffer from severe sample inefficiency. This inefficiency stems from reliance on point estimation of rewards from a small number of rollouts, leading to high estimation variance, variance collapse, and ineffective utilization of generated responses. In this work, we reformulate RLVR from a statistical estimation perspective by modeling rewards as samples drawn from a policy-induced distribution and casting advantage computation as the problem of estimating the reward distribution from finite data. Building on this view, we proposeDiscountedBeta-Bernoulli (DBB) reward estimation, which leverages historical reward statistics for the non-stationary distribution. Although biased, the resulting estimator exhibits reduced and stable variance, theoretically avoids estimated variance collapse, and achieves lower mean squared error than standard point estimation. Extensive experiments across six in-distribution and three out-of-distribution reasoning benchmarks demonstrate that GRPO with DBB consistently outperforms naive GRPO, achieving average Acc@8 improvements of 3.22/2.42 points in-distribution and 12.49/6.92 points out-of-distribution on the 1.7B and 8B models, respectively, without additional computational cost or memory usage.

Abstract:
PrefillDecode (PD) disaggregation has become the standard architecture for modern LLM inference engines, which alleviates the interference of two distinctive workloads. With the growing demand for multi-turn interactions in chatbots and agentic systems, we re-examined PD in this case and found two fundamental inefficiencies: (1) every turn requires prefilling the new prompt and response from the last turn, and (2) repeated KV transfers between prefill and decode nodes saturate the bandwidth, leading to high latency and even service degradation. Our key insight is that not all prefill operations are equally disruptive: append-prefill---processing only the new input tokens while reusing cached KV states---incurs substantially less decoding slowdown than full prefill. This motivates routing append-prefill to decode nodes locally. However, through comprehensive analysis, we show that no single fixed routing strategy satisfies all Service Level Objectives (SLOs) simultaneously. Based on this insight, we propose Prefill Prefill-capable Decode (PPD) disaggregation, a dynamic routing system that decides when to process Turn 2+ requests locally on decode nodes using cached KV states. PPD adapts to varying SLOs via configurable weights and seamlessly integrates with traditional PD deployments. With extensive evaluations, we show that PPD reduces Turn 2+ time-to-first-token (TTFT) by 68% while maintaining competitive time-per-output-token (TPOT), effectively alleviating KV transfer congestion under high load. We believe PPD represents a flexible and efficient paradigm for multi-turn LLM serving.

Abstract:
Learning natural body motion remains challenging due to the strong coupling between spatial geometry and temporal dynamics. Embedding motion in phase manifolds, latent spaces that capture local periodicity, has proven effective for motion prediction; however, existing approaches lack scalability and remain confined to specific settings. We introduce FunPhase, a functional periodic autoencoder that learns a phase manifold for motion and replaces discrete temporal decoding with a functionspace formulation, enabling smooth trajectories that can be sampled at arbitrary temporal resolutions. FunPhase unifies motion prediction and generation within a single interpretable phase manifold, enabling motion generation via latent diffusion, generalizes across skeletons and datasets, and supports downstream tasks such as motion super-resolution and partial-body completion. Our model achieves substantially lower reconstruction error than prior periodic autoencoder baselines, achieving uniform improvements of at least 45% across all metrics, while enabling a broader range of applications and performing on par with state-of-the-art motion generation methods.

Abstract:
Early and accurate detection of anomalies in timeseries data is critical due to the substantial risks associated with false or missed detections. While MLP-based mixer models have shown promise in time-series analysis, they do not maintain temporal causality during data processing. Moreover, real-world multivariate time series often contain numerous channels with diverse inter-channel correlations. Spurious correlations in the reconstructed time series lead to noisy representations, resulting in inaccurate anomaly detection. In addition, anomaly scoring methods that ignore temporal continuity can mislead sequential detection. To address these challenges, we propose a cluster-aware causal mixer for multivariate time-series anomaly detection. Channels are grouped into clusters based on their correlations, and each cluster is embedded through a dedicated embedding layer. A causal mixer is introduced to integrate information while maintaining temporal causality. We further develop a sequential anomaly-scoring method that accumulates evidence over time and refines anomaly boundaries. Our proposed model operates in an online fashion, making it suitable for real-time time-series anomaly detection. Experimental evaluations across six public benchmark datasets demonstrate that the proposed approach consistently achieves superior performance.

Abstract:
Analog inmemory computing (AIMC) performs computation directly within resistive crossbar arrays, offering an energy-efficient platform to scale large vision and language models. However, non-ideal analog device properties make the training on AIMC devices challenging. In particular, its update asymmetry can induce a systematic drift of weight updates towards a device-specific symmetric point (SP), which typically does not align with the optimum of the training objective. To mitigate this bias, most existing works assume the SP is known and pre-calibrate it to zero before training by setting the reference point as the SP. Nevertheless, calibrating AIMC devices requires costly pulse updates, and residual calibration error can directly degrade training accuracy. In this work, we present the first theoretical characterization of the pulse complexity of SP calibration and the resulting estimation error. We further propose a dynamic SP estimation method that tracks the SP during model training, and establishes its convergence guarantees. In addition, we develop an enhanced variant based on chopping and filtering techiques from digital signal processing. Numerical experiments demonstrate both the efficiency and effectiveness of the proposed method.

Abstract:
Significant progress has been made in the field of Instructionbased Image Editing Models (IIEMs). However, while these models demonstrate plausible adherence to instructions and strong reasoning ability on current benchmarks, their ability to edit small objects remains underexplored, despite its importance for precise local editing and refining details in both real and generated images. In this paper, we introduce DeepLookEditBench (DLEBench), the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects. Specifically, we construct a challenging testbed comprising 1889 samples across seven instruction types. In these samples, target objects occupy only 1%-10% of the image area, covering complex scenarios such as partial occlusion and multi-object editing. To ensure robust evaluation on this benchmark, we propose an evaluation protocol with refined score rubrics to minimize subjectivity and ambiguity in two criteria: Instruction Following and Visual Consistency. This protocol also introduces a dual-mode evaluation framework (Tool-driven and Oracle-guided Modes) addressing the misalignment between LMM-as-a-Judge and human judgements on DLEBench. Empirical results on 10 IIEMs reveal significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.

Abstract:
In this work, we present the first theoretical analysis of multiagent imitation learning (MAIL) in linear Markov games where both the transition dynamics and each agent's reward function are linear in some given features. We demonstrate that by leveraging this structure, it is possible to replace the state-action level \emphall policy deviation concentrability coefficient \citepfreihaut2025rate with a concentrability coefficient defined at the feature level which can be much smaller than the state-action analog when the features are informative about \emphstates' similarity. Furthermore, to circumvent the need for any concentrability coefficient, we turn to the interactive setting. We provide the first, computationally efficient, interactive MAIL algorithm for linear Markov games and show that its sample complexity depends only on the dimension of the feature map d. Building on these theoretical findings, we propose a deep MAIL interactive algorithm which clearly outperforms BC on games such as Tic-Tac-Toe and Connect4.

Abstract:
Spiking Neural Networks (SNNs) offer energyefficient, biologically plausible computation but suffer from non-differentiable spike generation, necessitating reliance on heuristic surrogate gradients. This paper introduces UltraLIF, a principled framework that replaces surrogate gradients with ultradiscretization, a mathematical formalism from tropical geometry providing continuous relaxations of discrete dynamics. The central insight is that the max-plus semiring underlying ultradiscretization naturally models neural threshold dynamics: the log-sum-exp function serves as a differentiable soft-maximum that converges to hard thresholding as a learnable temperature parameter \eps \to 0. Two neuron models are derived from distinct dynamical systems: UltraLIF from the LIF ordinary differential equation (temporal dynamics) and UltraDLIF from the diffusion equation modeling gap junction coupling across neuronal populations (spatial dynamics). Both yield fully differentiable SNNs trainable via standard backpropagation with no forward-backward mismatch. Theoretical analysis establishes pointwise convergence to classical LIF dynamics with quantitative error bounds and bounded non-vanishing gradients. Experiments on six benchmarks spanning static images, neuromorphic vision, and audio demonstrate improvements over surrogate gradient baselines, with gains most pronounced in single-timestep (T=1) settings on neuromorphic and temporal datasets. An optional sparsity penalty enables significant energy reduction while maintaining competitive accuracy.

Abstract:
Diffusion models have achieved tremendous success in image generation, but they also raise significant concerns regarding privacy and copyright issues. Membership Inference Attacks (MIAs) are designed to ascertain whether specific data were utilized during a model's training phase. As current MIAs for diffusion models typically exploit the model's image prediction ability, we formalize them into a unified general paradigm which computes the membership score for membership identification. Under this paradigm, we empirically find that existing attacks overlook the inherent deficiency in how diffusion models process highfrequency information. Consequently, this deficiency leads to member data with more high-frequency content being misclassified as hold-out data, and hold-out data with less high-frequency content tend to be misclassified as member data. Moreover, we theoretically demonstrate that this deficiency reduces the membership advantage of attacks, thereby interfering with the effective discrimination of member data and hold-out data. Based on this insight, we propose a plug-and-play high-frequency filter module to mitigate the adverse effects of the deficiency, which can be seamlessly integrated into any attacks within the general paradigm without additional time costs. Extensive experiments corroborate that this module significantly improves the performance of baseline attacks across different datasets and models.

Abstract:
Machine learning models are increasingly trained or finetuned on synthetic data. Recursively training on such data has been observed to significantly degrade performance in a wide range of tasks, often characterized by a progressive drift away from the target distribution. In this work, we theoretically analyze this phenomenon in the setting of score-based diffusion models. For a realistic pipeline where each training round uses a combination of synthetic data and fresh samples from the target distribution, we obtain upper and lower bounds on the accumulated divergence between the generated and target distributions. This allows us to characterize different regimes of drift, depending on the score estimation error and the proportion of fresh data used in each generation. We also provide empirical results on synthetic data and images to illustrate the theory.

Abstract:
While Mixtureof-Experts (MoE) architectures substantially bolster the expressive power of large-language models, their prohibitive memory footprint severely impedes the practical deployment on resource-constrained edge devices, especially when model behavior must be preserved without relying on lossy quantization. In this paper, we present ZipMoE, an efficient and semantically lossless on-device MoE serving system. ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters via a caching-scheduling co-design with provable performance guarantee. Fundamentally, our design shifts the paradigm of on-device MoE inference from an I/O-bound bottleneck to a compute-centric workflow that enables efficient parallelization. We implement a prototype of ZipMoE and conduct extensive experiments on representative edge computing platforms using popular open-source MoE models and real-world workloads. Our evaluation reveals that ZipMoE achieves up to 72.77% inference latency reduction and up to 6.76× higher throughput than the state-of-the-art systems. Our code is available at: https://anonymous.4open.science/r/s3fg2i1dn/.

Abstract:
While RetrievalAugmented Generation (RAG) is one of the dominant paradigms for enhancing Large Vision-Language Models (LVLMs) on knowledge-based VQA tasks, recent work attributes RAG failures to insufficient attention towards the retrieved context, proposing to reduce the attention allocated to image tokens. In this work, we identify a distinct failure mode that previous study overlooked: Attention Distraction (AD). When the retrieved context is sufficient (highly relevant or including the correct answer), the retrieved text suppresses the visual attention globally, and the attention on image tokens shifts away from question-relevant regions. This leads to failures on questions the model could originally answer correctly without the retrieved text. To mitigate this issue, we propose MAD-RAG, a training-free intervention that decouples visual grounding from context integration through a dual-question formulation, combined with attention mixing to preserve image-conditioned evidence. Extensive experiments on OK-VQA, E-VQA, and InfoSeek demonstrate that MAD-RAG consistently outperforms existing baselines across different model families, yielding absolute gains of up to 4.76%, 9.20%, and 6.18% over the vanilla RAG baseline. Notably, MAD-RAG rectifies up to 74.68% of failure cases with negligible computational overhead.

Abstract:
LLM agents rely on prompts to implement taskspecific capabilities based on foundation LLMs, making agent prompts valuable intellectual property. However, in untrusted deployments, adversaries can copy and reuse these prompts with other proprietary LLMs, causing economic losses. To protect these prompts, we identify four key challenges: proactivity, runtime protection, usability, and non-portability that existing approaches fail to address. We present PragLocker, a prompt protection scheme that satisfies these requirements. PragLocker constructs function-preserving obfuscated prompts by anchoring semantics with code symbols and then using target-model feedback to inject noise, yielding prompts that only work on the target LLM. Experiments across multiple agent systems, datasets, and foundation LLMs show that PragLocker substantially reduces cross-LLM portability, maintains target performance, and remains robust against adaptive attackers.

Abstract:
Driven by recent advancements in toolaugmented Large Language Model (LLM) agents, comprehensive benchmark datasets for evaluating these tool-augmented agents are being actively developed. Although these benchmarks incorporate increasingly complex user requests and a diverse array of tools, the evaluation methods for most of them remain limited to answer matching. However, as the number of steps required to resolve a user request increases, a proper evaluation of an agent's performance must go beyond the final answer to also assess the problem-solving trajectory, including previously ignored aspects such as efficiency, hallucinations, and adaptivity. The most straightforward method for evaluating these aspects is to compare the trajectory of the agent with a ground-truth trajectory, but this approach is fundamentally limited since annotating all possible ground-truth trajectories is prohibitively expensive. To address these significant gaps, we introduce TRACE, a framework for the multi-dimensional evaluation of tool-augmented LLM agent performance. By incorporating evidence store, TRACE enables a multi-faceted analysis and evaluation of an agent's reasoning trajectory, eliminating the need for a predefined ground-truth trajectory. To validate our framework, we develop a new meta-evaluation dataset by augmenting existing benchmarks with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates these complex behaviors in a scalable and cost-effective manner, even with small open-source LLMs.

Abstract:
LoRA adapts large language models (LLMs) by restricting updates to lowrank subspaces of pre-trained weights. While this substantially reduces training cost, the effectiveness of adaptation critically depends on which subspace is chosen at initialization: a poor initialization that allocates capacity to task-irrelevant directions can severely hinder downstream performance. Existing initialization strategies primarily rely on the intrinsic properties of pre-trained weights, implicitly assuming that weight geometry alone reflects task relevance. However, such criteria overlook how the model interacts with the downstream data distribution. In this work, we formulate LoRA initialization as the problem of identifying directions in parameter space that are the most impactful under the target data distribution. We argue that data-aware sensitivity, rather than weight-only magnitude, should govern the choice of adaptation subspaces. Building on this perspective, we propose a Fisher-guided framework that leverages curvature information induced by downstream data to characterize how parameter perturbations influence model predictions. This perspective yields a principled, task-dependent criterion for selecting LoRA directions that better align adaptation with the target objective. Empirical results across diverse tasks and modalities demonstrate that data-aware initialization consistently and significantly improves downstream performance over existing approaches.

Abstract:
Rotary Position Embedding (RoPE) is the de facto positional encoding in large language models due to its ability to encode relative positions and support length extrapolation. When adapted to vision transformers, the standard axial formulation decomposes twodimensional spatial positions into horizontal and vertical components, implicitly restricting positional encoding to axis-aligned directions. We identify this directional constraint as a fundamental limitation of the standard axial 2D RoPE, which hinders the modeling of oblique spatial relationships that naturally exist in natural images. To overcome this limitation, we propose Spiral RoPE, a simple yet effective extension that enables multi-directional positional encoding by partitioning embedding channels into multiple groups associated with uniformly distributed directions. Each group is rotated according to the projection of the patch position onto its corresponding direction, allowing spatial relationships to be encoded beyond the horizontal and vertical axes. Across a wide range of vision tasks including classification, segmentation, and generation, Spiral RoPE consistently improves performance. Qualitative analyzes of attention maps further show that Spiral RoPE exhibits more concentrated activations on semantically relevant objects and better respects local object boundaries, highlighting the importance of multi-directional positional encoding in vision transformers.

Abstract:
Adapting language models (LMs) to new tasks via posttraining carries the risk of degrading existing capabilities -- a phenomenon classically known as catastrophic forgetting. In this paper, toward identifying guidelines for mitigating this phenomenon, we systematically compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL). Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the mode-seeking nature of RL, which stems from its use of on-policy data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using approximately on-policy data, which can be substantially more efficient to obtain than fully on-policy data.

Abstract:
To define a steepest descent method over a neural network, we need to choose a norm for each layer, a way to aggregate these norms across layers, and whether to use normalization. We systematically explore different alternatives for aggregating norms across layers, both formalizing existing combinations of Adam and the recently proposed Muon as a type of nonEuclidean gradient descent, and deriving new variants of the Muon optimizer. Through a comprehensive experimental evaluation of the optimizers within our framework, we find that Muon is sensitive to the choice of learning rate, whereas a new variant we call MuonMax is significantly more robust. We then show how to combine any non-Euclidean gradient method with model based momentum (known as Momo). The new Momo variants of Muon are significantly more robust to hyperparameter tuning, and often achieve a better validation score. Thus for new tasks, where the optimal hyperparameters are not known, we advocate for using Momo in combination with MuonMax to save on costly hyperparameter tuning.

Abstract:
Large language models have achieved remarkable success on finalanswer mathematical problems, largely due to the ease of applying reinforcement learning with verifiable rewards. However, the reasoning underlying these solutions is often flawed. Advancing to rigorous proof-based mathematics requires reliable proof verification capabilities. We begin by analyzing multiple evaluation setups and show that focusing on a single benchmark can lead to brittle or misleading conclusions. To address this, we evaluate both proof-based and final-answer reasoning to obtain a more reliable measure of model performance. We then scale two major generative verification methods (GenSelect and LLM-as-a-Judge) to millions of tokens and identify their combination as the most effective framework for solution verification and selection. We further show that the choice of prompt for LLM-as-a-Judge significantly affects the model's performance, but reinforcement learning can reduce this sensitivity. However, despite improving proof-level metrics, reinforcement learning does not enhance final-answer precision, indicating that current models often reward stylistic or procedural correctness rather than mathematical validity. Our results establish practical guidelines for designing and evaluating scalable proof-verification and selection systems.

Abstract:
Training deep learning models on singlecell datasets with hundreds of millions of cells requires loading data from disk, as these datasets exceed available memory. While random sampling provides the data diversity needed for effective training, it is prohibitively slow due to the random access pattern overhead, whereas sequential streaming achieves high throughput but introduces biases that degrade model performance. We present scDataset, a PyTorch data loader that enables efficient training from on-disk data with seamless integration across diverse storage formats. Our approach combines block sampling and batched fetching to achieve quasi-random sampling that balances I/O efficiency with minibatch diversity. On Tahoe-100M, a dataset of 100 million cells, scDataset achieves more than two orders of magnitude speedup compared to true random sampling while working directly with AnnData files. We provide theoretical bounds on minibatch diversity and empirically show that scDataset matches the performance of true random sampling across multiple classification tasks.

Abstract:
Softmax attention is a central component of transformer architectures, yet its nonlinear structure poses significant challenges for theoretical analysis. We develop a unified, measurebased framework for studying single-layer softmax attention under both finite and infinite prompts. For i.i.d. Gaussian inputs, we lean on the fact that the softmax operator converges in the infinite-prompt limit to a linear operator acting on the underlying input-token measure. Building on this insight, we establish non-asymptotic concentration bounds for the output and gradient of softmax attention, quantifying how rapidly the finite-prompt model approaches its infinite-prompt counterpart, and prove that this concentration remains stable along the entire training trajectory in general in-context learning settings with sub-Gaussian tokens. In the case of in-context linear regression, we use the tractable infinite-prompt dynamics to analyze training at finite prompt length. Our results allow optimization analyses developed for linear attention to transfer directly to softmax attention when prompts are sufficiently long, showing that large-prompt softmax attention inherits the analytical structure of its linear counterpart. This, in turn, provides a principled and broadly applicable toolkit for studying the training dynamics and statistical behavior of softmax attention layers in large prompt regimes.

Abstract:
LongText Understanding (LTU) at million-token scale requires balancing reasoning fidelity with computational efficiency. Frontier long-context LLMs can process millions of token contexts end-to-end, but they suffer from high token consumption and attention dilution. In parallel, specialized LTU agents often sacrifice fidelity through task-agnostic abstractions like graph construction or indexing. We identify a key insight for LTU: query-relevant information is typically sparse relative to the full document, so effective reasoning should rely on a query-sufficient subset rather than the entire context. To address this, we propose SCOUT, a new paradigm for LTU thatshifts from passive processing to active information foraging. It treats the document as an explorable environment and answers from a compact, provenance-grounded epistemic state. Guided by state-level gap diagnosis, SCOUT adaptively alternates between coarse-to-fine exploration and anchored state updates that progressivelycontractits epistemic state toward query sufficiency. Experiments show that SCOUT matches state-of-the-art proprietary models while reducing token consumption by up to 8 times. Moreover, SCOUT remains stable as context length scales, substantially alleviating the practical cost--capability trade-off in long-context reasoning.Code.An anonymous repository link is provided in the Supplementary Material.

Abstract:
Reward models are central to Large Language Model (LLM) alignment within the framework of RLHF. The standard objective used in reward modeling is the BradleyTerry (BT) loss, which learns from pairwise data consisting of a pair of chosen and rejected responses. In this work, we analyze the per-sample gradient of BT-loss and shows spurious learning signals due to representation distance. In particular, BT gradient norm scales with two distinct components: (1) the difference in predicted rewards between chosen and rejected responses, which reflects the prediction error, and critically, (2) representation distance between the pair measured in the output space of the final layer. While the first term captures the intended training signal, the second term can significantly impact the update magnitude and misalign learning. Specifically, pairs with small representation distance often receive vanishingly weak updates, even when misranked, while pairs with large distance receive disproportionately strong updates. This leads to gradients from large-distance pairs to overshadow those from small-distance pairs, where fine-grained distinctions are especially important. To overcome this limitation, we propose NormBT, an adaptive pair-wise normalization scheme that rescales update to balance representation-driven effects and focuses learning signals on prediction error. NormBT is a lightweight, drop-in modification to BT loss with negligible overhead. Across various LLM backbones and datasets, NormBT improves reward model performance consistently, with notable gains of over 5% on the Reasoning category of RewardBench, which contains numerous fine-grained pairs.

Abstract:
Large VisionLanguage Models (LVLMs) typically process visual inputs as a prefix to the language decoder. As the model autoregressively generates text, this initial visual information inevitably undergoes ``dilution'', leading the model to over-rely on language priors and hallucinate objects. Existing interventions attempt to correct this by contrasting logits or iteratively refining outputs, but they incur prohibitive latency costs. We propose Residual-Update Directed DEcoding Regulation (RUDDER), a framework that counters visual dilution by creating a persistent visual anchor. We extract a robust evidence direction (CARD) directly from the model's prefill residual updates, and inject it into the decoding process. This injection is modulated by an adaptive gate, the Beta Gate, which acts as a trust mechanism and ensures the visual reminder is applied only when necessary. Experiments on LLaVA-1.5 (7B/13B), Idefics2, InstructBLIP, and Qwen2.5-VL demonstrate that RUDDER consistently mitigates hallucination (with greedy decoding, RUDDER reduces CHAIR_S by an average of 24.4% and CHAIR_i by 23.6% relative) and scales effectively across architectures, all while maintaining \>96.0% throughput. The code is available at https://anonymous.4open.science/r/RUDDER-Residual-Update-Directed-DEcoding-Regulation--D5FC.

Abstract:
Understanding climate dynamics requires going beyond correlations in observational data to uncover their underlying causal process. Latent drivers, such as atmospheric processes, play a critical role in temporal dynamics, while direct causal influences also exist among geographically proximate observed variables. Traditional Causal Representation Learning (CRL) typically focuses on latent factors but overlooks such observableto-observable causal relations, limiting its applicability to climate analysis. In this paper, we introduce a unified framework that jointly uncovers (i) causal relations among observed variables and (ii) latent driving forces together with their interactions. We establish conditions under which both the hidden dynamic processes and the causal structure among observed variables are simultaneously identifiable from time-series data. Remarkably, our guarantees hold even in the nonparametric setting, leveraging contextual information to recover latent variables and observable relations. Building on these insights, we propose CaDRe (Causal Discovery and Representation learning), a time-series generative model with structural constraints that integrates CRL and causal discovery. Experiments on synthetic datasets validate our theoretical results. On real-world climate datasets, CaDRe not only delivers competitive forecasting accuracy but also recovers visualized causal graphs aligned with domain expertise, thereby offering interpretable insights into climate systems.

Abstract:
The Earth's subsurface is a cornerstone of modern society, providing essential energy resources like hydrocarbons, geothermal, and minerals while serving as the primary reservoir for CO_2 sequestration. However, full physics numerical simulations of these systems are notoriously computationally expensive due to geological heterogeneity, high resolution requirements, and the tight coupling of physical processes with distinct propagation time scales. Here we propose the Adaptive Physics Transformer (APT), a geometry, mesh-, and physics-agnostic neural operator that explicitly addresses these challenges. APT fuses a graph-based encoder to extract high-resolution local heterogeneous features with a global attention mechanism to resolve long-range physical impacts. Our results demonstrate that APT outperforms state-of-the-art architectures in subsurface tasks across both regular and irregular grids with robust super-resolution capabilities. Notably, APT is the first architecture that directly learns from adaptive mesh refinement simulations. We also demonstrate APT's capability for cross-dataset learning, positioning it as a robust and scalable backbone for large-scale subsurface foundation model development.

Abstract:
Language models are pretrained on sequences that blend statistical regularities (structures making text fluent) with factual associations between specific tokens (corresponding to knowledge of facts). While recent work suggests that the variability of their interaction, such as paraphrases of factual associations, critically determines generalization ability, we lack a systematic analysis of these impacts. This paper introduces a flexible synthetic testbed that combines a statistical stream of generic tokens with an abstract factual stream of sourcetarget token pairs, enabling fine-grained control over their interaction. Specifically, the design enables the independent control of diversity nature by manipulating stream composition (contextual structure) and the level of diversity by varying which statistical streams each fact appears in. Through controlled experiments, we find that while higher contextual diversity delays in-distribution (ID) factual accuracy, its effect on out-of-distribution (OOD) generalization depends critically on contextual structure. In some cases, OOD performance follows the same trend as ID, but in others, diversity becomes essential for non-trivial factual learning. Even when low diversity prohibits factual recall, optimal diversity levels depend on training duration. Beyond factual recall failures, we identify structures where statistical generalization fails independently, and others where both capabilities degrade simultaneously. This demonstrates how the interplay between contextual design and diversity level impacts different aspects of generalization. Furthermore, through a series of controlled interventions on the model components, we trace the generalization failures to distinct optimization bottlenecks, highlighting the importance of the learned embedding and unembedding layers. Overall, our synthetic framework allows us to isolate effects that would be confounded in large-scale studies, thus offering a controlled testbed for future investigations.

Abstract:
Trainingfree diffusion guidance offers a flexible framework for leveraging off-the-shelf classifiers without additional training. Yet, current approaches hinge on posterior approximations via Tweedie’s formula, which often yield unreliable guidance, particularly in low-density regions. Stochastic optimal control (SOC), in contrast, enables principled posterior sampling but remains computationally prohibitive for efficient inference. In this work, we reconcile the strengths of these paradigms by introducing Stein Diffusion Guidance (SDG), a novel training-free framework grounded in a surrogate SOC objective. We establish a new theoretical bound on the SOC value function, revealing the necessity of correcting approximate posteriors to reflect true diffusion dynamics. Building on Stein variational inference, SDG computes the steepest descent direction that minimizes the Kullback-Leibler divergence between approximate and true posteriors. By integrating a principled Stein correction mechanism along with a novel running cost functional, SDG enables effective guidance in low-density regions. Our experiments on diverse image-guidance tasks and on challenging small-ligand sampling for protein docking suggest that SDG consistently outperforms standard training-free guidance methods and highlights its potential for broader posterior sampling problems beyond high-density regimes.

Abstract:
Redteaming Vision-Language Models is essential for identifying vulnerabilities where adversarial image-text inputs trigger toxic outputs. Existing approaches treat image generation as a black box, providing only terminal toxicity scores while remaining temporally opaque regarding when and how toxic semantics emerge during multi-step synthesis. We introduce STARE, a hierarchical reinforcement learning framework that treats the denoising trajectory as an exploitable attack surface. By synergizing a high-level prompt editor with low-level T2I fine-tuning via Group Relative Policy Optimization (GRPO), STARE achieves a 68% improvement in Attack Success Rate over state-of-the-art baselines including black box and white-box variants. More importantly, we reveal the Optimization-Induced Phase Alignment phenomenon: while vanilla models exhibit diffuse toxicity, adversarial optimization systematically concentrates conceptual harms into early semantic phases and detail-oriented harms into late refinement. This discovery transforms toxicity formation from a chaotic process into a series of predictable vulnerability windows. This temporal alignment transforms red-teaming from a trial-and-error process into a targeted structural analysis. Our work provides both a potent attack engine and a diagnostic foundation for developing next-generation, phase-aware safety mechanisms. Content warning: This paper contains examples of toxic content that may be offensive or disturbing.

Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for posttraining large reasoning models (LRMs) using policy-gradient methods such as GRPO. To stabilize training, these methods typically center trajectory rewards by subtracting the empirical mean for each prompt. Statistically, this centering acts as a control variate (or baseline), reducing the variance of the policy-gradient estimator. Typically, the mean reward is estimated using per-prompt empirical averages for each prompt in a batch. Drawing inspiration from Stein’s paradox, we propose using \emphshrinkage estimators that combine \emphper-prompt and \emphacross-prompt means to improve the overall per-prompt mean estimation accuracy---particularly in the low-generation regime typical of RLVR. Theoretically, we construct a shrinkage-based baseline that provably yields lower-variance policy-gradient estimators across algorithms. Our proposed baseline serves as a drop-in replacement for existing per-prompt mean baselines, requiring no additional hyper-parameters or computation. Empirically, shrinkage baselines consistently outperform standard empirical-mean baselines, leading to lower-variance gradient updates and improved training stability.

Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has been a main driver of recent breakthroughs in large reasoning models. Yet it remains a mystery how rewards based solely on final outcomes can help overcome the longhorizon barrier to extended reasoning. To understand this, we develop a theory of the training dynamics of RL for transformers on compositional reasoning tasks. Our theory characterizes how the effectiveness of RLVR is governed by the smoothness of the difficulty spectrum. When data contains abrupt discontinuities in difficulty, learning undergoes grokking-type phase transitions, giving rise to prolonged plateaus before progress recurs. In contrast, a smooth difficulty spectrum leads to arelay effect: stable gradient signals on easier problems elevate the model's capabilities to the point where harder ones become tractable, resulting in steady and continuous learning. Our theory explains how RLVR can improve performance at the edge of competence, and suggests that appropriately designed data mixtures can yield scalable gains. As a technical contribution, our analysis develops and adapts tools from Fourier analysis on finite groups to our setting. We validate the predicted mechanisms empirically via synthetic experiments.

Abstract:
Targeted data selection has emerged as a crucial paradigm for efficient instruction tuning, aiming to identify a small yet influential subset of training examples for a specific target task. In practice, influence is often measured through the effect of an example on parameter updates. To make selection scalable, many approaches leverage optimizer statistics (e.g., Adam states) as an axisaligned surrogate for update geometry (i.e., diagonal precondition), implicitly treating parameters as coordinate-wise independent. We show that this assumption breaks down in parameter-efficient fine-tuning (PEFT) methods such as LoRA. In this setting, the induced optimization geometry exhibits strong cross-parameter coupling with non-trivial off-diagonal interactions, while the task-relevant update directions are confined to a low-dimensional subspace. Motivated by this mismatch, we proposeGIST(Gradient Isometric Subspace Transformation), a simple yet principled alternative that replaces axis-aligned scaling with robust subspace alignment.GISTrecovers a task-specific subspace from validation gradients via spectral filtering (SVD), projects training gradients into this coupled subspace, and scores examples by their alignment with target directions. Extensive experiments have demonstrated thatGISTmatches or outperforms the state-of-the-art baseline with only 0.29% of the storage and 25% of the computational time under the same selection budget. Our code is available at https://anonymous.4open.science/r/GIST-1464.

Abstract:
Multitask learning and related frameworks have achieved tremendous success in modern applications. In multitask learning problem, we are given a set of heterogeneous datasets collected from related source tasks and hope to enhance the performance above what we could hope to achieve by solving each of them individually. The recent work of Hanneke & Kpotufe (2022) has showed that, without access to distributional information, no algorithm based on aggregating samples alone can guarantee optimal risk as long as the sample size per task is bounded. In this paper, we focus on understanding the statistical limits of multitask learning. We go beyond the nofree-lunch theorem in Hanneke & Kpotufe (2022) by establishing a stronger impossibility result of adaptation that holds for arbitrarily large sample size per task. This improvement conveys an important message that the hardness of multitask learning cannot be overcame by having abundant data per task. We also discuss the notion of optimal adaptivity that may be of future interests.

Abstract:
Reward learning typically relies on a single feedback type or combines multiple feedback types using manually weighted loss terms. Currently, it remains unclear how to jointly learn reward functions from heterogeneous feedback types such as demonstrations, comparisons, ratings, rankings, and stops that provide qualitatively different signals. We address this challenge by formulating reward learning from multiple feedback types as Bayesian inference over a shared latent reward function, where each feedback type contributes information through an explicit likelihood. We introduce a scalable amortized variational inference approach that learns a shared reward encoder and feedbackspecific likelihood decoders and is trained by optimizing a single evidence lower bound. Our approach avoids reducing feedback to a common intermediate representation and eliminates the need for manual loss balancing. Across discrete and continuous-control benchmarks, we show that jointly inferred reward posteriors outperform single-type baselines, exploit complementary information across feedback types, and yield policies that are more robust to environment perturbations. The inferred reward uncertainty further provides interpretable signals for analyzing model confidence and consistency across feedback types.

Abstract:
Openworld object detection (OWOD) requires incrementally detecting known categories while reliably identifying unknown objects. Existing methods primarily focus on improving unknown recall, yet overlook interpretability, often leading to known–unknown confusion and reduced prediction reliability. This paper aims to make the entire OWOD framework interpretable, enabling the detector to truly “knowing the unknown.” To this end, we propose a concept-driven InterPretable OWOD framework(IPOW) by introducing a Concept Decomposition Model (CDM) for OWOD, which explicitly decomposes the coupled RoI features in Faster R-CNN into discriminative, shared, and background concepts. Discriminative concepts identify the most discriminative features to enlarge the distances between known categories, while shared and background concepts, due to their strong generalization ability, can be readily transferred to detect unknown categories. Leveraging the interpretable framework, we identify that known–unknown confusion arises when unknown objects fall into the discriminative space of known classes. To address this, we propose Concept-Guided Rectification (CGR) to further resolve such confusion. Extensive experiments show that IPOW significantly improves unknown recall while mitigating confusion, and provides concept-level interpretability for both known and unknown predictions.

Abstract:
Large language models (LLMs) are strong passive responders, but learning to proactively elicit information—asking the right questions and stopping at the right time—remains difficult. Existing approaches, such as optimizing turnlevel attributes or relying on user simulators to generate training trajectories, often struggle with a persistent reality gap. We propose \textttLearn-to-Ask, a simulator-free framework that learns proactive questioning policies directly from offline expert conversations. Our key insight is to leverage the observed future of each expert trajectory to derive dense, turn-level rewards that reflect expert long-horizon strategy, reducing policy learning to a sequence of supervised learning tasks that jointly enable LLMs to know what to ask and when to stop. To ensure the LLM-generated contents, such as reward fidelity and sampling quality, align with expectations, we further introduce an automated pipeline that calibrates the prompts with minimal human supervision. Across multiple datasets and model scales, \textttLearn-to-Ask consistently improves proactive information-seeking behavior. We also report a large-scale real-world deployment where the trained agent surpasses an internal expert baseline under professional audit, which demonstrates the effectiveness of our framework and our rewards as a reality-validated proxy metric for LLM proactivity.

Abstract:
Innovation is a key driving force of human civilization. As the body of knowledge has grown considerably, bridging knowledge across different disciplines, where significant innovation often emerges, has become increasingly challenging. The recent advancements in machine learning models, particularly Large Language Models (LLMs), have provided effective access to extensive knowledge sources and shown impressive abilities in reasoning, rendering significant opportunities for interdisciplinary discovery. Our research aims to understand the capabilities of stateof-the-art LLMs in integrating knowledge from different fields for interdisciplinary research (IDR). To address this fundamental problem, we introduce IDRBench, a pioneering framework that includes both datasets and evaluation tasks: (1) IDR Paper Identification, (2) IDR Idea Integration, and (3) IDR Idea Recommendation.

Abstract:
Synthetic Nearest Neighbors (SNN) provides a principled solution to causal matrix completion under missingnot-at-random (MNAR) by exploiting local low-rank structure through fully observed anchor submatrices. However, its effectiveness critically relies on sufficient data availability within each treatment level, a condition that often fails in settings with multiple or complex treatments. In this work, we propose Mixed Synthetic Nearest Neighbors (MSNN), a new entry-wise causal identification estimator that integrates information across treatment levels. We show that MSNN retains the finite-sample error bounds and asymptotic normality guarantees of SNN, while enlarging the effective sample size available for estimation. Empirical results on synthetic and real-world datasets illustrate the efficacy of the proposed approach, especially under data-scarce treatment levels.

Abstract:
Inferring cellular trajectories from destructive snapshots is complicated by the challenges of stochasticity and nonconservative mass dynamics such as cell proliferation and apoptosis. Existing unbalanced Optimal Transport (OT) methods treat mass as a continuous fluid, performing inference at the population level. However, this macroscopic view often fails to capture the discrete, jump-like nature of birth-death events at single-cell resolution, which is essential for understanding lineage branching and fate decisions. We presentUnbalanced Schrödinger Bridge (USB), a simulation-free framework for learning underlying dynamics that effectively integrates both stochastic and unbalanced effects which also models the discrete, jump-like birth–death dynamics at single-cell resolution. Theoretically, USB provides a tractable solution to the Branching Schrödinger Bridge (BSB) problem, offering a rigorous microscopic interpretation where individual cells undergo both Brownian motion and discrete birth-death jumps. Technically, the method implements an efficient solver by introducing a simulation-free training objective that effectively scales to high-dimensional omics data. Empirically, we demonstrate on both simulated and real-world datasets that USB not only achieves trajectory reconstruction performance better than or comparable to deterministic baselines but also uniquely enables realistic discrete simulation of birth-death dynamics at single-cell resolution.

Abstract:
Understanding reasoning in large language models is complicated by evaluations that conflate multiple reasoning types. We isolate analogical reasoning (inferring shared properties between entities based on known similarities) and analyze its emergence in transformers. We theoretically prove three key results: (1) Joint training on similarity and attribution premises enables analogical reasoning through aligned representations; (2) Sequential training succeeds only when similarity structure is learned before specific attributes, revealing a necessary curriculum; (3) Twohop reasoning (a \to b, b \to c \implies a \to c) reduces to analogical reasoning with identity bridges (b = b), which must appear explicitly in training data. These results reveal a unified mechanism: transformers encode entities with similar properties into similar representations, enabling property transfer through feature alignment. Experiments with architectures up to 1.5B parameters validate our theory and demonstrate how representational geometry shapes inductive reasoning capabilities.

Abstract:
Quantum hardware suffers from intrinsic device heterogeneity and environmental drift, forcing practitioners to choose between suboptimal nonadaptive controllers or costly per-device recalibration. We derive a scaling law lower bound for meta-learning showing that the adaptation gain (expected fidelity improvement from task-specific gradient steps) saturates exponentially with gradient steps and scales linearly with task variance, providing a quantitative criterion for when adaptation justifies its overhead. Validation on quantum gate calibration shows negligible benefits for low-variance tasks but >40% fidelity gains on two-qubit gates under extreme out-of-distribution conditions (10× the training noise), with implications for reducing per-device calibration time on cloud quantum processors. Further validation on classical linear-quadratic control confirms these laws emerge from general optimization geometry rather than quantum-specific physics. Together, these results offer a transferable framework for decision-making in adaptive control.

Abstract:
Operations Research practitioners routinely debug infeasible models through an iterative process: analyzing Irreducible Infeasible Subsystems (\IIS), identifying constraint conflicts, and systematically repairing formulations until feasibility is achieved. Yet existing LLM benchmarks evaluate OR as oneshot translation---given a problem description, generate solver code---ignoring this diagnostic loop entirely. We introduce two benchmarks that place the solver in the evaluation loop. \ORDebug evaluates iterative self-correction through 5,000+ problems spanning 9 error types; each repair action triggers solver re-execution and \IIS recomputation, providing deterministic, verifiable feedback. \ORBias evaluates behavioral rationality through 2,000 newsvendor instances (1,000 ID + 1,000 OOD), measuring systematic deviations from closed-form optimal policies. Across 26 models and 12,000+ samples, we find that domain-specific RLVR training enables an 8B model to surpass frontier APIs: 95.3% vs 86.2% recovery rate (+9.1%), 62.4% vs 47.8% diagnostic accuracy (+14.6%), and 2.25 vs 3.78 steps to resolution (1.7× faster). On \ORBias, curriculum training achieves the only negative ID\rightarrowOOD bias drift among models evaluated (-9.6%), reducing systematic bias by 48% (from 20.0% to 10.4%). These results demonstrate that process-level evaluation with verifiable oracles enables targeted training that outperforms scale.

Abstract:
The twosample testing problem, a fundamental task in statistics and machine learning, seeks to determine whether two sets of samples, drawn from underlying distributions p and q, are in fact identically distributed (i.e.~whether p=q). A popular and intuitive approach is the classifier two-sample test (C2ST), where a classifier is trained to distinguish between samples from p and q. Yet despite simplicity of the C2ST, its reliability hinges on access to a near-Bayes-optimal classifier, a requirement that is rarely met and difficult to verify. This raises a major open question: can a weak classifier still be useful for two-sample testing? We show that the answer is a definitive yes. Building on the work of Hu & Lei (2024), we analyze two conformal variants of the C2ST that convert the scores from any trained classifier---even if weak, biased, or overfit---into exact, finite-sample p-values. We establish two key theoretical properties of the conformal C2ST: (i) finite-sample Type-I error control, and (ii) non-trivial power that degrades gently in tandem with the error of the trained classifier. The upshot is that even poorly performing classifiers can yield powerful and reliable two-sample tests. This general framework finds a powerful application in Bayesian inference, particularly for validating Neural Posterior Estimation (NPE) models, where the task of comparing a learned posterior approximation q(\theta \mid y) to the true posterior p(\theta \mid y) can be framed as a two-sample test. Empirically, the Conformal C2ST outperforms classical discriminative tests across a wide range of benchmarks for this task. Our results establish the conformal C2ST as a practical, theoretically grounded diagnostic tool.

Abstract:
Decentralized learning often involves a weighted global loss with heterogeneous node weights \lambda. We revisit two natural strategies for incorporating these weights: (i) embedding them into the local losses to retain a uniform weight (and thus a doubly stochastic matrix), and (ii) keeping the original losses while employing a \lambdainduced row-stochastic matrix. Although prior work shows that both strategies yield the same expected descent direction for the global loss, it remains unclear whether the Euclidean-space guarantees are tight and what fundamentally differentiates their behaviors. To clarify this, we develop a weighted Hilbert-space framework L^2(\lambda;\mathbbR^d) and obtain convergence rates that are strictly tighter than those from Euclidean analysis. In this geometry, the row-stochastic matrix becomes self-adjoint whereas the doubly stochastic one does not, creating additional penalty terms that amplify consensus error, thereby slowing convergence. Consequently, the difference in convergence arises not only from spectral gaps but also from these penalty terms. We then derive sufficient conditions under which the row-stochastic design converges faster even with a smaller spectral gap. Finally, by using a Rayleigh-quotient and Loewner-order eigenvalue comparison, we further obtain topology conditions that guarantee this advantage and yield practical topology-design guidelines.

Abstract:
Reinforcement Learning (RL) is pivotal for enhancing Large Language Model (LLM) reasoning, yet mainstream algorithms such as GRPO and DAPO remain constrained by a coarsegrained credit assignment paradigm, where all tokens within the same response receive the identical reward. In this paper, we propose Dynamic Entropy Weighting, systematically define entropy-based weight ratios \fracH_i,t\sum_k=1^n H_k,t and similar variants to redistribute rewards and get fine-grained rewards through two new algorithms: Group Token Policy Optimization (GTPO), which assigns an entropy-weighted reward to each token and synthesizes token-specific advantage function to drive the model toward optimal path, and the analogous algorithm Sequence-Level GRPO (GRPO-S), which admits a completely similar design at the sequence level. Unlike methods using entropy as mere regularization, GTPO and GRPO-S establish a new state-of-the-art on AIME and MATH 500, outperforming prior entropy-guided baselines and validating our weighting mechanism.

Abstract:
Inferencetime scaling via chain-of-thought (CoT) reasoning is a major driver of state-of-the-art LLM performance, but it comes with substantial latency and compute costs. We address a fundamental theoretical question: how many reasoning tokens are required to solve a problem as input size grows? By extending the bounded attention prefix oracle (BAPO) model--an abstraction of LLMs that quantifies the information flow required to solve a task--we prove lower bounds on the CoT tokens required for three canonical BAPO-hard tasks: binary majority, triplet matching, and graph reachability. We show that each requires \Omega(n) reasoning tokens when the input size is n. We complement these results with matching or near-matching upper bounds via explicit constructions. Finally, our experiments with frontier reasoning models show approximately linear reasoning token scaling on these tasks and failures when constrained to smaller reasoning budgets, consistent with our theoretical lower bounds. Together, our results identify fundamental bottlenecks in inference-time compute through CoT and offer a principled tool for analyzing optimal reasoning length.

Abstract:
SWEbench has emerged as the premier benchmark for evaluating Large Language Models on complex software engineering tasks. While these capabilities are fundamentally acquired during the mid-training phase and subsequently elicited during Supervised Fine-Tuning (SFT), there remains a critical deficit in metrics capable of guiding mid-training effectively. Standard metrics such as Perplexity (PPL) are compromised by the "Long-Context Tax" and exhibit weak correlation with downstream SWE performance. In this paper, we bridge this gap by first introducing a rigorous data filtering strategy. Crucially, we propose the Entropy Compression Hypothesis, redefining intelligence not by scalar Top-1 compression, but by the capacity to structure uncertainty into Entropy-Compressed States of low orders ("reasonable hesitation"). Grounded in this fine-grained entropy analysis, we formulate a novel metric, HE-SNR (High-Entropy Signal-to-Noise Ratio). Validated on industrial-scale Mixture-of-Experts (MoE) models across varying context windows (32K/128K), our approach demonstrates superior robustness and predictive power. This work provides both the theoretical foundation and practical tools for optimizing the latent potential of LLMs in complex engineering domains.

Abstract:
Efficiently enhancing the reasoning capabilities of VisionLanguage Models (VLMs) by merging them with Large Reasoning Models (LRMs) has emerged as a promising direction. However, existing methods typically operate at a coarse-grained layer level, which often leads to a trade-off between injecting reasoning capabilities and preserving visual capabilities. To address this limitation, we propose FRISM (Fine-grained Reasoning Injection via Subspace-level model Merging), a fine-grained reasoning injection framework based on subspace-level model merging. Observing that reasoning capabilities are encoded in distinct subspaces, FRISM decomposes LRM task vectors via Singular Value Decomposition (SVD) and adaptively tunes the scaling coefficients of each subspace through learning to realize fine-grained reasoning injection. Furthermore, we introduce a label-free self-distillation learning strategy with a dual-objective optimization using common vision-language perception datasets. Extensive experiments demonstrate that FRISM effectively improves reasoning capabilities without compromising the model's original visual capabilities by consistently achieving state-of-the-art performance across diverse visual reasoning benchmarks.

Abstract:
A widely used technique for improving policies is success conditioning, in which one collects trajectories, identifies those that achieve a desired outcome, and updates the policy to imitate the actions taken along successful trajectories. This principle appears under many namesrejection sampling with SFT, goal-conditioned RL, Decision Transformers---yet what optimization problem it solves, if any, has remained unclear. We prove that success conditioning exactly solves a trust-region optimization problem, maximizing policy improvement subject to a \chi^2 divergence constraint whose radius is determined automatically by the data. This yields an identity: relative policy improvement, the magnitude of policy change, and a quantity we call action-influence---measuring how random variation in action choices affects success rates---are exactly equal at every state. Success conditioning thus emerges as a conservative improvement operator. Exact success conditioning cannot degrade performance or induce dangerous distribution shift, but when it fails, it does so observably, by hardly changing the policy at all. We apply our theory to the common practice of return thresholding, showing this can amplify improvement, but at the cost of potential misalignment with the true objective.

Abstract:
Largescale web-crawled datasets contain noise, bias, and irrelevant information, necessitating data selection techniques. Existing methods depend on hand-crafted heuristics, downstream datasets, or require expensive influence-based computations---all of which limit scalability and introduce unwanted data dependencies. To address this, we introduce the Mimic Score, a simple and geometry-based data-quality metric that evaluates utility by measuring alignment between a sample’s gradients and a target direction induced by a pre-trained reference model. This leverages readily available model weights, avoids needing validation datasets, and incurs minimal computational overheads. Building on this metric, we propose Grad-Mimic, a two-stage framework that re-weights samples online to accelerate training and aggregates sample utilities offline to construct effective data filters. Empirically, we show that using mimic scores to guide training improves data efficiency, accelerates convergence, yields consistent performance gains across six image datasets, and enhances CLIP models with 20.7% fewer training steps. Additionally, mimic score-based filters augment existing filtering techniques, enabling improved CLIP models trained with 4.7 million fewer samples.

Abstract:
Finetuning large models on edge devices is severely hindered by the memory-intensive backpropagation (BP) in standard frameworks like federated learning and split learning. While substituting BP with zeroth-order optimization can significantly reduce memory footprints, it typically suffers from prohibitively degraded convergence speed. To resolve this dilemma, we proposeHybrid-Order Split Federated Learning(HO-SFL). By reformulating the split learning process within a Lagrangian framework, HO-SFL decouples the optimization landscape: The server performs precise first-order updates (i.e., BP), whereas clients conduct memory-efficient zeroth-order optimization. This hybrid design not only eliminates the need for client-side BP but also enables dimension-free model aggregation, drastically lowering communication costs. Crucially, we provide a theoretical convergence analysis, demonstrating that HO-SFL mitigates the dimension-dependent convergence slowdown of zeroth-order optimization, achieving a convergence rate comparable to first-order methods. Extensive experiments on tasks across vision and language modalities validate that HO-SFL achieves convergence speeds comparable to first-order baselines while significantly reducing communication costs and client memory footprints.

Abstract:
Handling missing node features is a key challenge for deploying Graph Neural Networks (GNNs) in realworld domains such as healthcare and sensor networks. Existing studies mostly address relatively benign scenarios, namely benchmark datasets with (a) high-dimensional but sparse node features and (b) incomplete data generated under Missing Completely At Random (MCAR) mechanisms. For (a), we theoretically prove that high sparsity substantially limits the information loss caused by missingness, making all models appear robust and preventing a meaningful comparison of their performance. To overcome this limitation, we introduce one synthetic and three real-world datasets with dense, semantically meaningful features. For (b), we move beyond MCAR and design evaluation protocols with more realistic missingness mechanisms. Moreover, we provide a theoretical background to state explicit assumptions on the missingness process and analyze their implications for different methods. Building on this analysis, we propose GNNmim, a simple yet effective baseline for node classification with incomplete feature data. Experiments show that GNNmim is competitive with respect to specialized architectures across diverse datasets and missingness regimes.

Abstract:
Large Multimodal Models encode extensive factual knowledge in their pretrained weights. However, its knowledge remains static and limited, unable to keep pace with real-world developments, which hinders continuous knowledge acquisition. Effective knowledge injection thus becomes critical, involving two goals: knowledge adaptation (injecting new knowledge) and knowledge retention (preserving old knowledge). Existing methods often struggle to learn new knowledge and suffer from catastrophic forgetting. To address these challenges, we propose KORE, a synergistic method centered around KnOwledge-oRientEd controls. These controls are implemented through a two-stage optimization process: (1) KORE automatically converts individual knowledge items into structured and comprehensive knowledge to ensure that the model accurately learns new knowledge, enabling accurate adaptation. (2) KORE stores previous knowledge in the covariance matrix of LMM's linear layer activations and initializes the adapter by projecting the original weights into the matrix's null space, defining a fine-tuning direction that minimizes interference with previous knowledge, enabling powerful retention. Extensive experiments on various LMMs, including LLaVA-v1.5 (7B), LLaVA-v1.5 (13B), and Qwen2.5-VL (7B), show that KORE achieves superior new knowledge injection performance and effectively mitigates catastrophic forgetting.

Abstract:
Time series reasoning demands both the perception of complex dynamics and logical depth. However, existing LLMbased approaches exhibit two limitations: they often treat time series merely as text or images, failing to capture the patterns like trends and seasonalities needed to answer specific questions; and when trained on a mix of simple and complex tasks, simpler objectives often dominate the learning process, hindering the development of deep reasoning capabilities. To address these limitations, we propose the Pattern-Aware Alignment and Balanced Reasoning model (PATRA), introducing a pattern-aware mechanism that extracts trend and seasonality patterns from time series to achieve deep alignment. Furthermore, we design a task-aware balanced reward to harmonize learning across tasks of varying difficulty, incentivizing the generation of coherent Chains of Thought. Extensive experiments show that PATRA outperforms strong baselines across diverse Time Series Question Answering (TSQA) tasks, demonstrating superior cross-modal understanding and reasoning capability.

Abstract:
Competitive programming problems are increasingly used to evaluate the coding capabilities of large language models (LLMs) due to their complexity and ease of verification. Yet, current coding benchmarks face limitations such as lack of exceptionally challenging problems, insufficient test case coverage, reliance on online platform APIs that limit accessibility. To address these issues, we introduce LiveOIBench, a largescale competitive programming benchmark featuring 403 expert-curated problems, averaging 60 official test cases each, drawn from 72 contests across 14 Informatics Olympiads held between 2023 and 2025. LiveOIBench has four key features: (1) expert-designed tasks with detailed subtask rubrics and extensive test cases; (2) direct comparison to elite human contestants; (3) continuous updates to reduce contamination risk; and (4) a fully offline, reproducible evaluation system. Benchmarking 34 popular general-purpose and reasoning LLMs, we find that GPT-5 achieves an 81.76th percentile, still falling short of top human contestants, while among the open-weight models, GPT-OSS-120B reaches only the 60th percentile. Reasoning-trace analyses indicate that robust reasoning models prioritize precise problem analysis over excessive exploration. Finally, analyses across released time, task familiarity, and code similarity find minimal evidence of data contamination in our benchmark. Our code and data are available at: https://liveoibenchanon.github.io/.

Abstract:
Speculative decoding mitigates the memorybound nature of LLM decoding by using a lightweight draft model to propose multiple tokens for parallel verification. However, its adoption has been limited by the lack of high-quality draft models and scalable training infrastructure. We introduce SpecForge, an open-source and efficient framework for training speculative decoding models with full support for EAGLE-3. SpecForge incorporates target–draft decoupling, hybrid parallelism, optimized training kernels, and tight integration with production-grade inference engines, enabling up to 9.9x faster EAGLE-3 training for Qwen3-235B-A22B compared to the baseline. We further release SpecBundle, a suite of production-grade EAGLE-3 draft models trained with SpecForge for mainstream open-source LLMs, achieving up to 4.48x end-to-end inference speedup on SGLang and addressing the scarcity of high-quality drafts. Finally, we distill a systematic study of speculative decoding training into practical and actionable recipes to guide real-world adoption.

Abstract:
The rapid evolution of Large Language Models (LLMs) has accelerated the transition from conversational chatbots to general agents. However, effectively balancing empathetic communication with budgetaware decision-making remains an open challenge. Since existing methods fail to capture these complex strategic trade-offs, we propose InteractCS-RL, a framework that reframes task-oriented dialogue as a multi-granularity reinforcement learning process. Specifically, we first establish a User-centric Interaction Framework to provide a high-fidelity training gym, enabling agents to dynamically explore diverse strategies with persona-driven users. Then, we introduce Cost-aware Multi-turn Policy Optimization (CMPO) with a hybrid advantage estimation strategy. By integrating generative process credits and employing a PID-Lagrangian cost controller, CMPO effectively guides the policy to explore Pareto boundary between user reward and global cost constraints. Extensive experiments on customized real business scenarios demonstrate that InteractCS-RL significantly outperform other baselines across three evaluation dimensions. Further evaluation on tool-agent-user interaction benchmarks verify InteractCS-RL robustness across diverse domains.

Abstract:
Deep sequence models are said to store atomic facts predominantly in the form of associative memory: a bruteforce lookup of co-occurring entities. We identify a dramatically different form of storage of atomic facts that we term as geometric memory. Here, the model has synthesized embeddings encoding novel global relationships between all entities, including ones that do not co-occur in training. Such storage is powerful: for instance, we show how it transforms a hard reasoning task involving an \ell-fold composition into an easy-to-learn 1-step navigation task. From this phenomenon, we extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, as against a lookup of local associations, cannot be straightforwardly attributed to typical supervisory, architectural, or optimizational pressures. Counterintuitively, a geometry is learned even when it is more complex than the brute-force lookup. Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that---in contrast to prevailing theories---indeed arises naturally despite the lack of various pressures. This analysis also points out to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery, and unlearning.

Abstract:
Existing offline feedforward methods for joint scene understanding and reconstruction on long image streams often repeatedly perform global computation over an ever-growing set of past observations, causing runtime and GPU memory to increase rapidly with sequence length and limiting scalability. We propose Streaming Semantic Gaussian Splatting (S2GS), a strictly causal, incremental 3D Gaussian semantic field framework: it does not leverage future frames and continuously updates scene geometry, appearance, and instance-level semantics without reprocessing historical frames, enabling scalable online joint reconstruction and understanding. S2GS adopts a geometry–semantic decoupled dual-backbone design: the geometry branch performs causal modeling to drive incremental Gaussian updates, while the semantic branch leverages a 2D foundation vision model and a query-driven decoder to predict segmentation masks and identity embeddings, further stabilized by query-level contrastive alignment and lightweight online association with an instance memory. Experiments show that S2GS matches or outperforms strong offline baselines on joint reconstruction-and-understanding benchmarks, while significantly improving long-horizon scalability: it processes 1,000+ frames with much slower growth in runtime and GPU memory, whereas offline global-processing baselines typically run out of memory at around 80 frames under the same setting.

Abstract:
Semantic caching has emerged as a pivotal technique for scaling LLM applications, widely adopted by major providers including AWS and Microsoft. By utilizing semantic embedding vectors as cache keys, this mechanism effectively minimizes latency and redundant computation for semantically similar queries. In this work, we conceptualize semantic cache keys as a form of fuzzy hashes. We demonstrate that the locality required to maximize cache hit rates fundamentally conflicts with the cryptographic avalanche effect necessary for collision resistance. Our conceptual analysis formalizes this inherent tradeoff between performance (locality) and security (collision resilience), revealing that semantic caching is naturally vulnerable to key collision attacks. While prior research has focused on side-channel and privacy risks, we present the first systematic study of integrity risks arising from cache collisions. We introduce CacheAttack, an automated framework for launching black-box collision attacks. We evaluate CacheAttack in security-critical tasks and agentic workflows. It achieves a hit rate of 86% in LLM response hijacking and can induce malicious behaviors in LLM agent, while preserving strong transferability across different embedding models. A case study on a financial agent further illustrates the real-world impact of these vulnerabilities. Finally, we discuss mitigation strategies, highlighting a persistent trade-off between cache efficiency and robustness.

Abstract:
Jailbreak attacks on audio language models (ALMs) optimize audio perturbations to elicit unsafe generations, and they typically update the entire waveform densely throughout optimization. In this work, we investigate the necessity of such dense optimization by analyzing the structure of tokenaligned gradients in ALMs. We find that gradient energy is highly non-uniform across audio tokens, indicating that only a small subset of token-aligned audio regions dominates the optimization signal. Motivated by this observation, we propose Token-Aware Gradient Optimization (TAGO), which enables sparse jailbreak optimization by retaining only waveform gradients aligned with audio tokens that have high gradient energy, while masking the remaining gradients at each iteration. Across three ALMs, TAGO outperforms baselines, and substantial sparsification preserves strong attack success rates (e.g., on Qwen3-Omni, \mathrmASR_l remains at 86% with a token retention ratio of 0.25, compared to 87% with full token retention). These results demonstrate that dense waveform updates are largely redundant, and we advocate that future audio jailbreak and safety alignment research should further leverage this heterogeneous token-level gradient structure.

Abstract:
This paper develops a finitesample statistical theory for in-context learning (ICL), analyzed within a meta-learning framework that accommodates mixtures of diverse task types. We leverage a Bayes risk identity that separates the total ICL risk into two orthogonal components: Bayes Gap and Posterior Variance. The Bayes Gap quantifies how well the trained model approximates the Bayes-optimal in-context predictor. For a uniform-attention Transformer, we derive a non-asymptotic upper bound on this gap, which explicitly clarifies the dependence on the number of pretraining prompts and their context length. The Posterior Variance is a model-independent risk representing the intrinsic task uncertainty. Our key finding is that this term is determined solely by the difficulty of the true underlying task, while the uncertainty arising from the task mixture vanishes exponentially fast with only a few in-context examples. Together, these results provide a unified view of ICL: the uniform-attention Transformer selects the optimal meta-algorithm during pretraining and rapidly converges to the optimal algorithm for the true task at test time.

Abstract:
Frontier language models are deployed as blackbox services, where model weights cannot be modified and customization is limited to prompting. We introduce Advisor Models, a method to train small open-weight models to generate dynamic, per-instance natural language advice that improves the capabilities of black-box frontier models. Advisor Models improve GPT-5's performance on RuleArena (Taxes) by 71%, reduce Gemini 3 Pro's steps taken in SWE agent tasks by 24.6%, and outperform static prompt optimizers in personalizing GPT-5 to user preferences (85-100% vs. 40-60%). We also find that advisors are transferable: an advisor trained with a low-cost student model still transfers improvements to a frontier model. Moreover, Advisor Models are robust: we observe no degradation on other benchmarks than the pipeline is trained on. Our method shows how to perform parametric optimization for black-box frontier models in a practical and cost-effective way.

Abstract:
Feature learning strength (FLS), i.e., the inverse of the effective output scaling of a model, plays a critical role in shaping the optimization dynamics of neural nets. While its impact has been extensively studied under the asymptotic regimesboth in training time and FLS---existing theory offers limited insight into how FLS affects generalization in practical settings, such as when training is stopped upon reaching a target training risk. In this work, we investigate the impact of FLS on generalization in deep networks under such practical conditions. Through empirical studies, we first uncover the emergence of anoptimal FLS---neither too small nor too large---that yields substantial generalization gains. This finding runs counter to the prevailing intuition that stronger feature learning universally improves generalization. To explain this phenomenon, we develop a theoretical analysis of gradient flow dynamics in two-layer ReLU nets trained with logistic loss, where FLS is controlled via initialization scale. Our main theoretical result establishes the existence of an optimal FLS arising from a trade-off between two competing effects: An excessively large FLS induces anover-alignmentphenomenon that degrades generalization, while an overly small FLS leads toover-fitting.

Abstract:
Predictive coding (PC) is a biologically plausible alternative to standard backpropagation (BP) that minimises an energy function with respect to network activities before updating weights. Recent work has improved the training stability of deep PC networks (PCNs) by leveraging some BPinspired reparameterisations. However, the full scalability and theoretical basis of these approaches remains unclear. To address this, we study the infinite width and depth limits of PCNs. For linear residual networks, we show that the set of width- and depth-stable feature-learning parameterisations for PC is exactly the same as for BP. Moreover, under any of these parameterisations, the PC energy with equilibrated activities converges to the BP loss in a regime where the model width is much larger than the depth, resulting in PC computing the same gradients as BP. Experiments show that these results hold in practice for deep nonlinear networks as long as an activity equilibrium is reached. Overall, this work unifies various previous theoretical and empirical results and has important practical implications for the scaling of PCNs.

Abstract:
Discrete flow matching, a recent framework for modeling categorical data, has shown competitive performance with autoregressive models. However, unlike continuous flow matching, the rectification strategy cannot be applied due to the stochasticity of discrete paths, necessitating alternative methods to minimize state transitions. We propose a dynamicoptimal-transport-like minimization objective and derive its Kantorovich formulation for discrete flows with convex interpolants, where transport cost depends solely on inter-state similarity and can be optimized via minibatch strategies. We show that such methods can reduce the number of transitions up to 8 times (1024 to 128) to reach the same generative perplexity without compromising diversity. Additionally, path nondeterminism in discrete flows precludes an instantaneous change-of-variables analogue, preventing precise probability estimation available to continuous flows. We therefore propose two upper bounds on perplexity, enabling principled training, evaluation and model comparison. Finally, we introduce Multimask Flows which outperform masked flows in generative perplexity without compromising diversity, particularly when utilizing minibatch Optimal Transport.

Abstract:
Reinforcement learning algorithms such as GRPO have driven recent advances in large language model (LLM) reasoning. While scaling the number of rollouts stabilizes training, existing approaches suffer from limited exploration on challenging prompts and leave informative feedback signals underexploited, due to contextindependent rollout allocation across prompts (e.g., generating 16 rollouts per prompt) and relying heavily on sparse rewards. This paper presents XRPO (eXplore–eXploit GRPO), a unified framework that recasts policy optimization through the principled lens of rollout exploration–exploitation. To enhance exploration, XRPO introduces a mathematically grounded rollout allocator that adaptively prioritizes prompts with higher potential for uncertainty reduction. It further addresses stagnation on zero-reward prompts through an in-context seeding strategy that injects curated exemplars, steering the model into more difficult reasoning trajectories. To strengthen exploitation, XRPO develops a group-relative, novelty-aware advantage sharpening mechanism that leverages sequence likelihoods to amplify low-probability yet correct responses, thereby extending the policy’s reach beyond sparse rewards. Experiments across diverse math and coding benchmarks on both reasoning and non-reasoning models demonstrate that XRPO outperforms existing advances (e.g., GRPO and GSPO) up to 4% pass@1 and 6% cons@32, while accelerating training convergence by up to 2.7x.

Abstract:
Personalized Federated Learning enables a collection of agents to collaboratively learn individual models without sharing raw data. We propose a new approach in which each agent optimizes a weighted combination of all agents' empirical risks, with the weights learned from data rather than specified a priori. The novelty of our method lies in formulating the estimation of these collaborative weights as a kernel mean embedding estimation problem with multiple data sources, leveraging tools from multitask averaging to capture statistical relationships between agents. This perspective yields a fully adaptive procedure that requires no prior knowledge of data heterogeneity and automatically transitions between global and local learning regimes. By recasting the objective as a high-dimensional mean estimation problem, we derive finite-sample guarantees on local excess risks for a broad class of distributions, explicitly quantifying the statistical gains of collaboration. To address communication constraints inherent to federated settings, we also propose a practical implementation based on random Fourier features, which allows one to trade communication cost for statistical efficiency. Numerical experiments validate our theoretical results.

Abstract:
Diffusion Language Models (DLMs) generate text by iteratively denoising a masked sequence, repeatedly deciding which positions to commit at each step. Standard decoding follows a greedy rule, unmasking the most confident positions, yet this local choice can lock the model into a suboptimal unmasking order, especially on reasoningheavy prompts. We present Search Or AcceleRate (SOAR), a training-free decoding algorithm that adapts its behavior to the model’s uncertainty. When confidence is low, SOAR briefly widens the search over alternative unmasking decisions to avoid premature commitments; when confidence is high, it collapses the search and decodes many positions in parallel to reduce the number of denoising iterations. Across mathematical reasoning and code generation benchmarks (GSM8K, MBPP, HumanEval) on Dream-7B and LLaDA-8B, SOAR improves generation quality while maintaining competitive inference speed, offering a practical way to balance quality and efficiency in DLM decoding.

Abstract:
REPresentation Alignment (REPA) improves the training of generative flow models by aligning intermediate hidden states with pretrained teacher features, but its effectiveness in tokenconditioned audio Flow Matching critically depends on the choice of supervised layers, which is typically made heuristically based on the depth. In this work, we introduceAttribution-GuidedREPresentationAlignment(AG-REPA), a novel causal layer selection strategy for representation alignment in audio Flow Matching. Firstly, we find that layers that best store semantic/acoustic information (high teacher-space similarity) are not necessarily the layers that contribute most to the velocity field that drives generation, and we call itStore-ContributeDissociation(SCD). To turn this insight into an actionable training guidance, we propose a forward-only gate ablation (FoG-A) that quantifies each layer's causal contribution via the induced change in the predicted velocity field, enabling sparse layer selection and adaptive weighting for alignment. Across unified speech and general-audio training (LibriSpeech + AudioSet) under different token-conditioning topologies, AG-REPA consistently outperforms REPA baselines. Overall, our results show that alignment is most effective when applied to the causally dominant layers that drive the velocity field, rather than to layers that are representationally rich but functionally passive.

Abstract:
Large language models (LLMs) can be influenced by harmful or irrelevant context, which can significantly harm model performance on downstream tasks. This motivates principled designs in which LLM systems include builtin mechanisms to guard against such "garbage in, garbage out" scenarios. We propose a novel approach to limit the degree to which harmful context can degrade model performance. First, we define a baseline "safe" behavior for the model -- the model's performance given no context at all (zero-shot). Next, we apply distribution-free risk control (DFRC) to control the extent to which the user-provided context can decay performance below this safe zero-shot baseline. We achieve this by leveraging dynamic early exit prediction, ignoring later attention heads that attend the most to the unsafe inputs. Finally, we propose modifications to DFRC that allow it to both control risk for harmful inputs and leverage performance and efficiency gains on helpful inputs. We present both theoretical and empirical results across 9 tasks spanning in-context learning and open-ended question answering, showing that our approach can effectively control risk for harmful context and simultaneously achieve substantial computational efficiency gains with helpful context.

Abstract:
The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multicriteria decision-making. However, existing idea evaluation methods often suffer from narrow knowledge horizons, flattened evaluation dimensions, and the inherent bias in LLM-as-a-Judge. To address these, we regard idea evaluation as a knowledge-grounded, multi-perspective reasoning problem and introduceInnoEval, a deep innovation evaluation framework designed to emulate human-level idea assessment. We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources. We further achieve review consensus with an innovation review board containing reviewers with distinct academic backgrounds, enabling a multi-dimensional decoupled evaluation across multiple metrics. We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval. Experiments demonstrate that InnoEval can consistently outperform baselines in point-wise, pair-wise, and group-wise evaluation tasks, exhibiting judgment patterns and consensus highly aligned with human experts.

Abstract:
We revisit the finitearmed linear bandit model by Nelson et al. [2022], where contexts and rewards are governed by a finite hidden Markov chain. Nelson et al. [2022] approach this model by a reduction to linear bandits, but relies on a simplification in which rewards are linear functions of the posterior probabilities over the hidden states given the observed contexts, rather than functions of the hidden states themselves, and assumes knowledge of the HMM parameters. We instead study the more natural model incorporating direct dependencies in the hidden states (on top of dependencies on the observed contexts, as is natural for contextual bandits) and also target stronger, high-probability, regret bounds for a fully adaptive strategy that estimates HMM parameters online.

Abstract:
As black box models and pretrained models gain traction in time series applications, understanding and explaining their predictions becomes increasingly vital, especially in highstakes domains where interpretability and trust are essential. However, most of the existing methods involve only in-distribution explanation, and do not generalize outside the training support, which requires the learning capability of generalization. In this work, we aim to provide a framework to explain black-box models for time series data through the dual lenses of Sparse Autoencoders (SAEs) and causality. We show that many current explanation methods are sensitive to distributional shifts, limiting their effectiveness in real-world scenarios. Building on the concept of Sparse Autoencoder, we introduce TimeSAE, a framework for black-box model explanation. We conduct extensive evaluations of TimeSAE on both synthetic and real-world time series datasets, comparing it to leading baselines. The results, supported by both quantitative metrics and qualitative insights, show that TimeSAE provides more faithful and robust explanations. Our code is available in an easy-to-use library TimeSAE-Lib: https://anonymous.4open.science/w/TimeSAE-571D/

Abstract:
Benchmark Design in BlackBox Optimization (BBO) is a fundamental yet open-ended topic. Early BBO benchmarks are predominantly human-crafted, introducing expert bias and constraining diversity. Automating this design process can relieve the human-in-the-loop burden while enhancing diversity and objectivity. We propose Evolution of Benchmark (EoB), an automated BBO benchmark designer empowered by the large language model (LLM) and its program evolution capability. Specifically, we formulate benchmark design as a bi-objective optimization problem towards maximizing (i) landscape diversity and (ii) algorithm-differentiation ability across a portfolio of BBO solvers. Under this paradigm, EoB iteratively prompts LLM to evolve a population of benchmark programs and employs a reflection-based scheme to co-evolve the landscape and its corresponding program. Comprehensive experiments validate our EoB is a competitive candidate in multi-dimensional usages: 1) Benchmarking BBO algorithms; 2) Training and testing learning-assisted BBO algorithms; 3) Extending proxy for expensive real-world problems.

Abstract:
We study machine unlearning in large generative models by framing the task as density ratio estimation to a target distribution rather than supervised finetuning. While classifier guidance is a standard approach for approximating this ratio and can succeed in general, we show it can fail to faithfully unlearn with finite samples when the forget set represents a sharp, concentrated data distribution. To address this, we introduceTemper-Then-Tilt Unlearning (T3-Unlearning), which freezes the base model and applies a two-step inference procedure: (i)temperingthe base distribution to flatten high-confidence spikes, and (ii)tiltingthe tempered distribution using a lightweight classifier trained to distinguish retain from forget samples. Our theoretical analysis provides finite-sample guarantees linking the surrogate classifier's risk to unlearning quality, proving that tempering is necessary to successfully unlearn for concentrated distributions. Empirical evaluations on the TOFU benchmark demonstrate that T3-Unlearning improves forget quality and generative utility over existing baselines, while training only a fraction of the parameters with a minimal runtime.

Abstract:
LLMs excel at code generation, yet ensuring the functional correctness of their outputs remains a persistent challenge. While recent studies have applied TestDriven Development (TDD) to refine code, these methods are often undermined by poor feedback quality, stemming from the scarcity of high-quality test cases and noisy signals from auto-generated ones. In this work, we shift the focus from test quantity to feedback quality. We introduce the Property-Generated Solver (PGS), a novel paradigm designed to generate highly effective feedback via two principles: it must be property-oriented, to provide semantic guidance beyond simple I/O mismatches, and structurally minimal, to reduce cognitive load and isolate root causes. PGS operates by checking high-level program properties (e.g., a sorting function must produce a non-decreasing sequence) then providing the simplest failing counterexample to the LLM. This property-driven, minimal feedback steers LLMs toward correct and generalizable solutions. Across diverse benchmarks, PGS demonstrates superior performance, achieving a bug fix rate 1.4x-1.6x higher than the strongest debugging-based approaches and establishing a new state-of-the-art in automated code refinement. Source code and data are available in the supplementary.

Abstract:
Time series foundation models (TSFMs) are revolutionizing the forecasting landscape from specific dataset modeling to generalizable task evaluation. However, we contend that existing benchmarks exhibit common limitations in four dimensions: constrained data composition dominated by reused legacy sources, compromised data integrity lacking rigorous quality assurance, misaligned task formulations detached from realworld contexts, and rigid analysis perspectives that obscure generalizable insights. To bridge these gaps, we introduceTIME, a next-generation task-centric benchmark comprising 50 fresh datasets and 98 forecasting tasks, tailored for strict zero-shot TSFM evaluation free from data leakage. Integrating large language models and human expertise, we establish a rigorous human-in-the-loop benchmark construction pipeline to ensure high data integrity and redefine task formulation by aligning forecasting configurations with real-world operational requirements and variate predictability. Furthermore, we propose a novel pattern-level evaluation perspective that moves beyond traditional dataset-level evaluations based on static meta labels. By leveraging structural time series features to characterize intrinsic temporal properties, this approach offers generalizable insights into model capabilities across diverse patterns. We evaluate 12 representative TSFMs and establish a multi-granular leaderboard to facilitate in-depth analysis and visualized inspection.

Abstract:
Reliable reward models (RMs) are critical for ensuring the safe alignment of large language models (LLMs). However, current RM evaluation methods focus solely on preference perception accuracies in given specific scenarios, obscuring the critical vulnerabilities of RMs in realworld scenarios. We identify the true challenge lies in assessing a novel dimension: Suitability, defined as conditional reliability under specific real-world perturbations. To this end, we introduce Reward Auditor, a hypothesis-testing framework specifically designed for RM suitability inference. Rather than answering “How accurate is the RM's preference perception for given samples?”, it employs scientific auditing to answer: “Can we infer RMs exhibit systematic vulnerabilities in specific real-world scenarios?". Under real-world perturbed scenarios, Reward Auditor quantifies statistical significance and effect size by auditing distribution degradation of RM preference perception confidence. This enables inference of both the certainty and severity of RM vulnerabilities across diverse real-world scenarios, thereby laying a solid foundation for building next-generation LLM alignment systems that are verifiably safe, more robust, and trustworthy.

Abstract:
The remarkable generalization properties of overparameterized networks are often attributed to implicit biases, such as norm minimization at small learning rates and low sharpness in the Edgeof-Stability regime. In this work, we argue that a comprehensive understanding of the generalization performance of gradient descent requires analyzing the interaction between these various forms of implicit regularization. We empirically demonstrate that the learning rate interpolates between low parameter norm and low sharpness of the trained model. We furthermore prove that neither implicit bias alone minimizes the generalization error for diagonal linear networks trained on a simple regression task. These findings demonstrate that focusing on a single implicit bias is insufficient to explain good generalization, and they motivate a broader view of implicit regularization that captures the dynamic trade-off between norm and sharpness induced by non-negligible learning rates.

Abstract:
Estimating density ratios between pairs of intractable data distributions is a core problem in probabilistic modeling, enabling principled comparisons of sample likelihoods under different datagenerating processes across conditions and covariates. While exact-likelihood models such as normalizing flows offer a promising approach to density ratio estimation, naive flow-based evaluations are computationally expensive, as they require simulating costly likelihood integrals for each distribution separately. In this work, we leverage condition-aware flow matching to derive a single dynamical formulation for tracking density ratios along generative trajectories. We demonstrate competitive performance on simulated benchmarks for closed-form ratio estimation, and show that our method supports versatile tasks in single-cell genomics data analysis, where likelihood-based comparisons of cellular states across experimental conditions enable treatment effect estimation and batch correction evaluation.

Abstract:
Trainingdata attribution for vision generative models aims to identify which training data influenced a given output. While most methods score individual examples, practitioners often need group-level answers (e.g., artistic styles or object classes). Group-wise attribution is counterfactual: how would a model's behavior on a generated sample change if a group were absent from training? A natural realization of this counterfactual is Leave-One-Group-Out (LOGO) retraining, which retrains the model with each group removed; however, it becomes computationally prohibitive as the number of groups grows. We propose GUDA (Group Unlearning-based Data Attribution) for diffusion models, which approximates each counterfactual model by applying machine unlearning to a shared full-data model instead of training from scratch. GUDA quantifies group influence using differences in a likelihood-based scoring rule (ELBO) between the full model and each unlearned counterfactual. Experiments on CIFAR-10 and artistic style attribution with Stable Diffusion show that GUDA identifies primary contributing groups more reliably than semantic similarity, gradient-based attribution, and instance-level unlearning approaches, while achieving ～100× speedup on CIFAR-10 over LOGO retraining.

Abstract:
LLMbased multi-agent systems excel at planning, tool use, and role coordination, but their openness and interaction complexity also expose them to jailbreak and adversarial collaboration. Existing defenses fall into two lines: (i) self-verification that asks each agent to pre-filter unsafe instructions before execution, and (ii) external guard modules that police behaviors. The former often underperforms because a standalone agent lacks sufficient capacity to detect cross-agent unsafe chains and delegation-induced risks; the latter increases system overhead and creates a single-point-of-failure—once compromised, system-wide safety collapses, and adding more guards worsens cost and complexity. To solve these challenges, we propose AdvEvo-MARL, a co-evolutionary multi-agent reinforcement learning framework that internalizes safety into task agents. Rather than relying on external guards, AdvEvo-MARL jointly optimizes attackers (which synthesize evolving jailbreak prompts) and defenders (task agents trained to both accomplish their duties and resist attacks) in adversarial learning environments. To stabilize learning and foster cooperation, we introduce a public baseline for advantage estimation: agents within the same functional group share a group-level mean-return baseline, enabling lower-variance updates and stronger intra-group coordination. Across representative attack scenarios, AdvEvo-MARL consistently keeps attack-success rate (ASR) below 20%, whereas baselines reach up to 38.33%, while preserving or even improving task accuracy (up to +3.67%). These results show that safety and utility can be jointly improved without relying on extra guard agents or added system overhead.

Abstract:
Textto-image diffusion models remain computationally intensive: generating a single image typically requires dozens of passes through large transformer backbones (e.g., SDXL uses ~50 evaluations of a 2.6B-parameter model). Few-step variants reduce the step count to 2–8, but still rely on large, full-precision backbones, making inference impractical on resource-constrained platforms, both on-device (latency/energy) and in data centers with multi-instance GPU (MIG) style GPU partitioning (limited memory/throughput per slice). Existing post-training quantization (PTQ) methods are further hampered by dependence on full-precision calibration. We introduce Q-Sched, a scheduler-level PTQ approach that adapts the diffusion sampler while keeping the quantized weights fixed. By adjusting the few-step sampling trajectory with quantization-aware preconditioning coefficients, Q-Sched matches or surpasses full-precision quality while delivering a 4× reduction in model size and preserving a single reusable checkpoint across bit-widths. To learn these coefficients, we propose a reference-free Joint Alignment–Quality (JAQ) loss, which combines text–image compatibility with an image-quality objective for fine-grained control; JAQ requires only a handful of calibration prompts and avoids any full-precision inference during calibration. Empirically, Q-Sched yields substantial gains: a 15.5% FID improvement over the FP16 4-step Latent Consistency Model and a 16.6% improvement over the FP16 8-step Phased Consistency Model, demonstrating that quantization and few-step distillation are complementary for high-fidelity generation. A large-scale user study with 80,000 annotations further validates these results on both FLUX.1[schnell] and SDXL-Turbo. Code will be released.

Abstract:
Recent work increasingly synthesizes agentic tasks for posttraining tool-using LLMs, yet robust generalization under shifts in tasks and toolsets remains an open challenge. We trace this brittleness to insufficient diversity in synthesized training tasks. Scaling diversity is difficult because training requires tasks to remain executable and verifiable, while generalization demands diverse tool types, toolset combinations, and heterogeneous tool-use patterns. We propose DIVE, an evidence-driven recipe that inverts synthesis order, executing diverse real-world tools first and reverse-deriving tasks strictly entailed by the resulting traces, providing grounding by construction. DIVE scales structural diversity along two controllable axes, tool-pool coverage and per-task toolset variety, synthesizing 48k trajectories over 374 tools across five domains that cover 46,398 unique toolsets and 39,810 unique tool-call graphs. Training Qwen3-8B on DIVE data (48k SFT + 3.2k RL) improves by +22 average points across 9 OOD benchmarks and outperforms the strongest 8B baseline by +68%. Under a fixed budget, controlled scaling shows diversity scaling consistently outperforms quantity scaling, even with 4× less data.

Abstract:
Reinforcement learning usually assumes a given or sometimes even fixed environment in which an agent seeks an optimal policy to maximize its longterm discounted reward. In contrast, we consider agents that are not limited to passive adaptations: they instead have model-changing actions that actively modify the RL model of world dynamics itself. Reconfiguring the underlying transition processes can potentially increase the agents' rewards. Motivated by this setting, we introduce the multi-layer configurable time-varying Markov decision process (MCTVMDP). In an MCTVMDP, the lower-level MDP has a non-stationary transition function that is configurable through upper-level model-changing actions. The agent's objective consists of two parts: Optimize the configuration policies in the upper-level MDP and optimize the primitive action policies in the lower-level MDP to jointly improve its expected long-term reward.

Abstract:
Activation sparsity offers a compelling route to accelerate large language model (LLM) inference by selectively suppressing hidden activations, yet existing approaches exhibit severe accuracy degradation at high sparsity. We show that this failure stems from representational instability:activation sparsity disrupts inputdependent activation learned during pretraining, inducing distribution shifts in hidden states. We address this issue by reframing activation sparsity as a representational alignment problem and introducingSpontaneous Neurons (SPON), a lightweight mechanism inspired by spontaneous neural activity in biological systems. SPON injects a small set of learnable, input-independent activation vectors that act as persistent representational anchors for sparse computation. These vectors are trained via distribution matching to the dense model and can be absorbed into bias terms after training, incurring negligible inference overhead. Across multiple LLM backbones, SPON consistently restores performance, stabilizes latent representations, and preserves generalization. Our results establish SPON as an effective and principled solution for reliable activation-sparse inference, and offer new insights into knowledge retention in LLMs.

Abstract:
Endto-end prediction of high-order crystal tensor properties from atomic structures remains challenging: while spherical-harmonic equivariant models are expressive, their Clebsch-Gordan tensor products incur substantial compute and memory costs for higher-order targets. We propose the Cartesian Environment Interaction Tensor Network (CEITNet), an approach that constructs a multi-channel Cartesian local environment tensor for each atom and performs flexible many-body mixing via a learnable channel-space interaction. By performing learning in channel space and using Cartesian tensor bases to assemble equivariant outputs, CEITNet enables efficient construction of high-order tensor. Across benchmark datasets for order-2 dielectric, order-3 piezoelectric, and order-4 elastic tensor prediction, CEITNet surpasses prior high-order prediction methods on key accuracy criteria while offering high computational efficiency. Code is provided in supplementary materials.

Abstract:
GraphRAG is increasingly adopted for converting unstructured corpora into graph structures to enable multihop reasoning. However, standard graph algorithms rely heavily on static connectivity and explicit edges, often failing in real-world scenarios where knowledge graphs (KGs) are noisy, sparse, or incomplete. To address this limitation, we introduce INSES (Intelligent Navigation and Similarity Enhanced Search), a dynamic framework designed to reason beyond explicit edges. INSES couples LLM-guided navigation, which prunes noise and steers exploration, with embedding-based similarity expansion to recover hidden links and bridge semantic gaps. Recognizing the computational cost of graph reasoning, we complement INSES with a lightweight router that delegates simple queries to Naïve RAG and escalates complex cases to INSES, balancing efficiency with reasoning depth. INSES consistently outperforms SOTA RAG and GraphRAG baselines across multiple benchmarks. Notably, on the MINE benchmark, it demonstrates superior robustness across KGs constructed by varying methods (KGGEN, GraphRAG, OpenIE), improving accuracy by 5%, 10%, and 27%, respectively.

Abstract:
Neighborhood search operators are critical to the performance of MultiObjective Evolutionary Algorithms (MOEAs) and rely heavily on expert design. Although recent LLM-based Automated Heuristic Design (AHD) methods have made notable progress, they primarily optimize individual heuristics or components independently, lacking explicit exploration and exploitation of dynamic coupling relationships between operators. In this paper, multi-operator optimization in MOEAs is formulated as a Markov decision process, enabling the improvement of interdependent operators through sequential decision-making. To address this, we propose the Evolution of Operator Combination (E2OC) framework for MOEAs, which achieves the co-evolution of design strategies and executable codes. E2OC employs Monte Carlo Tree Search to progressively search combinations of operator design strategies and adopts an operator rotation mechanism to identify effective operator configurations while supporting the integration of mainstream AHD methods as the underlying designer. Experimental results across AHD tasks with varying objectives and problem scales show that E2OC consistently outperforms state-of-the-art AHD and other multi-heuristic co-design frameworks, demonstrating strong generalization and sustained optimization capability.

Abstract:
Point forecasting for graphstructured multivariate time series is a fundamental problem, but rigorous uncertainty quantification for such predictions is still underexplored. Conformal prediction (CP) offers uncertainty estimation with a solid coverage guarantee under the exchangeability assumption, which requires the joint data distribution to be unchanged under permutation. However, in graph-structured time series, inherent cross-node coupling can violate the exchangeability condition, making direct application of CP unreliable. Inspired by the spectral graph theory, such coupling resides in global trends and can be characterized by the low-frequency components, while high-frequency components are nearly exchangeable. Therefore, we propose a novel concept namedSpectralGraphConditionalExchangeability (SGCE), which conditions exchangeable high-frequency components on low-frequency ones to preserve global trends and enable effective CP in the spectral domain. Based on SGCE, we further proposeSpectralConformal prediction via wAveLEt transform (SCALE). SCALE uses graph wavelets to decompose low/high-frequency components and conformalizes high-frequency residuals via adaptive gating over a low-frequency embedding. Experimental results on real-world traffic datasets show that SCALE not only achieves valid coverage but also consistently improves the coverage-efficiency trade-off over the state-of-the-art CP methods.

Abstract:
Even for the gradient descent (GD) method applied to neural network training, understanding its optimization dynamics, including convergence rate, iterate trajectories, function value oscillations, and especially its implicit acceleration, remains a challenging problem. We analyze nonlinear models with the logistic loss and show that the steps of GD reduce to those of generalized perceptron algorithms (Rosenblatt, 1958), providing a new perspective on the dynamics. This reduction yields significantly simpler algorithmic steps, which we analyze using classical linear algebra tools. Using these tools, we demonstrate on a minimalistic example that the nonlinearity in a twolayer model can provably yield a faster iteration complexity \tilde\mathcalO(\sqrtd) compared to \Omega(d) achieved by linear models, where d is the number of features. This helps explain the optimization dynamics and the implicit acceleration phenomenon observed in neural networks. The theoretical results are supported by extensive numerical experiments. We believe that this alternative view will further advance research on the optimization of neural networks.

Abstract:
Direct Preference Optimization (DPO) is often tuned as if increasing alignment pressure (controlled by \beta) yields progressively “better” behavior. We instead treat \beta as a control parameter and densely sweep it for three 7B openweight families under a fixed DPO recipe. In Mistral, capability is sharply non-monotonic: aggregated logic-probe margins become positive only in a narrow band near \beta \approx 10^-2 and revert outside it, with boundary points that are seed-sensitive. Across architectures under the same sweep, we observe qualitatively different response modes: sharp reorganization in Mistral, selective changes in Llama, and smooth trade-offs in Qwen. Critically, the DPO preference margin can anticorrelate with reasoning capability (Pearson r=-0.91 for Llama logic), so margin-based selection can prefer capability-impaired models. Training path also matters: exposure to high \beta induces capability losses that persist even after \beta is reduced (hysteresis). These findings motivate capability-resolved evaluation across the \beta landscape rather than reliance on margins or aggregate benchmarks.

Abstract:
Standard reward models typically predict scalar scores that fail to capture the multifaceted nature of response quality in nonverifiable domains, such as creative writing or open-ended instruction following. To address this limitation, we propose Rubric-ARM, a framework that jointly optimizes a rubric generator and a judge using reinforcement learning from preference feedback. Unlike existing methods that rely on static rubrics or disjoint training pipelines, our approach treats rubric generation as a latent action learned to maximize judgment accuracy. We introduce an alternating optimization strategy to mitigate the non-stationarity of simultaneous updates, providing theoretical analysis that demonstrates how this schedule reduces gradient variance during training. Extensive experiments show that Rubric-ARM achieves state-of-the-art performance among baselines on multiple benchmarks and significantly improves downstream policy alignment in both offline and online reinforcement learning settings.

Abstract:
In this paper, we provide a comprehensive convergence analysis for the Lion optimizer. First, we establish that the original Lion achieves a convergence rate of \mathcalO(d^1/2T^1/4), where d denotes the problem dimension and T is the iteration number. To improve this rate, we propose a variance reduction variant of Lion, which attains an enhanced rate of \mathcalO(d^1/2T^-1/3) with the average smoothness assumption. Then, we extend our analysis to distributed settings. We demonstrate that the distributed Lion optimizer and its variance reduction counterpart achieve linear speedup with respect to the number of nodes n, yielding convergence rates of \mathcalO(d^1/2(nT)^-1/4) and \mathcalO(d^1/2(nT)^-1/3), respectively. Additionally, we investigate a communication-efficient distributed Lion variant that utilizes sign compression for bidirectional communication. By employing unbiased sign operations, this variant achieves a convergence rate of \mathcalO \left( \max \ \fracd^1/4T^1/4, \fracd^1/10n^1/5T^1/5 \ \right), and its variance-reduced counterpart can further improves the rate to \mathcalO\left( \fracd^1/4T^1/4 \right). Finally, we conduct numerical experiments to validate the effectiveness of the proposed methods.

Abstract:
State Space Models (SSMs) have emerged as efficient alternatives to attention for vision tasks, offering lineartime sequence processing with competitive accuracy. Vision SSMs, however, require serializing 2D images into 1D token sequences along a predefined scan order, a factor often overlooked. We show that scan order critically affects performance by altering spatial adjacency, fracturing object continuity, and amplifying degradation under geometric transformations such as rotation. We present \bf Partial RIng Scan Mamba (PRISMamba), a rotation-robust traversal that partitions an image into concentric rings, performs order-agnostic aggregation within each ring, and propagates context across rings through a set of short radial SSMs. Efficiency is further improved via partial channel filtering, which routes only the most informative channels through the recurrent ring pathway while keeping the rest on a lightweight residual branch. On ImageNet-1K, PRISMamba achieves 84.5% Top-1 with 3.9G FLOPs and 3,054 img/s on A100, outperforming VMamba in both accuracy and throughput while requiring fewer FLOPs. It also maintains performance under rotation, whereas fixed-path scans drop by 1–2%. These results highlight scan-order design, together with channel filtering, as a crucial, underexplored factor for accuracy, efficiency, and rotation robustness in Vision SSMs. Code will be released upon acceptance.

Abstract:
We present Multipole Semantic Attention (MuSe), an efficient approximation of softmax attention for longcontext transformers. MuSe clusters queries and keys separately in their learned representation spaces, computing query-specific cluster summaries that capture how each query cluster attends to each key cluster. This is combined with retrieval of high-attention clusters for exact computation. Unlike prior work that clusters only keys, our separate query clustering provides a ~9× effective cluster count advantage, enabling high approximation quality at extreme sparsity. For causal attention, we introduce a block-sparse structure with causal accumulation of cluster summaries across spatial blocks, followed by two-level retrieval. At 64k context, MuSe achieves 64× sparsity in the far-field attention with <1% relative squared error and 2× speedup over CUDNN Flash Attention on isolated attention layers. We pretrain language models up to 1B parameters at 64k context, achieving 36% wallclock speedup with <1% loss degradation.

Abstract:
Mixtureof-Experts (MoE) models scale large language models efficiently by sparsely activating experts, but once an expert is selected, it is executed fully. Hence, the trade-off between accuracy and computation in an MoE model typically exhibits large discontinuities. We propose Mixture of Slimmable Experts (MoSE), an MoE architecture in which each expert has a nested, slimmable structure that can be executed at variable widths. This enables conditional computation not only overwhichexperts are activated, but also overhow muchof each expert is utilized. Consequently, a single pretrained MoSE model can support a more continuous spectrum of accuracy-compute trade-offs at inference time. We present a simple and stable training recipe for slimmable experts under sparse routing, combining multi-width training with standard MoE objectives. During inference, we explore strategies for runtime width determination, including alightweight test-time training mechanismthat learns how to map router confidence/probabilities to expert widths under a fixed budget. Experiments on GPT models trained on OpenWebText demonstrate that MoSE matches or improves upon standard MoE at full width andconsistently shifts the Pareto frontierfor accuracy vs. cost, achieving comparable performance with significantly fewer FLOPs.

Abstract:
Diffusion models often yield highly curved trajectories and noisy score targets due to an uninformative, memoryless forward process that induces independent datanoise coupling. We propose Adjoint Schrödinger Bridge Matching (ASBM), a generative modeling framework that recovers optimal trajectories in high dimensions via two stages. First, we view the Schrödinger Bridge (SB) forward dynamic as a coupling construction problem and learn it through a data-to-energy sampling perspective that transports data to an energy-defined prior. Then, we learn the backward generative dynamic with a simple matching loss supervised by the induced optimal coupling. By operating in a non-memoryless regime, ASBM produces significantly straighter and more efficient sampling paths. Compared to prior works, ASBM scales to high-dimensional data with notably improved stability and efficiency. Extensive experiments on image generation show that ASBM improves fidelity with fewer sampling steps. We further showcase the effectiveness of our optimal trajectory via distillation to a one-step generator.

Abstract:
Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing lowprecision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, INT consistently surpasses it as the quantization block size shrinks. Our comprehensive comparison demonstrates that for popular fine-grained formats like MX (block size 32), MXINT8 and MXINT4 are superior to their FP counterparts in both algorithmic accuracy and hardware efficiency. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory and advocate for prioritizing fine-grained INT formats in future AI accelerators to achieve a better balance of accuracy, power, and efficiency.

Abstract:
Spatial intelligence is crucial for visionlanguage models (VLMs) in the physical world, yet many benchmarks evaluate largely unconstrained scenes where models can exploit 2D shortcuts. We introduce SSI-Bench, a VQA benchmark for spatial reasoning on constrained manifolds, built from complex real-world 3D structures whose feasible configurations are tightly governed by geometric, topological, and physical constraints. SSI-Bench contains 1,000 ranking questions spanning geometric and topological reasoning and requiring a diverse repertoire of compositional spatial operations, such as mental rotation, cross-sectional inference, occlusion reasoning, and force-path reasoning. It is created via a fully human-centered pipeline: ten researchers spent over 400 hours curating images, annotating structural components, and designing questions to minimize pixel-level cues. Evaluating 31 widely used VLMs reveals a large gap to humans: the best open-source model achieves 22.2% accuracy and the strongest closed-source model reaches 33.6%, while humans score 91.6%. Encouraging models to think yields only marginal gains, and error analysis points to failures in structural grounding and constraint-consistent 3D reasoning.

Abstract:
Interpretable time series deep learning systems are often assessed by checking temporal consistency on explanations, implicitly treating this as evidence of robustness. We show that this assumption can fail: Predictions and explanations can be adversarially decoupled, enabling targeted misclassification while the explanation remains plausible and consistent with a chosen reference rationale. We propose TSEF (Time Series Explanation Fooler), a dualtarget attack that jointly manipulates the classifier and explainer outputs. In contrast to single-objective misclassification attacks that disrupt explanation structure and spread attribution mass broadly, TSEF achieves targeted prediction changes while keeping explanations consistent with the reference. Across multiple datasets and explainer backbones, our results consistently reveal that explanation stability is a misleading proxy for decision robustness and motivate coupling-aware robustness evaluations for trustworthy time series deployments.

Abstract:
Learning conditional distributions \pi^\star(\cdot|x) is a central problem in machine learning, which is typically approached via supervised methods with paired data (x,y) ～ \pi^\star. However, acquiring paired data samples is often challenging, especially in problems such as domain translation. This necessitates the development of semisupervised models that utilize both limited paired data and additional unpaired i.i.d. samples x ～ \pi^\star_x and y ～ \pi^\star_y from the marginal distributions. The usage of such combined data is complex and often relies on heuristic approaches. To tackle this issue, we propose a new learning paradigm that integrates both paired and unpaired data seamlessly using data likelihood maximization techniques. We demonstrate that our approach also connects intriguingly with inverse entropic optimal transport (OT). This finding allows us to apply recent advances in computational OT to establish an end-to-end learning algorithm to get \pi^\star(\cdot|x). In addition, we derive the universal approximation property, demonstrating that our approach can theoretically recover true conditional distributions with arbitrarily small error. Finally, we demonstrate through empirical tests that our method effectively learns conditional distributions using paired and unpaired data simultaneously.

Abstract:
Neural acoustic fields often model timedomain impulse responses, which struggle to capture the frequency-selective wave behaviors that dominate confined, resonant environments. To address this, we propose INFER (Implicit Neural Frequency Response fields), a framework that directly learns continuous, complex-valued frequency response fields. Unlike prior time-domain methods, our frequency-first approach enables three key innovations: (1) end-to-end learning of frequency-specific attenuation and phase delay in 3D space; (2) a physics-based Kramers–Kronig consistency constraint that causally regularizes attenuation and phase delay; and (3) perceptual and hardware-aware spectral supervision that prioritizes critical auditory bands. We evaluate INFER across diverse settings, ranging from standard room-scale benchmarks (MeshRIR, RAF) to challenging, highly reverberant environments like real car cabins. Our approach significantly outperforms time- and hybrid-domain baselines, reducing average magnitude and phase reconstruction errors by over 39% and 51%, respectively, demonstrating state-of-the-art accuracy in modeling complex acoustic spaces.

Abstract:
FineTuning-as-a-Service (FTaaS) facilitates the customization of Multimodal Large Language Models (MLLMs) but introduces critical backdoor risks via poisoned data. Existing defenses either rely on supervised signals or fail to generalize across diverse trigger types and modalities. In this work, we uncover a universal backdoor fingerprint—attention allocation divergence—where poisoned samples disrupt the balanced attention distribution across three functional components: system instructions, vision inputs, and user textual queries, regardless of trigger morphology. Motivated by this insight, we proposeTri-Component Attention Profiling (TCAP), an unsupervised defense framework to filter backdoor samples. TCAP decomposes cross-modal attention maps into the three components, identifies trigger-responsive attention heads via Gaussian Mixture Model (GMM) statistical profiling, and isolates poisoned samples through EM-based vote aggregation. Extensive experiments across diverse MLLM architectures and attack methods demonstrate that TCAP achieves consistently strong performance, establishing it as a robust and practical backdoor defense in MLLMs.

Abstract:
Recent advances in diffusion models show promising potential to accelerate nonconvex problem solving by leveraging their multimodality. However, most existing diffusionbased optimization approaches rely on supervised learning and lack a mechanism to enforce constraint satisfaction, which is required in real-world applications. In that case, we investigate and theoretically analyze the inherent problem of supervised diffusion solvers and identify the distributional misalignment problem, i.e., the generated solution distribution often exhibits low probability mass on the feasible region. To resolve this issue, we propose DiOpt, a new diffusion-based learning framework for constrained nonconvex optimization, which effectively learns the mapping from noise to the constraint region. Specifically, this framework operates in two distinct phases: an initial warm-start phase, implemented via supervised learning, followed by a bootstrapping training phase. This dual-phase architecture is designed to iteratively refine solutions, thereby improving the objective function with high constraint satisfaction. Finally, we also employ a solution selection technique in inference for better optimality. Notably, DiOpt is the first successful integration of the diffusion solver in constrained nonconvex optimization. Evaluations on diverse nonconvex tasks demonstrate the superiority of DiOpt in both optimality and constraint satisfaction.

Abstract:
Genomic foundation models increasingly adopt large language model architectures, yet almost universally rely on fixed tokenization schemes such as kmers or BPE, which impose arbitrary sequence boundaries and may obscure biologically relevant structure. Although recent work has demonstrated the feasibility of adaptive hierarchical tokenization in autoregressive settings, its extension to masked language modeling and systematic downstream evaluation remain underexplored. We present LDARNet, a 120M-parameter hierarchical genomic foundation model that integrates learnable sequence compression into the masked language modeling paradigm by combining BiMamba-2 state-space layers with selective attention and ratio-based regularization to induce stable token boundaries without supervision. We evaluate LDARNet by fine-tuning on 27 tasks from the Genomics Benchmarks and Nucleotide Transformer suites, covering regulatory, epigenetic, and sequence-level prediction problems, and compare it against state-of-the-art models ranging from 8M to 2.5B parameters. LDARNet achieves 11 out of 18 wins among models under 300M parameters and attains state-of-the-art performance on 5 histone modification tasks, outperforming even substantially larger models on several long-range epigenetic benchmarks. These results indicate that adaptive hierarchical tokenization under masked language modeling can capture long-range genomic dependencies relevant to regulatory biology and highlight learnable compression as a promising direction for efficient and scalable genomic foundation models.

Abstract:
Large Language Models have achieved strong performance on reasoning tasks with objective answers by generating stepby-step solutions, but diagnosing where a multi-step reasoning trace might fail remains difficult. Confidence estimation offers a natural diagnostic signal, yet existing methods are restricted to final answers or require internal model access. We introduce Stepwise Confidence Attribution (SCA), a framework for closed-source LLMs that assigns step-level confidence based only on generated reasoning traces. SCA applies the Information Bottleneck principle: steps aligning with consensus structures across correct solutions receive high confidence, while deviations are flagged as potentially erroneous. We propose two complementary methods: (1) NIBS, a non-parametric IB approach measuring consistency without graph structures, and (2) GIBS, a graph-based IB model that learns subgraphs through a differentiable mask to capture logical variability. Extensive experiments on mathematical reasoning and multi-hop question answering show that SCA reliably identifies low-confidence steps strongly correlated with reasoning errors. Moreover, using step-level confidence to guide self-correction improves the correction success rate by up to 13.5% over answer-level feedback.

Abstract:
Large language models (LLMs) have shown strong empirical gains as selfevolving agents for CUDA kernel generation, driven by feedback-conditioned planning across generations. However, how planning decisions attribute and combine heterogeneous feedback signals remains opaque. Standard end-to-end ablations fail to resolve this question, as iterative planning amplifies early perturbations and conflates feedback effects with trajectory-dependent drift. We introduce CUDAnalyst, a unified analysis layer for controlled, generation-level attribution of planning decisions to feedback components via trajectory freezing and selective feedback injection. CUDAnalyst enables stable generation-level evaluation and principled coalitional-style attribution of feedback effects and interactions. Our results show that explicit planning is beneficial only when feedback is aligned, that effective planning emerges from structured multi-feedback interactions, and that high-level plans from stronger reasoning models can partially transfer to weaker ones. These trends hold across representative workloads and reference induction regimes, indicating that the identified feedback-to-plan structure is robust within the controlled axes studied.

Abstract:
Multimodal InContext Learning (ICL) has emerged as a practical inference paradigm for Multimodal Large Language Models, where a small set of interleaved image-text In-Context Demonstrations (ICDs) conditions the model to solve new tasks. Despite its flexibility, multimodal ICL incurs high inference latency and suffers from instability due to sensitivity to demonstration formatting, ordering, and content. To address these limitations, we propose Hyper-ICL, a lightweight, training-based framework for demonstration-free multimodal ICL that reconstructs demonstration effects directly without requiring ICDs at inference time. Hyper-ICL learns a parameter-efficient low-rank logit-level adapter that calibrates attention distributions to better match demonstration-induced attention redistribution. To capture how demonstration influence varies across queries, we introduce query-adaptive modulation mechanism that adaptively controls intervention strength at token level across layers and heads based on the current query. Finally, we propose a layer-wise hyperbolic anchor distillation loss that aligns intermediate student features to a demonstration-conditioned teacher via Lorentz geodesic distance. This loss encourages the student to reconstruct the demonstration–query relationships induced by ICDs. Extensive experiments across six different multimodal benchmarks (including VQAv2, OK-VQA, and COCO Caption) demonstrate that Hyper-ICL consistently improves accuracy and stability over vanilla ICL and existing state-of-the-art methods.

Abstract:
This work investigates theoretically the interplay between interpolation and aggregation in regression. We establish that the \gammagraph dimension characterizes learnability for a broad class of natural aggregation procedures. Furthermore, we prove that an extremely simple aggregation procedure, combining three interpolating hypotheses via the median, is optimal among all these aggregation procedures, and is strictly more powerful than proper learning. Finally, we show that some hypothesis classes are learnable only by aggregating infinitely many hypotheses or by using non-interpolating aggregation rules (which may predict outside the range of their inputs), and any finite interpolating aggregation fails to achieve even trivial performance.

Abstract:
Reinforcement Learning (RL) has become pivotal for improving model capabilities yet suffers from rollout efficiency bottlenecks due to the longtail response length distribution. While existing works mitigate the impact of long tails via prompt-level tail scheduling, we focus on the root source of inefficiency: the distribution itself. Specifically, we characterize the long-tail distribution at a finer granularity, identifying intra-prompt long tails, and revealing that they frequently consist of ineffective verbosity. To address this, we propose a novel paradigm of active distribution shaping to shape the rollout distribution towards conciseness and certainty, thereby fundamentally resolving tail-induced overheads. We achieve this through a distribution-aware trajectory sampling mechanism, which selects trajectories from a redundant exploration space for each prompt, and an adaptive redundancy allocation scheme to maximize both shaping effectiveness and system efficiency. Experiments demonstrate significant acceleration over state-of-the-art systems by up to 1.77× without compromising model performance.

Abstract:
Generative modeling provides a powerful framework for learning data distributions. These models initially relied on probabilistic methods such as Gaussian Processes (GP) for uncertaintyaware predictions and shifted towards larger trainable models to learn more complex distributions. In this work, we introduceRandom Process (RP) Flow, a Flow Matching-based framework that represents the vector field as a neural implicit function. Unlike modern generative methods, our setting involves a single observed field, from which only sparse measurements are available. RP Flow uses Random Fourier Features to learn an implicit signal representation that can be queried at any arbitrary location from a limited set of observations, while encoding uncertainty through ensemble sampling. We propose constructing a Bayesian posterior by GP regression in the source space to generate high-quality samples. Our empirical results demonstrate that this framework generates realistic samples along with calibrated uncertainty estimates, even under challenging conditions such as high frequency, high sparsity, or high dimensionality. These findings position RP Flow as a milestone towards generative models for reconstruction tasks where data is scarce and uncertainty must remain traceable.

Abstract:
Proteinprotein interactions (PPIs) are fundamental to cellular function, disease mechanisms, and drug discovery. Current learning-based PPI predictors focus on learning powerful protein representations but neglect designing specialized classification heads. They mainly rely on generic aggregating methods like concatenation or dot products, which lack biological insight. Motivated by the biological "L3 rule", where multiple length-3 paths between a pair of proteins indicate their interaction likelihood, our study addresses this gap by designing a biologically informed PPI classifier. In this paper, we provide empirical evidence that popular PPI datasets strongly support the L3 rule. We propose an L3-path-regularized graph prompt learning method called L3-PPI, which can generate a prompt graph with virtual L3 paths based on protein representations and controls the number of paths. L3-PPI reformulates the classification of protein embedding pairs into a graph-level classification task over the generated prompt graph. This lightweight module seamlessly integrates with leading PPI predictors as a plug-and-play component, injecting the interaction prior of complementarity to enhance performance. Extensive experiments show that L3-PPI achieves superior performance enhancements over state-of-the-art competitors.

Abstract:
Recent work describes what transformers can and cannot compute through connections to boolean circuits, but existing results lack exact characterizations and are sensitive to modeling choices. Padded transformerswhose input is padded with filler symbols such as ``...''---emerge as a useful gadget for establishing equivalences to circuit classes by providing polynomial space for adaptive parallel computation. However, only a limited set of padded transformer idealizations has been studied, leaving open how robust these equivalences are to choices such as attention type, model width, and uniformity. We find that, under practical assumptions, padded transformers are surprisingly robust to all of these, and identify numeric precision and model depth as the main factors affecting expressivity. Concretely, we prove that polynomially-padded L-uniform constant-precision transformers with growing width equal L-uniform \textAC^0, while growing-precision ones achieve L-uniform \textTC^0 regardless of width. Furthermore, looping enables sequential processing analogous to circuits: \log^d N-looped constant-precision transformers reach FO-uniform \textAC^d, and growing-precision ones reach FO-uniform \textTC^d. Interestingly, growing width or precision beyond logarithmic does not increase expressivity, and all our results hold for both softmax and average hard attention transformers.

Abstract:
We explore the expressive power of Transformers by establishing precise approximation error upper and lower bounds for Hölder class. Specifically, a new approximation upper bound is derived for the standard Transformer architecture equipped with Softmax operators, ReLU activation functions, and residual connections. We prove that a Transformer network composed of at most \mathcalO(\varepsilon^d_0/\alpha) blocks can approximate any bounded Hölder function with d_0-dimensional input and smoothness \alpha\in(0,1] under any accuracy \varepsilon>0. In the case of approximation lower bounds, leveraging the VC-dimension upper bound, we are the first to rigorously prove that Transformers demand for at least \mathcalO(\varepsilon^-d_0/(4\alpha)) blocks to achieve the \varepsilon approximation accuracy. As a final step, we extend the derived results for standard Transformers to a general regression task and establish the corresponding excess risk rates demonstrating Transformers' empirical effectiveness in real-world settings.

Abstract:
Categorylevel 6D object pose estimation is typically formulated as a multi-category joint learning problem with fully shared model parameters. However, pronounced geometric heterogeneity across categories entangles incompatible optimization signals in shared modules, resulting in gradient conflicts and negative transfer during training. To address this challenge, we first introduce gradient-based diagnostics to quantify module-level cross-category contention. Building on results of diagnostics, we propose DecomPose, a difficulty-aware decomposition framework that mitigates optimization contention via: (1) difficulty-aware gradient decoupling, which groups categories using a data-driven difficulty proxy and routes each instance to a group-specific correspondence branch to isolate incompatible updates; and (2) stability-driven asymmetric branching, which assigns higher-capacity branches to structurally simple categories as stable optimization anchors while constraining complex categories with lightweight branches to suppress noisy updates and alleviate negative transfer. Extensive experiments on REAL275, CAMERA25, and HouseCat6D demonstrate that DecomPose effectively reduces cross-category optimization contention and achieves state-of-the-art performance.

Abstract:
Abstract reasoning ability reflects the intelligence and generalization capacity of LLMs to extract and apply abstract rules. However, accurately measuring this ability remains challenging: existing benchmarks either rely on expensive manual annotation, limiting their scale, or risk measuring memorization rather than genuine reasoning. To address this, we introduce an automated pipeline named A^2RBench, encompassing generation, expansion, evaluation, and analysis. Specifically, in the generation stage, LLMs create diverse tasks demanding genuine reasoning; in the expansion stage, LLMs reuse validated rules and expand new input spaces to generate task variations, achieving scaling. However, such a process may cause hallucinations. To eliminate it, we further establish a theoretical framework and prove that programmatic verification—testing whether the inverse operation perfectly reverses the forward operation (cycle consistency)—guarantees a unique solution. Through extensive evaluations on mainstream LLMs, we find: (1) Current LLMs exhibit fundamental deficiencies in abstract reasoning, with top models significantly underperforming humans on a representative subset (39.8% vs. 68.5%). (2) Current LLMs fall far short of 2D and 1D in the complexity of generated 3D tasks, revealing their lack of understanding of highdimensional tasks. (3) Counterintuitively, inputs with higher information complexity can simplify the reasoning process. Code and data are available at: https://anonymous.4open.science/r/A2Rbench.

Abstract:
Neural rough differential equations (NRDEs) learn continuoustime dynamics from irregularly sampled sequences by encoding the input path with signature features, providing robustness to discretisation and sampling irregularity. However, existing NRDEs implicitly rely on algebraic identities that can fail in two important settings: stochastic dynamics interpreted in the Itô sense, and dynamics evolving on manifolds where curvature renders the effect of repeated derivatives order-dependent. In this work, we propose Branched Neural Rough Differential Equations (B-NRDEs), a unified framework that replaces geometric signature features with tree-based (branched) rough-path lifts, yielding models that remain well-defined under Itô noise and on manifolds. Building on these branched lifts, an Itô-consistent training objective is introduced via the branched signature kernel. We provide an efficient, autodifferentiable package Stochastax for computing branched (log-)signatures and solving (manifold) RDEs. Across various applications, including rough Bergomi volatility modelling, sim-to-real \mathrmSO(3) dynamics forecasting, and SPD covariance dynamics, B-NRDE shows consistently strong results.

Abstract:
Monocular 3D shape recovery is fundamental to geometric understanding, yet achieving robust generalization across arbitrary viewpoints and unseen object categories remains a significant challenge. In this paper, we present a generalizable deformation learning framework that reconstructs 3D objects by explicitly deforming a categorylevel shape template to match the target observation. To address complex shape variations between the template and the target, we introduce a geometry-guided feature modeling mechanism. This process first enriches foundation features with template topology to yield a geometry-aware representation, which is then explicitly correlated with the target observation to guide precise deformation. Furthermore, to bridge the disparity between the fixed template and arbitrary target views, we propose a view-adaptive feature aggregation module. This module leverages multi-view template features and their corresponding camera poses to enrich the canonical template representation, ensuring robust feature alignment regardless of the target's perspective. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in handling large shape variations and diverse viewpoints, exhibiting strong generalization to novel categories and effectively supporting downstream real-world dexterous robotic manipulation tasks.

Abstract:
Hyperrelational knowledge graphs (HKGs) effectively represent complex facts. While inferring new knowledge in HKGs is a critical problem, current methods cast it as a simple link prediction, assuming that nearly all entities and relations within a fact are known, leaving only a single blank to be filled. However, this restricted assumption may not hold in real-world scenarios in which multiple, or even all, constituent components of a fact may be missing simultaneously. To bridge this gap, we introduce a task called fact generation: generating a valid hyper-relational fact from an arbitrarily masked query, i.e., completing a partially observed fact or generating a fact from scratch. We propose KREPE, the first generative representation learning method for HKGs that learns to model the probability distributions of missing components conditioned on the local fact components and global structure of HKGs via a masked discrete diffusion. KREPE models both the intra-fact dependencies by contextual message passing and inter-fact correlations by aggregating stochastically sampled contexts. KREPE seamlessly unifies link prediction and fact generation within a single training framework, achieving state-of-the-art performance on standard HKG link prediction benchmarks and outperforming LLM-based baselines in generating novel and correct facts.

Abstract:
The scarcity of highquality imaging data for coronary angiography (CAG) stenosis limits the clinical translation of automated stenosis detection. Synthetic stenosis data provides a practical avenue to augment training sets, improving data quality, diversity, and distributional coverage, and enhancing detection precision and generalization. However, diffusion-based editing commonly relies on soft guidance in a noise-initialized reverse process, offering limited pixel-level precision and structure preservation. We propose theOT-Bridge Editor, which reframes localized editing as a constrained entropic optimal transport (OT) problem and leverages geometric information to steer the generation path, enabling stronger geometric control. Extensive experiments show that our synthesized angiograms consistently improve downstream stenosis detection, yielding substantial relative gains of 27.8% on the public ARCADE benchmark and 23.0% on our multi-center dataset, supported by consistent qualitative results.

Abstract:
Planning and acting in 3D environments is a fundamental capability for robotic manipulation in the real world. Although prior work has explored predictive flow planners to guide 3D manipulation, existing approaches often rely on modular pipelines stacking multiple submodels, resulting in high computational overhead and limited realtime performance. To address these challenges, we introduce RoboFlow4D, a lightweight flow world model that unifies perception and planning by estimating temporal motion in physical 3D space. As an end-to-end framework, RoboFlow4D directly predicts multi-frame 3D flows from visual observations and textual instructions, providing explicit flow-based planning to guide action generation. This design allows seamless integration with general action policies, forming an efficient observation–planning–execution closed loop. Through slow–fast collaboration between flow prediction and action control, RoboFlow4D enables real-time and resource-efficient manipulation. Extensive experiments in both simulation and real-world settings demonstrate that RoboFlow4D consistently improves manipulation success rates and computational efficiency, advancing flow-guided planning for embodied intelligence.

Abstract:
With rapid advances in audiovisual generative models, reliable forgery detection becomes increasingly critical. Existing methods for audio-visual deepfake detection typically rely on cross-modal inconsistencies. In singing, rhythmic vocalization weakens this coupling and introduces a nontrivial domain shift, substantially degrading detection performance. We construct the Singing Head DeepFake (SHDF) dataset using rhythm-aware generative models to fill the gap in singing benchmarks. To cope with cross-scenario domain shifts, we propose a Text-guided Audio-Visual Forgery Detection (T-AVFD) framework that generalizes across both talking and singing scenarios. T-AVFD comprises a facial authenticity pattern learner and a multi-modal differential weight learning module. The pattern learner aligns facial features with multi-granularity textual descriptions to learn generalizable authenticity patterns. The weight learning module preserves intrinsic audio–visual consistency and adaptively integrates it with authenticity patterns via differential weighting. Extensive experiments on multiple talking head deepfake datasets and SHDF show consistent improvements over existing baselines and strong robustness under diverse perturbations.

Abstract:
Federated Learning (FL) offers a privacypreserving pathway for aligning Large Language Models (LLMs); however, existing frameworks typically enforce a monolithic reward model, inevitably averaging out inherently conflicting user preferences (e.g., helpfulness vs. harmlessness). While Variational Preference Learning (VPL) offers a pathway to personalization, adapting it to decentralized settings presents a fundamental challenge: posterior collapse driven by severe local data scarcity and heterogeneity. In this paper, we propose Federated Variational Preference Alignment with Gumbel-Softmax Prior (FedVPA-GP), a framework designed to disentangle diverse preferences without compromising privacy. To stabilize variational inference, we introduce a Federated Mixture Prior that enables clients to leverage the aggregate population distribution as a dynamic prior. Furthermore, we incorporate an Orthogonal Loss that explicitly enforces the separation of preference prototypes in the latent space. Experiments on the HH-RLHF dataset demonstrate that FedVPA-GP significantly outperforms monolithic baselines, successfully disentangling conflicting user intents and enabling dynamic preference switching.

Abstract:
Deep learning (DL) models, despite their remarkable success, remain vulnerable to small input perturbations that can cause erroneous outputs, motivating probabilistic robustness (PR) as a complementary notion to adversarial robustness (AR) for stochastic reliability assessment. However, existing PR formulations assume a fixed, known perturbation distribution, which is often unavailable or misspecified in practice. To address this limitation, we propose nonparametric probabilistic robustness (NPPR), a more conservative PR estimator over an admissible family of perturbation distributions. We instantiate NPPR with a tractable estimator (GMM-based) that supports four dependency structures (independent, label-, input-, and joint-dependent perturbations). We show that NPPR provably interpolates between AR and PR. Experiments on CIFAR-10/100 and TinyImageNet across multiple architectures show that NPPR yields consistently lower (more conservative) PR estimates than PR computed under common assumed distributions (e.g., Gaussian/Uniform), with up to 40% reduction in representative settings.

Abstract:
Vision–Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, benchmarking, and reasoning for VLMbased agents in this domain. To facilitate rigorous evaluation, we introduce HyperTrack, a large-scale dataset with over 16000 real-world tasks across more than 650 Chinese mobile applications, along with GUIEvalKit, an open-source toolkit for unified benchmarking of VLMs on offline GUI navigation tasks. Using HyperTrack, we analyze the effects of training data scale on both supervised and reinforcement-based finetuning. Our results show that reinforcement-based finetuning consistently outperforms supervised finetuning, particularly in out-of-domain settings, highlighting the synergy between data scaling and reinforcement learning. Leveraging GUIEvalKit, we further benchmark state-of-the-art (SOTA) VLMs and analyze how interaction history and reasoning capabilities influence task completion. Together, HyperTrack and GUIEvalKit provide a comprehensive platform for developing and evaluating VLM agents in mobile GUI navigation tasks.

Abstract:
Tuning stochastic gradient methods such as stochastic gradient descent (SGD) and stochastic gradient Langevin dynamics (SGLD) for approximate sampling and uncertainty quantification remains challenging, particularly in relevant settings using a large batch size or when the model is misspecified. Existing theory that provides tuning guidance relies on continuoustime limits or strong statistical assumptions, which can become quantitatively inaccurate in these regimes. We address these shortcomings by proposing a new discrete-time approximations to SG(L)D with and without momentum, which enables accurate predictions of the stationary covariance, iterate average covariance, and integrated autocorrelation time. Moreover, we prove quantitative, non-asymptotic error bounds showing that these estimates are sufficiently accurate for practical tuning and uncertainty quantification. Numerical experiments demonstrate that our theory yields improved tuning guidance across a range of models and data-generating distributions where existing approaches fail, including when using the \beta-divergence rather than log-loss to obtain statistically robust inferences.

Abstract:
Despite the rapid advancements in VisionLanguage Models (VLMs), a critical gap remains in their ability to handle structured, controllable diagrammatic tasks essential for professional workflows, as existing methods predominantly rely on pixel-based synthesis which operates in probabilistic pixel spaces and is inherently limited in editability and fidelity; instead, we propose a new "Diagram-as-Code" paradigm with symbolic logic that leveragesmxGraphExtensible Markup Language (XML) for precise diagram generation and editing, and we presentVCG-Bench, a unified benchmark for visual-centricmxGraphtasks comprising (1) a taxonomized dataset of 1,449 diverse diagrams spanning 6 domains and 15 sub-domains, (2) a paradigm definition that integrates Generation (Vision-to-Code) and Editability (Code-to-Code), and (3) a Tailored Evaluation Protocol employing multi-dimensional metrics such asmxGraphExecution Success Rate and Style Consistency Score (SCS), where experimental results highlight the challenges faced by current State-of-the-Art (SOTA) VLMs in structured fidelity and instruction compliance, reflecting their vision and reasoning capabilities.

Abstract:
Semisupervised learning (SSL) faces significant challenges in realistic scenarios where labeled data is extremely scarce and unlabeled data follows unknown, arbitrary distributions. We formalize this critical yet under-explored paradigm as Universal Semi-supervised Learning (UniSSL). Existing methods typically leverage unlabeled data via pseudo-labeling. However, they often rely on the idealized assumption of a uniform unlabeled data distribution or require sufficient labeled data to estimate it. In the UniSSL setting, such dependencies lead to numerous erroneous pseudo-labels, thereby triggering representation confusion. Fortunately, we observe that inter-sample relations captured by representations are more reliable than pseudo-labels. Leveraging this insight, we shift our focus to the representation-level structural inference to bypass distribution estimation. Accordingly, we propose Simplex Anchored Graph-state Equipartition (SAGE), which captures high-order inter-sample dependencies to establish structural consensus for guiding representation learning. Meanwhile, to mitigate representation confusion, we further employ vectors that satisfy a simplex equiangular tight frame to serve as a coordinate frame for guiding inter-class representation separation. Finally, we introduce a weighting strategy based on distribution-agnostic metrics to prioritize reliable pseudo-labels and an auxiliary branch to isolate potentially erroneous pseudo-labels. Extensive evaluations on five standard benchmarks demonstrate that SAGE consistently outperforms state-of-the-art methods, achieving an average accuracy improvement of8.52%.Code is available in the supplementary material.

Abstract:
Sequential conformal prediction (CP) provides valid uncertainty quantification under the assumption of residual exchangeability. However, this assumption is often violated in realworld time series due to temporal dependencies and distributional shifts. While recent methods attempt to approximate exchangeability through reweighting, identifying optimal weights remains an open challenge. To address this limitation, we proposeDistMatch, a binning-based method that recursively partitions residuals within a binary tree using the Kolmogorov–Smirnov (KS) statistic. We theoretically show that this partitioning induces approximately exchangeable leaves, thereby avoiding the need for reweighting. By applying quantile regression with online updates within each leaf, DistMatch enables locally adaptive inference and improves robustness to distributional shifts. Extensive experiments demonstrate that DistMatch outperforms existing sequential CP methods.

Abstract:
Neural collapse (NC) describes the structured geometry that emerges in the features and weights of trained classifiers. Recent theory suggests NC can be suboptimal in deep architectures, attributing this to an explicit lowrank bias from L2 regularization. We study the deep unconstrained feature model (UFM)—equivalent to a deep linear network with orthogonal inputs—trained without regularization, to isolate how gradient descent and depth alone shape NC. We show that depth induces an implicit low-rank bias: low-rank matrices propagate norm more efficiently through successive multiplications, promoting low-rank alternatives to NC. These alternatives, we argue, correspond to softmax codes: max-margin solutions previously found in width-bottlenecked networks. Analyzing training dynamics under spectral initialization, we identify an early-time repulsion among singular values that drives low-rank emergence, and characterize how depth shrinks NC's basin of attraction. Finally, we show that some effects act in the opposite direction: for randomly initialized networks, increasing width biases training toward higher-rank solutions. Our results provide the first asymptotic and dynamic characterization of implicit bias in deep UFMs trained with unregularized multiclass cross-entropy.

Abstract:
ChatGPT was launched on November 30, 2022; the r/ChatGPT subreddit was created just one day later. Since then, chatbotbased AI products have gone from niche proofs-of-concept to widely-used household names. However, the ways in which adoption has developed, especially among non-experts, remains poorly understood. In this paper, we propose a principled framework for using social media as a data source for understanding the societal impact of widely-adopted consumer AI products, as well as a general approach to monitoring for societally-impactful trends in real time. We apply our framework to conduct what is, to the best of our knowledge, the first longitudinal study of r/ChatGPT. We find that, overall, r/ChatGPT posts over time illustrate the normalization of ChatGPT as an everyday consumer product rather than an exceptional, novel technology. However, our retrospective analysis also finds that posts about using ChatGPT for mental health support, and posts about developing emotional attachments to ChatGPT, both rise steadily in frequency immediately after the launch of GPT-4o in May 2024. We show that our real-time method can detect the increase in emotional engagement as early as October 2024—months before OpenAI made any (public) acknowledgment of this impact.

Abstract:
Realtime and accurate spatial audio generation is pivotal for delivering an immersive experience. However, existing spatial audio synthesis technologies are often encumbered by a tradeoff between generation quality and high inference latency, as well as difficulty in capturing precise spatial information from multimodal inputs. To address these challenges, we propose S3Audio, a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts. S3Audio mainly makes the following contributions: 1) We introduce a causal autoregressive diffusion transformer architecture that enables streaming high-quality spatial audio generation. 2) We design a Spatial Video–Audio Contrastive (SVAC) learning strategy to align the video encoder with the acoustic domain, and further employ a multi-objective online direct preference optimization~(ODPO) scheme, resulting in strong spatial perception and robust multimodal spatial audio synthesis. 3) To alleviate the current scarcity of spatial audio datasets, we also develop an automated annotation pipeline for generating detailed spatial captions. Experimental results demonstrate that S3Audio achieves superior performance in both video-to-spatial and text-to-spatial audio generation tasks. Demos can be found at: \urlhttps://s3audio.github.io

Abstract:
Finetuning large language models (LLMs) to align with human preferences has driven the success of systems such as Gemini and ChatGPT. However, approaches like Reinforcement Learning from Human Feedback (RLHF) remain computationally expensive and complex. Direct Preference Optimization (DPO) offers a simpler alternative but has limitations such as inconsistent ranking accuracy, high dependence on expensive GPU resources, and sensitive hyperparameter tuning. We propose the Convex Optimization for Alignment and Preference Learning Algorithm (COALA): a novel lightweight strategy with strong theoretical guarantees. By leveraging the convex optimization reformulation of neural networks, COALA eliminates the need for a reference model and obtains significant reduction in both training time and VRAM consumption, thus enabling efficient training on a single GPU. Experiments across three datasets—including a 23,228-sample synthetic Educational Feedback dataset—and five models (including LLaMA-8B) demonstrate COALA's competitive performance and efficiency in utilizing as little as \approx17.6% of DPO's total TFLOPS. COALA exhibits stable, monotonically increasing rewards and reaches peak margins in significantly less time than traditional methods such as DPO and ORPO. To the best of our knowledge, this is the first time convex optimization has been effectively applied to preference fine-tuning of LLMs.

Abstract:
Understanding an agent's goals helps explain and predict its behaviour, yet there is no established methodology for reliably attributing goals to agentic systems. We propose a framework for evaluating goaldirectedness that integrates behavioural evaluation with interpretability-based analyses of models' internal representations. As a case study, we examine an LLM agent navigating a 2D grid world toward a goal state. Behaviourally, we evaluate the agent against an optimal policy across varying grid sizes, obstacle densities, and goal structures, finding that performance scales with task difficulty while remaining robust to difficulty-preserving transformations and complex goal structures. We then use probing methods to decode the agent's internal representations of the environment state and its multi-step action plans. We find that the LLM agent non-linearly encodes a coarse spatial map of the environment, preserving approximate task-relevant cues about its position and the goal location; that its actions are broadly consistent with these internal representations; and that reasoning reorganises them, shifting from broader environment structural cues toward information supporting immediate action selection. Our findings support the view that introspective examination is required beyond behavioural evaluations to characterise how agents represent and pursue their objectives.

Abstract:
Time Series Forecasting (TSF) plays a critical role across many domains, yet it is vulnerable to backdoor attacks. However, backdoor defenses tailored to TSF remain underexplored, due to data entanglement and taskformulation shift challenges. To fill this gap, we conduct a systematic evaluation of thirteen representative backdoor defenses across the TSF life cycle and analyze their failure modes. Our results reveal two fundamental issues: (1) data entanglement induces channel-level signal dilution, rendering sample-filtering and trigger-synthesis defenses ineffective at localizing backdoors; and (2) task-formulation shift leads to training-loss degeneration, causing poisoned and clean windows to become indistinguishable at training stages. Based on these findings, we propose a training-time backdoor defense for TSF, termed TimeGuard. Our method adopts channel-wise pool training as the core paradigm and initializes a high-confidence pool using time-aware criteria to mitigate signal dilution. Moreover, we introduce distance-regularized loss selection to progressively expand the reliable pool during training and ease loss degeneration. Extensive experiments across multiple datasets, forecasting architectures, and TSF backdoor attacks demonstrate that TimeGuard substantially improves robustness, boosting \mathrmMAE\_\mathrmP by 1.96× over the leading baseline, while preserving clean performance within 5% \mathrmMAE\_\mathrmC.

Abstract:
Crossmodal 2D–3D gait recognition is impeded by inherent domain discrepancies between 2D silhouette and 3D point cloud distributions. While prior methods align only final embeddings, we propose DiffCrossGait, which enforces trajectory-level alignment by driving both modalities with shared noise in a unified latent diffusion process. By driving both modalities with shared Gaussian noise within a latent space, we enable continuous alignment throughout the generative evolution. We introduce a Tri-Phase Alignment Strategy that exploits varying noise intensities to enforce identity anchoring, dynamics consistency, and cross-modal structural recoverability, thereby constraining both modalities to share denoising dynamics and bottleneck structure, which promotes modality-invariant gait features. Crucially, our framework decouples generative alignment from the discriminative backbone; the diffusion mechanism serves exclusively as a training objective, ensuring high inference efficiency by eliminating the computational overhead of iterative denoising. Extensive experiments on the SUSTech1K and FreeGait benchmarks demonstrate that DiffCrossGait achieves state-of-the-art performance.

Abstract:
Contrastive Representation Learning (CRL) has achieved strong empirical success in multiple machine learning disciplines, yet its theoretical sample complexity remains poorly understood. Existing analyses usually assume that input tuples are identically and independently distributed, an assumption violated in most practical settings where contrastive tuples are constructed from a finite pool of labeled data, inducing dependencies among tuples. While one recent work analyzed this learning setting using UStatistics to estimate the population risk, the techniques used therein require the risk of each class to concentrate uniformly, making excess risk bounds scale in the order of \rho_\min^-1/2 where \rho_\min denotes the probability of the rarest class. Such a dependency can be overly pessimistic in the extreme multiclass settings where there are many tail classes which contribute minimally to the overall population risk. Our contributions are two-fold. Firstly, we improve upon the previous work and prove a bound with a sample complexity of the same order as the number of classes R, regardless of the distribution over classes. Furthermore, we formulate a different estimator that captures the concentration of the risk across classes, enabling sharper bounds in extreme multi-class learning scenarios, especially where class distributions are long-tailed. Under mild assumptions on the class distributions, the resulting sample complexity is \mathcalO(k) where k is the number of samples per tuple.

Abstract:
Sampling from a highdimensional probability distribution is a fundamental algorithmic task arising in wide-ranging applications across multiple disciplines, including scientific computing, computational statistics and machine learning. Langevin Monte Carlo (LMC) algorithms are among the most widely used sampling methods in high dimensional settings. This paper introduces a novel higher-order and Hessian-free LMC sampling algorithm based on an efficient stochastic Runge-Kutta method of strong order 1.5 for the overdamped Langevin dynamics. In contrast to the existing Runge-Kutta type LMC (Li et al.,2019) involved with three gradient evaluations, the newly proposed algorithm is computationally cheaper and requires only two gradient evaluations for one iteration. Under certain log-smooth conditions, non-asymptotic error bounds of the proposed algorithms are analyzed in \mathcalW_2-distance. In particular, a uniform-in-time convergence rate of order O(d ^\frac32 h^\frac32) is derived in a non-log-concave setting, matching the convergence rate proved in the aforementioned work but under the log-concavity condition. Numerical experiments are finally presented to demonstrate the effectiveness of the new sampling algorithm.

Abstract:
Diffusion models are increasingly used as powerful conditional generators, yet real deployments often involve multiple target distributions arising from different tasks, e.g., diverse prompt domains in textto-image generation, or multiple environments in robotics with diffusion policies. This naturally leads to a multi-objective learning (MOL) problem. A key challenge is that achieving good Pareto trade-offs can require a generalist model class with substantially larger capacity than what suffices for solving any individual task, thereby increasing statistical cost since sample complexity typically scales with the model complexity. To reconcile this, we develop a principled MOL framework for diffusion models with limited data: a semi-supervised regime where paired (labeled) samples are scarce, but (unlabeled) condition data are abundant. We propose a two-stage training procedure that first fits lightweight specialist models from limited paired data, and then distills them into a generalist model by generating pseudo-samples. We establish generalization bounds showing that the required number of paired samples only depends on the complexity of the specialist model classes. We further extend the theory to diffusion policies for sequential decision making to account for distribution shift in on-policy rollouts. Extensive experiments on robotic control tasks are conducted to verify our theoretical results.

Abstract:
Causal language models factorize sequence probabilities using only preceding context, leaving future information unexploited during training despite its availability in the training data. This paper introduces Regret Pretraining, a self-supervised framework grounded in the Learning Using Privileged Information (LUPI) paradigm. The framework employs a dual-view architecture in which a single model generates both a causal Student distribution and a future-conditioned Teacher distribution. The training objective augments standard language modeling with a regret loss that minimizes the KL divergence from teacher to student, transferring future-aware signals to the causal representations. We investigate two teacher configurations on the OLMoE-1B-7B architecture:LocalRegret, which extends attention by one future token, andGlobalRegret, which conditions on bidirectional context with the target position masked. Experiments on nine downstream tasks following 4 billion tokens of training demonstrate that both configurations consistently outperform the baseline. On average,GlobalRegret andLocalRegret achieve 33.9% and 32.2% accuracy respectively, surpassing the baseline's 30.2%. Most notably,GlobalRegret improves BoolQ performance by 18.1 percentage points (61.0% vs 42.9%). The framework introduces no additional parameters and requires only one extra inference-mode forward pass per training step. The source code for this paper is publicly available at https://anonymous.4open.science/r/ICML13655/ to facilitate reproducibility.

Abstract:
Jailbreak prompts can trigger harmful completions on aligned LLMs, In accordance, safety steering has been proposed: test-time activation interventions that steer jailbreak activations to trig- ger refusal while preserving benign utility. How- ever, existing steering methods are fundamentally supervised and tied to a static, limited training set, whereas real jailbreaks evolve and are often out-of-distributed from the training set, leading to failures on unseen attacks. In this paper, we tackle the failure on unseen jail- breaks problem, base on unsupervised latent di- rection discovery. We propose a bi-level adver- sarial training framework for zero-shot jailbreak defense. In the inner step, we simulate diverse jail- broken activations by extrapolating from refusal- state harmful-request activations via unsupervised latent direction discovery, which expands the cov- erage of real jailbreak activation subspaces. In the outer step, we train a potential-induced steering field to push these adversarial jailbroken states into refusal regions while keeping benign un- changed. Across three LLMs and six classical jailbreak families, our method achieves strong de- fense with attack success rates mostly below 5%, and rising subspace coverage throughout training helps explain the improved generalization.

Abstract:
In existing works on population dynamics inference, there is a focus on flows arising from vector fields that are the gradients of scalar potentials. Among all admissible flows that are compatible with the population dynamic, gradient flows are optimal in a specific sense: they minimize kinetic energy. The selection of fields based on different criteria corresponds to a gauge freedom when determining population dynamics, which we leverage in this work. We propose NonGradient Inference Flows (NGIF), an algorithm to infer non-gradient population dynamics using a weak formulation of the continuity equation. This allows us to parameterize with general vector fields as well as choose other selection criteria beyond minimal energy. We demonstrate on a variety of low- and high-dimensional physics problems that this more general approach improves distributional accuracy over gradient-restricted baselines and better captures non-potential transport.

Abstract:
Fast Healthcare Interoperability Resources (FHIR) is the dominant standard for interoperable exchange of healthcare data. In FHIR, electronic health records form a directed graph of resources. Answering clinically meaningful questions over FHIR requires agents to perform multistep reasoning, filtering, and aggregation across multiple resource types. Prior work shows that even tool-augmented LLM agents (retrieval, code execution, multi-turn planning) often select the wrong resources or violate traversal constraints. We study this problem in the context of FHIR-AgentBench, a benchmark for realistic question answering over real-world hospital data, and frame reasoning on FHIR as a sequential decision-making problem over a queryable structured graph. We implement a multi-turn CodeAct agent and post-train it with reinforcement learning using a custom harness and tools. A LLM Judge provides execution-grounded rewards. Compared to prompt-based, closed-model baselines, RL post-training improves performance while enforcing data-integrity constraints. Empirically, our approach improves answer correctness from 50% (o4-mini) to 64% on FHIR-AgentBench using a smaller and cheaper Qwen 2.5-7B model. We present an end-to-end post-training pipeline (environment building, harness construction, model training and custom evaluation) that reliably improves multi-turn reasoning over structured clinical graphs.

Abstract:
Modern large language models (LLMs) extend context lengths to millions of tokens, enabling coherent, personalized responses grounded in long conversational history. However, the KeyValue (KV) cache grows linearly with the extended dialogue history, causing the model’s memory footprint to quickly exceed device limits. While recent KV cache compression methods attempt to reduce memory usage, most apply cache eviction after processing the entire context, incurring unbounded peak memory usage. Additionally, query-dependent eviction narrows the cache semantics to a single query, leading to failure cases in multi-turn conversations. In this paper, we introduce EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and performs episode-specific KV cache eviction. Across three LongConvQA benchmarks (LongMemEval, Realtalk, and LoCoMo), EpiCache improves accuracy by up to 30%, achieves near-full-cache accuracy under 4–6× compression, and reduces latency and peak memory by up to 2.4× and 3.7×, respectively.

Abstract:
Urgently needed generalizable robot object interaction and manipulation requires highquality Cross-Category object perception. As a pioneer of this area, Generalizable and Actionable Parts (GAParts) understanding has attracted increasing attention from relevant researchers. However, most recent works either have insufficient design regarding the symmetry issue or require rich symmetry annotation, which severely impedes precise GAPart pose estimation in data-lacking scenarios. In this paper, we propose SAFAG, a novel Symmetry Annotation-Free framework for Generalizable and Actionable Parts Pose Estimation. Specifically, we suggest a stepwise refinement two-stage framework for candidate-to-final quaternion regression, and tackle the symmetry prediction as a probability distribution problem with self-supervised learning strategy. The experimental results demonstrate the superior performance and robustness of our SAFAG. We believe that our work has the enormous potential to be applied in many areas of embodied AI system.

Abstract:
Multimodal large language models (MLLMs) inevitably memorize sensitive crossmodal information during pretraining, making post-deployment unlearning crucial for safety. Existing methods often evaluate unlearning based on output deviations, neglecting generation quality, which can lead to hallucinations or rigid refusals, compromising usability and safety. We propose Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models (ASRU), a novel, controllable multimodal unlearning framework that considers generation quality as a key evaluation metric. ASRU first constructs an initial model exhibiting refusal-style responses via activation redirection. It then reinforces the refusal boundary using a customized reward function, effectively unlearning target knowledge while retaining non-target knowledge. Experimental results show that ASRU preserves model utility while markedly enhancing unlearning (+24.61%) and generation quality (5.8×), achieving strong generalization and an efficient forget–retain trade-off with minimal retained supervision.

Abstract:
Predicting model performance at larger scales enables the design of training strategies and architectures tailored to specific performance targets. Empirical scaling law research identifies functional forms to aid this prediction task. These describe the relationship between loss and compute using a losscompute frontier defined by learning curves. Due to the empirical nature of this approach, the computational burden is substantial, making strategic resource allocation essential -- yet it remains surprisingly underexplored. In this work, we address this shortcoming by exploring the suitability of Successive Halving (SH) and SH combined with parametric and non-parametric surrogate models. In addition to enabling a more systematic allocation of a given compute budget, our findings show that SH paired with surrogate models yields a set of learning curves that includes one with a lower loss-compute value than what naive uniform allocation or an SH-only approach can obtain. Our experiments demonstrate mean relative improvements of up to 2.84% and 5.47% on real-world and synthetic learning curve datasets. This strategic resource allocation enables us to obtain accurate scaling laws at significantly reduced computational costs, saving up to 98.7% over the traditional exhaustive approach.

Abstract:
We study the multitask linear regression problem with contaminated tasks. We consider a situation where the unknown parameters of each task are close in the \ell_2-norm, but a certain proportion of tasks are outliers. In the presence of outliers, existing works develop theory under the assumption that the empirical second moment (normalized Gram matrix) of each task has a minimum eigenvalue of order \Omega(1). However, this assumption is violated in many cases, and we propose a novel loss function that operates efficiently under a strictly relaxed assumption. Under this assumption, we obtain an optimal Mean Squared Error (MSE) bound, and even when the assumption is violated, we achieve a favorable rate of the MSE bound. Hence, our methodology adapts to the degree of task similarity and the proportion of outliers, both of which are unknown (adaptivity and robustness), and also enjoys safety against assumption violation.

Abstract:
Outof-distribution (OOD) detection has emerged as a popular technique to enhance the reliability of machine learning models by identifying unexpected inputs from unknown classes. Recent progress in pre-trained vision–language models (VLMs) has enabled zero-shot OOD detection without access to in-distribution (ID) training data; in this setting, existing methods commonly treat text embeddings of class names as class prototypes. In this paper, we challenge this widely adopted “text-as-prototype” paradigm by theoretically showing that off-the-shelf textual prototypes are generally misaligned with the optimal visual prototypes, yielding an intrinsic modality gap that cannot be eliminated by prompt engineering alone. To mitigate this gap under the post-hoc constraint, this paper presents an online pseudo-supervised framework that directly learns class prototypes in the visual feature space using unlabeled test-time data streams and soft predictions from the pre-trained VLMs. We provide theoretical guarantees for the convergence of the online optimization procedure. Extensive experiments empirically manifest that our method achieves a new state of the art across a variety of OOD detection setups.

Abstract:
Singlecell RNA sequencing provides insights into gene expression at single-cell resolution, yet inferring temporal processes from these static snapshot measurements remains a fundamental challenge. Current approaches utilizing neural differential equations and flows are sensitive to overfitting and lack careful considerations of biological variability. In this work, we propose a generative framework that models population trends using a latent heteroscedastic Gaussian process (GP) approximated by Hilbert space methods. To address the absence of genuine cell trajectories, we leverage an optimal transport (OT) objective that aligns generated and observed population distributions. Our method explicitly captures biological heterogeneity by incorporating cell-specific latent time and cell type conditioning to disentangle temporal asynchrony and trajectories to different cell types. We demonstrate state-of-the-art performance on complex interpolation and extrapolation benchmarks and introduce a novel gradient-based strategy for inferring perturbation trajectories.

Abstract:
Reliable microbiomebased diagnosis is critical for precision medicine at scale in inflammatory diseases, yet current post-training pipelines in LLMs often overlook the interaction structure that governs microbial ecosystems. In inflammatory bowel disease (IBD), disease signals arise not only from species-level abundance shifts but also from latent microbe–microbe cross-talk. We propose iLoRA, a parameter-efficient Bayesian LoRA framework that infers latent interaction graphs from microbiome inputs and integrates them into adaptation, enabling joint clinical prediction and interaction discovery. Unlike correlation-based post hoc analysis, iLoRA models microbial interactions as latent variables learned end-to-end, yielding uncertainty-aware estimates of cross-talk. We evaluate iLoRA on (i) interactive question answering with human-annotated interaction graphs to quantify structural recovery and (ii) gut microbiome cohorts for IBD diagnosis. Across both domains, iLoRA consistently improves accuracy over strong LoRA baselines while producing interpretable interaction graphs consistent with annotated relations and conventional microbiome association networks.

Abstract:
Incorporating code into training corpora has become a widely acknowledged practice in the development of modern foundation language models (LMs). Compared with a general Internet corpus, code offers highquality, well-structured signals that substantially augment the coding proficiency of models. Beyond programming skills, prior research has suggested that code data may also contribute to non-coding capabilities. Nevertheless, through a series of rigorous controlled experiments, we demonstrate that the influence of code on other domains, particularly reasoning, remains limited. Our principal findings are as follows: (1) Code corpus yields substantial gains in programming-related abilities but only competed with knowledge-intensive tasks. (2) We identify a core subset that functions as cognitive scaffolding for mathematical reasoning, especially for complex problem-solving scenarios. (3) Formal reasoning provides more pronounced improvements in challenging mathematical reasoning tasks, while natural language–based reasoning proves more effective for simpler reasoning problems. Finally, by probing the internal mechanisms of LMs, we reveal how training data modulates routing patterns, thereby shaping emergent model behavior. As a central driver of model capability, our findings disentangle domain-specific data into finer-grained, cross-domain ability dimensions and underscore promising directions for future data optimization.

Abstract:
Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in reasoning and generation tasks and are increasingly deployed in realworld applications. However, their explicit chain-of-thought (CoT) mechanism introduces new security risks, making them particularly vulnerable to jailbreak attacks. Existing approaches often rely on static CoT templates to elicit harmful outputs, but such fixed designs suffer from limited diversity, adaptability, and effectiveness. To overcome these limitations, we propose an adaptive evolutionary CoT jailbreak framework, called AE-CoT. Specifically, the method first rewrites harmful goals into mild prompts with teacher role-play and decomposes them into semantically coherent reasoning fragments to construct a pool of CoT jailbreak candidates. Then, within a structured representation space, we perform multi-generation evolutionary search, where candidate diversity is expanded through fragment-level crossover and a mutation strategy with an adaptive mutation-rate control mechanism. An independent scoring model provides graded harmfulness evaluations, and high-scoring candidates are further enhanced with a harmful CoT template to induce more destructive generations. Extensive experiments across multiple models and datasets demonstrate the effectiveness of the proposed AE-CoT, consistently outperforming state-of-the-art jailbreak methods.

Abstract:
Neural networks (NNs) trained under different hyperparameters can fall into distinct training ``regimes'', with models in the same regime showing homogeneous properties and models across regimes differing qualitatively. In this paper, we analyze multiregime patterns in scientific machine learning (SciML) models by characterizing these regimes and the transitions between them. We show how different regimes affect trainability and generalization, and we demonstrate that loss-landscape analysis enables regime-based diagnostics to understand, evaluate, and improve SciML model training. Our analysis yields three key insights: (1) compared with computer vision (CV) tasks, SciML models exhibit significantly more pathological loss landscapes; (2) optimization methods are regime-specific -- different optimization strategies help in different regimes, but none is uniformly effective; and (3) SciML models exhibit fine-grained failure modes that challenge conventional interpretations of standard loss-landscape metrics. Using this study, we aim to unify our understanding of seemingly different failure modes across SciML tasks and obtain task-oblivious insights and methodologies for addressing these failures. We validate these findings across widely used SciML models, including physics-informed neural networks (PINNs), Fourier neural operators (FNOs), and Neural Ordinary Differential Equations (NeuralODEs), on benchmarks spanning representative ordinary and partial differential equations.

Abstract:
Crosssubject generalization in biomedical time-series (BTS) refers to training on data from some subjects and testing on unseen subjects. The key challenge is to suppress subject-specific variability in BTS representations. Most existing methods implicitly suppress the variability through model building or subject adversarial learning, but rarely model it explicitly. We introduce spectral drift as a new perspective to characterize subject-specific variability. Specifically, BTS signals under the same label often share consistent oscillatory structure, yet exhibit subject-dependent magnitude or phase shifts in specific frequency components, which we interpret as subject-specific variability. Building on this insight, we propose BioFormer. At its core is a Frequency-Band Alignment Module (FBAM) that generates band-wise modulation factors from the spectral distribution and adaptively adjusts amplitude and phase to align spectral structure, thereby mitigating variability. We further pair FBAM with Sample-Conditional Layer Normalization, which infers normalization parameters from intrinsic signal statistics rather than subject identity, stabilizing cross-subject representations. Extensive experiments on six datasets demonstrate that BioFormer outperforms 12 baselines, yielding absolute F1-score improvements of 6%.

Abstract:
Video diffusion models can generate visually stunning content, yet frequently produce motion that violates physical laws, objects accelerate implausibly or vanish midtrajectory. We reveal a surprising finding: a 2-step generation often exhibits better physical consistency than a 50-step output from the same model. Through spectral analysis, we trace this to phase erosion during denoising, motion dynamics are 8.5× more sensitive to phase corruption than magnitude, yet the refinement process progressively destroys this critical component. Building on this insight, we propose PhaseLock, a training-free framework that locks motion dynamics to fast inference priors. Rather than requiring 50 steps to establish physics, PhaseLock extracts a motion prior from just 2 steps and enforces it onto high-fidelity generation via Latent Delta Guidance. This decouples physical consistency from visual refinement, ensuring the final output remains grounded in valid trajectories. PhaseLock achieves strong physical consistency with negligible overhead (1.06× time, 1.02× memory), eliminating the need for expensive external guidance methods (～5× time).

Abstract:
Generative priors in Image SuperResolution (SR) often compromise faithful restoration, we attribute this limitation to a fundamental spectral misalignment between isotropic objectives and the intrinsic natural image manifold. While Direct Preference Optimization offers a path to alignment, its reliance on spectrally flat Gaussian noise fails to distinguish authentic high-frequency details from hallucinations. To bridge this geometric gap, we propose ASASR, a theoretically grounded framework that recasts the generative flow into a Sobolev-induced Riemannian geometry by explicitly coloring the noise transition kernel to mirror natural spectral decay. Driving this geometric alignment, we integrate a parametric adversary grounded in the Riesz Representation Theorem, which synthesizes targeted negative samples equivalent to worst-case Sobolev gradients to direct optimization along the tangent space of plausible structural failures. Extensive evaluations demonstrate that ASASR outperforms leading generative baselines, particularly in preserving spectral consistency and structural fidelity, offering a robust solution that effectively mitigates artifacts.

Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated significant achievements in general visual question answering (VQA) tasks. However, they remain brittle on mechanical engineering drawings, where high annotation density and weak domain knowledge, compounded by unreliable spatial relation reasoning under strict projection rules and geometric constraints, make decisive cues easy to miss and frequently lead to wrong answers. To bridge this gap, we introduce the first comprehensive mechanical drawing understading dataset MechVQA created through an semiautomated construction and quality-control pipeline. MechVQA contains 3.3k high-density pictures with 21K question–answer pairs, spanning 10 different fine-grained tasks across three capability levels: Recognition, Reasoning, and Judging, providing a testbed to evaluate and improve MLLMs understanding on real world mechanical drawings. On top of MechVQA, we then develop the MechVL model through a multi-stage training paradigm, building a strong domain specialized baseline. Extensive experimental results demonstrate that MechVL outperforms strong closed-source MLLMs by 6% on MechVQA total score, significantly enhancing mechanical drawing understanding ability and providing a reusable foundation for deploying MLLMs in mechanical design and inspection scenarios.

Abstract:
Selection bias is pervasive in observational studies. For example, large scale biobanks data can exhibit ``healthy volunteer bias'' when respondents are healthier and of higher socioeconomic status than the population they are meant to represent. Recovering causal effects from such sub-population is an important problem in causal inference, as estimating average treatment effects (ATE) from selected populations can result in a severely biased estimate of the ATE from the whole population. In this paper, we investigate the identifiability of the ATE under selection bias. We providenecessary and sufficient conditionsfor ATE identifiability, leveraging weak assumptions on probability classes to characterize propensity score and selection probability. Compared to previous works, our results extend existing graphical identifiability criteria and offer a more comprehensive understanding of causal effect identificationwith strictly weaker conditionsin the presence of selection bias.

Abstract:
Speculative decoding can significantly accelerate LLM inference, especially given that its cloudedge collaborative deployment offers cloud workload offloading, offline robustness, and privacy enhancement. However, existing collaborative inference frameworks with speculative decoding are constrained by (i) sequential token generation and communication with low resource utilization, and (ii) inflexible cloud non-autoregressive verification (NAV) triggering that induces premature verification or costly rollbacks. In this paper, we propose PipeSD, an efficient cloud-edge collaborative pipeline inference framework with speculative decoding. PipeSD overlaps token generation and communication by a token-batch pipeline scheduling mechanism optimized by dynamic programming, and improves verification flexibility through a dual-threshold NAV triggering mechanism with a lightweight Bayesian optimization autotuner. We implement PipeSD using llama-cpp-python, PyTorch, and FastAPI, and evaluate it on a real-world cloud-edge testbed with two draft-target model pairs across four scenarios. Results show that PipeSD consistently outperforms state-of-the-art baselines, achieving 1.16×-2.16× speedup and reducing energy consumption by 14.3%-25.3%. Our code is available at https://anonymous.4open.science/r/PipeSD.

Abstract:
Existing code reasoning methods primarily supervise final code outputs, ignoring intermediate states, often leading to reward hacking where correct answers are obtained through inconsistent reasoning. We propose StepCodeReasoner, a framework that introduces explicit intermediate executionstate supervision. By automatically inserting structured print-based execution-trace anchors into code, the model is trained to predict runtime states at each step, transforming code reasoning into a verifiable, stepwise execution modeling problem. Building on this execution-aware method, we introduce Bi-Level GRPO, a reinforcement learning algorithm for structured credit assignment at two levels: inter-trajectory, comparing alternative execution paths, and intra-trajectory, rewarding intermediate accuracy based on its impact on downstream correctness. Extensive experiments demonstrate that StepCodeReasoner achieves SOTA performance in code reasoning. In particular, our 7B model achieves 91.1% on CRUXEval and 86.5% on LiveCodeBench, outperforming the CodeReasoner-7B baseline (86.0% and 77.7%) and GPT-4o (85.6% and 75.1%). Furthermore, on the execution-trace benchmark REval, our model scores 82.9%, outperforming baseline CodeReasoner-7B (72.3%), its 14B counterpart (81.1%), and GPT-4o (77.3%). Additionally, our approach also improves code generation performance, demonstrating that explicit execution modeling enhances both code reasoning and code generation.

Abstract:
As LLMs are globally deployed, aligning their cultural value orientations is critical for safety and user engagement. However, existing benchmarks face the ConstructComposition-Context (C^3) challenge: relying on discriminative, multiple-choice formats that probe value knowledge rather than true orientations, overlook subcultural heterogeneity, and mismatch with real-world open-ended generation. We introduce DOVE, a distributional evaluation framework that directly compares human-written text distributions with LLM-generated outputs. DOVE utilizes a rate-distortion variational optimization objective to construct a compact value-codebook from 14K human documents, mapping text into a structured value space to filter semantic noise. Alignment is measured using unbalanced optimal transport, capturing intra-cultural distributional structures and sub-group diversity. Experiments across 12 LLMs show that DOVE achieves superior predictive validity, attaining a 31.56% correlation with downstream tasks, while maintaining high reliability with as few as 500 samples per culture.

Abstract:
Recently, large language models (LLMs) have shown remarkable reasoning abilities by producing long reasoning traces. However, as the sequence length grows, the keyvalue (KV) cache expands linearly, incurring significant memory and computation costs. Existing KV cache eviction methods mitigate this issue by discarding less important KV pairs, but often fail to capture complex KV dependencies, resulting in performance degradation. To better balance efficiency and performance, we introduce ForesightKV, a training-based KV cache eviction framework that learns to predict which KV pairs to evict during long-text generations. We first design the Golden Eviction algorithm, which identifies the optimal eviction KV pairs at each step using future attention scores. These traces and the scores at each step are then distilled via supervised training with a Pairwise Ranking Loss. Furthermore, we formulate cache eviction as a Markov Decision Process and apply the GRPO algorithm to mitigate the significant language modeling loss increase on low-entropy tokens. Experiments on AIME2024 and AIME2025 benchmarks of three reasoning models demonstrate that ForesightKV consistently outperforms prior methods under only half the cache budget, while benefiting synergistically from both supervised and reinforcement learning approaches.

Abstract:
KolmogorovArnold Networks (KANs) have recently emerged as a structured alternative to standard MLPs, yet a principled theory for their training dynamics, generalization, and privacy properties remains limited. In this paper, we analyze gradient descent (GD) for training two-layer KANs and derive general bounds that characterize their training dynamics, generalization, and utility under differential privacy (DP). As a concrete instantiation, we specialize our analysis to logistic loss under an NTK-separable assumption, where we show that polylogarithmic network width suffices for GD to achieve an optimization rate of order 1/T and a generalization rate of order 1/n, with T denoting the number of GD iterations and n the sample size. In the private setting, we characterize the noise required for (\epsilon,\delta)-DP and obtain a utility bound of order \sqrtd/(n\epsilon) (with d the input dimension), matching the classical lower bound for general convex Lipschitz problems. Our results imply that polylogarithmic width is not only sufficient but also necessary under differential privacy, revealing a qualitative gap between non-private (sufficiency only) and private (necessity also emerges) training regimes. Experiments further illustrate how these theoretical insights can guide practical choices, including network width selection and early stopping.

Abstract:
Evaluating large language models (LLMs) on openended tasks without ground-truth labels is increasingly done via the LLM-as-a-judge paradigm. A critical but under-modeled issue is that judge LLMs differ substantially in reliability; treating all judges equally can yield biased leaderboards and misleading uncertainty estimates—more data can make evaluation more confidently wrong under misspecified aggregation. We propose a judge-aware ranking framework that extends the Bradley-Terry-Luce model by introducing judge-specific discrimination parameters, jointly estimating latent model quality and judge reliability from pairwise comparisons without reference labels. We establish identifiability up to natural normalizations and prove consistency and asymptotic normality of the maximum likelihood estimator, enabling confidence intervals for score differences and rank comparisons. Across multiple public benchmarks and a newly collected dataset, our method improves agreement with human preferences, achieves higher data efficiency than unweighted baselines, and produces calibrated uncertainty quantification for LLM rankings.

Abstract:
Composed image retrieval uses a reference image plus a naturallanguage edit to retrieve images that apply the requested change while preserving other relevant visual content. Classic fusion pipelines typically rely on supervised triplets and can lose fine-grained cues, while recent zero-shot approaches often caption the reference image and merge the caption with the edit, which may miss implicit user intent and return repetitive results. We present Pix2Key, which represents both queries and candidates as open-vocabulary visual dictionaries, enabling intent-aware constraint matching and diversity-aware reranking in a unified embedding space. A self-supervised pretraining component, V-Dict-AE, further improves the dictionary representation using only images, strengthening fine-grained attribute understanding without CIR-specific supervision. On the extended DeepFashion-MM benchmark, Pix2Key improves Recall@10 up to 3.2 points, and adding V-Dict-AE yields an additional 2.3-point gain while improving intent consistency and maintaining high list diversity.

Abstract:
Large language models can memorize information that must be removedranging from copyright-sensitive content (e.g., book chapters) to personally identifiable information (e.g., income)--to ensure responsible and compliant behavior. Unlearning has emerged as an efficient alternative to full retraining, aiming to remove specific knowledge. However, users may still expect model to leverage the removed information when it is re-introduced in the prompt. Existing evaluations of unlearning methods focus on (1) the extent of forgetting of the target knowledge (forget set) and (2) performance preservation on the retain set (i.e., utility), but overlook this critical usability dimension. Through a systematic evaluation of six state-of-the-art unlearning methods, we show that they consistently degrade such \emphcontextual utility--the model's ability to use forgotten knowledge when it is provided in context. To address this, we augment unlearning objectives with a plug-in term that explicitly preserves contextual utility. Extensive experiments demonstrate that our approach restores contextual utility to near original levels while still maintaining effective forgetting and retain-set utility.

Abstract:
LLMbased agents solve complex tasks through iterative reasoning, tool use, and environment interaction, where each intermediate thought directly shapes subsequent actions. Small deviations in these thoughts can therefore propagate into unsafe behaviors, yet existing guardrails typically operate only on final outputs or require intrusive model modifications. We introduce Thought-Aligner, a lightweight plug-in safety model that performs causal correction on unsafe thoughts before action execution, without altering the underlying agent. The corrected thoughts are fed back into the agent, steering its decision process and tool use toward safer trajectories. Because it operates solely at the thought level, Thought-Aligner is model-agnostic and can be integrated into diverse agent frameworks. We train Thought-Aligner via two-stage contrastive learning on paired safe and unsafe thoughts generated across ten risk scenarios. Experiments on two agent-safety benchmarks with six LLMs show that Thought-Aligner increases behavioral safety from about 50% without protection to around 90% on average, exceeding state-of-the-art guardrails by roughly 23%, while also improving helpfulness by about 5%. The method incurs low per-step latency and minimal overhead, enabling scalable and practical deployment.

Abstract:
Outlier detection (OD) is widely used in practice; but its effective deployment on new tasks is hindered by lack of labeled outliers, which makes algorithm and hyperparameter selection notoriously hard. Foundation models (FMs) have transformed ML, and OD is no exception: Shen et al. (2025) introduced FOMO0D, the first FM for OD, achieving remarkable performance against numerous baselines. This work introduces OUTFORMER, which advances FOMO-0D with (1) a mixture of synthetic priors and (2) self-evolving curriculum training. OUTFORMER is pretrained solely on synthetic labeled datasets and infers test labels of a new task by using its training data as in-context input. Inference is fast and zero-shot, requiring merely forward pass and no labeled outliers. Thanks to in-context learning, it requires zero additional work—no OD model training or bespoke model selection—enabling truly plug-and-play deployment. OUTFORMER achieves state-of-the-art performance on the prominent ADBench, as well as two new large-scale OD benchmarks that we introduce, comprising over 1,500 datasets, while maintaining speedy inference.

Abstract:
MultiHop Fact Verification (MHFV) necessitates complex reasoning across disparate evidence, posing significant challenges for Large Language Models (LLMs) which often suffer from hallucinations and fractured logical chains. Existing methods, while improving transparency via Chain-of-Thought (CoT), lack explicit modeling of the causal dependencies between evidence and claims. In this work, we introduce a novel framework that grounds reasoning in a Structural Causal Model (SCM), treating verification as a constructive causal inference process. We empirically identify an "inverted U-shaped" correlation between reasoning chain length and accuracy, revealing that excessive structural complexity degrades performance. To address this, we propose a Rule-based Reinforcement Learning strategy using Group Relative Policy Optimization (GRPO). This approach dynamically optimizes the trade-off between structural depth and conciseness. Extensive experiments on HoVer and EX-FEVER demonstrate that our SCM-GRPO framework significantly outperforms state-of-the-art baselines, offering a reliable and interpretable solution for complex fact verification.

Abstract:
Robust classification in noisy environments remains a fundamental challenge in machine learning. Standard approaches typically treat signal enhancement and classification as separate, sequential stages: first enhancing the signal and then applying a classifier. This approach fails to leverage the semantic information in the classifier's output during denoising. In this work, we propose a general, domainagnostic framework that integrates two interacting diffusion models: one operating on the input signal and the other on the classifier's output logits, without requiring any retraining or fine-tuning of the classifier. This coupled formulation enables mutual guidance, where the enhancing signal refines the class estimation and, conversely, the evolving class logits guide the signal reconstruction towards discriminative regions of the manifold. We introduce three strategies to effectively model the joint distribution of the input and the logit. We evaluated our joint enhancement method for image classification and automatic speech recognition. The proposed framework surpasses traditional sequential enhancement baselines, delivering robust and flexible improvements in classification accuracy under diverse noise conditions.

Abstract:
The scarcity of highquality training data presents a fundamental bottleneck to scaling machine learning models. This challenge is particularly acute in recommendation systems, where extreme sparsity in user interactions leads to rugged optimization landscapes and poor generalization. We propose the Recursive Self-Improving Recommendation (RSIR) framework, a paradigm in which a model bootstraps its own performance without reliance on external data or teacher models. RSIR operates in a closed loop: the current model generates plausible user interaction sequences, a fidelity-based quality control mechanism filters them for consistency with user’s approximate preference manifold, and a successor model is augmented on the enriched dataset. Our theoretical analysis shows that RSIR acts as a data-driven implicit regularizer, smoothing the optimization landscape and guiding models toward more robust solutions. Empirically, RSIR yields consistent, cumulative gains across multiple benchmarks and architectures. Notably, even smaller models benefit, and weak models can generate effective training curricula for stronger ones. These results demonstrate that recursive self-improvement is a general, model-agnostic approach to overcoming data sparsity, suggesting a scalable path forward for recommender systems and beyond. Our anonymized code is available at https://anonymous.4open.science/r/RSIR-7C5B.

Abstract:
Spectral gradient methods, such as the Muon optimizer, modify gradient updates by preserving directional information while discarding scale, and have shown strong empirical performance in deep learning. We investigate the mechanisms underlying these gains through a dynamical analysis of a nonlinear phase retrieval model with anisotropic Gaussian inputs, equivalent to training a twolayer neural network with quadratic activation and fixed second-layer weights. Focusing on a spiked covariance setting where the dominant variance direction is orthogonal to the signal, we show that gradient descent (GD) suffers from a variance-induced misalignment: during the early escape phase, high-variance but uninformative directions are multiplicatively amplified, degrading alignment with the true signal under strong anisotropy. In contrast, spectral gradient descent (SpecGD) removes this anisotropic magnification effect, leading to stable alignment and accelerated noise contraction. Numerical experiments confirm the theory and show that these phenomena persist under more general anisotropic covariances.

Abstract:
Decodingbased regression, which reformulates regression as a sequence generation task, has emerged as a promising paradigm of applying large language models for numerical prediction. However, its progress is hindered by the misalignment between discrete token-level objectives (e.g., cross-entropy) and continuous numerical values. Existing approaches relying on token-level constraints often fail to capture the global magnitude of the target value, limiting their precision and generalization. In this paper, we propose to unlock the potential of decoding-based regression via reinforcement learning. We formulate the generation process as a Markov decision process, utilizing sequence-level rewards to enforce global numerical coherence. Under this framework, we present GenRe^2, which combines policy gradient methods to preserve error magnitudes with dense expert supervision, resolving the temporal credit assignment challenge. Extensive experiments across tabular regression, code metric prediction and generative reward modeling demonstrate that GenRe^2 consistently outperforms traditional baselines, establishing a robust paradigm for general-purpose numerical prediction.

Abstract:
Linear bandits have long been a central topic in online learning, with applications ranging from recommendation systems to adaptive clinical trials. Their general learnability has been established when the objective is to minimise the inner product between a cost parameter and the decision variable. While this is highly general, this reliance on an inner product structure belies the name of \emphlinear bandits, and fails to account for problems such as Optimal Transport. Using the Kantorovich formulation of Optimal Transport as an example, this article shows that an inner product structure is \emphnot necessary to achieve efficient learning in linear bandits. We propose a refinement of the classical OFUL algorithm that operates by embedding the action set into a Hilbertian subspace, where confidence sets can be built via leastsquares estimation. Actions are then constrained to this subspace by penalising optimism. The analysis is completed by leveraging convergence results from penalised (entropic) transport to the Kantorovich problem. Up to this approximation term, the resulting algorithm achieves the same trajectorial regret upper bounds as the OFUL algorithm, which we turn into worst-case regret using functional regression techniques. Its regret interpolates between \tilde\mathcal O(\sqrtT) and \mathcal O(T), depending on the regularity of the cost function, and recovers the parametric rate \tilde\mathcal O(\sqrtdT) in finite-dimensional settings.

Abstract:
LLMbased agents increasingly execute real-world workflows via tools and memory. Granting LLMs such powers enables ill-intended adversaries to likewise use these agents to carry out complex misuse scenarios. Existing agent-misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduceSTING(Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion donotconsistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.

Abstract:
VisionLanguage-Action (VLA) models have demonstrated remarkable capabilities and generalization in embodied manipulation. However, their decision-making relies on a fast, instinctive process that lacks deliberation. This strategy often leads to suboptimal or catastrophic actions when facing complex or ambiguous scenarios that require greater consideration. In this paper, we introduce VLA-ATTC, a framework that endows VLA models with adaptive test-time compute (TTC). VLA-ATTC employs an uncertainty-based ``cognitive clutch'' to dynamically transition from reflexive execution to a TTC deliberation phase when necessary. During TTC phase, a novel Relative Action Critic (RAC) model identifies the optimal action from generated candidates via pairwise comparisons. This relative mechanism replaces unstable absolute value estimation, significantly simplifying the learning objective. Furthermore, we introduce an efficient sampling strategy to amortize computational costs and an automated data pipeline that curates preference pairs without manual annotation. On the LIBERO-LONG benchmark, VLA-ATTC reduces the failure rate of the SOTA model PI0.5 by over 50%.

Abstract:
Deep Research (DR) agents extend Large Language Models (LLMs) beyond parametric knowledge by autonomously retrieving and synthesizing evidence from large web corpora into longform reports, enabling a long-horizon agentic paradigm. However, unlike real-time conversational assistants, DR is computationally expensive and time-consuming, creating an autonomy-interaction dilemma: high autonomy on ambiguous user queries often leads to prolonged execution with unsatisfactory outcomes. To address this, we propose IntentRL, a framework that trains proactive agents to clarify latent user intents before starting long-horizon research. To overcome the scarcity of open-ended research data, we introduce a scalable pipeline that expands a few seed samples into high-quality dialogue turns via a shallow-to-deep intent refinement graph. We further adopt a two-stage reinforcement learning (RL) strategy: Stage I applies RL on offline dialogues to efficiently learn general user-interaction behavior, while Stage II uses the trained agent and a user simulator for online rollouts to strengthen adaptation to diverse user feedback. Extensive experiments show that IntentRL significantly improves both intent hit rate and downstream task performance, outperforming the built-in clarify modules of closed-source DR agents and proactive LLM baselines.

Abstract:
Reinforcement learning significantly enhances LLM capabilities but suffers from a critical issue: length inflation, where models adopt verbosity or inefficient reasoning to maximize rewards. Prior approaches struggle to address this challenge in a general and lossless manner, primarily because additive penalties introduce a compensatory effect that creates optimization shortcuts, while heuristic gating strategies lack generality beyond binary feedback. To bridge this gap, we present Group Relative Reward Rescaling (GR^3), which reframes length control as a multiplicative rescaling paradigm, effectively establishing a generalized, continuous, and rewarddependent gating mechanism. To further ensure lossless optimization, we incorporate group-relative regularization and advantage-aware calibration, which dynamically adapt length budgets to instance difficulty and preserve the advantage signal of high-quality trajectories. Empirically, across both RLHF and RLVR settings, GR^3 maintains training dynamics and downstream performance comparable to standard GRPO while significantly mitigating length inflation, outperforming state-of-the-art length-regularized baselines.

Abstract:
Continual incorporation of new knowledge is essential for the longterm evolution of large language models (LLMs). Existing approaches typically rely on parameter-update algorithms to mitigate catastrophic forgetting, yet they suffer from fundamental limitations: 1) forgetting is unavoidable as the amount of newly injected knowledge grows; and 2) model updates are often irreversible. As modern LLMs become increasingly expressive, it is natural to question whether large-scale weight updates are necessary for acquiring a small amount of new knowledge. In this work, we propose a principled framework that models autoregressive language generation as a Markov process over tokens, where model memory is represented by a Markov transition matrix. Under this formulation, incorporating new knowledge/tokens corresponds to extending the state space, and preserving existing transitions guarantees retention of previously learned knowledge. We then prove a sample complexity bound for incorporating new tokens via a token-to-dictionary mapping strategy. In particular, for learning the transition behavior of each new token, the required number of samples scales linearly with the number of existing tokens it is mapped to. To realize this mapping, we propose an embedding-tuning algorithm that requires minimal parameter updates and induces zero forgetting. Experimental results further demonstrate the effectiveness of our method and validate our theoretical findings.

Abstract:
Vibe coding is a new programming paradigm in which human engineers instruct large language model (LLM) agents to complete complex coding tasks with little supervision. Although it is increasingly adopted, are vibe coding outputs really safe to deploy in production? To answer this question, we propose SUSVIBES, a benchmark consisting of 200 featurerequest software engineering tasks from real-world open-source projects, which, when given to human programmers, led to vulnerable implementations. We evaluate multiple widely used coding agents with frontier models on this benchmark. Disturbingly, all agents perform poorly in terms of software security. Although 61% of the solutions from SWE-Agent with Claude 4 Sonnet are functionally correct, only 10.5% are secure. Further experiments demonstrate that preliminary security strategies, such as augmenting the feature request with vulnerability hints, cannot mitigate these security issues. Our findings raise serious concerns about the widespread adoption of vibe-coding, particularly in security-sensitive applications.

Abstract:
Finetuning large language models (LLMs) on resource-constrained clients remains a challenging problem. Recent works have fused low-rank adaptation (LoRA) techniques with federated fine-tuning to mitigate challenges associated with client model sizes and data scarcity. Still, the heterogeneity of resources remains a critical bottleneck: while higher-rank modules generally enhance performance, varying client capabilities constrain LoRA's feasible rank range. Existing approaches attempting to resolve this issue either lack analytical justification or impose additional computational overhead, leaving a wide gap for efficient and theoretically-grounded solutions. To address these challenges, we propose federated sketching LoRA (FSLoRA), which leverages a sketching mechanism to enable clients to selectively update submatrices of global LoRA modules maintained by the server. By adjusting the sketching ratios, which determine the ranks of the submatrices on the clients, FSLoRA flexibly adapts to client-specific communication and computational constraints. We provide a rigorous convergence analysis of FSLoRA that characterizes how the sketching ratios affect the convergence rate. Through extensive experiments, we demonstrate that FSLoRA outperforms baselines and significantly improves training efficiency while preserving stable convergence.

Abstract:
The success of large language models (LLMs) in scientific domains has heightened safety concerns, prompting numerous benchmarks to evaluate their scientific safety. Existing benchmarks often suffer from limited risk coverage and a reliance on subjective evaluation. To address thess problems, we introduce SafeSci, a comprehensive framework for safety evaluation and enhancement in scientific contexts. SafeSci comprises SafeSciBench, a multidisciplinary benchmark with 0.25M samples, and SafeSciTrain, a large-scale dataset containing 1.5M samples for safety enhancement. SafeSciBench distinguishes between safety knowledge and risk to cover extensive scopes and employs objective metrics such as deterministically answerable questions to mitigate evaluation bias. We evaluate 21 advanced LLMs, revealing critical vulnerabilities in current models. Finally, we demonstrate that fine-tuning on SafeSciTrain significantly enhances the safety alignment of models. Our work provides both a diagnostic tool and a practical resource for building safer scientific AI systems.

Abstract:
Vision Transformers (ViTs) excel in semantic recognition but exhibit systematic failures in spatial reasoning tasks such as mental rotation. While often attributed to data scale, this work argues that the limitation arises from the intrinsic circuit complexity of the architecture. By formalizing spatial understanding as a Group Homomorphism Problem—requiring that latent embeddings preserve the algebraic composition of physical transformations acting on images—a fundamental computational bottleneck is identified. Specifically, for nonsolvable groups (e.g., \mathrmSO(3)), maintaining such structure-preserving embeddings is lower-bounded by the Word Problem, which is \mathsfNC^1-complete. In contrast, constant-depth ViTs with polynomial precision are strictly bounded by the complexity class \mathsfTC^0. Under the standard conjecture \mathsfTC^0 \subsetneq \mathsfNC^1, a complexity boundary emerges: constant-depth architectures lack the logical depth required to capture non-solvable spatial structures. This theoretical gap is empirically validated via the Latent Space Algebra (LSA) benchmark, which reveals a structural collapse in ViT representations as the compositional depth of non-solvable tasks increases.

Abstract:
Visual token compression is widely used to accelerate large visionlanguage models (LVLMs) by pruning or merging visual tokens, yet its adversarial robustness remains unexplored. We show that existing encoder-based attacks can substantially overestimate the robustness of compressed LVLMs, due to an optimization-inference mismatch: perturbations are optimized on the full-token representation, while inference is performed through a token-compression bottleneck. To address this gap, we propose the Compression-AliGnEd attack (CAGE), which aligns perturbation optimization with compression inference without assuming access to the deployed compression mechanism or its token budget. CAGE combines (i) expected feature disruption, which concentrates distortion on tokens likely to survive across plausible budgets, and (ii) rank distortion alignment, which actively aligns token distortions with rank scores to promote the retention of highly distorted evidence. Across diverse representative plug-and-play compression mechanisms and datasets, our results show that CAGE consistently achieves lower robust accuracy than the baseline. This work highlights that robustness assessments ignoring compression can be overly optimistic, calling for compression-aware security evaluation and defenses for efficient LVLMs.

Abstract:
Large Language Models (LLMs) have intensified the need for lowprecision formats that enable efficient, large-scale inference. The Open Compute Project (OCP) Microscaling (MX) standard is attractive due to its favorable hardware efficiency, but its 4-bit variant (MXFP4) lags behind NVIDIA’s NVFP4 in accuracy, limiting adoption. We introduce two software-only techniques, Overflow-Aware Scaling (OAS) and Macro Block Scaling (MBS), that improve MXFP4 quantization fidelity without requiring hardware changes. OAS reduces overall errors by increasing effective dynamic range under power-of-two block scaling, while MBS allocates higher-precision scaling at a coarser granularity to better preserve outliers. Across multiple LLMs and standard downstream benchmarks, OAS and MBS reduce the end-to-end accuracy gap between MXFP4 and NVFP4 from about 10% to below 1% on average, while incurring modest GEMM overhead (6.2% on average). These results re-establish MXFP4 as a practical alternative to NVFP4, enabling near-NVFP4 accuracy while retaining MX’s hardware-efficiency advantages (e.g., 12% relative area savings in tensor cores).

Abstract:
The JointEmbedding Predictive Architecture (JEPA) is often seen as a non-generative alternative to likelihood-based self-supervised learning, emphasizing prediction in representation space rather than reconstruction in observation space. We argue that the resulting separation from probabilistic generative modeling is largely rhetorical rather than structural: the canonical JEPA design–coupled encoders with a context-to-target predictor–mirrors the variational posteriors and learned conditional priors obtained when variational inference is applied to a particular class of coupled latent-variable models, and standard JEPA can be viewed as a deterministic specialization in which regularization is imposed via architectural and training heuristics rather than an explicit likelihood. Building on this view, we derive the Variational JEPA (Var-JEPA), which makes the latent generative structure explicit by optimizing a single Evidence Lower Bound (ELBO). This yields meaningful representations without ad-hoc anti-collapse regularizers and allows principled uncertainty quantification in the latent space. We instantiate the framework for tabular data (Var-T-JEPA) and achieve strong representation learning and downstream performance, consistently improving over T-JEPA while remaining competitive with strong raw-feature baselines.

Abstract:
Large language models (LLMs) achieve remarkable performance through everincreasing parameter counts, but scaling incurs steep computational costs. To better understand LLM scaling, we study representational differences between LLMs and their smaller counterparts, with the goal of replicating the representational qualities of larger models in the smaller models. We observe a geometric phenomenon which we term embedding condensation, where token embeddings collapse into a narrow cone-like subspace in some language models. Through systematic analyses across multiple Transformer families, we show that small models such as \textttGPT2 and \textttQwen3-0.6B exhibit severe condensation, whereas the larger models such as \textttGPT2-xl and \textttQwen3-32B are more resistant to this phenomenon. Additional observations show that embedding condensation is not reliably mitigated by knowledge distillation from larger models. To fight against it, we formulate a dispersion loss that explicitly encourages embedding dispersion during training. Experiments demonstrate that it mitigates condensation, recovers dispersion patterns seen in larger models, and yields performance gains across 10 benchmarks. We believe this work offers a principled path toward improving smaller Transformers without additional parameters.

Abstract:
Multimodal large language models (MLLMs) are pushing recommender systems (RecSys) toward contentgrounded retrieval and ranking via cross-modal fusion. We find that while cross-modal consensus often mitigates conventional poisoning that manipulates interaction logs or perturbs a single modality, it also introduces a new attack surface where synchronised multimodal poisoning can reliably steer fused representations along stable semantic directions during fine-tuning. To characterise this threat, we formalise cross-modal interactive poisoning and propose VENOMREC, which performs Exposure Alignment to identify high-exposure regions in the joint embedding space and Cross-modal Interactive Perturbation to craft attention-guided coupled token–-patch edits. Experiments on three real-world multimodal datasets demonstrate that VENOMREC consistently outperforms strong baselines, achieving 0.73 mean ER@20 and improving over the strongest baseline by +0.52 absolute ER points on average, while maintaining comparable recommendation utility.

Abstract:
Time series forecasting is important in many fields that require accurate predictions for decisionmaking. Patching techniques, commonly used and effective in time series modeling, help capture temporal dependencies by dividing the data into patches. However, existing patch-based methods fail to dynamically select patches and typically use all patches during the prediction process. In real-world time series, there are often low-quality issues during data collection, such as missing values, distribution shifts, anomalies and white noise, which may cause some patches to contain low-quality information, negatively impacting the prediction results. To address this issue, this study proposes a robust time series forecasting framework called SEER. Firstly, we propose an Augmented Embedding Module, which improves patch-wise representations using a Mixture-of-Experts~(MoE) architecture and obtains series-wise token representations through a channel-adaptive perception mechanism. Secondly, we introduce a Learnable Patch Replacement Module, which enhances forecasting robustness and model accuracy through a two-stage process: 1) a dynamic filtering mechanism eliminates negative patch-wise tokens; 2) a replaced attention module substitutes the identified low-quality patches with global series-wise token, further refining their representations through a causal attention mechanism. Comprehensive experimental results demonstrate the SOTA performance of SEER.

Abstract:
Recent advances in formal theorem proving have focused on Olympiadlevel mathematics, leaving undergraduate domains largely unexplored. Optimization, fundamental to machine learning, operations research, and scientific computing, remains underserved by existing provers. Its reliance on domain-specific formalisms (convexity, optimality conditions, and algorithmic analysis) creates significant distribution shift, making naive domain transfer ineffective. We present OptProver, a trained model that achieves robust transfer from Olympiad to undergraduate optimization. Starting from a strong Olympiad-level prover, our pipeline mitigates distribution shift through two key innovations. First, we employ large-scale optimization-focused data curation via expert iteration. Second, we introduce a specialized preference learning objective that integrates perplexity-weighted optimization with a mechanism to penalize valid but non-progressing proof steps. This not only addresses distribution shifts but also guides the search toward efficient trajectories. To enable rigorous evaluation, we construct a novel benchmark in Lean 4 focused on optimization. On this benchmark, OptProver achieves state-of-the-art Pass@1 and Pass@32 among comparably sized models while maintaining competitive performance on general theorem-proving tasks, demonstrating effective domain transfer without catastrophic forgetting.

Abstract:
OpenVocabulary Aerial Detection (OVAD) and Remote Sensing Visual Grounding (RSVG) have emerged as two key paradigms for aerial scene understanding. However, each paradigm suffers from inherent limitations when operating in isolation: OVAD is restricted to coarse category-level semantics, while RSVG is structurally limited to single-target localization. These limitations prevent existing methods from simultaneously supporting rich semantic understanding and multi-target detection. To address this, we propose OTA-Det, the first unified framework that bridges both paradigms into a cohesive architecture. Specifically, we introduce a task reformulation strategy that unifies task objectives and supervision mechanisms, enabling joint training across datasets from both paradigms with dense supervision signals. Furthermore, we propose a dense semantic alignment strategy that establishes explicit correspondence at multiple granularities, from holistic expressions to individual attributes, enabling fine-grained semantic understanding. To ensure real-time efficiency, OTA-Det builds upon the RT-DETR architecture, extending it from closed-set detection to open-text detection by introducing several high efficient modules, achieving state-of-the-art performance on six benchmarks spanning both OVAD and RSVG tasks while maintaining real-time inference at 34 FPS.

Abstract:
Mental rotation—the ability to compare objects seen from different viewpoints—is a fundamental example of mental simulation and spatial world modeling in humans. Here we propose a mechanistic model of human mental rotation, leveraging recent advances in deep, equivariant, and neurosymbolic learning. Our model consists of three stacked components: (1) an equivariant neural encoder, producing 3D spatial representations of objects from images, (2) a neuro-symbolic object encoder, deriving symbolic objects descriptions from these spatial representations, and (3) a neural decision agent, comparing these symbolic descriptions to prescribe rotation simulations in 3D latent space via a recurrent pathway. Our model design is guided by the abundant experimental literature on mental rotation, which we complemented with experiments in VR where participants could at times manipulate the objects to compare. Our model captures well the performance, response times and behavior of participants in our and others' experiments, and through ablation studies we demonstrate the necessity of each component. Our work adds to a recent collection of deep neural models of human spatial reasoning, further demonstrating the potency of integrating deep, equivariant, and symbolic representations to model the human mind.

Abstract:
Scene graphs provide structured abstractions for scene understanding, yet they often overfit to spurious correlations, severely hindering outof-distribution generalization. To address this limitation, we propose CURVE, a causality-inspired framework that integrates variational uncertainty modeling with uncertainty-guided structural regularization to suppress high-variance, environment-specific relations. Specifically, we apply prototype-conditioned debiasing to disentangle invariant interaction dynamics from environment-dependent variations, promoting a sparse and domain-stable topology. Empirically, we evaluate CURVE in zero-shot transfer and low-data sim-to-real adaptation, verifying its ability to learn domain-stable sparse topologies and provide reliable uncertainty estimates to support risk prediction under distribution shifts.

Abstract:
Continual learning, the ability of a model to adapt to an ongoing sequence of tasks without forgetting earlier ones, is a central goal of artificial intelligence. To better understand its underlying mechanisms, we study the limitations of continual learning in a tractable yet representative setting. Specifically, we analyze onehidden-layer quadratic neural networks trained by gradient descent on a sequence of XOR-cluster datasets with Gaussian noise, where different tasks correspond to clusters with orthogonal means. Our analysis is based on a tight characterization of gradient descent dynamics for the training loss, which yields explicit bounds on the rate of train-time forgetting as functions of the number of iterations, sample size, number of tasks, and hidden-layer width. We then leverage an algorithmic stability framework to bound the generalization gap, leading to corresponding guarantees on test-time forgetting. Together, our results provide the first closed-form guarantees for forgetting in continual learning with neural networks and show how key problem parameters jointly govern forgetting dynamics. Numerical experiments corroborate our theoretical results.

Abstract:
Time series forecasting is essential in various domains. Compared to relying solely on endogenous variables (i.e., target variables), considering exogenous variables (i.e., covariates) provides additional predictive information and often leads to more accurate predictions. However, existing methods for time series forecasting with exogenous variables (TSFX) have the following shortcomings: 1) they do not leverage future exogenous variables, 2) they fail to fully account for the correlation between endogenous and exogenous variables. In this study, to better leverage exogenous variables, especially future exogenous variables, we propose DAG, which utilizes \underlineDual correl\underlineAtion network along both the temporal and channel dimensions for time series forecasting with exo\underlineGenous variables. Specifically, we propose two core components: the Temporal Correlation Module and the Channel Correlation Module. Both modules consist of a correlation discovery submodule and a correlation injection submodule. The former is designed to capture the correlation effects of historical exogenous variables on future exogenous variables and on historical endogenous variables, respectively. The latter injects the discovered correlation relationships into the processes of forecasting future endogenous variables based on historical endogenous variables and future exogenous variables.

Abstract:
Transformerbased large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations. In practice, such models are often fine-tuned to improve zero-shot performance on downstream tasks, allowing them to solve tasks without examples and thereby reducing inference costs. However, fine-tuning can degrade in-context learning, limiting the performance of fine-tuned models on tasks not seen during fine-tuning. Using linear attention models, we provide a theoretical analysis that characterizes how fine-tuning objectives modify attention parameters and identifies conditions under which this leads to degraded few-shot performance. We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning. We further show that incorporating an auxiliary few-shot loss enhances in-context learning primarily on the target task, at the expense of degraded in-context learning ability on tasks not seen during fine-tuning. We empirically validate our theoretical results.

Abstract:
Large Language Models (LLMs) have become critical to modern software development, but their reliance on uncurated webscale datasets for training introduces a significant security risk: the absorption and reproduction of malicious content. This risk materialized in November 2024, when a user suffered a \2,500 financial loss after executing code generated by ChatGPT that contained a live scam phishing URL. To systematically evaluate this risk, we introduce Scam2Prompt, a scalable automated auditing framework that identifies the underlying intent of a scam site and then synthesizes developer-style prompts that mirror this intent, allowing us to test whether an LLM will generate malicious code in response to these prompts. In a large-scale study of four production LLMs (GPT-4o, GPT-4o-mini, Llama-4-Scout, and DeepSeek-V3), we found that Scam2Prompt's developer-style prompts triggered malicious URL generation in 4.24% of cases. To test the persistence of this security risk, we constructed Innoc2Scam-bench, a benchmark of 1,377 prompts that consistently elicited malicious code from all four LLMs. When applied to seven additional production LLMs released in 2025, we found the vulnerability is not only present but severe, with malicious code generation rates ranging from 12.9% to 47.3%. Furthermore, existing safety measures like state-of-the-art guardrails or RAG-based agents proved insufficient to prevent this behavior.

Abstract:
Conventional large language model (LLM) safety alignment relies on a reactive, disjoint loop: attackers exploit a static model, then defenders patch exposed vulnerabilities. This sequential setup leads to attackers overfitting obsolete exploits while defenders perpetually lag behind emerging threats. To address this, we introduce SelfRedTeam, the first fully online self-play multi-agent reinforcement learning (MARL) algorithm that continuously co-evolves attacker and defender for robust safety alignment. A single policy self-plays as both attacker and defender, generating adversarial prompts and defending against them, with a reward model adjudicating outcomes. Each role uses hidden chain-of-thought for strategic planning. Grounded in two-player zero-sum game theory, we establish a theoretical safety guarantee: if the game converges to Nash Equilibrium, the defender produces safe responses against any adversarial input. Empirically, Self-RedTeam generalizes across five models from the Llama and Qwen families, uncovering more diverse attacks (+17.80% SBERT) and improving safety of RLHF-trained models by up to 95% across 14 benchmarks. Our work motivates a shift from reactive patching to proactive co-evolution, enabling LLM safety self-improvement via online self-play MARL.

Abstract:
Vafa et al. recently showed that a transformer fails to acquire an internal Newtonian world model when trained on synthetic planetarymotion data. How can we fix this problem? We find that inductive biases are key to learning the veridical world model: (1)Spatial smoothnessis required for any world model to be learned. However, naive tokenization may disrupt smoothness since two close points in physical space may be far apart in token embedding space without sufficient training or data. We fix this by formulating the prediction problem as regression instead of classification. (2)Spatial stabilitymakes the prediction robust to noise, which is not guaranteed by default, but can be taught via correcting in-context noise perturbations. (3) With both spatial smoothness and stability built in, further imposingtemporal localityinduces a Newtonian world model, while the lack of this knowledge induces a Keplerian world model -- fitting elliptical parameters instead of computing gravitational forces. Our results suggest that even simple general inductive biases are powerful enough to induce correct and specific world models. The inductive biases do not need to know that much about the underlying law to be learned, but without them, it is impossible to learn.

Abstract:
Despite recent Multimodal Large Language Models (MLLMs)' linguistic prowess in medical diagnosis, we find even stateof-the-art MLLMs suffer from a critical perceptual deficit:geometric blindness. This failure to ground outputs in objective geometric constraints leads to plausible yet factually incorrect hallucinations, rooted in training paradigms that prioritize linguistic fluency over geometric fidelity. This paper introduces Med-Scout, a novel framework that “cures” this blindness via Reinforcement Learning (RL) that leverages the intrinsic geometric logic latent within unlabeled medical images. Instead of relying on costly expert annotations, Med-Scout derives verifiable supervision signals through three strategic proxy tasks: Hierarchical Scale Localization, Topological Jigsaw Reconstruction, and Anomaly Consistency Detection. To rigorously quantify this deficit, we present Med-Scout-Bench, a new benchmark specifically designed to evaluate geometric perception. Extensive evaluations show that Med-Scout significantly mitigates geometric blindness, outperforming leading proprietary and open-source MLLMs by over40%on our benchmark. Furthermore, this enhanced geometric perception generalizes to broader medical understanding, achieving superior results on radiological and comprehensive medical VQA tasks. Code, data, and models will be publicly available.

Abstract:
Visual prostheses hold great promise for restoring vision in blind individuals. While researchers have successfully utilized M/EEG signals to evoke visual perceptions during the brain decoding stage of visual prostheses, the complementary process of converting images into M/EEG signals in the brain encoding stage remains largely unexplored, hindering the formation of a complete functional pipeline. In this work, we present a novel imageto-brain signal framework that generates M/EEG from images by leveraging the diffusion transformer architecture enhanced with cross-attention mechanisms. Specifically, we employ a diffusion transformer (DiT) architecture based on denoising diffusion implicit models (DDIM) to achieve brain signal generation. To realize the goal of image-to-brain signal conversion, we use cross-attention mechanisms to align brain signal embeddings with CLIP image embeddings. Moreover, we leverage large language models (LLMs) to generate image captions, and concatenate the resulting CLIP text embeddings with CLIP image embeddings to form unified embeddings for cross-attention alignment, enabling our model to capture core semantic information. Furthermore, we introduce a learnable spatio-temporal position encoding that combines brain region embeddings with temporal embeddings to capture both spatial and temporal characteristics of brain signals. We evaluate the framework on two multimodal benchmark datasets (THINGS-EEG2 and THINGS-MEG) and demonstrate that it generates biologically plausible brain signals.

Abstract:
Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inferencetime computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to access it. These failures are systematic and disproportionately affect long-tail facts and reverse questions. Finally, we show that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.

Abstract:
Reinforcement learning (RL) based posttraining for large language models (LLMs) is computationally expensive, as it generates many rollout sequences that could frequently share long token prefixes. Existing RL frameworks usually process these sequences independently, repeatedly recomputing identical prefixes during forward and backward passes during policy model training, leading to substantial inefficiencies in computation and memory usage. Although prefix sharing naturally induces a tree structure over rollouts, prior tree-attention–based solutions rely on fully materialized attention masks and scale poorly in RL settings. In this paper, we introduce AReaL-DTA to efficiently exploit prefix sharing in RL training. AReaL-DTA employs a depth-first-search (DFS)–based execution strategy that dynamically traverses the rollout prefix tree during both forward and backward computation, materializing only a single root-to-leaf path at a time. To further improve scalability, AReaL-DTA incorporates a load-balanced distributed batching mechanism that dynamically constructs and processes prefix trees across multiple GPUs. Across the popular RL post-training workload, AReaL-DTA achieves up to 8.31× in \tau^2-bench higher training throughput while reducing peak GPU memory consumption by approximately 30–40%.

Abstract:
Large language model (LLM) inference is often bounded by memory footprint and memory bandwidth in resourceconstrained deployments, making quantization a fundamental technique for efficient serving. While post-training quantization (PTQ) maintains high fidelity at 4-bit, it deteriorates at 2–3 bits. Fundamentally, existing methods enforce a shape-invariant quantization grid (e.g., the fixed uniform intervals of UINT2) for each group, severely restricting the feasible set for error minimization. To address this, we propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients, and iteratively refines them using approximate second-order information while progressively compensating quantization errors to minimize output discrepancy. In the 2-bit regime, BPDQ enables serving Qwen2.5-72B on a single RTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit). Moreover, we provide theoretical analysis showing that the variable grid expands the feasible set, and that the quantization process consistently aligns with the optimal objective in Hessian-induced geometry. Code is available in the supplementary materials and will be open-sourced.

Abstract:
This paper studies the AdamWstyle Shampoo optimizer, an effective implementation of the classical Shampoo that notably won the external tuning track of the AlgoPerf neural network training algorithm competition. Our analysis unifies one-sided and two-sided preconditioning and establishes the convergence rate \frac1K\sum_k=1^KE[|||\nabla f(X_k)|||]\leq O(\frac\sqrtm+nCK^1/4) measured by nuclear norm (denoted as |||\cdot||| to display correctly in OpenReview), where K represents the iteration number, (m,n) denotes the size of matrix parameters, and C matches the constant in the optimal convergence rate of SGD. Theoretically, we have ||\nabla f(X)||\leq|||\nabla f(X)|||\leq\sqrt\min(m,n)||\nabla f(X)|| (denote ||\cdot|| as the Frobenius norm to display correctly in OpenReview), supporting that our convergence rate can be considered to be analogous to the optimal \frac1K\sum_k=1^K E[||\nabla f(X_k)||]\leq O(\fracCK^1/4) convergence rate of SGD in the ideal case of |||\nabla f(X)|||= \Theta(\sqrt\min(m,n))||\nabla f(X)|| and balanced m and n.

Abstract:
The core learning signal used in language model distillation is the standard KullbackLeibler (KL) divergence between the distribution of the student and the teacher. Traditional KL divergence tends to be dominated by the teacher’s highest-probability modes, thus diminishing the influence of less probable yet potentially informative components of the output distribution. We propose a new tail-aware divergence that decouples the contribution of the teacher model's top-K predicted probabilities from that of lower-probability predictions, while maintaining the same computational profile as the KL Divergence. Our decoupled approach reduces the impact of the teacher modes and, consequently, increases the contribution of the tail of the distribution. Experimental results demonstrate that our modified distillation method yields competitive performance in both pre-training and supervised distillation of decoder models across various datasets. Furthermore, the distillation process is efficient and can be performed with a modest academic budget for large datasets, eliminating the need for industry-scale computing.

Abstract:
Matrix functions such as square root, inverse roots, and orthogonalization play a central role in preconditioned gradient methods for neural network training. This has motivated the development of iterative algorithms that avoid explicit eigendecompositions and rely primarily on matrix multiplications, making them well suited for modern GPU accelerators. We present PRISM (Polynomialfitting and Randomized Iterative Sketching for Matrix functions computation), a general framework for accelerating iterative algorithms for computing matrix functions. PRISM combines adaptive polynomial approximation with randomized sketching: at each iteration, it fits a polynomial surrogate to the current spectrum via a sketched least-squares problem, adapting to the instance at hand with minimal overhead. We apply PRISM to accelerate Newton–Schulz-like iterations for matrix square roots and orthogonalization, which are core primitives in machine learning. Unlike prior methods, PRISM requires no explicit spectral bounds or singular value estimates; it adapts automatically to the evolving spectrum. Empirically, PRISM accelerates training when integrated into Shampoo and Muon optimizers.

Abstract:
This paper provides datadependent bounds on the expected error of the Gibbs algorithm in the overparameterized interpolation regime, where low training errors are also obtained for impossible data, such as random labels in classification. The results show that generalization in the low-temperature regime is already signaled by small training errors in the more classical high temperature regime. The bounds are stable under approximation with Langevin Monte Carlo algorithms. The analysis motivates the design of an algorithm to compute bounds, which on the MNIST and CIFAR-10 datasets yield nontrivial, close predictions on the test error for true labeled data, while maintaining a correct upper bound on the test error for random labels.

Abstract:
Knowledgeintensive Visual Question Answering (KI-VQA) frequently suffers from severe knowledge conflicts caused by the inherent limitations of open-domain retrieval. However, existing paradigms face critical limitations, including the lack of generalizable conflict detection and intra-model constraint mechanisms to handle conflicting evidence. To address these challenges, we propose theREAL(Reasoning-PivotAlignment) framework centered on the novel concept of theReasoning-Pivot. Distinct from reasoning steps that prioritize internal self-derivation, a reasoning-pivot serves as an atomic unit (node or edge) in the reasoning chain that emphasizes knowledge linkage, and it typically relies on external evidence to complete the reasoning. Supported by our constructedREAL-VQAdataset, our approach integratesReasoning-Pivot Aware SFT (RPA-SFT)to train a generalizable discriminator by aligning conflicts with pivot extraction, and employsReasoning-Pivot Guided Decoding (RPGD), an intra-model decoding strategy that leverages these pivots for targeted conflict mitigation. Extensive experiments across diverse benchmarks demonstrate that REAL significantly enhances discrimination accuracy and achieves state-of-the-art performance, validating the effectiveness of our pivot-driven resolution paradigm.

Abstract:
Dynamic graph anomaly detection (DGAD) is critical for many realworld applications but remains challenging due to the scarcity of labeled anomalies. Existing methods are either unsupervised or semi-supervised: unsupervised methods avoid the need for labeled anomalies but often produce ambiguous boundary, whereas semi-supervised methods can overfit to the limited labeled anomalies and generalize poorly to unseen anomalies. To address this gap, we consider a largely underexplored problem in DGAD: learning a discriminative boundary from normal/unlabeled data, while leveraging limited labeled anomalies when available without sacrificing generalization to unseen anomalies. To this end, we propose an effective, generalizable, and model-agnostic framework with three main components: (i) residual representation encoding that capture deviations between current interactions and their historical context, providing anomaly-relevant signals; (ii) a restriction loss that constrain the normal representations within an interval bounded by two co-centered hyperspheres, ensuring consistent scales while keeping anomalies separable; (iii) a bi-boundary optimization strategy that learns a discriminative and robust boundary using the normal log-likelihood distribution modeled by a normalizing flow. Extensive experiments demonstrate the superiority of our framework across diverse evaluation settings.

Abstract:
Achieving a provable exponential quantum speedup for an important machine learning task has been a central research goal since the seminal HHL quantum algorithm for solving linear systems and the subsequent quantum recommender systems algorithm by Kerenidis and Prakash. These algorithms were initially believed to be strong candidates for exponential speedups, but a lower bound ruling out similar classical improvements remained absent. In breakthrough work by Tang, it was demonstrated that this lack of progress in classical lower bounds was for good reasons. Concretely, she gave a classical counterpart of the quantum recommender systems algorithm, reducing the quantum advantage to a mere polynomial. Her approach is quite general and was named \emphquantuminspired classical algorithms. Since then, almost all the initially exponential quantum machine learning speedups have been reduced to polynomial via new quantum-inspired classical algorithms. From the current state-of-affairs, it is unclear whether we can hope for exponential quantum speedups for any natural machine learning task. In this work, we present the first such provable exponential separation between quantum and quantum-inspired classical algorithms for the basic problem of solving a linear system when the input matrix is well-conditioned and has sparse rows and columns.

Abstract:
Oversmoothing has long been identified as a major limitation of Graph Neural Networks (GNNs): input node features are smoothed at each layer and converge to a constant noninformative representation,if the weights of the GNN are sufficiently bounded. This assumption is crucial: if, on the contrary, the weights are sufficiently large, then oversmoothing may be compensated. Theoretically, GNN could thuslearnto not oversmooth. However, this does not really happen in practice, which prompts us to examine oversmoothing from anoptimizationpoint of view. In this paper, we analyzebackward oversmoothing, that is, the notion that backpropagated errors are also subject to oversmoothing from output to input. With non-linearities, we outline the key role of theinteractionbetween forward and backward smoothing. Moreover, we show that, due to backward oversmoothing, GNNs provably exhibit manyspurious stationary points: as soon as thelastlayer is trained, thewholeGNN is at a stationary point. As a result, we can exhibit regions where gradients are near-zero while the loss stays high. Additionally, we prove that this isspecificto GNNs, and does not necessarily hold for Multi-Layer Perceptrons. This paper is a step toward a more complete comprehension of the optimization landscape of GNNs.

Abstract:
We investigate decentralized online convex optimization with compressed communication, where n learners connected by a network collaboratively minimize a sequence of global loss functions using only local information and compressed data from neighbors. Prior work has established regret bounds of O(\max\\\omega^2\rho^-4n^1/2,\omega^-4\rho^-8\\n\sqrtT) and O(\max\\\omega^-2\rho^-4n^1/2,\omega^-4\rho^-8\\n\lnT) for convex and strongly convex functions, respectively, where \omega\in(0,1] is the compression quality factor and \rho<1 is the spectral gap of the communication matrix. However, these regret bounds suffer from a prohibitively high quadratic or even quartic dependence on \omega^-1. Moreover, the super-linear dependence on n is also undesirable. To overcome these shortcomings, we propose a novel algorithm that achieves improved regret bounds of \tildeO(\omega^-1/2\rho^-1n\sqrtT) and \tildeO(\omega^-1\rho^-2n\lnT) for convex and strongly convex functions, respectively. The primary idea is to design a two-level blocking update framework incorporating two novel ingredients: an online gossip strategy and an error compensation scheme, which work together to promote a better consensus among learners. Furthermore, we establish the first lower bounds for this problem, justifying the optimality of our results with respect to both \omega and T. Additionally, we consider the bandit feedback scenario and extend our method with the classical gradient estimators to enhance existing regret bounds.

Abstract:
Trajectory Inference (TI) seeks to reconstruct latent dynamical processes from snapshot data, which consist of independent samples from timeindexed marginals of an underlying stochastic system. In applications such as single-cell genomics, destructive measurements preclude direct observation of trajectories, making the induced distribution over paths fundamentally ill-posed given finitely many marginals. However, despite extensive work on modeling approaches, little attention has been paid to evaluating the inferred object itself, namely, a probability measure over trajectories. Since path-space laws are not identifiable from snapshot data, evaluation protocols based on predictive accuracy at held-out marginals provide only limited information and fail to constrain trajectory-level behavior. We introduce a general framework for estimating the Kullback–Leibler divergence (KL) between probability measures on function space: we obtain a tractable estimator that can be approximated from data, is practical, and scales to realistic problem sizes (number and size of snapshot data). We apply this framework to a systematic empirical study of trajectory inference methods on synthetic and real datasets. We show that current evaluation metrics yield inconsistent assessments, whereas path-space KL provides a coherent comparison that reveals discrepancies in inferred dynamics, particularly in regions with sparse or missing data. These results support the use of functional KL as a principled criterion for evaluating TI methods under partial observability.

Abstract:
Mixtureof-Experts (MoE) effectively scales model capacity while preserving computational efficiency through sparse expert activation. However, training high-quality MoEs from scratch is prohibitively expensive. A promising alternative is to convert pretrained dense models into sparse MoEs. Existing dense-to-MoE methods fall into two categories: dynamic structural pruning that converts dense models into MoE architectures with moderate sparsity to balance performance and inference efficiency, and downcycling approaches that use pretrained dense models to initialize highly sparse MoE architectures. However, existing methods break the intrinsic activation patterns within dense models, leading to suboptimal expert construction. In this work, we argue that the Gated Linear Unit (GLU) mechanism provides a natural blueprint for dense-to-MoE conversion. We show that the fine-grained neural-wise activation patterns of GLU reveal a coarse-grained structure, uncovering an inherent MoE architecture composed of consistently activated universal neurons and dynamically activated specialized neurons. Leveraging this discovery, we introduce ExpertWeaver, a training-free framework that partitions neurons according to their activation patterns and constructs shared experts and specialized routed experts with layer-adaptive configurations. Our experiments demonstrate that ExpertWeaver significantly outperforms existing methods, both as a training-free dynamic structural pruning technique and as a downcycling strategy for superior MoE initialization.

Abstract:
Subbit model compression seeks storage below one bit per weight, where the sign bit becomes a fixed-cost bottleneck as magnitudes are aggressively compressed. Across Transformers, CNNs, and MLPs, learned sign matrices resist low-rank compression and are spectrally indistinguishable from i.i.d. Rademacher baselines. Despite this apparent randomness, most weights keep their initialization signs, with flips occurring mainly through rare near-zero boundary crossings, suggesting that the randomness in sign patterns is largely inherited from initialization. We formalize this behavior with sign lock-in theory, a stopping-time analysis of sign flips under SGD noise. Under bounded updates and a rare re-entry condition for a small neighborhood around zero, the number of effective sign flips exhibits a geometric tail. Building on this mechanism, we introduce a gap-based initialization and a lightweight outward-drift regularizer that reduces the effective flip rate to approximately 10^-3 with only about a one-point increase in perplexity.

Abstract:
Modern LLM serving systems must sustain high throughput while meeting strict latency SLOs across two distinct inference phases: computeintensive prefill and memory-bound decode phases. Existing approaches either (1) aggregate both phases on shared GPUs, leading to interference between prefill and decode phases, which degrades Time-Between-Tokens (TBT); or (2) disaggregate the two phases across GPUs, improving latency but wasting resources through duplicated models and KV cache transfers. We present DuetServe, a unified LLM serving framework that achieves disaggregation-level isolation within a single GPU. DuetServe operates in aggregated mode by default and dynamically activates SM-level GPU spatial multiplexing when TBT degradation is predicted. Its key idea is to decouple prefill and decode execution only when needed through fine-grained, adaptive SM partitioning that provides phase isolation only when contention threatens latency service level objectives. DuetServe integrates (1) an attention-aware roofline model to forecast iteration latency, (2) a partitioning optimizer that selects the optimal SM split to maximize throughput under TBT constraints, and (3) an interruption-free execution engine that eliminates CPU–GPU synchronization overhead. Evaluations show that DuetServe improves total throughput by up to 1.3× while maintaining low generation latency compared to state-of-the-art frameworks.

Abstract:
Vision–Language–Action (VLA) models adapt large vision–language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions autoregressively in a fixed leftto-right order or attach separate diffusion heads outside the backbone, fragmenting information pathways and hindering unified, scalable architectures. We present Discrete Diffusion VLA, a unified-transformer policy that models discretized action chunks with discrete diffusion retaining progressive refinement inside the VLM backbone. Our method achieves an adaptive decoding order that resolves high-confidence (easy) action elements before harder ones and employs secondary re-masking to revisit uncertain predictions, enabling robust error correction. This design preserves pretrained vision-language priors, supports parallel decoding, and improves the efficiency. Discrete Diffusion VLA achieves 96.5% avg.~success on LIBERO, 71.2% visual matching on SimplerEnv-Fractal, and 54.2% overall on SimplerEnv-Bridge. On out-of-distribution benchmarks, our method exhibits only 1.4% language degradation versus 8.0% for parallel decoding, and 21.0% vision degradation versus 29.0% for continuous diffusion, demonstrating well retention of pretrained vision-language capabilities. Visualization analysis confirms the learned decoding order adaptively prioritizes high-confidence tokens, validating our refinement strategy.

Abstract:
Vector Quantization (VQ) has recently emerged as a promising approach for learning discrete representations of graphstructured data. However, a fundamental challenge, i.e., codebook collapse, remains underexplored in the graph domain, significantly limiting the expressiveness and generalization of graph tokens. In this paper, we present the first empirical study and observe that codebook collapse consistently occurs when applying VQ to graph data, even with mitigation strategies proposed in vision or language domains. Moreover, we provide a diagnosis of collapse from data and optimization perspectives, showing that collapse is associated with graph data properties such as feature redundancy and connectivity density, and is further reinforced by the training dynamics of deterministic hard assignment. To address these issues, we propose RGVQ, a novel framework that integrates graph topology and feature similarity as explicit regularization signals to enhance codebook utilization and promote token diversity. RGVQ introduces soft assignments via Gumbel-Softmax reparameterization, ensuring that all codewords receive gradient updates. In addition, RGVQ incorporates a structure-aware contrastive regularization to penalize assigning the same token to dissimilar node pairs. Extensive experiments demonstrate that RGVQ substantially improves codebook utilization and consistently boosts the performance of state-of-the-art graph VQ backbones across multiple downstream tasks, enabling more expressive and transferable graph token representations.

Abstract:
Structured dilated attention has an appealing inferencetime efficiency knob: it reduces the FLOPs of the attention and the KV cache size by a factor of the dilation size \mathttD, while preserving long-range connectivity. However, we find a persistent failure mode of them -sparsifying a pretrained attention model to a dilated pattern leads to severe accuracy degradation. We introduce RAT+, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning. A single RAT+ model is pretrained densely once, then flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate sparse models. At 1.5B parameters trained on 100B tokens, RAT+ closely matches dense accuracy at \mathttD=16 and drops by about 2-3 points at \mathttD=64 on commonsense reasoning and LongBench tasks, respectively. Moreover, RAT+ outperforms attention when sparsifying to the top-k block attention. We further scale to 2.6B parameters and 200B tokens and observe the same trend.

Abstract:
The scalinglaw era has transformed artificial intelligence from research into a global industry, but its rapid growth raises concerns over energy usage, carbon emissions, and environmental sustainability. Unlike traditional sectors, the AI industry still lacks systematic carbon accounting methods that support large-scale estimates without reproducing the original model. This leaves open questions about how large the problem is today and how large it might be in the near future. Given that the Hugging Face (HF) platform well represents the broader open-source community, we treat it as a large-scale, publicly accessible, and audit-ready corpus for carbon accounting. We propose a FLOPs-based framework to estimate aggregate training emissions of HF open-source models. Considering their uneven disclosure quality, we introduce a tiered approach to handle incomplete metadata, supported by empirical regressions that verify the statistical significance. Compute is also converted to AI training carbon intensity (ATCI, emissions per compute), a metric to assess the sustainability efficiency of model training. Our results show that training the most popular models (with over 5,000 downloads) has resulted in approximately 5.8×10^4 tons of carbon emissions. This paper provides a framework for large-scale emission estimations and a practical methodology to guide future standards and sustainability strategies in the AI industry.

Abstract:
Statement autoformalization acts as a critical bridge between human mathematics and formal mathematics by translating natural language problems into formal language. While prior works have focused on data synthesis and diverse training paradigms to optimize endto-end Large Language Models (LLMs), they typically treat formal code as flat sequences, neglecting the hierarchical logic inherent in mathematical statements. In this work, we introduce Decompose, Structure, and Repair (DSR), a neuro-symbolic framework that restructures autoformalization into a modular pipeline. DSR decomposes statements into logical components and maps them to structured operator trees, leveraging this topological blueprint to precisely localize and repair errors via sub-tree refinement. Furthermore, we introduce PRIME, a benchmark of 156 undergraduate and graduate-level theorems selected from canonical textbooks and expertly annotated in Lean 4. Experimental results demonstrate that DSR establishes a new state-of-the-art, consistently outperforming baselines under equivalent computational budgets. The benchmark, code, and detailed experimental results are available at https://anonymous.4open.science/r/DSR-4A51.

Abstract:
Diffusion models have become the foundation of modern generative systems, with most research focusing primarily on improving generation efficiency and output quality. The timestep embedding component is a crucial part of the diffusion pipeline, which provides a temporal conditioning signal to the denoising network, enabling it to adapt its predictions across different noise levels throughout the process. Despite their potential to contain substantial information, timestep embeddings remain underexplored in current research, especially for security risks and reliable provenance. To fill this gap, we introduceShadow Timestep Embedding (STE), a novel mechanism that investigates the underutilized temporal space for malicious information injection into diffusion models. In particular, when zooming in on the timestep embedding space, we find that different timesteps exhibit distinct representational capabilities that can encode sidechannel information. Moreover, such encoded information can be utilized for attack and defense purposes through the scheduler interface. We present a theoretical analysis of timestep embeddings as position-encoding mappings and derive a mutual coherence evaluation that explains the separability of disjoint timestep intervals. Our findings reveal the diffusion model's timestep as a powerful side channel for carrying dedicated information, motivating new directions for adversarial generative modeling by understanding the temporal dimension.

Abstract:
Effectively scaling Reinforcement Learning (RL) is crucial for enhancing the reasoning and alignment of Large Language Models. The massive data and complex execution flows inherent in these tasks require a distributed architecture capable of efficient scaling. However, to simplify programming and dependency management, mainstream frameworks often rely on a centralized architecture where a single node dispatches both control and data. This inherent coupling creates significant communication bottlenecks, severely limiting system scalability and efficiency. We present DistFlow, a novel, fully distributed RL framework that adopts a multicontroller paradigm. By decoupling data transmission from control dispatch, DistFlow establishes a parallelism-aware, decentralized Data Coordinator that leverages local caching, load balancing, and asynchronous double buffer to minimize communication overhead and mitigate straggler effects. For control logic, it introduces a task scheduler built upon Directed Acyclic Graph (DAG) that facilitates fine-grained, independent execution. Experimental results demonstrate that DistFlow achieves near-linear scalability up to 512 GPUs and delivers up to a 2.63x throughput improvement over state-of-the-art (SOTA) frameworks.

Abstract:
Foundation models often generate unreliable answers, while heuristic uncertainty estimators fail to fully distinguish correct from incorrect outputs, causing users to accept erroneous answers without statistical guarantees. We address this through the lens of false discovery rate (FDR) control, ensuring that among all accepted predictions, the proportion of errors does not exceed a target risk level. To this end, we propose LEC, a principled framework that reframes selective prediction as a decision problem governed by a linear expectation constraint over selection and error indicators. Under this formulation, we derive a finitesample sufficient condition that relies only on a held-out set of exchangeable calibration data, enabling the computation of an FDR-constrained, retention-maximizing threshold. Furthermore, we extend LEC to two-model routing systems: if the primary model's uncertainty exceeds its calibrated threshold, the input is delegated to a subsequent model, while maintaining system-level FDR control. Experiments on both closed-ended and open-ended question answering (QA) and vision question answering (VQA) demonstrate that LEC achieves tighter FDR control and substantially improves sample retention compared to prior approaches.

Abstract:
Multimodal large language models (MLLMs) struggle with numerical regression under longtailed target distributions. Tokenlevel supervised fine-tuning (SFT) and point-wise regression rewards bias learning toward high-density regions, leading to regression-to-the-mean behavior and poor tail performance. We identify the lack of cross-sample relational supervision as a key limitation of existing MLLM training paradigms. To address it, we propose a distribution-aware reinforcement learning framework based on Group Relative Policy Optimization, which introduces batch-level comparison-based supervision via the Concordance Correlation Coefficient-based reward to align predicted and ground-truth distributions in terms of correlation, scale, and mean. The framework is plug-and-play, requiring no architectural modification. Experiments on a unified suite of long-tailed regression benchmarks show consistent improvements over SFT and existing MLLM regression methods, with particularly strong gains in medium- and few-shot regimes.

Abstract:
Graph Prompt Learning (GPL) has recently emerged as a promising paradigm for downstream adaptation of pretrained graph models, mitigating the misalignment between pre-training objectives and downstream tasks. Recently, the focus of GPL has shifted from in-domain to cross-domain scenarios, which is closer to the real world applications, where the pre-training source and downstream target often differ substantially in data distribution. However, why GPLs remain effective under such domain shifts is still unexplored. Empirically, we observe that representative GPL methods are competitive with two simple baselines in cross-domain settings: full fine-tuning (FT) and linear probing (LP), motivating us to explore a deeper understanding of the prompting mechanism. We provide a theoretical analysis demonstrating that jointly leveraging these two complementary branches yields a smaller estimation error than using either branch alone, formally proving that cross-domain GPL benefits from the integration between pre-trained knowledge and task-specific adaptation. Based on this insight, we propose GP2F, a dual-branch GPL method that explicitly instantiates the two extremes: (1) a frozen branch that retains pre-trained knowledge, and (2) an adapted branch with lightweight adapters for task-specific adaptation. We then perform adaptive fusion under topology constraints via a contrastive loss and a topology-consistent loss. Extensive experiments on cross-domain few-shot node and graph classification demonstrate that our method outperforms existing methods.

Abstract:
Large Language Models (LLMs) have enabled automated heuristic design (AHD) for combinatorial optimization problems (COPs), but existing frameworks' reliance on fixed evolutionary rules and static prompt templates often leads to myopic heuristic generation, redundant evaluations, and limited reasoning about how new heuristics should be derived. We propose a novel multiagent reasoning framework, referred to as Planning through World Model for Automated Heuristic Design via Self-Evolving LLMs (PathWise), which formulates heuristic generation as a sequential decision process over an entailment graph serving as a compact, stateful memory of the search trajectory. This approach allows the system to carry forward past decisions and reuse or avoid derivation information across generations. A policy agent plans evolutionary actions, a world model agent generates heuristic rollouts conditioned on those actions, and critic agents provide routed reflections summarizing lessons from prior steps, shifting LLM-based AHD from trial-and-error evolution toward state-aware planning through reasoning. Experiments across diverse COPs show that PathWise converges faster to better heuristics, generalizes across different LLM backbones, and scales to larger problem sizes.

Abstract:
Sparse Autoencoders (SAEs) extract features from LLM internal activations, meant to correspond to interpretable concepts. A core SAE training hyperparameter is L0: how many SAE features should fire per token on average. Existing work compares SAE algorithms using sparsityreconstruction tradeoff plots, implying L0 is a free parameter with no single correct value aside from its effect on reconstruction. In this work we study the effect of L0 on SAEs, and show that if L0 is not set correctly, the SAE fails to disentangle the underlying features of the LLM. If L0 is too low, the SAE will mix correlated features to improve reconstruction. If L0 is too high, the SAE finds degenerate solutions that also mix features. Further, we present a proxy metric that can help guide the search for the correct L0 for an SAE on a given training distribution. We show that our method finds the correct L0 in toy models and coincides with peak sparse probing performance in LLM SAEs. We find that most commonly used SAEs have an L0 that is too low. Our work shows that L0 must be set correctly to train SAEs with correct features.

Abstract:
Ridehailing platforms face the challenge of balancing passenger waiting times with overall system efficiency under highly uncertain supply–demand conditions. Adaptive delayed matching, which controls the holding intervals for batched sets of requests and vehicles, reveals an inherent trade-off between matching and pickup delays. The resulting environment with temporally varying request arrival patterns and dynamic congestion calls for more expressive networks with sufficient capacity to capture their non-stationarity. To address the limitations of existing methods that rely on shallow encoders that cannot capture dynamic supply-demand patterns and congestion effects, we introduce the Regime-Aware Spatio-Temporal Mixture-of-Experts (RAST-MoE) framework, which formalizes adaptive delayed matching as a regime-aware Markov Decision Process and equips RL agents with a self-attention MoE encoder. Instead of relying on a single monolithic network, our design allows different experts to specialize automatically in varying operational conditions, improving representation capacity while maintaining per-sample computation efficiency. Despite its modest size of only 12M parameters, our framework consistently outperforms strong baselines. On real-world Uber trajectory data from San Francisco, it reduces average matching delay by 10%, and pickup delay by 15%. In addition, it demonstrates robustness to unseen demand regimes, stable training behavior without reward hacking, and expert specialization to different regimes. This study shows the strength of MoE-enhanced RL for large-scale decision-making tasks with complex spatiotemporal dynamics.

Abstract:
Large language model (LLM) agents increasingly rely on external tools such as search engines to solve complex, multistep problems, yet their rollouts are structurally heterogeneous: variations in tool-call number, placement, and outcomes induce distinct behaviors and reward distributions. As a result, policy gradient methods with a single global baseline suffer fromcross-stratum bias, an "apples-to-oranges" comparison that distorts credit assignment and impedes exploration. To address this issue, we proposeStratified GRPO. Its core component,Stratified Advantage Normalization(SAN), partitions trajectories into homogeneous strata based on structural properties and computes advantages locally within each stratum, ensuring comparisons only among true peers. We show that SAN eliminates cross-stratum bias, yields conditionally unbiased unit-variance estimates within strata, and preserves the global unbiasedness and unit-variance properties of standard normalization, resulting in a more reliable learning signal. To improve robustness in finite-sample regimes, we further linearly blend SAN with the global estimator. Experiments on factual QA and deep-research agent benchmarks demonstrate that Stratified GRPO consistently outperforms GRPO by up to 12.6 points, achieving higher training rewards, improved training stability, and more effective search policies. These results establish stratification as a principled remedy for structural heterogeneity in RL for LLM search agents.

Abstract:
Recent advances in Large Audio Language Models (LALMs) have extended Textto-Speech (TTS) to interactive role-play scenarios, which demand high expressiveness and strict adherence to role-play instructions. However, existing models struggle to maintain stylistic consistency with character profiles and scene descriptions across multi-turn dialogues. A critical bottleneck is the lack of objective metrics for quantifying speaking style. To bridge this gap, we proposeMean Continuation Log-Probability (MCLP)as both an evaluation metric and a reward signal, validated on LALM-based Role-Play TTS (RP-TTS) tasks. Critically, we leverage the In-Context Learning capability of pre-trained LALMs to formulate MCLP via a continuation log-probability prediction. This metric quantifies stylistic consistency by measuring the likelihood of the ground-truth speech conditioned on the generated speech. Furthermore, we employ MCLP as a reinforcement learning reward to enhance the style alignment between generated speech and Role-Play instructions. To facilitate evaluation, we construct an RP-TTS dataset with rich scene and character annotations. Experimental results demonstrate that our method significantly outperforms strong LALM baselines on both objective and subjective metrics. Audio samples are available at https://mclp-tts.github.io/mclp/.

Abstract:
Large language models (LLMs) can struggle to memorize factual knowledge in their parameters, often leading to hallucinations and poor performance on knowledgeintensive tasks. In this paper, we formalize fact memorization from an information-theoretic perspective and study how training data distributions affect fact accuracy. We show that fact accuracy is suboptimal (below the capacity limit) whenever the amount of information contained in the training data facts exceeds model capacity. This is further exacerbated when the fact frequency distribution is skewed (e.g. a power law). We propose data selection schemes based on the training loss alone that aim to limit the number of facts in the training data and flatten their frequency distribution. On semi-synthetic datasets containing high-entropy facts, our selection method effectively boosts fact accuracy to the capacity limit. When pretraining language models from scratch on an annotated Wikipedia corpus, our selection method enables a GPT2-Small model (110m parameters) to memorize 1.3X more entity facts compared to standard training, matching the performance of a 10X larger model (1.3B parameters) pretrained on the full dataset.

Abstract:
Large language models (LLMs) have exhibited remarkable performance on complex reasoning tasks, with reinforcement learning under verifiable rewards (RLVR) emerging as a principled framework for aligning model behavior with reasoning chains. Despite its promise, RLVR remains prohibitively resourceintensive, requiring extensive reward signals and incurring substantial rollout costs during training. In this work, we revisit the fundamental question of data and compute efficiency in RLVR. We first establish a theoretical lower bound on the sample complexity required to unlock reasoning capabilities, and empirically validate that strong performance can be achieved with a surprisingly small number of training instances. To tackle the computational burden, we propose Dynamic One-Shot Policy Refinement (DoPR), a uncertainty-aware RL strategy that dynamically selects a single informative training sample per batch for policy updates, guided by reward volatility and exploration-driven acquisition. DoPR reduces rollout overhead by nearly an order of magnitude while preserving competitive reasoning accuracy, offering a scalable and resource-efficient solution for LLM post-training. This approach offers a practical path toward more efficient and accessible RL-based training for reasoning-intensive LLM applications.

Abstract:
Many advanced Large Language Model (LLM) applications require longcontext processing, but the self-attention module becomes a bottleneck during the prefilling stage of inference due to its quadratic time complexity with respect to sequence length. Existing sparse attention methods accelerate attention computation by skipping less significant regions of the attention map. However, these approaches typically perform coarse-grained inspection of the attention map, resulting in their suboptimal performance. In this paper, we propose SALE, a fine-grained sparse attention method that accelerates the long-context prefilling stage of LLM with negligible loss in model accuracy. SALE achieves fast and accurate fine-grained attention map estimation using low-bit quantized query-key products to approximate attention weights, followed by the application of a novel Relative Attention Score metric to assess the importance of query-key pairs. This design enables us to accurately identify important regions in the attention map, thereby constructing a highly sparse attention mask. We implement a custom CUDA kernel in SALE optimized for hardware efficiency, reducing overhead to approximately 11% of the full attention latency. Notably, SALE requires no parameter training and can be seamlessly integrated into existing systems with trivial code modifications. Experiments on long-context benchmarks demonstrate that our method outperforms existing approaches in accuracy-efficiency trade-offs, achieving at least 3.36× speedups on Llama-3.1-8B for sequences longer than 64K while maintaining model quality.

Abstract:
We present MultiBreak, a scalable and diverse multiturn jailbreak benchmark to evaluate large language model (LLM) safety. Multi-turn jailbreaks mimic natural conversational settings, making them easier to bypass safety-aligned LLM than single-turn jailbreaks. Existing multi-turn benchmarks are limited in size or rely heavily on templates, which restrict their diversity. To address this gap, we unify a wide range of harmful jailbreak intents, and introduce an active learning pipeline for expanding high-quality multi-turn adversarial prompts, where a generator is iteratively fine-tuned to produce stronger attack candidates, guided by uncertainty-based refinement. Our MultiBreak includes 10,389 multi-turn adversarial prompts, spans 2,665 distinct harmful intents, and covers the most diverse set of topics to date. Empirical evaluation shows that our benchmark achieves up to a 54.0% and 34.6% higher attack success rate (ASR) than the second-best dataset on DeepSeek-R1-7B and GPT-4.1-mini, respectively. More importantly, safety evaluations suggest that diverse attack categories uncover fine-grained LLM vulnerabilities, and categories that appear benign under single-turn can exhibit substantially higher adversarial effectiveness in multi-turn scenarios. These findings highlight persistent vulnerabilities of LLMs under realistic adversarial settings and establish MultiBreak as a scalable resource for advancing LLM safety.

Abstract:
Visionlanguage models (VLMs) advance video understanding but operate under tight computational budgets, making performance dependent on selecting a small, high-quality subset of frames. Existing frame sampling strategies, such as uniform or fixed-budget selection, fail to adapt to variations in content density or task complexity. To address this, we present FrameOracle, a lightweight, plug-and-play module that predicts both (1) which frames are most relevant to a given query and (2) how many frames are needed. FrameOracle is trained via a curriculum that progresses from weak proxy signals, such as cross-modal similarity, to stronger supervision with FrameOracle-41K, the first large-scale VideoQA dataset with validated keyframe annotations specifying minimal sufficient frames per question. Extensive experiments across five VLMs and six benchmarks show that FrameOracle reduces 16-frame inputs to an average of 10.4 frames without accuracy loss. When starting from 64-frame candidates, it reduces inputs to 13.9 frames on average while improving accuracy by 1.5%, achieving state-of-the-art efficiency–accuracy trade-offs for scalable video understanding.

Abstract:
Transformers can acquire Chainof-Thought (CoT) capabilities to solve complex reasoning tasks through fine-tuning. Reinforcement learning (RL) and supervised fine-tuning (SFT) are two primary approaches to this end. In this work, we examine them specifically for learning k-sparse Boolean functions with a one-layer transformer and intermediate supervision that is akin to CoT. In particular, we consider k-sparse Boolean functions that can be recursively decomposed into fixed 2-sparse Boolean functions. We first analyze the learning dynamics of fine-tuning the transformer via either RL or SFT with CoT in a unified way. This allows us to identify sufficient conditions for the transformer to provably learn the general sparse Boolean functions. We then verify that these conditions hold for three basic examples, including k-PARITY, k-AND, and k-OR, thus demonstrating the learnability of them via both RL and SFT. Notably, we reveal that RL and SFT exhibit distinct learning behaviors: RL learns the whole CoT chain simultaneously, whereas SFT naturally learns the CoT chain step-by-step. Overall, our findings provide theoretical insights into the underlying mechanisms of RL and SFT and how they differ in triggering the CoT capabilities of transformers.

Abstract:
Time series foundation models (TSFMs) are increasingly adopted as generalpurpose time series learners. Although their training corpora are vast, they exclude peta-scale astronomical time series that exhibit unique challenges (e.g., irregular sampling, multiple variates, and heteroskedasticity) and exist in immense quantities. We introduce \textttStarEmbed, the first public benchmark for stellar time series observations ("light curves") on three downstream tasks: unsupervised clustering, supervised classification, and out-of-distribution (OOD) source detection. \textttStarEmbed integrates a catalog of expert-vetted light curves, totaling ～40,000 labeled examples across seven astrophysical classes. We evaluate the zero-shot capabilities of three families of TSFMs (\textttMoirai, \textttChronos, and \textttTime-MoE) and a domain-specific transformer (\textttAstromer). Our results demonstrate that the \textttChronos family, despite being pre-trained on regularly sampled data, outperforms domain-specific baselines and yields state-of-the-art performance in clustering and OOD source detection. While they do not yet strictly surpass hand-crafted features in classification, TSFM models like the Chronos models demonstrate excellent generalization performance, marking a promising step toward universal foundation models in astronomy.

Abstract:
We investigate the extent to which an LLM’s hiddenstate geometry can be recovered from its behavior in psycholinguistic experiments. Across eight instruction-tuned transformer models, we run two experimental paradigms---similarity-based forced choice and free association---over a shared 5,000-word vocabulary, collecting 17.5M+ trials to build behavior-based similarity matrices. Using representational similarity analysis, we compare behavioral geometries to layerwise hidden-state similarity and benchmark against FastText, BERT, and cross-model consensus. We find that forced-choice behavior aligns substantially more with hidden-state geometry than free association. In a held-out-words regression, behavioral similarity (especially forced choice) predicts unseen hidden-state similarities beyond lexical baselines and cross-model consensus, indicating that behavior-only measurements retain recoverable information about internal semantic geometry. Finally, we discuss implications for the ability of behavioral tasks to uncover hidden cognitive states.

Abstract:
Continual Learning (CL) requires models to sequentially adapt to new tasks without forgetting old knowledge. Recently, LowRank Adaptation (LoRA), a representative Parameter-Efficient Fine-Tuning (PEFT) method, has gained increasing attention in CL. Several LoRA-based CL methods reduce interference across tasks by separating their update spaces, typically building the new space from the estimated null space of past tasks. However, they (i) overlook task-shared directions, which suppresses knowledge transfer, and (ii) fail to capture truly effective task-specific directions since these ``null bases" of old tasks can remain nearly inactive for new task under correlated tasks. To address this, we study LoRA learning capability from a projection energy perspective, and propose Low-rank Decomposition and Adaptation (LoDA). It performs a task-driven decomposition to build general and truly task-specific LoRA subspaces by solving two energy-based objectives, decoupling directions for knowledge sharing and isolation. LoDA fixes LoRA down-projections on two subspaces and learns robust up-projections via a Gradient-Aligned Optimization (GAO) approach. After each task, before integrating the LoRA updates into the backbone, LoDA derives a closed-form recalibration for the general update, approximating a feature-level joint optimum along this task-shared direction. Experiments indicate that LoDA outperforms existing CL methods.

Abstract:
Rewardmaximizing RL methods enhance the reasoning performance of LLMs, but often reduce the diversity among outputs. Recent works address this issue by adopting GFlowNets, training LLMs to match a target distribution while jointly learning its partition function. In contrast to prior works that treat this partition function solely as a normalizer, we reinterpret it as a per-prompt expected-reward (i.e., online accuracy) signal, leveraging this unused information to improve sample efficiency. Specifically, we first establish a theoretical relationship between the partition function and per-prompt accuracy estimates. Building on this key insight, we propose Partition Function-Guided RL (PACED-RL), a post-training framework that leverages accuracy estimates to prioritize informative question prompts during training, and further improves sample efficiency through an accuracy estimate error–prioritized replay. Crucially, both components reuse information already produced during GFlowNet training, effectively amortizing the compute overhead into the existing optimization process. Extensive experiments across diverse benchmarks demonstrate strong performance improvements over GRPO and prior GFlowNet approaches, highlighting PACED-RL as a promising direction for a more sample efficient distribution-matching training for LLMs.

Abstract:
Multitask learning (MTL) has emerged as a pivotal paradigm in machine learning by leveraging shared structures across multiple related tasks. Despite its empirical success, the development of likelihood-based efficiently solvable algorithms—even for shared linear representations—remains largely underdeveloped, primarily due to the non-convex structure intrinsic to matrix factorization. This paper introduces a first-order algorithm that jointly learns a shared representation and task-specific parameters, with guaranteed computational efficiency. Notably, it converges in \tildeO(1) iterations and attains a \emphnear-optimal estimation error of \widetildeO(dk/(TN)), \emphimproving over existing likelihood-based methods by a factor of k, where d, k, T, N denote input dimension, representation dimension, task count, and samples per task, respectively. Our results demonstrate that likelihood-based first-order methods can also efficiently solve the MTL problem.

Abstract:
Large language models (LLMs) can be controlled at inference time through prompts (incontext learning) and internal activations (activation steering). Different accounts have been proposed to explain these methods, yet their common goal of controlling model behavior raises the question of whether these seemingly disparate methodologies can be seen as specific instances of a broader framework. Motivated by this, we develop a unifying, predictive account of LLM control from a Bayesian perspective. Specifically, we posit that both context- and activation-based interventions impact model behavior by altering its belief in latent concepts: steering operates by changing concept priors, while in-context learning leads to an accumulation of evidence. This results in a closed-form Bayesian model that is highly predictive of LLM behavior across context- and activation-based interventions in a set of domains inspired by prior work on many-shot in-context learning. This model helps us explain prior empirical phenomena - e.g., sigmoidal learning curves as in-context evidence accumulates--while predicting novel ones--e.g., additivity of both interventions in log-belief space, which results in distinct phases such that sudden and dramatic behavioral shifts can be induced by slightly changing intervention controls. Taken together, this work offers a unified account of prompt-based and activation-based control of LLM behavior, and a methodology for empirically predicting the effects of these interventions.

Abstract:
Treesearch decoding is an effective form of test-time scaling for large language models (LLMs), but real-world deployment imposes a fixed per-query token budget that varies across settings. Existing tree-search policies are largely budget-agnostic, treating the budget as a termination condition, which can lead to late-stage over-branching or premature termination. We propose Budget-Guided MCTS (BG-MCTS), a tree-search decoding algorithm that aligns its search policy with the remaining token budget: it starts with broad exploration, then prioritizes refinement and answer completion as the budget depletes while reducing late-stage branching from shallow nodes. BG-MCTS consistently outperforms budget-agnostic tree-search baselines across different budgets on MATH500 and AIME24/25 with open-weight LLMs.

Abstract:
Modern large language models are increasingly deployed under compute and memory constraints, making flexible control of model capacity a central challenge. While sparse and lowrank structures naturally trade off capacity and performance, existing approaches often rely on heuristic designs that ignore layer and matrix heterogeneity or require model-specific architectural modifications. We propose SALAAD, a plug-and-play framework applicable to different model architectures that induces sparse and low-rank structures during training. By formulating structured weight learning under an augmented Lagrangian framework and introducing an adaptive controller that dynamically balances the training loss and structural constraints, SALAAD preserves the stability of standard training dynamics while enabling explicit control over the evolution of effective model capacity during training. Experiments across model scales show that SALAAD substantially reduces memory consumption during deployment while achieving performance comparable to ad-hoc methods. Moreover, a single training run yields a continuous spectrum of model capacities, enabling smooth and elastic deployment across diverse memory budgets without the need for retraining.

Abstract:
The prefill stage of longcontext Retrieval-Augmented Generation (RAG) is severely bottlenecked by computational overhead. To mitigate this, recent methods assemble pre-calculated KV caches of retrieved RAG documents (by auser query) and reprocess selected tokens to recover cross-attention between these pre-calculated KV caches. However, we identify a fundamental ``crowding-out effect'' in current token selection criteria: globally salient butuser-query-irrelevant tokens saturate the limited recomputation budget, displacing the tokens truly essential for answering theuser queryand degrading inference accuracy. We propose ProphetKV, a user-query-driven KV Cache reuse method for RAG scenarios. ProphetKV dynamically prioritizes tokens based on their semantic relevance to theuser queryand employs a dual-stage recomputation pipeline to fuse layer-wise attention metrics into a high-utility set. By ensuring the recomputation budget is dedicated to bridging the informational gap between retrieved context and theuser query, ProphetKV achieves high-fidelity attention recovery with minimal overhead. Our extensive evaluation results show that ProphetKV retains 96%–101% of full-prefill accuracy with only a 20% recomputation ratio, while achieving accuracy improvements of 8.8%–24.9% on RULER and 18.6%–50.9% on LongBench over the state-of-the-art approaches (e.g., CacheBlend, EPIC, and KVShare).

Abstract:
Singlecell RNA-seq profiles are high-dimensional, sparse, and unordered, causing autoregressive generation to impose an artificial ordering bias and suffer from error accumulation. To address this, we propose scDiVa, a masked discrete diffusion foundation model that aligns generation with the dropout-like corruption process by defining a continuous-time forward masking mechanism in token space. ScDiVa features a bidirectional denoiser that jointly models discrete gene identities and continuous values, utilizing entropy-normalized serialization and a latent anchor token to maximize information efficiency and preserve global cell identity. The model is trained via depth-invariant time sampling and a dual denoising objective to simulate varying sparsity levels while ensuring precise recovery of both identity and magnitude. Pre-trained on 59 million cells, scDiVa achieves strong transfer performance across major benchmarks, including batch integration, cell type annotation, and perturbation response prediction. These results suggest that masked discrete diffusion serves as a biologically coherent and effective alternative to autoregression.

Abstract:
Reinforcement learning (RL) posttraining is a dominant approach for improving the reasoning performance of large language models (LLMs), yet growing evidence suggests that its gains arise primarily from distribution sharpening rather than the acquisition of new capabilities. Recent work has shown that sampling from the power distribution of LLMs using Markov chain Monte Carlo (MCMC) can recover performance comparable to RL post-training without relying on external rewards; however, the high computational cost of MCMC makes such approaches impractical for widespread adoption. In this work, we propose a theoretically grounded alternative that eliminates the need for iterative MCMC. We derive a novel formulation showing that the global power distribution can be approximated by a token-level scaled low-temperature one, where the scaling factor captures future trajectory quality. Leveraging this insight, we introduce a training-free and verifier-free algorithm that sharpens the base model's generative distribution autoregressively. Empirically, we evaluate our method on math, QA, and code tasks across four LLMs, and show that our method matches or surpasses one-shot GRPO without relying on any external rewards, while reducing inference latency by over 10× compared to MCMC-based sampling.

Abstract:
Toolintegrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search-R1-style, search-integrated question answering task, revealing a consistent three-phase trajectory: early stagnation, steady decay, and accelerated collapse. To address this, we propose a likelihood-preserving regularization LLDS that activates only when a response action’s likelihood decreases, and regularizes only the tokens responsible. This fine-grained structure mitigates LLD with minimal interference. Our method stabilizes training, prevents gradient explosion, and yields substantial performance improvements across seven benchmarks, including relative improvements of +45.2% on Qwen2.5-3B and +37.1% on Qwen2.5-7B over vanilla GRPO training. Our results establish LLD as a previously overlooked bottleneck in GRPO- based TIRL and provide a practical path toward stable, scalable training of tool-integrated RL.

Abstract:
Diffusion Large Language Models (dLLMs) introduce a new paradigm for language generation and thus induce new challenges in aligning dLLMs for human preference. In this work, aim to optimize the dLLM generation process by developing a theoretical formulation and an efficient and effective quantification of the probability of generation trajectory. We prove that (i) under reference policy regularization, the probability ratio of intermediate diffusion states equals to that of the newly unmasked tokens, and (ii) the probability of the entire generation can be estimated using a single forward pass with block attention. Integrating the two estimations into preference optimization objective, we propose Trajectory Reduction Policy Optimization (dTRPO). We evaluate dTRPO on 7B dLLMs across instructionfollowing and reasoning benchmarks and show that it substantially improves the core performance of state-of-the-art dLLMs, achieving gains of up to 9.6% on STEM tasks, up to 4.3% on coding tasks, and up to 3.0% on instruction-following tasks.

Abstract:
This paper deals with stochastic optimization problems involving Markovian noise with a zeroorder oracle. We present and analyze a novel derivative-free method for solving such problems in strongly convex smooth and non-smooth settings with both one-point and two-point feedback oracles. Using a randomized batching scheme, we show that when mixing time \tau of the underlying noise sequence is less than the dimension of the problem d, the convergence estimates of our method do not depend on \tau. This observation provides an efficient way to interact with Markovian stochasticity: instead of invoking the expensive first-order oracle, one should use the zero-order oracle. Finally, we complement our upper bounds with the corresponding lower bounds. This confirms the optimality of our results.

Abstract:
Differentiable discrete selection uses soft mixtures during training but hard selection at deployment, resulting in a traininginference gap. We decompose this gap into selection gap (method-dependent, reducible) and computation gap (input-dependent, irreducible). Our key finding: the selection gap is determined by forward-pass structure, not backward-pass gradients. Methods using hard selection during training achieve zero selection gap by construction, while mixture methods exhibit gaps even with identical gradient estimators. This occurs because mixtures reward hedging across options, while deployment requires commitment to one. We propose CAGE (Confidence-Adaptive Gate Exploration), which addresses optimization entirely in the backward pass by adapting temperature based on selection confidence. We also identify a critical failure mode: Gumbel-ST suffers 40--50 percentage point accuracy collapse at low temperatures, which CAGE prevents. Experiments on logic gate networks validate the theory: hard selection achieves 98% accuracy with zero gap across all temperatures.

Abstract:
Large models (LMs) are powerful content generators, yet their open‑ended nature can also introduce potential risks, such as generating harmful or biased content. Existing guardrails mostly perform posthoc detection that may expose unsafe content before it is caught, and the latency constraints further push them toward lightweight models, limiting detection accuracy. In this work, we propose PlugGuard, a novel plug-in framework that enables streaming risk detection within the LM generation pipeline. PlugGuard leverages intermediate LM hidden states through a Streaming Latent Dynamics Head (SLD), which models the temporal evolution of risk across the generated sequence for more accurate real-time risk detection. To ensure reliable streaming moderation in real applications, we introduce an Anchored Temporal Consistency (ATC) loss to enforce monotonic harm predictions by embedding a benign-then-harmful temporal prior. Besides, for a rigorous evaluation of streaming guardrails, we also present StreamGuardBench—a model-grounded benchmark featuring on-the-fly responses from each protected model, reflecting real-world streaming scenarios in both text and vision–language tasks. Across diverse models and datasets, PlugGuard consistently outperforms state-of-the-art post-hoc guardrails and prior plug-in probes (15.61% higher average F1), while using only 20M parameters and adding less than 0.5 ms of per-token latency. The code and StreamGuardBench are released at PlugGuard to facilitate research on streaming guardrails.

Abstract:
Modern visionlanguage models achieve strong performance in static perception, but remain limited in the complex spatiotemporal reasoning required for embodied, egocentric tasks. A major source of failure is their reliance on temporal priors learned from passive video data, which often leads to spatiotemporal hallucinations and poor generalization in dynamic environments. To address this, we present EgoTSR, a curriculum-based framework for learning task-oriented spatiotemporal reasoning. EgoTSR is built on the premise that embodied reasoning should evolve from explicit spatial understanding to internalized task-state assessment and finally to long-horizon planning. To support this paradigm, we construct EgoTSR-Data, a large-scale dataset comprising 46 million samples organized into three stages: Chain-of-Thought supervision for explicit reasoning, weakly supervised tagging for internalizing reasoning, and long-horizon sequences composed of orthogonal atomic subtasks. We further introduce a Reasoning-Enhanced Task Decomposition mechanism that explicitly models causal dependencies between actions and objects, enabling controllable planning from high-level task descriptions. To rigorously validate our approach, we establish a Dual-Level Evaluation Framework that covers both atomic spatial perception and logical planning. Extensive experiments demonstrate that EgoTSR effectively eliminates chronological biases, achieving 92.4% accuracy on long-horizon logical reasoning tasks while maintaining 88.2% fine-grained perceptual precision, significantly outperforming existing open-source and closed-source state-of-the-art models.

Abstract:
Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling testtime compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal ``overthinking,'' leading to performance degradation. In this work, we quantify inference-time effort by identifying \emphdeep-thinking tokens---tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@n, a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@n matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.

Abstract:
We study the thermodynamic memory capacity of modern Hopfield networks (Dense Associative Memory models) with continuous states under geometric constraints, extending classical analyses of pairwise associative memory. We derive thermodynamic phase boundaries for Dense Associative Memory networks with exponential capacity p = e^\alpha N, comparing Gaussian (LSE) and Epanechnikov (LSR) kernels. For continuous neurons on an Nsphere, the geometric entropy depends solely on the spherical geometry, not the kernel. In the sharp-kernel regime, the maximum theoretical capacity \alpha = 0.5 is achieved at zero temperature; below this threshold, a critical line separates retrieval from a spin-glass phase. The two kernels differ qualitatively in their phase boundary structure: for LSE, the retrieval region extends to arbitrarily high temperatures as \alpha \to 0, but interference from spurious patterns is always present. For LSR, the finite support introduces a threshold \alpha_\textth below which no spurious patterns contribute to the noise floor, producing a qualitatively different retrieval regime in this sub-threshold region. These results advance the theory of high-capacity associative memory and clarify fundamental limits of retrieval robustness in modern attention-like memory architectures.

Abstract:
Longform clinical videos are central to visual evidence-based decision-making, with growing importance for applications such as surgical robotics and related settings. However, current multimodal large language models typically process videos with passive sampling or weakly grounded inspection, which limits their ability to iteratively locate, verify, and justify predictions with temporally targeted evidence. To close this gap, we proposeMedScope, a tool-using clinical video reasoning model that performs coarse-to-fine evidence seeking over long-form procedures. By interleaving intermediate reasoning with targeted tool calls and verification on retrieved observations, MedScope produces more accurate and trustworthy predictions that are explicitly grounded in temporally localized visual evidence. To address the lack of high-fidelity supervision, we buildClinVideoSuite, an evidence-centric, fine-grained clinical video suite. We then optimizeMedScopewithGrounding-AwareGroupRelativePolicyOptimization (GA-GRPO), which directly reinforces tool use with grounding-aligned rewards and evidence-weighted advantages. On full and fine-grained video understanding benchmarks,MedScopeachieves state-of-the-art performance in both in-domain and out-of-domain evaluations. Our approach illuminates a path toward medical AI agents that can genuinely “think with videos” through tool-integrated reasoning. We will release our code, models, and data.

Abstract:
While benchmarks measure the accuracy of LLMpowered agents, they mostly assume full automation, failing to represent the collaborative nature of real-world use cases. In this paper, we make two major steps towards the rigorous assessment of human-agent interactions. First, we propose PULSE, a framework for more efficient human-centric evaluation of agent designs, which comprises collecting user feedback, training an ML model to predict user satisfaction, and computing results by combining human satisfaction ratings with model-generated pseudo-labels. Second, we deploy PULSE n software engineering---one of the highest-impact, real-world domains for LLM agents---via a large-scale web platform built around the open-source agent OpenHands. Across 15k users, we evaluate how three agent design decisions impact developer satisfaction rates. We also show how PULSE can lead to more robust conclusions about agent design, reducing confidence intervals by 40% compared to a standard A/B test. Finally, we find substantial discrepancies between in-the-wild results with benchmark performance (e.g., the anti-correlation between claude-sonnet-4 and gpt-5, underscoring the limitations of benchmark-driven evaluation. Our framework PULSE provides guidance for future evaluations, and our findings identify opportunities for better software agent designs.

Abstract:
Neural operators perform well on structured domains, yet their behaviour on irregular geometries remains poorly understood. We show that this limitation is not merely an encoding issue, but a depthwise failure mode inherent to deep operator architectures. We formalise theGeometric Forgetting Hypothesis: due to the Markovian structure of operator layers and their reliance on global mixing mechanisms, neural operators progressively lose access to domain geometry as depth increases. Using layer-wise geometric probing, we demonstrate that both spectral and attention-based operators systematically lose geometric fidelity. We show that this geometric forgetting degrades accuracy, stability, and generalisation. To counteract it, we introduce a lightweightgeometry memory injectionmechanism that restores geometric constraints at intermediate depths with minimal architectural overhead. This simple intervention consistently mitigates forgetting and exposes ageometric shortcutinstability in transformer-based operators, revealing that geometric retention is a structural requirement rather than a design choice.

Abstract:
Vision language models (VLMs) often generate hallucination, i.e., content that cannot be substantiated by either textual or visual inputs. Prior work primarily attributes this to overreliance on linguistic prior knowledge rather than visual inputs. Some methods attempt to mitigate hallucination by amplifying visual token attention proportionally to their attention scores. However, these methods overlook the visual attention sink problem, where attention is frequently misallocated to task-irrelevant visual regions, and neglect cross-modal fusion balance by enhancing only visual attention without adjusting attention to the user query. This can result in amplifying incorrect areas while failing to properly interpret the user query. To address these challenges, we propose a simple yet effective method called Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT). GIFT pre-computes a holistic visual saliency map by tracking positive changes in visual attention, or "gaze shifts", during user query comprehension, and leverages this map to amplify attention to both salient visual information and the user query at each decoding step. This reduces the impact of visual attention sink, as irrelevant tokens exhibit minimal shifts, while ensuring balanced cross-modal fusion for well-integrated representation. Extensive experiments show that GIFT effectively mitigates hallucination in VLMs across both generative and classification tasks, achieving up to 20.7% improvement over greedy decoding, while maintaining general vision-language performance with low computational overhead.

Abstract:
RetrievalAugmented Generation (RAG) systems deployed in sensitive domains must provide interpretable evidence selection and robust safeguards against data poisoning, yet current approaches rely on opaque similarity-based retrieval with arbitrary top-k cutoffs that offer no explanation for their selections and remain vulnerable to adversarial manipulation. We propose METEORA, a rationale-driven RAG framework that addresses these fundamental limitations through interpretable, adaptive evidence retrieval. Our framework introduces three synergistic contributions. First, we preference-tune a general-purpose LLM to generate explicit rationales that articulate why specific evidence is needed for a given query. These rationales then guide adaptive evidence selection through a two-step process: rationale-chunk pairing for query-specific relevance assessment, followed by dynamic cutoff detection that eliminates the need for arbitrary k heuristics. Finally, the same rationales enable a verification stage that filters poisoned or misleading evidence before generation. Evaluation across six datasets demonstrates substantial improvements on three critical dimensions. For retrieval quality, METEORA achieves 21.05% higher precision than the best-performing baseline, while its variant with context expansion achieves 13.41% higher recall. In terms of efficiency, the framework reduces the volume of evidence required to reach comparable recall by 80%, which directly translates to a 33.34% improvement in downstream answer generation accuracy. Most notably for adversarial robustness, METEORA increases the F1 score from 0.10 to 0.44 under poisoning attacks, a 4.4× improvement that makes RAG systems substantially more resilient to adversarial manipulation. Human evaluation with four experienced annotators confirms genuine interpretability, achieving a mean confidence score of 3.64/5 and demonstrating that humans can reliably reconstruct evidence-level decisions with 86% accuracy. These results demonstrate that rationale-driven retrieval can simultaneously enhance interpretability, efficiency, and safety in RAG systems for sensitive domains. The code is available in the anonymous GitHub repository \urlhttps://anonymous.4open.science/r/METEORA-DC46/README.md

Abstract:
Statespace language models such as Mamba match Transformer quality while permitting linear complexity inference, yet still comprise billions of parameters that hinder deployment. While existing one-shot pruning methods are effective for generic linear and attention blocks, they are not designed with the overall Mamba architecture in mind and fail to account for the time-shared and discretized state-transition matrix at the heart of the selective state-space module (SSM). In this paper, we introduce SparseSSM, the first training-free pruning framework that extends the classic optimal brain surgeon (OBS) framework to state space architectures. Our layer-wise algorithm (i) derives an approximate second-order saliency score that aggregates Hessian-trace information across time steps, (ii) incorporates a component sensitivity analysis to guide feed-forward network (FFN) pruning, which also sheds light on where redundancy resides in mamba architecture, (iii) can be easily extended to semi-structured and structured sparsity, and generalized to other SSM-based architectures. Empirically, we prune 50% of SSM weights without fine-tuning and observe no zero-shot accuracy loss, achieving the current state-of-the-art pruning algorithm for Mamba-based LLMs.

Abstract:
While large language models (LLMs) show promise in scientific discovery, existing research focuses on inference or feedbackdriven training, leaving the direct modeling of the generative reasoning process, P(\texthypothesis|\textbackground) (P(h|b)), unexplored. We demonstrate that directly training P(h|b) is mathematically intractable due to the combinatorial complexity (O(N^k)) inherent in retrieving and composing inspirations from a vast knowledge base. To break this barrier, we introduce MOOSE-Star, a unified framework enabling tractable training and scalable inference. In the best case, MOOSE-Star reduces complexity from exponential to logarithmic (O(\log N)) by (1) training on decomposed subtasks derived from the probabilistic equation of discovery, (2) employing motivation-guided hierarchical search to enable logarithmic retrieval and prune irrelevant subspaces, and (3) utilizing bounded composition for robustness against retrieval noise. To facilitate this, we release TOMATO-Star, a benchmark of 120,000 decomposed papers (38,400 GPU hours) for training. Furthermore, we show that while brute-force sampling hits a ``complexity wall,'' MOOSE-Star exhibits continuous test-time scaling.

Abstract:
Large language models (LLMs) are capable of emulating reasoning and using tools, creating opportunities for autonomous agents that execute complex scientific tasks. Protein design provides a natural testbed: although machine learning (ML) methods achieve strong results, these are largely restricted to canonical amino acids and narrow objectives, leaving unfilled need for a generalist tool for broad design pipelines. We introduce Agent Rosetta, an LLM agent paired with a structured environment for operating Rosetta, the leading physicsbased heteropolymer design software, capable of modeling non-canonical building blocks and geometries. Agent Rosetta iteratively refines designs to achieve user-defined objectives, combining LLM reasoning with Rosetta's generality. We evaluate Agent Rosetta on design with canonical amino acids, matching specialized models and expert baselines, and with non-canonical residues-where ML approaches fail-achieving comparable performance. Critically, prompt engineering alone often fails to generate Rosetta actions, demonstrating that environment design is essential for integrating LLM agents with specialized software. Our results show that properly designed environments enable LLM agents to make scientific software accessible while matching specialized tools and human experts.

Abstract:
Evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with realworld developer workflows. Existing benchmarks often focus on algorithmic problems or Python-centric bug fixing, leaving critical dimensions of software engineering underexplored. To address these gaps, we introduce SWE-Compass, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework. SWE-Compass spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests and refined through systematic filtering and validation. We benchmark ten state-of-the-art LLMs under two agentic frameworks, SWE-Agent and Claude Code, revealing a clear hierarchy of difficulty across task types, languages, and scenarios. Moreover, by aligning evaluation with real-world developer practices, we hope SWE-Compass can provide a rigorous and reproducible foundation for diagnosing and advancing agentic coding capabilities in large language models.

Abstract:
Endto-end speech-in, speech-out dialogue systems are emerging as a powerful alternative to traditional ASR–LLM–TTS pipelines but remain prone to hallucinations due to limited factual grounding. While text-based dialogue models have effectively mitigated this issue through tools such as web search APIs, extending such capabilities to speech-in, speech-out systems remains underexplored. A key challenge is that tool integration increases latency, disrupting conversational flow. To mitigate this, we propose Streaming Retrieval-Augmented Generation (Stream RAG), a novel framework that reduces latency by predicting tool queries in parallel with user speech, even before the user finishes speaking. Specifically, we develop a post-training pipeline that teaches the model when to issue tool calls and how to generate spoken summaries using retrieved text results, thereby improving both accuracy and responsiveness. To evaluate our approach, we construct AudioCRAG, a benchmark created by converting queries from the publicly available CRAG dataset into speech form. Experimental results show that Stream RAG improves QA accuracy by over 20.0% absolute on AudioCRAG and achieves state-of-the-art performance, including outperforming cascaded systems, on the SLUE-SQA benchmark, while reducing latency by up to 57%. Stream RAG is modality-agnostic and can be applied equally to typed input, paving the way for more agentic, real-time AI assistants.

Abstract:
Large language models (LLMs) are highly sensitive to even small amounts of unsafe training data, making effective detection and filtering essential for trustworthy model development. Current stateof-the-art (SOTA) detection approaches primarily rely on moderation classifiers, which require significant computation overhead for training and are limited to predefined taxonomies. In this work, we explore data attribution approaches that measure the similarity between individual training samples and a small set of unsafe target examples, based on data representations such as hidden states or gradients. We identify a key limitation in existing methods: unsafe target texts contain both critical tokens that make them unsafe and neutral tokens (e.g., stop words or benign facts) that are necessary to form fluent language, and the latter of which makes the overall representations noisy for the purpose of detecting unsafe training data. To address this challenge, we propose Denoised Representation Attribution (DRA), a novel representation-based data attribution approach that denoises training and target representations for unsafe data detection. Across tasks of filtering jailbreaks and detecting gender bias, the proposed approach leads to significant improvement for data attribution methods, outperforming SOTA methods that are mostly based on moderation classifiers.

Abstract:
We study online bilateral trade, where a learner facilitates repeated exchanges between a buyer and a seller to maximize the Gain From Trade (GFT), i.e., the social welfare. In doing so, the learner must guarantee not to subsidize the market. This constraint is usually imposed per round through Weak Budget Balance (WBB). Despite that, (Bernasconi et al.,2024) shows that a Global Budget Balance (GBB) constraint on the profitenforced over the entire time horizon--can improve the GFT by a multiplicative factor of two. While this might appear to be a marginal relaxation, this implies that all existing WBB-focused algorithms suffer linear regret when measured against the GBB optimum. In this work, we provide the first algorithm to achieve sublinear regret against the GBB benchmark in stochastic environments under one-bit feedback. In particular, we show that when the joint distribution of valuations has a bounded density, our algorithm achieves \widetilde\mathcalO(T^3/4) regret. Our result shows that there is no separation between the one-dimensional problem of learning the optimal WBB price and the two-dimensional problem of learning the optimal GBB distribution over couples of prices.

Abstract:
Testtime compute (TTC) has become an increasingly prominent paradigm for enhancing large language models (LLMs). Despite the empirical success of methods such as best-of-n (BoN) sampling and sequential revision, their fundamental limits remain unclear. We address this gap by analyzing a mixture-of-reference policy model and proving that standard BoN is inherently suboptimal. To move closer to the optimal frontier, we study reward-filtered sequential inference, a simple procedure that selectively incorporates only high-reward generations into the context. This mechanism concentrates computation on superior policy candidates and suppresses inferior ones. On the theoretical side, we show that reward-filtered sequential inference yields strictly stronger guarantees than standard TTC paradigms. On the empirical side, we evaluate such an inference strategy across diverse benchmarks and observe consistent improvements over widely used approaches, demonstrating the practical effectiveness of our framework.

Abstract:
In scientific reasoning tasks, the veracity of the reasoning process is as critical as the final outcome. While Process Reward Models (PRMs) offer a solution to the coarsegrained supervision problems inherent in Outcome Reward Models (ORMs), their deployment is hindered by the prohibitive cost of obtaining expert-verified step-wise labels. This paper addresses the challenge of training reliable PRMs using abundant but noisy "weak" supervision. We argue that existing Weak-to-Strong Generalization (W2SG) theories lack prescriptive guidelines for selecting high-quality training signals from noisy data. To bridge this gap, we introduce the Dual-Consensus Weak-to-Strong (DC-W2S) framework. By intersecting Self-Consensus (SC) metrics among weak supervisors with Neighborhood-Consensus (NC) metrics in the embedding space, we stratify supervision signals into distinct reliability regimes. We then employ a curriculum of instance-level balanced sampling and label-level reliability-aware masking to guide the training process. We demonstrate that DC-W2S enables the training of robust PRMs for complex reasoning without exhaustive expert annotation, proving that strategic data curation is more effective than indiscriminate training on large-scale noisy datasets.

Abstract:
Realworld data is frequently noisy and ambiguous. In crowdsourcing, for example, human annotators may assign conflicting class labels to the same instances. Partial-label learning (PLL) addresses this challenge by training classifiers when each instance is associated with a set of candidate labels, only one of which is correct. While early PLL methods approximate the true label posterior, they are often computationally intensive. Recent deep learning approaches improve scalability but rely on surrogate losses and heuristic label refinement. We introduce a novel probabilistic framework that directly approximates the posterior distribution over true labels using amortized variational inference. Our method employs neural networks to predict variational parameters from input data, enabling efficient inference. This approach combines the expressiveness of deep learning with the rigor of probabilistic modeling, while remaining architecture-agnostic. Theoretical analysis and extensive experiments on synthetic and real-world datasets demonstrate that our method achieves state-of-the-art performance in both accuracy and efficiency.

Abstract:
Transformer architectures, and their attention mechanisms in particular, form the foundation of modern large language models. While transformer models are widely believed to operate in highdimensional hidden spaces, we show that attention outputs are confined to a surprisingly low-dimensional subspace, with an effective dimensionality of only about 60% of the full space---a phenomenon that is consistently observed across diverse model families and datasets, and is strongly influenced by the attention output projection matrix. Critically, we find this low-rank structure as a key factor of the prevalent dead feature problem in sparse dictionary learning, where it creates a mismatch between randomly initialized features and the intrinsic geometry of the activation space. Building on this insight, we propose a subspace-constrained training method for sparse autoencoders (SAEs), initializing feature directions into the active subspace of activations. Our approach reduces dead features from 87% to below 1% in Attention Output SAEs with 1M features, and can further extend to other sparse dictionary learning methods. Our findings provide both new insights into the geometry of attention and practical tools for improving sparse dictionary learning in large language models. Code is available at \urlhttps://anonymous.4open.science/r/Language-Model-SAEs-C015.

Abstract:
Large Language Models (LLMs) represent a promising frontier for recommender systems, yet their development has been impeded by the absence of predictable scaling laws, which are crucial for guiding research and optimizing resource allocation. We hypothesize that this may be attributed to the inherent noise, bias, and incompleteness of raw user interaction data in prior continual pretraining (CPT) efforts. This paper introduces a novel, layered framework for generating high-quality synthetic data that circumvents such issues by creating a curated, pedagogical curriculum for the LLM. We provide powerful, direct evidence for the utility of our curriculum by showing that standard sequential models trained on our principled synthetic data significantly outperform (+130% on recall@100 for SasRec) models trained on real data in downstream ranking tasks, demonstrating its superiority for learning generalizable user preference patterns. Building on this, we empirically demonstrate, for the first time, robust power-law scaling for an LLM that is continually pre-trained on our high-quality, recommendation-specific data. Our experiments reveal consistent and predictable perplexity reduction across multiple synthetic data modalities. These findings establish a foundational methodology for reliable scaling LLM capabilities in the recommendation domain, thereby shifting the research focus from mitigating data deficiencies to leveraging high-quality, structured information.

Abstract:
Rapid advances in AIgenerated image (AIGI) technology enable highly realistic synthesis, threatening public information integrity and security. Recent studies have demonstrated that incorporating texture-level artifact features alongside semantic features into multimodal large language models (MLLMs) can enhance their AIGI detection capability. However, our preliminary analyses reveal that artifact features exhibit high intra-feature similarity, leading to an almost uniform attention map after the softmax operation. This phenomenon causes attention dilution, thereby hindering effective fusion between semantic and artifact features. To overcome this limitation, we propose a lightweight fusion adapter, TranX-Adapter, which integrates a Task-aware Optimal-Transport Fusion that leverages the Jensen-Shannon divergence between artifact and semantic prediction probabilities as a cost matrix to transfer artifact information into semantic features, and an X-Fusion that employs cross-attention to transfer semantic information into artifact features. Experiments on standard AIGI detection benchmarks upon several advanced MLLMs, show that our TranX-Adapter brings consistent and significant improvements (up to +6% accuracy). Code can be found in supplementary material.

Abstract:
Commercial Microwave Links (CMLs) offer dense spatial coverage for rainfall sensing but produce pathintegrated measurements that make accurate ground-level reconstruction challenging. Existing methods typically oversimplify CMLs as point sensors and neglect the physical power-law relating rainfall to signal attenuation, resulting in degraded performance under heterogeneous precipitation. In this work, we view rain field reconstruction as a Bayesian inverse problem with Diffusion Models (DMs) as high-fidelity spatial priors. We show that diffusion models better preserve key rainfall statistics compared to censored Gaussian processes. Framing rainfall estimation as a Bayesian inverse problem with a DM prior enables training-free posterior sampling using a broad family of methods, including Plug-and-Play, Sequential Monte Carlo, and Replica Exchange methods. Experiments on synthetic and real-world datasets demonstrate consistent improvements over established CML-based reconstruction baselines.

Abstract:
Recent advances in large language models have enabled deep research systems that generate expertlevel reports through multi-step reasoning and evidence-based synthesis. However, evaluating such reports remains challenging: report quality is multifaceted, making it difficult to determine what to assess and by what criteria; LLM-based judges may miss errors that require domain expertise to identify; and because deep research relies on retrieved evidence, report-wide claim verification is also necessary. To address these issues, we propose DEER, a benchmark for evaluating expert-level deep research reports. DEER systematizes evaluation criteria with an expert-developed taxonomy (7 dimensions, 25 subdimensions) operationalized as 101 fine-grained rubric items. We also provide task-specific Expert Evaluation Guidance to support LLM-based judging. Alongside rubric-based assessment, we propose a claim verification architecture that verifies both cited and uncited claims and quantifies evidence quality. Experiments show that while current deep research systems can produce structurally plausible reports that cite external evidence, there is room for improvement in fulfilling expert-level user requests and achieving logical completeness. Beyond simple performance comparisons, DEER makes system strengths and limitations interpretable and provides diagnostic signals for improvement.

Abstract:
Scaling testtime computation during language model inference, such as generating intermediate thoughts or sampling multiple candidate answers, has proven effective in improving model performance. While these techniques inherently rely on the stochastic nature of inference to explore diverse reasoning paths, prior theoretical works typically build on adeterministic decodingframework, overlooking the stochastic nature of practical language model inference. This work takes an initial step to bridge this gap by establishing a new theoretical framework, incorporatingrandomness and samplingdirectly into the decoding analysis. To demonstrate the framework's effectiveness, we apply it to the canonical in-context linear regression task with continuous and binary coefficients, simulating decoding via noise injection and sampling to analyze widely adopted inference techniques. We validate our theoretical findings through numerical simulations, with additional experiments on real-world tasks substantiating the framework's potential for practical applications.

Abstract:
Zeroshot methods detect LLM-generated text by computing statistical signatures using a surrogate model. Existing approaches typically employ a fixed surrogate for all inputs regardless of the unknown source. We systematically examine this design and find that detection performance varies substantially depending on surrogate-source alignment. We observe that while no single surrogate achieves optimal performance universally, a well-matched surrogate typically exists within a diverse pool for any given input. This finding transforms robust detection into a routing problem: selecting the most appropriate surrogate for each input. We propose DetectRouter, a prototype-based framework that learns text-detector affinity through two-stage training. The first stage constructs discriminative prototypes from white-box models; the second generalizes to black-box sources by aligning geometric distances with observed detection scores. Experiments on EvoBench and MAGE benchmarks demonstrate consistent improvements across multiple detection criteria and model families.

Abstract:
The ability to use tools is fundamental for large language model (LLM) agents. Given a task, existing systems use LLMs to plan and generate tool calls, which are executed by realworld tools to complete the task. However, tool calls are prone to errors because they are derived merely from LLM intrinsic capabilities. What is more, while it is useful to let LLMs iteratively refine the tool-call sequence using execution results from real tools, this process can be expensive and lead to unsafe results. To improve LLM tool calls and address issues caused by using real tools for refinement, we introduce Gecko, a comprehensive environment that simulates tool responses using a combination of rules and LLMs. Specifically, Gecko checks the validity of tool calls including input arguments and tool names, synthesizes reasonable responses that adhere to the output schema, and assesses whether all task objectives have been achieved. These three types of feedback provided by Gecko allow LLMs to refine their tool calls, forming a simple yet effective test-time scaling method named GATS. On BFCLv3 and \tau^2-bench, GATS consistently improves the tool calling performance of various LLMs including GPT-4o, GPT-5, and Gemini-3.0-pro. We further discuss working mechanisms of our method and share future possibilities.

Abstract:
Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing humanlevel performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments using math-only data with two widely-used methods: Reinforcement Learning (RL) and Supervised Finetuning (SFT) with detailed ablations. On top of the observation that RL-tuned models transfer better than SFT-tuned model, we identify on-policy fine-tuning as the key mechanism underlying cross-domain transfer, regardless of whether the training signal comes from RL or supervised learning. Latent-space representation and token-space distribution shift analyses reveal that off-policy SFT induces substantial representation and output drift, while on-policy RL preserves general-domain structure. Our results suggest a need to rethink the post-training recipes, particularly the reliance on off-policy SFT-distilled data to advance reasoning models.

Abstract:
The integration of Large Language Models (LLMs) with automated theorem proving has shown immense promise, yet is constrained by challenges in scaling up both trainingtime reinforcement learning (RL) and inference-time compute. This paper introduces BFS-Prover-V2, a step-level theorem proving system designed to address this dual scaling problem. We present two primary innovations. The first is a novel multi-turn off-policy RL framework for continually improving the performance of the LLM step-prover at training time. This framework, inspired by the principles of AlphaZero, utilizes a multi-stage expert iteration pipeline featuring adaptive tactic-level data filtering and periodic retraining to surmount the performance plateaus that typically curtail long-term RL in LLM-based agents. The second innovation is a planner-enhanced multi-agent system that scales reasoning capabilities at inference time. This architecture employs a general reasoning model as a high-level planner to iteratively decompose complex theorems into a sequence of simpler subgoals. This hierarchical approach substantially reduces the search space, enabling a team of parallel prover agents to collaborate efficiently by leveraging a shared proof cache. We demonstrate that this dual approach to scaling yields state-of-the-art results on established formal mathematics benchmarks. BFS-Prover-V2 achieves 95.08% and 41.4% on the miniF2F and ProofNet test sets respectively. While demonstrated in the domain of formal mathematics, the RL and inference techniques presented in this work are of broader interest and may be applied to other domains requiring long-horizon multi-turn reasoning and complex search.

Abstract:
While Large Language Models (LLMs) have achieved remarkable success across diverse tasks, they remain vulnerable to jailbreak attacks, which pose significant risks to their secure deployment. Current safetymechanisms primarily rely on output guardrails to filter harmful outputs, yet these defenses are not impenetrable. Due to LLMs' inherent reliance on autoregressive, tokenby-token inference, their semantic representations lack robustness to spatially structured perturbations, such as redistributing tokens across different rows, columns, or diagonals. Exploiting the Transformer's spatial weakness, we propose SpatialJB to disrupt the model's output generation process, allowing harmful content to bypass guardrails without detection. Comprehensive experiments conducted on leading LLMs get nearly 100% ASR, demonstrating the high effectiveness of SpatialJB. Even after adding advanced output guardrails, like the OpenAI Moderation API, SpatialJB consistently maintains a success rate exceeding 75%, outperforming current jailbreak techniques by a significant margin. The proposal of SpatialJB exposes a key weak- ness in current guardrails and emphasizes the importance of spatial semantics, offering new insights to advance LLM safety research. To prevent potential misuse, we also present baseline defense strategies against SpatialJB and evaluate their effectiveness inmitigating such attacks.

Abstract:
Looping, reusing a block of layers across depth, and depth growing, training shallowto-deep models by duplicating middle layers, have both been linked to stronger reasoning, but their relationship remains unclear. We provide a mechanistic unification: looped and depth-grown models exhibit convergent depth-wise signatures, including increased reliance on late layers and recurring patterns aligned with the looped or grown block. These shared signatures support the view that their gains stem from a common form of iterative computation. Building on this connection, we show that the two techniques are adaptable and composable: applying inference-time looping to the middle blocks of a depth-grown model improves accuracy on some reasoning primitives by up to 2×, despite the model never being trained to loop. Both approaches also adapt better than the baseline when given more in-context examples or additional supervised fine-tuning data. Additionally, depth-grown models achieve the largest reasoning gains when using higher-quality, math-heavy cooldown mixtures, which can be further boosted by adapting a middle block to loop. Overall, our results position depth growth and looping as complementary, practical methods for inducing and scaling iterative computation to improve reasoning.

Abstract:
We introduce ELECTRAFI, a fast, endto-end differentiable model for predicting periodic charge densities in crystalline materials. ELECTRAFI constructs anisotropic Gaussians in real space and exploits their closed-form Fourier transforms to analytically evaluate plane-wave coefficients via the Poisson summation formula. This formulation delegates non-local and periodic behavior to analytic transforms, enabling reconstruction of the full periodic charge density with a single inverse FFT. By avoiding explicit real-space grid probing, periodic image summation, and spherical harmonic expansions, ELECTRAFI matches or exceeds state-of-the-art accuracy across periodic benchmarks while being up to 633× faster than the strongest competing method, reconstructing crystal charge densities in a fraction of a second. When used to initialize DFT calculations, ELECTRAFI reduces total DFT compute cost by up to ～20 %, whereas slower charge density models negate savings due to high inference times. Our results show that accuracy and inference cost jointly determine end-to-end DFT speedups, and motivate our focus on efficiency.

Abstract:
We study the implicit bias of momentumbased optimizers on homogeneous models. We first extend existing results on the implicit bias of steepest descent in homogeneous models to normalized steepest descent with an optional learning rate schedule. We then show that for smooth homogeneous models, momentum steepest descent algorithms like Muon (spectral norm), MomentumGD (\ell_2 norm), and Signum (\ell_\infty norm) are approximate steepest descent trajectories under a decaying learning rate schedule, proving that these algorithms too have a bias towards KKT points of the corresponding margin maximization problem. We extend the analysis to Adam (without the stability constant), which maximizes the \ell_\infty margin, and to Muon-Signum and Muon-Adam, which maximize a hybrid norm. Our experiments corroborate the theory and show that the identity of the margin maximized depends on the choice of optimizer. Overall, our results extend earlier lines of work on steepest descent in homogeneous models and momentum-based optimizers in linear models.

Abstract:
While Diffusion Transformers (DiTs) have achieved notable progress in video generation, this longsequence generation task remains constrained by the quadratic complexity inherent to self-attention mechanisms, creating significant barriers to practical deployment. Although sparse attention methods attempt to address this challenge, existing approaches either rely on oversimplified static patterns or require computationally expensive sampling operations to achieve dynamic sparsity, resulting in inaccurate pattern predictions and degraded generation quality. To overcome these limitations, we propose a \underlineMixture-\underlineOf-\underlineDistribution DiT (MOD-DiT), a novel sampling-free dynamic attention framework that accurately models evolving attention patterns through a two-stage process. First, MOD-DiT leverages prior information from early denoising steps and adopts a distributed mixing approach to model an efficient linear approximation model, which is then used to predict mask patterns for a specific denoising interval. Second, an online block masking strategy dynamically applies these predicted masks while maintaining historical sparsity information, eliminating the need for repetitive sampling operations. Extensive evaluations demonstrate consistent acceleration and quality improvements across multiple benchmarks and model architectures, validating MOD-DiT's effectiveness for efficient, high-quality video generation while overcoming the computational limitations of traditional sparse attention approaches.

Abstract:
We introduce Reinforcement Learning (RL) with Adaptive Verifiable Environments (RLVE), an approach using verifiable environments that procedurally generate problems and provide algorithmically verifiable rewards, to scale up RL for language models (LMs). RLVE enables each verifiable environment to dynamically adapt its problem difficulty distribution to the policy model's capabilities as training progresses. In contrast, static data distributions often lead to vanishing learning signals when problems are either too easy or too hard for the policy. To implement RLVE, we create RLVEGym, a large-scale suite of 400 verifiable environments carefully developed through manual environment engineering. Using RLVE-Gym, we show that environment scaling, i.e., expanding the collection of training environments, consistently improves generalizable reasoning capabilities. RLVE with joint training across all 400 environments in RLVE-Gym yields a 3.37% absolute average improvement across six reasoning benchmarks, starting from one of the strongest 1.5B reasoning LMs. By comparison, continuing this LM's original RL training yields only a 0.49% average absolute gain despite using over 3x more compute. We will release our code publicly.

Abstract:
VisionLanguage-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints. To address this, we propose LangForce, enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior p(a \mid v) and a language-conditioned posterior \pi(a \mid v, \ell). We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Extensive experiments across on three benchmarks demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of LangForce to robustly ground language in action.

Abstract:
While Diffusion Large Language Models (dLLMs) offer structural advantages for global planning, efficiently verifying that they arrive at correct answers via valid reasoning traces remains a critical challenge. In this work, we propose a geometric perspective: Reasoning on the Manifold. We hypothesize that valid generation trajectories reside as stable attractors on the highdensity manifold of the learned distribution, whereas invalid paths exhibit off-manifold drift. To operationalize this, we introduce Bidirectional Manifold Consistency (BMC), a training-free, unsupervised metric that quantifies the stability of the generated sequence through a forward-masking and backward-reconstruction cycle. Empirically, we demonstrate BMC's versatility across the full reasoning lifecycle: (1) in Diagnosis, it serves as a robust discriminator of solution validity without ground truth answer; (2) in Inference, it enables rejection resampling to effectively concentrate computational resources on complex reasoning tasks; and (3) in Alignment, it functions as a dense geometric reward that transforms sparse outcome supervision into fine-grained guidance, empowering models to self-evolve beyond standard baselines. Our results establish intrinsic geometric stability as a robust indicator of correctness for dLLMs.

Abstract:
Worldmodel-based imagine-then-act becomes a promising paradigm for robotic manipulation, yet existing approaches typically support either purely image-based forecasting or reasoning over partial 3D geometry, limiting their ability to predict complete 4D scene dynamics. This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary-view RGBD generation: given only a single-view RGBD observation as input, the model “imagines” the remaining viewpoints, which can then be back-projected and fused to assemble a more complete 3D structure across time. To efficiently learn the multi-view, cross-modality generation, we explicitly design cross-view and cross-modality feature fusion that jointly encourage consistency between RGB and depth and enforce geometric alignment across views. Beyond prediction, converting generated futures into actions is often handled by inverse dynamics, which is ill-posed because multiple actions can explain the same transition. We address this with a test-time action optimization strategy that backpropagates through the generative model to infer a trajectory-level latent best matching the predicted future, and a residual inverse dynamics model that turns this trajectory prior into accurate executable actions. Experiments on three datasets demonstrate strong performance on both 4D scene generation and downstream manipulation, and ablations provide practical insights into the key design choices.

Abstract:
Dynamic relational data arise in many machine learning applications, yet their evolving structure poses challenges for learning representations that remain consistent and interpretable over time. A common approach is to learn time varying node embeddings, whose usefulness depends on well defined stability properties across nodes and across time. We introduce Unfolded Laplacian Spectral Embedding (ULSE), a principled extension of unfolded adjacency spectral embedding to normalized Laplacian operators, a setting where stability guarantees have remained out of reach. We prove that ULSE satisfies both crosssectional and longitudinal stability under a dynamic stochastic block model. Moreover, the Laplacian formulation yields a dynamic Cheeger-type inequality linking the spectrum of the unfolded normalized Laplacian to worst case conductance over time, providing structural insight into the embeddings. Empirical results on synthetic and real world dynamic networks validate the theory.

Abstract:
Diffusion Language Models (DLMs) have recently emerged as a strong alternative to autoregressive language models (ARLMs), due to their comparable accuracy and faster inference speed via parallel decoding. However, standard DLM decoding strategies, which rely on unmasking only high-confidence tokens, encounter an inherent information-theoretic bottleneck that restricts decoding progress and ultimately slows down generation. We demonstrate this through an information-theoretic lower bound that the number of decoding rounds must grow linearly with the sample's total information and inversely with the per-round information budget, establishing a bits-to-rounds principle. Motivated by this theory, we propose Explore-Then-Exploit (ETE), a training-free decoding strategy that maximizes information throughput and decoding efficiency. ETE combines cross-block decoding with targeted exploration of high-uncertainty tokens to reshape the conditional distribution and trigger cascades of confident predictions. Experiments across diverse benchmarks verify our theoretical bounds and demonstrate that ETE consistently reduces the number of decoding rounds compared to confidence-only baselines without compromising generation quality. Furthermore, ETE integrates efficiently with KV caching, translating these algorithmic gains into improved tokens-per-second throughput.

Abstract:
MultiAgent Debate (MAD) has shown promise in leveraging collective intelligence to improve reasoning and reduce hallucinations, yet it remains unclear how information exchange shapes the underlying ability. Empirically, MAD exhibits paradoxical phenomena, such as accuracy improvement accompanied by substantial increase in token entropy, and remarkable divergence between homogeneous and heterogeneous model combinations. In this paper, we propose a Bayesian uncertainty analysis framework for MAD, which decomposes total predictive uncertainty into epistemic uncertainty reducible by debate context and aleatoric uncertainty induced by internal model noise. Across multiple model configurations, we find that effective debate hinges on achieving high epistemic gain under controlled aleatoric cost. Building on this insight, we design an uncertainty-guided multi-agent reinforcement learning (MARL) algorithm that explicitly optimizes aleatoric noise reduction and epistemic information utilization. Experiments show that our training significantly improves post-debate accuracy and stability, and enhances individual reasoning beyond single-agent RL, providing a unified Bayesian uncertainty perspective for understanding and improving MAD.

Abstract:
Joint planning through languagebased interactions is a key area of human-AI teaming. Planning problems in the open world often involve various aspects of incomplete information and unknowns, e.g., objects involved, human goals/intents -- thus leading to knowledge gaps in joint planning. We consider the problem of discovering optimal interaction strategies for AI agents to actively elicit human inputs in object-driven planning. To this end, we propose Minimal Information Neuro-Symbolic Tree (MINT) to reason about the impact of knowledge gaps and leverage self-play with MINT to optimize the AI agent’s elicitation strategies and queries. More precisely, MINT builds a symbolic tree by making propositions of possible human-AI interactions and by consulting a neural planning policy to estimate the uncertainty in planning outcomes caused by remaining knowledge gaps. Finally, we leverage LLM to search and summarize MINT’s reasoning process and curate a set of queries to optimally elicit human inputs for best planning performance. By considering a family of extended Markov decision processes with knowledge gaps, we analyze the return guarantee for a given MINT with active human elicitation. Our evaluation on three benchmarks involving unseen/unknown objects of increasing realism shows that MINT-based planning attains near-expert returns by issuing a limited number of questions per task while achieving significantly improved rewards and success rates.

Abstract:
Recent progress has expanded the use of large language models (LLMs) in drug discovery, including synthesis planning. However, objective evaluation of retrosynthesis performance remains limited. Existing benchmarks and metrics typically rely on published synthetic procedures and TopK accuracy based on single ground-truth, which does not capture the open-ended nature of real-world synthesis planning. We propose a new benchmarking framework for single-step retrosynthesis that evaluates both general-purpose and chemistry-specialized LLMs using ChemCensor, a novel metric for chemical plausibility. By emphasizing plausibility over exact match, this approach better aligns with human synthesis planning practices. We also introduce CREED, a novel dataset comprising millions of ChemCensor-validated reaction records for LLM training, and use it to train a model that improves over the LLM baselines under this benchmark.

Abstract:
Recently,steering vectors (SVs)have emerged as an effective and lightweight approach to steer behaviors of large language models (LLMs), among which finetuned SVs are more effective than optimization-free ones. However, current approaches to fine-tuned SVs suffer from two limitations. First, they require careful selection of steering factors on a per-SV basis to balance steering effectiveness and generation quality at inference time. Second, they operate asfull-sequence SVs (FSSVs), which can sacrifice generation quality regardless of factor selection due to excessive intervention on the model generation process. To address the first limitation, we proposejoint trainingof steering factors and directions, such that post-hoc factor selection is no longer required. Using neural network scaling theory, we find that moderately large initialization sizes and learning rates for steering factors are essential for stability and efficiency of joint training. To tackle the second limitation, we draw inspiration fromrepresentation fine-tuningand introducePrompt-Only Steering Vector (PrOSV), an SV that intervenes only on a few prompt tokens. Our empirical results show that PrOSV outperforms traditional FSSVs on AxBench when using our joint training scheme. We also find that PrOSV achieves a better tradeoff between general model utility and adversarial robustness than FSSV.

Abstract:
Spatial understanding remains a key challenge in visionlanguage models. Yet it is still unclear whether such understanding is truly acquired, and if so, through what mechanisms. We present a controllable 1D image–text testbed to probe how left–right relational understanding emerges in Transformer-based vision and text encoders trained with a CLIP-style contrastive objective. We train lightweight Transformer-based vision and text encoders end-to-end on paired descriptions of one- and two-object scenes and evaluate generalization to unseen object pairs while systematically varying label and layout diversity. We find that contrastive training learns left–right relations and that label diversity, more than layout diversity, is the primary driver of generalization in this setting. To gain the mechanistic understanding, we perform an attention decomposition and show that interactions between positional and token embeddings induce a horizontal attention gradient that breaks left–right symmetry in the encoders; ablating this contribution substantially reduces left–right discrimination. Our results provide a mechanistic insight of when and how CLIP-style models acquire relational competence.

Abstract:
We identify the Spectral Energy Gain in extreme model compression, where lowrank binary approximations outperform tiny-rank floating-point baselines for heavy-tailed spectra. However, prior attempts fail to realize this potential, trailing state-of-the-art 1-bit methods. We attribute this degradation to Latent Geometry Misalignment: standard singular vectors exhibit high coherence (spiky distribution), the worst-case geometry for binary quantization. To realize this gain, we propose LittleBit-2, a framework employing Internal Latent Rotation and Joint Iterative Quantization (Joint-ITQ). This approach acts as a geometric preconditioner, aligning coherent latent distributions with the binary hypercube with zero inference overhead. Empirically, LittleBit-2 establishes a new state-of-the-art in the sub-1-bit regime (1～0.1 bpp) on Llama-2 and Llama-3, matching the fidelity of leading 1-bit baselines.

Abstract:
Designing highperformance neural networks for new tasks requires balancing optimization quality with search efficiency. Current methods fail to achieve this balance: neural architectural search is computationally expensive, while model retrieval often yields suboptimal static checkpoints. To resolve this dilemma, we model the performance gains induced by fine-grained architectural modifications as edit-effect evidence and build evidence graphs from prior tasks. By constructing a retrieval-augmented model refinement framework, our proposed M-DESIGN dynamically weaves historical evidence to discover near-optimal modification paths. M-DESIGN features an adaptive retrieval mechanism that quickly calibrates the evolving transferability of edit-effect evidence from different sources. To handle out-of-distribution shifts, we introduce predictive task planners that extrapolate gains from multi-hop evidence, thereby reducing reliance on an exhaustive repository. Based on our model knowledge base of 67,760 graph neural networks across 22 datasets, extensive experiments demonstrate that M-DESIGN consistently outperforms baselines, achieving the search-space best performance in 26 out of 33 cases under a strict budget. Code and data are available at: https://anonymous.4open.science/r/M-DESIGN-245/.

Abstract:
Despite the advanced capabilities of Large VisionLanguage Models (LVLMs), they frequently suffer from object hallucination. One reason is that visual features and pretrained textual representations often become intertwined in the deeper network layers. To address this, we propose REVIS, a training-free framework designed to explicitly re-activate this suppressed visual information. Rooted in latent space geometry, REVIS extracts the pure visual information vector via orthogonal projection and employs a calibrated strategy to perform sparse intervention only at the precise depth where suppression occurs. This surgical approach effectively restores visual information with minimal computational cost. Empirical evaluations on standard benchmarks demonstrate that REVIS reduces object hallucination rates by approximately 19% compared to state-of-the-art baselines, while preserving general reasoning capabilities.

Abstract:
Mixtureof-Experts architectures have become the standard for efficient LLM scaling, typically employing expert parallelism to distribute experts across devices. However, the absence of explicit load balancing constraints during inference allows adversarial inputs to trigger severe routing concentration. We demonstrate that out-of-distribution prompts can manipulate the routing mechanism such that all tokens are routed to the same set of top-k experts, which creates computational bottlenecks on certain devices while forcing others to idle. This converts an efficiency mechanism into a denial-of-service attack vector, leading to violations of service-level agreements for time-to-first-token (TTFT). We propose RepetitionCurse, a black-box strategy to exploit this vulnerability. By identifying a universal flaw in MoE router behavior, RepetitionCurse constructs attack prompts using simple repetitive token patterns in a model-agnostic manner. On widely deployed MoE models hosted on 8-GPU clusters, our method increases TTFT by 20% to 148%, significantly degrading service quality.

Abstract:
Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decisionmaking processes remain unclear. Chain-of-thoughts (CoTs) have been commonly utilized to externalize LLMs' thinking, but this strategy fails to accurately reflect LLMs' thinking process. Techniques based on LLMs' hidden representations provide an inner perspective to improve the monitorability of their latent thinking. However, previous methods only try to develop external modules instead of making LLMs themselves easier to monitor. In this paper, we propose a novel method, TELLME, improving the transparency of LLMs and helping monitors identify unsuitable and sensitive behaviors. Furthermore, we showcase the effectiveness of TELLME on detoxification tasks, where LLMs achieve consistent improvement among multimodal test sets, distinct architectures, and varying parameter scales. We further analyze TELLME's improvement on LLMs' generalization ability from both optimal transport theory and empirical perspectives.

Abstract:
Transformers used for evidencegrounded question answering with binary adjudication (e.g., support/refute or yes/no) can be highly sensitive to the order in which exchangeable evidence is presented, producing dispersion across permutations and unreliable attempted answers (“hallucinations” under a Bernoulli predicate). We treat evidence order as a nuisance variable and show that next-token training minimizes expected conditional description length over orderings, which can be close to Bayes-optimal in expectation while deviating under any fixed ordering. We quantify this expectation–realization gap via a Quantified Martingale Violation (QMV) bound that predicts O(\log n) growth in permutation dispersion under harmonic positional sensitivity. We then derive the Expectation-level Decompression Law (EDFL), relating expected information budget to achievable reliability for Bernoulli predicates, and use it to define Bits-to-Trust (B2T), Risk-of-Hallucination (RoH), and the Information Sufficiency Ratio (ISR), with a fixed ISR-gating rule for answer/abstain decisions under permutation mixtures. On 3,059 grounded items from a five-benchmark evidence-grounded QA suite (FEVER, HotpotQA, NQ-Open, PopQA, and Controls), we observe logarithmic dispersion and Jensen gains from uniform permutation mixtures. In a pre-specified held-out audit (528 items), an ISR=1 gate attains 0.0–0.7% hallucination with 20.6–27.9% abstention (95% CIs).

Abstract:
Speculative decoding accelerates inference for (M)LLMs, yet a trainingdecoding discrepancy persists: while existing methods optimize single greedy trajectories, decoding involves verifying and ranking multiple sampled draft paths. We proposeVariational Speculative Decoding(VSD), formulating draft training as variational inference over latent proposals (draft paths). VSD maximizes the marginal probability of target-model acceptance, yielding an ELBO that promotes high-quality latent proposals while minimizing divergence from the target distribution. To enhance quality and reduce variance, we incorporate a path-level utility and optimize via an Expectation-Maximization procedure. The E-step draws MCMC samples from an oracle-filtered posterior, while the M-step maximizes weighted likelihood using Adaptive Rejection Weighting (ARW) and Confidence-Aware Regularization (CAR). Theoretical analysis confirms that VSD increases expected acceptance length and speedup. Extensive experiments across LLMs and MLLMs show that VSD achieves up to a 9.58% speedup over EAGLE-3 and 8.80% over ViSpec, significantly improving decoding efficiency.

Abstract:
Diffusion language models, especially masked discrete diffusion models, have achieved great success recently. While there are some theoretical and primary empirical results showing the advantages of latent reasoning with looped transformers or continuous CoT, continuous diffusion models typically underperform their discrete counterparts. In this paper, we argue that diffusion language models do not necessarily need to be in the discrete space. In particular, we prove that continuous diffusion models have stronger expressivity than discrete diffusions and looped transformers. We attribute the contradiction between the theoretical expressiveness and empirical performance to their practical trainability: while continuous diffusion provides intermediate supervision that looped transformers lack, they are harder to generate and decode tokens in the continuous representation space compared with discrete states. We therefore proposeCoevolutionaryContinuousDiscreteDiffusion (CCDD), which defines a joint multimodal diffusion process on the union of a continuous representation space and a discrete token space, leveraging a single model to simultaneously denoise in the joint space. By combining two modalities, CCDD is expressive with rich semantics in the latent space, as well as good trainability and sample quality with the help of explicit discrete tokens. We also propose effective architectures and advanced training/sampling techniques for CCDD, which reveals strong empirical performance in extensive language modeling experiments on realworld tasks.

Abstract:
Semisupervised imitation learning (SSIL) consists in learning a policy from a small dataset of action-labeled trajectories and a much larger dataset of action-free trajectories. Some SSIL methods learn an inverse dynamics model (IDM) to predict the action from the current state and the next state. An IDM can act as a policy when paired with a video model (VM-IDM) or as a label generator to perform behavior cloning on action-free data (IDM labeling). In this work, we first show that VM-IDM and IDM labeling learn the same policy in a limit case, which we call the IDM-based policy. We then argue that the previously observed advantage of IDM-based policies over behavior cloning is due to the superior sample efficiency of IDM learning, which we attribute to two causes: (i) the ground-truth IDM tends to be contained in a lower complexity hypothesis class relative to the expert policy, and (ii) the ground-truth IDM is often less stochastic than the expert policy. We argue these claims based on insights from statistical learning theory and novel experiments, including a study of IDM-based policies using recent architectures for unified video-action prediction (UVA). Motivated by these insights, we finally propose an improved version of the existing LAPO algorithm for latent action policy learning. We experiment on the ProcGen, Push-T and Libero benchmarks.

Abstract:
Evaluating LLM reliability via scalar probabilities often fails to capture the structural dynamics of reasoning. We introduce TRACED, a framework that assesses reasoning quality through theoretically grounded geometric kinematics. By decomposing reasoning traces into Progress (displacement) and Stability (curvature), we reveal a distinct topological divergence: correct reasoning manifests as highprogress, stable trajectories, whereas hallucinations are characterized by low-progress, unstable patterns (stalled displacement with high curvature fluctuations). Leveraging these signatures, our probabilistic framework achieves competitive performance and superior robustness across diverse benchmarks. Crucially, TRACED bridges geometry and cognition by mapping high curvature to "Hesitation Loops'' and displacement to ''Certainty Accumulation'', offering a physical lens to decode the internal dynamics of machine thought.

Abstract:
Lung cancer remains the leading cause of cancer mortality, driving the development of automated screening tools to alleviate radiologist workload. Standing at the frontier of this effort is Sybil, a deep learning model capable of predicting future risk solely from computed tomography (CT) with high precision. However, despite extensive clinical validation, current assessments rely purely on observational metrics. This correlationbased approach overlooks the model's actual reasoning mechanism, necessitating a shift to causal verification to ensure robust decision-making before clinical deployment. We propose S(H)NAP, a model-agnostic auditing framework that constructs generative interventional attributions validated by expert radiologists. By leveraging realistic 3D diffusion bridge modeling to systematically modify anatomical features, our approach isolates object-specific causal contributions to the risk score. Providing the first interventional audit of Sybil, we demonstrate that while the model often exhibits behavior akin to an expert radiologist, differentiating malignant pulmonary nodules from benign ones, it suffers from critical failure modes, including dangerous sensitivity to clinically unjustified artifacts and a distinct radial bias.

Abstract:
Urban wind flow reconstruction is essential for assessing air quality, heat dispersion, and pedestrian comfort, yet remains challenging when only sparse sensor data are available. We propose GenDA, a generative data assimilation framework that reconstructs highresolution wind fields on unstructured meshes from limited observations. The model employs a multiscale graph-based diffusion architecture trained on computational fluid dynamics (CFD) simulations and interprets classifier-free guidance as a learned posterior reconstruction mechanism: the unconditional branch learns a geometry-aware flow prior, while the sensor-conditioned branch injects observational constraints during sampling. This formulation enables obstacle-aware reconstruction and generalization across unseen geometries, wind directions, and mesh resolutions without retraining. We consider both sparse fixed sensors and trajectory-based observations using the same reconstruction procedure. When evaluated against supervised graph neural network (GNN) baselines and classical reduced-order data assimilation methods, GenDA reduces the relative root-mean-square error (RRMSE) by 25-57% and increases the structural similarity index (SSIM) by 23-33% across the tested meshes. Experiments are conducted on Reynolds-averaged Navier-Stokes (RANS) simulations of a real urban neighborhood in Bristol, United Kingdom, at a characteristic Reynolds number of \mathrmRe\approx2×10^7, featuring complex building geometry and irregular terrain. The proposed framework provides a scalable path toward generative, geometry-aware data assimilation for environmental monitoring in complex domains.

Abstract:
The Platonic Representation Hypothesis posits that neural networks trained on different modalities converge toward a shared statistical model of the world. Recent work exploits this convergence by aligning frozen pretrained vision and language models with lightweight alignment layers, but typically relies on contrastive losses and millions of paired samples. In this work, we ask whether meaningful alignment can be achieved with substantially less supervision. We introduce a semisupervised setting in which pretrained unimodal encoders are aligned using a small number of image–text pairs together with large amounts of unpaired data. To address this challenge, we propose SOTAlign, a two-stage framework that first recovers a coarse shared geometry from limited paired data using a linear teacher, then refines the alignment on unpaired samples via an optimal-transport-based divergence that transfers relational structure without overconstraining the target space. Unlike existing semi-supervised methods, SOTAlign effectively leverages unpaired images and text, learning robust joint embeddings across datasets and encoder pairs, and significantly outperforming supervised and semi-supervised baselines.

Abstract:
Quantizationaware training (QAT) is an effective method to drastically reduce the memory footprint of LLMs while keeping performance degradation at an acceptable level. However, the optimal choice of quantization format and bit-width presents a challenge in practice. The full design space of quantization is not fully explored in the context of QAT, and the precise trade-off between quantization and downstream performance is poorly understood, as comparisons often rely solely on perplexity-based evaluations. In this work, we address these shortcomings with an empirical study of QAT in the low-bit regime. We show that k-means based weight quantization outperforms integer formats and can be implemented efficiently on standard hardware. Furthermore, we find that, under a fixed inference memory budget, the best performance on generative downstream tasks is achieved with 1-bit quantized weights.

Abstract:
In pursuit of data privacy, federated learning (FL) collaboratively trains a global model by aggregating local models learned from decentralized data. However, FL heavily depends on highquality labels, which are often impractical in the real world, leading to the federated label-noise (F-LN) problem. Unlike traditional noisy labels, the F-LN problem is exacerbated by the inherent heterogeneity of FL, where clients experience varying levels and types of label errors. In this study, we observe that the global model of FL exhibits slow memorization of noisy labels, suggesting its ability to maintain reliable predictions and robust representations in FL. Based on this insight, we propose a novel method termed Global Reviser for Federated Learning with Noisy Labels (FedGR) to improve the robustness of FL against the F-LN problem. Specifically, FedGR first leverages the label-noise-robust characteristics of the global model to filter and refine the noisy labels on each client using the sieving-and-refining module. Then, it regularizes local model training with the assistance of the global model through the following two modules: the globally revised exponential moving average (EMA) distillation module and the global representation regularization module. Extensive experiments on three widely used F-LN benchmarks demonstrate the superior performance of FedGR, outperforming seven state-of-the-art baselines even in complicated label-noise and data heterogeneity. The code will be released upon acceptance.

Abstract:
We study a class of iterated empirical risk minimization (ERM) procedures in which two successive ERMs are performed on the same dataset, and the predictions of the first estimator enter as an argument in the loss function of the second. This setting, which arises naturally in active learning and reweighting schemes, introduces intricate statistical dependencies across samples and fundamentally distinguishes the problem from classical singlestage ERM analyses. For linear models trained with a broad class of convex losses on Gaussian mixture data, we derive a sharp asymptotic characterization of the test error in the high-dimensional regime where the sample size and ambient dimension scale proportionally. Our results provide explicit, fully asymptotic predictions for the performance of the second-stage estimator despite the reuse of data and the presence of prediction-dependent losses. We apply this theory to revisit a well-studied pool-based active learning problem, removing oracle and sample-splitting assumptions made in prior work. We uncover a fundamental tradeoff in how the labeling budget should be allocated across stages, and demonstrate a double-descent behavior of the test error driven purely by data selection, rather than model size or sample count.

Abstract:
“Codeas-Policy” considers how executable code can complement data-intensive Vision-LanguageAction (VLA) methods, yet their effectiveness as autonomous controllers for embodied manipulation remains underexplored. We present CaPX, an open-access framework for systematically studying Code-as-Policy agents in robot manipulation. At its core is CaP-Gym, an interactive environment in which agents control robots by synthesizing and executing programs that compose perception and control primitives. Building on this foundation, CaP-Bench evaluates frontier language and vision-language models across varying levels of abstraction, interaction, and perceptual grounding. Across 7 simulation tasks and 12 models, CaP-Bench reveals a consistent trend: performance improves with human-crafted abstractions but degrades as these priors are removed, exposing a dependence on designer scaffolding. At the same time, we observe that this gap can be mitigated through scaling agentic test-time computation–through multi-turn interaction, structured execution feedback, visual differencing, automatic skill synthesis, and ensembled reasoning–substantially improves robustness even when agents operate over low-level primitives. These findings allow us to derive CaP-Agent0, a training-free framework that recovers human-level reliability on several manipulation tasks in simulation and on real embodiments. We further introduce CaP-RL, showing reinforcement learning with verifiable rewards improves success rates and transfers from sim2real with minimal gap. Together, CaP-X provides a principled, open-access platform for advancing embodied coding agents. Project page: https://cap-x-anonymous.github.io

Abstract:
Robotic policies deployed in realworld environments often encounter post-training faults, where retraining, exploration, or system identification are impractical. We introduce an inference-time, cerebellar-inspired residual control framework that augments a frozen reinforcement learning policy with online corrective actions, enabling fault recovery without modifying base policy parameters. The framework instantiates core cerebellar principles, including high-dimensional pattern separation via fixed feature expansion, parallel microzone-style residual pathways, and local error-driven plasticity with excitatory and inhibitory eligibility traces operating at distinct time scales. These mechanisms enable fast, localized correction under post-training disturbances while avoiding destabilizing global policy updates. A conservative, performance-driven meta-adaptation regulates residual authority and plasticity, preserving nominal behavior and suppressing unnecessary intervention. Experiments on MuJoCo benchmarks under actuator, dynamic, and environmental perturbations show improvements of up to +66% on \textttHalfCheetah-v5 and +53% on \textttHumanoid-v5 under moderate faults, with graceful degradation under severe shifts and complementary robustness from consolidating persistent residual corrections into policy parameters.

Abstract:
Equipping agents with interactive environments and verifiable tasks for selfexploration is essential for cultivating generalist agents capable of adapting to diverse scenarios. However, high-quality agentic data remain critically scarce, and existing synthesis methods suffer from significant limitations regarding environmental diversity and scalability. To address these challenges, we introduce ScaleEnv, a framework that constructs fully interactive environments and verifiable tasks entirely from scratch. Specifically, ScaleEnv ensures environment reliability through procedural testing, and guarantees task completeness and solvability via tool dependency graph expansion and executable action verification. By enabling agents to learn through exploration within ScaleEnv, we demonstrate significant performance improvements on unseen, multi-turn tool-use benchmarks such as \tau^2-Bench and VitaBench, highlighting strong generalization capabilities. Furthermore, we investigate the relationship between environment and task scaling, providing empirical evidence that scaling environmental diversity is critical for robust agent learning.

Abstract:
While multiagent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design under-deliver. Such shortcomings stem from two key factors: (1) methodological complexity – agent orchestration is performed using sequential, code-level execution that limits global system-level holistic reasoning and scales poorly with agent complexity – and (2) efficacy uncertainty – MAS are deployed without understanding if there are tangible benefits compared to single-agent systems (SAS). We propose MASOrchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented subagents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis reveals that MAS gains depend critically on task structure, verification protocols, and the capabilities of both orchestrator and subagents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA, while achieving more than 10× efficiency over strong baselines. Together, MAS-Orchestra and MASBENCH enable better training and understanding of MAS in the pursuit of multi-agent intelligence.

Abstract:
Large Language Models (LLMs) have advanced Table Question Answering, where most queries can be answered by extracting information or simple aggregation. However, a common class of realworld queries is implicitly predictive, requiring the inference of unobserved answers from historical patterns rather than mere retrieval. These queries introduce two challenges: recognizing latent intent and reliable predictive reasoning over massive tables. To assess LLMs in such Tabular questiOn answering with implicit Prediction tasks, we introduce TopBench, a benchmark consisting of 779 samples across four sub-tasks, ranging from single-point prediction to decision making, treatment effect analysis, and complex filtering, requiring models to generate outputs spanning reasoning text and structured tables. We evaluate diverse models under both text-based and agentic workflows. Experiments reveal that current models often struggle with intent recognition, defaulting to just lookups. Deeper analysis identifies that accurate intent disambiguation serves as the prerequisite for leading these predictive behaviors. Furthermore, elevating the upper bound of prediction precision requires the integration of more sophisticated modeling or reasoning capabilities.

Abstract:
Large language models frequently exhibit hallucinations: fluent and confident outputs that are factually incorrect or unsupported by the input context. While recent hallucination detection methods have explored various features derived from attention maps, the underlying mechanisms they exploit remain poorly understood. In this work, we propose SinkProbe, a hallucination detection method grounded in the observation that hallucinations are deeply entangled with attention sinks tokens that accumulate disproportionate attention mass during generation - indicating a transition from distributed, input-grounded attention to compressed, prior-dominated computation. Importantly, although sink scores are computed solely from attention maps, we find that the classifier preferentially relies on sinks whose associated value vectors have large norms. Moreover, we show that previous methods implicitly depend on attention sinks by establishing their mathematical relationship to sink scores. Our findings yield a novel hallucination detection method grounded in theory that produces state-of-the-art results across popular datasets and LLMs.

Abstract:
We propose a system for marking sensitive or copyrighted texts to detect their use in finetuning large language models (LLMs) under black-box access with statistical guarantees. Our method builds digital "marks" using invisible Unicode characters organized into ("cue", "reply") pairs. During an audit, prompts containing only "cue" fragments are issued to trigger regurgitation of the corresponding "reply", indicating document usage. To control false positives, we compare against held-out counterfactual marks and apply a ranking test, yielding a verifiable bound on the false positive rate. The approach is minimally invasive, scalable across many sources, robust to standard processing pipelines, and achieves high detection power even when marked data is a small fraction of the fine-tuning corpus.

Abstract:
This paper considers a novel variant of the online fair division problem involving multiple agents in which a learner sequentially observes an indivisible item that has to be irrevocably allocated to one of the agents while satisfying a desired balance between fairness and efficiency. Existing algorithms assume a small number of items with a sufficiently large number of copies, which ensures a good utility estimation for all itemagent pairs from noisy observed utilities. However, this assumption may not hold in many real-life applications, for example, an online platform that has a large number of users (items) who use the platform's service providers (agents) only a few times (a few copies of items), which makes it difficult to accurately estimate utilities for all item-agent pairs. To address this limitation, we assume utility is an unknown function of item-agent features. We propose algorithms that model online fair division as a contextual bandit problem, with provable sub-linear regret upper bound guarantees. Our experimental results further validate the effectiveness of the proposed algorithms.

Abstract:
Featuremap knowledge distillation (KD) transfers internal representations well between comparably sized Vision Transformers (ViTs), but it often fails in compression. We revisit this failure and uncover a paradox. Sample-wise SVD shows that each image is highly compressible, which seems to suggest that a narrow student with a linear projector should match the teacher "in principle". However, a dataset-level view contradicts this intuition: PCA shows that the teacher is a union of low-rank subspaces with significant subspace rotation across inputs. We further introduce token-level Spectral Energy Patterns (SEP) and find an architecture-invariant encoding law: tokens spread energy broadly across channel modes even when they live in low-rank subspace, creating a bandwidth mismatch. We refer to this combined phenomenon as an encoding mismatch. We propose two minimal remedies, Lift or WideLast: (i) Lift retains a lightweight lifting projector at inference to provide wider channel, or (ii) WideLast widens only the student’s last block, enabling an input-dependent expansion. On ImageNet-1K, these fixes revive feature KD for ViT compression, improving DeiT-Tiny distilled from CaiT-S24 from 74.86% to 77.53%/78.23% top-1 accuracy, and they also strengthen students trained without distillation. Our analyses clarify when and why feature-map KD fails and then how to fix it. Code and raw data are provided in the supplementary materials.

Abstract:
Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively overconfident in incorrect answers. Previous studies devote to directly incorporating calibration objective into existing optimization target. However, our theoretical analysis demonstrates that there exists a fundamental gradient conflict between the optimization for maximizing policy accuracy and minimizing calibration error. Building on this insight, we propose DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives. Extensive experiments demonstrate that our DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue. Our study provides valuable insights and practical solution for more reliable LLM deployment.

Abstract:
Spoken dialogues with and between voice agents are becoming increasingly common, yet assessing them for their socially harmful content such as violence, harassment, and hate remains textcentric and fails to account for audio-specific cues and transcription errors. We present LALM-as-a-Judge, the first controlled benchmark and systematic study of large audio-language models (LALMs) as safety judges for multi-turn spoken dialogues. We generate 24,000 unsafe and synthetic spoken dialogues in English that consist of 3-10 turns, by having a single dialogue turn including content with one of 8 harmful categories (e.g., violence) and on one of 5 grades, from very mild to severe. On 160 dialogues, 5 human raters confirmed reliable unsafe detection and a meaningful severity scale. We benchmark three open-source LALMs: Qwen2-Audio, Audio Flamingo 3, and MERaLiON as zero-shot judges that output a scalar safety score in \left[0,1\right] across audio-only, transcription-only, or multimodal inputs, along with a transcription-only LLaMA baseline. We measure the judges' sensitivity to detecting unsafe content, the specificity in ordering severity levels, and the stability of the score in dialogue turns. Results reveal architecture- and modality-dependent trade-offs: the most sensitive judge is also the least stable across turns, while stable configurations sacrifice detection of mild harmful content. Transcription quality is a key bottleneck: Whisper-Large may significantly reduce sensitivity for transcription-only modes, while largely preserving severity ordering. Audio becomes crucial when paralinguistic cues or transcription fidelity are category-critical. We summarize all findings and provide actionable guidance for practitioners.

Abstract:
While PhysicsInformed Neural Networks (PINNs) offer a mesh-free approach to solving PDEs, standard point-wise residual minimization suffers from convergence pathologies in topologically complex domains like Triply Periodic Minimal Surfaces (TPMS). The locality bias of point-wise constraints fails to propagate global information through tortuous channels, causing unstable gradients and conservation violations. To address this, we propose the Multi-scale Weak-form PINN (MUSA-PINN), which reformulates PDE constraints as integral conservation laws over hierarchical spherical control volumes. We enforce continuity and momentum conservation via flux-balance residuals on control surfaces. Our method utilizes a three-scale subdomain strategy—comprising large volumes for long-range coupling, skeleton-aware meso-scale volumes aligned with transport pathways, and small volumes for local refinement—alongside a two-stage training schedule prioritizing continuity. Experiments on steady incompressible flow in TPMS geometries show MUSA-PINN outperforms state-of-the-art baselines, reducing relative errors by up to 93% and preserving mass conservation.

Abstract:
Masked Diffusion Language Models (MDLMs) enable parallel token decoding, providing a promising alternative to the sequential nature of autoregressive generation. However, their iterative denoising process remains computationally expensive because it repeatedly processes the entire sequence at every step. We observe that across these diffusion steps, most token representations remain stable; only a small subset, which we term salient tokens, contributes meaningfully to the next update. Leveraging this temporal sparsity, we present DyLLM, a trainingfree inference framework that accelerates decoding by selectively computing only these salient tokens. DyLLM identifies saliency by measuring the cosine similarity of attention contexts between adjacent denoising steps. It recomputes feed-forward and attention operations only for salient tokens while reusing cached activations for the remainder. Across diverse reasoning and code-generation benchmarks, DyLLM achieves up to 9.6× higher throughput while largely preserving the baseline accuracy of state-of-the-art models like LLaDA and Dream.

Abstract:
Multimodal Large Language Models (MLLMs) possess intrinsic reasoning and worldknowledge capabilities, yet adapting them for dense retrieval remains challenging. Existing approaches typically rely on invasive parameter updates, such as full fine-tuning and LoRA, which risk disrupting the pre-trained semantic manifold and degrading the complex knowledge structures crucial for logical inference. To address this, we proposeSLQ, a parameter-efficient tuning framework that adapts MLLMs for retrieval while keeping the backbone entirely frozen. SLQ introduces a small set ofShared Latent Queriesthat are appended to both text and image tokens, leveraging the model’s native causal attention to aggregate multimodal context into a unified embedding space. Furthermore, to rigorously evaluate retrieval beyond superficial pattern matching, we constructKARR-Bench, a benchmark designed for knowledge-aware reasoning retrieval. Extensive experiments demonstrate that SLQ achieves better performance compared to full fine-tuning and LoRA baselines on COCO and Flickr30K, while significantly outperforming them on KARR-Bench, validating that preserving the frozen semantic manifold via non-invasive adaptation is an effective strategy for MLLM-based retrieval.

Abstract:
Recent advancements in Large Language Models (LLMs) have successfully employed searchbased strategies to enhance code generation. However, existing methods typically rely on static, sparse public test cases for verification, leading to pseudo-correctness—where solutions overfit the visible public tests but fail to generalize to hidden test cases. We argue that optimizing against a fixed, weak environment inherently limits robustness. To address this, we propose AdverMCTS, a novel adversarial Monte Carlo Tree Search framework that combats pseudo-correctness by coupling code search with active vulnerability discovery. AdverMCTS formulates generation as a minimax-style game between a Solver agent, which synthesizes code candidates, and an Attacker agent, which evolves to generate targeted test cases that exploit logical divergences in the current solution pool. These discovered tests form a dynamic, progressively hostile filter that penalizes fragile reasoning. Extensive experiments demonstrate that AdverMCTS significantly outperforms state-of-the-art baselines, effectively reducing false positive rates and forcing the model to generalize beyond the initial constraints. The resources of this work are available at https://anonymous.4open.science/r/AdverMCTS_open-A255.

Abstract:
Progress in neural combinatorial optimization for Dynamic Flexible Job Shop Scheduling Problem (DFJSP) is currently hindered by a methodological tension: static benchmarks encourage benchmark overfitting, while uncalibrated generators obscure algorithmic difficulty with stochastic noise. To resolve this, we introduce DynaSchedBench, a diagnostic framework for DFJSP that rigorously controls the instancegeneration process for evaluation. Instead of relying on naive parameter sampling, our approach utilizes Sequential Event-Space Calibrator (SESC) that computes a novel Schedule Stress Index (SSI) to stratify instances by difficulty. We demonstrate that SESC is substantially more computationally efficient than evolutionary baselines while converging reliably to the target metrics. The framework integrates modular components for instance generation, snapshot-based simulation, agents, evaluation, and visualization, enabling rigorous testing of reactive and lookahead-based policies. Leveraging this calibrated environment, we identify key limitations of LLM-based scheduling agents. Specifically, in step-wise online decision-making for dynamic scheduling, we identify an ``Observability Paradox'': providing agents with oracle access to full structural information can hurt policy performance, underperforming concise statistical summaries. Furthermore, despite substantial token overhead, tool-augmented and refinement strategies do not reliably improve performance, and most LLM agents do not consistently surpass strong dispatching baselines—behaving more like robust heuristic approximators than superior optimizers.

Abstract:
Understanding \emphmodality interaction in multimodal large language models (MLLMs) remains a central challenge for reliable and interpretable deployment. We introduce Partial Information Decomposition (PID) as a unified, decisionlevel framework that separates \emphunique, \emphredundant, and \emphsynergistic contributions of sensory and linguistic inputs, moving beyond representation alignment and outcome-based evaluation. Across vision–language benchmarks, PID reveals stable \emphinteraction regimes: reasoning-oriented tasks consistently exhibit high cross-modal synergy, whereas knowledge-oriented tasks are dominated by language-unique information. These regimes generalize across architectures and scales and predict causal sensitivity to modality-level interventions. We extend this framework to tri-modal systems with Sensory PID, treating language as a control variable to decompose information gain from video and audio. Applied to omni-modal models, this analysis uncovers a persistent \emphsensory synergy bottleneck, where decisions remain dominated by visual information even on fusion-dependent tasks. Layer-wise analysis further show that sensory integration emerges late and is instruction-gated, following early visual saturation.

Abstract:
Graph collaborative filtering (GCF) is a dominant paradigm in recommender systems, where contrastive learning (CL) objectives such as the Sampled Softmax (SSM) loss are widely used for optimization. However, it remains unclear how CL interacts with the prediction mechanism of GCF. By unfolding the prediction mechanism of GCF, we show that the useritem prediction score is computed by aggregating learnable weights over a large number of neighbor pairs formed by the multi-hop neighbors of the user and the item. This analysis implies that effective optimization critically depends on which neighbor pairs are upweighted during training. Empirically, we find that effective recommendation is achievable by selectively upweighting only a small subset of neighbor pairs whose constituent neighbors are structurally similar to the target user and item, and that the effect of such selective upweighting varies across different neighbor pair types. Based on these findings, we analyze SSM and identify key limitations in its neighbor pair weight update dynamics. To address these limitations, we propose NT-SSM, an effective and principled CL objective that induces type-aware neighbor pair weight update dynamics. Experiments demonstrate consistent performance improvements over SSM across multiple datasets and GCF models.

Abstract:
Large Language Models (LLMs) offer a promising avenue for scientific discovery, yet their application to symbolic regression is often constrained by inefficient search strategies and coarse feedback signals. Current methods typically guide LLMs using scalar metrics (e.g., global Mean Squared Error), which fail to identify which specific components of a proposed equation are driving performance or causing error. We introduce InfluenceGuided Symbolic Regression (IGSR), a method that frames equation discovery as an iterative two-step process combining diverse term generation with rigorous selection: an LLM generates candidate basis functions \psi_j(\mathbfx) for a linear model, which are then evaluated using granular influence scores \Delta_j. These scores quantify each term's marginal contribution to generalization accuracy, enabling an influence-guided pruning process that systematically refines the model structure. Integrating this mechanism into a Monte Carlo Tree Search (MCTS) enables navigating the combinatorial search space while balancing exploration of novel functional forms with exploitation of high-influence components. We demonstrate IGSR's effectiveness across a diverse suite of benchmarks, including LLM-SRBench, pharmacological PKPD models, an epidemiological simulation, and real-world genomic data. Notably, we validate the framework's capacity for genuine discovery in a case study using a high-dimensional biological dataset, in which IGSR identified a novel relationship between DNA methylation and RNA Polymerase II pausing; a hypothesis that was subsequently supported via wet-lab experimentation.

Abstract:
We theoretically study the hallucination phenomena in two canonical diffusion samplers: the stochastic Denoising Diffusion Probabilistic Model (DDPM) and the deterministic Denoising Diffusion Implicit Model (DDIM). We analyze the reverse ODE (DDIM) and SDE (DDPM) for a Gaussian mixture target, proving that after a critical time \tau, (a) DDIM can become stuck on the segment connecting the two nearest modes and (b) the stochasticity of DDPM helps DDPM become unstuck from this region, thus avoiding hallucination. Our empirical validation verifies that DDPM has a significantly lower hallucination rate than DDIM when this region is entered. Building on our observations, we exhibit how using additional stochastic steps can help DDIM avoid hallucinations and offer new insights on how to design improved samplers.

Abstract:
Looped Language Models (LoopLMs) enable efficient latent reasoning through depth recurrence, yet exhibit unreliable testtime scaling behavior: performance often peaks at a certain iteration depth and then collapses with further recurrence. Through latent dynamics analysis, we find an inherent trade-off between stability and effectiveness in existing architectures and strategies. By conceptualizing reasoning as uncertainty reduction, we propose that convergence toward stable fixed points while preserving effectiveness represents a promising way. To this end, we propose STARS (STAbility-driven Recurrent Scaling), a training framework that constrains latent states to approach asymptotically stable fixed points. This is realized via efficient Jacobian Spectral Radius Regularization with random loop sampling, enabling STARS to maximize effectiveness while ensuring rigorous stability. Experiments on arithmetic tasks show that STARS achieves reliable test-time scaling, and on complex mathematical reasoning it substantially mitigates performance degradation as recurrence depth increases while also improving peak performance.

Abstract:
Current vision language models face hallucination and robustness issues against ambiguous or corrupted modalities. We hypothesize that these issues can be addressed by exploiting the shared information between modalities to compensate for the impaired one. To this end, we analyze multimodal interactions redundant (shared), unique (exclusive), and synergistic (emergent) task-relevant information provided by the modalities -- to determine their impacts on model reliability. Specifically, amplifying redundant interactions would increase this exploitable shared information to resolve these issues; yet, modern instruction datasets often eliminate redundancies to prioritize visual grounding. We bridge this gap through a self-captioning workflow featuring a \textscMultimodal Interaction Gate: a mechanism to convert unique interactions into redundant interactions. Our findings suggest that increasing redundancy can reduce visual induced errors by 38.3% and improve consistency by 16.8%.

Abstract:
Heterogeneous LoRArank methods address system heterogeneity in federated fine-tuning of foundation models by assigning client-specific ranks based on computational capabilities. However, these methods achieve only marginal computational savings, as dense feed-forward computations dominate. Sparse Mixture-of-Experts (SMoE) provides a promising alternative through conditional computation, yet we identify that its naive application to heterogeneous federated settings introduces two critical discordances: (i) expert utilization imbalance and (ii) non-differentiability of Top-K routing. Our convergence analysis demonstrates that these discordances lead to degraded convergence, particularly for resource-constrained clients. To address these challenges, we propose Universally Balanced Sparse Mixture-of-Experts (UB-SMoE), which introduces Dynamic Modulated Routing (DMR) to rebalance expert utilization, and Universal Pseudo-Gradient (PG) to reconstruct learning signals for non-activated experts. These mechanisms form a self-reinforcing cycle that maintains expert viability across heterogeneous clients. Experiments on benchmarks show that UB-SMoE achieves up to 45.0% computational reduction on low-resource clients while improving their performance by 8.7 × compared to existing heterogeneous LoRA-rank methods.

Abstract:
Adapting Large Language Models (LLMs) to specialized domains typically incurs high data and computational overhead. While prior efficiency efforts have largely treated data selection and parameterefficient fine-tuning as isolated processes, our empirical analysis suggests they may be intrinsically coupled. We posit the Strong Map Hypothesis: a sparse subset of attention heads plays a dominant role in task-specific adaptation, acting as keys that unlock specific data patterns. Building on this observation, we propose From Parameters to Data (P2D), a unified framework that leverages these task-sensitive attention heads as a dual compass for both sample mining and structural pruning. To rigorously quantify the total pipeline cost, we introduce the Alignment Efficiency Ratio (AER) metric for both selection latency and training time. Mechanistically, P2D identifies critical heads via a lightweight proxy and uses them as a functional filter to curate high-affinity data, establishing a synergistic pipeline. Empirically, by updating merely 10% of attention heads on 10% of the data, P2D achieves an 8.3 pp performance gain over strong baselines and delivers a 7.0x end-to-end time speedup. These results validate that precise parameter-data synchronization eliminates redundancy, offering a new paradigm for efficient alignment.

Abstract:
Offline MetaReinforcement Learning leverages static datasets to enable agents to generalize to unseen environments by combining offline efficiency with meta-learning adaptability, yet it faces fundamental challenges from context and policy distribution shifts. These issues hinder agents trained on offline datasets from adapting to online environments, and are further exacerbated under sparse-reward settings. As a result, agents often become trapped in an inherent pattern dilemma, failing to achieve robust generalization. In this work, we propose a novel framework that integrates information-theoretic task representation learning with a Transformer-based stochastic world model. Our approach extracts task-defining latent variables that are invariant to behavior policy, thereby effectively mitigating the context distribution shift. To further handle policy shift and model exploitation, we incorporate conservative value regularization into imagination-based rollouts, fully leveraging task representations that are sufficient for reliable adaptation. Our method is evaluated on multiple offline environments, where it consistently outperforms state-of-the-art approaches, achieving superior stability and generalization under severe out-of-distribution and sparse-reward settings.

Abstract:
Understanding the capability bottlenecks of embodied multimodal large language models (MLLMs) is crucial for improvement. However, existing embodied benchmarks fail to provide actionable insights because they focus on tasklevel evaluation rather than discovering capability bottlenecks. To address this, we introduce BEAR, where we divide embodied tasks into 14 atomic skills for skill-level evaluation. BEAR comprises 4,469 interleaved image–video–text entries across 14 skills in 6 categories, ranging from low-level perception to high-level planning. We evaluate 20 MLLMs on BEAR under a hierarchical skill-level diagnosis framework and discover that (1) perceptual capabilities are major bottlenecks behind reasoning failures, and (2) models fail due to unstable spatiotemporal modeling which remain unexposed in previous benchmarks. Furthermore, building on these insights, we propose BEAR-Agent, a multimodal conversable agent that augments MLLMs with visual and spatial tools. It substantially enhances MLLMs’ performance across skills, yielding a relative improvement of 17.5% on GPT-5 on BEAR and outperform baselines by a large margin in both simulation and real-robot experiments across models.

Abstract:
Fewshot molecular property prediction (FSMPP) is essential in drug discovery and materials design, where high-quality labeled data are often scarce and expensive to obtain. Despite the promising performance of existing methods, especially in the context-aware methods, they still face two-fold severe challenges with insufficient structural context modeling & redundant auxiliary context learning, leading to inadequate context graph exploration and ineffective information utilization for effective molecule representation learning. To address these, in this paper, we propose a novel framework by learning on \underlineRelational and \underlineCompact c\underlineontext \underlineGraph, named \method, to comprehensively exploit the context graph for expressive molecular property prediction. Specifically, the proposed \method contains two core modules: a (1) cross-property relational learning module to better model the structural and relational context information, and a (2) context graph information bottleneck module to adaptively suppress irrelevant auxiliary signals for compact context information utilization, followed by a detailed theoretical demonstration regarding the importance of joint relational and compact knowledge extraction in context graphs. Extensive experiments across multiple datasets demonstrate that \method consistently outperforms state-of-the-art methods, validating its superiority. Code is available at~\urlhttps://anonymous.4open.science/r/ReCoG-main-40C7/.

Abstract:
VisionLanguage Models (VLMs) excel at tasks like zero-shot classification and cross-modal retrieval by mapping images and text to a shared space, but this requires expensive end-to-end training with massive paired datasets. Current post-hoc alignment methods reduce computational costs by connecting pretrained encoders through lightweight mappings, yet still demand substantial paired data. In this work, we investigate the potential of repurposing the classification heads of pretrained vision models as semantic prototypes. The recycling of these weights, typically discarded after pretraining, unlocks two distinct capabilities: it enables zero-shot alignment by using weights as semantic anchors, and serves as a robust data augmentation strategy by mixing these prototypes with real image-text pairs. We demonstrate that integrating our approach with several state-of-the-art post-hoc alignment techniques consistently boosts accuracy in cross-modal retrieval, zero- and few-shot classification tasks.

Abstract:
Tabular foundation models via pretrained priordata fitted networks (PFNs) achieve remarkable generalization performance on arbitrary testing tabular data, when sample distributions are independent of the deployed classifiers, i.e., a non-strategic regime. In a variety of real-world scenarios, however, once a classifier is deployed, individuals corresponding to tabular samples strategically manipulate their features to obtain favorable results, inducing feature distribution shifts at deployment, i.e., a strategic regime. As concurrent tabular foundation models exclusively overlook the strategic tabular data, we systematically explore the boundary of PFNs on strategic tabular data, characterizing their theoretical properties and empirical performance towards such a commonly encountered type of tabular data, offering a pioneer analysis on bridging PFNs and the society domain. To be first, we inform that such strategic manipulation creates a mismatch between the grounding, strategic prior and the pretrained prior. Subsequently, the prior mismatch leads to an inevitable posterior prediction bias of current tabular foundation models when applied to strategic environments. To address this challenge, we propose Strategic Prior-data Fitted Network (SPN), a strategy-aware framework that adapts tabular foundation models to strategic environments at inference time. SPN uses in-context learning to approximate post-manipulation inputs and then performs prediction for strategic tabular data. Experiments on real-world and synthetic tabular data show that SPN consistently improves performance and robustness under strategic manipulation compared to both tabular foundation models and classical tabular methods.

Abstract:
Conditioned Sequence Models (CSMs) learn policies by treating returnto-go (RTG) as a control signal. However, existing CSMs often treat the RTGs as simple numerical inputs rather than aligning them with the performance of their policies. In this paper, we propose Q-ALIGN DT, a framework that enforces this alignment by ensuring the Q-value of the output policy is consistent with the input RTG. By leveraging a Q function to provide dense guidance to CSMs and further fine-tuning it using an RTG-perturbation technique with the CSM, our method ensures that higher RTGs are consistently mapped to trajectories with higher expected returns. Theoretically, we show that Q-ALIGN DT can efficiently learn the desired policy and output a near-optimal one when the RTG is sufficiently high. Empirically, we demonstrate through extensive experiments that \methodname achieves superior controllability and performance across the D4RL benchmark. Remarkably, our model effectively learns a structured family of policies that maintains precise alignment and generalizes to tasks like velocity-tracking where prior methods fail.

Abstract:
Extended chainof-thought reasoning can degrade performance on deterministic state-tracking tasks—not due to preference biases, but fundamental information-theoretic limits in decoder-only transformers. We establish: (1) an Attention Bottleneck Theorem with matching lower bound, proving state-tracking capacity scales as O(H \cdot \log(L/H) \cdot \sqrtd_h); (2) a context-dependent error model yielding super-exponential accuracy decay; (3) the State-Space Jaccard metric distinguishing capability from preference failures; (4) a Deterministic Horizon d^ \in [19, 31] beyond which tool delegation becomes necessary. Across 12 models and 8 task domains—including SWE-Bench, WebArena, and SQL-Multi—tool-integrated reasoning achieves 86–94% accuracy versus 24–42% for neural chain-of-thought. Fine-tuning on optimal-length traces yields <5% improvement, confirming an architectural ceiling. High cross-model correlation (r = 0.81–0.91) demonstrates these failures are architectural, not training-specific. Our results provide principled guidance for when pure neural reasoning should yield to hybrid approaches in agentic systems.

Abstract:
Sparse autoencoders (SAEs) decompose neural network activations into interpretable features, but many features never activatea problem called feature death. Death rates vary dramatically across models: near-zero on GPT-2, over 70% on AlphaFold3 with identical SAE configurations. Why? We find that dimension-level activation outliers (dimensions where mean magnitude is large relative to per-token variation) shift pre-activations at initialization, making feature fate depend on weight-outlier alignment rather than input content. We derive \gamma = \|\boldsymbol\mu\|/\|\boldsymbol\sigma\| from this mechanism; it predicts initial death rates (Spearman \rho > 0.9) across 275 model-layer combinations spanning language, vision, and protein models. This creates two death pathways; we trace their recovery mechanisms and find one resolves naturally while the other bottlenecks on the SAE slowly learning to mean-center. Initializing the SAE to mean-center from the start eliminates this outlier-induced death, confirming the mechanism.

Abstract:
As the computational demands for pretraining Large Language Models (LLMs) continue to surge, the need for efficient training paradigms becomes critical. Despite the vast resources already invested in existing pre-trained checkpoints, these assets often remain under-leveraged due to architectural limitations. We introduce an "orthogonal growth" strategy designed to "recycle" these checkpoints by strategically expanding their parameters prior to continued training. Our method focuses on optimizing converged Mixture-of-Experts (MoE) models through two dimensions: interpositional layer copying for increased depth and noisy expert duplication for expanded width. Through extensive scaling laws analysis, we demonstrate a strong positive correlation between the "sunk cost" (prior investment) and the final model accuracy. Empirical results on models up to 70B parameters and 1T tokens show that our recycling approach yields a 10.6% accuracy improvement compared to training from scratch under identical extra compute budgets. This work provides a cost-effective blueprint for sustainable large-scale LLM development.

Abstract:
Layer pruning efficiently reduces Large Language Model (LLM) computational costs but often triggers sudden performance collapse. Existing representationbased analyses struggle to explain this mechanism. We propose studying pruning through decision representation. Focusing on multiple-choice tasks, we introduce two metrics, Decision Margin and Option Frequency, and an Iterative Pruning method to analyze layer-wise decision dynamics. Our findings reveal a sharp decision transition that partitions the network into two stages: a Silent Phase, where the model cannot yet predict the correct answer, and a Decisive Phase, where the correct prediction emerges. We also find that pruning the Decisive Phase has minimal impact, whereas pruning the Silent Phase triggers immediate performance collapse, highlighting its extreme sensitivity to structural changes. Therefore, we conclude that pruning-induced collapse stems from disrupting the Silent Phase, which prevents the critical decision transition from occurring.

Abstract:
Large VisionLanguage Models have achieved unprecedented success in zero-shot recognition by aligning visual features with broad semantic concepts. However, this semantic abstraction creates a critical vulnerability in open-world deployment: the "Hubris of Semantics", where models force-fit unknown anomalies into known categories with high confidence due to the lack of explicit negative knowledge. To address this Open-World Trustworthiness Paradox, we propose Immuno-VLM, a bio-inspired framework that adapts the biological principle of Immunological Negative Selection to high-dimensional latent spaces. Departing from traditional Open-Set Recognition methods that rely on passive density estimation or inefficient pixel-space outlier generation, Immuno-VLM leverages the generative reasoning of Large Language Models to actively hallucinate ``Semantic Antibodies''—textual descriptions of near-distribution outliers (e.g., look-alikes, contextual anomalies) that effectively bound the decision space of known classes. Extensive experiments on ImageNet-1K and four challenging OOD benchmarks reveal that Immuno-VLM establishes a new state-of-the-art.

Abstract:
Timeseries forecasting is critical in various scenarios, such as energy, transportation, and public health. However, most existing forecasters rely primarily on one-way inference, i.e., mapping history to target, and overlook the structural information provided by a revised natural chain (``history (model input) -- target (ground-truth output) -- post-target continuation''). The post-target continuation records how trajectories evolve after the target, which can help stabilize forecasting, but it is not observable at inference time. In this work, we aim to obtain an approximate proxy of the post-target continuation for the current input, providing structural knowledge for bidirectional forecasting. This idea is instantiated as KUP-BI (Knowledge Utilization Paradigm with Bidirectional Inspiration), a new time-series modeling paradigm that distills continuation-style knowledge (as an approximate post-target continuation proxy) from a \emphtrain-only historical library and integrates it into standard forecasting backbones. The input stream and the continuation-proxy stream are fused via a lightweight feature-level gating module. This design does not introduce information beyond what is already contained in the training trajectories; instead, it provides a structured inductive bias that helps backbones exploit typical continuation patterns rather than relying solely on parametric extrapolation. Experimental results on six public datasets show that KUP-BI consistently improves the forecasting performance of state-of-the-art models, with small additional overhead.

Abstract:
Reinforcement Learning (RL) has become a cornerstone for improving the performance of Large Language Models (LLMs). However, its rollout phase constitutes a significant efficiency bottleneck, mainly arising from the longtail bubbles across data parallel ranks, particularly in long-context scenarios where faster GPUs remain idle while waiting for stragglers. Existing solutions, such as partial rollout or asynchronous RL, mitigate these bubbles by compromising the algorithm's strict synchronous nature. Instead, we propose BubbleSpec, a novel framework that accelerates RL rollouts while strictly keeping the mathematical exactness. Instead of attempting to eliminate bubbles, BubbleSpec exploits them. We exploit the idle time windows of faster ranks to pre-generate rollout results for subsequent steps, serving as drafts for speculative decoding. Unlike prior speculative methods that rely on historical epoch similarity and warm-ups, BubbleSpec is agnostic to dataset size and provides immediate acceleration from the onset of training. Extensive evaluations demonstrate that BubbleSpec reduces decoding steps by ～50% and increases rollout throughput by up to 1.8×. Critically, BubbleSpec is seamlessly compatible with various RL frameworks and strategies as it sustains the strict synchronous property of RL algorithms.

Abstract:
The agency expected of Agentic Large Language Models goes beyond answering correctly, requiring autonomy to set goals and decide what to explore. We term thisinvestigatory intelligence, distinguishing it fromexecutional intelligence, which merely completes assigned tasks. Data Science provides a natural testbed, as realworld analysis starts from raw data rather than explicit queries, yet few benchmarks focus on it. To address this, we introduceDeep Data Research (DDR), an open-ended task where LLMs autonomously extract key insights from databases, andDDR-Bench, a large-scale, checklist-based benchmark that enables verifiable evaluation. Results show that while frontier models display emerging agency, long-horizon exploration remains challenging. Our analysis highlights that effective investigatory intelligence depends not only on agent scaffolding or merely scaling, but also on intrinsic strategies of agentic models.

Abstract:
Model merging has emerged as a lightweight paradigm for enhancing Large Language Models (LLMs), yet its underlying mechanisms remain poorly understood. In this work, we analyze latestage pre-training trajectories and uncover a Rank-1 Subspace phenomenon: while raw optimization steps oscillate violently, consecutive \emphmerged checkpoints collapse onto a stable, approximately one-dimensional linear manifold. We theoretically ground this observation in a \emphriver-valley landscape analysis: averaging acts as a geometric low-pass filter that dampens high-curvature noise to reveal the optimal descent direction. Capitalizing on this insight, we propose Extra-Merge, a training-free strategy that extrapolates along this subspace to minimize loss without additional gradient updates. Extensive experiments across GPT-2 and LLaMA families (124M to 2B) demonstrate that Extra-Merge consistently outperforms standard merging baselines. Notably, it yields consistent zero-shot accuracy gains on Pythia-12B downstream tasks and generalizes effectively to the Muon optimizer \citepjordan2024muon.

Abstract:
Current VideoLLM approaches for Video Temporal Grounding (VTG) typically rely on direct timestamp generation from an unstructured visual-token stream, often resulting in brittle numerics and inconsistent boundaries. To address this, we propose Foresee-to-Ground (F2G), a framework that enforces a verifiable Identify-then-Measure routine. F2G couples predictive temporal perception with evidence-driven reasoning: it learns boundary-sensitive temporal representations to constructs a video-wide evidence pool of candidate event segments, and then augments the LLM input with citable evidence units and enforces identifying the moment by citing the evidence before measuring final metric boundaries under the cited hypothesis. This design decouples event identification from precise measurement, effectively stabilizing the reasoning process. Extensive experiments demonstrate that F2G consistently improves grounding accuracy across diverse benchmarks, transfers robustly across different Video-LLM backbones, and preserves general video understanding capabilities.

Abstract:
Building mathematical optimization models is critical in operations research (OR), while it requires substantial human expertise. Recent advancements have utilized large language models (LLMs) to automate this modeling process. However, existing works often struggle to verify the correctness of the generated optimization models, without checking the rationality of the constraints and variables or the validity of solutions to the generated models. This hampers the subsequent verification and correction steps, and thus it severely hurts the modeling accuracy. To address this challenge, we propose a novel LLMbased framework with Dual-side Verification (OptiVer) from both structure and solution perspectives, thereby improving the modeling accuracy. The structure-side verification ensures that the modeling structure of the generated optimization models aligns with the original problem description, accurately capturing the problem's constraints and requirements. Meanwhile, the solution-side verification interprets and evaluates the validity of the solutions, confirming that the optimization models are logically and mathematically sound. Extensive experiments on several popular benchmarks demonstrate that our approach significantly outperforms the state-of-the-art, achieving over 20% improvement in accuracy.

Abstract:
Incontext localization (ICL) seeks to localize a target object specified by a small set of support examples in a query image, operating on the fly without training or parameter updates. Despite rapid advances in vision–language models (VLMs), achieving category-agnostic and visually grounded ICL remains an open problem, even though it is essential for applications such as image editing, personalized visual search, and retrieval. Existing methods are fragile and rely on explicit category supervision, which not only limits applicability in realistic settings with unnamed or instance-specific objects but also introduces category bias that steers predictions toward semantic priors rather than visual evidence. We introduce a two-stage training framework that explicitly optimizes in-context attention between support bounding boxes and query images without category supervision. We further refine localization via reinforcement learning using Group Relative Policy Optimization (GRPO) to directly minimize localization error. This formulation enforces visual correspondence over semantic priors, yielding robust instance-level localization. Empirically, a 7B-parameter model trained with our objectives outperforms models up to 72B parameters, demonstrating that context-aware localization objectives can surpass scaling alone. Comprehensive ablations validate the contribution of each component.

Abstract:
Consensus protocols form the backbone of distributed systems and blockchains, where implementation bugs can cause data corruption and financial losses. While LLMbased approaches show promise in code analysis, they struggle with deep protocol-level logic bugs involving complex state-dependent behaviors across multiple execution stages. We present Agora, a domain-aware multi-agent framework that integrates hypothesis-driven testing with LLM capabilities for systematic protocol verification. Agora employs specialized agents that collaboratively explore protocol state spaces, synthesize attack scenarios using domain-specific constraints, and validate findings through iterative refinement. This explicit role separation enables reasoning about global protocol invariants beyond single-function code analysis. We evaluate Agora on four consensus implementations (Raft, EPaxos, HotStuff, BullShark) using four state-of-the-art LLMs. Agora discovers 15 previously unknown protocol-level logic bugs that violate safety properties, while existing LLM-based agents fail to detect any such protocol-level logic bugs. Our results demonstrate that domain-aware multi-agent collaboration is essential for detecting deep logic bugs in complex protocols.

Abstract:
Diffusionbased neural solvers for combinatorial optimization repeatedly re-evaluate dense edge/factor interactions, making inference expensive in wall-clock time and often memory-bound at scale. We introduce LoRe, a training-free, inference-time drop-in wrapper that enforces per-step interaction-evaluation budgeting: at each iteration, it evaluates only a fixed fraction of interactions by dynamically routing computation to high-conflict or high-uncertainty interactions, instead of using a fixed sparsification (e.g., static kNN graphs or static masks). Under fully inclusive end-to-end wall-clock accounting, LoRe substantially improves scalability on maximum independent set, pushing feasible inference beyond the baseline out-of-memory boundary by 2.5× while delivering 4--5× speedup, about 81% peak-memory reduction, and 86.9% MIS-size retention. As an auxiliary study on large-scale TSP, LoRe achieves 7.5× median speedup at n=1000 with 97% median memory reduction and a median gap difference of -0.22 percentage points versus the baseline, supporting its generality across problem families.

Abstract:
Skeletonbased action recognition aims to understand human behaviors from body joint sequences and is especially challenging in the one-shot setting, where only a single labeled exemplar is available for each novel action. A key challenge is learning representations that capture the hierarchical and compositional structure of human motion while aligning effectively with high-level action semantics under extreme data scarcity. Existing approaches, largely based on Euclidean embeddings and low-level motion cues, struggle to model the tree-like organization of skeleton data, limiting cross-modal alignment and generalization to unseen action categories. We propose SkelHCC, a unified skeleton hyperbolic CLIP-driven cache adaptation framework for one-shot skeleton-based action recognition. SkelHCC introduces an Explicitly Hierarchical Hyperbolic CLIP (EH-HCLIP) module that embeds skeleton sequences and action language into a shared hyperbolic space. By leveraging the negative curvature and exponential volume growth of hyperbolic geometry, EH-HCLIP naturally encodes the joint–part–body hierarchy of human anatomy and yields structurally consistent cross-modal representations. To support efficient one-shot adaptation, SkelHCC further integrates a training-free LLM-guided Multi-granularity Voting Cache (LMV-Cache) for context-aware inference. Experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD II demonstrate that SkelHCC consistently outperforms state-of-the-art methods.

Abstract:
Users increasingly face the challenge of selecting an appropriate LLM for a given task from a rapidly growing pool of LLMs, each with distinct but often opaque latent properties. Compounding this challenge, users may lack the vocabulary or awareness to explicitly articulate the characteristics they value in an LLM's responses or deployment. We propose an interactionefficient active learning framework in which a dueling bandit algorithm iteratively selects pairs of LLMs, collects user feedback about their responses, and updates its belief about the user's latent preferences. We introduce a novel belief-aware upper confidence bound strategy that balances exploration of the model pool with exploitation of inferred preferences, enabling efficient alignment between user needs and LLM capabilities under user-specified cost and time budgets. Through diverse experiments on LLMs and human studies, we experimentally verify that our model can efficiently match users to LLMs at a lower cost.

Abstract:
Foundation models are increasingly trained on synthetic data generated by prior model iterations rather than exclusively on real data. Thisselfconsumingtraining paradigm can lead to model collapse, divergence, or bias amplification. Recent work (Ferbach et al., 2024) shows that incorporating human curation into the loop can steer a self-consuming model toward human-aligned behavior, but these analyses focus on a single, isolated model that solely consumes its own outputs. In practice, however, models often interact and train on input output pairs produced by other models. This paper studies self-consuming training in themulti-modelregime. We first formalize a framework for interacting self-consuming models and characterize when the resulting dynamical system converges to a stable point. We then examine how human curation of one model affects its own alignment (self-influence) and how such effects propagate to other models (cross-influence). Unlike isolated settings where human curation always enhances model alignment, we show that cross-model interactions can dampen or even invert this effect, ultimately degrading long-term alignment.

Abstract:
Multivariate time series forecasting is critical in many realworld systems, and thus modeling cross-channel dependencies is essential. Although existing methods improve overall accuracy by enhancing representations and cross-channel interactions, it remains challenging to reliably capture inter-variable dependencies under specific conditions. We observe that dependencies in real data are often state-dependent and noisy; in such cases, dense interactions can amplify spurious correlations and lead to representation over-smoothing, which may yield unreliable predictions in certain scenarios. Motivated by this, we propose MS-FLOW, a sparse-bottleneck framework that explicitly models inter-variable interaction as capacity-limited information flow. Specifically, MS-FLOW replaces fully connected communication with selective sparse routing, retaining only a few critical dependency paths and injecting cross-variable signals under a strict communication budget, thereby suppressing redundant connections and spurious-correlation propagation. Extensive experiments demonstrate that MS-FLOW learns more reliable multivariate correlations, achieving state-of-the-art forecasting accuracy on 12 real-world benchmarks while producing fewer yet more reliable dependencies, shifting multivariate forecasting from “more interaction” to “more effective interaction”.

Abstract:
Modern learning systems excel at interpolation but struggle to generalize to unseen tasks outside the training distribution's support. This failure occurs even in simple settings, such as handling task parameters beyond the training range, and persists despite advances in foundation models. To this end, we develop the Relational Task Extrapolator (RTE), an algorithm designed to enable systematic extrapolation to novel tasks. The key observation is that extrapolation is inherently relational: extrapolating to unseen tasks requires learning how tasks transform into one another. If a model learns the transformation between tasks A and B during training, it can apply that same transformation to relate known tasks to unseen ones at test time. RTE operationalizes this idea by decomposing each target task into a known anchor task and a transformation linking the anchor and target. It then learns a relational operator, mapping an anchor–transformation pair to predictions for the target task. We instantiate RTE across multiple task extrapolation regimes in function prediction, e.g. where target tasks use outof-range parameters (parameter extrapolation), has greater compositional depth (length extrapolation), and/or recombine function primitives in unseen ways (compositional extrapolation). We further extend RTE to sequence prediction, integrating it into fine-tuning algorithms for foundation models. Across empirical studies, we find that RTE substantially outperforms existing approaches on extrapolation to novel, unseen tasks.

Abstract:
Current LLM agents are proficient at calling isolated APIs but struggle with the "last mile" of commercial software automation. In realworld scenarios, tools are not independent; they are atomic, interdependent, and prone to environmental noise. We introduce ComplexMCP, a benchmark designed to evaluate agents in these rigorous conditions. Built on the Model Context Protocol (MCP), ComplexMCP provides over 300 meticulously tested tools derived from 7 stateful sandboxes, ranging from office suites to financial systems. Unlike existing datasets, our benchmark utilizes a seed-driven architecture to simulate dynamic environment states and unpredictable API failures, ensuring a deterministic yet diverse evaluation. We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate, far trailing human performance 90%. Granular trajectory analysis identifies three fundamental bottlenecks: (1) tool retrieval saturation as action spaces scale; (2) over-confidence, where agents skip essential environment verifications; and (3) strategic defeatism, a tendency to rationalize failure rather than pursuing recovery. These findings underscore the insufficiency of current agents for interdependent workflows, positioning ComplexMCP as a critical testbed for the next generation of resilient autonomous systems.

Abstract:
Semisupervised referring expression segmentation (SS-RES) aims to achieve precise pixel-level language grounding under limited annotation, yet suffers from limited supervision and unreliable pseudo-labels when exploiting unlabeled image–text pairs. In this work, we propose Learning to Label, a reinforced self-evolving framework (L2L) that casts pseudo-label construction as a learnable decision-making process. To build foundational understanding, we leverage a multimodal large language model to extract semantic–spatial priors, which are instantiated as initial soft segmentation proposals and elevated—together with textual cues—into learnable guidance signals that condition a hierarchical segmentation network. To ensure stable learning, a reinforced pseudo-label selection is further formulated as an exploratory decision process that adaptively rewards high-utility pixel-level supervision based on multimodal priors and model predictions. This reinforced self-evolving loop enables joint optimization of the segmentation model and pseudo-labels, progressively enhancing label reliability under sparse supervision. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate improvements over existing methods, validating its effectiveness and generalization.

Abstract:
In this paper, we study a general class of nonconvex constrained stochastic problems with differenceof-convex (DC) regularization, where the feasible set is possibly nonconvex, and the concave part of the DC regularizer is allowed to be nonsmooth. The fundamental challenge lies in maintaining feasibility for nonconvex constraints while achieving favorable oracle complexity. Although single-loop algorithms are efficient in solving unconstrained DC optimization problems, their potential for constrained optimization with DC structure remains largely unexplored. To address this gap, we develop MoSSP, a Momentum-based Single-loop Stochastic Penalty method for such problems with provable complexity guarantee. The key idea is to perform a single stochastic proximal-gradient update that approximates the gradient of the Moreau envelope of the composite term, which consists of the penalty function and the convex component of the DC regularizer. Simultaneously, the proximal mapping of its concave component is computed in parallel. We derive two algorithm variants: a Polyak-momentum version with \mathcalO(\varepsilon^-4) oracle complexity for finding stochastic \varepsilon-KKT points, and an improved \mathcalO(\varepsilon^-3) version incorporating recursive momentum. Experiment results demonstrate the effectiveness of our proposed algorithms.

Abstract:
We study realtime audio-responsive character control as a deployment-faithful problem: strictly causal, bounded-latency streaming that must generate coherent full-body motion at interactive frame rates while the audio condition can change abruptly (tempo shifts, drops, or user edits). Prior music-to-motion systems are largely optimized for offline generation with global context, and degrade in streaming rollouts where conditioning history becomes stale or unreliable. We introduce DiscoForcing, a streaming audio-driven diffusion framework that combines a causal music encoder that captures rhythmic structure and phase dynamics with a diffusion-forcing sequence model trained under heterogeneous noise levels across the temporal horizon. Building on this, we design a hybrid temporal schedule and a history-guided streaming sampler to explicitly trade off responsiveness against long-horizon consistency under non-stationary audio. Implemented in an end-to-end real-time interactive system with online avatar playback and humanoid deployment workflows, DiscoForcing delivers more stable long-horizon rollouts and sharper audio–motion alignment than prior baselines under matched causality and latency constraints while maintaining real-time throughput.

Abstract:
While KeyValue (KV) cache compression is essential for efficient LLM inference, current evaluations disproportionately focus on sparse retrieval tasks, potentially masking the degradation of High-Density Reasoning where Chain-of-Thought (CoT) coherence is critical. We introduce KVFundaBench to systematically evaluate this gap, revealing a sharp dichotomy: while retrieval tasks remain robust, reasoning tasks exhibit severe Task-Dependent Degradation under aggressive compression due to disrupted CoT links. Extending our analysis to the DeepSeek-R1 model, we uncover that its specialized attention patterns offer unique insights into the fragility of reasoning chains. Guided by these findings—specifically the necessity of preserving few-shot examples as indivisible Semantic Units—we propose ShotKV. This approach explicitly separates prefill and decoding phases to prioritize semantic integrity. Empirical results demonstrate that ShotKV achieves 9%-18% accuracy improvements on long-context generation tasks and effectively generalizes to document QA, all while delivering an 11% latency reduction compared to full cache inference.

Abstract:
Visuomotor policies aim to learn complex manipulation tasks from expert demonstrations. However, generating smooth and coherent trajectories remains challenging, as it requires balancing proximal precision with distal foresight. Existing approaches typically focus on optimizing intrachunk action distributions, often neglecting the inter-chunk coherence. Consequently, inter-chunk discontinuities significantly impede the learning of coherent long-horizon actions. To overcome this limitation and achieve a synergetic balance between precision and foresight, we propose FocalPolicy, a foresight-aware visuomotor policy that combines Frequency-Optimized Chunking with Locally Anchored flow matching. We introduce a foresight composite objective that supervises time-domain alignment within the proximal actions while regularizing frequency-domain structure over multiple future action chunks to improve cross-chunk coherence. To efficiently learn complex action distributions, we design locally anchored campling to enhance target signal propagation efficiency during consistency flow matching training. Extensive experiments demonstrate that our method consistently outperforms existing approaches while further validating the effective generalizability of our proposed modules to other baseline models. The project will be released as open source.

Abstract:
As Large Language Models (LLMs) more deeply integrate into human life across various regions, aligning them with pluralistic cultures is crucial for improving user engagement and mitigating cultural conflicts. For this purpose, recently, different culturespecific corpora have been carefully curated, either synthesized or manually annotated. Nevertheless, inspired by culture theories, we identify two key challenges faced by these datasets: (1) Representativeness: These corpora fail to fully capture the target culture's core characteristics, causing insufficient cultural coverage with redundancy; (2) Distinctiveness: They struggle to distinguish the unique nuances of a given culture from shared patterns across other relevant ones, hindering precise cultural modelling. To handle these challenges, we introduce CAReDiO, a novel data optimization framework, which alternatively refines culture-sensitive questions and responses according to information-theoretic objectives in an in-context optimization manner, enhancing the cultural informativeness and distinguishability of constructed data. Extensive experiments on 15 distinct cultures demonstrate that CAReDiO can create high-quality data with richer cultural information and enable efficient alignment of small open-source or large proprietary LLMs with as few as 200 training samples, consistently outperforming previous datasets in both multi-choice and open-ended cultural benchmarks.

Abstract:
We present a unified algorithmic framework for the numerical solution, constrained optimization, and physicsinformed learning of PDEs with a variational structure. Our framework is based on a Galerkin discretization of the underlying variational forms, and its high efficiency stems from a novel highly-optimized and GPU-compliant TensorGalerkin framework for linear system assembly (stiffness matrices and load vectors). TensorGalerkin operates by tensorizing element-wise operations within a Python-level Map stage and then performs global reduction with a sparse matrix multiplication that performs message passing on the mesh-induced sparsity graph. It can be seamlessly employed downstream as i) a highly-efficient numerical PDEs solver, ii) an end-to-end differentiable framework for PDE-constrained optimization, and iii) a physics-informed operator learning algorithm for PDEs. With multiple benchmarks, including 2D and 3D elliptic, parabolic, and hyperbolic PDEs on unstructured meshes, we demonstrate that the proposed framework provides significant computational efficiency and accuracy gains over a variety of baselines in all the targeted downstream applications.

Abstract:
While diffusion models have emerged as a powerful class of generative models, their learning dynamics remain poorly understood. We address this issue first by empirically showing that standard diffusion models trained on natural images exhibit a simplicity bias, learning simple, pairwise input statistics before specializing to higher-order correlations. We reproduce this behaviour in simple denoisers trained on a minimal data model, the mixed cumulant model, where we precisely control both pair-wise and higher-order correlations of the inputs. We identify a scalar invariant of the model that governs the sample complexity of learning pair-wise and higher-order correlations that we call thediffusion information exponent, in analogy to related invariants in different learning paradigms. Using this invariant, we prove that the denoiser learns simple, pair-wise statistics of the inputs at linear sample complexity, while more complex higher-order statistics, such as the fourth cumulant, require at least cubic sample complexity. We also prove that the sample complexity of learning the fourth cumulant is linear if pair-wise and higher-order statistics share a correlated latent structure. Our work describes a key mechanism for how diffusion models can learn distributions of increasing complexity.

Abstract:
While multimodal large language models (MLLMs) have made substantial progress in singleimage spatial reasoning, multi-image spatial reasoning, which requires integration of information from multiple viewpoints, remains challenging. Cognitive studies suggest that humans address such tasks through two mechanisms:cross-view correspondence, which identifies regions across different views that correspond to the same physical locations, andstepwise viewpoint transformation, which composes relative viewpoint changes sequentially. However, existing studies incorporate these mechanisms only partially and often implicitly, without explicit supervision for both. We propose Human-Aware Training for Cross-view correspondence and viewpoint cHange (HATCH), a training framework with two complementary objectives: (1) Patch-Level Spatial Alignment, which encourages patch representations to align across views for spatially corresponding regions, and (2) Action-then-Answer Reasoning, which requires the model to generate explicit viewpoint transition actions before predicting the final answer. Experiments on three benchmarks demonstrate that \method consistently outperforms baselines of comparable size by a clear margin and achieves competitive results against much larger models, while preserving single-image reasoning capabilities.

Abstract:
In this work, we aim to develop effective data synthesis techniques that autonomously synthesize multimodal training data for enhancing MLLMs in solving complex realworld tasks. To this end, we propose Collective Adversarial Data Synthesis (CADS), a novel and general approach to synthesize high-quality, diverse and challenging multimodal data for MLLMs. The core idea of CADS is to leverage collective intelligence to ensure high-quality and diverse generation, while exploring adversarial learning to synthesize challenging samples for effectively driving model improvement. Specifically, CADS operates with two cyclic phases, i.e., Collective Adversarial Data Generation (CAD-Generate) and Collective Adversarial Data Judgment (CAD-Judge). CAD-Generate leverages collective knowledge to jointly generate new and diverse multimodal data, while CAD-Judge collaboratively assesses the quality of synthesized data. In addition, CADS introduces an Adversarial Context Optimization mechanism to optimize the generation context to encourage challenging and high-value data generation. With CADS, we construct MMSynthetic-20K and train our model R1-SyntheticVL, which demonstrates superior performance on various benchmarks.

Abstract:
We present a finegrained theoretical analysis of the performance gap between reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under a representation gap. Our study decomposes this gap into two sources: an explicit representation gap under exact optimization and an implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is implicitly sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model, highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.

Abstract:
Reinforcement learning (RL) based posttraining for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix ("Wait, think safely") only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60 % (e.g., LlamaV-o1: 63.33% \rightarrow5.74% on JailbreakV-28K, R1-OneVision: 69.07%\rightarrow5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20%\rightarrow65.00%). A key empirical finding from our experiments is that safety recovery is often only a few steering steps away: intervening in the first 1–3 reasoning steps typically suffices to redirect the full generation toward safe completions.

Abstract:
Conformal prediction (CP) has become a cornerstone of distributionfree uncertainty quantification, conventionally evaluated by its coverage and interval length. This work critically examines the sufficiency of these standard metrics. We demonstrate that the interval length might be deceptively improved through a counter-intuitive approach termed Prejudicial Trick (PT), while the coverage remains valid. Specifically, for any given test sample, PT probabilistically returns an interval, which is either null or constructed using an adjusted confidence level, thereby preserving marginal coverage. While PT potentially yields a deceptively lower interval length, it introduces practical vulnerabilities: the same input can yield completely different prediction intervals across repeated runs of the algorithm. We formally derive the conditions under which PT achieves these misleading improvements and provide extensive empirical evidence across various regression and classification tasks. Furthermore, we introduce a new metric interval stability which helps detect whether a new CP method implicitly improves the length based on such PT-like techniques.

Abstract:
Graph neural networks (GNNs) have been widely used in various graph machine learning scenarios. Existing literature primarily assumes wellannotated training graphs, while the reliability of labels is not guaranteed in real-world scenarios. Recently, efforts have been made to address the problem of graph learning with label noise. However, existing methods often (i) struggle to distinguish between reliable and unreliable nodes, and (ii) overlook the relational information embedded in the graph topology. To tackle this problem, this paper proposes a novel method, Dual-Standard Semantic Homogeneity with Dynamic Optimization (DREAM), for reliable, relation-informed optimization on graphs with label noise. Specifically, we design a relation-informed dynamic optimization framework that iteratively reevaluates the reliability of each labeled node in the graph during the optimization process according to the relation of the target node and other nodes. To measure this relation comprehensively, we propose a dual-standard selection strategy that selects a set of anchor nodes based on both node proximity and graph topology. Subsequently, we compute the semantic homogeneity between the target node and the anchor nodes, which serves as guidance for optimization. We also provide a rigorous theoretical analysis to justify the design of DREAM. Extensive experiments are performed on six graph datasets across various domains under three types of graph label noise against competing baselines, and the results demonstrate the effectiveness of the proposed DREAM.

Abstract:
Medical foundation models (MFMs) aim to learn universal representations from multimodal medical images that can generalize effectively to diverse downstream clinical tasks. However, most existing MFMs suffer from information ambiguity that blend multimodal representations in a single embedding space, leading to the degradation of modality specificity and diversity. In this paper, we propose MIDoL, a self-supervisedMFM that introducesInformationDecomposition for multimodal representationLearning via two objectives: i) maximize inter-modality entropy by dispersing multimodal representation into separable Mixture-of-Experts (MoE) subspaces to achieve representation specificity across modalities; and ii) minimize intra-modality uncertainty by performing fine-grained semantic discrimination within each MoE subspace to enrich representation diversity per modality. By pre-training on 1.15 million medical images, M-IDoL i) delivers superior generalization across 21 downstream clinical tasks, outperforming 20 foundation models on five imaging modalities (e.g., X-ray, fundus, OCT, dermoscopy and pathology), and ii) learns modality-specific and diverse representations, showing clearer separation of feature cluster across modalities and finer-grained feature discrimination within each modality.

Abstract:
Publishing a large language model (LLM) benchmark (especially its groundtruth answers) on the Internet risks contaminating future LLMs and enabling evaluation gaming: it may be unintentionally (or intentionally) used to train or select a model, or exploited to overfit and hack leaderboards when labels are accessible. A common mitigation is to keep the benchmark private and let participants submit their models or predictions to the organizers, but this still permits test-set overfitting through feedback loops. To overcome this issue, we propose a way to publish benchmarks without completely disclosing the ground-truth answers to the questions, while still maintaining the ability to openly evaluate LLMs. The main underlying idea is to reduce the best possible accuracy, i.e., Bayes accuracy, by injecting randomness to the answers by preparing several logically correct answers, and only include one of them as the solution in the benchmark. Not only is this helpful to keep us from disclosing the ground truth, but this also offers a test for detecting contamination. In principle, even fully capable models should not surpass the Bayes accuracy. If a model surpasses this ceiling despite this expectation, this is a strong signal of leakage or gaming. We present theoretical and experimental evidence that our method can accurately detect data contamination or hacking on a wide range of benchmarks, models, training methodologies, and scenarios including feedback loops.

Abstract:
In standard Transformer architectures, semantic importance is often conflated with activation magnitude, obscuring the geometric structure of latent representations. To disentangle these factors, we introduce PRISM, a complexvalued architecture designed to isolate the computational role of phase. By enforcing a strict unit-norm constraint (|z| = 1) and replacing attention with gated harmonic convolutions, the model is compelled to utilize subtractive interference in the frequency domain to suppress noise, rather than relying on magnitude-based gating. We utilize this constrained regime to demonstrate that a hybrid architecture—fusing phase-based routing with standard attention—achieves superior parameter efficiency and representation quality compared to unconstrained baselines. Mechanistically, we identify geometric phase clustering, where tokens naturally self-organize to resolve semantic ambiguities. This establishes an O(N log N) reasoning framework based on spectral interference, providing an algorithmic existence proof that subtractive logic is a logical primitive for deep reasoning.

Abstract:
Most evaluations of External Memory Module assume a static setting: memory is built offline and queried at a fixed state. In practice, memory is streaming: new facts arrive continuously, insertions interleave with retrievals, and the memory state evolves while the model is serving queries. In this regime, accuracy and cost are governed by the full memory lifecycle, which encompasses the ingestion, maintenance, retrieval, and integration of information into generation. We present Neuromem, a scalable testbed that benchmarks External Memory Module under an interleaved insertionand-retrieval protocol and decomposes its lifecycle into five dimensions including memory data structure, normalization strategy, consolidation policy, query formulation strategy, and context integration mechanism. Using three representative datasets LoCoMo, LONGMEMEVAL, and MemAgentBench, Neuromem evaluates interchangeable variants within a shared serving stack, reporting token-level F1 and insertion/retrieval latency.Overall, we observe that performance typically degrades as memory grows across rounds, and time-related queries remain the most challenging category. The memory data structure largely determines the attainable quality frontier, while aggressive compression and generative integration mechanisms mostly shift cost between insertion and retrieval with limited accuracy gain.

Abstract:
While reasoning over long context is crucial for various realworld applications, it remains challenging for large language models (LLMs) as they suffer from performance degradation as the context length grows. Recent work MemAgent has tried to tackle this by processing context chunk-by-chunk in an RNN-like loop and updating a textual memory for final answering. However, this naive recurrent memory update faces two crucial drawbacks: (i) memory can quickly explode because it can update indiscriminately, even on evidence-free chunks; and (ii) the loop lacks an exit mechanism, leading to unnecessary computation after even sufficient evidence is collected. To address these issues, we propose GRU-Mem, which incorporates two text-controlled gates for more stable and efficient long-context reasoning. Specifically, in GRU-Mem, the memory only updates when the update gate is open and the recurrent loop will exit immediately once the exit gate is open. To endow the model with such capabilities, we introduce two reward signals r^\textupdate and r^\textexit within end-to-end RL, rewarding the correct updating and exiting behaviors respectively. Experiments on various long-context reasoning tasks demonstrate the effectiveness and efficiency of GRU-Mem, which generally outperforms the vanilla MemAgent with up to 400% times inference speed acceleration.

Abstract:
Generating highfidelity 3D geometries that satisfy specific parameter constraints has broad applications in design and engineering. However, current methods typically rely on large training datasets and struggle with controllability and generalization beyond the training distributions. To overcome these limitations, we introduce LAMP (Linear Affine Mixing of Parametric shapes), a data-efficient framework for controllable and interpretable 3D generation. LAMP first aligns signed distance function (SDF) decoders by overfitting each exemplar from a shared initialization, then synthesizes new geometries by solving a parameter-constrained mixing problem in the aligned weight space. To ensure robustness, we further propose a safety metric that detects geometry validity via linearity mismatch. We evaluate LAMP on two 3D parametric benchmarks: DrivAerNet++ and BlendedNet. We found that LAMP enables (i) controlled interpolation within bounds with as few as 100 samples, (ii) safe extrapolation by up to 100% parameter difference beyond training ranges, (iii) physics performance-guided optimization under fixed parameters. LAMP significantly outperforms conditional autoencoder and Deep Network Interpolation (DNI) baselines in both extrapolation and data efficiency. Our results demonstrate that LAMP advances controllable, data-efficient, and safe 3D generation for design exploration, dataset generation, and performance-driven optimization.

Abstract:
The stochastic Polyak step size (SPS) has proven to be a promising choice for stochastic gradient descent (SGD), delivering competitive performance relative to stateof-the-art methods on smooth convex and non-convex optimization problems, including deep neural network training. However, extensions of this approach to non-smooth settings remain in their early stages, often relying on interpolation assumptions or requiring knowledge of the optimal solution. In this work, we propose a novel SPS variant, Safeguarded SPS (SPS_safe), for the stochastic subgradient method, and provide rigorous convergence guarantees for non-smooth convex optimization with no need for strong assumptions. We further incorporate momentum into the update rule, yielding equally tight theoretical results. Comprehensive experiments on convex benchmarks and deep neural networks corroborate our theory: the proposed step size achieves competitive performance to existing adaptive baselines and exhibits stable behavior across a wide range of problem settings. Finally, in the context of deep neural network training, the gradient norms under our step size do not collapse to (near) zero, indicating robustness to vanishing gradients.

Abstract:
The effectiveness of unlabeled data in Semi/SelfSupervised Learning (SSL) depends on appropriate assumptions for specific scenarios, thereby enabling the selection of beneficial unsupervised pretext tasks. However, existing research has paid limited attention to assumptions in SSL, resulting in practical situations where the compatibility between the unsupervised pretext tasks and the target scenarios can only be assessed after training and validation. This paper centers on the assumptions underlying unsupervised pretext tasks and explores the feasibility of preemptively estimating the impact of unsupervised pretext tasks at low cost. Through rigorous derivation, we show that the impact of unsupervised pretext tasks on target performance depends on three factors: assumption learnability with respect to the model, assumption reliability with respect to the data, and assumption completeness with respect to the target. Building on this theory, we propose a low-cost estimation method that can quantitatively estimate the actual target performance. We build a benchmark of over one hundred pretext tasks and demonstrate that our estimated performance strongly correlates with the actual performance obtained through large-scale training and validation.

Abstract:
Scaling factors in residual branches have emerged as a prevalent method for boosting neural network performance, especially in normalizationfree architectures. While prior work has primarily examined scaling effects from an optimization perspective, this paper investigates their role in residual architectures through the lens of generalization theory. Specifically, we establish that wide residual networks (ResNets) with constant scaling factors become asymptotically unlearnable as depth increases. In contrast, when the scaling factor exhibits rapid depth-wise decay combined with early stopping, over-parameterized ResNets achieve minimax-optimal generalization rates. To establish this, we demonstrate that the generalization capability of wide ResNets can be approximated by the kernel regression associated with a specific kernel. Our theoretical findings are validated through experiments on synthetic data and real-world classification tasks, including MNIST and CIFAR-100.

Abstract:
We propose UniDFlow, a unified discrete flowmatching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlow achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.

Abstract:
Reasoning segmentation is an emerging visionlanguage task that requires reasoning over intricate text queries to precisely segment objects. However, existing methods typically suffer from overthinking, generating verbose reasoning chains that interfere with object localization in multimodal large language models (MLLMs). To address this issue, we propose DR^2Seg, a self-rewarding framework that improves both reasoning efficiency and segmentation accuracy without requiring extra thinking supervision. DR^2Seg employs a two-stage rollout strategy that decomposes reasoning segmentation into multimodal reasoning and referring segmentation. In the first stage, the model generates a self-contained description that explicitly specifies the target object. In the second stage, this description replaces the original complex query to verify its self-containment. Based on this design, two self-rewards are introduced to mitigate overthinking and the associated attention dispersion. Extensive experiments conducted on 3B and 7B variants of Qwen2.5-VL, as well as on both SAM2 and SAM3, demonstrate that DR^2Seg consistently improves reasoning efficiency and overall segmentation accuracy. Codes are available in supplementary materials.

Abstract:
Agentic theorem proverspipelines that couple a mathematical reasoning model with library retrieval, decomposition/search, and a proof assistant verifier---have recently achieved striking empirical success, yet it remains unclear which components drive performance and why such systems work at all despite classical hardness of proof search. We propose a distributional viewpoint and introduce statistical provability, defined as the finite-horizon success probability of reaching a verified proof, averaged over an instance distribution, and formalize modern theorem-proving pipelines as time-bounded reachability MDPs. Exploiting Bellman structure, we prove existence of optimal policies under mild regularity, derive provability certificates via sub-/super-solution inequalities, and bound the performance gap of score-guided planning (greedy/top-(k)/beam/rollouts) in terms of approximation error, sequential statistical complexity, representation geometry (metric entropy/doubling structure), and action-gap margin tails. Together, our theory provides a principled, component-sensitive explanation of when and why agentic theorem provers succeed on biased real-world problem distributions, while clarifying limitations in worst-case or adversarial regimes.

Abstract:
Difficulty replicating baselines, high computational costs, and required domain expertise create persistent barriers to clinical AI research. To address these challenges, we introduce PyHealth 2.0, an enhanced clinical deep learning toolkit that enables predictive modeling in as few as 7 lines of code. PyHealth 2.0 offers three key contributions: (1) a comprehensive toolkit addressing reproducibility and compatibility challenges by unifying 15+ datasets, 20+ clinical tasks, 25+ models, 5+ interpretability methods, and uncertainty quantification including conformal prediction within a single framework that supports diverse clinical data modalities—signals, imaging, and electronic health records—with translation of 5+ medical coding standards; (2) accessibilityfocused design accommodating multimodal data and diverse computational resources with up to 39× faster processing and 20× lower memory usage, enabling work from 16GB laptops to production systems; and (3) an active open-source community of 400+ members lowering domain expertise barriers through extensive documentation, reproducible research contributions, and collaborations with academic health systems and industry partners, including multi-language support via RHealth. PyHealth 2.0 establishes an open-source foundation and community advancing accessible, reproducible healthcare AI.

Abstract:
In DRL, an agent is trained from a stream of experience. In a continual learning setting, such agents can suffer from \emphplasticity loss: their ability to learn new skills from new experiences diminishes over training. Recently, Mixtureof-Experts (MoE) networks have been reported to enable scaling laws and facilitate the learning of diverse skills. However, in continual reinforcement learning settings, their performance can degenerate as learning proceeds, indicating a loss of plasticity. To address this, building on Neural Tangent Kernel (NTK) theory, we formalize the plasticity loss in MoE policies as a loss of \emphspectral plasticity. We then derive a tractable proxy for spectral plasticity, one expressible in terms of individual expert feature matrices. Leveraging this proxy, we introduce \emphSPHERE, a practical Parseval penalty tailored for MoE-based policies that alleviates the loss of spectral plasticity. On MetaWorld and HumanoidBench, SPHERE improves average success under continual RL by 133% and 50% over an unregularized MoE baseline, while maintaining higher spectral plasticity throughout training.

Abstract:
In streaming Reinforcement Learning (RL), transitions are observed and discarded immediately after a single update. While this minimizes resource usage for ondevice applications, it makes agents notoriously sample-inefficient, since value-based losses alone struggle to extract meaningful representations from transient data. We propose extending Self-Predictive Representations (SPR) to the streaming pipeline to maximize the utility of every observed frame. However, due to the highly correlated samples induced by the streaming regime, naively applying this auxiliary loss results in training instabilities. Thus, we introduce orthogonal gradient updates relative to the momentum target and resolve gradient conflicts arising from streaming-specific optimizers. Validated across the Atari, MinAtar, and Octax suites, our approach systematically outperforms existing streaming baselines. Latent-space analysis, including t-SNE visualizations and effective-rank measurements, confirms that our method learns significantly richer representations, bridging the performance gap caused by the absence of a replay buffer, while remaining efficient enough to train on just a few CPU cores.

Abstract:
Large language model (LLM)powered agents can translate high-level user intents into plans and actions in an environment. Yet after observing an outcome, users may wonder: What if I had phrased my intent differently? We introduce a framework that enables such counterfactual reasoning in agentic LLM-driven control scenarios, while providing formal reliability guarantees. Our approach models the closed-loop interaction between a user, an LLM-based agent, and an environment as a structural causal model (SCM), and leverages test-time scaling to generate multiple candidate counterfactual outcomes via probabilistic abduction. Through an offline calibration phase, the proposed conformal counterfactual generation (CCG) yields sets of counterfactual outcomes that are guaranteed to contain the true counterfactual outcome with high probability. We showcase the performance of CCG on a wireless network control use case, demonstrating significant advantages compared to naive re-execution baselines.

Abstract:
Inferencetime steering adapts pretrained diffusion and flow models to new tasks without retraining, often utilizing ratio-of-densities constructions that reweight time-indexed marginals with fixed exponents. We identify Marginal Path Collapse, a failure mode in which the intermediate density defined by such compositions becomes non-normalizable despite valid endpoints. This collapse can arise when composing heterogeneous experts trained with mismatched noise schedules (and/or negative exponents / partial supports). To address this, we provide (i) a necessary-and-sufficient Path Existence Criterion that characterizes when the composed intermediate densities are mathematically well-defined, and (ii) Adaptive path Correction with Exponents (ACE), which generalizes Feynman–Kac steering to support time-varying exponents. Our analysis reveals that ACE controls the quantile radius of the intermediate distributions, providing a theoretical mechanism for path stabilization observed in experiments. On flexible-pose scaffold decoration, a drug design task composed of de-novo, conformer, and protein-conditioned experts, ACE prevents collapse and significantly outperforms constant-exponent baselines. Furthermore, ACE improves attribute success rates in compositional image generation, establishing it as a general framework for compositional sampling.

Abstract:
Entity state tracking is a necessary component of world modeling that requires maintaining coherent representations of entities over time. Previous work has benchmarked entity tracking performance in purely textbased tasks. We introduce MET-Bench, a multimodal entity tracking benchmark designed to evaluate vision-language models' ability to track entity states across modalities. Using two structured domains, we assess how effectively current models integrate textual and image-based state updates. Our findings reveal a significant performance gap between text-based and image-based entity tracking. We empirically show this discrepancy primarily stems from deficits in visual reasoning rather than perception. We further show that explicit text-based reasoning strategies improve performance, yet limitations remain in long-horizon multimodal tasks. We develop a reinforcement learning method to improve performance on MET-Bench. Applying our method to open-source VLMs achieves competitive performance with advanced closed models. Our results highlight the need for improved multimodal representations and reasoning techniques to bridge the gap between textual and visual entity tracking.

Abstract:
A truly universal AIGenerated Image (AIGI) detector must simultaneously generalize across diverse generative models and varied semantic content. Current state-of-the-art methods learn a single, entangled forgery representation, conflating content-dependent flaws with content-agnostic artifacts, and are further constrained by outdated benchmarks. To overcome these limitations, we propose OmniAID, a novel framework centered on a decoupled Mixture-of-Experts (MoE) architecture. The core of our method is a hybrid expert system designed to decouple: (1) semantic flaws across distinct content domains, and (2) content-dependent flaws from content-agnostic universal artifacts. This system employs a set of Routable Specialized Semantic Experts, each for a distinct domain (e.g., human, animal), complemented by a Fixed Universal Artifact Expert. This architecture is trained using a novel two-stage strategy: we first train the experts independently with domain-specific hard-sampling to ensure specialization, and subsequently train a lightweight gating network for effective input routing. By explicitly decoupling "what is generated" (content-specific flaws) from "how it is generated" (universal artifacts), OmniAID achieves robust generalization. To address outdated benchmarks and validate real-world applicability, we introduce Mirage, a new large-scale, contemporary dataset. Extensive experiments, using both traditional benchmarks and our Mirage dataset, demonstrate our model surpasses existing monolithic detectors, establishing a new and robust standard for AIGI authentication against modern, in-the-wild threats.

Abstract:
We study dynamic regret minimization in nonstationary online learning, with a primary focus on follow-the-regularized-leader (FTRL) methods. FTRL is important for curved losses and for understanding adaptive algorithms, yet existing dynamic regret analyses are less explored for FTRL. To address this, we build on the discounted-to-dynamic reduction and present a modular way to obtain dynamic regret bounds. The reduction simplifies prior proofs for online linear regression, recovers optimal rates, and provides new guarantees for online logistic regression, covering two representative curved losses. Beyond online convex optimization, we apply the reduction to analyze the Adam optimizers, obtaining optimal convergence rates in stochastic, non-convex, and non-smooth settings. The reduction also enables a more detailed treatment of Adam with two discount parameters (\beta_1,\beta_2), leading to new results for both clipped and clip-free variants.

Abstract:
Crosstask generalization is a core challenge in open-world robotic manipulation, and the key lies in extracting transferable manipulation knowledge from seen tasks. Recent in-context learning approaches leverage seen task demonstrations to generate actions for unseen tasks without parameter updates. However, existing methods provide only low-level continuous action sequences as context, failing to capture composable skill knowledge and causing models to degenerate into superficial trajectory imitation. We propose Decompose and Recompose, a skill reasoning framework using atomic skill-action pairs as intermediate representations. Our approach decomposes seen demonstrations into interpretable skill--action alignments, enabling the model to recompose these skills for unseen tasks through compositional reasoning. Specifically, we construct a task-adaptive dynamic demonstration library via visual-semantic retrieval combined with skill sequences from a planning agent, complemented by a coverage-aware static library to fill missing skill patterns. Together, these yield skill-comprehensive demonstrations that explicitly elicit compositional reasoning for skill composition and execution ordering. Experiments on the AGNOSTOS benchmark and real-world environments validate our method's zero-shot cross-task generalization capability.

Abstract:
We propose Obfuscated Semantic Null space Injection for Privacy (OSNIP), a lightweight clientside encryption framework for privacy-preserving LLM inference. Generalizing the geometric intuition of linear kernels to the high-dimensional latent space of LLMs, we formally define the ``Obfuscated Semantic Null Space'', a high-dimensional regime that preserves semantic fidelity while enforcing near-orthogonality to the original embedding. By injecting perturbations that project the original embedding into this space, OSNIP ensures privacy without any post-processing. Furthermore, OSNIP employs a key-dependent stochastic mapping that synthesizes individualized perturbation trajectories unique to each user. Evaluations on 12 generative and classification benchmarks show that OSNIP achieves state-of-the-art performance, sharply reducing attack success rates while maintaining strong model utility under strict security constraints.

Abstract:
Textguided object segmentation requires both cross-modal reasoning and pixel grounding abilities. Most recent methods treat it as a single forward pass, where the model directly predicts pixel prompts to a segmentation model, which limits verification, refocusing and refinement when initial localization is wrong. To address this limitation, we proposeRSAgent, an agentic Multimodal Large Language Model (MLLM) which interleavesreasoning and actionfor segmentation via multi-turn tool invocations. RSAgent queries a visual toolbox, observes feedback, and revises its pixel hypothesis using historical observations to re-localize targets and iteratively refine masks. We further build a data pipeline to synthesize multi-turn reasoning segmentation trajectories, and train RSAgent with a two-stage framework: cold-start supervised fine-tuning followed by agentic reinforcement learning with fine-grained, task-specific rewards. Extensive experiments show that RSAgent achieves a zero-shot performance of 66.5% gIoU on ReasonSeg test, improving over Seg-Zero-7B by 9%, and reaches 81.5% cIoU on RefCOCOg, demonstrating state-of-the-art performance on both in-domain and out-of-domain benchmarks.

Abstract:
Stochastic variancereduced algorithms such as Stochastic Average Gradient (SAG) and SAGA, and their deterministic counterparts like the Incremental Aggregated Gradient (IAG) method, have been extensively studied in large-scale machine learning. Despite their popularity, existing analyses for these algorithms are disparate, relying on different proof techniques tailored to each method. Furthermore, the original proof of SAG is known to be notoriously involved, requiring computer-aided analysis. Focusing on finite-sum optimization with smooth and strongly convex objectives, our main contribution is to develop a single unified convergence analysis that applies to all three algorithms: SAG, SAGA, and IAG. Our analysis features two key steps: (i) establishing a bound on delays due to sub-sampling using simple concentration tools, and (ii) carefully designing a novel Lyapunov function that accounts for such delays. The resulting proof is short and modular, providing high-probability bounds for SAG and SAGA that can be seamlessly extended to non-convex objectives and Markovian sampling. As an immediate byproduct of our new analysis technique, we obtain the best known rates for the IAG algorithm, significantly improving upon prior bounds.

Abstract:
Effectively managing missing modalities is a fundamental challenge in realworld multimodal learning scenarios, where data incompleteness often results from systematic collection errors or sensor failures. Sparse Mixture-of-Experts (SMoE) architectures have the potential to naturally handle multimodal data, with individual experts specializing in different modalities. However, existing SMoE approach often lacks proper ability to handle missing modality, leading to performance degradation and poor generalization in real-world applications. We propose ConfSMoE to introduce a two-stage imputation module to handle the missing modality problem for the SMoE architecture by taking the opinion of experts and reveal the insight of expert collapse from theoretical analysis with strong empirical evidence. Inspired by our theoretical analysis, ConfSMoE propose a novel expert gating mechanism by detaching the softmax routing score to task confidence score w.r.t ground truth signal. This naturally relieves expert collapse without introducing additional load balance loss function. We show that the insights of expert collapse aligns with other gating mechanism such as Gaussian and Laplacian gate. The proposed method is evaluated on four different real world dataset with three distinct experiment settings to conduct comprehensive analysis of ConfSMoE on resistance to missing modality and the impacts of proposed gating mechanism.

Abstract:
Offlineto-online reinforcement learning (O2O RL) faces a central challenge between retaining offline conservatism and adapting to online feedback under distribution shift. This challenge arises because data behavior evolves during fine-tuning, rendering data origin a misleading basis for constraint handling and thereby leading to objective–data mismatch. We therefore propose Dynamic Alignment for RElease (DARE), a distribution-aware framework for sample-level constraint release based on the behavioral consistency with a behavior model. To our knowledge, DARE is the first to condition constraint release on behavioral consistency via a posterior-induced exchange mechanism, moving beyond a binary offline/online data distinction. Importantly, DARE requires only per-sample behavioral alignment, enabling instantiation on top of many offline algorithms with flexible choices of behavior models and fine-tuning objectives. We provide a theoretical analysis showing that behavior-based sample exchange consistently improves the distinction between offline-like and online-like subsets. Experiments on D4RL demonstrate that DARE consistently improves fine-tuning stability and achieves superior final performance over strong offline-to-online baselines. (Anonymized code archive is included in the supplementary material.)

Abstract:
Pruning at Initialisation methods discover sparse, trainable subnetworks before training, but their theoretical mechanisms remain elusive. Existing analyses are often limited to finitewidth statistics, lacking a rigorous characterisation of the global sparsity patterns that emerge as networks grow large. In this work, we connect discrete pruning heuristics to graph limit theory via graphons, establishing thegraphon limit of PaI masks. We introduce aFactorised Saliency Modelthat encompasses popular pruning criteria and prove that, under regularity conditions, the discrete masks generated by these algorithms converge to deterministic bipartite graphons. This limit framework establishes a novel topological taxonomy for sparse networks: while unstructured methods (e.g., Random, Magnitude) converge to homogeneous graphons representing uniform connectivity, data-driven methods (e.g., SNIP, GraSP) converge to heterogeneous graphons that encode implicit feature selection. Leveraging this continuous characterisation, we derive two fundamental theoretical results: (i) a Universal Approximation Theorem for sparse networks that depends only on the intrinsic dimension of active coordinate subspaces; and (ii) a Graphon-NTK generalisation bound demonstrating how the limit graphon modulates the kernel geometry to align with informative features. Our results transform the study of sparse neural networks from combinatorial graph problems into a rigorous framework of continuous operators, offering a new mechanism for analysing expressivity and generalisation in sparse neural networks.

Abstract:
We analyze the Accelerated Noisy Power Method, an algorithm for Principal Component Analysis in the setting where only inexact matrixvector products are available, which can arise for instance in decentralized PCA. While previous works have established that acceleration can improve convergence rates compared to the standard Noisy Power Method, these guarantees require overly restrictive upper bounds on the magnitude of the perturbations, limiting their practical applicability. We provide an improved analysis of this algorithm, which preserves the accelerated convergence rate under much milder conditions on the perturbations. We show that our new analysis is worst-case optimal, in the sense that the convergence rate cannot be improved, and that the noise conditions we derive cannot be relaxed without sacrificing convergence guarantees. We demonstrate the practical relevance of our results by deriving an accelerated algorithm for decentralized PCA, which has similar communication costs to non-accelerated methods. To our knowledge, this is the first decentralized algorithm for PCA with provably accelerated convergence.

Abstract:
We introduce \textscMosaic, a probabilistic weather forecasting model that addresses two sources of spectral degradation in MLbased weather prediction: training to predict the ensemble mean deterministically and compressive encoding creating an information bottleneck. \textscMosaic combines learned functional perturbations for ensemble forecasting with block-sparse attention, a hardware-aligned formulation that shares keys and values across spatially adjacent queries, enabling each block to dynamically attend to the most relevant regions. By capturing arbitrarily long-range dependencies at linear cost, \textscMosaic processes high-resolution weather data without compression. On IFS HRES data, \textscMosaic at 1.5° resolution matches or outperforms models trained on 0.25° data, with individual ensemble members exhibiting near-perfect spectral alignment across all resolved frequencies.

Abstract:
Accurate prediction of nuclear magnetic resonance (NMR) chemical shifts is fundamental to spectral analysis and molecular structure elucidation, yet existing machine learning methods rely on limited, laborintensive atom-assigned datasets. We propose a semi-supervised framework that learns NMR chemical shifts from millions of literature-extracted spectra without explicit atom-level assignments, integrating a small amount of labeled data with large-scale unassigned spectra. We formulate chemical shift prediction from literature spectra as a permutation-invariant set supervision problem, and show that under commonly satisfied conditions on the loss function, optimal bipartite matching reduces to a sorting-based loss, enabling stable large-scale semi-supervised training beyond traditional curated datasets. Our models achieve substantially improved accuracy and robustness over state-of-the-art methods and exhibit stronger generalization on significantly larger and more diverse molecular datasets. Moreover, by incorporating solvent information at scale, our approach captures systematic solvent effects across common NMR solvents for the first time. Overall, our results demonstrate that large-scale unlabeled spectra mined from the literature can serve as a practical and effective data source for training NMR shift models, suggesting a broader role of literature-derived, weakly structured data in data-centric AI for science.

Abstract:
Achieving Artificial General Intelligence (AGI) requires agents that learn and interact adaptively, with interactive world models providing scalable environments for perception, reasoning, and action. Yet current research still lacks largescale datasets and unified benchmarks to evaluate their physical interaction capabilities. To address this, we propose iWorld-Bench, a comprehensive benchmark for training and testing world models on interaction-related abilities such as distance perception and memory. We construct a diverse dataset with 330k video clips and select 2.1k high-quality samples covering varied perspectives, weather, and scenes. As existing world models differ in interaction modalities, we introduce an Action Generation Framework to unify evaluation and design six task types, generating 4.9k test samples. These tasks jointly assess model performance across visual generation, trajectory following, and memory. Evaluating 14 representative world models, we identify key limitations and provide insights for future research. The iWorld-Bench model leaderboard is publicly available at iWorld-Bench.com.

Abstract:
Despite the strong performance of Large Language Models (LLMs) on complex instructionfollowing tasks, precise control of output length remains a persistent challenge. Existing methods primarily attempt to enforce length constraints by externally imposing length signals or optimization objectives, while largely overlooking the underlying limitation: the model's intrinsic deficit in length cognition. To address this, we propose LARFT (Length-Aware Reinforcement Fine-Tuning), a training framework that aligns the model's length cognition with its action. Specifically, LARFT integrates length-oriented reinforcement learning with a hindsight length awareness. By transforming on-policy data into hindsight self-awareness tasks where the model learns to identify the actual length of its own generation, LARFT jointly optimizes the model’s internal representation of length information and refines its policy to satisfy length constraints, thereby achieving precise and reliable length instruction following. Extensive experiments across four base models demonstrate that LARFT outperforms existing baselines, achieving an average improvement of +20.92 points across three length instruction following benchmarks with only a marginal decline of -1.45 points on four general capability benchmarks.

Abstract:
While Mamba2's expanded state dimension enhances temporal modeling, it incurs substantial inference overhead that saturates bandwidth during autoregressive generation. Standard pruning methods fail to address this bottleneck: unstructured sparsity leaves activations dense, magnitudebased selection ignores runtime dynamics, and gradient-based methods impose prohibitive costs. We introduce GHOST (Grouped Hidden-state Output-aware Selection and Truncation), a structured pruning framework that approximates control-theoretic balanced truncation using only forward-pass statistics. By jointly measuring controllability and observability, GHOST rivals the fidelity of gradient-based methods without requiring backpropagation. As a highlight, on models ranging from 130M to 2.7B parameters, our approach achieves a 50% state-dimension reduction with approximately 1 perplexity point increase on WikiText-2. Code is available at https://anonymous.4open.science/r/mamba2_ghost-7BCB/.

Abstract:
The choice of loss function in classification involves a fundamental tradeoff: smooth losses (like Cross-Entropy) enable fast optimization rates but yield slow square-root consistency bounds, while piecewise-linear losses (like Hinge) offer fast linear consistency rates but suffer from non-differentiability. We propose Linear-Core (LC) Surrogates, a new family of convex loss functions that resolve this tension by stitching a linear core to a smooth tail. We prove that these surrogates are differentiable everywhere while retaining strict linear H-consistency bounds, effectively combining the optimization benefits of smoothness with the statistical efficiency of margin-based losses. In the structured prediction setting, we show that this smoothness unlocks a massive computational and energy advantage: it allows for an unbiased stochastic gradient estimator that bypasses the quadratic complexity O(|\mathcalY|^2) of exact inference (e.g., Viterbi). Empirically, our method achieves a 23× speedup over Structured SVMs on large-vocabulary sequence tagging tasks and demonstrates superior robustness to instance-dependent label noise, outperforming Cross-Entropy by 2.6% on corrupted CIFAR-10.

Abstract:
We study linear dueling bandits in volatile environments characterized by the simultaneous presence of postserving contexts, delayed feedback, and adversarial corruption. Feedback is subject to unknown stochastic or adversarial delays and a cumulative corruption budget \mathcalC. To address these challenges, we propose \term, which integrates a learned approximator that predicts post-serving contexts from pre-serving information. It further employs an adaptive weighting strategy that clips feature vectors to mitigate the impact of corrupted and delayed observations simultaneously. Under standard regularity conditions and a parametric post-serving mapping, we rigorously establish that our algorithm is delay-regime-agnostic, achieving a regret upper bound of \widetilde\mathcalO(d(\sqrtT + \mathcalC + \mathcalD)), where d is the total feature dimension and \mathcalD encapsulates the delay complexity. Crucially, our analysis reveals an additive cost structure between corruption and delay, avoiding the multiplicative degradation typical of prior works. We further establish lower bounds that nearly match our upper bounds up to a \sqrtd factor for adversarial delays in the absence of post-serving contexts.

Abstract:
While explicit Chainof-Thought (CoT) equips Large Language Models (LLMs) with strong reasoning capabilities, it requires models to verbalize every intermediate step in text tokens, constraining the model thoughts to the discrete vocabulary space. Recently, reasoning in continuous latent space has emerged as a promising alternative, enabling more robust inference and flexible computation beyond discrete token constraints. However, current latent paradigms often suffer from feature collapse and instability, stemming from distribution mismatches when recurrently using hidden states as the input embeddings, or alignment issues when relying on assistant models. To address this, we proposeLatent Thoughts Tuning (LT-Tuning), a framework that redefines how latent thoughts are constructed and deployed. Instead of relying solely on raw hidden states, our method introduces aContext-Prediction-Fusionmechanism that jointly leveraging contextual hidden states and predictive semantic guidance from the vocabulary embedding space. Combined with a progressive three-stage curriculum learning pipeline, LT-Tuning also enables dynamically switching between latent and explicit thinking modes. Experiments demonstrate that our method outperforms existing latent reasoning baselines, effectively mitigating feature collapse and achieving robust reasoning accuracy. Code is available at https://anonymous.4open.science/r/LT_Tuning-F35E.

Abstract:
Large language models (LLMs) using chainof-thought reasoning often waste substantial compute by producing long, incorrect responses. Abstention can mitigate this by withholding outputs unlikely to be correct. While most abstention methods decide to withhold outputs before or after generation, dynamic mid-generation abstention considers early termination of unpromising reasoning traces at each token position. Prior work has explored empirical variants of this idea, but principled guidance for the abstention rule remains lacking. We present a formal analysis of dynamic abstention for LLMs, modeling abstention as an explicit action within a regularized reinforcement learning framework. An abstention reward parameter controls the trade-off between compute and information. We show that abstaining when the value function falls below this reward strictly outperforms natural baselines under general conditions. We further derive a principled and efficient method to approximate the value function. Empirical results on mathematical reasoning tasks support our theory and demonstrate improved selective accuracy over existing methods.

Abstract:
While Large Reasoning Models (LRMs) have demonstrated impressive capabilities in solving complex tasks through the generation of long reasoning chains, this reliance on verbose generation results in significant latency and computational overhead. To address these challenges, we propose CoSMo (ConsistencyGuided Split-Merge Optimization), a framework designed to eliminate structural redundancy rather than indiscriminately restricting token volume. Specifically, CoSMo utilizes a split-merge algorithm that dynamically refines reasoning chains by merging redundant segments and splitting logical gaps to ensure coherence. We then employ structure-aligned reinforcement learning with a novel segment-level budget to supervise the model in maintaining efficient reasoning structures throughout training. Extensive experiments across multiple benchmarks and backbones demonstrate that CoSMo achieves superior performance, improving accuracy by 3.3 points while reducing segment usage by 28.7% on average compared to reasoning efficiency baselines.

Abstract:
Continual learning (CL) with large pretrained models is challenged by task interference and catastrophic forgetting. Existing LoRA-based Mixture-of-Experts (MoE) methods mitigate forgetting by adding new task-specific adapters and freezing old ones, but often suffer from redundancy, interference, and ambiguous routing due to coarse-grained experts and routing. Coarse-grained experts (i.e., full LoRA adapters with large rank) encode low-specialty information. Newly added experts often duplicate or conflict with existing ones, causing redundancy and interference. Their low specialization further confuses the router, accelerating routing degradation and forgetting as experts accumulate. In this work, we propose MoRAM (Mixture of Rank-1 Associative Memory). Grounded in the view that weight matrices function as linear associative memories, MoRAM achieves CL as gradual incrementing of atomic rank-1 memory experts. Each rank-1 adapter acts as a fine-grained MoE expert or an associative memory unit. By viewing rank-1 adapters as key–value pairs, we eliminate explicit routers in MoE-LoRA, using a self-activation mechanism where each memory atom evaluates its own relevance via its intrinsic key. This transforms the adaptation process into robust, content-addressable retrieval. Extensive experiments on CLIP and LLMs demonstrate that \ours significantly outperforms state-of-the-art baselines, achieving superior plasticity-stability trade-offs, improving generalization while mitigating forgetting.

Abstract:
As increasingly capable agents are deployed, a central safety challenge is how to retain meaningful human control without modifying the underlying system. We study a minimal control interface in which an agent chooses whether to act autonomously (play) or defer (ask), while a human simultaneously chooses whether to be permissive (trust) or engage in oversight (oversee), and model this interaction as a twoplayer Markov Game. When this game forms a Markov Potential Game, we prove an alignment guarantee: any increase in the agent's utility from acting more autonomously cannot decrease the human's value. This establishes a form of intrinsic alignment where the agent's incentive to seek autonomy is structurally coupled to the human's welfare. Practically, the framework induces a transparent control layer encouraging the agent to defer when risky and act when safe. While we use gridworld simulations to illustrate the emergence of this collaboration, our primary validation involves an agentic tool-use task where two 30B-parameter language models are fine-tuned via independent policy gradient. We demonstrate that even as the agents learn to coordinate on the fly, this framework effectively reduces safety violations in realistic, open-ended environments.

Abstract:
Safety alignment of Large Language Models (LLMs) is extremely fragile, finetuning on small number of benign samples can erase safety behaviors learned from millions of preference examples. Existing studies attempt to explain this phenomenon by comparing parameters and hidden states before and after fine-tuning, but overlook their dynamic evolution during fine-tuning. In this work, we analyze parameter dynamics and uncover a critical mechanism underlying safety degradation, where benign fine-tuning causes parameters cumulatively drift toward danger-aligned directions, progressively undermining the model's safety. Inspired by these findings, we propose Sample-Level Quantification of Safety Degradation (SQSD), a method that quantifies each training sample's influence on safety degradation. Specifically, SQSD assigns continuous risk scores to individual samples by measuring their induced parameter updates along safety and danger directions. Extensive experiments across three models and two datasets show that SQSD outperforms baselines in better separating high-risk and low-risk samples, with risk scores that consistently predict the severity of safety degradation. In particular, SQSD exhibits strong transferability across architectures, parameter scales, and parameter-efficient methods.

Abstract:
To develop socially intelligent AI, existing approaches typically model behavioral dimensions (e.g., affective, cognitive, or social attributes) in isolation. Although useful, this taskspecific modeling increases training costs and limits generalization across behavioral settings. Recent reasoning RL methods facilitate training a single unified model across multiple behavioral tasks, but do not explicitly address learning across different heterogeneous behavioral data. To address this gap, we introduce Heterogeneity-Aware Relative Policy Optimization (HARPO), a RL method that balances leaning across heterogeneous tasks and samples. This is achieved by modulating advantages to ensure that no single task or sample carries disproportionate influence during policy optimization. Using HARPO, we develop and release OMNISAPIENS-7B 2.0, a foundation model for social behavior processing. Relative to existing behavioral foundation models, OMNISAPIENS-7B 2.0 achieves the strongest performance across behavioral tasks, with gains of up to +16.85% and +9.37% on multitask and held-out settings respectively, while producing more explicit and robust reasoning traces. We also validate HARPO against recent RL methods, where it achieves the most consistently strong performance across behavioral tasks.

Abstract:
Selfdistillation (SD), retraining a student on a mixture of ground-truth labels and a teacher’s own predictions using the same architecture and training data, often improves generalization empirically, but it is unclear when improvement is guaranteed. We study SD for ridge regression with an unconstrained mixing weight \xi \in \mathbbR. Conditional on the training data, without any distributional assumptions, we prove that for any squared prediction risk R (including out-of-distribution), the optimally mixed student strictly improves upon the ridge teacher at every regularization level \lambda where the teacher risk is not stationary (R'(\lambda) \neq 0). We also characterize the optimal mixing weight \xi^\star in terms of the risk derivative R', showing that it can surprisingly be negative. To further quantify SD risk improvements, we derive exact risk asymptotics in the proportional asymptotics regime for general anisotropic covariance and deterministic signals. Finally, we propose a consistent one-shot tuning method to estimate \xi^\star without retraining, sample splitting, or grid search. Experiments on real-world tasks and pre-trained neural network features validate our theory and tuning method.

Abstract:
We study the sample complexity of bestcase Empirical Risk Minimizer in the setting of Stochastic Convex Optimization. We show that there exists an instance, where sample size is linear in dimension, learning is possible, but an Empirical Risk Minimizer is likely to be unique and overfits. This resolves an open question by Feldman. We also extend this to approximate ERMs. Building on our construction we also show that (constrained) Gradient Descent potentially overfits when horizon and learning rate grow w.r.t sample size. Specifically we provide a novel generalization lower bound of \Omega\left(\eta T/(m\sqrtm)\right) for Gradient Descent, where \eta is the learning rate, T is the horizon and m is the sample size. This narrows down, exponentially, the gap between the best known upper bound of O(\eta T/m) and existing lower bounds from previous constructions.

Abstract:
Deep learning models for clinical event prediction on electronic health records (EHR) often suffer performance degradation when deployed under different data distributions. While domain adaptation (DA) methods can mitigate such shifts, its "blackbox" nature prevents widespread adoption in clinical practice where transparency is essential for trust and safety. We propose ExtraCare to decompose patient representations into invariant and covariant components. By supervising these two components and enforcing their orthogonality during training, our model preserves label information while exposing domain-specific variation at the same time for more accurate predictions than most feature alignment models. More importantly, it offers human-understandable explanations by mapping sparse latent dimensions to medical concepts and quantifying their contributions via targeted ablations. ExtraCare is evaluated on two real-world EHR datasets across multiple domain partition settings, demonstrating superior performance along with enhanced transparency, as evidenced by its accurate predictions and explanations from extensive case studies.

Abstract:
Traditional ecommerce search systems employ multi-stage cascading architectures (MCA) that suffer from fragmented computation and optimization objective collisions across stages, ultimately limiting their performance ceiling. We propose OneSearch, the first industrial-deployed end-to-end generative framework for e-commerce search, featuring three key innovations: (1) Keyword-enhanced Hierarchical Quantization Encoding (KHQE) to preserve hierarchical semantics and distinctive item attributes while maintaining strong query-item relevance constraints; (2) multi-view user behavior sequence injection that constructs behavior-driven user IDs and incorporates both explicit short-term and implicit long-term sequences; and (3) a Preference-Aware Reward System (PARS) with multi-stage supervised fine-tuning and adaptive reward-weighted ranking to capture fine-grained user preferences. Extensive offline evaluations demonstrate its superior performance, while online A/B tests achieve statistically significant improvements: +1.67% item CTR, +2.40% buyer, and +3.22% order volume. OneSearch reduces operational expenditure by 75.40%, improves Model FLOPs Utilization from 3.26% to 27.32%, and has been successfully deployed across multiple search scenarios in TEST, serving millions of users daily. Code and datasets will be made publicly available.

Abstract:
Large language models can express values in two main ways: (1) intrinsic expression, reflecting the model's inherent values learned during training, and (2) prompted expression, elicited by explicit prompts. Given their widespread use in value alignment, it is paramount to clearly understand their underlying mechanisms, particularly whether they mostly overlap (as one might expect) or rely on distinct mechanisms, but this remains largely understudied. We analyze this at the mechanistic level using two approaches: (1) value vectors, feature directions representing value mechanisms extracted from the residual stream, and (2) value neurons, MLP neurons that contribute to value vectors. We demonstrate that intrinsic and prompted value mechanisms partly share common components crucial for inducing value expression, generalizing across languages and reconstructing theoretical intervalue correlations in the model's internal representations. Yet, as these mechanisms also possess unique elements that fulfill distinct roles, they lead to different degrees of response diversity (intrinsic > prompted) and value steerability (prompted > intrinsic). In particular, components unique to the intrinsic mechanism promote lexical diversity in responses, whereas those specific to the prompted mechanism strengthen instruction following, taking effect even in distant tasks like jailbreaking.

Abstract:
Modern machine learning (ML) workloads increasingly rely on GPUs, yet achieving high endto-end performance remains challenging due to dependencies on both GPU kernel efficiency and host-side settings. Although LLM-based methods show promise on automated GPU kernel generation, prior works mainly focus on single-kernel optimization and do not extend to end-to-end programs, hindering practical deployment. To address the challenge, in this work, we propose \textscStitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it step-by-step, and a Verifier for correctness check and performance profiling using Nsys/NCU. To fundamentally improve the Coder's ability in end-to-end GPU programming, \textscStitchCUDA integrates rubric-based agentic reinforcement learning over two atomic skills, task-to-code generation and feedback-driven code optimization, with combined rubric reward and rule-based reward from real executions. Therefore, the Coder learns how to implement advanced CUDA programming techniques (e.g., custom kernel fusion, cublas epilogue), and we also effectively prevent Coder's reward hacking (e.g., just copy PyTorch code or hardcoding output) during benchmarking. Experiments on KernelBench show that \textscStitchCUDA achieves nearly 100% success rate on end-to-end GPU programming tasks, with 1.72× better speedup over the multi-agent baseline and 2.73× than the RL model baselines.

Abstract:
GUI agents are rapidly shifting from multimodule pipelines to end-to-end, native vision-language models (VLMs) that perceive raw screenshots and directly interact with digital devices. Despite rapid progress on general GUI tasks, CAPTCHA solving remains a major challenge. On the other hand, although specialized CAPTCHA solving pipelines exist, they cannot handle general GUI tasks. To address this gap, we introduce ReCAP: a CAPTCHA-capable native GUI agent that can robustly solve modern, interactive CAPTCHA challenges, while preserving their performance as a general GUI agent. We first develop a dynamic CAPTCHA system spanning seven representative CAPTCHA types, designed to stress primitive and complementary capabilities for CAPTCHA solving (e.g., robust OCR under heavy noise and text stylization, fine-grained visual understanding, and precise control). Then, we develop an automated data collection and curation pipeline that generates large-scale CAPTCHA interaction trajectories paired with reasoning traces. As CAPTCHA solving often requires multi-step interaction and recovery from intermediate mistakes, we further leverage failed trajectories to construct self-correction data, training agents to reflect on errors and correct their actions online. %To mitigate the imbalance between reasoning tokens and action tokens, we also introduce a weighted training objective that up-weights action tokens. Across held-out test sets, ReCAP improves CAPTCHA-solving success from roughly 30% to 80%, while maintaining strong performance on general GUI-agent benchmarks.

Abstract:
Attention is the critical component of a transformer. Yet the quadratic computational complexity of vanilla full attention in the input size and the inability of its linear attention variant to focus have been challenges for computer vision tasks. We provide a mathematical definition of generalized attention and formulate both vanilla softmax attention and linear attention within the general framework. We prove that generalized attention disperses, that is, as the number of keys tends to infinity, the query assigns equal weights to all keys. Motivated by the dispersion property and recent development of Mamba form of attention, we design Scalable and Efficient Mamba like Attention (SEMA) which utilizes token localization to avoid dispersion and maintain focusing, complemented by theoretically consistent arithmetic averaging to capture global aspect of attention. We support our approach on Imagenet1k where classification results show that SEMA is a scalable and effective alternative beyond linear attention, outperforming recent vision Mamba models on increasingly larger scales of images at similar model parameter sizes.

Abstract:
Federated learning (FL) offers a promising framework for collaborative digital pathology by enabling model training across institutions. However, realworld deployments face heterogeneity arising from diverse multiple instance learning (MIL) architectures and heterogeneous feature extractors across institutions. We propose FedHD, a novel FL framework that performs local Gaussian-mixture feature alignment tailored for WSI analysis. Instead of exchanging model parameters, each client independently distills semantically rich synthetic feature representations aligned with the distribution of real WSIs. To preserve diagnostic diversity, FedHD adopts a one-to-one distillation strategy, generating a synthetic counterpart for each real slide to avoid over-compression. During federation, a curriculum-based integration strategy progressively incorporates cross-site synthetic features into local training once performance plateaus. Furthermore, an optional interpretation module reconstructs pseudo-patches from synthetic embeddings, enhancing transparency. FedHD is architecture-agnostic, privacy-preserving, and supports personalized yet collaborative training across diverse institutions. Experiments on TCGA-IDH, CAMELYON16, and CAMELYON17 show that FedHD consistently outperforms state-of-the-art federated and distillation baselines.

Abstract:
Large Language Models (LLMs) have shown strong capability in interpreting multimodal data but remain limited in their ability to natively handle timeseries data. Addressing this limitation could enable the translation of longitudinal and wearable sensing data into actionable insights and patient-facing digital health applications. We propose OpenTSLM, a family of Time Series Language Models that integrate time-series as a native modality into pretrained LLMs, enabling natural-language prompting and reasoning over multiple time-series. We implement two OpenTSLM variants based on soft prompting (OpenTSLM-SP) and cross-attention (OpenTSLM-Flamingo). To conduct comprehensive experiments on reasoning over medical text and time-series, we introduce three chain of thought (CoT) datasets: HAR-CoT (human activity recognition), Sleep-CoT (sleep staging), and ECG-QA-CoT (electrocardiogram question answering). Across tasks, OpenTSLM models consistently outperform baselines. OpenTSLMs with time-series encoders trained from scratch achieve 69.88% in sleep staging and 65.44% in HAR, while OpenTSLM combined with time series foundation models (TSFMs) achieve 68.33% and 67.64%, compared to 9.05% and 60.44% for fine-tuned text-only baselines. Additionally, we conduct expert evaluations with cardiologists, which show that OpenTSLMs exhibit strong reasoning capabilities and temporal understanding on raw sensor data for ECG-QA. We further show that OpenTSLM-Flamingo models scale better in memory as the number and length of time-series increase. To facilitate further research, we release all code, datasets, and models as open-source resources.

Abstract:
Quantization followed by parameterefficient fine-tuning has emerged as a promising paradigm for downstream adaptation under tight GPU memory constraints. However, this sequential pipeline fails to leverage the intricate interaction between quantization bit-width and LoRA rank. Specifically, a carefully optimized quantization allocation with low quantization error does not always translate to strong fine-tuning performance, and different bit-width and rank configurations can lead to significantly varying outcomes under the same memory budget. To address this limitation, we propose AutoQRA, a joint optimization framework that simultaneously optimizes the bit-width and LoRA rank configuration for each layer during the mixed quantized fine-tuning process. To tackle the challenges posed by the large discrete search space and the high evaluation cost associated with frequent fine-tuning iterations, AutoQRA decomposes the optimization process into two stages. First, it first conducts a global multi-fidelity evolutionary search, where the initial population is warm-started by injecting layer-wise importance priors. This stage employs specific operators and a performance model to efficiently screen candidate configurations. Second, trust-region Bayesian optimization is applied to locally refine promising regions of the search space and identify optimal configurations under the given memory budget. This approach enables active compensation for quantization noise in specific layers during training. Experiments show that AutoQRA achieves performance close to full-precision fine-tuning with a memory footprint comparable to uniform 4-bit methods.

Abstract:
Generative models have been shown to "memorize" certain training data, leading to verbatim or nearverbatim generating images, which may cause privacy concerns or copyright infringement. We introduce Guidance Using Attractive-Repulsive Dynamics (GUARD), a novel framework for memorization mitigation in text-to-image diffusion models. GUARD adjusts the image denoising process to guide the generation away from an original training image and towards one that is distinct from training data while remaining aligned with the prompt, guarding against reproducing training data, without hurting image generation quality. We propose a concrete instantiation of this framework, where the positive target that we steer towards is given by a novel method for (cross) attention attenuation based on (i) a novel statistical mechanism that automatically identifies the prompt positions where cross attention must be attenuated and (ii) attenuating cross-attention in these per-prompt locations. The resulting GUARD offers a surgical, dynamic per-prompt inference-time approach that, we find, is by far the most robust method in terms of consistently producing state-of-the-art results for memorization mitigation across two architectures and for both verbatim and template memorization, while also improving upon or yielding comparable results in terms of image quality.

Abstract:
Large Language Models (LLMs) have recently shown strong promise for robotic task planning, particularly through automatic planning domain generation. Planning domains are brittle under imperfect logical states and perception noise; prior approaches largely treat generated planning domains as plan utilities, overlooking their potential as scalable sources of reasoning supervision and structured reward signals. At the same time, reasoning LLMs depend on chainof-thought (CoT) supervision that is expensive to collect for robotic tasks, and reinforcement learning (RL) faces challenges on reward engineering. We propose Self-CriTeach, an LLM self-teaching and self-critiquing framework in which an LLM autonomously generates symbolic planning domains that serve a dual role: (i) enabling large-scale generation of robotic planning problem–plan pairs, and (ii) providing structured reward functions. First, the self-written domains enable large-scale generation of symbolic task plans, which are automatically transformed into extended CoT trajectories for supervised fine-tuning. Second, the self-written domains are reused as structured reward functions, providing dense feedback for reinforcement learning without manual reward engineering. This unified training pipeline yields a planning-enhanced LLM with higher planning success rates, stronger cross-task generalization, reduced inference cost, and resistance to imperfect logical states.

Abstract:
Multimodal Large Language Models (MLLMs) are increasingly deployed in stateless systems, such as autonomous driving and robotics. This paper investigates a novel threat: SemanticAware Hijacking. We explore the feasibility of hijacking multiple stateless decisions simultaneously using a single universal perturbation. We introduce the Semantic-Aware Universal Perturbation (SAUP), which acts as a semantic router, "actively" perceiving input semantics and routing them to distinct, attacker-defined targets. To achieve this, we conduct theoretical and empirical analysis on the geometric properties in the latent space. Guided by these insights, we propose the Semantic-Oriented (SORT) optimization strategy and annotate a new dataset with fine-grained semantics to evaluate performance. Extensive experiments on three representative MLLMs demonstrate the fundamental feasibility of this attack, achieving a 66% attack success rate over five targets using a single frame against Qwen.

Abstract:
Representation Engineering typically relies on static refusal vectors derived from terminal representations. We move beyond this paradigm, demonstrating that refusal is a dynamic and sparse process rather than a localized outcome. Using Causal Tracing, we uncover the Refusal Trajectory—a persistent upstream signature that remains intact even when adversarial attacks (e.g., GCG) suppress terminal signals. Leveraging this, we propose SALO (Sparse Activation Localization Operator), an inferencetime detector designed to capture these latent patterns. SALO effectively recovers defense capabilities against forced-decoding attacks, improving detection rates from ～0% to >90% where methods relying on terminal states perform poorly.

Abstract:
While large language models (LLMs) have emerged as powerful decisionmakers across a wide range of single-agent and stationary environments, fewer efforts have been devoted to settings where LLMs must engage in \emphrepeated and \emphstrategic interactions with unknown or dynamic opponents. In such settings, recipes built upon \emphoffline pre-training or fine-tuning, though robust against worst-case adversaries, do not fully exploit the capability of LLMs to adapt \emphonline based on interaction feedback. Instead, we explore the more natural perspective of \emphscaling inference-time computation as a mechanism for adaptation, embedding the principles of a classical game-theoretical learning dynamic, \emphsmooth Fictitious Play (sFP), into LLM inference: (i) for belief formation, we employ an auxiliary opponent model that in-context learns to imitate the \emphtime-averaged behavior of the opponent; (ii) for best response, we advance best-of-N (BoN) sampling by simulating against the opponent model. Empirical evaluations on two distinct forms of repeated negotiation games demonstrate that our method enables significant performance improvement over online interaction compared to various baselines, offering a scalable and principled approach to strategic decision-making without any parameter updates.

Abstract:
Symbolic regression (SR) traditionally balances accuracy and complexity, implicitly assuming that simpler formulas are structurally more rational. We argue that this assumption is insufficient: existing algorithms often exploit this metric to discover accurate and compact but structurally irrational formulas that are numerically illconditioned and physically inexplicable. Inspired by the structural stability of real physical laws, we propose the Effective Information Criterion (EIC) to quantify formula rationality. EIC models formulas as information channels and measures the amplification of inherent rounding noise during recursive calculation, effectively distinguishing physically plausible structures from pathological ones without relying on ground truth. Our analysis reveals a stark structural stability gap between human-derived equations and SR-discovered results. By integrating EIC into SR workflows, we provide explicit structural guidance: for heuristic search, EIC steers algorithms toward stable regions to yield superior Pareto frontiers; for generative models, EIC-based filtering improves pre-training sample efficiency by 2–4 times and boosts generalization R^2 by 22.4%. Finally, an extensive study with 108 human experts shows that EIC aligns with human preferences in 70% of cases, validating structural stability as a critical prerequisite for human-perceived interpretability. We release our code at https://anonymous.4open.science/r/EIC-91B2.

Abstract:
Multimodal large language models (MLLMs) excel at highlevel reasoning yet fail on OCR tasks where fine-grained visual details are compromised or misaligned. We identify an overlooked optimization issue in multi-layer feature fusion. Skip pathways introduce direct back-propagation paths from high-level semantic objectives to early visual layers. This mechanism overwrites low-level signals and destabilizes training. To mitigate this gradient interference, we propose Detached Skip-Links, a minimal modification that reuses shallow features in the forward pass while stopping gradients through the skip branch during joint training. This asymmetric design reduces gradient interference, improving stability and convergence without adding learnable parameters. To diagnose whether fine-grained information is preserved and usable by an LLM, we introduce R-Probe, which measures pixel-level reconstructability of projected visual tokens using a shallow decoder initialized from the first quarter of the LLM layers. Across multiple ViT backbones and multimodal benchmarks, and at scales up to 7M training samples, our approach consistently improves OCR-centric benchmarks and delivers clear gains on general multimodal tasks.

Abstract:
Looped transformers scale computational depth independent of parameter count by repeatedly applying the same layer. However, training these models over long horizons creates significant optimization challenges. Specifically, it is difficult for looped transformers that start from noise to steer towards a potentially complex output without additional supervision. Diffusion models tackle this issue by corrupting data with varying magnitudes of noise and training the model to reverse it in a single step. However, this process misaligns training and testing behaviour. We introduce Denoising Recursion Models, a method that similarly corrupts data with noise but trains the model to reverse the corruption over multiple recursive steps. This strategy provides a tractable curriculum of intermediate states, while better aligning training with testing and incentivizing nongreedy, forward-looking generation. Through extensive experiments, we showed this approach outperformed the Tiny Recursion Model (TRM) on ARC-AGI, where it recently achieved breakthrough performance.

Abstract:
We introduce OPOCMDP, the first policy optimization algorithm for stochastic Contextual Markov Decision Process (CMDPs) under general offline function approximation. Our approach achieves a high probability regret bound of \widetildeO(H^4\sqrtT|S||A|\log(|\mathcalF||\mathcalP|)), where S and A denote the state and action spaces, H the horizon length, T the number of episodes, and \mathcalF, \mathcalP the finite function classes used to approximate the losses and dynamics, respectively. This is the first regret bound with optimal dependence on |S| and |A|, directly improving the current state-of-the-art (Qian, Hu, and Simchi-Levi, 2024). These results demonstrate that optimistic policy optimization provides a natural, computationally superior and theoretically near-optimal path for solving CMDPs.

Abstract:
We study clustered multitask learning in a semiparametric setting where tasks share a latent cluster structure in their target parameters but exhibit heterogeneous, potentially infinitedimensional nuisance components. Such heterogeneity poses a major challenge for existing multitask learning methods, which typically rely on aligned feature spaces or homogeneous task structures. To address this challenge, we propose anadaptive fused orthogonal estimatorthat integrates Neyman-orthogonal losses with data-driven pairwise fusion penalties. Our framework leverages task-specific pilot estimates to calibrate the fusion penalties and combines adaptive aggregation with orthogonalization to mitigate the impact of nuisance-parameter estimation error. Theoretically, we show that the proposed estimator achieves exact recovery of the latent clustering with high probability and attains pooled parametric convergence rates proportional to cluster size. Moreover, we establish asymptotic normality and show that, asymptotically, our estimator matches the performance of an oracle procedure that knows the true clustering in advance. Empirically, we show that the proposed method consistently outperforms strong baselines in various simulation setups. A real-world application to U.S. residential energy consumption further demonstrates the effectiveness of our approach in uncovering meaningful regional clustering in electricity price elasticity, showcasing the efficacy of our method.

Abstract:
For approximating a target distribution given only its unnormalized logdensity, stochastic gradient-based variational inference (VI) algorithms are a popular approach. For example, Wasserstein VI (WVI) and black-box VI (BBVI) perform gradient descent in measure space (Bures-Wasserstein space) and parameter space, respectively. Previously, for the Gaussian variational family, convergence guarantees for WVI have shown superiority over existing results for black-box VI with the reparametrization gradient, suggesting the measure space approach might provide some unique benefits. In this work, however, we close this gap by obtaining identical state-of-the-art iteration complexity guarantees for both. In particular, we identify that WVI's superiority stems from the specific gradient estimator it uses, which BBVI can also leverage with minor modifications. The estimator in question is usually associated with Price's theorem and utilizes second-order information (Hessians) of the target log-density. We will refer to this as Price's gradient. On the flip side, WVI can be made more widely applicable by using the reparametrization gradient, which requires only gradients of the log-density. We empirically demonstrate that the use of Price's gradient is the major source of performance improvement.

Abstract:
This paper investigates the scaling properties of autoregressive nextpixel prediction, a simple, end-to-end yet under-explored framework for unified vision models. Starting with images at resolutions of 32x32, we train a family of Transformers using IsoFlops profiles across compute budgets up to 7e19 FLOPs and evaluate three distinct target metrics: next-pixel prediction objective, ImageNet classification accuracy, and generation quality measured by Fr'echet Distance. First, optimal scaling strategy is critically task-dependent. At a fixed 32x32 resolution alone, the optimal scaling properties for image classification and image generation diverge, where generation optimal setup requires the data size grow three to five times faster than for the classification optimal setup. Second, as image resolution increases, the optimal scaling strategy indicates that the model size must grow much faster than data size. Surprisingly, by projecting our findings, we discover that the primary bottleneck is compute rather than the amount of training data. As compute continues to grow four to five times annually, we forecast the feasibility of pixel-by-pixel modeling of images within the next five years.

Abstract:
Finetuning Multimodal Large Language Models (MLLMs) on task-specific data is an effective way to improve performance on downstream applications. However, such adaptation often leads to a degradation in generalization on pretrained tasks, a phenomenon known as Catastrophic Forgetting. Existing methods that aim to mitigate this issue either become ineffective when fine-tuning deeper layers of the language decoder or scale poorly with increasing model size. To address these limitations, we propose Model-Dowser, a novel sparse fine-tuning approach for MLLMs. Model-Dowser measures a principled importance score for each model parameter with respect to pretrained generalization (prior to downstream adaptation) by jointly considering weight magnitudes, input activations, and output sensitivities. During fine-tuning, Model-Dowser selectively preserves high-importance parameters and updates the remaining. Comprehensive experiments on two representative MLLMs, LLaVA and NVILA, demonstrate that Model-Dowser effectively mitigates catastrophic forgetting and consistently outperforms prior methods, while remaining resource-efficient and scalable to multi-billion-parameter models.

Abstract:
Adversarial imitation learning (AIL) achieves highquality imitation by mitigating compounding errors in behavioral cloning (BC), but often exhibits training instability due to adversarial optimization. To avoid this issue, a class of non-adversarial Q-based imitation learning (IL) methods, represented by IQ-Learn, has emerged and is widely believed to outperform BC by leveraging online environment interactions. However, this paper revisits IQ-Learn and demonstrates that it provably reduces to BC and suffers from an imitation gap lower bound with quadratic dependence on horizon, therefore still suffering from compounding errors. Theoretical analysis reveals that, despite using online interactions, IQ-Learn uniformly suppresses the Q-values for all actions on states uncovered by demonstrations, thereby failing to generalize. To address this limitation, we introduce a primal-dual framework for distribution matching, yielding a new Q-based IL method, Dual Q-DM. The key mechanism in Dual Q-DM is incorporating Bellman constraints to propagate high Q-values from visited states to unvisited ones, thereby achieving generalization beyond demonstrations. We prove that Dual Q-DM is equivalent to AIL and can recover expert actions beyond demonstrations, thereby mitigating compounding errors. To the best of our knowledge, Dual Q-DM is the first non-adversarial IL method that is theoretically guaranteed to eliminate compounding errors. Experimental results further corroborate our theoretical results.

Abstract:
Causal machine learning (Causal ML) aims to answer "what if" questions using machine learning algorithms, making it a promising tool for highstakes decision-making. Yet, empirical evaluation practices in Causal ML remain limited. Existing benchmarks often rely on a handful of hand-crafted or semi-synthetic datasets, leading to brittle, non-generalizable conclusions. To bridge this gap, we introduce CausalProfiler, a synthetic benchmark generator for Causal ML methods. Based on a set of explicit design choices about the class of causal models, queries, and data considered, the CausalProfiler randomly samples causal models, data, queries, and ground truths constituting the synthetic causal benchmarks. In this way, Causal ML methods can be rigorously and transparently evaluated under a variety of conditions. This work offers the first random generator of synthetic causal benchmarks with coverage guarantees and transparent assumptions operating on the three levels of causal reasoning: observation, intervention, and counterfactual. We demonstrate its utility by evaluating several state-of-the-art methods under diverse conditions and assumptions, both in and out of the identification regime, illustrating the types of analyses and insights the CausalProfiler enables.

Abstract:
The impressive performance of generalist large language models (LLMs) such as GPT4 and Claude in healthcare raises a critical question: will domain-specific medical specialist models become obsolete? We argue that the future of medical artificial intelligence (AI) lies not in building monolithic medical foundation models, nor in replacing human expertise, but in orchestrating collaboration among generalist LLMs, domain-specific specialist models, and clinicians. We propose HetMedAgent, a heterogeneous medical multi-agent framework that enables conflict-aware evidence fusion, uncertainty-based clinician intervention triggering, and adaptive threshold calibration. Experiments on three real-world clinical decision-making tasks demonstrate that the synergy between generalist LLMs and domain-specific specialist models significantly outperforms using either type of model alone, validating the irreplaceable value of specialist models in modality-specific analysis. HetMedAgent drives a paradigm shift from building medical LLMs or foundation models to multi-agent collaboration, achieving a balance between general reasoning capabilities and domain-specific precision.

Abstract:
As large language models continue to scale, lowbit weight-only post-training quantization (PTQ) offers a practical solution to their memory-efficient deployment. Although block-wise PTQ is capable of matching the full-precision (FP) baseline on basic language modeling and understanding, its quality is degraded for generative tasks---especially at longer responses and extended chains of thought, which is critical in boosting task accuracy. We attribute this shortfall to two factors: (i) the omission of the unembedding layer (the LM head) in block-wise optimization and (ii) the reliance on the mean squared error (MSE) objective. Both factors cause the token probability distribution of the quantized model to misalign with that of the FP model, yielding notable accuracy drops on text generation benchmarks. To rectify the discrepancy, we introduce \emphLogit-aware Final-block Quantization (LFQ), a simple yet effective enhancement to block-wise PTQ that quantizes the final Transformer block by minimizing the cross-entropy between the logits of the FP model and those of its quantized counterpart. By aligning token probabilities at the logit level in the final block, LFQ consistently improves the accuracy of complex generation tasks over state-of-the-art block-wise PTQ across diverse model families and text generation tasks, while maintaining parity with FP baselines on language modeling and understanding.

Abstract:
Large VisionLanguage Models (LVLMs) represent a significant leap towards empathetic agents, demonstrating remarkable capabilities in emotion understanding. However, the internal mechanisms governing how LVLMs translate abstract visual stimuli into coherent emotional narratives remain largely unexplored, primarily due to the scarcity of visual counterfactuals and the diffuse nature of emotional expression. In this paper, we bridge this gap by introducing a steering-vector-based causal attribution framework tailored for descriptive emotional reasoning. To this end, we construct a specialized dataset to demystify the emotional circuits underlying the three-stage '' Adapt-Aggregate-Execute'' mechanism. Crucially, we discover a functional decoupling: visual emotional cues are aggregated in middle layers via sentiment-specific attention heads, but are subsequently translated into narrative generation in deep layers through emotion-general pathways. Guided by these insights, we regulate the emotional information routing to strengthen attention flow and amplify the semantic activation to consolidate expression. Extensive experiments on the comprehensive MER-UniBench demonstrate that our methods significantly improve performance via inference-time intervention, effectively mitigating emotional hallucinations and corroborating the causal fidelity of the discovered circuits.

Abstract:
Online advertising platforms rely on machine learning models to predict clickthrough rates (pCTR) and conversion rates (pCVR) for auction mechanisms. We introduce a novel framework to study the interaction between recommender system model quality, auction format, and au- tobidder behavior. We formalize when model improvements—defined via a refinement relation inspired by filtrations in probability theory—lead to improvements in platform-level Evaluation Cri- teria Metrics (ECM) such as revenue, welfare, or liquid welfare. Our main contributions are: (1) a formal definition of model improvement based on cluster refinement, and (2) a complete charac- terization of ECM monotonicity across different combinations of bidder types (tCPA, max-CPA), auction formats (first-price, second-price, VCG), and budget constraints. We show that first-price auctions with uniform bidding guarantee revenue monotonicity for tCPA bidders without budgets (via Jensen’s inequality), while second-price auc- tions and budget constraints can break this prop- erty. We provide full numerical counterexamples for all negative results. Our findings have practi- cal implications for advertising platforms seeking to align model improvements with business out- comes.

Abstract:
Twoplayer games such as board games have long been used as traditional benchmark for reinforcement learning. This work revisits a regularized policy optimization with reverse Kullback-Leibler divergence and entropy divergence and analyzes this combination on two-player zero-sum settings from theoretical and empirical perspectives. From a theoretical perspective, we investigate the stability of the policy update rule on two theoretical settings: game-theoretic normal-form games and finite-length games. We provide convergence guarantees and verify our theoretical results by numerical experiments on synthetic games. From an empirical perspective, we derive a practical model-free reinforcement learning algorithm based on the regularized policy optimization. We validate the efficiency in training of our algorithm through comprehensive experiments on five board games: Animal Shogi, Gardner Chess, Go, Hex, and Othello. The experimental results demonstrate that our agent achieves more efficient learning than existing methods across the environments.

Abstract:
Large diffusion vision–language models (LDVLMs) have recently demonstrated competitive performance on multimodal tasks, emerging as a promising alternative to autoregressive models. They enable parallel decoding for efficient inference and leverage bidirectional attention to capture global context. Despite these advances, their behavior under longform generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and lead to degraded visual grounding. Through analysis, we identify two underlying causes of these failures. First, repetitive generation originates from a mask token prior. Because generation tokens are initialized as mask tokens, their hidden representations progressively drift toward a shared prior direction over generation steps. Second, a fundamental misalignment exists between the positional attention bias and the iterative unmasking process. This discrepancy suppresses the model's attention toward informative visual tokens, leading to degradation in visual grounding. Based on these insights, we propose a training-free approach that mitigates both issues. Specifically, we introduce Mask Prior Suppression and Monotonic RoPE Scaling, which mitigate mask prior drift and positional attention collapse during decoding. Experiments on general multimodal benchmarks and visual grounding tasks demonstrate improvements over baseline LDVLMs, with robust gains on long-form description benchmarks. Overall, our results show that these failures can be effectively addressed with a lightweight, plug-and-play strategy that requires no additional training and generalizes across diverse LDVLM architectures.

Abstract:
Parameterbased knowledge editing updates the internal knowledge of large language models (LLMs) via localized weight modifications and has attracted significant attention. However, most existing methods overlook fundamental theoretical limitations and are rarely evaluated under realistic, practice-oriented settings. In this paper, we first present a theoretical analysis based on the Representation Space Collapse Hypothesis, explaining how localized parameter edits can propagate along fragile directions in the representation space, inducing global interference and ultimately causing reasoning collapse. Building on this insight, we conduct a comprehensive empirical evaluation by systematically varying knowledge complexity, number of edits, evaluation dimensions, and baseline methods. Our results show that parameter-based editing methods consistently damage core LLM capabilities. In contrast, a simple retrieval-based baseline reliably outperforms all parameter-editing methods across all evaluated conditions. These findings highlight that preserving the fundamental capabilities of LLMs after knowledge editing should be a central concern for future research. Data and code are provided in the supplementary material.

Abstract:
Standard cells form the building blocks of digital circuits, so their delay and power critically influence chiplevel performance; yet characterization (can be understood as evaluation of cell delay and power) still relies on slow simulation sweeps, and many fast predictors ignore layout geometry, missing coupling and layout-dependent effects. The challenge is to jointly represent layout geometry and netlist topology so models capture fine-grained spatial details together with structural connectivity for accurate performance prediction. We introduce FusionCell, a dual-modality predictor that treats routed layout geometry and netlist topology as inputs and fuses them explicitly in a unified model. A DeiT encoder processes three-layer routed layouts, while a graph transformer models heterogeneous device/net graphs. The modalities are integrated through a topology-guided mechanism, where the netlist acts as a structural ``map'' to actively query relevant physical regions in the layout for joint geometric and topological reasoning. We build a 7nm dataset based on the ASAP7 PDK with over 19.5k cells spanning 149 types using automatic tools, targeting six metrics: signal rise/fall delay, transition, and power. Experimental results demonstrate that FusionCell reduces regression error (average MAPE 0.92%) and improves Spearman/Kendall ranking over baselines, while accelerating the characterization process by orders of magnitude compared to circuit simulation.

Abstract:
Reinforcement learning with verifiable rewards (RLVR) efficiently scales the reasoning ability of large language models but is bottlenecked by scarce labeled data. Reinforcement learning with intrinsic rewards (RLIR) offers a scalable alternative via selfrewarding, yet often suffers from instability and inferior performance. We trace this gap to a systemic bias in confidence-coupled self-rewarding: the model tends to over-reward high-confidence mistakes, forming a self-confirming loop. We quantify this feedback-loop bias with three metrics: reward noise magnitude (\rho_\textnoise), policy–reward coupling (\rho_\textselfbias), and over-/under-reward skew (\rho_\textsymbias). Our analyses show a compounding effect where strong coupling amplifies confidence-conditioned errors and drives a drift toward over-reward, leading to instability and a lower performance ceiling. To mitigate this, we propose reinforcement learning with ensembled rewards (RLER), which aggregates diverse models with adaptive reward interpolation and disagreement-aware rollout selection to reduce coupling and suppress over-reward drift. Extensive experiments show that RLER improves by 13.6% over the best RLIR baseline and is within 3.6% of RLVR, while exhibiting stable scaling on unlabeled samples.

Abstract:
Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, have successfully guided model development but fail to explain emerging nonmonotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute. We propose the Shannon Scaling Law, a unified theoretical framework that models LLM training as information transmission over a noisy channel, grounded in the Shannon–Hartley theorem. By mapping model parameters to channel bandwidth and training tokens to signal power, our formulation explicitly captures the interaction between learning signal and intrinsic noise. This perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation. We validate our theory through extensive experiments on the Pythia and OLMo2 model suites under diverse perturbations, including Gaussian noise, quantization and supervised finetuning on math, QA and code tasks. The Shannon Scaling Law consistently outperforms classical scaling laws and recent perturbation-aware laws, achieving strong R^2 scores and accurately capturing loss basins missed by prior approaches. Our results suggest that SNR-aware scaling is essential for robust and efficient model growth, providing a principled foundation for future scaling strategies.

Abstract:
We present a regressionadjustment framework designed to estimate longitudinal treatment effects in randomized experiments under static regimes. Although regression-adjustment methods are useful for variance reduction in randomized experiments through the use of pre-treatment covariates, they usually focus only on average effects, which cannot capture valuable knowledge about when effects appear and how long they continue. To address this limitation, we need to consider intermediate outcomes and evolving post-treatment covariates over time, and we represent these transitions using transition kernels. Furthermore, we establish the asymptotic normality and the semiparametric efficiency bound for our estimator, enabling more powerful statistical inference. Simulation studies and empirical analysis using A/B test data from a streaming platform in Japan demonstrate the practical advantages of our method.

Abstract:
The quantum approximate optimization algorithm (QAOA) holds promise for combinatorial optimization but is constrained by limited qubits. While divideand-conquer frameworks like QAOA^2 address scalability by partitioning graphs into subgraphs, existing methods suffer from two fundamental limitations: i) misalignment between heuristic partitioning metrics and quantum optimization goals, and ii) topology-blind parameter initialization that leads to optimization cold starts. To bridge these gaps, we propose Neural QAOA^2, an end-to-end differentiable framework that jointly generates graph partitions and initial parameters. By integrating a generative evaluative network (GEN), our method utilizes a differentiable quantum evaluator as a high-fidelity performance surrogate to provide direct gradient guidance, enabling the joint generator to learn the intrinsic mapping from graph topology to high-quality partition and parameter configurations. Extensive experiments on 183 QUBO, Ising, and MaxCut instances (21 to 1000 variables) demonstrate that our gradient-driven approach significantly outperforms heuristic baselines, ranking first on 101 instances. It exhibits zero-shot generalization across out-of-distribution graph topologies and scales.

Abstract:
Supervised finetuning is the predominant approach for adapting autoregressive vision–language models to downstream tasks. Recent work has shown that this paradigm is highly vulnerable to backdoor attacks, and that existing defenses are ineffective in open-ended generation settings. In response, we propose BYORn, a backdoor-robust fine-tuning framework motivated by the observation that poisoned target responses are often semantically implausible given the corresponding image–text inputs and a pretrained model. BYORn identifies such misaligned responses and dynamically replaces them with alternative responses generated by the model, thereby breaking the correlation between triggers and target outputs. The resulting objective gradient corresponds to the gradient of the empirical estimate of the population risk upper bound over the clean data distribution. Empirically, BYORn consistently improves robustness to backdoor attacks while preserving clean-task performance, establishing a new trade-off frontier between generalization and attack success rate. Finally, we demonstrate that BYORn remains effective against adaptive attacks specifically designed to circumvent the proposed defense.

Abstract:
Crossdomain offline reinforcement learning (RL) aims to train an agent that performs well in the target domain using a limited target domain dataset and a source domain dataset that exhibits a dynamics shift. Training directly on the original source dataset typically leads to performance collapse. Recent studies perform data filtering from the perspective of dynamics alignment or value alignment to enable efficient policy transfer. However, these studies are typically validated on single-domain or single-behavior-policy source datasets. In this work, we explore a more general setting where the source datasets may be collected from multiple source domains by diverse behavior policies, which we name heterogeneous cross-domain offline RL. We first uncover a critical yet overlooked issue in this setting: value misassignment. Empirically and theoretically, we demonstrate that value misassignment can undermine value alignment, mislead data filtering toward selecting suboptimal samples, and loosen the suboptimality gap, thereby degrading the agent’s performance. To address this issue, we propose V2A, a simple yet effective framework that integrates dynamics alignment, value alignment, and value assignment. V2A first employs temporally-consistent modality representation learning to extract dynamics modalities from the source dataset, followed by modality-aware advantage learning to rectify value alignment. Finally, it adopts a data filtering paradigm to selectively share source data for policy learning. Empirical results show that under general heterogeneous cross-domain offline RL settings, V2A significantly outperforms strong baseline methods and exhibits excellent performance across multiple tasks and datasets.

Abstract:
Many tasks in modern machine learning are observed to involve heavytailed gradient noise during the optimization process. To manage this realistic and challenging setting, new mechanisms, such as gradient clipping and gradient normalization, have been introduced to ensure the convergence of first-order algorithms. However, adaptive gradient methods, a famous class of modern optimizers that includes popular \mathttAdam and \mathttAdamW, often perform well even without any extra operations mentioned above. It is therefore natural to ask whether adaptive gradient methods can converge under heavy-tailed noise without any algorithmic changes. In this work, we take the first step toward answering this question by investigating a special case, \mathttAdaGrad, the origin of adaptive gradient methods. We provide the first provable convergence rate for \mathttAdaGrad in non-convex optimization when the tail index p satisfies \frac43

Abstract:
Stochastic gradient Langevin dynamics combined with Gibbs updates (SGLDGibbs) provides a highly scalable approach to approximate Bayesian inference in latent variable models. However, it remains unclear how to tune the algorithm's hyperparameters in a principled manner to ensure the uncertainty estimates are statistically meaningful. In this work, we address this gap in tuning guidance by developing a statistical scaling limit theory for SGLD-Gibbs. We derive a joint asymptotic limit for the global parameters and latent variables under appropriate space-time rescaling. We show that global parameters converge to a diffusion-type limit, while individual latent variables converge to a jump process reflecting their intermittent Gibbs updates. This joint jump-diffusion structure reveals how latent-variable randomness contributes to the stationary distribution of the global parameters. We leverage our results to provide explicit guidance on hyperparameter tuning for SGLD-Gibbs that ensures meaningful uncertainty quantification. Our empirical results show that SGLD-Gibbs with our tuning guidance leads to better parameter estimates and uncertainty quantification than stochastic variational inference.

Abstract:
In recent years, bilevel optimization (BLO) has attracted significant attention for its broad applications in machine learning. However, most existing works on BLO remain confined to the singletask setting and rely on the lower-level strong convexity assumption, which significantly restricts their applicability to modern machine learning problems of growing complexity. In this paper, we make the first attempt to extend BLO to the multi-task setting under a relaxed lower-level general convexity (LLGC) assumption. To this end, we reformulate the multi-task bilevel learning (MTBL) problem with LLGC into an equality constrained multi-objective optimization (ECMO) problem. However, ECMO itself is a new problem that has not yet been studied in the literature. To address this gap, we first establish a new Karush–Kuhn–Tucker (KKT)-based Pareto stationarity as the convergence criterion for ECMO algorithm design. Based on this foundation, we propose a weighted Chebyshev (WC)-penalty algorithm that achieves a finite-time convergence rate of \mathcalO(ST^-\frac12) to KKT-based Pareto stationarity in both deterministic and stochastic settings, where S denotes the number of objectives, and T is the total iterations. Moreover, by varying the preference vector over the S-dimensional simplex, our WC-penalty method systematically explores the Pareto front. Finally, solutions to the ECMO problem translate directly into solutions for the original MTBL problem, thereby closing the loop between these two foundational optimization frameworks.

Abstract:
Despite the rapid progress of VisionLanguage-Action (VLA) models, the prevailing paradigm of predicting discrete waypoints remains fundamentally misaligned with the intrinsic continuity of physical motion. This discretization imposes rigid sampling rates, lacks high-order differentiability, and introduces quantization artifacts that hinder precise, compliant interaction. We propose Neural Implicit Action Fields (NIAF), a paradigm shift that reformulates action prediction from discrete waypoints to continuous action function regression. By utilizing an MLLM as a hierarchical spectral modulator over a learnable motion prior, NIAF synthesizes infinite-resolution trajectories as continuous-time manifolds. This formulation enables Analytical Differentiability, allowing for explicit supervision of velocity, acceleration, and jerk to ensure mathematical consistency and physical plausibility. Our approach achieves state-of-the-art results on CALVIN and LIBERO benchmarks across diverse backbones. Furthermore, real-world experiments demonstrate that NIAF enables stable impedance control, bridging the gap between high-level semantic understanding and low-level dynamic execution.

Abstract:
Automated proving of polynomial inequalities is a fundamental challenge in automated mathematical reasoning, where rich algebraic structure and a rapidly growing certificate search space hinder scalability. Purely symbolic approaches provide strong guarantees but often scale poorly as the number of variables or the degree increases, due to expensive algebraic manipulations and rapidly growing intermediate expressions. In parallel, LLMguided methods have made notable progress, particularly on competition-style inequalities with a small number of variables. To address the remaining scalability challenges, we propose NSPI, a neuro-symbolic framework that combines the complementary strengths of LLMs and symbolic computation for polynomial-inequality proving. Concretely, an LLM proposes a conjecture in the form of an approximate polynomial Sum-Of-Squares (SOS) decomposition; we refine it via symbolic computation to obtain an exact polynomial SOS representation, which directly proves the target inequality, and we further certify the proof in Lean, yielding an end-to-end pipeline from heuristic discovery to machine-checked proof. Experiments on challenging benchmarks involving polynomials with up to 10 variables demonstrate the effectiveness and scalability of the proposed method.

Abstract:
Large language models (LLMs) now generate substantial production code, often for tasks with multiple valid algorithmic solutions. The hidden risk is that incidental prompt cues can steer \emphwhich algorithm is selected, even when all outputs pass the same tests. Prompt sensitivity is well studied as a tool to improve output quality, but we instead examine output policy: algorithm choice under fixed correctness. We define algorithm steering and run 55,545 controlled experiments across 11 tasks, 19 cue types (18 channels plus a memoization ablation), and 15 models. We find large, interpretable shifts in algorithmfamily distributions (up to 100 percentage points, pp), including on applied tasks such as rate limiting, yielding an ``invisible lottery'' in which accidental context alters performance, security, and maintainability.

Abstract:
Probabilistic models are typically trained using taskagnostic objectives like log-loss, which can lead to significant errors in downstream estimation. This disconnect is especially critical in Inverse Probability Weighting (IPW) for causal inference, where propensity score errors near 0 and 1 often lead to high bias and variance. We propose a principled framework for deriving task-specific strictly proper scoring rules by matching the local curvature of the downstream error metric. We apply this to the Average Treatment Effect (ATE) estimation, deriving a closed-form loss and its corresponding canonical probability mapping that can be readily integrated with any model like a neural network or a gradient boosting algorithm. Extensive evaluations on causal inference benchmarks demonstrate that our tailored objective consistently outperforms standard likelihood-based and covariate-balancing approaches.

Abstract:
Audio deepfake detection (ADD) models are critical for countering the malicious use of textto-speech (TTS) models. Evaluating and strengthening ADD models requires developing datasets that span the space of generated audio and highlight high-error regions. Existing dataset development strategies face two challenges: (i) manual collection, and (ii) inefficient discovery of blind spots in the ADD models. To address these challenges, we propose FoeGlass, the first black-box automated red-teaming method for ADDs, which effectively discovers ADD failure modes in the space of generated audio underexplored by state-of-the-art deepfake benchmarks. FoeGlass uses the in-context learning capabilities of an LLM to explore the input space of a TTS model, generating audio samples that fool the target ADD using only black-box access to all components. By using a carefully designed context based on diversity measurements, FoeGlass mitigates the common problem of mode collapse in automated red-teaming systems. Empirical evaluations on several open-source ADD and TTS models demonstrate that data generated from FoeGlass substantially improves the false negative rates over unconditional sampling baselines and recent spoofing datasets by up to 94%, while requiring no manual supervision. Furthermore, we show that the attacks generated by FoeGlass are transferable across different target ADDs, demonstrating its broad applicability and ease of use for the automated red teaming of ADD systems. Finally, fine-tuning ADD models on FoeGlass-generated samples notably enhances the robustness of the detectors (up 41%).

Abstract:
Reinforcement learning for program repair is hindered by sparse execution feedback and coarse sequencelevel rewards that obscure which edits actually fix bugs. We present BoostAPR, a three-stage framework: (1) supervised fine-tuning on execution-verified demonstrations with reasoning traces, (2) training dual reward models—a sequence-level assessor and a line-level credit allocator—from execution outcomes, and (3) PPO optimization where the line-level model redistributes rewards to critical edit regions. This line-level credit assignment operates at an intermediate granularity naturally suited to code changes. Trained on SWE-Gym and evaluated on four benchmarks, BoostAPR achieves 40.7% on SWE-bench Verified (+22.9pp over the base model), 24.8% on Defects4J (Python→Java transfer), 84.5% on HumanEval-Java, and 95.0% on QuixBugs, showing competitive open-source performance with strong cross-language generalization.

Abstract:
Preference learning methods like Direct Preference Optimization (DPO) are known to induce reliance on spurious correlations, leading to sycophancy and length bias in today's language models and potentially severe goal misgeneralization in future systems. In this work, we provide a unified theoretical analysis of this phenomenon, characterizing the mechanisms of spurious learning, its consequences on deployment, and a provable mitigation strategy. Focusing on loglinear policies, we show that standard preference-learning objectives induce reliance on spurious features at the population level through two channels: mean spurious bias and causal-spurious correlation leakage. We then show that this reliance creates an irreducible vulnerability to distribution shift: more data from the same training distribution fails to reduce the model's dependence on spurious features. To address this, we proposetie training, a data augmentation strategy using ties (equal-utility preference pairs) to introduce data-driven regularization. We demonstrate that this approach selectively reduces spurious learning without degrading causal learning. Finally, we validate our theory on log-linear models and provide empirical evidence that both the spurious learning mechanisms and the benefits of tie training persist for neural networks and large language models.

Abstract:
This work proposes a framework LGKDE that learns kernel density estimation for graphs. The key challenge in graph density estimation lies in effectively capturing both structural patterns and semantic variations while maintaining theoretical guarantees. Combining graph kernels and kernel density estimation (KDE) is a standard approach to graph density estimation, but has unsatisfactory performance due to the handcrafted and fixed features of kernels. Our method LGKDE leverages graph neural networks to represent each graph as a discrete distribution and utilizes maximum mean discrepancy to learn the graph metric for multiscale KDE, where all parameters are learned by maximizing the density of graphs relative to the density of their well-designed perturbed counterparts. The perturbations are conducted on both node features and graph spectra, which helps better characterize the boundary of normal density regions. Theoretically, we establish consistency and convergence guarantees for LGKDE, including bounds on the mean integrated squared error, robustness, and generalization. We validate LGKDE by demonstrating its effectiveness in recovering the underlying density of synthetic graph distributions and applying it to graph anomaly detection across diverse benchmark datasets. Extensive empirical evaluation shows that LGKDE demonstrates superior performance compared to state-of-the-art baselines on most benchmark datasets.

Abstract:
Multimodal large language models (MLLMs) frequently suffer from object hallucinations, yet the visual perceptual mechanism underlying this failure remains poorly understood. In this work, we reveal that hallucinations are strongly associated with a humanlike attention distraction phenomenon, where humans under divided focus experience degraded visual clarity and produce inaccurate descriptions, while in models the same mechanism manifests as spatial inconsistency in multi-head attention and temporal fading of attention to image tokens during decoding. We further provide theoretical insights that attention dispersion increases model complexity and degrades classification generalization. Motivated by these findings, we propose an Attention-Focused Approach for Improved Image Perception (AFIP), which corrects attention distraction via cross-head attention enrichment and reinforces visual grounding through dynamic historical attention enhancement. Extensive experiments on multiple benchmarks and models validate the effectiveness of AFIP without additional training.

Abstract:
Deep forecasting models often suffer from attenuated periodic perception and entangled trend–noise representations as network depth increases. Moreover, the widely adopted channelindependent paradigm, while improving training stability, disrupts intrinsic dynamic coordination among variables, hindering the modeling of cross-variable consistency in multivariate time series. To address these issues, we propose PESD-TSF, a physics-inspired structured decomposition framework for long-term time series forecasting that jointly emphasizes interpretability and predictive accuracy. PESD-TSF introduces three key designs. First, a Multiplicative Periodic Gating mechanism incorporates continuous-time priors to dynamically modulate signal amplitudes, preserving periodic structures across deep layers. Second, a multi-scale structured encoder integrates detrended attention with hierarchical sampling to explicitly decouple long-term trends from high-frequency variations while retaining fine-grained temporal semantics. Third, to recover disrupted inter-variable dependencies, we propose Cross-Scale Collaborative Attention (CSCA) together with an RLC regularization scheme, which reconstructs global inter-variable topology in deep feature spaces and enforces physically consistent collaboration through orthogonality and consistency constraints. Extensive experiments on benchmark datasets from multiple domains demonstrate that PESD-TSF consistently achieves state-of-the-art performance, with particularly strong gains on multivariate forecasting tasks involving complex inter-variable coupling, highlighting its superior structural modeling capability and generalization.

Abstract:
Unlearning in diffusion models aims to remove undesirable data or concepts while preserving the utility of pretrained modelstwo fundamentally conflicting objectives. We propose a principled constrained optimization framework that formulates unlearning as minimizing the deviation from a pretrained model, subject to explicit separation constraints from the unlearning distributions. Specifically, we formulate three constrained optimization problems based on reverse and forward KL divergences, and likelihood constraints. The first two generalize existing approaches for concept and data unlearning, while the third offers a novel and natural formulation for unlearning. Despite the non-convexity of the KL constraints, we establish strong duality for all three problems, enabling us to explicitly characterize their optimal solutions as unlearning targets and develop primal–dual algorithms for each formulation. Experimental results demonstrate that our KL-constrained approach achieves superior retaining-unlearning trade-offs compared to weight-based baselines for concept and data unlearning, and that our likelihood-based approach matches unlearning effectiveness while better preserving retained concepts compared to baselines.

Abstract:
LLMbased multi-agent systems can fail even when planned actions are executed correctly because agents may misjudge their knowledge when evaluating plan feasibility, a phenomenon we term epistemic miscalibration in planning. Unlike execution errors, epistemic miscalibration is latent during planning, as generated plans can remain self-consistent and executable without observable errors; the miscalibration is also dynamic, as new information can alter feasibility assessments, potentially obscuring past miscalibration signals and causing them to recur over time. To address this, we propose the Epistemic Planning Calibration Agentic Workflow (EPC-AW), which assesses whether plans remain supported under varying information conditions rather than directly verifying feasibility. EPC-AW employs Information-consistency-based Plan Selection, selecting plans whose evaluations are stable across agents, together with Consistency-guided Epistemic State Refinement to adapt calibration over time by leveraging past discrepancies to guide future planning. Experiments show that EPC-AW improves system-level success by an average of 9.75%.

Abstract:
When reliable target structures are unavailable at scale or phenotypes arise from dysregulated pathways, transcriptomic perturbations provide a systemlevel functional readout for drug action. In this work, we formalize Transcriptome-based Drug Design (TBDD) as a generative inverse problem: designing drug molecules conditioned on desired transcriptomic state transitions. We analyze the inherently ill-posed nature of this task, which is further complicated by the profound domain gap between biology and chemistry and by the sparsity of transcriptomic signals. To address these challenges, we propose CURE (A CellUlar Response Engine), a multi-resolution transcriptome-guided diffusion framework. CURE features a specialized Transcriptome Perturbation Functional Feature Extractor (TFE) that (1) distills function-oriented perturbation embeddings from pre/post states, (2) aligns these signatures to dual chemical views to bridge the cross-modal gap, and (3) performs heterogeneity-aware aggregation to extract robust state-specific signals from noisy transcriptomic data. Extensive evaluations on both standard benchmarks and rigorous out-of-distribution protocols demonstrate that CURE consistently outperforms strong baselines in structural quality and functional consistency. Furthermore, we validate its practical utility via a zero-shot gene-inhibitor design task, highlighting the potential of phenotype-driven generative discovery.

Abstract:
Video generation models offer a promising imagination mechanism for robot manipulation by predicting longhorizon future observations, but effectively exploiting these imagined futures for action execution remains challenging. Existing approaches either condition policies on predicted frames or directly decode generated videos into actions, both suffering from a mismatch between visual realism and control relevance. As a result, predicted observations emphasize perceptual fidelity rather than action-centric causes of state transitions, leading to indirect and unstable control. To address this gap, we propose MoLA (Mixture of Latent Actions), a control-oriented interface that transforms imagined future videos into executable representations. Instead of passing predicted frames directly to the policy, MoLA leverages multiple pretrained inverse dynamics models to infer a mixture of latent actions implied by generated visual transitions. These modality-aware inverse dynamics models capture complementary semantic, depth, and flow cues, providing a structured and physically grounded action representation that bridges video imagination and policy execution. We evaluate our approach on simulated benchmarks (LIBERO, CALVIN, and LIBERO-Plus) and real-world robot manipulation tasks, achieving consistent gains in task success, temporal consistency, and generalization. Code will be released.

Abstract:
Watermarking embeds statistical signatures in AIgenerated text for detection and attribution. We reveal a fundamental vulnerability: when users access multiple models (today's reality), watermarks trivially fail. Watermarks perturb output distributions away from the original, and in competitive markets, these perturbations are typically independent across providers. We theoretically prove that averaging output probability distributions recovers the unwatermarked distribution with up to a second-order error term. Empirically, simply averaging 3-5 models cancels out these perturbations. We introduce WASH (Watermark Attenuation via Statistical Hybridisation), which solves practical challenges in ensemble generation: vocabulary misalignment and tokenisation differences across heterogeneous models. Experiments across six watermarking schemes and three LLMs show that detection z-scores drop from 5-300 tobelow 2(below the detection threshold of ~4) when averaging across 3 models, while improving quality by27.5%and running6×faster than the best baseline on the long sequence generation task. Our results suggest that robust AI-text detection via watermarking requires either accepting this fundamental vulnerability or unprecedented coordination among model providers.

Abstract:
Inferencetime scaling has emerged as a critical avenue for enhancing Large Language Model performance, yet real-world deployment is bound by strict computational budgets. In this work, we formulate inference budget allocation as a global constrained optimization problem governed by economic principles. By modeling reasoning utility as an S-shaped function, we derive a theoretical optimal policy based on a global shadow price that dynamically equilibrates resource scarcity. Based on this theory, we propose Difficulty-Aware Budget Allocation (DABA), a market-based mechanism that numerically solves for the exact market-clearing price. Unlike standard methods, DABA implements a Lambert W policy to execute strategic abandonment, sacrificing insolvent tasks to redistribute critical computational resources to solvable complex queries. Extensive experiments on mathematical reasoning benchmarks demonstrate that DABA significantly improves the Pareto frontier of cost versus accuracy. In resource-scarce regimes, DABA achieves up to a 3 times improvement in global accuracy compared to uniform allocation.

Abstract:
Relational prediction tasks are fundamental in many realworld applications, where data are naturally stored in relational databases (RDBs). Relational Deep Learning (RDL) addresses this problem by modeling RDBs as graphs and applying graph neural networks (GNNs) for end-to-end learning. However, graph construction in RDBs must satisfy the full-resolution property to preserve relational semantics, forcing most existing methods to rely on fixed graph structures that limit message passing flexibility and representation learning in downstream GNNs and decrease performance on downstream tasks. In this paper, we proposeFROG, aFull-Resolution andOptimizableGraph Structure Learning framework for RDL that formulates relational structure learning as a learnable table role modeling problem, allowing tables to contribute as nodes and edges in message passing. We further design role-driven message passing mechanisms to capture relational semantics, enabling joint optimization of graph structure and GNN representations. To ensure semantic consistency, we introduce functional dependency constraints that regularize representations across table and entity levels. Extensive experiments demonstrate that our method outperforms existing approaches and reveal how table roles impact downstream tasks, offering new insights into graph construction for RDL.

Abstract:
Reconstructionbased methods are a dominant paradigm in time series anomaly detection (TSAD), however, their near-universal reliance on Mean Squared Error (MSE) loss results in statistically flawed reconstruction residuals. This fundamental weakness leads to noisy, unstable anomaly scores, hindering reliable detection. To address this, we propose Constrained Gaussian-Noise Optimization and Smoothing (COGNOS), a universal, model-agnostic enhancement framework that tackles this issue at its source. COGNOS introduces a novel Gaussian-White Noise Regularization strategy during training, which directly constrains the model's output residuals to conform to a Gaussian white noise distribution. This engineered statistical property creates the ideal precondition for our second contribution: Adaptive Residual Kalman Smoother that provably operates as a statistically robust estimator to denoise the raw anomaly scores. Extensive experiments on multiple benchmarks demonstrate that COGNOS consistently enhances the performance of state-of-the-art backbones significantly, validating the efficacy of coupling statistical regularization with adaptive filtering.

Abstract:
Grading in large undergraduate STEM courses often yields minimal feedback due to heavy instructional workloads. We present a largescale empirical study of AI grading on real, handwritten single-variable calculus work from a major U.S. public research university. Using OCR-conditioned large language models with structured, rubric-guided prompting, our system produces scores and formative feedback for thousands of free-response quiz submissions from nearly 800 students. In a setting with no single ground-truth label, we evaluate performance against official teaching-assistant grades, student surveys, and independent human review, finding strong alignment with TA scoring and a large majority of AI-generated feedback rated as correct or acceptable across quizzes. Beyond calculus, this setting highlights core challenges in OCR-conditioned mathematical reasoning and partial-credit assessment. We analyze key failure modes, propose practical rubric- and prompt-design principles, and introduce a multi-perspective evaluation protocol for reliable, real-course deployment. Building on the dataset and evaluation framework developed here, we outline a standardized benchmark for AI grading of handwritten mathematics to support reproducible comparison and future research.

Abstract:
How can we learn the laws underlying the dynamics of stochastic systems when their trajectories are sampled sparsely in time? Existing methods either require temporally resolved highfrequency observations, or rely on geometric arguments that apply only to conservative systems, limiting the range of dynamics they can recover. Here, we present a new framework that reconciles these two perspectives by reformulating inference as a stochastic control problem. Our method uses geometry-driven path augmentation, guided by structure in the system’s invariant density to reconstruct likely trajectories and infer the underlying dynamics without assuming specific parametric models. Applied to overdamped Langevin systems, our approach accurately recovers stochastic dynamics even from severely undersampled data, outperforming existing methods in synthetic benchmarks. This work demonstrates the effectiveness of incorporating geometric inductive biases into stochastic system identification methods, with broad applications across physics, biology, and control.

Abstract:
Large language models (LLMs) have shown promising potential in materials science, enabling tasks ranging from knowledge retrieval to property prediction. Existing materials science benchmarks mainly focus on perceptual or knowledgebased tasks, largely ignoring the structure modelling tasks, a core challenge in real scientific workflows. In practice, constructing and manipulating atomic structures is one of the most creative and least automated steps in materials research. In this work, we introduce AtomWorld, a benchmark designed to evaluate the abilities of LLMs on structure modifications. The benchmark includes ten fundamental actions under four widely used modelling categories, enabling verifiable evaluation metrics. We find that Gemini 2.5 Pro generally performs the best. While the success rate decreases markedly with increasing modelling complexity, with particularly low success rates (below 12% for rotation) for operations involving complex spatial relations. Our results suggest that contemporary LLMs are better suited as copilots for materials structure modelling rather than fully unsupervised autonomous scientific agents. Beyond evaluation, AtomWorld also serves as a testbed and playground for developing future structure-aware models, including reinforcement learning and agentic approaches.

Abstract:
Diffusionbased language models(dLLMs) have emerged as a promising alternative to autoregressive language models, offering the potential for parallel token generation and bidirectional context modeling. However, harnessing this flexibility for fully non-autoregressive decoding remains an open question, particularly for reasoning and planning tasks. In this work, we investigate non-autoregressive decoding in dLLMs by systematically analyzing its inference dynamics along the temporal axis. Specifically, we uncover an inherent failure modes in confidence-based non-autoregressive generation stem from a strong proximity bias—the denoising order tends to concentrate on spatially adjacent tokens. This local dependency leads to spatial error propagation, rendering the entire trajectory critically contingent on the initial unmasking position. Leveraging this insight, we present a minimal-intervention approach that guides early token selection, employing a lightweight planner and end-of-sequence temperature annealing. We thoroughly evaluate our method on various reasoning and planning tasks and observe substantial overall improvement over existing heuristic baselines without significant computational overhead.

Abstract:
Sparse attention reduces the quadratic complexity of full selfattention but faces two challenges: (1) an attention gap, where applying sparse attention to full-attention-trained models causes performance degradation due to train-inference distribution mismatch, and (2) a capability gap, where models trained purely with sparse attention lack complete gradient flow, preventing them from matching full-attention performance. We propose SSA (Sparse Sparse Attention), a training framework that integrates both sparse and full attention with bidirectional attention-output alignment. We prove that the approximation error scales linearly with the attention mass dropped under sparse attention, and show that SSA's alignment objective substantially reduces this quantity compared to baselines. Experiments demonstrate that SSA achieves state-of-the-art performance under both inference modes, adapts smoothly to varying sparsity budgets, and demonstrates superior long-context capabilities.

Abstract:
During conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating highquality responses. Inspired by this cognitive phenomenon, we propose a novelFull-duplexLAtent andInternalReasoning method named FLAIR that conductslatentthinking simultaneously with speech perception. Unlike conventional "thinking" mechanisms in NLP, which require post-hoc generation, our approach aligns seamlessly with spoken dialogue systems: during the user’s speaking phase, it recursively feeds the latent embedding output from the previous step into the next step, enabling continuous reasoning that strictly adheres to causality without introducing additional latency. To enable this latent reasoning, we design an Evidence Lower Bound-based objective that supports efficient supervised finetuning via teacher forcing, circumventing the need for explicit reasoning annotations. Experiments demonstrate the effectiveness of this think-while-listening design, which achieves competitive results on a range of speech benchmarks. Furthermore, FLAIR robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.

Abstract:
Large Language Models (LLMs) often assign disproportionate attention to the first token, a phenomenon known as the attention sink. Several recent approaches aim to address this issue, including Sink Attention in GPTOSS and Gated Attention in Qwen3-Next. However, a comprehensive analysis of the relationship among these attention mechanisms is lacking. In this work, we provide both theoretical and empirical evidence demonstrating that the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers. This insight explains the head collapse phenomenon observed in prior work, where only a fixed subset of attention heads contributes to generation. To mitigate head collapse, we propose a sink-aware training algorithm with an auxiliary load balancing loss designed for attention layers. Extensive experiments show that our method achieves effective head load balancing and improves model performance across Vanilla Attention, Sink Attention, and Gated Attention. We hope this study offers a new perspective on attention mechanisms and encourages further exploration of the inherent MoE structure within attention layers.

Abstract:
We consider maxmin and min-max problems with objective functions that are possibly non-smooth, submodular with respect to the minimiser and concave with respect to the maximiser. We investigate the performance of a zeroth-order method applied to this problem. The method is based on the subgradient of the Lovász extension of the objective function with respect to the minimiser and based on Gaussian smoothing to estimate the smoothed function gradient with respect to the maximiser. In expectation sense, we prove the convergence of the algorithm to an \epsilon-saddle point in the offline case. Moreover, we show that, in the expectation sense, in the online setting, the algorithm achieves O(\sqrtN\barP_N) online duality gap, where N is the number of iterations and \barP_N is the path length of the sequence of optimal decisions. The complexity analysis and hyperparameter selection are presented for all the cases. The theoretical results are illustrated via numerical examples.

Abstract:
Finetuned LLMs can covertly encode prompt secrets into outputs via steganographic channels. Prior work demonstrated this threat but relied on trivially recoverable encodings. We formalize payload recoverability via classifier accuracy and show previous schemes achieve 100% recoverability. In response, we introduce low-recoverability steganography, replacing arbitrary mappings with embedding-space-derived ones. For Llama-8B (LoRA) and Ministral-8B (LoRA) trained on TrojanStego prompts, exact secret recovery rises from 17\rightarrow30% (+78%) and 24\rightarrow43% (+80%) respectively, while on Llama-70B (LoRA) trained on Wiki prompts, it climbs from 9\rightarrow19% (+123%), all while reducing payload recoverability. We then discuss detection. We argue that detecting fine-tuning-based steganographic attacks requires approaches beyond traditional steganalysis. Standard approaches measure distributional shift, which is an expected side-effect of fine-tuning. Instead, we propose a mechanistic interpretability approach: linear probes trained on later-layer activations detect the secret with up to 33% higher accuracy in fine-tuned models compared to base models, even for low-recoverability schemes. This suggests that malicious fine-tuning leaves actionable internal signatures amenable to interpretability-based defenses.

Abstract:
Theoretical studies show that for any differentiable function on a compact domain, there exists a neural network that approximates both the function values and gradients. However, such a result cannot be used in practice since it assumes real parameters and exact internal operations. In contrast, real implementations only use a finite subset of reals and machine operations with roundoff errors. In this work, we investigate whether a similar result holds for neural networks under floating-point arithmetic, when the gradient with respect to the input is computed by the automatic differentiation algorithm D^\mathttAD. We first show that given a floating-point function \phi (e.g., a loss function), arbitrary function values and gradients can be represented by a floating-point network f and D^\mathttAD(\phi\circ f), respectively. We further extend this result: given \phi_1,\dots,\phi_n, D^\mathttAD(\phi_i\circ f) can simultaneously represent arbitrary gradients while f represents the target values, under mild conditions. Our results hold for practical activation functions, e.g., ReLU, ELU, GELU, Swish, Sigmoid, and tanh.

Abstract:
Despite the growing reasoning capabilities of recent large language models (LLMs), their internal mechanisms during the reasoning process remain underexplored. Prior approaches often rely on humandefined concepts (e.g., overthinking, reflection) at the word level to analyze reasoning in a supervised manner. However, such methods are limited, as it is infeasible to capture the full spectrum of potential reasoning behaviors, many of which are difficult to define in token space. In this work, we propose an unsupervised framework (namely, RISE: Reasoning behavior Interpretability via Sparse auto-Encoder) for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors. By segmenting chain-of-thought traces into sentence-level 'steps' and training sparse auto-encoders (SAEs) on step-level activations, we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking. Visualization and clustering analyses show that these behaviors occupy separable regions in the decoder column space. Moreover, targeted interventions on SAE-derived vectors can controllably amplify or suppress specific reasoning behaviors, altering inference trajectories without retraining. Beyond behavior-specific disentanglement, SAEs capture structural properties such as response length, revealing clusters of long versus short reasoning traces. More interestingly, SAEs enable the discovery of novel behaviors beyond human supervision. We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space. These findings underscore the potential of unsupervised latent discovery for both interpreting and controllably steering reasoning in LLMs.

Abstract:
Mixtureof-Experts (MoE) based Large Language Models (LLMs) have achieved superior performance, yet the massive memory overhead caused by storing multiple expert networks severely hinders their practical deployment. Singular Value Decomposition (SVD)-based compression has emerged as a promising post-training technique; however, most existing methods apply uniform rank allocation or rely solely on static weight properties. This overlooks the substantial heterogeneity in expert utilization observed in MoE models, where frequent routing patterns and intrinsic information density vary significantly across experts. In this work, we propose RFID-MoE, an effective framework for MoE compression by exploiting heterogeneous Routing Frequency and Information Density. We first introduce a fused metric that combines expert activation frequency with effective rank to measure expert importance, adaptively allocating higher ranks to critical expert groups under a fixed budget. Moreover, instead of discarding compression residuals, we reconstruct them via a parameter-efficient sparse projection mechanism to recover lost information with minimal parameter overhead. Extensive experiments on representative MoE LLMs (e.g., Qwen3, DeepSeekMoE) across multiple compression ratios demonstrate that RFID-MoE consistently outperforms state-of-the-art methods like MoBE and D2-MoE. Notably, RFID-MoE achieves a perplexity of 16.92 on PTB with the Qwen3-30B model at a 60% compression ratio, reducing perplexity by over 8.0 compared to baselines, and improves zero-shot accuracy on HellaSwag by approximately 8%.

Abstract:
We propose a modular, datadriven framework for jointly learning unknown functional mappings and discovering the underlying one-parameter symmetry subgroup governing the data. Unlike conventional geometric deep learning methods that assume known symmetries, our approach identifies the relevant continuous subgroup directly from data. We consider the broad class of one-parameter subgroups, which admit a canonical geometric classification into three regimes: elliptical, hyperbolic, and parabolic. Given an assumed regime, our framework instantiates a corresponding symmetry discovery architecture with invariant and equivariant representation layers structured according to the Lie algebra of the subgroup, and learns the exact generator parameters end-to-end from data. This yields models whose invariance or equivariance is guaranteed by construction and admits formal proofs, enabling symmetry to be explicitly traced to identifiable components of the architecture. The approach is applicable to one-parameter subgroups of a wide range of matrix Lie groups, including SO(n), SL(n), and the Lorentz group. Experiments on synthetic and real-world systems—including moment of inertia prediction, double-pendulum dynamics, and high-energy Top Quark Tagging—demonstrate accurate subgroup recovery and strong predictive performance across both compact and non-compact regimes.

Abstract:
Effective data selection is essential for pretraining large language models (LLMs), enhancing efficiency and improving generalization to downstream tasks. However, existing approaches often require leveraging external pretrained models, making it difficult to disentangle the effects of data selection from those of the external pretrained models. In addition, they often overlook the longterm impact of selected data if the model is trained for a long period of time, primarily due to the prohibitive cost of full-scale LLM pretraining. In this paper, we introduce BLISS (BileveL Influence Scoring method for data Selection): a lightweight data selection method that operates entirely \emphfrom scratch, without relying on any external pretrained oracle models, while explicitly accounting for the long-term impact of selected data. BLISS leverages a small proxy model as a surrogate for the LLM and employs a score model to estimate the long-term influence of training samples if the proxy model is trained to convergence. We formulate data selection as a bilevel optimization problem, where the upper-level objective optimizes the score model to assign importance weights to training samples, ensuring that minimizing the lower-level objective (i.e., training the proxy model over the weighted training loss until convergence) leads to best validation performance. Once optimized, the trained score model predicts influence scores for the dataset, enabling efficient selection of high-quality samples for LLM pretraining. We validate BLISS by pretraining 410M/1B/2.8B Pythia and LLaMA-0.5B models on selected subsets of the C4 dataset. Notably, under the 1B model setting, BLISS achieves 1.7× speedup in reaching the same performance as the state-of-the-art method, demonstrating superior performance across multiple downstream tasks.

Abstract:
Typical reinforcement learning (RL) methods for LLM reasoning waste compute on hard problems, where correct onpolicy traces are rare and policy gradients vanish. To bootstrap more efficient RL, we consider reusing old sampling FLOPs (from prior inference or RL training) in the form of off-policy traces. We introduce PrefixRL, where we condition on the prefix of successful off-policy traces and run on-policy RL to complete them, side-stepping instabilities from using off-policy data as supervision targets. PrefixRL boosts the learning signal on hard problems by modulating the difficulty of the problem through the off-policy prefix length. We prove that the PrefixRL objective is not only consistent with the standard RL objective but also more sample efficient. Empirically, we discover back-generalization: training only on prefixed problems generalizes to out-of-distribution unprefixed performance, with learned strategies often differing from those in the prefix. In our experiments, we source the off-policy traces by rejection sampling with the base model, creating a self-improvement loop. On hard reasoning problems, PrefixRL reaches the same training reward 2x faster than the strongest baseline (SFT on off-policy data then RL), even after accounting for the compute spent on the initial rejection sampling, and increases the final reward by 3x.

Abstract:
While InfoNCE powers modern contrastive learning, its geometric mechanisms remain undercharacterized beyond the canonical alignment--uniformity decomposition. We present a measure-theoretic framework that models learning as the evolution of representation measures on a fixed embedding manifold. By establishing value and gradient consistency in the large-batch limit, we bridge the stochastic objective to explicit deterministic energy landscapes, uncovering a fundamental geometric bifurcation between the unimodal and multimodal regimes. In the unimodal setting, the intrinsic landscape is strictly convex with a unique Gibbs equilibrium; here, entropy acts merely as a tie-breaker, clarifying "uniformity" as a constrained expansion within the alignment basin. In contrast, the symmetric multimodal objective contains a persistent negative symmetric divergence term that remains even after kernel sharpening. We show that this term induces barrier-driven co-adaptation, enforcing a population-level modality gap as a structural geometric necessity rather than an initialization artifact. Our results shift the analytical lens from pointwise discrimination to population geometry, offering a principled basis for diagnosing and controlling distributional misalignment.

Abstract:
To maximize hardware utilization, modern machine learning systems typically employ large constant or manually tuned batch size schedules, relying on heuristics that are brittle and costly to tune. Existing adaptive strategies based on gradient noise scale (GNS) offer a principled alternative. However, their assumption of SGD's Euclidean geometry creates a fundamental mismatch with popular optimizers based on generalized norms, such as signSGD / Signum (\ell_\infty) and stochastic spectral descent (specSGD) / Muon (Schatten\infty). In this work, we derive non-Euclidean gradient noise scales for signSGD and spectral descent that naturally emerge from the geometry of its dual norm. We propose an efficient variance estimation procedure that leverages distributed data-parallel systems with negligible overhead. Our experiments demonstrate that adaptive batch size strategies using non-Euclidean GNS enable us to match the validation loss of constant-batch baselines while reducing training steps by up to 66% for Signum and Muon.

Abstract:
The Densest kSubgraph (DkS) is a fundamental combinatorial problem known for its theoretical hardness and breadth of applications. Recently, Lu et al. (AAAI 2025) introduced a penalty-based non-convex relaxation that achieves promising empirical performance; however, a rigorous theoretical understanding of its success remains unclear. In this work, we bridge this gap by providing a comprehensive theoretical analysis. We first establish the tightness of the relaxation, ensuring that the global maximum values of the original combinatorial problem and the relaxed problem coincide. Then we reveal the benign geometry of the optimization landscape by proving a strict dichotomy of stationary points: all integral stationary points are local maximizers, whereas all non-integral stationary points are strict saddles with explicit positive curvature. We propose a saddle-escaping Frank--Wolfe algorithm and prove that it achieves exact convergence to an integral local maximizer in a finite number of steps.

Abstract:
Large Language Models (LLMs) are pretrained on massive datasets and later instructiontuned via supervised fine-tuning (SFT) or reinforcement learning (RL). Best practices emphasize large, diverse pretraining data, whereas post-training operates differently: SFT relies on smaller, high-quality datasets, while RL benefits more from scale, with larger amounts of feedback often outweighing label quality. Yet it remains unclear why pretraining and RL require large datasets, why SFT excels on smaller ones, and what defines high-quality SFT data. In this work, we theoretically analyze transformers trained on an in-context weight prediction task for linear regression. Our analysis reveals several key findings: (i) balanced pretraining data can induce latent capabilities later activated during post-training, and (ii) SFT learns best from a small set of examples challenging for the pretrained model, while excessively large SFT datasets may dilute informative pretraining signals. In contrast, RL is most effective on large-scale data that is not overly difficult for the pretrained model. We validate these theoretical insights with experiments on large nonlinear transformer architectures.

Abstract:
The Platonic Representation Hypothesis posits that learned representations from models trained on different modalities converge to a shared latent structure of the world. However, this hypothesis has largely been examined in vision and language, and it remains unclear whether time series participate in such convergence. We first examine this in a trimodal setting and find that independently pretrained time series, vision, and language encoders exhibit nearorthogonal geometry in the absence of explicit coupling. We then apply post-hoc alignment by training projection heads over frozen encoders using contrastive learning, and analyze the resulting representations with respect to geometry, scaling behavior, and dependence on information density and input modality characteristics. Our investigation reveals that overall alignment in contrastive representation spaces improves with model size, but this alignment is asymmetric: time series align more strongly with visual representations than with text, and images can act as effective intermediaries between time series and language. We further see that richer textual descriptions improve alignment only up to a threshold; training on denser captions does not lead to further improvement. Analogous effects are observed for visual representations. Our findings shed light on considerations for building multimodal systems involving non-conventional data modalities beyond vision and language.

Abstract:
Reinforcement Learning from Human Feedback (RLFH) or Verifiable Rewards (RLVR) are two key steps in the posttraining of modern Language Models (LMs). A common problem is reward hacking, where the policy may exploit inaccuracies of the reward and learn an unintended behavior. Most previous works address this by limiting the policy update with a Kullback-Leibler (KL) penalty towards a reference model. We propose a different framing: Train the LM in a way that biases policy updates towards regions in which the reward is more accurate. First, we derive a theoretical connection between the accuracy of a reward model and the flatness of an optimum at convergence. Gradient regularization (GR) can then be used to bias training to flatter regions and thereby maintain reward model accuracy. We confirm these results by showing that the gradient norm and reward accuracy are empirically correlated in RLHF. We then show that Reference Resets of the KL penalty implicitly use GR to find flatter regions with higher reward accuracy. We further improve on this by proposing to use explicit GR with an efficient finite-difference estimate. Empirically, GR performs better than a KL penalty across a diverse set of RL experiments with LMs. GR achieves a higher GPT-judged win-rate in RLHF, avoids overly focusing on the format in rule-based math rewards, and prevents hacking the judge in LLM-as-a-Judge math tasks.

Abstract:
In largescale LLM pretraining systems with 100\mathrmk+ GPUs, failures become the norm rather than the exception, and restart costs can dominate wall-clock training time. However, existing fault-tolerance mechanisms are largely unprepared for this restart-dominant regime. To address this challenge, we propose SPARe—Stacked Parallelism with Adaptive Reordering—a fault-tolerance framework that masks node failures during gradient synchronization by stacking redundant data shards across parallelism groups and adaptively reordering execution. SPARe achieves availability comparable to traditional replication while maintaining near-constant computation overhead of only 2～3×, even under high redundancy where traidional replication would require linearly inflating overhead. We derive closed-form expressions for endurable failure count and computation overhead, validate them via SimGrid-based discrete-event simulation, and jointly optimize redundancy and checkpointing to minimize training time. At extreme scale with up to 600\mathrmk GPUs, SPARe reduces time-to-train by 40～50% compared to traditional replication.

Abstract:
The rapid evolution of Large Language Models (LLMs) has shifted their role from passive information providers to active agents capable of executing complex workflows. However, the realization of a true "AI worker" is currently hindered by benchmarks that fail to capture the intricacy of professional environments, which demand longhorizon planning, complex tool usage, and adherence to strict access protocols. To bridge this gap, we introduce EnterpriseOps-Gym, a benchmark environment designed to evaluate agentic planning in realistic enterprise settings. EnterpriseOps-Gym provides: (i) 1,150 expert-curated tasks across eight interconnected domains (including HR, IT, Customer Service and productivity tools) that require managing persistent state and adhering to strict outcome-based verification logic; and (ii) a high-fidelity, containerized sandbox environment hosting 164 database tables and 512 functional tools. Our evaluation reveals critical limitations in state-of-the-art models: even the top-performing Claude Sonnet~4.5 achieves only 34.1% success, struggling significantly with planning consistency, error recovery, and policy constraints. Furthermore, we observe that agents frequently fail to refuse infeasible tasks, leading to unintended and potentially harmful side effects on the system. These findings indicate that current agents are not yet ready for enterprise deployment. By releasing EnterpriseOps-Gym, we provide a concrete testbed to advance the reliability of autonomous agents in professional workflows.

Abstract:
Time series is a pervasive data type across various application domains, rendering the reasonable solving of diverse time series tasks a longstanding goal. Recent advances in large language models (LLMs), especially their reasoning abilities unlocked through reinforcement learning (RL), have opened new opportunities for tackling tasks with long Chain-of-Thought (CoT) reasoning. However, leveraging LLM reasoning for time series remains infancy, hindered by the absence of carefully curated time series CoT data for training, limited data efficiency caused by underexplored data scheduling, and the lack of RL algorithms tailored for exploiting such time series CoT data. In this paper, we introduce VeriTime, a framework that tailors LLMs for time series reasoning through data synthesis, data scheduling, and RL training. First, we propose a data synthesis pipeline that constructs a time series–text multimodal dataset with process-verifiable annotations. Second, we design a data scheduling mechanism that arranges training samples according to a principled hierarchy of difficulty and task taxonomy. Third, we develop a two-stage reinforcement finetuning featuring fine-grained, multi-objective rewards that leverages verifiable process-level CoT data. Extensive experiments show that VeriTime substantially boosts LLM performance across diverse time series reasoning tasks. Notably, it enables compact 3B–4B models to achieve reasoning capabilities on par with or exceeding those of larger proprietary LLMs.

Abstract:
We study reward poisoning attacks in reinforcement learning (RL), where an adversary manipulates rewards within constrained budgets to force the target RL agent to adopt a policy that aligns with the attacker's objectives. Prior works on reward poisoning mainly focused on sufficient conditions to design a successful attacker, while only a few studies discussed the infeasibility of targeted attacks. This paper provides the first precise necessity and sufficiency characterization of the attackability of a linear MDP under reward poisoning attacks. Our characterization draws a bright line between the vulnerable RL instances, and the intrinsically robust ones which cannot be attacked without large costs even running vanilla nonrobust RL algorithms. Our theory extends beyond linear MDPs---by approximating deep RL environments as linear MDPs, we show that our theoretical framework effectively distinguishes the attackability and efficiently attacks the vulnerable ones, demonstrating both the theoretical and practical significance of our characterization.

Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for unlocking complex reasoning in Large Language Models (LLMs). Yet, scaling up RL is bottlenecked by limited existing verifiable data, where improvements increasingly saturate over prolonged training. To overcome this, we proposeGolden Goose, a simple trick to synthesize unlimited RLVR tasks from unverifiable internet text by constructing a multiplechoice question-answering version of the fill-in-the-middle task. Given a source text, we prompt an LLM to identify and mask key reasoning steps, then generate a set of diverse, plausible distractors. This enables us to leverage reasoning-rich unverifiable corpora typically excluded from prior RLVR data construction (e.g., science textbooks) to synthesizeGooseReason-0.7M, a large-scale RLVR dataset with over 0.7 million tasks spanning mathematics, programming, and general scientific domains. Empirically, GooseReason effectively revives models saturated on existing RLVR data, yielding robust, sustained gains under continuous RL and achieving new state-of-the-art results for 1.5B and 4B-Instruct models across 15 diverse benchmarks. Finally, we deploy Golden Goose in a real-world setting, synthesizing RLVR tasks from raw FineWeb scrapes for the cybersecurity domain, where no prior RLVR data exists. Training Qwen3-4B-Instruct on the resulting dataGooseReason-Cybersets a new state-of-the-art in cybersecurity, surpassing a 7B domain-specialized model with extensive domain-specific pre-training and post-training. This highlights the potential of automatically scaling up RLVR data by exploiting abundant, reasoning-rich, unverifiable internet text.

Abstract:
Large Language Models (LLMs) can propose naturallanguage rules, circumventing the reliance on a predefined predicate space in traditional rule learning. However, existing LLM-based methods often neglect the global interactions among rules, and the potential of using fine-grained rule importance scores to calibrate neuro-symbolic reasoning remains underexplored. To address this gap, we introduce RLIE, a framework that integrates LLMs with probabilistic modeling to learn weighted rule sets in four stages: (1) Rule generation: proposing and filtering candidate rules via LLMs; (2) Logistic regression: learning sparse, calibrated weights for global rule selection; (3) Iterative refinement: revising the rule set with error-driven hard examples; and (4) Evaluation: validating the learned system via comparative inference paradigms. Across multiple real-world datasets and LLM backbones, our learned weighted rules achieve superior stability and accuracy, whereas rule-injection prompting yields mixed results and often degrades performance. These results suggest LLMs excel at semantic rule discovery but are less reliable at controlled probabilistic aggregation. Our findings highlight both the promise and the limits of LLMs for inductive reasoning, motivating a principled integration with classic probabilistic rule combination for reliable neuro-symbolic reasoning.

Abstract:
Irregular multivariate time series (IMTS) forecasting is challenging due to nonuniform sampling and variable asynchronicity. These irregularities violate the equidistant assumptions of standard models, hindering local temporal modeling and rendering classical frequency-domain methods ineffective for capturing global periodic structures. To address this challenge, we propose TFMixer, a joint time–frequency modeling framework for IMTS forecasting. Specifically, TFMixer incorporates a Global Frequency Module that employs a learnable Non-Uniform Discrete Fourier Transform (NUDFT) to directly extract spectral representations from irregular timestamps. In parallel, the Local Time Module introduces a query-based patch attention mechanism to adaptively aggregate informative temporal segments and alleviate information density imbalance. Finally, TFMixer fuses the time-domain and frequency-domain representations to generate forecasts and further leverages inverse NUDFT for explicit seasonal extrapolation. Extensive experiments on real-world IMTS benchmarks demonstrate the effectiveness and robustness of TFMixer under irregular sampling and missing data.

Abstract:
Foundation modeldriven agents often struggle with long-horizon planning due to the transient nature of purely prompting-based reasoning. While existing skill induction methods mitigate this by distilling experience into state-blind parameterized scripts, they fail to capture the conditional logic required for robust execution in dynamic environments. In this paper, we propose Neuro-Symbolic Skill Induction (NSI), a framework that lifts interaction traces into modular, logic-grounded programs. By synthesizing explicit control flows and dynamic variable binding, NSI empowers agents to discover when and why to act. This paradigm enables the efficient generalization, allowing agents to induce skills from few-shot examples and flexibly adapt to unseen goals. Experiments on a series of agentic tasks demonstrate that NSI consistently outperforms state-of-the-art baselines, empowering agents to self-evolve into architects of cognitive skills.

Abstract:
Mixtureof-Experts (MoE) have shown strong potential in scaling language models efficiently by activating only a small subset of experts per input. However, their deployment remains limited due to the high memory overhead associated with storing all expert parameters, particularly as the number of experts increases. To address this challenge, prior works have explored expert dropping and merging strategies; however, they often suffer from notable performance drop especially at high compression ratios due to their reliance on coarse-grained tensor- or expert-level operations. In this paper, we introduce PuzzleMoE, the first MoE merging method to enable fine-grained element-wise merging while achieving both high accuracy and inference speed, via two key innovations: First, PuzzleMoE performs sparse expert merging by identifying element-wise weight redundancy and specialization. It introduces a dual-mask approach to capture both shared and expert-specific salient parameters. Second, to avoid the overhead of storing masks and signs, we introduce a bit-packed encoding scheme that reuses underutilized exponent bits, enabling efficient MoE inference on GPUs. Extensive experiments demonstrate that PuzzleMoE outperforms prior MoE compression methods by up to 16.7% on MMLU at 50% compression ratio, and achieves up to 1.80× end-to-end inference throughput gain.

Abstract:
We study an algorithmindependent, worst-case lower bound for the Gaussian process (GP) bandit problem in the frequentist setting, where the reward function is fixed and has a bounded norm in the known reproducing kernel Hilbert space (RKHS). Specifically, we focus on the squared exponential (SE) kernel, one of the most widely used kernel functions in GP bandits. One of the remaining open questions for this problem is the gap in the dimension-dependent logarithmic factors between upper and lower bounds. This paper partially resolves this open question under a hyperspherical input domain. We show that any algorithm suffers \Omega(\sqrtT (\ln T)^d (\ln \ln T)^-d) cumulative regret, where T and d represent the total number of steps and the dimension of the hyperspherical domain, respectively. Regarding the simple regret, we show that any algorithm requires \Omega(\epsilon^-2(\ln \frac1\epsilon)^d (\ln \ln \frac1\epsilon)^-d) time steps to find an \epsilon-optimal point. We also provide the improved O((\ln T)^d+1(\ln \ln T)^-d) upper bound on the maximum information gain for the SE kernel. Our results guarantee the optimality of the existing best algorithm up to dimension-independent logarithmic factors under a hyperspherical input domain.

Abstract:
Transformers often fail to learn generalizable algorithms, instead relying on brittle heuristics. Using graph connectivity as a testbed, we explain this phenomenon both theoretically and empirically. We consider a simplified Transformer architecture, the Disentangled Transformer, and prove that an Llayer model can compute connectivity in graphs with diameters up to 3^L, implementing an algorithm equivalent to computing powers of the adjacency matrix. By analyzing training dynamics, we prove that whether the model learns this strategy hinges on whether most training instances are within this model capacity. Within-capacity graphs (diameter \leq 3^L) drive the learning of the algorithmic solution while beyond-capacity graphs drive the learning of a simple heuristic based on node degrees. Finally, we empirically show that restricting training data to stay within a model's capacity makes both standard and Disentangled Transformers learn the exact algorithm.

Abstract:
Pretrained perception models excel in generic image domains but degrade significantly in novel environments like indoor scenes. The conventional remedy is fine-tuning on downstream data which incurs catastrophic forgetting of prior knowledge and demands costly, scene-specific annotations. We propose a paradigm shift through Sea^2 (See, Act, Adapt): rather than adapting the perception modules themselves, we adapt how they are deployed through an intelligent pose-control agent. Sea^2 keeps all perception modules frozen, requiring no downstream labels during training, and uses only scalar perceptual feedback to navigate the agent toward informative viewpoints. Specially, we transform a vision-language model (VLM) into a low-level pose controller through a two-stage training pipeline: first fine-tuning it on rule-based exploration trajectories that systematically probe indoor scenes, and then refining the policy via unsupervised reinforcement learning that constructs rewards from the perception module’s outputs and confidence. Unlike prior active perception methods that couple exploration with specific models or collect data for retraining them, Sea^2 directly leverages off-the-shelf perception models for various tasks without the need for retraining. We conducted experiments on three visual perception tasks, including visual grounding, segmentation and 3D box estimation, with performance improvements of 13.54%, 15.92% and 27.68% respectively on dataset ReplicaCAD.

Abstract:
The rapid adoption of large language models (LLMs) has created significant challenges for efficient inference at scale. Unlike traditional workloads, LLM inference is constrained by both computation and the memory overhead of key–value (KV) caching, which accelerates decoding but quickly exhausts GPU memory. In this paper, we introduce the first queueingtheoretic framework that explicitly incorporates both computation and GPU memory constraints into the analysis of LLM inference. Based on this framework, we derive rigorous stability and instability conditions that determine whether an LLM inference service can sustain incoming demand without unbounded queue growth. This result offers a powerful tool for system deployment, potentially addressing the core challenge of GPU provisioning. By combining an estimated request arrival rate with our derived stable service rate, operators can calculate the necessary cluster size to avoid both costly over-purchasing and performance-violating under-provisioning. We further validate our theoretical predictions through extensive experiments in real GPU production environments. Our results show that the predicted stability conditions are highly accurate, with deviations typically within 10%.

Abstract:
Bimanual dexterous manipulation relies on integrating multimodal inputs to perform complex realworld tasks. To address the challenges of effectively combining these modalities, we propose DECO, a decoupled multimodal diffusion transformer that disentangles vision, proprioception, and tactile signals through specialized conditioning pathways, enabling structured and controllable integration of multimodal inputs, with a lightweight adapter for parameter-efficient injection of additional signals. Alongside DECO, we release DECO-50 dataset for bimanual dexterous manipulation with tactile sensing, consisting of 50 hours of data and over 5M frames, collected via teleoperation on real dual-arm robots. We train DECO on DECO-50 and conduct extensive real-world evaluation with over 2,000 robot rollouts. Experimental results show that DECO achieves the best performance across all tasks, with a 72.25% average success rate and a 21% improvement over the baseline. Moreover, the tactile adapter brings an additional 10.25% average success rate across all tasks and a 20% gain on complex contact-rich tasks while tuning less than 10% of the model parameters.

Abstract:
Uncertainty quantification (UQ) is an important technique for ensuring the trustworthiness of LLMs, given their tendency to hallucinate. Existing stateof-the-art UQ approaches for free-form generation rely heavily on sampling, which incurs high computational cost and variance. In this work, we propose the first gradient-based UQ method for free-form generation, SemGrad, which is sampling-free and computationally efficient. Unlike prior gradient-based methods developed for classification tasks that operates in parameter space, we propose to consider gradients in semantic space. Our method builds on the key intuition that a confident LLM should maintain stable output distributions under semantically equivalent input perturbations. We interpret the stability as the gradients in semantic space and introduce a Semantic Preservation Score (SPS) to identify embeddings that best capture semantics, with respect to which gradients are computed. We further propose HybridGrad, which combines the strengths of SemGrad and parameter gradients. Experiments demonstrate that both of our methods provide efficient and effective uncertainty estimates, achieving superior performance than state-of-the-art methods, particularly in settings with multiple valid responses.

Abstract:
The design of Analog and MixedSignal (AMS) integrated circuits remains heavily reliant on expert knowledge, with transistor sizing a major bottleneck due to nonlinear behavior, high-dimensional design spaces, and strict performance constraints. Existing Electronic Design Automation (EDA) methods typically frame sizing as static black-box optimization, resulting in inefficient and less robust solutions. Although Large Language Models (LLMs) exhibit strong reasoning abilities, they are not suited for precise numerical optimization in AMS sizing. To address this gap, we propose \textscAutoSizer, a reflective LLM-driven meta-optimization framework that unifies circuit understanding, adaptive search-space construction, and optimization orchestration in a closed loop. It employs a two-loop optimization framework, with an inner loop for circuit sizing and an outer loop that analyzes optimization dynamics and constraints to iteratively refine the search space from simulation feedback. We further introduce \textscAMS-SizingBench, an open benchmark comprising 24 diverse AMS circuits in SKY130 CMOS technology, designed to evaluate adaptive optimization policies under realistic simulator-based constraints. \textscAutoSizer experimentally achieves higher solution quality, faster convergence, and higher success rate across varying circuit difficulties, outperforming both traditional optimization methods and existing LLM-based agents.

Abstract:
Reward Models (RMs) are crucial for online alignment of language models (LMs) with human preferences. However, RMbased preference-tuning is vulnerable to reward hacking, whereby LM policies learn undesirable behaviors from flawed RMs. By systematically measuring biases in five high-quality RMs, including the state-of-the-art, we find that issues persist despite prior work with respect to length, sycophancy, and overconfidence. We also discover new issues related to bias toward model-specific “styles” and answer-order. We categorize RM failures by complexity and propose a simple post-hoc intervention to mitigate low-complexity biases that arise from spurious correlations. Our proposed mechanistic reward shaping reduces targeted biases without degrading reward quality and while using minimal labeled data. The method is extensible to new biases, model-internal, and generalizes out-of-distribution.

Abstract:
We introduce a formal active learning methodology for guiding the placement of Lagrangian observers to infer timedependent vector fields -- a key task in oceanography, marine science, and ocean engineering -- using a physics-informed spatio-temporal Gaussian process surrogate model. The majority of existing placement campaigns either follow standard `space-filling' designs or relatively ad-hoc expert opinions. A key challenge to applying principled active learning in this setting is that Lagrangian observers are continuously advected through the vector field, so they make measurements at different locations and times. It is, therefore, important to consider the likely future trajectories of placed observers to account for the utility of candidate placement locations. To this end, we present BALLAST: Bayesian Active Learning with Look-ahead Amendment for Sea-drifter Trajectories. We observe noticeable benefits of BALLAST-aided sequential observer placement strategies on both synthetic and high-fidelity ocean current models. In addition, we developed a novel GP inference method -- the Vanilla SPDE Exchange (VaSE) -- to boost the GP posterior sampling efficiency, which is also of independent interest.

Abstract:
Audio provides critical situational cues, yet current Audio Language Models (ALMs) face an attention bottleneck in longform recordings where dominant background patterns can dilute rare, salient events. We introduce NAACA, a training-free NeuroAuditory Attentive Cognitive Architecture that reframes attention allocation as an auditory salience filtering problem. At its core is OWM, a neuro-inspired Oscillatory Working Memory that maintains stable attractor-like states and triggers higher-cognition ALM processing only when adaptive energy fluctuations signal perceptual salience, triggering higher-level reasoning. On XD-Violence, NAACA improves AudioQwen’s average precision (AP) from 53.50% to 70.60% while reducing unnecessary ALM invocations. Furthermore, qualitative case studies on the Urban Soundscapes of the World (USoW) dataset show that OWM captures novel events and subcategory shifts while remaining robust to transient pauses and ambient urban noise.

Abstract:
Due to limited resources and public safety concerns, deep reinforcement learning (RL) agents for many cyberphysical systems (e.g., autonomous vehicles) are first trained in simulators. However, when deployed in real world environments, they often suffer from performance degradation or safety violations because of the inevitable Sim2Real gap. Existing zero-shot approaches, such as robust safe RL and domain randomization, mitigate this issue but typically at the cost of degraded performance or residual safety risks when experiencing unmodeled system dynamics. To address these limitations, we propose a novel reinforcement learning framework that enables safe and efficient policy transfer via probabilistic latent embeddings and dynamic policy adaptation. We consider a family of Constrained Markov Decision Processes (CMDPs) under different environment contexts. By leveraging probabilistic latent context adaptation in meta-RL, the proposed framework infers the latent representation of the environment from simulated experiences. Furthermore, it incorporates a distributional RL formulation, which allows risk levels of the deployed policy to be adjusted dynamically at inference time, based on the estimation accuracy of the latent context variable. This strategy promotes safety at the early deployment stage and improves efficiency through fast policy adaptation under the Sim2Real gap.

Abstract:
Existing diffusion models have made significant progress in generating realistic images. However, their direct adaptation to remote sensing imagery often disregards intrinsic physical laws. This oversight frequently leads to spectral distortion and radiometric inconsistency, severely limiting the scientific utility of generated data. To address this issue, this paper introduces AnyBandDiff, a novel spectral-prior-guided diffusion framework tailored for robust spectral reconstruction. Specifically, we design a Masked Conditional Diffusion backbone integrated with a dual stochastic masking strategy, empowering the model to recover complete spectral information from arbitrary band subsets. Subsequently, to ensure radiometric fidelity, a Physics-Guided Sampling mechanism is proposed, leveraging gradients from a differentiable physical model to explicitly steer the denoising trajectory toward the manifold of physically plausible solutions. Furthermore, a Multi-Scale Physical Loss is formulated to enforce rigorous constraints across pixel, region, and global levels in a joint manner. Extensive experiments confirm the effectiveness of AnyBand-Diff in generating reliable imagery and achieving accurate spectral reconstruction, contributing to the advancement of physics-aware generative methods for Earth observation.

Abstract:
Gated Linear Units (GLU) and their variants are widely adopted in modern opensource large language model architectures and consistently outperform their non-gated counterparts, yet the underlying reasons for this advantage remain unclear. In this work, we study GLU by analyzing two-layer networks in the neural tangent kernel (NTK) regime. Our analysis reveals that the GLU structure reshapes the NTK spectrum, leading to a smaller condition number and a more compact eigenvalue distribution. Building on this finding, we further analyze the resulting training dynamics and show how the reshaped spectrum leads to faster convergence of GLU models, including a characteristic loss-crossing phenomenon observed between GLU and non-GLU models. Finally, we empirically observe that GLU has limited impact in reducing the stochastic error on various models, including ViT and GPT-2, suggesting that its primary benefit lies in accelerating optimization rather than reducing the stochastic error.

Abstract:
Reinforcement learning enhances the reasoning capabilities of large language models but often involves high computational costs due to rolloutintensive optimization. Online prompt selection presents a plausible solution by prioritizing informative prompts to improve training efficiency. However, current methods either depend on costly, exact evaluations or construct prompt-specific predictive models lacking generalization across prompts. This study introduces Generalizable Predictive Prompt Selection (GPS), which performs Bayesian inference towards prompt difficulty using a lightweight generative model trained on the shared optimization history. Intermediate-difficulty prioritization and history-anchored diversity are incorporated into the batch acquisition principle to select informative prompt batches. The small predictive model also generalizes at test-time for efficient computational allocation. Experiments across varied reasoning benchmarks indicate GPS's substantial improvements in training efficiency, final performance, and test-time efficiency over superior baseline methods. The code is available at https://anonymous.4open.science/r/GPS-ICML.

Abstract:
Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises from core limitations of RLHF: (1) preference datasets are constructed from the LLM's own outputs, allowing it to influence them, and (2) pairwise comparisons only indicate which response is better, not why. These limitations can be exploited to cause alignment tampering. For example, if an LLM generates biased responses with higher quality, annotators will prefer them based on quality. However, preference labels do not distinguish whether the preference stems from quality or bias, and the resulting reward model inherits this limitation. Optimizing such rewards through reinforcement learning or bestof-N sampling can amplify misaligned biases. Our experiments demonstrate amplification across diverse biases: from simple keyword bias to propaganda (e.g., sexism), brand promotion, and instrumental goal-seeking behaviors. We propose a detection method, while mitigation remains challenging. Existing techniques for robust RLHF fail to fully resolve alignment tampering without sacrificing response quality. These findings reveal structural vulnerabilities of current RLHF and emphasize the need to prevent this vulnerability.

Abstract:
While LLMbased agents excel at individual tasks, effective collaboration with realistic human partners remains challenging. Most of the existing conversation-level collaborative studies lack grounded interaction and behavioral execution, motivating the need for cooperative game environments that enable contextualized and immersive collaboration. To this end, this paper proposes CollabBench, a benchmark for evaluating and training collaborative agents in cooperative games. CollabBench features a Diverse Player Profile Simulation pipeline to model varied players behaviors, and a Collaborative Agentic Training paradigm that unifies reasoning, communication, and action via agentic rollouts, optimized with a hybrid reward balancing task efficiency and affective adaptation. We further extend classic environments to CWAH-MultiPlayer and Cook-MultiPlayer for systematic evaluation under diverse personalities. Experiments with efficiency and affective metrics show that our trained models outperform base models, achieving 19.5% higher efficiency and 24.4% improved affective performance. Further analysis reveals key collaborative limitations of existing models and offers insights for future collaborative training.

Abstract:
Existing multimodal reasoning approaches predominantly follow two paradigms: converting visual inputs into text prior to reasoning, or performing endto-end reasoning within a unified vision–language representation space. Despite their empirical progress, both paradigms suffer from fundamental structural limitations. The former relies on static visual-to-text conversion, which tends to compress and lose fine-grained visual details. The latter is prone to linguistic dominance induced by joint optimization and attention mechanisms, leading to systematically weakened faithfulness to visual evidence during reasoning. In this work, we argue that a central challenge is how and when visual evidence is introduced into the reasoning process. Motivated by this insight, we propose CSMR, a multimodal reasoning framework in which a language model controls the reasoning process by deciding when to invoke an independent visual perception module to acquire task-relevant visual evidence. Experiments across multiple multimodal reasoning benchmarks show that CSMR consistently outperforms representative baseline methods in accuracy under a zero-shot setting. Further experimental analysis confirms that these advantages primarily arise from the proposed cognitive scheduling mechanism.

Abstract:
Deep Neural Networks (DNNs) remain fundamentally vulnerable to backdoor attacks. Traditional datafree defenses largely operate under the paradigm of internal diagnosis methods like model repairing or input robustness, yet these approaches are often fragile under advanced attacks as they remain entangled with the victim model’s corrupted parameters. We propose a paradigm shift to data-free External Semantic Auditing, using universal Vision-Language Models (VLMs) as independent auditors to decouple defense from the compromised model. We introduce PRISM (Prototype Refinement & Inspection via Statistical Monitoring), which transforms generic VLMs into domain-adaptive gatekeepers purely via online test-time adaptation. PRISM bridges the domain gap through a Hybrid VLM Teacher that refines prototypes from the test stream and an Adaptive Router that calibrates thresholds via statistical monitoring. Evaluation across 17 datasets and 11 attack types confirms PRISM achieves state-of-the-art performance (suppressing Attack Success Rate to < 1% on CIFAR-10), proving that robust defense is achievable without touching the model weights or accessing a single training sample.

Abstract:
We study online inverse linear optimization, also known as contextual recommendation, where a learner sequentially infers an agent’s hidden objective vector from observed optimal actions over feasible sets that change over time. The learner aims to recommend actions that perform well under the agent’s true objective, and the performance is measured by the regret, defined as the cumulative gap between the agent’s optimal values and those achieved by the learner's recommended actions. Prior work has established a regret bound of O(d\log T), as well as a finite but exponentially large bound of \exp(O(d\log d)), where d is the dimension of the optimization problem and T is the time horizon, while a regret lower bound of \Omega(d) is known (Gollapudi et al. 2021; Sakaue et al. 2025). Whether a finite regret bound polynomial in d is achievable or not has remained an open question. We partially resolve this by showing that when the feasible sets are Mconvex—a broad class that includes matroids—a finite regret bound of O(d\log d) is possible. We achieve this by combining a structural characterization of optimal solutions on M-convex sets with a geometric volume argument. Moreover, we extend our approach to adversarially corrupted feedback in up to C rounds. We obtain a regret bound of O((C+1)d\log d) without prior knowledge of C, by monitoring directed graphs induced by the observed feedback to detect corruptions adaptively.

Abstract:
Gene Regulatory Network (GRN) inference is essential for understanding complex cellular mechanisms, rendered tractable through singlecell transcriptomic data. With the emergence of single-cell Foundation Models (scFMs), enhanced transcriptomic encoding is widely expected to revolutionize GRN inference. However, we observe that their performance remains far from satisfactory. The primary reason is that the standard reconstruction-based pre-training objectives often fail to explicitly capture latent regulatory signals. To bridge this gap, we first introduce a GRN generalization benchmark designed to evaluate regulatory predictions on unseen genes and datasets, which relies on the zero-shot capabilities of scFMs and is inherently challenging for traditional methods. Furthermore, to unlock the regulatory knowledge within the foundation models, we propose two novel methods, Virtual Value Perturbation and Gradient Trajectory, to distill implicit regulatory information from scFMs into highly generalizable inter-gene features. Extensive experiments demonstrate that our approach significantly outperforms existing methods, establishing a new paradigm for leveraging the potential of scFMs in universal GRN inference.

Abstract:
Generative models have a persistent limitation: their tendency to memorize training data can create legal liabilities and erode creative diversity. Understanding which samples are memorized in whole or in part, and under what conditions, therefore remains an important open problem. Here we answer the question "Are atypical or rare samples memorized first?" in the negative. We train diffusion models on strings generated according to the production rules of the Random Hierarchy Model (RHM), and find that samples composed ofcommon substringsare preferentially memorized. This holds true even if the training data consists of entirely unique samples, indicating that deduplication at the data point level does not provide a meaningful privacy guarantee. Correspondingly we predict, then observe, delayed memorization for fattailed datasets (i.e., those with more atypical samples). This effect is amplified when fat-tails are introduced into high-level production rules. These together suggest thatdataset diversity, particularly at higher levels of abstraction, plays an important role in staving off memorization. Finally, we identify an intermediate regime of partial memorization in which common substrings are learned first and subsequently overproduced during generation. If training is stopped in this regime, models will exhibit the reversion-to-the-mean blandness often derided as "slop".

Abstract:
Gaussian processes (GPs) provide a principled Bayesian framework for uncertainty estimation, but their computational complexity severely limits scalability to large datasets. We propose SIKAGP, which accelerates GP inference using sparse inducing kernel approximations based on a dyadic ordered template basis, incurring only O(\log M) complexity dependence on the number of inducing points. Our approach constructs compact and expressive kernel representations from sparsely activated bases, enabling efficient tensorized GPU computation and seamless integration with modern large-scale models. SIKA-GP can be naturally embedded into Bayesian neural networks (BNNs) with sparse activations, yielding significant speedups in both training and inference without sacrificing predictive performance. The method naturally extends to deep feature learning, addressing the scalability challenges introduced by deep architectures and high-dimensional feature representations. Empirical results on vision and transformer-based language benchmarks demonstrate that our approach consistently delivers fast and accurate GP models, providing a principled path toward scalable kernel learning.

Abstract:
Realworld federated systems seldom operate on static data: input distributions drift while privacy rules forbid raw data sharing. We study Federated Domain-Incremental Learning (FDIL), where (i) clients are heterogeneous, (ii) tasks arrive sequentially with shifting domains, and (iii) the label space remains fixed. Two theoretical pillars remain missing for FDIL under partial participation: a guarantee of backward knowledge transfer (BKT) and a convergence rate that holds uniformly across the task sequence. We introduce SPECIAL (Server-Proximal Efficient Continual Aggregation for Learning), a simple, memory-free FDIL algorithm that adds a single server-side ``anchor'' to FedAvg: in each round, the server aggregates updates from a uniformly sampled subset of clients and then blends the result with the previous global model via a lightweight proximal step. This anchor curbs cumulative drift without replay buffers, synthetic data, or task-specific heads, leaving communication cost and model size unchanged. Our theory shows that SPECIAL (i) preserves earlier tasks: a BKT bound caps any increase in earlier-task loss by a drift-controlled term that shrinks with more rounds, local epochs, and participating clients; and (ii) achieves task-uniform, communication-efficient convergence for non-convex FDIL with partial participation: \mathcalO\!\big(\sqrtE/(NT)\big) in expected gradient norm, with E local epochs, T rounds, and N participating clients, while explicitly separating optimization variance from inter-task drift. Experiments on standard FDIL benchmarks corroborate the theory.

Abstract:
Continuoustime dynamic graphs (CTDGs) provide a richer framework to capture fine-grained temporal patterns in evolving relational data. Long-range information propagation is a key challenge while learning representations, wherein it is important to retain and update information over long temporal horizons. Existing approaches restrict models to capture one-hop or local temporal neighborhoods and fail to capture multi-hop or global structural patterns. To mitigate this, we derive a parameter-efficient state-space modeling framework for continuous-time dynamic graphs \texttt(CTDG-SSM) from first principles. We first introduce continuous-time Topology-Aware higher order polynomial projection operator (\textttCTT-HiPPO), a novel memory-based reformulation of \textttHiPPO to jointly encode temporal dynamics and graph structure. The solution from \textttCTT-HiPPO are obtained by projecting the classical HiPPO solution through a polynomial of the Laplacian matrix, yielding topology-aware memory updates that admit an equivalent state-space formulation for CTDGs (\textttCTDG-SSM). Then a computationally efficient discrete formulation is obtained using the zero-order hold approach for model implementation. Across benchmarks on dynamic link prediction, dynamic node classification, and sequence classification, \textttCTDG-SSM achieves state-of-the-art performance. Notably, it achieves large performance gains on datasets that require long range temporal (LRT) and spatial reasoning.

Abstract:
While posttraining has successfully improved large language models across a variety of domains from open-ended text generation to mathematics, these gains heavily rely on human-labeled data or external verifiers. Existing data has already been exploited and new high-quality data is expensive to collect. More fundamentally, true intelligence goes far beyond tasks that are easily verified. Therefore, there is a need for self-improvement frameworks that allow models to improve without external oversight. We proposeMutual Information-based Preference Optimization (MIPO), a contrastive data augmentation method that constructs preference pairs by generating a positive response conditioned on the correct prompt and a negative response conditioned on a random or incomplete prompt; then train with Direct Policy Optimization. We show that this connects to maximizing pointwise mutual information between prompts and model responses under the base policy. Empirical results with the Llama- (1, 3B) and Qwen- (1.5, 3, 7B) Instruct models show that MIPO achieves 4-38% improvements on personalization tasks from real-user datasets (PRISM, Community Alignment). Surprisingly, MIPO can be more generally applied to a suite of benchmark tasks (e.g., math and multiple-choice answering), yielding 3% and 18% improvements for smaller 1B models,without any additional data or labels.

Abstract:
Distributed stochastic optimization intertwines (i) stochastic gradient noise, (ii) communication compression, and (iii) adaptive/normalized updates. While each factor has been studied in isolation, their joint effect under realistic assumptions remains poorly understood. In this work, we develop a unified theoretical framework for Distributed Compressed SGD (DCSGD) and its sign variant Distributed SignSGD (DSignSGD) under the recently introduced (L_0, L_1)smoothness condition. From a conceptual perspective, we show that the first- and second-order modified equations from the literature do not accurately model the discrete-time step-size/stability restrictions, especially under (L_0,L_1)-smoothness. From a technical perspective, we propose new first-order SDEs by carefully incorporating curvature-dependent terms into their drift: This helps capture the fine-grained relationship between learning rate restrictions, gradient noise, compression, and the geometry of the loss landscape. Importantly, we do so under general gradient noise assumptions, including heavy-tailed and affine-variance regimes, which extend beyond the classical bounded-variance setting. Our results suggest that normalizing the updates of DCSGD emerges as a natural condition for stability, with the degree of normalization precisely determined by the gradient noise structure, the landscape’s regularity, and the compression rate. In contrast, DSignSGD converges even under heavy-tailed noise with standard learning rate schedules. Together, these findings offer both new theoretical insights and perspectives, and practical guidance.

Abstract:
Generative posterior sampling using diffusion models has emerged as a dominant paradigm for solving inverse problems in imaging, which usually consists of three main components: data consistency (DC) guidance, classifierfree guidance (CFG) and stochasticity. While prior arts have focused on how to develop each or all components, less attention has given to how to schedule them, leading to heuristically fixed or partially adjusted suboptimal schedules. In this work, we argue that the interactions among all three components in terms of scheduling are crucial for significantly improved performance in solving inverse problems in imaging. Our analysis shows that aggressive CFG early in sampling conflict with DC guidance, while stochasticity brings the trajectory back to higher-probability regions. Based on these findings, we propose Triadic Dynamics Aware Posterior Sampling (TriPS), which reformulates posterior sampling as a time-varying control problem and optimizes schedules following a triadic trend of decreasing DC and stochasticity scales alongside increasing CFG scale. TriPS achieves this through two strategies: template-based search over functional priors for reliable baseline schedules, and Group Relative Policy Optimization (GRPO)-based reinforcement learning for more flexible temporal curves. Experiments demonstrate TriPS outperforms state-of-the-art baselines in data fidelity and perceptual realism.

Abstract:
Incontext learning (ICL) excels at new tasks from minimal examples, yet we still lack a mechanistic explanation of how few-shot prompts shape a model’s function vector (FV)--a causal activation direction that drives task behavior on the ICL query. Across tasks and models, an n-shot FV is well-approximated by a linear combination of example-level sub-FVs, suggesting additive and composable contributions from individual demonstrations. Beyond additivity, we show that models contextualize individual examples' representations based on prior examples to adaptively reweight which demonstrations dominate the FV: attention shifts toward examples that are more informative and less ambiguous under the context. Finally, a causal decomposition separates Query–Key routing from Value updates, finding that contextualization’s most consistent contributions to FV quality arise from Query–Key alignment--particularly in ambiguous settings--while Value-mediated effects are more heterogeneous. Together, these results unify additive superposition with context-dependent attention reweighting into a mechanistic, testable account of how few-shot prompts implement tasks.

Abstract:
We present the first theoretical convergence analysis of machine learning training under fully homomorphic encryption (FHE), combined with a differentially private (DP) training algorithm tailored to encrypted computation. Our approach improves computational efficiency over standard differentially private gradient descent (DPGD) while achieving comparable utility. In particular, we prove convergence of approximate gradient descent using polynomial approximations of activation and loss functions, which are required for FHE compatibility. To preserve privacy in downstream tasks, we integrate differential privacy without relying on costly per-sample gradient clipping, enabling scalable encrypted learning. We also provide data-independent hyperparameter selection and theoretically grounded strategies for polynomial approximation which can be of independent interest. Together, these contributions advance the feasibility of efficient, private, and secure machine learning on sensitive data.

Abstract:
Contrastive learning has emerged as a powerful paradigm for selfsupervised representation learning, yet the precise conditions under which it recovers meaningful latent structure remain incompletely understood. We develop a measure-theoretic framework that formalizes the diversity condition, a requirement on the sampling mechanism that is necessary for recovering the latent space up to orthogonal transformation. We prove that when this condition is violated, as commonly occurs in practical settings where augmentations preserve semantic content, the optimal encoder no longer preserves geometric structure and linear identifiability is lost. Crucially, we demonstrate that the contrastive loss alone is insufficient for latent space reconstruction: encoder inductive bias emerges as a critical component that compensates for violations of the diversity condition. Our experiments on synthetic datasets and CIFAR-10 confirm these theoretical predictions, showing that architectural constraints become essential precisely when sampling diversity is limited. These findings have direct implications for the design of data augmentation strategies and encoder architectures in self-supervised contrastive learning systems.

Abstract:
Recent theoretical analyses of multipathway Deep Linear Networks, typically grounded in Gradient Flow, predict a "winner-takes-all" specialization in which path symmetry breaks and each feature concentrates in a single pathway. In this work, we show that discrete Gradient Descent with a large step size reproduces the initial, depth-driven symmetry breaking but ultimately overrides this tendency due to its catapults at the Edge of Stability. In this regime, GD exhibits an implicit preference for low-curvature minima. Since we prove that splitting singular values across pathways minimizes sharpness, large-step GD—driven by its implicit preference for such flat minima—forces a subsequent re-balancing phase: iterates escape sharp, sparse configurations for stable, balanced solutions. Together, these results clarify how architectural depth shapes pathway competition while explaining why GD with large step size ultimately favors shared representations rather than permanent pathway monopolization.

Abstract:
We study the robustness of stochastic mirror descent (SMD) under heavytailed noise, focusing on whether the method retains its convergence guarantees when run with infinite-variance stochastic gradient input. To address this question in a principled manner, we begin by introducing a continuous-time model of SMD as a stochastic differential equation (SDE) driven by a centered Lévy noise process with finite p-th order moments, 1 < p \leq 2. This scheme---which we call the Lévy mirror flow (LMF)---arises naturally as the scaling limit of SMD in the presence of heavy-tailed noise. In particular, when p < 2---the heavy noise regime---the trajectories of LMF generically exhibit jump discontinuities of arbitrary magnitude which, if frequent enough, lead to infinite variance. Nonetheless, despite this highly singular behavior, we show that LMF attains \epsilon-optimality within \mathcalO(\epsilon^-p/(p-1)) time in the convex case, and within \tilde\mathcalO(\epsilon^-1/(p-1)) time for (relatively) strongly convex objectives. These guarantees provide a transparent characterization of the impact of frequent long jumps on the convergence of the process, and percolate to a series of matching discrete-time guarantees for several variants of SMD under heavy-tailed noise.

Abstract:
Speculative reasoning has recently been proposed as a means to accelerate reasoningintensive generation in large multimodal models, but its effectiveness is often constrained by misalignment between speculative drafts and target-verified reasoning. In this work, we introduce DREAM-R, a framework that substantially improves the performance of speculative reasoning. At its core, DREAM-R employs Speculative Alignment Policy Optimization (SAPO), a reinforcement-learning objective that trains draft models to generate reasoning steps that are both faithful to target trajectories and concise. We further propose a Threshold-based Verification Mechanism (TBVM) that uses a ratio-based criterion to provide stable and interpretable acceptance of speculative steps only when positive evidence clearly dominates, thereby preventing error propagation. Building on these components, we develop a Fully Parallel Speculative Reasoning (FPSR) framework that parallelizes draft generation, target-side reasoning, and verification across multi-step reasoning, enabling early stopping and clean fallback. Experiments on reasoning-heavy benchmarks demonstrate up to 2.49× speedup while preserving target-model accuracy, yielding substantial efficiency gains without compromising reasoning quality.

Abstract:
Accurate and efficient Video Quality Assessment (VQA) has long been a key research challenge. Current mainstream VQA methods typically improve performance by pretraining on largescale classification datasets, followed by fine-tuning on VQA datasets. However, this strategy presents two significant challenges: (1) merely transferring semantic knowledge learned from pretraining is insufficient for VQA, as video quality depends on multiple factors (e.g., semantics, distortion, motion); (2) pretraining on large-scale datasets demands enormous computational resources, often dozens to hundreds of times more than training on VQA datasets. Recently, Contrastive Vision-Language Models (CVLMs) have shown strong generalization across visual tasks and promising potential for quality assessment. In this work, we propose Q-CLIP, the first fully CVLMs-based framework for VQA. Q-CLIP enhances both visual and textual representations through a Shared Cross-Modal Adapter (SCMA), which contains only a minimal number of trainable parameters and is the only component that requires training. This design significantly reduces computational cost. In addition, we introduce a set of five learnable quality-level prompts to guide the CVLMs in perceiving subtle quality variations. Furthermore, we investigate the impact of different frame sampling strategies on VQA performance. Extensive experiments demonstrate that Q-CLIP exhibits excellent performance on several VQA datasets. Code is provided in the supplementary material.

Abstract:
The dynamics of gradientbased training in neural networks often exhibit nontrivial structures; hence, understanding them remains a central challenge in theoretical machine learning. In particular, a concept offeatureunlearning, in which a neural network progressively loses previously learned features over long training, has gained attention as a way to explain overfitting. In this study, we consider the infinite-width limit of a two-layer neural network updated with a large-batch stochastic gradient, then derive differential equations with different time scales, revealing the mechanism and conditions for feature unlearning to occur. Specifically, we utilize thefast-slow dynamics: while an alignment of first-layer weights develops rapidly, the second-layer weights develop slowly. The direction of a flow on a critical manifold, determined by the slow dynamics, decides whether feature unlearning occurs. We give numerical validation of the result, and derive theoretical grounding and scaling laws of the feature unlearning. Our results yield the following insights: (i) the strength of the primary nonlinear term in data induces the feature unlearning, and (ii) an initial scale of the second-layer weights mitigates the feature unlearning. Technically, our analysis utilizes Tensor Programs and the singular perturbation theory.

Abstract:
Diffusion models have significantly advanced the field of talking head generation (THG). However, slow inference speeds and prevalent nonautoregressive paradigms severely constrain the application of diffusion-based THG models. In this study, we propose REST, a pioneering diffusion-based, real-time, end-to-end streaming audio-driven talking head generation framework. To support real-time end-to-end generation, a compact video latent space is first learned through a spatiotemporal variational autoencoder with a high compression ratio. Additionally, to enable semi-autoregressive streaming within the compact video latent space, we introduce an ID-Context Cache mechanism, which integrates ID-Sink and Context-Cache principles into key-value caching for maintaining identity consistency and temporal coherence during long-term streaming generation. Furthermore, an Asynchronous Streaming Distillation (ASD) strategy is proposed to mitigate error accumulation and enhance temporal consistency in streaming generation, leveraging a non-streaming teacher with an asynchronous noise schedule to supervise the streaming student. REST bridges the gap between autoregressive and diffusion-based approaches, achieving a breakthrough in efficiency for applications requiring real-time THG. Experimental results demonstrate that REST outperforms state-of-the-art methods in both generation speed and overall performance.

Abstract:
Visionlanguage-action (VLA) models have advanced the field of embodied manipulation by harnessing broad world knowledge and strong generalization. However, current VLA models still face several key challenges, including limited reasoning capability, lack of status monitoring, and difficulty in self-correction. In this paper, we introduce Sentinel-VLA, a metacognitive VLA model equipped with an active ``sentinel'' module to monitor real-time execution status. Only when necessary, such as during initial planning or upon detecting an error, the model triggers a dynamic reasoning or formulate error recovery solutions. This on-demand reasoning mechanism ensures robust decision-making while minimizing computational overhead. Notably, all training data (spanning 44 tasks and over 2.6 million transitions) is automatically generated and annotated through our designed pipeline. We also propose the Self-Evolving Continual Learning (SECL) algorithm, which allows Sentinel-VLA to identify its capability boundaries and automatically collect data for expansion, paired with Orthogonal Continual Adapter (OC-Adapter) to constrain parameter updates to an orthogonal space, thereby preventing catastrophic forgetting. Real-world experiments demonstrate that Sentinel-VLA boosts the task success rate by over 30% compared to the SOTA model, PI0. We will open-source all the code, weights, and data generation pipeline.

Abstract:
We study the population loss landscape of twolayer ReLU networks of the form \sum_k=1^K \mathrmReLU(w_k^\top x) in a realisable teacher–student setting with Gaussian covariates. We show that local minima admit an exact low-dimensional representation in terms of \emphsummary statistics, yielding a sharp and interpretable characterisation of the landscape. We further establish a direct link with one-pass SGD: local minima correspond to attractive fixed points of the dynamics in summary statistics space. This perspective reveals a hierarchical structure of minima: they are typically isolated in the well-specified regime, but become connected by flat directions as network width increases. In this overparameterised regime, global minima become increasingly accessible, attracting the dynamics and reducing convergence to spurious solutions. Overall, our results reveal intrinsic limitations of common simplifying assumptions, which may miss essential features of the loss landscape even in minimal neural network models.

Abstract:
Online Reinforcement Learning (RL) offers a promising avenue for complex image editing but is currently constrained by the scarcity of reliable and finegrained reward signals. Existing evaluators frequently struggle with a critical perception gap we term "Attention Collapse," where models neglect cross-image comparisons and fail to capture fine-grained details, resulting in inaccurate perception and miscalibrated scores. To address these limitations, we proposeSpatialReward, a reward model that enforces precise verification via explicit spatial reasoning. By anchoring reasoning to predicted edit regions, SpatialReward grounds semantic judgments in pixel-level evidence, significantly enhancing evaluative accuracy. Trained on a curated 260k spatial-aware dataset, our model achieves state-of-the-art performance on MMRB2 and EditReward-Bench, and outperforms proprietary evaluators on our proposedMultiEditReward-Bench. Furthermore, SpatialReward serves as a robust signal in online RL, boosting OmniGen2 by +0.90 on GEdit-Bench—surpassing the leading discriminative model and doubling the gain of GPT-4.1 (+0.45). These results demonstrate that spatial reasoning is essential for unlocking effective alignment in image editing.

Authors: Anka Reuel, Avijit Ghosh, Jenny Chim, Andrew Tran, Yanan Long, Jennifer Mickel, Usman Gohar, Srishti Yadav, Pawan Sasanka Ammanamanchi, Mowafak Allaham, Hossein A. Rahmani, Mubashara Akhtar, Felix Friedrich, Robert Scholz, Michael Riegler, Jan Batzner, Eliya Habba, Arushi Saxena, Anastassia Kornilova, Kevin Wei, Prajna Soni, Yohan Mathew, Kevin Klyman, Jeba Sania, Subramanyam Sahoo, Olivia Bruvik, Pouya Sadeghi, Sujata Goswami, Angelina Wang, Yacine Jernite, Zeerak Talat, Stella Biderman, Mykel Kochenderfer, Sanmi Koyejo, Irene Solaiman

Abstract:
Foundation models are increasingly central to highstakes AI systems, and governance frameworks now depend on evaluations to assess their risks and capabilities. Although general capability evaluations are widespread, social impact assessments covering bias, fairness, privacy, environmental costs, and labor remain uneven. To characterize this landscape, we conduct the first comprehensive analysis of social impact evaluation reporting, examining 186 first-party release reports and 248 third-party evaluation sources, supplemented by developer interviews. We find a stark division of labor: first-party reporting is sparse, often superficial, and declining in areas like environmental impact and bias, while third-party evaluators provide broader, more rigorous coverage of bias, harmful content, and performance disparities. However, only developers can authoritatively report on data provenance, content moderation labor, costs, and infrastructure, yet interviews reveal these disclosures are deprioritized unless tied to product adoption or compliance. Current practices leave major gaps in assessing societal impacts, underscoring the need for policies that mandate developer transparency, strengthen independent evaluation ecosystems, and create shared infrastructure for aggregating third-party evaluations.

Abstract:
Diffusion models have excelled at generative tasks for both continuous and tokenbased domains, but their application to discrete ordinal data remains underdeveloped. We present \emphCountsDiff, a diffusion framework designed to natively model distributions on the natural numbers. CountsDiff extends the Blackout diffusion framework by simplifying its formulation through a direct parameterization in terms of a survival probability schedule and an explicit loss weighting. This introduces flexibility through design parameters with direct analogues in existing diffusion modeling frameworks. Beyond this reparameterization, CountsDiff introduces features from modern diffusion models, previously absent in counts-based domains, including continuous-time training, classifier-free guidance, and churn/remasking reverse dynamics that allow non-monotone reverse trajectories. We propose an initial instantiation of CountsDiff and validate it on natural image datasets (CIFAR-10, CelebA), demonstrating the benefits of the proposed design space and that the framework scales to complex, high-dimensional data domains. We then highlight biological count assays as a natural use case, evaluating CountsDiff on single-cell RNA-seq imputation in a fetal cell and heart cell atlas. Remarkably, we find that even this simple instantiation matches or surpasses the performance of a state-of-the-art discrete generative model and leading RNA-seq imputation methods, while leaving substantial headroom for further gains through optimized design choices in future work.

Abstract:
Reinforcement learning has emerged as a principled posttraining paradigm for Temporal Video Grounding (TVG) due to its on-policy optimization, yet existing GRPO-based methods remain fundamentally constrained by sparse reward signals and substantial computational overhead. We propose Video-OPD, an efficient post-training framework for TVG inspired by recent advances in on-policy distillation. Video-OPD optimizes trajectories sampled directly from the current policy, thereby preserving alignment between training and inference distributions, while a frontier teacher supplies dense, token-level supervision via a reverse KL divergence objective. This formulation preserves the on-policy property critical for mitigating distributional shift, while converting sparse, episode-level feedback into fine-grained, step-wise learning signals. Building on Video-OPD, we introduce Teacher-Validated Disagreement Focusing (TVDF), a lightweight training curriculum that iteratively prioritizes trajectories that are both teacher-reliable and maximally informative for the student, thereby improving training efficiency. Empirical results demonstrate that Video-OPD consistently outperforms GRPO while achieving substantially faster convergence and lower computational cost, establishing on-policy distillation as an effective alternative to conventional reinforcement learning for TVG.

Abstract:
Existing autobidding algorithms in digital advertising often treat the value of an ad opportunity as the revenue obtained when an ad is shown and/or clicked, and bid accordingly. This can lead to wasteful spending because the true value is the marginal gain from paid exposure: even without winning a sponsored slot, an advertiser may still earn revenue via an organic search result (e.g., on Google or Amazon). Motivated by recent work, we model ad value as a treatment effect—the outcome difference between winning and losing the auction—and study online learning for bidding in second-price (Vickrey) auctions under this causal perspective. We develop algorithms that attain rate-optimal regret under several feedback models. A key ingredient exploits the information revealed by the second-price payment rule, which strictly improves regret relative to analogous learning problems in first-price auctions.

Abstract:
Masked Diffusion Language Models have recently emerged as a powerful generative paradigm, yet their generalization properties remain understudied compared to their autoregressive counterparts. In this work, we investigate these properties within the setting of the k-parity problem (computing the XOR sum of k relevant bits), where neural networks typically exhibit grokking—a prolonged plateau of chance-level performance followed by sudden generalization. We theoretically decompose the Masked Diffusion (MD) objective into a Signal regime which drives feature learning, and a Noise regime which serves as an implicit regularizer. By training nanoGPT using MD objective on the k-parity problem, we demonstrate that MD objective fundamentally alters the learning landscape, enabling rapid and simultaneous generalization without experiencing grokking. Furthermore, we leverage our theoretical insights to optimize the distribution of the mask probability in the MD objective. Our method significantly improves perplexity for 50M-parameter models and achieves superior results across both pre-training from scratch and supervised fine-tuning. Specifically, we observe performance gains peaking at 8.8% and 5.8%, respectively, on 8B-parameter models, confirming the scalability and effectiveness of our framework in large-scale masked diffusion language model regimes.

Abstract:
Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute longhorizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.

Abstract:
OpenEnded Deep Research (OEDR) pushes LLM agents beyond short-form QA toward long-horizon workflows that iteratively search, connect, and synthesize evidence into structured reports. However, existing OEDR agents largely follow either linear "search-then-generate" accumulation or outline-centric planning. The former suffers from lost-in-the-middle failures as evidence grows, while the latter relies on the LLM to implicitly infer knowledge gaps from the outline alone, providing weak supervision for identifying missing relations and triggering targeted exploration. We present DualGraph memory, an architecture that separates what the agent knows from how it writes. DualGraph maintains two co-evolving graphs: an Outline Graph (OG), and a Knowledge Graph (KG), a semantic memory that stores fine-grained knowledge units, including core entities, concepts, and their relations. By analyzing the KG topology together with structural signals from the OG, DualGraph generates targeted search queries, enabling more efficient and comprehensive iterative knowledge-driven exploration and refinement. Across DeepResearch Bench, DeepResearchGym, and DeepConsult, DualGraph consistently outperforms state-of-the-art baselines in report depth, breadth, and factual grounding; for example, it reaches a 53.08 RACE score on DeepResearch Bench with GPT-5. Moreover, ablation studies confirm the central role of the dual-graph design. DualGraph code is available at https://anonymous.4open.science/r/DualGraph-2536.

Abstract:
Recent progress in geospatial foundation models highlights the importance of learning generalpurpose representations for real-world locations, particularly points-of-interest (POIs) where human activity concentrates. Existing approaches, however, focus primarily on place identity derived from static textual metadata, or learn representations tied to trajectory context, which capture movement regularities rather than how places are actually used (i.e., POI's function). We argue that POI function is a missing but essential signal for general POI representations. We introduce Mobility-Embedded POIs (ME-POIs), a framework that augments POI embeddings derived, from language models with large-scale human mobility data to learn POI-centric, context-independent representations grounded in real-world usage. ME-POIs encodes individual visits as temporally contextualized embeddings and aligns them with learnable POI representations via contrastive learning to capture usage patterns across users and time. To address long-tail sparsity, we propose a novel mechanism that propagates temporal visit patterns from nearby, frequently visited POIs across multiple spatial scales. We evaluate ME-POIs on five newly proposed map enrichment tasks, testing its ability to capture both the identity and function of POIs. Across all tasks, augmenting text-based embeddings with ME-POIs consistently outperforms both text-only and mobility-only baselines. Notably, ME-POIs trained on mobility data alone can surpass text-only models on certain tasks, highlighting that POI function is a critical component of accurate and generalizable POI representations.

Abstract:
Diffusion models generate highquality samples but can also memorize training data, raising serious privacy concerns. Understanding the mechanisms governing when memorization versus generalization occurs remains an active area of research. In particular, it is unclear where along the noise schedule memorization is induced, how data geometry influences it, and how phenomena at different noise scales interact. We introduce a geometric framework that partitions the noise schedule into three regimes based on the coverage properties of training data by Gaussian shells and the concentration behavior of the posterior, which we argue are two fundamental objects governing memorization and generalization in diffusion models. This perspective reveals that memorization risk is highly non-uniform across noise levels. We further identify a danger zone at medium noise levels where memorization is most pronounced. In contrast, both the small and large noise regimes resist memorization, but through fundamentally different mechanisms: small noise avoids memorization due to limited training coverage, while large noise exhibits low posterior concentration and admits a provably near linear Gaussian denoising behavior. For the medium noise regime, we identify geometric conditions through which we propose a geometry-informed targeted intervention that mitigates memorization.

Abstract:
Scaling testtime compute with multi-path chain-of-thought can improve reasoning accuracy, but its gains hinge on an effective exploration–exploitation trade-off. Existing methods handle this trade-off in rigid ways: tree-structured search hard-codes exploration via brittle expansion rules that disrupt post-trained reasoning, while parallel reasoning over-explores redundant hypothesis paths and relies on a weak answer selection strategy. Driven by the insight that the optimal balance isphase-dependentand that correct vs. incorrect paths oftendiverge only at late stages, we reconceptualize test-time scaling as a dynamicexpand–reducecontrol problem over a pool of hypothesis paths. We introduceHyPER, atraining-free online control policyfor MoE multi-path decoding that reallocates compute under a fixed budget using lightweight path statistics. HyPER features (i) anonline controllerthat shifts from exploration to exploitation as the hypothesis pool evolves, (ii) an MoE-based token-level refinement primitive for efficientgeneration-time exploitationwithout full-path resampling, and (iii) a length- and confidence-aware aggregation rule to bridge the existence–selection gap for reliableanswer-time exploitation. Extensive experimental results across four MoE models and diverse benchmarks demonstrate HyPER consistently achieves the accuracy–compute Pareto frontier, outperforming prior-art methods by 8-10% while reducing token consumption by 25-40%.

Abstract:
The bestarm identification (BAI) problem is one of the most fundamental problems in interactive machine learning, which has two flavors: the fixed-budget setting (FB) and the fixed-confidence setting (FC). For K-armed bandits with the unique best arm, the optimal sample complexities for both settings have been settled down, and they match up to logarithmic factors. This prompts an interesting research question about the generic, potentially structured BAI problems: Is FB harder than FC or the other way around? In this paper, we show that FB is no harder than FC up to logarithmic factors. We do this constructively: we propose a novel algorithm called FC2FB (fixed confidence to fixed budget), which is a meta algorithm that takes in an FC algorithm \mathcalA and turns it into an FB algorithm. We prove that this FC2FB enjoys a sample complexity that matches, up to logarithmic factors, that of the sample complexity of \mathcalA. This means that the optimal FC sample complexity is an upper bound of the optimal FB sample complexity up to logarithmic factors. Our result not only reveals a fundamental relationship between FB and FC, but also has a significant implication: FC2FB combined with existing state-of-the-art FC algorithms, leads to improved sample complexity for a number of FB problems.

Abstract:
Recent work has found that neural networks with stronger generalization tend to exhibit higher representational alignment with one another across architectures and training paradigms. In this work, we show that models with stronger generalization also align more strongly with human neural activity. Moreover, generalization performance, modelmodel alignment, and model--brain alignment are all significantly correlated with each other. We further show that these relationships can be explained by a single geometric property of learned representations: the local intrinsic dimension of embeddings. Lower local dimension is consistently associated with stronger model--model alignment, stronger model--brain alignment, and better generalization, whereas global dimension measures fail to capture these effects. Finally, we find that increasing model capacity and training data scale systematically reduces local intrinsic dimension, providing a geometric account of the benefits of scaling. Together, our results identify local intrinsic dimension as a unifying descriptor of representational convergence in artificial and biological systems.

Abstract:
Autoregressive models (ARMs) currently constitute the dominant paradigm for large language models (LLMs). Energybased models (EBMs) represent another class of models, which have historically been less prevalent in LLM development, yet naturally characterize the optimal policy in post-training alignment. In this paper, we present a unified view of these two model classes. Taking the chain rule of probability as a starting point, we establish an explicit bijection between ARMs and EBMs in function space, which we show to correspond to a special case of the soft Bellman equation in maximum entropy reinforcement learning. Building upon this bijection, we derive the equivalence between supervised learning of ARMs and EBMs. Furthermore, we analyze the distillation of EBMs into ARMs by providing theoretical error bounds. Our results provide insights into the ability of ARMs to plan ahead, despite being based on the next-token prediction paradigm.

Abstract:
PreLayer Normalization (Pre-LN) is the de facto choice for large language models (LLMs) and is crucial for stable pretraining and effective transfer learning. However, Pre-LN is inefficient due to repeated statistical calculations and suffers from the curse of depth. As layers grow, the magnitude and variance of the hidden state escalate, destabilizing training. Efficiency-oriented normalization-free methods such as Dynamic Tanh (DyT) improve speed but remain fragile at depth. To jointly address stability and efficiency, we propose Bounded Hyperbolic Tanh (BHyT), a drop-in replacement for Pre-LN. BHyT couples a tanh nonlinearity with explicit, data-driven input bounding to keep activations within a non-saturating range. It prevents depth-wise growth in activation magnitude and variance and comes with a theoretical stability guarantee. For efficiency, BHyT computes exact statistics once per block and replaces a second normalization with a lightweight variance approximation, enhancing efficiency. Empirically, BHyT demonstrates improved stability and efficiency during pretraining, achieving an average of 15.8% faster training and an average of 4.2% higher token generation throughput compared to RMSNorm., while matching or surpassing its inference performance and robustness across language understanding and reasoning benchmarks. Our code is available at: \urlhttps://anonymous.4open.science/r/BHyT

Abstract:
Multimodal reasoning for ultrahigh-resolution (UHR) remote sensing (RS) is usually bottlenecked by visual evidence acquisition: the model necessities localizing tiny task-relevant regions in massive pixel spaces. While Agentic Reinforcement Learning with Verifiable Rewards (RLVR) using zoom-in tools offers a path forward, we find that standard reinforcement learning struggles to navigate these vast visual spaces without structured domain priors. In this paper, we investigate the interplay between post-training paradigms: comparing Cold-start Supervised Fine-Tuning (SFT), RLVR, and Agentic RLVR on the UHR RS benchmark. Our controlled studies yield a counter-intuitive finding: high-quality Earth-science text-only QA is a primary driver of UHR visual reasoning gains. Despite lacking images, domain-specific text injects the concepts, mechanistic explanations, and decision rules necessary to guide visual evidence retrieval. Based on this, we propose a staged knowledge injection recipe: (1) cold-starting with scalable, knowledge-graph-verified Earth-science text QA to instill reasoning structures; and (2) "pre-warming'' on the same hard UHR image–text examples during SFT to stabilize and amplify subsequent tool-based RL. This approach achieves a 60.04% Pass@1 on XLRS-Bench, significantly outperforming larger general-purpose models (e.g., GPT-5.2, Gemini 3.0 Pro, Intern-S1) and establishing a new state-of-the-art. We provide both the automated data pipeline and the rigorous ablation studies that validate this "Text-Before-Vision'' paradigm. Datasets and code will be released.

Abstract:
In this paper, we generate conceptual engineering designs of electric vertical takeoff and landing (eVTOL) aircraft. We follow the paradigm of simulation-based inference (SBI), whereby we look to learn a posterior distribution over the full eVTOL design space. To learn this distribution, we must sample over discrete aircraft configurations (topologies) and their corresponding set of continuous parameters. Therefore, we introduce a hierarchical probabilistic model consisting of two diffusion models. The first model leverages recent work on Riemannian Diffusion Language Modeling (RDLM) and Unified World Models (UWMs) to enable us to sample topologies from a discrete and continuous space. For the second model we introduce a masked diffusion approach to sample the corresponding parameters conditioned on the topology. We show our approach successfully rediscovers known trends and captures governing physical laws in aircraft design.

Abstract:
Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexitybased metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.

Abstract:
In realworld deployments of large language models (LLMs), balancing inference quality and computational cost has become a central challenge. Existing approaches tackle this trade-off along two largely independent dimensions: model routing, which switches among models of different scales to match request complexity, and test-time scaling (TTS), which adjusts inference-time compute within a fixed model for fine-grained control. However, this decoupled design introduces inherent limitations. Model routing yields coarse-grained, discrete performance changes due to the sparse set of model scales, while single-model TTS often encounters capacity ceilings and exhibits diminishing returns as compute increases. Moreover, treating the two mechanisms separately restricts adaptability in dynamic inference environments. To overcome these limitations, we introduceUnified Inference Scaling (UIS), which unifies model routing and TTS in a single optimization space. Building on this formulation, we propose UniScale, an online framework that models adaptive UIS as a contextual multi-armed bandit problem and learns inference policies via LinUCB. The framework incorporates efficiency-aware learning and cost modeling to ensure stable and scalable optimization over high-dimensional action spaces. Evaluation shows that UniScale effectively exploits the synergy in the UIS space to deliver a fine-grained and consistently better quality-cost trade-off across diverse, dynamic inference scenarios.

Abstract:
Biological and neuromorphic recurrent neural networks (RNNs) are subject to spatial and temporal locality constraints on the information that can plausibly be used during learning. A common strategy to satisfy these constraints is to modify gradient descent by neglecting nonlocal terms to varying degrees, as in random feedback local online (RFLO) learning and truncated backpropagation through time (tBPTT). However, the learning dynamics of these algorithms, and how they compare with BPTT, remain poorly understood. Here, we apply dynamical systems theory to data-aligned linear RNNs to compare stationary solutions, stability properties, and convergence rates, finding a close relationship between BPTT and one-step tBPTT—in contrast with qualitatively distinct behavior for RFLO. We further observe that the solutions learned by RFLO are restricted to low-rank perturbations of initial parameters, a result which holds beyond the data-aligned setting. Our work provides fundamental insights into how locality constraints shape RNN learning dynamics, with implications for neuroscientific models of learning and alternative optimization approaches for state-space models.

Abstract:
Large language models (LLMs) display strong comprehensive abilities, yet the internal mechanisms that support these behaviors remain insufficiently understood. In this work, we show that across a wide range of openweight Transformers, a subset of neurons remains consistently highly activated during inference across tasks of multiple capability dimensions. By probing along the cross-task activation strength, an extremely sparse subset is isolated, whose removal causes a collapse in model behavior, which we term keystone neurons. Our analysis reveals that keystone neurons are a stable and intrinsic neuron subset of the model that is largely established during pretraining. The parameters associated with these neurons are tightly calibrated during the training process, and their precise values are critical for the capabilities of the model. Building on these insights, we propose a supervised fine-tuning approach that updates only keystone neurons, achieving task gains comparable to or even better than full-parameter fine-tuning while better preserving performance in other capability dimensions, despite modifying a much smaller number of parameters. Our code is available at https://anonymous.4open.science/r/keystone-48CE.

Abstract:
A central challenge in largescale decision-making under incomplete information is estimating reliable probabilities. Recent approaches leverage Large Language Models (LLMs) to generate explanatory factors and elicit coarse-grained probability estimates. Typically, an LLM performs forward abduction to propose factors, each paired with two mutually exclusive attributes, and a Naïve Bayes model is trained over factor combinations to refine the final probabilities. However, the induced factor space is often sparse, leading to frequent ''unknown'' outcomes when the system cannot map a query context to any supported factor configuration. Simply expanding the factor set to increase coverage is ineffective: it amplifies statistical noise and introduces spurious correlations that violate the conditional-independence assumption, ultimately degrading stability and reliability.To address these limitations, we propose Anchor, an inference framework that orchestrates aggregated Bayesian inference over a hierarchically structured factor space. Anchor first constructs a dense and organized factor space via iterative generation and hierarchical clustering. It then performs context-aware mapping through hierarchical retrieval and refinement, substantially reducing ''unknown'' predictions. Finally, Anchor augments Naïve Bayes with a Causal Bayesian Network to capture latent dependencies among factors, relaxing the strict independence assumption. Experiments show that Anchor markedly reduces ''unknown'' predictions and produces more reliable probability estimates than direct LLM baselines, achieving state-of-the-art performance while significantly reducing time and token overhead.

Abstract:
Spoken Language Models (SLMs) revolutionize speech synthesis by bypassing traditional linguistic frontends, yet they remain limited by the digital resource disparities across languages. We investigate these challenges within the Southeast Asian linguistic landscape, using the phonetically complex Thai and data-scarce Lao as representative cases for low-resource SLMs. Scaling experiments reveal that reliance on synthetic data triggers a Stability-Expressivity Gap, characterized by a non-monotonic degradation we term Synthetic Erosion. To bridge this gap, we propose two self-alignment frameworks. Disentanglement-Guided Self-Alignment (DGSA) recovers expressivity for complex languages by exploiting prosody-timbre separation. For regimes where authentic references are exceptionally limited, Temperature-Driven Self-Critique (TDSC) stabilizes generation through automated exploration and filtering. Our methods achieve state-of-the-art results, including the first zero-shot voice cloning capability for Lao, establishing a scalable pathway for high-fidelity synthesis across the global linguistic long-tail. Audio Samples are available at: \urlhttps://anonymous.4open.science/api/repo/multilantts-demo-EEF6/file/index.html?v=2de23271.

Abstract:
By processing electronic health records (EHRs) as natural language sequences, large language models (LLMs) have shown potential in clinical prediction tasks such as mortality prediction and phenotyping. However, longitudinal or highly frequent EHRs often yield excessively long token sequences that result in high computational costs and even reduced performance. Existing solutions either add modules for compression or remove less important tokens, which introduce additional inference latency or risk losing clinical information. To achieve lossless compression of token sequences without additional cost or loss of performance, we propose Medical TokenPair Encoding (MedTPE), a layered method that extends standard tokenisation for EHR sequences. MedTPE merges frequently co-occurring medical token pairs into composite tokens, providing lossless compression while preserving the computational complexity through a dependency-aware replacement strategy. Only the embeddings of the newly introduced tokens of merely 0.5-1.0% of the LLM’s parameters are fine-tuned via self-supervised learning. Experiments on real-world datasets for two clinical scenarios demonstrate that MedTPE reduces input token length by up to 31% and inference latency by 34-63%, while maintaining or even improving both predictive performance and output format compliance across three LLMs and four clinical prediction tasks. Furthermore, MedTPE demonstrates robustness across different input context lengths and generalisability to scientific and financial domains.

Abstract:
Extensive research has highlighted the severe threats posed by backdoor attacks to deep reinforcement learning (DRL). However, prior studies primarily focus on vanilla scenarios, while plasticity interventions have emerged as indispensable builtin components of modern DRL agents. Despite their effectiveness in mitigating plasticity loss, the impact of these interventions on DRL backdoor vulnerabilities remains underexplored, and this lack of systematic investigation poses risks in practical DRL deployments. To bridge this gap, we empirically study 14,664 cases integrating representative interventions and attack scenarios. We find that only one intervention (i.e., SAM) exacerbates backdoor threats, while other interventions mitigate them. Pathological analysis identifies that, the exacerbation is attributed to backdoor gradient amplification, while the mitigation stems from activation pathway disruption and representation space compression. From these findings, we derive two novel insights: (1) a conceptual framework SCC for robust backdoor injection that deconstructs the mechanistic interplay between interventions and backdoors in DRL, and (2) abnormal loss landscape sharpness as a key indicator for DRL backdoor detection.

Abstract:
In this paper, we tackle a fundamental problem in highdimensional statistics, namely, learning Single Index Models (SIMs) robustly in the presence of heavy-tailed noise and an adversary that can corrupt a constant fraction of both covariates and responses. Prior research on efficient robust recovery only focuses on monotonic link functions or only limit themselves to Phase Retrieval. Provable efficient robust recovery guarantees for generic nonlinear link functions have remained elusive. In this paper, we obtain the first near-linear time, optimal-sample-complexity robust recovery algorithm for a wide class of nonlinear non-monotonic link functions. Critical to our result is an improved understanding of the squared-loss landscape: we identify a sufficient condition under which a broad class of non linear link functions admit a dimension-independent constant-radius convex basin around the ground truth, establishing statistical identifiability beyond previously known cases. We also leverage second-order Stein's identities to identify a structural condition, that we term Expected Squared Convexity (ESC), that acts as a sufficient condition for spectral initialization techniques to obtain an estimator within the convex basin with error O(\epsilon^1/4), even under heavy-tailed noise and strong adversarial contamination. This robust initialization technique can be combined with a robust gradient descent phase to break the spectral error barrier, achieving an improved estimation error of O(\sigma\sqrt\epsilon). Our non-convex optimization framework gives the first efficient sample and time complexity robust recovery results for activation functions such as GeLU and Swish that act as building blocks of modern deep-learning architectures.

Abstract:
Reliable automatic seizure detection from longterm electroencephalogram recordings (EEG) remains an unsolved challenge, as current models often fail to generalize across patients or clinical settings. Manual EEG review still is the standard of care, highlighting the need for robust models and standardized evaluation. The current literature often reports high efficacy, yet these models frequently fail when deployed to unseen patient populations. To rigorously assess this generalization gap, we conducted a large-scale empirical study evaluating 28 state-of-the-art algorithmic architectures, ranging from classical feature engineering to modern Deep Learning. These algorithms were collected by organizing competition. A strictly held-out private dataset of continuous EEG recordings from 65 subjects, totaling 4'360 hours of data, was utilized to evaluate algorithm performance. Expert neurophysiologists annotated these recordings, establishing the ground truth for seizure events. Algorithms were evaluated using event-based metrics from the SzCORE framework, including sensitivity, precision, F1-score, and false positive rate per day. Results revealed significant performance variability among state-of-the-art approaches, with the top F1 score of 32% (sensitivity 37%, precision 29%), highlighting the persistent difficulty of this task for current machine learning methodologies. Our analysis uncovered a discordance between peak performance and population-level stability. The algorithms achieving the highest aggregate F1-scores did not achieve the most consistent ranking across subjects, indicating high performance variance and susceptibility to failure on outlier patients. This independent evaluation also exposed a notable gap between self-reported efficacies and hold-out performance, underscoring the critical need for standardized, rigorous benchmarking in developing clinically viable ML models. A comparison with previous challenges and commercial systems indicates that the best algorithm in this study surpassed prior methods. Critically, the evaluation infrastructure transitions into a continuously open benchmarking platform, fostering reproducible research and accelerating the development of robust seizure detection algorithms by allowing ongoing submissions and integration of additional private datasets. Clinical centers can also adopt this platform to evaluate seizure detection algorithms on their EEG data using a standardized, reproducible framework.

Abstract:
In this paper, we present empirical and theoretical evidence against a central but largely implicit assumption in circuit and sheaf discovery (CSD), which we term theFunctional Anisotropy Hypothesis: the idea that functions in large language models (LLMs) are localised to a unique or nearunique internal mechanism. We show that a single LLM task can instead be supported by multiple, structurally distinct circuits or sheaves that are simultaneously faithful, sparse, and complete. To systematically uncover such competing mechanisms, we introduce Overlap-Aware Sheaf Repulsion, a method that augments the CSD objective with an explicit penalty on structural overlap across multiple discovery runs, enabling the discovery of circuits or sheaves with strong task performance but minimal shared structure across a plethora of common CSD benchmarks. We find that this phenomenon becomes increasingly pronounced as the number of discovered sheaves grows and persists robustly across major CSD methods. We further identify an ultra-sparse three-edge sheaf and show that none of its edges is individually indispensable, undermining even weakened notions of canonical or essential components. To explain these findings, we propose aDistributive Dense Circuit Hypothesisand provide a theoretical analysis demonstrating that non-unique, low-overlap circuit explanations arise naturally from high-dimensional superposition under mild assumptions. Together, our results suggest that mechanistic explanations in LLMs are inherently non-canonical and call for a rethinking of how CSD results should be interpreted and evaluated.

Abstract:
TSAD is a critical task, but developing models that generalize to unseen data in a zeroshot manner remains a major challenge. Prevailing foundation models for TSAD predominantly rely on reconstruction-based objectives, which suffer from a fundamental objective mismatch and representation conflict: they tend to memorize static patterns from training data, struggling to identify subtle anomalies while often misinterpreting complex normal patterns in unseen domains. To overcome these limitations, we introduce TimeRCD, a novel foundation model for TSAD built upon a new pre-training paradigm: Relative Context Discrepancy (RCD). Instead of reconstructing inputs based on fixed priors, TimeRCD is explicitly trained to adaptively identify anomalies by contrasting the query with its surrounding context. This relational approach, implemented with a standard Transformer architecture, enables the model to infer normality on-the-fly and capture contextual shifts indicative of anomalies that reconstruction-based methods often miss. To empower this paradigm, we develop a large-scale, diverse synthetic corpus with context-dependent anomaly labels, providing the rich supervisory signal necessary for effective pre-training. Extensive experiments demonstrate that TimeRCD significantly outperforms existing general-purpose and anomaly-specific foundation models in zero-shot TSAD across diverse datasets. Our results validate the superiority of the RCD paradigm and establish a new, effective path toward building robust and generalizable foundation models for time series anomaly detection. The code is available in \urlhttps://anonymous.4open.science/r/TimeRCD-5BE1/

Abstract:
Reliable retrievalaugmented generation (RAG) systems depend fundamentally on the retriever’s ability to find relevant information. We show that neural retrievers used in RAG systems have blind spots, which we define as the failure to retrieve entities that are relevant to the query, but have low similarity to the query embedding. We investigate the training-induced biases that cause such blind-spot entities to be mapped to inaccessible parts of the embedding space, resulting in low retrievability. Using a large-scale dataset constructed from Wikidata relations and first paragraphs of Wikipedia, and our proposed Retrieval Probability Score (RPS), we show that blind spot risk in standard retrievers (e.g., Contriever, ReasonIR) can be predicted pre-index from entity embedding geometry, avoiding expensive retrieval evaluations. To address these blind spots, we introduce ARGUS, a pipeline that enables the retrievability of high-risk (low-RPS) entities through targeted document augmentation from a knowledge base (KB), first paragraphs of Wikipedia, in our case. Extensive experiments on BRIGHT, ImpliRet, and RAR-b show that ARGUS achieves consistent improvements across all evaluated retrievers (averaging +3.4 nDCG@5 and +4.5 nDCG@10 absolute points), with substantially larger gains in challenging subsets. These results establish that preemptively remedying blind spots is critical for building robust and trustworthy RAG systems (Code and data will be released upon acceptance.).

Abstract:
Imitation learning with diffusion models has advanced robotic control by capturing the multimodal action distributions. However, existing methods typically treat observations only as high-level conditions to the denoising network, rather than integrating them into the stochastic dynamics of the diffusion process itself. As a result, the sampling is forced to begin from random noise, weakening the coupling between perception and control and often yielding suboptimal performance. We propose BridgePolicy, a generative visuomotor policy that directly integrates observations into the stochastic dynamics via a diffusion-bridge formulation. By constructing an observation-informed trajectory, BridgePolicy enables sampling to start from a rich and informative prior rather than random noise, substantially improving precision and reliability in control. A key difficulty is that diffusion bridge normally connects distributions of matched dimensionality, while robotic observations are heterogeneous and not naturally aligned with actions. To overcome this, we introduce a multi-modal fusion module and a semantic aligner to unify the visual and state inputs and align the observations with action representations, making diffusion bridge applicable to heterogeneous robot data. Extensive experiments across 52 simulation tasks on three benchmarks and 5 real-world tasks demonstrate that BridgePolicy consistently outperforms state-of-the-art generative policies.

Abstract:
While planand-infill decoding in Masked Diffusion Models (MDMs) shows promise for mathematical and code reasoning, performance remains highly sensitive to slot infilling order, often yielding substantial output variance. We introduce DiffuSearch, a framework that formulates slot selection as decision making and optimises infilling orders through Monte Carlo Tree Search (MCTS). DiffuSearch uses look-ahead simulations to evaluate partial completions before commitment, systematically exploring the combinatorial space of generation orders. Experiments show an average improvement of 3.2% over autoregressive baselines and 8.0% over baseline plan-and-infill, with notable gains of 19.5% on MBPP and 4.9% on MATH500. Our analysis reveals that while DiffuSearch predominantly follows sequential ordering, incorporating non-sequential generation is essential for maximising performance. We observe that larger exploration constants, rather than increased simulations, are necessary to overcome model confidence biases and discover effective orderings. These findings establish MCTS-based planning as an effective approach for enhancing generation quality in MDMs.

Abstract:
Devicecloud collaboration holds promise for deploying large language models (LLMs), leveraging lightweight on-device models for efficiency while relying on powerful cloud models for superior reasoning. A central challenge in this setting is determining, for each incoming query, whether it should be processed locally or offloaded to the cloud. Existing approaches typically rely on external routers, which often struggle to determine difficulty from the prompt itself, especially for tasks involving complex reasoning. Motivated by this limitation, we propose enabling on-device LLMs to decide internally whether to invoke cloud assistance at inference time, with this capability instilled through reinforcement learning based post-training. Casting on-device LLM post-training as a reward maximization problem, we design hierarchical rewards to encourage local problem solving and judicious cloud offloading. To solve the resulting problem, we develop an algorithm featuring a group-level policy gradient that stabilizes optimization, together with adaptive prompt filtering that provides complementary learning signals to mitigate policy collapse (i.e., exclusive local execution or exclusive cloud offloading). Extensive experiments on on-device-scale LLaMA and Qwen models across multiple reasoning benchmarks show that our method consistently outperforms baselines and significantly narrows the gap to full cloud LLMs.

Abstract:
Diffusion Transformers (DiTs) achieve strong video generation quality but suffer from high inference cost due to dense 3D attention, leading to the development of sparse attention technologies to improve efficiency. However, existing trainingfree sparse attention methods in video generation still face two unresolved limitations:ignoring layer heterogeneity in attention pruningandignoring query-key coupling in block partitioning, which hinder a better quality-speedup trade-off. In this work, we uncover a critical insight thatthe attention sparsity of each layer is its intrinsic property, with minor effects across different inputs. Motivated by this, we proposeSVOO, a training-freeSparse attention framework for fastVideo generation viaOffline layer-wise sparsity profiling andOnline bidirectional co-clustering. Specifically, SVOO adopts a two-stage paradigm: (i) offline layer-wise sensitivity profiling to derive intrinsic per-layer pruning levels, and (ii) online block-wise sparse attention via a novel bidirectional co-clustering algorithm. Extensive experiments on seven widely used video generation models demonstrate that SVOO achieves a superior quality-speedup trade-off over state-of-the-art methods, delivering up to 1.93× speedup while maintaining a PSNR of up to 29 dB on Wan2.1.

Abstract:
Modern imaging techniques heavily rely on Bayesian statistical models to address difficult image reconstruction and restoration tasks. This paper addresses the objective evaluation of such models in settings where ground truth is unavailable, with a focus on model selection and misspecification diagnosis. Existing unsupervised model evaluation methods are often unsuitable for computational imaging due to their high computational cost and incompatibility with modern image priors defined implicitly via machine learning models. We herein propose a general methodology for unsupervised model selection and misspecification detection in Bayesian imaging sciences, based on a novel combination of Bayesian crossvalidation and data fission, a randomized measurement splitting technique. The approach is compatible with any Bayesian imaging sampler, including diffusion and plug-and-play samplers. We demonstrate the methodology through experiments involving various scoring rules and types of model misspecification, where we achieve excellent selection and detection accuracy with a low computational cost.

Abstract:
Deploying Large Language Models to datascarce programming domains poses significant challenges, particularly for kernel synthesis on emerging Domain-Specific Architectures where a "Data Wall" limits available training data. While models excel on data-rich platforms like CUDA, they suffer catastrophic performance drops on data-scarce ecosystems such as NPU programming. To overcome this cold-start barrier without expensive fine-tuning, we introduce Evokernel, a self-evolving agentic framework that automates the lifecycle of kernel synthesis from initial drafting to continual refining. Our method addresses this by formulating the synthesis process as a memory-based reinforcement learning task. Through a novel value-driven retrieval mechanism, it learns stage-specific Q-values that prioritize experiences based on their contribution to the current objective—whether bootstrapping a feasible draft or iteratively refining latency. Furthermore, by enabling cross-task memory sharing, the agent generalizes insights from simple to complex operators. By building an NPU variant of KernelBench and evaluating on it, \ourmethod improves frontier models' correctness from 11.0% to 83.0% and achieves a median speedup of 3.60x over initial drafts through iterative refinement. This demonstrates that value-guided experience accumulation allows general-purpose models to master the kernel synthesis task on niche hardware ecosystems.

Abstract:
Highdimensionality of data is often regarded as a fundamental statistical impediment in Machine Learning and AI. The purpose of this paper is to clarify, on the contrary, when and how high-dimensionality may be beneficial. In the setting of a general random function model of data we delineate between three notions of dimensionality: effective dimension p_\mathrmeff, measuring total variability across feature directions; correlation rank r, measuring functional complexity across samples; and latent intrinsic dimension d of manifold structure hidden in data. Via a generalized Hanson-Wright inequality, we show that increasing p_\mathrmeff drives a blessing of dimensionality phenomenon, whereby data dot-products concentrate about their expectations. In turn, we show that, under mild continuity assumptions (ensuring that features bring additional information as dimension grows), persistence diagrams recover latent homology when p_\mathrmeff \in \omega (\log n) as n\to\infty. Informed by our theory, we revisit the ground-breaking neuroscience discovery of toroidal structure in grid-cell activity made by Gardner et al. (2022): our findings provide the first empirical evidence that this structure is isometric to a flat torus model of physical space, suggesting that grid cell activity conveys a geometrically faithful representation of the real world.

Abstract:
TestTime Scaling (TTS) improves LLM reasoning by exploring multiple candidate responses and then operating over this set to find the best output. A tacit premise behind TTS is that sufficiently diverse candidate pools enhance reliability. In this work, we show that this assumption in TTS introduces a previously unrecognized failure mode. When candidate diversity is curtailed, even by a modest amount, TTS becomes much more likely to produce unsafe outputs. We present a reference-guided diversity reduction protocol (RefDiv) that serves as a diagnostic attack to stress test TTS pipelines. Through extensive experiments across open-source models (e.g. Qwen3, Mistral, Llama3.1, Gemma3) and two widely used TTS strategies (Monte Carlo Tree Search and Best-of-N), constraining diversity consistently signifies the rate at which TTS produces unsafe results. The effect is often stronger than that produced by prompts directly with high adversarial intent scores. This observed phenomenon also transfers across TTS strategies and to closed-source models (e.g. OpenAI o3-mini and Gemini-2.5-Pro), thus indicating that this is a general and extant property of TTS rather than a model-specific artifact. Additionally, we find that numerous widely used safety guardrail classifiers (e.g. Llama-Guard), are unable to flag the adversarial input prompts generated by RefDiv, demonstrating that existing defenses offer limited protection against this diversity-driven failure mode.

Abstract:
Reasoning hallucinations in large language models (LLMs) often appear as fluent yet unsupported conclusions that violate either the given context or underlying factual knowledge. Although such failures are widely observed, the mechanisms by which decoderonly Transformers produce them remain poorly understood. We model next-token prediction as a graph search process over an underlying graph, where entities correspond to nodes and learned transitions form edges. From this perspective, contextual reasoning is a constrained search over a sampled subgraph (intrinsic reasoning), while context-free queries rely on memorized structures in the underlying graph (extrinsic reasoning). We show that reasoning hallucinations arise from two fundamental mechanisms: path reuse, where memorized knowledge overrides contextual constraints during early training, and path compression, where frequently traversed multi-step paths collapse into shortcut edges in later training. Together, these mechanisms provide a unified explanation for reasoning hallucinations in LLMs and connected to well-known behaviors observed in downstream applications.

Abstract:
The minimal norm weight perturbations of DNNs required to achieve a specified change in output are derived and the factors determining its size are discussed. These singlelayer exact formulae are contrasted with more generic multi-layer Lipschitz constant based robustness guarantees; both are observed to be of the same order which indicates similar efficacy in their guarantees. These results are applied to precision-modification-activated backdoor attacks, establishing provable compression thresholds below which such attacks cannot succeed, and show empirically that low-rank compression can reliably activate latent backdoors while preserving full-precision accuracy. These expressions reveal how back-propagated margins govern layer-wise sensitivity and provide certifiable guarantees on the smallest parameter updates consistent with a desired output shift.

Abstract:
Lloyd's kmeans algorithm is one of the most widely used clustering methods. We prove that in high-dimensional, high-noise settings, the algorithm exhibits catastrophic failure: with high probability, essentially every partition of the data is a fixed point. Consequently, Lloyd's algorithm simply returns its initial partition — even when the underlying clusters are trivially recoverable by other methods. In contrast, we prove that Hartigan's k-means algorithm does not exhibit this pathology. Our results show the stark difference between these algorithms and offer a theoretical explanation for the empirical difficulties often observed with k-means in high dimensions.

Abstract:
Large language model pretraining typically exhibits a two-phase trajectory: a fast initial loss drop followed by a prolonged slow improvement. We identify an underlying spectral phenomenon, Stability of Singular Distribution (SoSD), where the trace-normalized singular value spectrum stabilizes early, even as parameter matrices continue to evolve. We demonstrate that synchronization between SoSD and the slow-descent regime is widely observed across diverse architectures (GPT-2, LLaMA) and settings, including various schedules (Step-wise, WSD, Cosine Decay), weight decays, and optimizers (AdamW, Muon). By analyzing a simplified Transformer, we prove that growing weight norms inevitably precipitate an early SoSD threshold, after which the rate of loss decrease becomes theoretically bounded by the variation in the singular distribution. We further interpret strategies like WSD and Muon through their ability to modulate the SoSD scale, offering a spectral lens for understanding efficient pre-training dynamics.

Abstract:
Neural networks trained with gradientbased methods exhibit a strong simplicity bias, learning simpler statistical features of their data before moving to more complex features. In this work, we study this bias from a Fourier perspective, motivated by the approximate translation-invariance and the characteristic power spectra of natural images. We first show experimentally that simple neural networks trained on image classification tasks first rely on amplitude information -- related to pair-wise correlations between pixels -- before exploiting phase information, which encodes edges and higher-order correlations. To explain this phenomenon, we introduce a synthetic data model for translation-invariant inputs that allows precise control over the amplitudes and phases while remaining tractable. We rigorously establish that for isotropic and high-dimensional inputs, classifying them by relying only phase information is a genuinely hard task: online stochastic gradient descent cannot distinguish the structured inputs from noise within n \ll N^3 steps, but needs at least n \gg N^3 \log^2N steps. In contrast, we prove that for non-isotropic inputs with power-law spectra, the existence of a dominant principal subspace can dramatically accelerate the speed of learning, even if the Fourier amplitudes are shared among classes and do not help with classification. Simulations with two-layer networks trained on textures, and with deep convolutional networks on ImageNet confirm this non-trivial interaction between amplitudes and phases, providing mechanistic insight into how deep neural networks can learn natural image distributions efficiently.

Abstract:
Evaluating mathematical reasoning in LLMs is constrained by limited benchmark sizes and inherent model stochasticity, yielding highvariance accuracy estimates and unstable rankings across platforms. On difficult problems, an LLM may fail to produce a correct final answer, yet still provide reliable pairwise comparison signals indicating which of two candidate solutions is better. We leverage this observation to design a statistically efficient evaluation framework that combines standard labeled outcomes with pairwise comparison signals obtained by having models judge auxiliary reasoning chains. Treating these comparison signals as control variates, we develop a semiparametric estimator based on the efficient influence function (EIF) for the setting where auxiliary reasoning chains are observed. This yields a one-step estimator that achieves the semiparametric efficiency bound, guarantees strict variance reduction over naive sample averaging, and admits asymptotic normality for principled uncertainty quantification. Across simulations, our one-step estimator substantially improves ranking accuracy, with gains increasing as model output noise grows. Experiments on GPQA Diamond, AIME 2025, and GSM8K further demonstrate more precise performance estimation and more reliable model rankings, especially in small-sample regimes where conventional evaluation is pretty unstable.

Abstract:
Looped Language Models (LoopLMs) perform multistep latent reasoning prior to token generation and outperform conventional LLMs on reasoning benchmarks at smaller parameter budgets. However, attempts to further improve LoopLM reasoning with reinforcement learning have failed—standard objectives such as Group Relative Policy Optimization (GRPO) only assign credit to the final latent state, creating a fundamental mismatch with the model's internal computation. To resolve this, we introduceRLTT (Reward Latent Thought Trajectories), a reinforcement learning framework which distributes reward across the full latent reasoning trajectory. RLTT provides dense, trajectory-level credit assignment without relying on external verifiers and can directly replace GRPO with negligible overhead. Across extensive experiments with Ouro-2.6B-Thinking under identical training and inference conditions,RLTT yields substantial improvements over GRPO on challenging mathematical reasoning benchmarks, improving accuracy by +14.4% on MATH-500, +16.6% on AIME24, and +10.0% on BeyondAIME. Despite being trained exclusively on mathematics, RLTT also transfers effectively to non-mathematical reasoning benchmarks, demonstrating the effectiveness of trajectory-level credit assignment for reinforcement learning in LoopLMs.

Abstract:
Bridging the sim2real gap between computationally inexpensive models and complex physical systems remains a central challenge in machine learning applications to engineering problems, particularly in multiscale settings where reduced-order models typically capture only dominant dynamics. In this work, we present Cheap2Rich, a multi-scale data assimilation framework that reconstructs high-fidelity state spaces from sparse sensor histories by combining a fast low-fidelity prior with learned, interpretable discrepancy corrections. We demonstrate the performance on rotating detonation engines (RDEs), a challenging class of systems that couple detonation-front propagation with injector-driven unsteadiness, mixing, and stiff chemistry across disparate scales. Our approach successfully reconstructs high-fidelity RDE states from sparse measurements while isolating physically meaningful discrepancy dynamics associated with injector-driven effects. The results highlight a general multi-fidelity framework for data assimilation and system identification in complex multi-scale systems, enabling rapid design exploration and real-time monitoring and control while providing interpretable discrepancy dynamics. Anonymous code is available at: anonymous.4open.science/r/Cheap2Rich-4C71.

Abstract:
As Large Language Models (LLMs) increasingly assist secure software development, their ability to meet the rigorous demands of Rust program verification remains unclear. Existing evaluations treat Rust verification as a black box, assessing models only by binary pass or fail outcomes for proof hints. This obscures whether models truly understand the logical deductions required for verifying nontrivial Rust code. To bridge this gap, we introduce VCoTLift, a framework that lifts low-level solver reasoning into high-level, human-readable verification steps. By exposing solver-level reasoning as an explicit Verification Chain-of-Thought, VCoT-Lift provides a concrete ground truth for fine-grained evaluation. Leveraging VCoT-Lift, we introduce VCoT-Bench, a comprehensive benchmark of 1,988 VCoT completion tasks for rigorously evaluating LLMs’ understanding of the entire verification process. VCoT-Bench measures performance along three orthogonal dimensions: robustness to varying degrees of missing proofs, competence across different proof types, and sensitivity to the proof locations. Evaluation of ten state-of-the-art models reveals severe fragility, indicating that current LLMs fall well short of the reasoning capabilities exhibited by automated theorem provers.

Abstract:
Compressing long chains of thought (CoT) into compact latent tokens is crucial for efficient reasoning with large language models (LLMs). Recent studies employ autoencoders to achieve this by reconstructing textual CoT from latent tokens, thus encoding CoT semantics. However, treating textual CoT as the reconstruction target forces latent tokens to preserve surfacelevel linguistic features (e.g., word choice and syntax), introducing a strong linguistic inductive bias that prioritizes linguistic form over reasoning structure and limits logical abstraction. Thus, we propose ImgCoT that replaces the reconstruction target from textual CoT to the visual CoT obtained by rendering CoT into images. This substitutes linguistic bias with spatial inductive bias, i.e., a tendency to model spatial layouts of the reasoning steps in visual CoT, enabling latent tokens to better capture global reasoning structure. Moreover, although visual latent tokens encode abstract reasoning structure, they may blur reasoning details. We thus propose a loose ImgCoT, a hybrid reasoning that augments visual latent tokens with a few key textual reasoning steps, selected based on low token log-likelihood. This design allows LLMs to retain both global reasoning structure and fine-grained reasoning details with fewer tokens than the complete CoT. Extensive experiments across multiple datasets and LLMs demonstrate the effectiveness of the two versions of ImgCoT.

Abstract:
In GFlowNets and variational inference, it has been shown that the mean square error between target and model log probabilities is an effective, low variance, surrogate loss for training generative models. This loss has the property that when evaluated \emphonpolicy its gradients correspond to those of the KL divergence, while \emphoff-policy it remains a valid loss with the same global minimiser. In this work, we demonstrate that this construction can be extended to the whole family of f-divergences, leading to a family of losses whose on-policy gradients are that of the corresponding f-divergence, but retain the same global minimiser off-policy. Specifically, we show that the on-policy gradients lead to a one to one correspondence between translation invariant loss functions on the target and model log probabilities, and f-divergences. This equivalence allows us to design new surrogate loss functions for tuning a wide class of generative models that inherit the properties of the corresponding f-divergence, such as being more mode covering, whilst being applicable to off-policy data. We apply our losses on a range of tasks, including classic synthetic examples, SynFlowNets for molecule discovery, and asynchronous large language model (LLM) tuning, demonstrating that our models retain their predicted properties on- and off-policy and can be applied to a wide class of generative models.

Abstract:
ZerothOrder (ZO) optimization is pivotal for scenarios where backpropagation is unavailable, such as memory-constrained on-device learning and black-box optimization. However, existing methods face a stark trade-off: they are either sample-inefficient (e.g., standard finite differences) or suffer from high variance due to randomized estimation (e.g., random subspace methods). In this work, we propose Coherent Coordinate Descent (CoCD), a deterministic, sample-efficient, and budget-aware ZO optimizer. Theoretically, we formalize the notion of gradient coherence and demonstrate that CoCD is equivalent to Block Cyclic Coordinate Descent (BCCD) with ``warm starts,'' effectively converting historical (stale) gradients from a liability into a computational asset. This mechanism enables O(1) query complexity per step while maintaining global descent directions. Furthermore, we derive error bounds revealing a counter-intuitive insight: larger finite-difference step sizes can induce an implicit smoothing effect on the optimization landscape by reducing the effective smoothness constant, thereby improving convergence stability. Experiments on MLP and CNN architectures (up to 20k parameters) demonstrate that CoCD significantly outperforms BCCD in terms of sample efficiency and convergence loss/accuracy, and exhibits superior stability over randomized ZO methods. Our results suggest that deterministic, structure-aware updates offer a superior alternative to randomization for lightweight ZO optimization.

Abstract:
Reasoning is a core capability of language models (LMs), yet it remains unclear how much model capacity is necessary to support reasoning during pretraining. In this work, we study the minimal parameter budget required for implicit reasoning, defined as the ability to infer new facts from learned knowledge without explicit chainof-thought supervision. To isolate this phenomenon, we pretrain LMs from scratch in a controlled synthetic environment that mimics the structure and distribution of real-world knowledge graphs, and evaluate their ability to complete missing edges via multi-hop inference. From both a theoretical and an empirical perspective, we identify a scaling law linking this optimal parameter budget to a graph search entropy measure. Across a wide range of model sizes, training steps, and graph complexities, we show that an optimally sized language model can reliably reason over approximately 0.008 bits of information per parameter at most. Our results characterize the minimal sufficient capacity for implicit reasoning during pretraining. Our findings provide principled guidance for matching model size to data complexity and offer new insights into the scaling behavior of reasoning in large language models.

Abstract:
Many realworld datasets contain hidden structure that cannot be detected by simple linear correlations between input features. For example, latent factors may influence the data in a coordinated way, even though their effect is invisible to covariance-based methods such as PCA. In practice, nonlinear neural networks often succeed in extracting such hidden structure in unsupervised and self-supervised learning. However, constructing a minimal high-dimensional model where this advantage can be rigorously analyzed has remained an open theoretical challenge. We introduce a tractable high-dimensional spiked model with two latent factors: one visible to covariance, and one statistically dependent yet uncorrelated, appearing only in higher-order moments. PCA and linear autoencoders fail to recover the latter, while a minimal nonlinear autoencoder provably extracts both. We analyze both the population risk, and empirical risk minimization. Our model also provides a tractable example where self-supervised test loss is poorly aligned with representation quality: nonlinear autoencoders recover latent structure that linear methods miss, even though their reconstruction loss is higher.

Abstract:
From generating headlines to fabricating news, the Large Language Models (LLMs) are typically assessed by their final outputs, under the safety assumption that a refusal response signifies safe reasoning throughout the entire process. Challenging this assumption, our study reveals that during fake news generation, even when a model rejects a harmful request, its Chainof-Thought (CoT) reasoning may still internally contain and propagate unsafe narratives. To analyze this phenomenon, we introduce a unified safety-analysis framework that systematically deconstructs CoT generation across model layers and evaluates the role of individual attention heads through Jacobian-based spectral metrics. Within this framework, we introduce three interpretable measures: stability, geometry, and energy to quantify how specific attention heads respond or embed deceptive reasoning patterns. Extensive experiments on multiple reasoning-oriented LLMs show that the generation risk rise significantly when the thinking mode is activated, where the critical routing decisions concentrated in only a few contiguous mid-depth layers. By precisely identifying the attention heads responsible for this divergence, our work challenges the assumption that refusal implies safety and provides a new understanding perspective for mitigating latent reasoning risks.

Abstract:
Many reinforcement learning (RL) problems admit multiple terminal solutions of comparable quality, where the goal is not to identify a single optimum but to represent a diverse set of highquality outcomes. Nevertheless, policies trained by standard expected-return maximization routinely collapse onto a small subset of outcomes, a phenomenon commonly attributed to insufficient exploration or weak regularization. We show that this explanation is incomplete: outcome-level mode collapse is a structural consequence of the expected-return objective itself. Under idealized learning dynamics, the log-probability ratio between any two outcomes evolves linearly in their reward difference, implying exponential ratio divergence and inevitable collapse—independent of the exploration strategy, entropy regularization, or optimization algorithm. We identify the source of this pathology as the probability multiplier inside the expectation and propose a minimal correction: inverse probability scaling, which removes outcome-frequency amplification from the learning signal, fundamentally changes the learning dynamics, and provably yields reward-proportional terminal distributions, preventing collapse in multimodal settings. We instantiate this principle in Group Relative Policy Optimization (GRPO) as a drop-in modification, IPS-GRPO, requiring no auxiliary models or architectural changes. Across different reasoning and molecular generation tasks, IPS-GRPO consistently reduces outcome-level mode collapse while matching or exceeding baseline performance, suggesting that correcting the objective rather than adding exploration heuristics is key to reliable multimodal policy optimization.

Abstract:
GPTstyle language models are sensitive to single-token changes at generation points where the predicted probability distribution is spread across multiple tokens. Viewing this sensitivity as a geometric property, we derive an \mathfrakso(n)-valued 1-form that depends only on the geometry of the token embeddings. Despite this purely geometric origin, we show that its curvature is semantically meaningful: on chess reasoning tasks, the curvature couples to the world model of an off-the-shelf instruction-tuned model, with transformations clustering by board region and respecting piece importance. Our findings suggest that token space geometry directly reflects how models internally represent problems.

Abstract:
Reinforcement learning has been widely applied to diffusion and flow models for visual tasks such as textto-image generation. However, these tasks remain challenging because diffusion models have intractable likelihoods, which creates a barrier for directly applying popular policy-gradient type methods. Existing approaches primarily focus on crafting new objectives built on already heavily engineered LLM objectives, using ad hoc estimators for likelihood, without a thorough investigation into how such estimation affects overall algorithmic performance. In this work, we provide a systematic analysis of the RL design space by disentangling three factors: i) policy-gradient objectives, ii) likelihood estimators, and iii) rollout sampling schemes. We show that adopting an evidence lower bound (ELBO) based model likelihood estimator, computed only from the final generated sample, is the dominant factor enabling effective, efficient, and stable RL optimization, outweighing the impact of the specific policy-gradient loss functional. We validate our findings across multiple reward benchmarks using SD 3.5 Medium, and observe consistent trends across all tasks. Our method improves the GenEval score from 0.24 to 0.95 in 90 GPU hours, which is 4.6× more efficient than FlowGRPO and 2× more efficient than the SOTA method DiffusionNFT without reward hacking.

Abstract:
Fake Image Detection (FID), aiming at unified detection across four image forensic subdomains, is critical in realworld forensic scenarios. Compared with ensemble approaches, monolithic FID models are theoretically more promising, but to date, consistently yield inferior performance in practice. In this work, by discovering the "heterogeneous phenomenon'', which is the intrinsic distinctness of artifacts across subdomains, we diagnose the cause of this underperformance for the first time: the collapse of the artifact feature space driven by such phenomenon. The core challenge for developing a practical monolithic FID model thus boils down to the "unified-yet-discriminative" reconstruction of the artifact feature space. To address this paradoxical challenge, we hypothesize that high-level semantics can serve as a structural prior for the reconstruction, and further propose Semantic-Induced Constrained Adaptation (SICA), the first monolithic FID paradigm. Extensive experiments on our OpenMMSec dataset demonstrate that SICA outperforms 15 state-of-the-art methods and reconstructs the target unified-yet-discriminative artifact feature space in a near-orthogonal manner, thus firmly validating our hypothesis. The code and dataset will be made publicly available.

Abstract:
Recent leaderboardbased evaluations of large language models aggregate user feedback by fitting a Bradley--Terry model to pairwise comparisons, producing a single global ranking based on a latent quality score. While appealing for its simplicity, this approach is incompatible with heterogeneous preferences: when LLMs are used across diverse tasks and use cases, users who favor fundamentally different model behaviors can be systematically misrepresented when collapsed into a single quality score. To address this issue, we study pluralistic leaderboards that aim to remain stable with respect to heterogeneous user populations. Drawing on ideas from social choice theory, we adapt the notion of local stability, which requires that no model outside the top-k positions is collectively preferred to the top-k set by more than O(1/k) fraction of users. Building on techniques from the social choice literature, we design an alternative leaderboard mechanism that satisfies local stability while eliciting only \widetildeO(k) pairwise comparisons per user, where k is the size of the prefix for which stability is guaranteed. Using data from LMArena, we show that standard Bradley--Terry aggregation can violate local stability in practice, whereas our method provides substantially stronger stability guarantees.

Abstract:
PhysicsInformed Neural Networks (PINNs) embed physical laws into deep learning models. However, conventional PINNs often suffer from failure modes leading to inaccurate solutions. We trace these failure modes to two structural pathologies: gradient shattering, where gradients degrade with depth and provide little training signal, and flow mismatch, where training pushes predictions along trajectories that diverge from the PDE solution path. We introduce ResPINNs, which reformulate PINNs as residual flows, networks that iteratively refine their own predictions through explicit corrective steps, in the spirit of classical iterative solvers. Our analysis shows that this design mitigates both pathologies by keeping updates aligned with descent and by preserving informative gradients across depth. Extensive experiments on PDE benchmarks confirm that ResPINNs achieve higher accuracy with substantially fewer parameters than conventional architectures.

Abstract:
We show that Chainof-Thought (CoT) enables a fixed single-layer transformer to efficiently approximate the training process of an N-layer feed-forward network in-context. Since FFNs are universal approximators, this result provides strong theoretical evidence for the expressive power of CoT. Specifically, we improve the computational cost of the prior best in-context result [Wu et al., ICML 2025] by O(N). Building on the insight of the recursive nature of CoT, we reuse the single-layer transformer autoregressively instead of stacking the same transformer blocks to perform multiple In-Context Gradient Descent (ICGD) updates. The key novelty is a dynamic-masking scheme: at each CoT step, the attention heads are forced to see only the tokens needed to compute the result of the current forward or backpropagation update. This selective reuse contrasts with earlier ICGD proofs for neural network optimization, which must carry all information through every layer simply because a later gradient update might need it. Our numerical validations backup our theory.

Abstract:
The choice of training objective is central to diffusionbased generative modeling in terms of both sample quality and distribution coverage. While standard maximum likelihood training provides a principled objective with strong theoretical grounding, empirical studies indicate that previous training objectives in diffusion models often face an inverse correlation between likelihood optimization and perceptual evaluations. We propose the Rényi diffusion model, a unified generative framework that formulates training objectives using Rényi divergence. This yields a generalized score matching objective providing explicit control over the trade-off between sample quality and distribution coverage. Experiments demonstrate improved balance between density estimation and sample generation performances across multiple datasets without modifying model architectures or sampling procedures.

Abstract:
Conformal prediction converts point predictions into setvalued predictions with coverage guarantees under exchangeability between calibration and deployment data. We studyconformal calibration transfer, where this requirement fails because labeled calibration is available only in a source space, while prediction sets are needed in a target space linked to the source throughunlabeled pairedobservations (e.g., paired modalities or sensor changes). We propose Transported Conformal Calibration (TCC): we transport labeled source calibration into the target space using the paired data, and then correct residual post-transport mismatch using only unlabeled target inputs. We instantiate this correction with two complementary methods:TCC-KS, which uses a label-free uncertainty surrogate to detect mismatch and adjust calibration conservatively, andweighted-TCC, which reweights transported calibration toward the target domain for improved efficiency when weights are stable. We provide finite-sample target-domain coverage guarantees that adapt to an observable measure of mismatch. Across CIFAR-100-C, Tiny-ImageNet-C, and SEN12MS, we show reliable target-domain coverage transfer without labeled target calibration data, with label-free diagnostics that predict when correction is needed.

Abstract:
The Traveling Salesperson Problem (TSP) has long served as a benchmark for evaluating the strength of optimization techniques in the classical theory of algorithms. In recent efforts to apply ML to algorithmic problems, TSP has also become a natural testbed for the development of MLbased techniques. A common approach is to train a neural network to output a heatmap estimating the likelihood of each edge to be part of the optimal tour; however, converting such a heatmap into an actual tour remains a non-trivial and often computationally intensive step. In this work, we propose algorithms for transforming heatmaps into tours with theoretical guarantees linking the achieved approximation ratio to the quality of the provided heatmap. In the spirit of \emphalgorithms with predictions, our results can be described as (1+2\eta/OPT)-approximation algorithms, where \eta denotes the L1 distance between the prediction (heatmap) and an optimal solution (tour). Since the previous works lack such explicit guarantees, we compare our approach against them experimentally.

Abstract:
The widespread availability of finetuned LoRA modules for open pre-trained models has led to an interest in methods that can adaptively merge LoRAs to improve performance. These methods typically include some way of selecting LoRAs from a pool and tune merging coefficients based on a task-specific dataset. While adaptive merging methods have demonstrated improvements in some settings, no past work has attempted to recycle LoRAs found ``in the wild'' on model repositories like the Hugging Face Hub. To address this gap, we consider recycling from a pool of nearly 1,000 user-contributed LoRAs trained from the Llama 3.1 8B-Instruct language model. Our empirical study includes a range of adaptive and non-adaptive merging methods in addition to a new method designed via a wide search over the methodological design space. We demonstrate that adaptive merging methods can improve performance over the base model but provide limited benefit over training a new LoRA on the same data used to set merging coefficients. We additionally find not only that the specific choice of LoRAs to merge has little importance, but that using LoRAs with randomly initialized parameter values yields similar performance. This raises the possibility that adaptive merging from recycled LoRAs primarily works via some kind of regularization effect, rather than by enabling positive cross-task transfer. To better understand why past work has proven successful, we confirm that positive transfer is indeed possible when there are highly relevant LoRAs in the pool. We release the model checkpoints and code online.

Abstract:
Lowrank adaptation (LoRA) is one of the most widely used parameter-efficient fine-tuning (PEFT) methods for adapting pre-trained large language models (LLMs) to downstream tasks. Although LoRA significantly reduces the number of trainable parameters and lowers fine-tuning costs, its performance is often limited by the inherent low-rank assumption. In this paper, we revisit the notion of rank for LoRA update matrices and show that the standard matrix rank fails to capture duplicated directions and redundancy in the update subspace. Motivated by this analysis, we argue that the Kruskal rank offers a more informative criterion for characterizing update diversity. We therefore propose Low Kruskal Rank Adaptation (LoKRA), a new PEFT algorithm with provable theoretical guarantees that mitigates the limitations of LoRA. We further introduce LoKRA^+, an enhanced variant that provides a tighter theoretical lower bound on the Kruskal rank and yields stronger empirical performance. Experiments on multiple LLMs show that our approach consistently outperforms LoRA and other baselines, establishing state-of-the-art performance across a range of benchmarks.

Abstract:
A common assumption in strategic classification is that the classifier is made public knowledge. However, it remains unclear if, and why, a system would choose to commit to full disclosure. We study a setting in which regulation requires the system to share some, but not all, of the information. This entails a learning task in which the goal is to jointly learn a classifier and the uncertainty surrounding it. Towards this, we adopt from robust mechanism design the notion of ambiguity, which in our setting permits the learner to reveal a set or range of possible classifiers, and choose one to realize. We investigate how ambiguity affects the learning task, propose efficient algorithms for computing bestresponses and training, and empirically explore strategic learning and its outcomes in this novel setting and using our approach.

Abstract:
Deep models are vulnerable to performance degradation caused by various factors, such as imbalanced samples, inaccurate labels, and backdoor attacks. However, existing optimization methods that address these issues are typically designed in a scenarioor architecture-specific manner, and each optimization often requires costly training. To this end, inspired by image denoising, we propose parameter purification as a new paradigm for model performance optimization. Parameter purification attributes performance degradation to the contamination of model parameters and aims to recover clean parameters from corrupted ones in a manner analogous to image denoising. To purify parameters with massive scale and complex structure, we further introduce a novel parameter manifold purification method. In this framework, high-dimensional and complex parameters are first viewed as manifolds embedded in a high-dimensional space, and are then partitioned into nested local parameter-cluster manifolds via a proposed parameter clustering strategy. Meanwhile, to remove parameter redundancy while preserving global parameter information, we propose an implicit manifold auto-encoder along with a parameter-cluster discrepancy loss to learn low-dimensional representations of parameter-cluster manifolds. Finally, an implicit conditional diffusion model is applied to denoise the low-dimensional parameter manifolds, progressively restoring clean parameters. Extensive experiments under three representative scenarios that cause model performance degradation demonstrate that parameter manifold purification can accurately and completely purify corrupted parameters of unseen models, analogous to denoising unseen images, and rapidly improve model performance.

Abstract:
The scalability of Large Language Models (LLMs) to long contexts is fundamentally constrained by the quadratic complexity of standard attention, motivating the adoption of linear attention mechanisms with subquadratic cost. To improve representation capacity under long contexts, recent approaches organize memory in a multi-state manner. However, existing multi-state linear attention methods rely on fixed state merging policies that cannot adapt to dynamically varying token importance, irreversibly obscuring critical tokens and causing severe error accumulation over long sequences. To address this limitation, we propose DLA, a dynamic memory modeling framework for multi-state linear attention. DLA introduces (i) Information-Aware Dynamic State Merging, which adaptively determines state boundaries based on token-level information variation, preserving high-resolution representations around semantic transitions while aggressively summarizing stable regions, and (ii) Capacity-Bounded Memory Modeling, which maintains a fixed-size, chronologically ordered state cache by selectively merging adjacent low-information states to control memory growth with minimal information loss. We pre-train DLA on two different linear attention models and evaluate on 10 datasets from three different aspects. Experimental results demonstrate the superiority of DLA over state-of-the-art.

Abstract:
Flow Matching shows great promise in offline reinforcement learning (RL), yet optimizing these iterative policies via Backpropagation Through Time (BPTT) is unstable. While prevailing paradigms circumvent this by distilling multistep flows into single-step approximations, such methods may limit the benefits of iterative refinement. To avoid these sacrifices, we propose Direct Flow Q-Learning (DFQL), a streamlined framework that attains superior results by optimizing flow matching policies without BPTT or distillation. DFQL derives a surrogate objective that directly injects terminal Q-value gradients as a guidance term into each step velocity field, ensuring stable optimization while preserving iterative expressive capacity. Across 73 challenging tasks in OGBench and D4RL, DFQL achieves state-of-the-art results. Additionally, DFQL extends seamlessly to the offline-to-online setting, delivering substantial performance gains without further modification.

Abstract:
Connectionist models and symbolic models have long embodied two divergent paradigms: the former excel at differentiable representation learning yet struggle with transparency, while the latter deliver explicit rulebased reasoning but resist gradient-based optimization. We introduce Arboreal Neural Networks (ArbNN), a neural–symbolic framework that unifies these paradigms both computationally and conceptually. At the design level, ArbNN departs fundamentally from prior neuralized-tree models through a depth-aware routing mechanism and a topology-informed softmax aggregation, which together enable one-shot full-path gradient propagation and consequently achieving rapid and well-conditioned optimization dynamics and high parallel inference efficiency. At the conceptual level, ArbNN reveals that decision-tree branching and self-attention routing are two realizations of the same conditional computation primitive. We prove a structural isomorphism between a decision tree and a single-query attention head, enabling a differentiable architecture that faithfully preserves symbolic decision logic. The defining property of ArbNN is Bidirectional Fidelity, ensuring that the neural module can be compiled from—and losslessly decompiled back into—a symbolic tree, yielding both ordering consistency in ranking behavior and explicit, auditable interpretability via reconstructed if–else rules. ArbNN further supports GBDT-based initialization, allowing it to inherit strong inductive biases and integrate seamlessly with existing production workflows. Empirically, ArbNN achieves state-of-the-art performance on various public tabular benchmarks and delivers consistent gains under temporal distribution shift in large-scale industrial credit-risk systems. To support realistic evaluation, we additionally contribute TabCredit, a feature-rich, temporally partitioned dataset built from millions of real-world loan applications. Together, these results demonstrate that ArbNN forms a unified, reversible, and practically deployable bridge between symbolic reasoning and neural computation for high-stakes tabular domains.

Abstract:
Tabular data synthesis is critical for privacypreserving data sharing and augmentation, yet existing diffusion models rely on implicit attention mechanisms to capture inter-column relationships. We introduce Geometry-Aware Tabular Diffusion, which augments diffusion models with explicit pairwise geometric features - angles and lengths - computed directly from column value differences. Our method achieves state-of-the-art performance on standard benchmarks while using 3.5 times fewer parameters on average (up to 25 times for classification tasks) than transformer-based approaches. On ten datasets, we win on 8/10 for Shape (marginal fidelity) with 27% error reduction, 7/10 for Trend (correlation preservation) with 20% error reduction, and 9/10 for downstream utility (F1/RMSE). These results demonstrate that explicit relational structure can substitute for model capacity, enabling state-of-the-art tabular synthesis with simple, efficient architectures.

Abstract:
Influence functions are a standard tool for data debugging and unlearning, but they become impractical for highquery subset workloads such as large-K cross-validation, repeated resampling, or interactive what-if analysis as each subset query typically requires an expensive inverse-curvature solve. We introduce Clustered Influence Functions (CiF), which turns subset influence into an amortized subset oracle. We build a compact cache once by clustering training gradients, solve a damped Generalised Gauss-Newton system only for cluster means, and answer new subset queries by a linear recombination using cluster membership counts. This yields per-query cost of O(Cp) linear in the cache size C, and the number of model parameters p. We further provide a diagnostic error bound that decomposes approximation error into a clustering scatter term and a solver residual term, making the accuracy-compute tradeoff explicit through the cache budget and solver tolerance. Evaluations across MNIST, CIFAR-10, show that CiF matches per-query influence rankings while significantly reducing the total runtime in high-Q regimes, enabling influence-based workflows that are otherwise computationally prohibitive.

Abstract:
Neural operators have emerged as powerful surrogates for solving PDEs, significantly accelerating scientific computation. While transformerbased architectures offer unmatched flexibility for irregular domains, they suffer from a fundamental efficiency gap: standard attention mechanisms assign uniform interaction budgets to all token pairs, neglecting the physical reality that far-field interactions are often compressible. To address this mismatch, we draw inspiration from classical fast solvers that exploit hierarchical near-far decompositions. We further observe that embedding such tree-structured hierarchies in Euclidean space incurs inherent distortion, whereas hyperbolic space naturally accommodates exponential branching. Consequently, we propose Hyperbolic Neural Operator (HNO), which leverages intrinsic hyperbolic geometry to instantiate a continuous Gibbs kernel based on stabilized geodesic distances on the Lorentz hyperboloid. This design imposes a geometric inductive bias that naturally enjoys robust multi-scale routing akin to the Fast Multipole Method (FMM), yet within a unified, learnable attention mechanism. Empirically, HNO achieves state-of-the-art accuracy on six PDE benchmarks and two large-scale unstructured CFD tasks, reducing the mean relative \ell_2 error by up to 40% compared to leading baselines. Codes are attached and will be available online.

Abstract:
Clustering in a big data setting is an intensively studied problem, with coresets emerging as one of the important paradigms in this line of work. Given a cost function \textcost(P,S) mapping input points P and a solution S to an objective value, a coreset is a typically weighted subset \Omega\subseteq P such that \textcost(\Omega,S)\approx \textcost(P,S). For example, the Euclidean kmeans problem, arguably the most widely studied problem in this line of work, admits a coreset of size \tildeO(k\varepsilon^-2\min(\sqrtk,\varepsilon^-2)) points while preserving the k-means cost for \emphany candidate solution up to a (1\pm \varepsilon) factor [CLSSS NeurIPS 2022]. While this bound is reasonably small, most empirical work on coresets suggest that smaller coreset sizes are sufficient. In this paper, we offer an explanation of this phenomenon. We show that a coreset size of \tildeO(k \varepsilon^-3) is sufficient that retains the approximation guarantee up to a (1+\varepsilon) factor of any approximation algorithm used to compute a solution. These \emphapproximation preserving coresets have a weaker guarantee than that of strong coresets, which apply to all solutions, while having stronger guarantees than weak coresets which only apply to the optimum solution. Thus, in some sense, worst case solutions inducing large strong coresets are solutions that most reasonable algorithms will not consider. We further extend the notion of approximation preserving coresets to \empharbitrary metrics, showing that the approximation guarantee can be retained up to a factor 4+\varepsilon with a coreset of size \tildeO(k\varepsilon^-2). We complement this result by showing that a very small distortion on the approximation factor cannot admit coresets of this size. Our implementation with popular approximation algorithms such as k-means++ and local search confirm our theoretical findings also in practice.

Abstract:
Synthesizing realistic and diverse anomalous samples from limited data is vital for robust model generalization. However, existing methods struggle to reconcile fidelity and diversity, often hampered by distribution misalignment and overfitting, respectively. To mitigate this, we introduce Anomaly Preference Optimization (APO), a novel paradigm that reformulates anomaly generation as a preference learning problem. Central to our approach is an implicit preference alignment mechanism that leverages real anomalies as positive references, deriving optimization signals directly from denoising trajectory deviations without requiring costly human annotation. Furthermore, we propose a TimeAware Capacity Allocation module that dynamically distributes model capacity along the diffusion timeline— prioritizing structural diversity during highnoise phases while enhancing fine-grained fidelity in low-noise stages. During inference, a hierarchical sampling strategy modulates the coherencealignment trade-off, enabling precise control over generation. Extensive experiments demonstrate that significantly outperforms existing baselines, achieving state-of-the-art performance in both realism and diversity.

Abstract:
Artificial Neural Networks (ANNs) are powerful tools for complex decisionmaking tasks. While existing activation mechanisms often promote sparsity through thresholding, they lack explicit awareness of feature channel relevance, causing networks to continuously suffer from interference by noisy channels. Such irrelevant activation signals can propagate through the network and adversely affect the final decision. Inspired by observations that channel relevance can be reflected in both intrinsic activity levels and extrinsic decision weights, and that there is strong consensus between these two aspects, we propose AIEC (Activation with Intrinsic-Extrinsic Consensus), a novel activation mechanism that has the ability to identify and suppress irrelevant feature channels during training. With a basic threshold activation, AIEC leverages an intrinsic Activation-Counting Unit that tracks channel activation statistics, an extrinsic Decision-Making Unit that learns channel decision weights, and a Consensus Gatekeeping Unit that suppresses irrelevant channels based on the agreement between intrinsic and extrinsic channel relevance assessments. Extensive experiments demonstrate that AIEC can effectively suppress irrelevant channels and encourage sparser representations. Furthermore, AIEC is compatible with a wide range of mainstream ANN architectures and achieves superior performance compared to existing activation mechanisms across multiple tasks and domains.

Abstract:
Physicsinformed neural networks (PINNs) enforce physical laws by minimizing partial differential equation (PDE) residuals and auxiliary constraints. Standard training relies on a mean-squared error (MSE) objective, which implicitly assumes independent Gaussian residuals with a fixed global variance. We show theoretically and empirically that residuals encountered during PINN training are heterogeneous and heavy-tailed, revealing a systematic mismatch with this assumption. As a consequence, a small number of large residuals can disproportionately dominate both the loss and gradient, leading to poorly balanced optimization dynamics. Motivated by this mismatch, we adopt a Student-t residual model to explicitly capture heavy-tailed behavior. An equivalent hierarchical representation yields an expectation–maximization (EM) algorithm that alternates between estimating residual-dependent weights and optimizing network parameters via a weighted MSE objective, allowing existing PINN solvers to be reused in the M-step. The resulting training dynamics bound the influence of extreme residuals and admit almost sure convergence guarantees under standard stochastic optimization assumptions. Experiments across a diverse suite of challenging PDE benchmarks demonstrate consistently improved solution accuracy and robustness compared to standard PINN training.

Abstract:
Social interactions are characterized by both adversarial and cooperative aspects. Communications between agents may also involve adversarially motivated actors. Messages may pass through intermediaries of malign intents before reaching the intended receiver. These actors may modify the message to induce misunderstanding from the receiver while preserving the overall characteristics of the message. This form of misinformation is prevalent in realworld communications and may affect the dynamics under which communication protocols are developed. However, this aspect of social interaction is relatively underexplored in many studies of the emergent communication field, which aims to understand the environmental factors behind the emergence of languages' characteristics. This work explores how misinformation affects language emergence with a focus on compositionality. We design a communication game containing a malign intermediary between the sender and receiver. We find that risks of malign misrepresentation promote the emergence of compositional languages in simulations of communicative agents. Furthermore, we observe that adaptability of malign intermediaries is a crucial factor in forming a pressure towards compositionality and that partial misinformation where the intermediary targets only a subset of attributes can also induce compositionality.

Abstract:
Existing 3D Visual Question Answering (3DVQA) methods rely on generative paradigms, producing ambiguous descriptions that hinder deterministic decision-making. We introduce 3D Scene Assertion Verification, a task requiring models to verify natural language assertions in 3D scenes with strict binary judgments. To enable rigorous evaluation, we present 3DSAV, the first large-scale diagnostic benchmark comprising 22.5k samples tailored for this objective. We observe that current 3D-VQA models struggle in this setting due to a lack of specialized reasoning mechanisms. To address this, we propose DualLPSS. This framework uses a dual-stage routing mechanism to enable type-aware cross-modal fusion and scene-guided assertion focusing. Extensive experiments show that DualLPSS achieves state-of-the-art performance on 3DSAV, distinguishing itself by correctly handling complex logical assertions where baselines fail. The code and dataset will be made publicly available.

Abstract:
Flow Matching (FM) has shown remarkable ability in modeling complex distributions and achieves strong performance in offline imitation learning for cloning expert behaviors. However, despite its behavioral cloning expressiveness, FMbased policies are inherently limited by their lack of environmental interaction and exploration. This leads to poor generalization in unseen scenarios beyond the expert demonstrations, underscoring the necessity of online interaction with environment. Unfortunately, optimizing FM policies via online interaction is challenging and inefficient due to instability in gradient computation and high inference costs. To address these issues, we propose to let a "student'' policy with simple MLP structure to explore the environment and be online updated via RL algorithm with a reward model. This reward model is associated with a "teacher'' FM model, containing rich information of expert data distribution. Furthermore, the same "teacher'' FM model is utilized to regularize the "student'' policy's behavior to stabilize policy learning. Due to the student’s simple architecture, we avoid the gradient instability of FM policies and enable efficient online exploration, while still leveraging the expressiveness of the teacher FM model. Extensive experiments shows that our approach significantly enhances learning efficiency, generalization, and robustness, especially when learning from suboptimal expert data.

Abstract:
Autoregressive models generate sequences monotonically, where any sampled token, even if erroneous or suboptimal, becomes a permanent condition for all subsequent steps. This structural limitation means that autoregressive models cannot revisit or revise earlier decisions, i.e., a capability essential for complex generation tasks where exploration and correction are necessary. To this end, we propose N-MARS, a Non-Monotonic AutoregRessive Sequence modeling framework that enables models to generate, evaluate, and revise tokens within a single forward pass, effectively allowing exploration before commitment. We operationalize this framework through a learned erase token that retracts the previous token, enabling on-the-fly revision within standard autoregressive decoding. To train the model, we introduce a sequence augmentation method that constructs error-correction trajectories from model-generated deviations paired with ground-truth references. We then propose masked supervised fine-tuning (mSFT) that exposes the model to errors as context for learning when to revise, without optimizing their likelihood. Finally, we refine the model with group relative policy optimization (GRPO) that incentivizes judicious usage, i.e., rewarding effective corrections while penalizing unsuccessful ones. We conduct comprehensive theoretical and empirical analysis to validate the effectiveness of N-MARS, demonstrating a robust foundation for non-monotonic sequence modeling.

Abstract:
Compositional diffusion planners enable robotic decisionmaking beyond the horizon of training trajectories. Yet, current approaches often rely on the heuristic stitching of local predictions. We demonstrate that this induces a non-conservative vector field that does not mathematically correspond to any valid global trajectory log-density function. We propose Energy-based Compositional Diffuser (ECD), a framework that formulates the global trajectory as the minimizer of the sum of local bridge potentials. This energy-based perspective guarantees a conservative update field by construction and reveals a critical endpoint reaction term that is missing in heuristic stitching methods. To enable efficient inference, we further introduce a Markov-based score approximation that computes the reaction term though a single block-tridiagonal solve, maintaining time complexity linear in the planning horizon. Empirically, ECD achieves state-of-the-art success rates on a range of OGBench stitching tasks, while nearly matching the inference speed of heuristic stitching methods.

Abstract:
The statetransition (decay) matrix governs how fixed-size memory is updated and used, making it a core design in linear attention models. Prior work exploits rank-1 approximations to reduce the cost of constructing decay matrices, but this low-rank constraint also limits the expressive capacity. We therefore formulate decay-matrix design as an open optimization problem: maximizing expressiveness while introducing minimal additional cost. Inspired by the multi-head mechanism, we propose Head-in-Head, which introduces an additional mask matrix to structure memory partitioning and interactions within a single linear-attention head. This simple, generic, and efficient design: \romannumeral1) enables a rank-r approximation of the decay matrix with only a few extra parameters and \romannumeral2) strengthens intra-head information interaction. We further develop mask normalization and a chunk-wise parallelization scheme to support efficient parallel training. Extensive experiments on synthetic benchmarks and language modeling tasks, together with visual analyses, show that Head-in-Head consistently improves baseline performance by enriching information diversity and strengthening intra-head interactions.

Abstract:
Distributional learning problems optimize discrepancies between probability measures, including optimal transport or Sinkhorn divergence, yet are typically optimized using Euclidean firstorder methods in parameter space. We show this mismatch is structural rather than algorithmic. We introduce geometry-misalignment, a local condition number that measures distortion between Euclidean geometry and the intrinsic geometry induced by a distributional objective. For a broad class of problems, we establish lower bounds demonstrating that Euclidean first-order methods incur an unavoidable convergence slowdown proportional to misalignment, even under intrinsic strong convexity and smoothness. We further prove geometry-aware preconditioned methods attain matching upper bounds independent of misalignment, yielding a sharp separation between Euclidean optimization and geometry-aware optimization. Beyond convergence rates, we show geometry-misalignment induces an optimization-dependent excess risk term under finite budgets, directly linking optimization geometry with statistical efficiency. We develop a geometry-calibrated optimization framework that estimates misalignment and selectively activates geometry-aware updates when necessary. Experiments on distribution matching for domain adaptation validate the theory, with improvements concentrated in high-misalignment regimes and negligible overhead.

Abstract:
We introduce a recursive AlphaZero style MonteCarlo tree search algorithm, "RMCTS". It first generates the search tree using prior policies, and then recursively re-estimates action values by using the regularized optimal posterior policies from ``Monte--Carlo tree search as regularized policy optimization'' (Grill et al., 2020) at each node of the search tree, starting from the leaves and working back up to the root. We find that RMCTS matches or exceeds the quality of AlphaZero's MCTS-UCB in a tiny fraction of the time.

Abstract:
Label Distribution Learning (LDL) is an effective learning paradigm for predicting entire conditional label distributions, improving the trustworthiness of predictions in risksensitive tasks. Although previous LDL methods achieve satisfactory performance on conventional evaluation metrics, they generally overlook the divisiveness within label distributions, i.e., the propensity of label distribution to exhibit dissension between semantically opposing labels, which is an essential indicator of the practical decision risk. Therefore, we propose a divisiveness‑consistent label distribution learning framework to quantify and preserve the divisiveness information. First, we formalize a divisiveness measure that satisfies the axiomatic property of polarity monotonicity to quantify the divisiveness information. Second, we theoretically demonstrate the inconsistency between conventional loss functions and divisiveness error. Besides, in order to address the adversarial gradient problem arising from directly minimizing the divisiveness error, we propose a pairwise divisiveness loss as an unbiased estimator of the original divisiveness error. Experiments confirm the effectiveness of the proposed method.

Abstract:
Generating objects with specific symmetries is essential in various realworld scenarios, with the creation of patterns exhibiting planar group symmetries being a representative task. However, adapting existing 2D continuous representations to enforce symmetry remains a significant challenge, as the transformation of non-reflective group elements may disrupt continuity. To overcome this limitation, we propose a novel symmetric continuous representation framework for arbitrary planar groups. Our method transforms any underlying 2D continuous representation into a symmetric one while strictly preserving continuity. We provide the mathematical formulation of this representation, demonstrate its universal approximation capability for symmetric functions, and detail the construction methodology. We validate our approach through three distinct generation tasks: general symmetric pattern design, connectivity-preserving paper-cutting design, and mechanically-constrained geometric stylized patterns. Experimental results confirm that our representation enables precise and effective symmetry control in pattern generation.

Abstract:
Recent bytelevel language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their practical deployment is limited by slow inference. In this work, we enhance the Byte Latent Transformer (BLT) using new training and inference techniques. First, we introduceBLT Diffusion (BLT-D), a new model and our fastest BLT variant. BLT-D is trained with an auxiliary block-wise diffusion objective over byte blocks alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially improving decoding efficiency. Second, we propose two extensions inspired by speculative decoding that trade some speed for improved quality:BLT Self-speculation (BLT-S), a faster generation method for BLT in which it speculates bytes beyond its normal patch boundaries and verifies its own generations; andBLT Diffusion+Verification (BLT-DV), which enhances BLT-D by adding an autoregressive verification step after diffusion-based generation. Each approach offers its own unique advantages, and together, they overcome key barriers to large-scale deployment of byte-level LMs.

Abstract:
Modern Hopfield Networks (MHNs) have achieved widespread success across various domains but are confined to Euclidean/Hilbert spaces, failing to preserve the hierarchical structure of data due to geometric constraints—arbitrary tree structures cannot be embedded with low distortion, while hyperbolic spaces can naturally accommodate hierarchical structures through exponential volume growth. To address this issue, we propose Hyperbolic Associative Memory Networks (HAMNs), the first framework to embed modern associative memory into hyperbolic space: we map query and memory vectors from Euclidean space to a constant negative curvature manifold via exponential maps, define a regularized energy function based on the Minkowski inner product, and adopt curvatureaware Riemannian optimization combined with exponential map updates to achieve stable on-manifold retrieval. We put forward a hierarchy-sensitivity hypothesis—HAMNs outperform Euclidean MHNs on data with deep hierarchies but exhibit comparable performance on data with weak or shallow hierarchies, which is validated by depth-controlled experiments and cross-level metrics. As a plug-and-play, model-agnostic module, HAMNs are suitable for the storage and retrieval of representations in task architectures requiring hierarchical understanding, instantiated with the Poincaré ball in experiments, and also applicable to any hyperbolic model with constant negative curvature.

Abstract:
Steplevel caching offers a promising avenue for accelerating diffusion models by exploiting temporal redundancy. However, existing strategies predominantly rely on heuristic, threshold-based metrics to trigger cache updates. This reactive paradigm is inherently myopic as it optimizes only for local feature consistency, and yields unpredictable deployment latency. In this work, we propose BudCache, a budget-constrained optimization framework that inverts this standard: instead of letting error thresholds dictate the cost, we enforce a strict computational budget and globally search for the caching policy that maximizes generation fidelity. To tackle the combinatorial complexity of step selection, we employ a hybrid strategy combining Simulated Annealing with deterministic Hill Climbing. This approach efficiently escapes local optima to locate globally optimized cache masks within minutes, incurring zero inference overhead. Crucially, to address the trajectory drift induced by aggressive caching, we introduce a cache-aware schedule alignment mechanism. By refining the time discretization via a lightweight, data-free distillation, we significantly enhance performance in low-NFE regimes. Extensive experiments on FLUX.1-dev and Wan2.1 demonstrate that BudCache consistently outperforms heuristic baselines, achieving superior perceptual quality under rigid latency constraints.

Abstract:
In multiview clustering (MVC), conventional anchor learning based models implicitly assume a uniform distribution of anchors across clusters, which could lead to inferior representation, especially when clusters vary significantly in size, as larger clusters require more anchors so as to adequately capture their intrinsic structural complexity. To alleviate this, we design a method termed FCFMVC that explicitly encourages proportional anchor allocation. To be specific, we transfer anchor allocation to discrete sample-cluster learning via bipartite graph bridge, and then backpropagate cluster state consisting of size and dispersion degree to guide anchor assignment. This allows the model to integrate cluster cardinality awareness and structural compactness directly into anchor distribution. On the other hand, we regard anchors as pseudo-samples, introduce an anchor-cluster indicator matrix on each view, and directly constrain the number of anchors assigned to each cluster within a tolerance margin. These two paths are further coupled through anchor-sample label alignment, and collaboratively facilitate anchor generation from fine-grained (anchor-level) to coarse-grained (cluster-level) structures. Besides, the entire optimization operation with linear time and space cost makes FCFMVC well-scalable to large-scale tasks. Experiments on datasets with diverse scales confirm the effectiveness of our FCFMVC.

Abstract:
Transformers pretrained via next token prediction learn to factor their world into parts, representing these factors in orthogonal subspaces of the residual stream. We formalize two representational hypotheses: (1) a representation in the product space of all factors, whose dimension grows exponentially with the number of parts, or (2) a factored representation in orthogonal subspaces, whose dimension grows linearly. Both track contextinduced uncertainty over the latent parts, but the factored representation sacrifices fidelity when factors are not conditionally independent. We derive precise predictions about the geometric structure of activations for each, including the number of subspaces, their dimensionality, and the arrangement of context embeddings within them. We test between these hypotheses on transformers trained on synthetic processes with known latent structure. When factors are conditionally independent, models learn factored representations; when noise or dependencies break this structure, models gradually expand their effective dimensionality over training to recover fidelity. This provides a principled explanation for why transformers decompose the world into parts, and suggests that interpretable low dimensional structure may persist even in models trained on complex data.

Abstract:
Bayesassisted conformal prediction combines the strengths of Bayesian modelling with exact, distribution--free frequentist coverage guarantees. While validity holds even under model misspecification, the size of the prediction sets can degrade significantly when the prior is poorly aligned with the observed data. We address this limitation by introducing RoBAS: a novel Bayes--assisted nonconformity score which is motivated by a hierarchical Bayesian working model with heavy--tailed priors, and which we implement in practice via a computationally tractable empirical Bayes instantiation. Our proposed method is adaptive to the quality of the available working information in the prior. When reliable prior information is available and can be effectively encoded, we achieve set sizes lower than that of other sets with the same coverage. On the other hand, when such information is weak or inaccurate, our nonconformity scores revert to the Distance--To--Average score, a robust baseline that is well--suited to settings where accurate prior information is not available. We evaluate our method on tabular and image regression tasks in the setting where there exists distribution shift between the training and calibration/test data. We find that our approach is competitive with widely used nonconformity scores in the absence of distribution shift, while providing significant gains in the more challenging setting of distribution shift.

Abstract:
Energybased models (EBMs) provide a flexible framework for generative models with strong distribution modeling capabilities. Nevertheless, their broader adoption has been limited by the difficulty of stable and efficient training. In this paper, we propose a unified and efficient latent-guided cooperative EBM that leverages informative target latent variables to guide the joint energy in capturing both data distribution and semantic structure, along with a cooperative generator designed for effective MCMC initialization. Our joint space optimization only requires MCMC sampling in the data space, and allows the energy to learn semantic data–latent relationships directly from real data. Experiments show our method improves generation quality and training stability with fewer resources, and performs effectively across multiple downstream tasks.

Abstract:
Hyperbolic metric learning is highly effective in embedding hierarchical data structures. However, past work has predominantly focused on the conformal Poincaré model, leaving other geometries like the Klein model largely underexplored. In addition, the curved geodesics of the Poincaré model present a fundamental geometric misalignment with the linear projections dominating the feature transformation steps in the modern neural network backbones. In this paper, we investigate the Klein model, a projective model of hyperbolic geometry whose straight-line geodesics offer a structurally aligned alternative in linear encoders, for hyperbolic metric learning. By formalizing a framework based on Einstein gyrovector operations, we derive a numerically stable metric learning approach that mitigates the inherent optimization challenges of the Klein model. Extensive experiments on multiple image datasets for fine-grained image classification task show that the Klein model not only serves as a viable alternative to the Poincaré model but also achieves highly competitive performance by leveraging its unique geometric properties, without increasing parameter complexity. Our empirical findings establish the Klein model as an efficient geometric prior for hyperbolic metric learning.

Abstract:
Modelbased reinforcement learning (MBRL) improves sample efficiency by leveraging learned dynamics models, but often suffers from unstable training due to dynamics model learning mismatch: models are trained on data from historical policies while being queried under the continually updated current policy. This mismatch can cause policy-relevant local model error to remain large even as global prediction error decreases, leading to oscillatory updates. We present a finite-horizon performance analysis that decomposes the policy performance gap into global model error, policy-induced distribution shift, and historical policy mixture effects, showing that minimizing global error alone is insufficient for stable optimization. Motivated by this analysis, we propose Prioritized Model Experience Replay (PMER), a lightweight replay mechanism that prioritizes high-error transitions during dynamics model training. PMER implicitly emphasizes policy-relevant regions without explicit policy distance estimation and integrates seamlessly into Dyna-style MBRL frameworks. Experiments on MuJoCo benchmarks demonstrate improved stability, faster convergence, and higher sample efficiency.

Abstract:
Hyperbolic geometry has recently emerged as a powerful representation space for multimodal learning, as it naturally captures hierarchical semantic structure across modalities. Despite this progress, how such representations behave under continual learning poses fundamentally different challenges that remain underexplored. This work provides a geometric perspective on this problem and establishes a theoretical foundation for representation preservation in hyperbolic space, showing that preventing forgetting requires crossmodal invariance under a shared hyperbolic isometry. Guided by these insights, a principled continual learning framework is derived that preserves essential geometric structure while allowing effective adaptation to new tasks. Experiments on continual multimodal benchmarks corroborate the effectiveness of the proposed approach.

Abstract:
We study the problem of generating structurally diverse graphs on N unlabeled vertices. Given a space of such graphs S_N, a metric d, and a target cardinality k, the objective is to construct a set \mathcalG \subset S_N that maximizes pairwise diversity under d. While neural generative models may appear appealing as a solution, standard approaches require samples from a target distribution that does not exist for dispersion problems. As a result, prior work is limited to bruteforce combinatorial or iterative search methods. We instead treat diversity as an explicit optimization objective, an approach we term Neural Graph Dispersion. An ensemble of generators is optimized under a repulsive potential, producing diverse graphs by sampling along optimization trajectories as they disperse over (S_N,d). Moreover, this approach allows us to generate an initial diverse graph set and, when desired, refine it under bespoke graph distances with minimal overhead. Extensive experiments show our method produces highly diverse graphs while scaling efficiently with respect to N and k.

Abstract:
Flow map matching (FMM) enables oneand few-step sampling for diffusion-style generation, yet its performance is often hindered by the mismatch between ground-truth training transitions and model-induced flow maps. We propose Contrastive Flow Map Matching (CFMM), a principled framework that explicitly aligns FMM training with practical sampling. Our approach is grounded in a theoretical upper bound on the reverse KL divergence, which decomposes the distributional gap into a marginal mismatch over intermediate states and a conditional mismatch in endpoint reconstruction. This analysis motivates two complementary objectives: average-velocity regression for marginal alignment and a sampling-aligned InfoNCE contrastive loss for conditional refinement. CFMM is a training-only plug-in for pre-trained FMMs, incurs no inference-time overhead, and supports training FMMs from scratch. Experiments on CIFAR-10, ImageNet, and LSUN across multiple FMM baselines demonstrate consistent improvements in fidelity and perceptual quality with only modest additional training cost.

Abstract:
Modern deep neural networks often contain far more parameters than needed to fit their training data, yet they achieve impressive generalization. A common explanation of this success is the implicit bias of stochastic gradient descent (SGD). An alternative volume hypothesis posits that, within low trainingloss regions, loss-landscape basins leading to strong generalization occupy much larger regions of weight space than basins that generalize poorly, and therefore SGD is simply more likely to land in the former. Recent experimental explorations of this idea present seemingly contradictory results. While in one set of experiments randomly sampling the network weights until achieving zero training error yielded poor generalization, molecular-dynamics density estimates supported the volume hypothesis. We observe that these experiments were performed at different dataset size regimes, and explore an intermediate regime using the Replica Exchange Wang–Landau algorithm to estimate the joint density of states over training and test accuracies in binary networks. Across several architectures and datasets, we show that the generalization advantage of SGD over random sampling training diminishes as the training data size grows, suggesting a resolution of the paradox.

Abstract:
We study the stability properties of feature bagging, an ensemble technique that improves robustness by training each learner on a randomly selected subset of features. We introduce feature stability (FS), a notion that quantifies the sensitivity of an algorithm’s output to the removal of a single feature. This notion complements classical instance stability (IS) and together provides a more comprehensive framework for evaluating algorithmic stability. Within this framework, we analyze feature bagging in both a parametric linear model and an assumptionfree setting inspired by recursive feature subsampling in random forests. In both cases, we establish formal stability guarantees showing that feature bagging strictly outperforms its non-bagged counterpart, with larger gains achieved at smaller subsampling ratios, and that only a modest number of bagging rounds is sufficient to attain near-optimal stability.

Abstract:
When gradient information is unavailable, blackbox optimization (BBO) methods provide a practical alternative. While Evolution Strategies (ES), Consensus-Based Optimization (CBO), Optimization via Integration (OVI), and related methods have each been studied independently, their connections remain underexplored. We unify these approaches within a common theoretical framework, revealing that they differ primarily in two design choices: fitness aggregation (controlling sharpness preference) and consensus scope (controlling modality). Leveraging these insights, we introduce hybrid optimizers that interpolate between existing methods. Our ES-OVI hybrid allows explicit control over the preference for flat minima, enabling a trade-off between performance and robustness in continuous control tasks. Our CBO-OVI hybrids combine the high-dimensional efficiency of parametric methods with the multimodal capabilities of particle-based approaches, achieving competitive results on language model merging under limited evaluation budgets. We validate our methods on standard BBO benchmarks and high-dimensional locomotion tasks, demonstrating that the hybrid methods can outperform their constituent algorithms.

Abstract:
Bayesian predictive inference provides a principled framework for uncertainty quantification, data efficiency, and robust generalization. However, exact inference is often intractable, and scalable approximations may remain computationally expensive or require restrictive modeling assumptions that degrade predictive performance. PriorData Fitted and in-context learning networks have recently emerged as an amortized alternative by learning to map datasets directly to predictive distributions, but existing approaches are tightly coupled to the support of the training prior and lack explicit mechanisms for adapting to new priors at test time, resulting in limited robustness under distribution shift. We introduce a multi-task in-context learning framework for amortized hierarchical Bayesian predictive inference that explicitly represents prior information as a prefix of in-context datasets. A transformer trained on sequences of prior and target tasks learns to adapt its predictions across families of priors. On a suite of evaluations with increasing difficulty, including out-of-meta-distribution heavy-tailed priors and priors with high-dimensional latent structures, our method matches oracle Bayesian predictors while being orders of magnitude faster.

Abstract:
An artificial intelligence must have a model of its environment that iscausal, supporting reasoning about interventions and counterfactuals, and alsocombinatorial, supporting generalization to unseen combinations of objects. In this work, we formally study when and how such a model can be learned. We developrelational structural causal models, extending structural causal models (Pearl 2009) to settings where objects and their relations vary. First, we show how answers to not only causal but also observational queries about unseen combinations of objects can not be identified without further assumptions. To enable such identificationincluding in the presence of unobserved confounding---we definerelational causal graphsand derive symbolic identification criteria. Finally, we proposerelational neural causal models, a provably correct approach that outperforms non-relational baselines on simulated traffic scenes with varying cars, signals, and pedestrians.

Abstract:
Transformer architectures have dramatically advanced representation learning and inference in deep models through selfattention mechanisms. In parallel, associative memory (AM) frameworks map representations onto energy landscapes, offering interpretable retrieval mechanisms. However, their continuous-time inference dynamics lack the biological plausibility of classical Continuous Attractor Neural Networks (CANNs). To bridge this gap, we proposeControlled Dynamics Attractor Transformer (CDAT), which couples a mixture von Mises–Fisher (Mo–vMF) attention energy with a Hopfield refinement energy, while augmenting energy descent with a CANN-inspired excitation–inhibition modulation.CDATinstantiates a topology-constrained dynamical system whose couplings encode relational structure among tokens, thereby linking attractor-style dynamics to modern energy-based attention. We further provide a constructive dissipation analysis to formally establish their controlled inference dynamics. Benefiting from these robust and structured dynamics,CDATachieves state-of-the-art performance across multiple benchmarks in graph anomaly detection and graph classification.

Abstract:
Reward models are central to Reinforcement Learning from Human Feedback (RLHF), especially for openended tasks where evaluation is inherently multi-dimensional. Recent Generative Reward Models (GRMs) improve interpretability by producing natural-language rationales and task-specific evaluation principles. However, most existing GRMs generate principles after reading the actor's response, i.e., Q+R \rightarrow P. We show that this coupling induces Principle Drift: when the actor performs reward hacking (e.g., verbosity, self-aggrandizement, or hallucinated self-justifications), the reward model may shift its criteria to rationalize the response, yielding inflated scores that in turn reinforce hacking during RL. We propose IP-GRM (Independent Principle GRM), a two-stage framework that first generates principles solely from the question (Q \rightarrow P) and then evaluates the response conditioned on (Q, R, P). This decoupling keeps criteria invariant to response content, producing more objective and stable reward signals. For efficient training, we further introduce a Principle Cache strategy that reuses principles within a group, improving GRPO throughput by 23.66% while maintaining strict intra-group consistency. In GRPO training on creative writing, IP-GRM suppresses reward hacking and improves WritingBench and CreativeWriting-v3 by up to +4.6 and +7.1 points based on Qwen3-8B, achieving state-of-the-art performance among open-source models.

Abstract:
Predicting outputs that are located in nonEuclidean spaces, such as probability distributions, networks, and symmetric positive-definite matrices, is becoming increasingly important in modern data analysis, particularly when inputs are high-dimensional. We propose DeSI (Deep Single-Index Fréchet Regression), a semiparametric framework for regression with metric space-valued outputs and multivariate inputs that assumes a single-index structure for the conditional Fréchet mean. DeSI estimates an interpretable index direction, which quantifies the relative importance of inputs, using a deep neural network, and performs Fréchet regression along the resulting one-dimensional index in the target metric space. This structure mitigates the curse of dimensionality while retaining interpretability, which stands in contrast to standard deep neural networks. We establish theoretical guarantees for DeSI, including consistency and convergence rates, and demonstrate its strong predictive performance through simulations on distributions, networks, and symmetric positive-definite matrices, as well as an application to compositional mood data from New Jersey.

Abstract:
Label ambiguity/polysemy is pervasive in supervised learning, motivating a variety of representations beyond the traditional singlelabel setting. While label distribution (LD) provides a probabilistic description and has attracted increasing attention, we reveal its inherent limitations, including inconsistency with raw data, distortion of inter-sample order, and limited applicability. To address these issues, we introduce generalized label distribution (GLD), a unified representation that can perfectly recover raw data while preserving inter-sample order consistency, transform into existing forms of label representations without information loss, and capture out-of-distribution samples as well as negative label correlations. We further develop GLD learning algorithms and demonstrate their effectiveness through both theoretical analysis and extensive experiments.

Abstract:
Image dehazing, an important image restoration problem, aims to recover clear scene content from images degraded by atmospheric haze. Existing dehazing methods rely on observing the distribution of hazy images during training: supervised approaches typically depend on synthetic datasets, leading to poor generalization in realworld scenarios; unsupervised methods are constrained by the limited diversity of observed haze conditions due to the difficulty of collecting real hazy images, and fail to generalize to unseen haze types. To address these challenges, we propose the first fully zero-shot dehazing framework that is trained without any hazy images. The framework is built upon a set of representations that remain invariant across clean and hazy images to bridge the two domains, which is both theoretically derived and experimentally validated. Consequently, we formulate dehazing as a conditional generative modeling problem and train a diffusion model solely with the invariant representations of the abundant and readily available clean images. During testing, the same representations extracted from hazy images serve as the conditional input to guide the diffusion process toward the clean image distribution. Quantitative analyses verify the effectiveness of the proposed representations, and extensive experiments across various real-world hazy datasets demonstrate our framework’s remarkable generalization ability, significantly outperforming existing methods. Our code will be available after the review process.

Abstract:
Conventional autoregressive models achieve causality through nexttoken prediction, but suffer from prohibitive latency and typically under-perform non-causal alternatives such as masked prediction and diffusion. We propose self-token prediction, which enables predicting a flexible number of tokens per step, and introduce AGARIC, the first spatially causal image generator built on this paradigm. AGARIC delivers markedly faster inference and consistently outperforms prior autoregressive baselines, matching the performance of state-of-the-art non-causal models. Our findings point to self-token prediction as a key step toward unified and efficient multimodal autoregressive modeling.

Abstract:
Importance sampling is a Monte Carlo method which designs estimators of expectations under a target distribution using weighted samples from a proposal distribution. When the target distribution is complex, such as multimodal distributions in highdimensional spaces, the efficiency of importance sampling critically depends on the choice of the proposal distribution. In this paper, we propose a novel adaptive scheme for the construction of efficient proposal distributions. Our algorithm promotes efficient exploration of the target distribution by combining global sampling mechanisms with a delayed weighting procedure. The proposed weighting mechanism plays a key role by enabling rapid resampling in regions where the proposal distribution is poorly adapted to the target. Our sampling algorithm is shown to be geometrically convergent under mild assumptions and is illustrated through various numerical experiments.

Abstract:
Salient Object Ranking (SOR) aims to study how humans visually explore complex scenes by predicting an ordered sequence of objects that attracts our attention. Existing SOR approaches typically model this ranking deterministically, assuming a single, fixed ranking sequence of attention. However, such deterministic SOR fails to capture the true nature of human attention. We observe that human attention shifts exhibit variability and stochasticity, showing that the next object of fixation is not a definitive choice but rather a probability distribution. Yet, existing SOR methods and evaluation metrics do not account for this inherent randomness. To address this fundamental problem, we propose ProbSOR, a novel Probabilistic Salient Object Ranking model that explicitly learns the uncertainty of attention shifts by incorporating Group Relative Policy Optimization (GRPO). We leverage a VisionLanguage Model (VLM) as the foundation for ProbSOR to identify salient objects and infer their ranked order, utilizing a segmentation decoder for precise object extraction. We also propose a new metric tailored to ProbSOR, as existing SOR metrics only support deterministic rankings. Further, we construct a ProbSOR dataset comprising 15,000 probabilistic SOR samples, to support both model training and evaluation. Extensive experiments demonstrate that ProbSOR achieves strong performances in salient object ranking under both our proposed and traditional benchmarks.

Abstract:
We present a Liealgebraic approach to model Koopman dynamics that integrates algebraic structure with computational scalability. The proposed formulation constrains the neural generators to evolve within prescribed Lie subalgebras and constructs finite-time flows through a neural Magnus expansion construction, thereby maintaining consistency with the associated Lie-group composition over each time segment. To address the computational burden inherent in sequential propagation, we exploit the associativity of Lie-group compositions and construct segmentwise propagators via a prefix-scan algorithm, which reduces the depth of temporal composition from linear to logarithmic. Consequently, the framework enables accurate long-horizon prediction while improving computational efficiency, and provides a principled foundation for scalable Koopman operator learning for nonlinear systems.

Abstract:
Mixtureof-Experts (MoE) models rely on balanced expert utilization to fully realize their scalability. However, existing load-balancing methods are largely heuristic and operate on mini-batch assignment statistics, introducing bias relative to population-level objectives. We propose \phi-balancing, a principled framework that directly targets population-level expert balance by minimizing a Schur-convex potential of the expected routing distribution. Using convex duality, we derive an equivalent min-max formulation and obtain a simple online algorithm via mirror descent, yielding an efficient EMA-based routing adjustment with negligible overhead. Across large-scale pretraining and downstream fine-tuning, \phi-balancing consistently outperforms prior Switch-style and loss-free baselines, demonstrating more stable and effective expert utilization.

Abstract:
A digital twin (DT) is a virtual model of a realworld system that can assist decision-making by simulating scenarios induced by different policies. However, the typical design process of machine learning-based DTs does not optimise for this objective. We prove that, when model capacity is limited, typical DT training paradigms, which minimise one-step transition errors, can produce suboptimal models for ranking sets of policies. We further show that this holds empirically, even with expressive model classes. To address this, we introduce DT^2, a decision-targeted DT training paradigm. DT^2 uses off-policy evaluation methods to estimate values of candidate policies on offline data, and encourages the DT to generate rollouts that preserve pairwise policy rankings derived from these proxy ground-truths with an architecture-agnostic loss function. We empirically demonstrate the efficacy of our method across a range of settings and architectures, showing that DT^2 consistently improves policy ranking and reduces decision regret relative to conventional DT training, both for policies used during training and for unseen policies, while maintaining a good level of raw simulation fidelity.

Abstract:
Traditional comorbidity scores (e.g., Charlson and Elixhauser) are widely used for risk adjustment and patient stratification, but they have two key limitations: they are largely mortalitycentric and do not align well with other outcomes, and their linear, rule-based structure cannot capture nonlinear, outcome-specific risk relationships. We propose a Machine-Learned Comorbidity Index (MLCI) that maps diagnosis codes to a single scalar by maximizing the normalized Hilbert–Schmidt Independence Criterion (nHSIC) between the score and multiple clinical outcomes. MLCI captures nonlinear risk–outcome dependence and is supported by a novel theory that characterizes when a unified, informative patient ordering can be achieved across outcomes. Empirical results on multiple benchmark electronic health record (EHR) datasets show that MLCI outperforms strong single-index baselines across multiple evaluation metrics.

Abstract:
Induction heads are attention circuits believed to underlie incontext learning in transformers, yet a precise characterization of the estimators they implement remains elusive. We study transformers trained on order-k Markov chains and prove that a two-layer disentangled transformer implements a soft context-matching estimator that aggregates contributions from all partial context matches, weighted exponentially by their degree of overlap. This mechanism admits two complementary smoothing interpretations. First, prepending a beginning-of-sequence (BOS) token induces additive pseudo-counts, recovering Dirichlet-style smoothing. Second, a finite attention temperature enables interpolation across context orders, analogous to Jelinek–Mercer smoothing but with data-dependent weights that adapt to each sequence's local structure. Experiments on trained transformers confirm that learned attention patterns match our theoretical construction and approach Bayes-optimal performance, where hard counting fails. Our results bridge mechanistic interpretability of induction heads with classical statistical smoothing, revealing that transformers learn to regularize in-context estimation rather than simply count.

Abstract:
Conformal prediction converts any blackbox predictor into one with finite-sample, distribution-free coverage guarantees, outputting prediction sets T(x) that contain the true label with probability at least 1-\alpha. To construct these prediction sets, conformal prediction relies on a randomly sampled ``calibration set'' of labeled examples. In many applications, however, this labeled calibration set is costly to collect, creating a tradeoff between upfront labeling cost and downstream utility of the conformal predictor. In this work, we study conformal prediction with costly label queries, where unlabeled examples arrive i.i.d. and labels can be queried one at a time. After m queries, we form a conformal predictor; the upfront cost of this predictor is the calibration set size m, and its efficiency is the expected prediction set size \mathbbE|T_m(X)|. We design an online stopping rule \hatm that automatically balances the upfront cost against conformal efficiency while preserving the original conformal guarantee. Theoretically, we show that under mild regularity assumptions, the expected total cost of our stopping rule matches the best fixed calibration size in hindsight. Experimentally, we find that our stopping rule reduces cost compared to standard choices of m from the literature by 41.4% \pm 2.3%. Finally, as a concrete application we demonstrate a reduction from CP to the probably approximately correct labeling problem of Candès et al. (2025), under which our stopping rule minimizes the total labeling cost.

Abstract:
Existing time series foundation models (TSFMs), often based on transformer variants, lack adaptability to different sampling rates, struggle with generalization across varying context and target lengths and are computationally inefficient. We introduce FlowState, a novel TSFM architecture that achieves samplingrate–equivariant forecasting through a unified design that pairs a state space model (SSM) encoder with a functional basis decoder (FBD). This design enables continuous-time modeling and dynamic time-scale adjustment, allowing FlowState to inherently generalize across all possible temporal resolutions, and dynamically adjust the forecasting horizons without retraining. We further propose an efficient pretraining strategy that improves robustness and accelerates training. Despite being one of the smallest TSFMs, FlowState achieves state-of-the-art results on the widely used GIFT-Eval benchmark, while demonstrating superior adaptability to unseen sampling rates. Our detailed analyses confirm the effectiveness of its components, and we demonstrate its unique ability to adapt to varying input sampling rates.

Abstract:
Effective demonstration selection is crucial for maximizing large language model (LLM) performance in fewshot in-context learning. Due to influences such as recency bias, the effectiveness of demonstrations depends heavily on their context relationship to the specific query, and on the ordering in which they are presented, making demonstration selection a complex combinatorial problem. To address these two challenges, we introduce AutoSelect, a novel framework that formulates demonstration selection as an auto-regressive sequential decision process. At each step, AutoSelect embeds the query and previously selected demonstrations into matrix representations to preserve structural information, and a trainable policy model sequentially selects the next best exemplar. To navigate the factorial space of demonstration permutations, our framework formulates a Kullback-Leibler (KL) regularized optimization problem, from which an optimal policy induces an optimal Plackett-Luce (PL) ranking over all possible demonstration sequences. We prove that minimizing a tractable policy-level Cross-Entropy (CE) loss provably bounds the worst-case discrepancy between our policy's induced PL ranking and the optimal one, enabling tractable prioritization of high-quality sequences. Empirically, AutoSelect outperforms existing heuristic and learning-based methods across nine diverse datasets, achieving up to an 11% improvement over the strongest baseline. Our results are further supported by analytical studies and a case study, highlighting AutoSelect's key properties, as well as its transferability and generalizability.

Abstract:
Federated bilevel optimization is widely used for nested learning problems across distributed clients, such as federated hyperparameter tuning and metalearning under privacy and communication constraints. Most existing formulations assume fixed client data distributions, which can be violated by performativity, where deployed decisions reshape client behavior and data collection, inducing client-specific, decision-dependent distribution shift. We study federated bilevel performative prediction, where both upper-level (UL) and lower-level (LL) objectives are evaluated under client-dependent, decision-dependent distributions. We formalize the federated bilevel performatively stable (FBPS) point under a decoupled-risk perspective and provide sufficient conditions for its existence and uniqueness. We then develop two federated methods to compute the FBPS solution: FBi-RRM, which converges linearly under a contraction condition, and FBi-SGD, a communication-efficient stochastic method based on federated hypergradient estimation with convergence guarantees under diminishing step sizes when sensitivities are sufficiently small. Experiments on strategic regression and meta strategic classification validate the predicted stability thresholds and demonstrate improved meta-generalization over non-performative baselines.

Abstract:
Event sequences from complex systems, such as clinical workflows, are often sparse and incomplete. As a result, downstream models are trained on data that only partially captures the underlying dynamics. Synthetic sequence generation can augment real data by filling in missing structure and improving coverage of rare patterns, but generated trajectories must remain realistic, satisfy domain constraints, and allow control. We propose the ForwardChaining Temporal Point Process (FC-TPP), a framework for constraint-aware and controllable sequence generation in continuous time. FC-TPP maintains an explicit latent symbolic state encoding high-level predicates, which evolves through a differentiable multi-hop forward-chaining operator. Logical rules update the latent state based on recent events, while a temporal point process decoder generates future event times and types conditioned on this evolving state. By tying the generative dynamics to multi-hop reasoning in latent space, FC-TPP incorporates symbolic structure throughout generation rather than relying directly on raw event histories. Experiments on synthetic data and four semi-synthetic/real-world benchmarks—LogiCity, MIMIC-IV, EPIC-100, and IKEA ASM—show that FC-TPP achieves higher generation quality under limited and incomplete data, with stronger constraint adherence and greater controllability than purely neural and prior neuro-symbolic baselines.

Abstract:
Reasoning problems such as Sudoku and ARCAGI remain challenging for neural networks. Recurrent Reasoning Models (RRMs), including Hierarchical Reasoning Models (HRM) and Tiny Recursive Models (TRM), offer a compact alternative to large language models, but currently handle symbol symmetries only implicitly via costly data augmentation. We introduce symbol-equivariant recurrent reasoning models (SE-RRMs), which enforce permutation equivariance at the architectural level through symbol-equivariant layers, guaranteeing identical solutions under symbol or color permutations. SE-RRMs outperform prior RRMs on 9×9 Sudoku and generalize from just training on 9×9 to smaller 4×4 and larger 16×16 and 25×25 instances, to which existing RRMs cannot extrapolate. On ARC-AGI-1 and ARC-AGI-2, SE-RRMs achieve competitive performance with substantially less data augmentation, demonstrating that explicitly encoding symmetry improves the robustness and scalability of neural reasoning.

Abstract:
Efficient LLM inference research has largely focused on reducing the cost of each decoding step (e.g., using quantization, pruning, or sparse attention), typically applying a uniform computation budget to every generated token. In practice, token difficulty varies widely, so static compression can overcompute on easy steps and under-compute on hard ones. We studydynamic budget allocationfor autoregressive decoding: learning how much computation to spendper tokenfrom within a single model. Self-Optimizing Language Models (SOL) pair a frozen LLM with a lightweight policy network that reads the LLM hidden state and selects a discreteefficiency actionat each decode step. Actions can jointly control (i) token-level attention sparsity, (ii) structured activation pruning in the MLP, and (iii) activation quantization bit-width, while leaving the base model weights unchanged. We train the policy with group-relative policy optimization on teacher-forced episodes: the token sequence is fixed, while we sample multiple compute schedules (i.e., “counterfactual” schedules that vary only the efficiency actions for the same token path) and compare their likelihoods under the same supervision. Our reward trades off language‑model quality against soft penalties that encourage episode‑average budget usage to match a requested target. Across model variants and compute regimes, SOL improves quality at matched budget over static allocation and strong random schedule search, offering a complementary axis for inference‑efficiency optimization. SOL discovers a better quality-efficiency pareto-front across all our experiments and improves MMLU accuracy by upto 7.3% over uniform budget allocation strategies.

Abstract:
Visuotactile sensors have been widely adopted in robotic manipulation. However, inherent heterogeneity in sensor designs hinders the learning of unified tactile representations in cross-sensor scenarios. Existing methods that focus on reconstruction or task-specific supervision often fail to capture the common information between different tactile sensors, particularly in the presence of substantial sensor variations, resulting in limited generalization to unseen sensors. To address this, we propose Cross-Tactile Sensor Representation Learning (CTSRL), a unified framework for sensor-agnostic tactile representation learning. CTSRL introduces a Cross-Sensor Modulator (CSM) to eliminate sensor-specific biases and adopts a two-stage learning paradigm: (1) leveraging aligned synthetic data for cross-sensor self-supervised learning to extract shared latent representations across sensor domains; and (2) integrating real-world multimodal tactile data to bridge the sim-to-real semantic gap through cross-modal alignment, thereby enriching representations with fine-grained semantic attributes. Experimental results show that our method demonstrates strong multi-sensor generalization, significantly improving sensor-agnostic representation learning.

Abstract:
Codebookbased generators built on masked language model (MLM) transformers have become highly effective in text and vision, yet remain underused for tabular data. This is because codebooks typically act as information bottlenecks, whereas tabular generation requires them to generalize. We address this gap with Q-Tab, a codebook-based tabular generator based on lookup-free quantization (LFQ) with residual corruption. The resulting corruption kernel induces a moving Nadaraya–Watson–style kernel regression over a large discrete code space, which turns codebook learning into a moving-target problem. We derive necessary conditions for the learnability of such moving codebooks and show how the residual LFQ construction aligns with these conditions. Q-Tab achieves state-of-the-art downstream predictive utility and missing-value imputation, while matching the distributional fidelity of diffusion-based generators, notably without any post-hoc temperature tuning.

Abstract:
We analyze the hardmax limit of selfattention dynamics for token embeddings in the zero-temperature regime (\beta \to +\infty) and relate it to finite-\beta behavior. In this limit, the update rule can be viewed as a Frank-Wolfe step for a quadratic objective over the convex hull of the current tokens. When the key-query matrix is negative semidefinite, the dynamics converge with the standard sublinear rate \mathcalO(t^-1) on the quadratic energy, whereas in the positive semidefinite case, extending the hardmax rule to the convex hull induces a Voronoi structure: vertices are stationary, interior points remain in their initial cells, and each token moves along a straight line toward its cell's vertex with exponential convergence under a step-size bounded away from zero. We additionally establish well-posedness of the associated ODE limit in this regime. For finite \beta, we model self-attention as a Markov chain and prove dynamic metastability: interior tokens reach near-vertex configurations in a constant number of steps and remain trapped for times exponential in \beta with high probability, before eventual collapse to some point within the initial convex hull. Thus, hardmax dynamics accurately approximate the finite-\beta process over exponentially long time horizons.

Abstract:
Direct Alignment Algorithms (DAAs) such as DPO have become a common way to posttrain and align LLMs with human preferences. However, DAAs have been observed to over-optimize their implicit reward model and decrease the likelihood of preferred responses. This results in a decrease in the total likelihood assigned to responses seen in the preference dataset, potentially resulting in undesirable behavior. To counteract this undesired side-effect of DAAs, we examine the effect of using objectives that add a regularization term to maintain the total length-normalized probabilities of the chosen and rejected responses. To better understand over-optimization, we investigate how response likelihood changes are distributed over the tokens with and without regularization. We find that a significant portion of the likelihood changes are due to a small set of outlier tokens, which explains how DAAs improve generation quality despite decreasing the likelihoods of chosen responses. We apply the proposed regularization to reference-based (DPO) and reference-free (SimPO) methods and find (1) improved trade-offs between generation quality and general benchmark capability and (2) improvements in reward modeling across datasets. For example, on Llama-3.1-8B-Instruct, we see both a >20% relative increase in AlpacaEval2 scores and >9% relative performance gains on general benchmarks. Additionally, we find that the added regularization term effectively mitigates the amount of displacement within preferred responses overall, and for the outlier tokens specifically, by utilizing low-likelihood tokens.

Abstract:
Concept Embedding Models (CEMs) advance interpretable AI by extending Concept Bottleneck Models (CBMs) through semantic concept embeddings, providing an important solution in highstakes domains such as medical diagnosis where accuracy and interpretability are critical. However, a fundamental limitation persists: existing CEMs inherently assume concept independence, critically overlooking the highly complex dependencies among concepts. To address this, we propose an Evidential Copula Concept Embedding Model (EC-CEM) that redefines the joint distribution over concepts, capturing inter-concept dependencies while maintaining a flexible structure that decouples the marginal concept distributions from their dependency structure. In particular, EC-CEM relaxes the concept independence assumption and uniquely integrates Copula theory with evidential deep learning to define a joint distribution over concepts. The proposed EC-CEM also develops two training objectives that aim at classification and concept modeling simultaneously. We provide theoretical justification via variational inference and demonstrate empirical superiority through extensive experiments.

Abstract:
Dynamic 3D content representation is crucial for generating moving 3D objects and scenes. Existing 4D variational autoencoders (VAEs) are mainly based on projected 2D pointmaps, which are only incomplete and viewdependent observations that do not model the native 4D positional relations between points. This often leads to projection-induced distortions and irreversible token dislocation. In this paper, we introduce a novel 4D VAE that operates directly in native 4D space, that is dynamic colored voxel space, without 2D projection. This preserves explicit spatio-temporal coordinates throughout the learned encoder and decoder, enabling both partial and complete 4D content encoding. To support a flexible temporal compression ratio, we also design a novel spatio-temporal window attention module that performs attention within local 4D windows. Additionally, we propose a differentiable voxel rendering loss based on sparse voxel rasterization to improve the geometry and color reconstruction quality. On 4D reconstruction tasks, our approach improves reconstruction fidelity over pointmap VAEs and flow-based VAEs while learning a more structurally consistent latent space. We further demonstrate the generative potential of our method by training a video-conditioned 4D diffusion model.

Abstract:
Aligning Large Language Models (LLMs) with human preferences is often formulated via Direct Preference Optimization (DPO). However, the standard BradleyTerry instantiation of DPO is limited in modeling common departures from transitivity in human preferences. To address this, recent work has introduced Self-Play Preference Optimization (SPPO), which iteratively refines the policy by training on self-generated win-lose pairs. Our investigation, however, reveals a critical instability in SPPO: the optimization is prone to policy degeneration when the preference oracle assigns overly confident wins to semantically indistinguishable responses. To mitigate this, we propose S-SPPO, a dual-space semantic calibration framework comprising: i) Supervision Calibration via semantic gating, which anneals win rate targets toward the maximum-entropy baseline as semantic overlap increases; and ii) Representation Calibration via latent repulsion to enforce geometric diversity to prevent manifold collapse and maintain latent diversity between chosen and rejected samples. Theoretically, we show that the calibration preserves the constant-sum game structure, facilitating convergence to a Nash Equilibrium. Empirically, S-SPPO avoids the performance degradation seen in prior methods, achieving 52.19% win rate and 47.46% length-controlled win rate on AlpacaEval 2.0 with Llama-3-8B, without using additional human-annotated preferences during training.

Abstract:
Weightless Neural Networks (WNNs) based on interconnected Lookup Tables (LUTs) have attracted attention for inference in extremely compact models, but achieving competitive accuracy under such tight resource budgets remains challenging. To address these issues, we introduce the Convolutional LearnableGroup Weightless Neural Network (CLGN). CLGN constructs convolutional layers using LUTs and incorporates a learnable GroupSum connection, thereby enhancing the accuracy of WNNs while maintaining low implementation resource consumption. Moreover, we propose a hierarchical training strategy to improve the training efficiency. We evaluate CLGN in two edge computing scenarios: (1) FPGA, where we evaluate accuracy, latency, throughput, power consumption, LUTs usage, and parameter size; and (2) Microprocessor, where we evaluate latency and memory usage. Compared with the state-of-the-art solutions, the proposed CLGN achieves superior accuracy while maintaining lower implementation resource consumption.

Abstract:
Randomized smoothing has emerged as a scalable technique for certifying the adversarial robustness of classifiers. However, its application to regression remains underexplored and faces unique challenges. Existing regression certificates rely on probabilistic acceptance regions and fail to exploit the local geometry of the function. In this work, we present a novel framework for certified robust regression that addresses these limitations. We derive a prediction-centered certificate that guarantees the stability of the smoothed model’s prediction and ensures practical computability at test time. We investigate several alternatives for constructing these certificates by explicitly incorporating means, variances, and gradients. In particular we demonstrate on the MNIST rotation task that utilizing gradient information yields significantly tighter robustness certificates compared to the current state-of-the-art, \alpha-smoothing.

Abstract:
Spiking Neural Networks (SNNs) demonstrate superior energy efficiency over conventional Artificial Neural Networks (ANNs). Recent advances in Transformerbased SNNs have shown encouraging performance by seamlessly integrating spike-driven computation with Transformer architectures. Positional information plays a crucial role in sequential modeling tasks. However, existing positional encoding methods designed for ANNs are fundamentally incompatible with SNNs, as they interfere with the spike-driven computation paradigm, highlighting the need for SNN-specific solutions. We propose Spiking Positional Encoding (SPE), a novel positional encoding method specifically designed for Spiking Transformers that effectively captures relative positional information. Its key component is the Positional Encoding Leaky Integrate-and-Fire (PE-LIF) neuron layer, which encodes positional information directly into neuron thresholds. Through continuous spike firing and membrane potential reset processes, this positional information is implicitly embedded into the emitted spike trains while maintaining compatibility with the spike-driven computation paradigm. Comprehensive experiments across thirteen datasets, including the GLUE and other widely-adopted Natural Language Processing benchmarks, demonstrate that SPE consistently outperforms existing positional encoding methods. SPE provides a tailored positional encoding solution for Spiking Transformers, bridging the performance gap between ANNs and SNNs, thus advancing neuromorphic computing applications in sequential modeling tasks.

Abstract:
We consider the computational problem of compacting Tsetlin Machine (TM) classifiers by reducing the number of propositional clauses while preserving predictive accuracy. TMs trained with limited clause capacity often perform poorly because stochastic optimization cannot reliably find the few precise clauses needed in a vast configuration space. Highquality compact subsets also exist for larger TMs. The main challenge is to extract them. Heuristic pruning does not work for TMs because clauses interact through Boolean logic. A clause is unimportant in isolation, but it becomes critical when others are removed. We formalize compression as the Minimum Discriminating Clause Set (MDCS) problem: find the smallest subset of clauses that preserves the trained model's separation of training samples. We prove that MDCS is NP-hard. We solve MDCS via weighted partial Maximum Satisfiability (MaxSAT). A partition-and-merge strategy allows us to scale to 100K samples. MaxSAT compression significantly outperforms direct training on all but one of 13 datasets at matched capacity, with improvements up to 26 percentage points and a median of 6 percentage points.

Abstract:
Thompson Sampling is one of the most widely used and studied bandit algorithms, known for its simple structure, low regret performance, and solid theoretical guarantees. Yet, in stark contrast to most other families of bandit algorithms, the exact mechanism through which posterior sampling (as introduced by Thompson) is able to "properly" balance exploration and exploitation, remains a mystery. In this paper, we show that the core insight to address this question stems from recasting Thompson Sampling as an online optimization algorithm. To distill this, we introduce a time invariant notion of regret that summarizes cumulative regret across horizons (through a regret bound), leading to a time invariant Bellmanoptimal policy. It turns out that Thompson Sampling admits an online optimization form that mimics the structure of the Bellman-optimal policy, where greediness is regularized by a measure of residual uncertainty. When viewed through this new lens of online optimization, Thompson Sampling can be understood and improved in a principled manner, by comparing it against the Bellman-optimal benchmark.

Abstract:
Operator splitting algorithms are a cornerstone of modern firstorder optimization, decomposing complex problems into simpler subproblems solved via proximal operators. However, most functions lack closed-form proximal operators, which has long restricted these methods to a narrow set of problems. Hamilton-Jacobi-based proximal operator (HJ-Prox) is a recent derivative-free Monte Carlo technique based on Hamilton-Jacobi PDE theory, that approximates proximal operators numerically. In this work, we introduce a unified framework for operator splitting via HJ-Prox, which allows for deployment of operator splitting even when functions are not proximable. We prove that replacing exact proximal steps with HJ-Prox in algorithms such as proximal point, proximal gradient descent, Douglas–Rachford splitting, Davis–Yin splitting, and primal–dual hybrid gradient preserves convergence guarantees under mild assumptions. Numerical experiments demonstrate HJ-Prox is competitive and effective on a wide variety of statistical learning tasks.

Abstract:
Backpropagation (BP) remains the dominant training paradigm for deep neural networks, yet its reliance on global gradient propagation fundamentally induces update locking problem, enforcing strong interlayer dependencies in parameter updates. To address this limitation, we propose Depth-progressive Monotonic Learning (DMoL), a training scheme that assigns layer-wise local belief objectives and incrementally refines them across network depth, enabling unlocked parameter updates. As a result, DMoL supports dynamic modification of network depth during training, adapting to available compute and device resources while maintaining stable optimization. We provide theoretical guarantees that layer-wise local belief objectives improve monotonically with increasing depth and converge exponentially. Empirically, DMoL consistently matches or outperforms BP across diverse tasks, yielding a 4.3% accuracy gain on CIFAR-100, mitigating over-smoothing in deep graph neural networks (+37.5% on Cora), and reducing the final loss by over 35% in diffusion model training, highlighting its robustness and flexibility as an alternative to BP. The code is publicly available at: https://anonymous.4open.science/r/DMoL.

Abstract:
The evaluation of large language models (LLMs) is increasingly performed by other LLMs, a setup commonly known as "LLMas-a-judge", or autograders. While autograders offer a scalable alternative to human evaluation, they are not free from biases (e.g., favouring longer outputs or generations from their own model family). Here we propose a statistical framework based on Bayesian generalised linear models (GLMs) that enables researchers to address their primary research questions (e.g., LLM capability or risk assessment), while simultaneously identifying, quantifying and mitigating various biases in their autograders. Our approach can be applied to various evaluation formats (e.g., absolute scores or pairwise preferences) and augments traditional metrics (e.g., inter-rater agreement) by providing precise uncertainty estimates and clarifying sources of disagreement between graders. This framework also enables efficient counterfactual simulations without costly re-evaluation (e.g., assessing agreement after removing systematic biases). We demonstrate these capabilities through simulated examples, with all methods available in an open-source software package. Overall, we introduce a novel framework for autograder evaluation which allows researchers to detect, quantify and correct for various biases in a systematic way.

Abstract:
Most semisupervised learning frameworks rely on a single teacher that transfers zero-order supervision through pseudo-labels, constraining the student to imitate categorical outputs without perceiving the loss geometry. This design often leads to unstable optimization and limited generalization under scarce labels. We propose TTN (Two-Teachers Newton-guided Learning), a dual-teacher framework that integrates complementary supervision from MAE and DINOv3 and optimizes the student through a Newton step update. The two teachers provide multi-scale structural and semantic cues whose pseudo-labels and local Hessians are fused by confidence weighting, forming a unified second-order supervision signal. The student updates parameters preconditioned by the fused curvature, enabling stable convergence and geometry-consistent learning. TTN consistently improves over existing single-teacher and consistency-based semi-supervised learning methods on ImageNet, CIFAR-10, SVHN, and STL-10, demonstrating that combining multi-view self-supervised teachers with curvature-guided optimization yields robust and efficient semi-supervised learning.

Abstract:
This paper approaches the fundamental challenge of accelerating the inherently autoregressive nature of gradient descent (GD) like SGD and Adam through a dynamic system perspective. Specifically, we introduce a unified framework that recasts the autoregressive GD process as solving a system of triangular nonlinear equations (TNEs), thereby enabling stepparallel training, where gradients for different GD steps are computed concurrently without sequential dependencies. Within this generic framework, we establish that: (1) the TNE system admits a unique solution corresponding precisely to the autoregressive GD iterative trajectory; (2) solving the TNEs system guarantees convergence to the GD iterative trajectory in at most the equal iterations. Building on these insights, we present PASO, the first step-parallel optimizer for accelerating a broad class of GD-based optimizers like SGD and Adam. Extensive experiments (e.g., Llama-3.2-1B and diffusion model) validate that PASO achieves up to 21× reduction in GD steps and 4.5× speedup in wall-clock time, with no model quality loss. Source code is available at: \urlhttps://anonymous.4open.science/r/PASO-0AF9.

Abstract:
Multimodal learning aims to integrate information from heterogeneous data sources to improve representation quality and downstream task performance. A key challenge lies in aligning modalityspecific representations while suppressing modality-dependent noise and redundancy. The Information Bottleneck (IB) principle provides a principled framework for learning task-relevant representations. Existing multimodal IB methods primarily apply the IB principle to fused multimodal representation and rely on restrictive distributional assumptions, such as Gaussian latent priors induced by variational autoencoders, which may not hold in practice. In this paper, we propose Information Bottleneck–based Multimodal Alignment (IBMA), a novel multimodal learning framework that enforces the IB principle for both the fused multimodal representation and modality-specific representations. IBMA introduces modality-specific representation alignment that guides each modality-specific encoder to learn informative and task-relevant representations aligned with the complementary modality, thereby enhancing cross-modal semantic consistency. Moreover, we derive a novel, efficient, and distribution-free variational upper bound for the IB loss that avoids unrealistic assumptions on latent feature distributions and is readily optimized using standard stochastic gradient descent. Extensive experiments demonstrate that IBMA achieves superior performance compared to existing multimodal learning methods, validating the effectiveness of modality-specific representation alignment. The code for IBMA is available at~\urlhttps://anonymous.4open.science/r/IBMA/.

Abstract:
Bayesian inference provides principled uncertainty quantification but is often limited by challenges of prior and likelihood elicitation. The martingale posterior (MGP) (Fong et al, 2023) offers an alternative by replacing these requirements with a predictive rule. Additionally MGP focuses inference on parameters defined through a loss function. This framework is especially resonant in the era of foundation transformers; practitioners increasingly leverage models like TabPFN for their stateof-the-art capabilities, yet often require epistemic uncertainty for a scientific estimand \theta that need not parametrize the model’s implicit latent model. The MGP provides the mechanism to recover these posterior distributions. We introduce TabMGP, an MGP built on TabPFN for tabular data. TabMGP produces credible sets with near-nominal coverage and often outperforms both handcrafted MGP constructions and standard Bayesian baselines.

Abstract:
Modern language models reason within bounded attention size, a physical constraint that poses a fundamental barrier to longhorizon reasoning. We identify recursion as a core principle for overcoming this barrier, and propose recursive models as a minimal realization, where the model can recursively invoke itself to solve subtasks in sequences that are contextually isolated. We prove that any computable problem admits a recursive decomposition where subtasks require only exponentially smaller active context than standard autoregressive models, and this approach strictly surpasses any single-context management approaches such as summarization. We further show that modern agentic systems are naturally suited for realizing recursion in a generalized way where arbitrary processing of contexts and workflows is allowed, and prove they can achieve the same power as recursive models, yet none can surpass it. Experimentally, we train a 3B model to learn recursive reasoning and evaluate on SAT, finding that it significantly outperforms frontier LLMs.

Abstract:
Dominant approaches for modelling Partial Differential Equations (PDEs) rely on deterministic predictions, yet many physical systems of interest are inherently chaotic and uncertain. While training probabilistic models from scratch is possible, it is computationally expensive and fails to leverage the significant resources already invested in highperforming deterministic backbones. In this work, we adopt a training-efficient strategy to transform pre-trained deterministic models into probabilistic ones via retrofitting with a proper scoring rule: the Continuous Ranked Probability Score (CRPS). Crucially, this approach is architecture-agnostic: it applies the same adaptation mechanism across distinct model backbones with minimal code modifications. The method proves highly effective across different scales of pre-training: for models trained on single dynamical systems, we achieve 20–54% reductions in rollout CRPS and up to 30% improvements in variance-normalised RMSE (VRMSE) relative to compute-matched deterministic fine-tuning. We further validate our approach on a PDE foundation model, trained on multiple systems and retrofitted on the dataset of interest, to show that our probabilistic adaptation yields an improvement of up to 40% in CRPS and up to 15% in VRMSE compared to deterministic fine-tuning. Validated across diverse architectures and dynamics, our results show that probabilistic PDE modelling need not require retraining from scratch, but can be unlocked from existing deterministic backbones with modest additional training cost.

Abstract:
In recent years, multiview outlier detection (MVOD) has gained increasing attention, with the primary objective of recovering the underlying structure of normal data from outlier-contaminated multi-view datasets. However, this objective is hindered by two fundamental challenges:(i) outlier propagation, (ii) scale discrepancy. To address these issues, we propose RNAMOD (Reliable Neighborhood-Aware Multi-View Outlier Detection), which introduces the concept of reliability and constructs a reliable neighborhood structure to avoid outlier propagation. We introduce a leave-one-out directional consensus mechanism to align cross-view neighborhood structures while preventing scale discrepancy by aligning geometric directions that remain invariant to scaling. Extensive experiments on six benchmark datasets demonstrate that RNAMOD consistently outperforms state-of-the-art methods.

Abstract:
Large language models (LLMs) achieve remarkable generative performance, yet their output quality is dependent on the decoding strategy. While samplingbased methods (e.g., top-k, nucleus) and search-and-select based methods (e.g., beam search, best-of-n, majority voting) can improve upon greedy decoding, both approaches suffer from limitations: sampling commits to a single path, while search often expends excessive computation regardless of task complexity. We introduce Entropy-informed DEcodiNg (EDEN), a plug-and-play, model-agnostic decoding framework that adaptively allocates computation based on the model’s own uncertainty, approximating higher width beam search with fewer generations required. At each generation step, EDEN estimates the entropy of the output token distribution and adjusts the branching factor monotonically with the entropy, expanding more candidates in high-entropy regions and following a greedier path in low-entropy regions, improving sample efficiency. Experiments across complex tasks, including mathematical reasoning, code generation, and scientific questions, demonstrate that EDEN consistently improves output quality over existing decoding strategies, achieving better trade-offs between accuracy and token generations than fixed beam search approaches. By treating next token selection as a noisy maximisation problem, we prove that branching factors monotone in entropy are guaranteed to find better (i.e. more probable) continuations than any fixed branching factor within the same total computation budget, motivating the dynamic branching.

Abstract:
Multimodal SelfSupervised Learning (SSL) has achieved remarkable success by learning representations from multiple views of data. However, prevalent methods rely on the redundancy assumption—that different views share substantial task-relevant information. We argue that this assumption fails in complex, real-world settings characterized by heterogeneity (e.g., variable-length healthcare or behavioral data), where enforcing strict alignment can discard unique, modality-specific signals and inadvertently amplify bias. In this work, we propose FairSSL, a framework that leverages data heterogeneity as a resource for fairness rather than a hindrance. Unlike standard contrastive approaches, FairSSL uses a subject-aware Variance-Invariance-Covariance Regularization objective, where alignment is enforced across segments drawn from the same subject. We introduce a segment-based pooling strategy to handle variable-length modalities, and we regularize representations to encourage (i) sufficient within-subject variability, (ii) cross-modal and cross-subject invariance, and (iii) representation decorrelation. Theoretical analysis shows that our objective bounds the score gap between protected groups. Empirically, FairSSL significantly outperforms existing baselines on heterogeneous multimodal datasets, improving fairness without sacrificing downstream predictive performance.

Abstract:
LLM agents powered by retrieval and RAG are increasingly prevalent across research and applications. Embedding models play a critical role in these systems, particularly in embeddingbased retrieval. However, current benchmarks for embeddings, such as MTEB, remain focused on general-purpose scenarios, which fail to align well with the diverse and evolving needs of agentic applications. To close this gap, we introduce Agent-Oriented Embedding Benchmark (AOEB), a comprehensive evaluation suite dedicated to agent-centric retrieval for embedding models. AOEB is characterized by two key features: (1) Multi-Task, covering five essential capabilities for retrieval in LLM agents, including code, tool, reasoning, and memory retrieval; and (2) Multi-Modal, providing evaluation with both textual and visual data for each task category. We evaluate representative embedding models on AOEB and observe that they exhibit distinct strengths across different agent-oriented retrieval tasks. By curating AOEB, we aim to promote a move toward more practically oriented directions within the embedding community and foster further progress.

Abstract:
Posttraining quantization at the 2-bit level enables low-cost deployment and inference acceleration for large language models (LLMs). Scalar quantization (SQ) and vector quantization (VQ) are two primary quantization methods, however, the former suffers from significant performance degradation, and the latter incurs computational and storage overhead. We propose UniSVQ, a unified 2-bit quantization framework that bridges scalar and vector quantization by parameterizing codewords as an affine transform of integer lattices. This structure preserves compatibility with optimized integer kernels while retaining much of VQ's flexibility. We further introduce a data-driven block-wise fine-tuning strategy to directly minimize quantization reconstruction error. Extensive experiments across multiple LLM families and zero-shot benchmarks demonstrate that UniSVQ consistently outperforms state-of-the-art SQ methods and achieves performance comparable to advanced VQ methods, while providing higher inference throughput.

Abstract:
Code repair is an important capability for language models (LMs): given a buggy program and unit tests, an LM must produce a fixed program that passes the tests. We aim to scale supervision for code repair by having an LM generate bugfix tasks with unconstrained edits, using unit tests as the only verifier. We propose generator-fixer self-play, in which a single model is trained with reinforcement learning to alternate between generating bugs and fixing them. As the fixer improves, the generator adapts to produce increasingly difficult bugs, yielding an automatic curriculum. However, because unit tests certify correctness but not realism, we find that the generator can drift from bugs encountered in practice, improving repair on self-generated bugs while degrading on real-world bugs. We propose Anchored Self Play (ASP), which anchors self-play with a small reference set by (i) adding a code-embedding similarity reward to guide generation and (ii) mixing reference bugs into fixer training to prevent drift. To reflect LM-assisted programming, where bugs come from humans, LMs, and human edits of LM code, we introduce BugSourceBench, a code repair benchmark spanning human-authored bugs, human-edited buggy LM code, and errors in LM-generated code. Across bug sources, ASP achieves the best fix rates, improving average fix rate by +25% (relative) / +7.2 pp (absolute) over standard self-play, with gains on both LM-error bugs (+100% relative / +11 pp absolute) and human-authored bugs (+7.1% relative / +3.4 pp absolute).

Abstract:
Recent research has established empirical scaling laws to predict model performance on multidomain data mixtures. However, a theoretical understanding of these model loss behaviors remains limited. In this work, we propose a unified framework to explain the underlying mechanics of data mixing. Our approach extends theoretical perspectives originally developed for standard neural scaling laws (e.g., Kaplan and Chinchilla) to the multi-domain setting. Based on the distributional assumption that domains overlap on fundamental skills while diverging on specialized skills, we identify two key factors that decide the domain loss of models trained on different data mixtures: Capacity Competition, where the allocation of finite model capacity couples domain losses globally, and Noise Reduction, where optimal weights shift toward harder-to-learn domains to minimize variance. Experiments demonstrate that our framework fits the loss landscape with lower Mean Relative Error than existing empirical baselines and accurately predicts optimal training mixtures. Crucially, our model achieves these results using significantly fewer parameters.

Abstract:
Generative models such as diffusion models and transformers are powerful tools for learning complex data distributions and generating new samples. However, their blackbox nature limits interpretability, and the learned distributions may violate side knowledge arising from domain expertise. We represent such side knowledge as probability distributions over noisy functions of the modeled objects and seek to minimally adjust the generative model to satisfy such constraints. Our approach is to optimize the dual of the corresponding constrained optimization problem, encoding the infinite-dimensional dual variable using a neural network. We introduce a simple and efficient score-based method for fitting the parameters of this neural network, and for simulating from the resulting adjusted distribution. We evaluate our approach on a number of synthetic tasks, as well on two real-world problems: a regularized nonparametric maximum likelihood estimation problem, and the incorporation of class-level fairness constraints into image diffusion models.

Abstract:
Incontext learning (ICL) is a practical way to adapt large models, yet under strict context limits it remains unclear how to spend scarce tokens without being misled by noisy, redundant, or conflicting demonstrations. We address this gap by targeting token-budgeted context construction: how to select and compress demonstrations so the prompt carries maximal task-relevant signal with minimal predictive distortion. We propose RDCO, a deterministic, training-free optimizer that scores demonstrations by marginal task information per token, penalizes redundancy and prefix-conditioned conflicts, and compacts the selected context under a bounded predictive divergence constraint to control drift. Across a 10-dataset ICL suite spanning classification and structured generation, RDCO achieves the best average performance (63.26 Acc. on classification and 56.26 EM on generation) and improves the overall average by +2.20 points over the strongest baseline under the same budget. Our results suggest that viewing prompts as finite-capacity messages yields a principled and effective path to more reliable and token-efficient ICL.

Abstract:
LLMbased multi-agent embodied planning remains impractical due to prohibitively high execution latency. We identify failed actions as the dominant bottleneck, stemming from two core challenges: inaccurate state tracking under partial observability and inefficient coordination that produces redundant or conflicting actions. We introduce Mosaic, a runtime-efficient multi-agent planning framework that addresses both challenges. Mosaic maintains accurate yet lightweight state tracking through agent-centric semantic memory that stores objects in relative coordinates, enabling geometric transformations and coordination. It ensures efficient coordination through Integer Linear Programming that allocates actions at every planning step, enforcing physical feasibility and inter-agent coordination constraints. Across AI2-THOR and search-and-rescue benchmarks, Mosaic achieves 27–32% faster execution, 30–33% fewer LLM calls, 25–31% fewer steps, and 4–10% points higher success rates. These results demonstrate that efficient memory and constraint-guided coordination are critical for scalable, low-latency multi-agent planning.

Abstract:
We consider the question of how to employ nexttoken prediction algorithms in adversarial online decision making environments. Specifically, if we train a next-token prediction model on a distribution \mathcalD over sequences of opponent actions, when is it the case that the induced online decision making algorithm (by approximately best responding to the model's predictions) has low adversarial regret (i.e., when is \mathcalD a \emphlow-regret distribution)? For unbounded context windows (where the prediction made by the model can depend on all the actions taken by the adversary thus far), we show that although not every distribution \mathcalD is a low-regret distribution, every distribution \mathcalD is exponentially close (in TV distance) to one low-regret distribution, and hence sublinear regret can always be achieved at negligible cost to the accuracy of the original next-token prediction model. In contrast to this, for bounded context windows (where the prediction made by the model can depend only on the past w actions taken by the adversary, as may be the case in modern transformer architectures), we show that there are some distributions \mathcalD of opponent play that are \Theta(1)-far from any low-regret distribution \mathcalD' (even when w = \Omega(T) and such distributions exist). Finally, we complement these results by showing that the unbounded context robustification procedure can be implemented by layers of a standard transformer architecture, and provide empirical evidence that transformer models can be efficiently trained to represent these new low-regret distributions.

Abstract:
Federated finetuning of Large Language Models faces severe statistical heterogeneity. However, existing model-level defenses often overlook the root cause: intrinsic data distribution mismatches. In this work, we first establish Federated Self-Distillation (FedSD) as a fundamental and potent strategy. By projecting client representations into a smoothed ``model-understanding space,'' FedSD alone serves as a universal booster, demonstrating superior performance over conventional algorithms. Despite its success, we identify a subtle trade-off termed the Rewrite Paradox---unconstrained self-distillation can inadvertently increase hallucinations and redundancy. To refine this paradigm, we further propose FedSDR (Federated Self-Distillation with Rectification), the ultimate reinforced framework. It augments FedSD with a dual-stream mechanism: a local LoRA-S (Smoothing) branch to implicitly absorb heterogeneity via distilled data, and a parallel global LoRA-R (Rectification) branch anchored to raw data to enforce factual correctness. By selectively aggregating only LoRA-R, FedSDR yields a globally aligned and faithful model. Extensive experiments verify its superior performance.

Abstract:
Image clustering is a fundamental unsupervised task in computer vision. Recent studies have explored incorporating external linguistic information to facilitate visual feature learning and thereby enhance clustering performance. Nevertheless, these methods typically rely on fixed vocabularies (e.g., WordNet) to generate language counterparts, leading to intermodal semantic misalignment due to granularity discrepancy between visual and textual semantics. Moreover, they often overlook the issue of intra-modal semantic redundancy caused by task-irrelevant knowledge. To address these challenges, we propose a new Multi-grAnularity lanGuage-informed Image Clustering method, dubbed MAGIC. To reduce semantic misalignment, we first prompt the vision-language models to generate multi-granularity language descriptions that capture rich image semantics, which are then integrated for effective multi-modal alignment. To alleviate semantic redundancy, we design modality-specific semantic adapters that adaptively refine and compress the semantically dense features into clustering-friendly representations under task guidance. A consensus representation is obtained by fusing the refined visual and textual features, which acts as a teacher to guide image clustering through a robust contrastive learning framework. Extensive experiments on benchmarks demonstrate that MAGIC outperforms state-of-the-art methods.

Abstract:
Identifying temporal causal structure is fundamental to understanding complex systems. Neural Granger causality has emerged as a powerful paradigm for this task, leveraging the expressiveness of neural networks to model intricate nonlinear dynamics. Although complex architectures excel at predictive modeling, existing methods typically rely on simple local measures for causal discovery, which extract only partial information from the learned model and may miss global dependencies. To address this issue, we reformulate Granger causality as a feature attribution problem and propose the InformationTheoretic Shapley value (Info-Shap) to measure global feature importance. We first establish the theoretical equivalence between zero Info-Shap and Granger non-causality. On top of this, we construct two novel regularizers to suppress spurious relationships and mitigate overfitting. These regularizers are model-agnostic and can be seamlessly integrated into the training of any differentiable neural network. Through extensive experiments on synthetic and realistic datasets, we demonstrate that our method robustly recovers the underlying causal relationships, providing a flexible tool for causal discovery in high-dimensional nonlinear time series.

Abstract:
Many realworld tasks are non-verifiable—there is no objective ground truth, and quality must be judged subjectively—making reward design for RL difficult. Existing approaches based on scalar rubric scores or single comparisons are often noisy, poorly calibrated, or provide sparse learning signals. We introduce Tournament Style RL (TSRL), which constructs rewards from rubric-guided pairwise judgments against a fixed set of anchor responses, using win-rate as the reward for policy optimization. This aggregation of comparisons against anchor responses yields a signal that is more robust to the judge noise by stabilizing the reference frame, reducing the variance in reward. We test across four non-verifiable tasks and two backbone LLMs, and find that TSRL improves average win-rate by +43.8 points over the base model and +22.8 points over the strongest baseline. TSRL scales with the number of anchors, remains robust under weak or partially corrupted judges, the results are supported by blinded human preference studies.

Abstract:
Graph learning has been increasingly deployed in critical and sensitive domains, raising pressing demands for trustworthinessrobustness, fairness, and beyond. However, these properties are often undermined by various perturbations, which induce distributional uncertainty and compromise the trustworthiness of graph learning. To address this, we propose DICT, a novel framework that models distributional uncertainty to achieve trustworthy graph learning. Specifically, DICT formulates a unified optimization objective that captures perturbation-induced distributional shifts in graph topology, node features, and labels, and minimizes the worst-case risk over the uncertainty set. However, directly optimizing this objective in its primal form leads to an infinite-dimensional problem. To make this problem tractable, we integrate strong duality and local Lipschitz continuity of the loss, reformulating the objective as a finite-dimensional min-max problem. We focus on robustness and fairness as primary instantiations of DICT because they are not only critical in real-world applications, but also provide transferable modeling principles for broader trustworthiness objectives. By formulating fairness in the form of an uncertainty set, DICT pioneers unified robustness and fairness within a single optimization framework. Extensive experiments across diverse benchmarks and backbones demonstrate that DICT consistently improves both robustness and fairness, validating the effectiveness and adaptability of the DICT framework.

Abstract:
We introduce a new contextenriched, multimodal time series forecasting benchmark TimesX. TimesX contains a wide selection of high-quality real-world time series with diverse domains and textual contexts obtained from an automated data generation pipeline, which helps address three main issues of existing multimodal forecasting benchmarks: (1) poor generalization due to the small scale and synthetic nature of benchmark data, (2) very limited types of textual contexts in the benchmarks, and (3) an inability to mitigate data leakage in evaluation. We conduct a thorough empirical study of zero-shot multimodal forecasting approaches on TimesX. Our results suggest that many approaches that perform well on existing benchmarks may fail on TimesX. In contrast, simple ensemble methods that leverage rich textual context accompanying time-series can outperform strong baselines on the TimesX benchmark.

Abstract:
We present a fully dynamic data structure that supports edge and node updates and cluster membership queries for the Normalised Cut problem with strong theoretical guarantees. Furthermore, our data structure outperforms the state of the art significantly on real world datasets. At the heart of our data structure is the novel notion of Justin-Time Sampling Trees. The worst-case edge update time of our data structure is O(\log n) where n is the number of nodes in the current graph. Let d_\max be the maximum degree of the current graph, let T_NC(n',k') be the running time of an \alpha-approximation algorithm for the Normalised Cut problem on n' vertices and k' clusters, and let \textvol(Y) be the sum of the unweighted degrees of all nodes in a set Y. The worst-case query time of our data structure to label all nodes in Y is O\left(kd_max^2 \log(n) + \gamma(n,\epsilon,k,Y)\right), with approximation ratio \alpha \frac1+\epsilon1-\epsilon, where \gamma(n,\epsilon,k,Y) is defined to be \log(n)\log^\star(n)\epsilon^-4k^2 + T_NC(\epsilon^-4k^2,k) + \epsilon^-8k^4 +vol(Y). Assuming d_\max is polylogarithmic, as is the case with many sparse real-world graphs, our method achieves the best known trade-off between query time and update time.

Abstract:
Diffusion inversion, which maps images back to the Gaussian latent space of a diffusion model, is a critical task for image reconstruction and editing. While DDIM enables fast deterministic inversion, it inherently introduces deviations that accumulate into noticeable inversion errors. Existing methods often address this by solving a fixedpoint problem but largely overlook how the selection of the diffusion timestep in the noise scheduler influences inversion fidelity. In this work, we reveal that the deviation scale in diffusion inversion is strongly dependent on the timestep size, and exhibits a parabolic trend, with larger errors concentrated at both small and large timesteps. Based on this finding, we propose a simple yet effective nonuniform timestep scheduler that integrates a global rescaling with a local dynamic programming based rescheduling, enabling a strategic allocation of computational effort that minimizes the overall inversion error and preserves higher inversion accuracy. Our method serves as an off-the-shelf enhancement for existing inversion techniques and requires no extra parameters or computational overhead. Through extensive experiments, we verify that integrating our scheduler consistently boosts the performance of existing inversion methods, achieving superior results in image reconstruction and editing.

Abstract:
Generative analysis often models multidomain observations as nonlinear mixtures of domain-invariant content variables and domain-specific style variables. Identifying both factors from unpaired domains enables tasks such as domain transfer and counterfactual data generation. Prior work establishes identifiability under (block-wise) statistical independence between content and style, or via sparse Jacobian assumptions on the nonlinear mixing function, but such conditions can be restrictive and may not hold in practice. In this work, we introduce differential independence, a weaker structural condition requiring that infinitesimal variations in content and style induce orthogonal directions on the data manifold, thereby enabling identifiability even when content and style are dependent and the Jacobian is dense. We operationalize this condition through a blockwise orthogonality constraint on the Jacobian subspaces associated with content and style. To support high-dimensional generative models, we design a stochastic regularizer based on numerical Jacobian approximation, enabling scalable training in settings such as high-resolution image generation. Experiments across multiple datasets corroborate the identifiability analysis and demonstrate practical benefits on counterfactual generation and domain translation tasks.

Abstract:
Posttraining (via supervised fine-tuning) improves instruction-following, but often induces semantic mode collapse by biasing models toward low-entropy fine-tuning data at the expense of the high-entropy pre-training distribution. Crucially, we find this trade-off worsens with scale. To close this semantic diversity gap, we propose annotation-anchored training, a principled method that enables models to adopt the preference-following behaviors of post-training without sacrificing the inherent diversity of pre-training. Our approach is simple: we pre-train on documents paired with semantic annotations, inducing a rich annotation distribution that reflects the full breadth of pre-training data, and we preserve this distribution during post-training. This lets us sample diverse annotations at inference time and use them as anchors to guide generation, effectively transferring pre-training's semantic richness into post-trained models. We find that models trained with annotation-anchored training can attain 6× less diversity collapse than models trained with SFT, and improve with scale.

Abstract:
Despite remarkable performance on complex tasks, Large Reasoning Models (LRMs) often generate excessively long Chainof-Thoughts (CoT), inflating computational costs even for simple queries. Existing efforts to mitigate this inefficiency typically rely on discrete reasoning modes or fixed budget tiers, lacking a principled criterion of when reasoning is sufficient. In this work, we introduceMinimal Sufficient CoT(MSC), defined as the shortest prefix of a CoT trajectory which is adequate for producing the correct answer. We empirically show that MSC not only reduces reasoning tokens, but also improves accuracy across difficulty levels. Building on MSC, we proposeSufficiency-guided Continuous Adaptive Reasoning(SuCo), a two-stage training framework for autonomous reasoning control along a continuous spectrum. In stage 1,MSC-Aligned Fine-Tuning(MFT) constructs MSC data using problem-adaptive sufficiency thresholds that naturally scale with question difficulty, then fine-tunes the model to internalize concise yet sufficient reasoning patterns. In stage 2,Sufficiency-Aware Policy Optimization(SAPO) further optimizes the model through reinforcement learning with dynamic complexity tracking and sufficiency-aware rewards that penalize both over- and under-thinking. Extensive experiments across mathematics, code, and science benchmarks show that SuCo consistently achieves improvements in both accuracy and reasoning efficiency.

Abstract:
The construction of the brain functional network often follows the handcrafted Correlation Coefficients of blood-oxygen-level-dependent (BOLD) time series without any learnable components. Meanwhile, most efforts are made to the models, such as graph neural networks, that make predictions with the constructed brain network as input. Unfortunately, the fixed brain network may lose critical information during construction and lead to difficulty in performance improvement, even with deliberately designed graph models. From this perspective, the current situation is similar to the machine learning community, i.e., hand-crafted features and learnable predictors, before the advent of representation learning. In fact, the brain network can be regarded as a graph-structured learnable representation of the brain. By drawing on representation learning, this paper presents the Brain Representation (BRep) learning problem. To this end, the widely used linear and nonlinear correlations are enhanced to be high-order, parametric, and learnable. The flexible brain representation makes the following predictor simple and leads the framework to possess an end-to-end characteristic. The framework is implemented by combining the parametric correlation and a TopK sparsification. Theoretical analysis guarantees the model's universal approximation to any U/V-statistics. Extensive evaluations demonstrate that the proposed BRep possesses superior performance, high efficiency, and interpretability. The source code is publicly available at https://anonymous.4open.science/r/BRep-demo-1A3E/

Abstract:
We study the problem of finding stationary points of nonconvex functions when access to the objective is provided only through a comparison oracle that, given two points, outputs which has the larger function value. For a twice differentiable f\colon\mathbb R^n\to\mathbb R with Lipschitz gradient and Hessian, we develop an algorithm that outputs an \epsilon-stationary point using \widetilde O(n^2/\epsilon^1.5) queries. Our approach uses a subroutine that estimates the normalized Hessian to accuracy \delta using \widetilde O(n^2\log(1/\delta)) queries. We further study this problem with a quantum comparison oracle model where queries can be made in superpositions, and develop the first quantum algorithm that finds an \epsilon-stationary point, which takes \widetilde O(n/\epsilon^1.5) queries.

Abstract:
Symbolic Regression (SR) is the task of finding a closedform mathematical expression that optimizes some objective. Solving this task is NP-hard. However, SR software routinely discovers accurate, interpretable models without exhaustively searching function space. Motivated by this disconnect between worst-case theory and practical success, we study SR through the lens ofparameterized complexity theory. In particular, we reanalyze tractability with respect to practically relevant parameters like expression depth, tree size, and number of primitives used. We show that SR is actually fixed-parameter tractable (FPT) under a parametrization over expression depth or tree size, formalizing an explanation for why the bounded-complexity search of popular SR algorithms succeeds. However, SR becomes W[1]-hard when parameterized by the number of variables or primitives used, identifying selection as a source of intractability. We further find lower bounds under the exponential time hypothesis, prove approximation hardness, and rule out polynomial kernels.

Abstract:
Recent efforts to improve the reasoning abilities of Large Language Models (LLMs) have focused on integrating formal logic solvers within neurosymbolic frameworks. A key challenge is that formal solvers lack commonsense world knowledge, preventing them from making reasoning steps that humans find obvious. Prior methods address this by using LLMs to supply missing commonsense assumptions, but these approaches implicitly assume universal agreement on such commonsense facts. In reality, commonsense beliefs vary across individuals. We propose a probabilistic framework for abductive commonsense reasoning that explicitly models this variation, aiming to determine whether most people would judge a statement as true or false. We introduce Probabilistic Abductive CommonSense (PACS), a novel algorithm that uses an LLM and a formal solver to sample proofs as observations of individuals’ distinct commonsense beliefs, and aggregates conclusions across these samples. Empirically, PACS outperforms chainof-thought reasoning, prior neurosymbolic methods, and search-based approaches across multiple benchmarks.

Abstract:
Gradient Inversion Attack (GIA) poses a significant threat to federated learning, enabling adversaries to reconstruct private training data from the information shared during training. Prior research has predominantly focused on the vanilla SGD, where the server or an eavesdropper can directly observe true gradients. In practical deployments, however, models may be trained with adaptive optimizers (e.g., Adam, RMSProp, and AdaGrad), for which the observable signal is not raw gradients but momentumbased parameter updates. This setting remains underexplored and undermines traditional gradient-matching strategies, which struggle to recover labels and images from non-gradient updates. To address this gap, this paper explores attacks tailored to modern adaptive optimizers. We present an analytical rule for recovering labels from optimizer updates and propose an update-matching objective that optimizes dummy inputs to reproduce the observed updates. The proposed approach is general and can be directly applied to various optimizers such as Adam, AdaGrad, and RMSProp. Furthermore, we find that, despite being introduced for adaptive optimizers, the proposed objective function also yields stronger attacks in the standard SGD setting. Experiments on datasets such as ImageNet and PACS highlight the effectiveness of our method over existing gradient matching techniques.

Abstract:
Spherical signals provide a natural representation for omnidirectional perception and often benefit from equivariance to 3D rotations. Recent spherical vision transformers implement local selfattention on spherical grids, but most retain only partial \mathrmSO(3) equivariance and rely on location-dependent local positional embeddings (LPEs). Such LPEs can degrade robustness to camera tilt or object reorientation and introduce additional memory and computational overhead. We propose Spherical \mathrmSO(3)-Equivariant Local Attention (SoLA), an LPE-free local attention mechanism for spherical signals. SoLA achieves full \mathrmSO(3) equivariance through a distance-preserving positional modulation that couples query/key features with each token’s unit direction. Specifically, the modulation lifts queries and keys using an outer-product with the 4D direction dependent vector. The induced similarity of the modulated queries and keys depends on content affinity and great-circle distance while remaining invariant to global \mathrmSO(3) rotations. The same formulation admits a softmax-free linear variant that computes local attention via key-value aggregation without per-query neighbor materialization. We integrate SoLA into a U-shaped spherical transformer for 360^\circ depth estimation and semantic segmentation, demonstrating substantially improved robustness to arbitrary 3D rotations compared to prior spherical transformers with similar computational costs.

Abstract:
Subgraph counting is a fundamental problem in graph analysis. Motivated by practical scenarios where graph analytics are performed on subgraphs induced by selected vertices rather than on the entire graph -- and by growing privacy concerns, we initiate the study ofdifferentially private range subgraph counting (DPRSC). The goal is to privately count occurrences of a fixed pattern graph within induced subgraphs defined by multi-dimensional attribute ranges. Unlike classical point counting, subgraph counting is inherently nonlinear and exhibits high sensitivity: a single edge modification can affect many subgraph occurrences. We present the first efficient algorithms for DPRSC with small additive error. Our approach introduces a subgraph projection that reduces DPRSC to weighted orthogonal range counting, enabling the use of range trees and local sensitivity estimation to achieve accurate private query answering. We complement our algorithms with matching lower bounds, obtained by reducing reconstruction attacks to DPRSC and leveraging discrepancy theory. In particular, we show that any differentially private algorithm for DPRSC must incur additive error exponential in the dimension. Empirical evaluations demonstrate that our algorithms significantly outperform baseline methods in accuracy while maintaining strong privacy guarantees.

Abstract:
Radio Frequency Fingerprint Identification (RFFI) is a foundational pillar of physicallayer security, providing unclonable identity authentication and lightweight defense mechanisms for zero-trust wireless networks. Its practical deployment, however, often occurs in a source-free open-world (SF-OW) setting, characterized by a continuous influx of unregistered devices and privacy constraints that preclude the retention of historical data. In this paper, we formalize SF-OW RFFI task, which manifests a severe stability-plasticity dilemma: intrinsic signal similarity confuses new classes, while source absence precipitates catastrophic forgetting. To address this, we propose Incremental Orthogonal ETF (IO-ETF), a novel neural collapse-inspired framework utilizing output geometry to actively induce parameter separation and isolation. We further devise a Triple-Level Geometric Alignment (TLGA) strategy via semantic optimal transport, manifold progressive anchoring, and reliable subspace retention to stably align unlabeled streams to this geometric skeleton. Experiments on benchmarks demonstrate a superior trade-off between old-class retention and new-class discovery, offering a promising solution for secure access in dynamic networks.

Abstract:
Muon has emerged as an efficient alternative to Adam for pretraining, yet remains underused for finetuning. A key obstacle is that most open models are pretrained with Adam, and naively switching to Muon for fine-tuning leads to degraded performance due to an optimizer mismatch. We study this mismatch through controlled experiments and relate it to the distinct implicit biases of Adam and Muon. We provide evidence that fine-tuning with a mismatched optimizer disrupts pretrained knowledge, and show that constraining updates with Low-Rank Adaptation (LoRA) mitigates this issue. Across language and vision tasks, LoRA with Muon matches or outperforms LoRA with Adam when fine-tuning Adam-pretrained models. Furthermore, in settings with pronounced mismatch, this benefit diminishes when LoRA updates approach full fine-tuning. These results shed light on how optimizer mismatch affects fine-tuning and how it can be mitigated. Our code is availablehere.

Abstract:
Language models famously improve under a smooth scaling law, but some specific capabilities exhibit sudden breakthroughs in performance. Advocates of "emergence" view breakthroughs as unlocked capabilities, but others attribute them to metric thresholding effects. We propose that breakthroughs are instead driven by continuous changes in theprobability distributionof training outcomes when performance is bimodally distributed across random seeds. we show that different random seeds can produceeithersmoothoremergent scaling trends in synthetic length generalization tasks, multiple choice question answering, and grammatical generalization. We reveal that sharp breakthroughs in metrics are produced by underlying continuous changes in their distribution across seeds.

Abstract:
Sparse tangent portfolio optimization aims to learn an interpretable, lowcardinality portfolio in the tangency direction of the mean–variance frontier, yet the associated cardinality-constrained formulation is NP-hard and standard predict-then-optimize pipelines often misalign forecasting accuracy with downstream portfolio quality. We propose an end-to-end decision-focused learning framework that reformulates Sharpe-ratio maximization as a Disciplined Parametrized Programming (DPP)-compliant convex programming layer and replaces discrete selection with a smooth top-k operator enforcing an exact sum-to-k sparsity budget. This enables gradient flow through prediction, asset selection, and re-optimization, allowing the predictive model to directly optimize the portfolio performance. Across five major equity markets, our method consistently delivers higher out-of-sample Sharpe ratios than historical and prediction-focused baselines while producing meaningful sparse selections.

Abstract:
Largescale, high-quality dynamic 3D (4D) assets are essential for learning physically grounded representations, but remain costly to capture and annotate at scale. This limits the viability of supervised 4D learning and motivates zero-shot text-to-4D generation leveraging pretrained diffusion priors. To model complex dynamics, prior methods typically adopt implicit 3D representations (e.g., NeRFs or 3DGS) for their deformation capacity. However, their implicit nature provides limited control over surface topology, which hinders high-fidelity geometry and makes temporally coherent surface reconstruction challenging. To address these limitations, we explore zero-shot text-to-4D mesh generation. However, a structural mismatch arises when combining diffusion-based guidance with topology-constrained meshes: the guidance is noisy and spatially inconsistent, while meshes impose severe topological constraints, making direct vertex-level deformation unstable. In this paper, we introduce TextMesh4D, the first zero-shot framework for text-to-4D that directly generates dynamic meshes by addressing the above challenge at two complementary levels. Geometrically, we shift deformation modeling from vertices to faces via a Jacobian Deformation Field (JDF), enabling topology-aware surface reconstruction through an integrability-enforcing integration formulation. Semantically, we propose a Local-Global Semantic Regularizer (LGSR) that preserves identity over time by jointly constraining local deformation plausibility and global shape consistency. Extensive experiments demonstrate state-of-the-art temporal consistency, structural fidelity, and visual quality, while remaining efficient on a single 24GB GPU. The code will be released to facilitate future research.

Abstract:
Removing noise is difficult, but adding noise is easy. In this work, we show how to eliminate meanshift noisy components from PCA by deliberately introducing knockoﬀ mean-shift perturbation. Standard PCA is highly sensitive to shifts in the sample mean: a small fraction of samples from a shifted distribution can cause large deviations in the leading principal components. In high-dimensional regimes, existing Robust PCA approaches cannot handle the mean-shift contamination structure inherent in the mixture model. Using tools from Random Matrix Theory, we prove that the mean-shift spikes are spectrally separable from the stable eigenvalues of the original covariance. Furthermore, the original eigenspace remains asymptotically invariant to the contamination, independent of the mixture weight. Exploiting this spectral stability, we propose a simple, two-stage PCA algorithm by adding knockoﬀ mean that identifies and removes the mean-shift component using only standard PCA operations. We release an implementation for main code at https://anonymous.4open.science/r/ms-pca-0E47/.

Abstract:
Irregular Multivariate Time Series (IMTS) arise naturally in many realworld domains, yet their irregular sampling patterns pose significant challenges for effective modeling. Existing approaches for IMTS fall into two categories: architecture-based and data-based methods. Architecture-based methods require specialized modeling for IMTS, limiting reuse of established Multivariate Time Series (MTS) models, data-based methods convert IMTS into regular time series through imputation or interpolation, often introducing artificial values that distort temporal dynamics. In this work, we propose a novel input-embedding-based approach for modeling the IMTS. Our method preserves the original MTS backbone and operates directly on IMTS. We introduce QuITE (Query-based Irregular Time-series Embedding), a simple yet effective, backbone-agnostic embedding module that enables MTS models to directly process IMTS. QuITE leverages a set of learnable query tokens to aggregate irregular observations via a single self-attention layer, producing fixed-dimensional latent representations. Extensive experiments on real-world benchmarks demonstrate that QuITE consistently improves the performance of existing MTS models, achieving average relative performance gains up to 45.9% across diverse datasets and backbone architectures.

Abstract:
The integration of Large Language Models (LLMs) into ad auctions for dynamic creative enhancement presents a paradigm shift, yet introduces significant computational costs disrupting traditional mechanism design. This paper provides a comprehensive gametheoretic and algorithmic framework for such LLM-augmented auctions. We model the system as a dynamic Stackelberg game where the platform (leader) strategically invests in creative enhancement to maximize net revenue, while autobidding agents (followers) respond to enhanced ad qualities under budget constraints. To endogenize inference costs, we propose the Platform-Investment Mechanism (PIM). We develop the Online Dual-Descent Bidding with Regularization (ODDB-R) algorithm for agents to learn optimal bidding strategies in this non-stationary environment. For the platform, we formulate the investment problem as a continuous control task and solve it using a Two-Timescale Stackelberg Learning with Proximal Policy Optimization (TTSL-PPO) algorithm, which provably converges to a Stackelberg Stationary Point. Extensive experiments on large-scale real-world datasets and state-of-the-art LLMs demonstrate that our framework significantly outperforms heuristic baselines in revenue, social welfare, and user engagement.

Abstract:
Flaw reporting for deployed AI systems is fundamental to identifying system failures and improving AI safety. Yet the AI reporting ecosystem is fragmented: researchers who identify flaws often do not know what or where to report, and groups who receive reports rarely share them with other relevant stakeholders. As a result, goodfaith reporters duplicate effort by submitting many different forms, and recipients lack standardized, triage-ready information. We audit 12 reporting systems published by AI developers, cybersecurity groups, and AI flaw aggregators, identifying five recurring design challenges spanning discoverability, scope, information collection, coordination, and guidance for strict-liability cases. Building on this analysis and feedback from 49 experts across 32 organizations representing developers, security researchers, and ecosystem coordinators, we introduce FLARE-AI, an open-source AI flaw reporting system designed for interoperability with existing systems. FLARE-AI streamlines flaw report creation by collecting triage-relevant information through conditional logic and early classification, then enables optional dissemination of standardized, machine-readable reports to multiple developers, coordinators, and incident registries from a single submission. By lowering barriers to reporting AI flaws and improving interoperability across stakeholders, FLARE-AI helps break down silos and accelerate remediation across the AI ecosystem.

Abstract:
Efficient transfer learning of vision–language models (VLMs) commonly suffers from a Base–New Tradeoff (BNT): improving performance on unseen (new) classes often degrades accuracy on known (base) classes. Addressing how to boost recognition of unseen classes without sacrificing known-class performance remains a central challenge. Existing work often simplistically attributes the BNT to overfitting on known classes. We observe an interesting phenomenon: VLMs frequently exhibit asymmetric confusion on certain downstream data, i.e., samples of class A are systematically mispredicted as class B, while the reverse confusion (B → A) rarely occurs. For known classes, this kind of bias can be mitigated by tuning using a cross-entropy loss, but for unseen classes, such pretraining-induced bias persists and harms generalization. Motivated by this, we propose NeRP, a plug-and-play prompting correction strategy that improves discrimination on unseen classes without modifying model parameters. NeRP leverages neutral text prompts and reference images to measure class-wise prior preferences along the pre-trained inter-class geometry, and combines them with the sample likelihood to obtain the model’s surrogate score. If, for a given sample, the prior strongly favors the current prediction while the observed evidence is clearly insufficient, we perform a local flip between easily confusable class pairs, thereby correcting prior-dominated mispredictions. Extensive experiments across multiple backbones and 15 few-shot and cross-domain benchmarks show that NeRP substantially improves accuracy on unseen classes while preserving known-class prediction performance.

Abstract:
This paper addresses constrained smooth saddlepoint problems in settings where projection onto the feasible sets is computationally expensive. We bridge the gap between projection-based and projection-free optimization by introducing a unified dual dynamic smoothing framework that enables the design of efficient single-loop algorithms. Within this framework, we establish convergence results for nonconvex-concave and nonconvex-strongly concave settings. Furthermore, we show that this framework is naturally applicable to convex-concave problems, providing a unified analysis across varying payoff structures. We propose and analyze three algorithmic variants based on the application of a linear minimization oracle over the minimization variable, the maximization variable, or both. Notably, our analysis yields anytime convergence guarantees without requiring a pre-specified iteration horizon. These results significantly narrow the performance gap between projection-free and projection-based methods for minimax optimization.

Abstract:
We study generalized linear bandits with memory, a nonstationary setting in which rewards depend on past actions through a finite memory matrix. Building on prior work for linear models Clerici et al.,(2024), we show that the previously known \tilde\mathcalO(T^3/4) regret stems from a loose analysis based on cyclic proxy policies, and we refine the analysis to recover a \tilde\mathcalO(\sqrtT) regret rate in the linear case. We then extend this improvement to generalized linear models and propose a block-wise algorithm based on shrinkage-based confidence bounds. Our algorithm achieves a regret bound of \tilde\mathcalO(\sqrtmT+ d\sqrtT+\sqrt\kappa d^2 m^1/4 T^1/4 + \kappa d^2), where d denotes the feature dimension, m the memory length, and \kappa a curvature parameter of the link function, thereby attaining a \sqrtT rate despite nonlinear rewards and memory effects. To the best of our knowledge, this analysis provides a unified treatment of memory-induced non-stationarity and nonlinear link functions, while ensuring that the leading regret term is independent of the curvature of the link function. We conduct numerical experiments that are consistent with our theoretical findings.

Abstract:
SharpnessAware Minimization (SAM) empirically boosts generalization by seeking parameters that minimize the worst-case loss in a small neighborhood, yet existing theory explains its behavior under either strong convexity or small perturbation radius. We revisit SAM through the bilevel minimax problem \min_\theta\max_\|\Delta\|\le\rhol(\theta+\Delta) and derive a (\theta,\Delta) gradient flow ODE whose equilibria coincide with the problem’s optimality conditions. A Lyapunov argument—free of convexity assumptions—quantifies how the optimality gap depends on the radius~\rho and local curvature. Discretizing the flow yields a \emphMulti-step SAM algorithm that recovers classical SAM as \rho\!\to\!0. Moreover, our analysis and the resulting algorithm remain valid even for large \rho, providing principled guidance for aggressive neighborhood exploration. Experiments on synthetic objectives and CIFAR-10 validate the predicted gains from multiple inner updates, bridging the gap between SAM’s minimax intuition and its practical implementation.

Abstract:
Highorder relations involving multiple interacting entities are commonly encountered, particularly in biomedical domains. Existing relational learning methods typically learn static entity representations and assume symmetric relation inference, which can be inadequate for capturing context-dependent entity functions and the inherent asymmetry of high-order relations. In this paper, we propose Contextualized High-order Contrastive Learning (CONTEXTOR), a general and plug-and-play framework that formulates high-order relation inference as a dynamic query–response process. Specifically, CONTEXTOR decomposes each high-order relation into multiple incomplete query tuples and their corresponding response entities. Given a query tuple, we contextualize candidate response entity representations via an asymmetric conditional modulation, and align queries with their corresponding contextualized responses through multi-fold contrastive learning. Extensive experiments on benchmark datasets spanning multiple biomedical tasks demonstrate that CONTEXTOR consistently outperforms existing methods across diverse evaluation settings. Code is available at https://anonymous.4open.science/r/CONTEXTOR-94EE.

Abstract:
Mechanistic interpretability aims to understand how neural networks generalize beyond their training data by reverseengineering their internal structures. We introduce patterning as the dual problem: given a desired form of generalization, determine what training data produces it. Our approach is based on susceptibilities, which measure how posterior expectation values of observables respond to infinitesimal shifts in the data distribution. Inverting this linear response relationship yields the data intervention that steers the model toward a target internal configuration. We demonstrate patterning in a small language model, showing that re-weighting training data along principal susceptibility directions can accelerate or delay the formation of structure, such as the induction circuit. In a synthetic parentheses balancing task where multiple algorithms achieve perfect training accuracy, we show that patterning can select which algorithm the model learns by targeting the local learning coefficient of each solution. These results establish that the same mathematical framework used to read internal structure can be inverted to write it.

Abstract:
We propose a novel algorithm that accelerates optimal transport by exploiting intrinsic symmetries induced by finite group actions. The core of our approach is to recover the orbit decomposition and the associated algebraic structure directly from the cost matrix—without requiring prior knowledge of the group—and to reduce the original transport problem to a substantially smaller problem on the orbit space. This reduction preserves optimality while achieving a significant drop in computational complexity. We develop efficient solvers for two central classes of optimal transport: linear OT and entropy‑regularized OT. Experiments on synthetic data and real‑world image datasets confirm the efficiency and robustness of the method. To our knowledge, this work is the first to systematically incorporate symmetry groups into optimal transport, providing both a theoretical framework and a practical pathway to computational acceleration.

Abstract:
Agentic AI systems capable of autonomous planning and extended environmental interaction pose a fundamental control problem: how can humans maintain meaningful oversight of systems that may exceed human capabilities? While scalable oversight is widely studied, existing approaches often rely on complex assumptions, remain largely heuristic, or lack practical methods for sequential settings with statistical guarantees. We introduce Calibrated Collective Oversight (CCO), which aggregates diverse auxiliary scoring functions into a penalty that measures deviation from a conservative baseline. Inspired by Attainable Utility Preservation, CCO enables collective conservatism: when multiple oversight signals register concern, the agent defers. CCO calibrates this conservatism online using Conformal Decision Theory, ensuring that undesirable outcomes remain below a userspecified target \alpha with finite-time bounds and no distributional assumptions. Experiments on SWE-bench demonstrate that weaker overseers successfully constrain an adversarially misaligned stronger agent. Similarly, on MACHIAVELLI, CCO achieves substantial reductions in ethical violations while preserving reward. In both settings, empirical violation rates closely match the specified targets. Our work demonstrates that combining penalty-based conservatism with online calibration yields practical oversight with statistical guarantees suited for agentic deployment.

Abstract:
Many machine learning problems are geometric at their core, relying on metric representations of data for tasks such as clustering, prototype selection, nearestneighbor search, and graph-based learning. Furthermore, data is constantly evolving and it is routinely transformed through dimensionality reduction, random projections, feature embeddings, compression, or privacy-preserving mechanisms. These transformations are designed to preserve geometry approximately. As a result, they preserve objective values for many geometric optimization problems, but they fail to guarantee that algorithmic outcomes remain consistent. In this work, we study \emphresilient data summaries for geometric optimization. Building on the notion of \emph\gamma-resilient algorithms from Ahmadian, we introduce \gamma-resilient coresets. A \gamma-resilient (k,\varepsilon)-coreset is a compact, weighted summary that guarantees a (1+\varepsilon) approximation to the objective and enforces stability at the level of assignments. We complement our positive result with a lower bound showing that to obtain a tight approximation for resilient clustering it is necessary to use a bi-criteria solution.

Abstract:
In this paper, we study the rentor-buy problem (also called the Bahncard problem) in the learning-augmented setting. In this problem, a traveler must complete a sequence of trips that are revealed online over time, each of which has an associated cost with it. The traveler has the option of buying a discount card at a fixed cost that gives a discount on trip costs for a fixed time after buying the card. The goal is to minimize the overall cost of all the trips, including the money spent on buying discount cards. For this problem, it is well-known that the best deterministic algorithm has a competitive ratio of 2. In this paper, we ask whether we can do better if the traveler has asampleof trips available offline, e.g., obtained from an ML model based on historical data. We show that even a sparse sample of the input can significantly improve the competitive ratio of the algorithm from 2 to 3/2, and further to close to 1 under some additional conditions. We also verify our theoretical bounds via numerical simulations, which reveal that our proposed algorithm obtains nearly optimal solutions for a variety of natural input classes.

Abstract:
RetrievalAugmented Generation (RAG) improves factual grounding in large language models but suffers from substantial latency due to synchronous retrieval. While recent work explores asynchronous retrieval, existing approaches rely on heuristic coordination between retrieval and generation and assume stable information demands during decoding that often break in complex, multi-domain settings. In this paper, we propose an advanced asynchronous retrieval framework that enables predictive prefetching aligned with evolving information needs. The framework explicitly predicts when retrieval should be triggered and what information should be retrieved using three components, a retrieval predictor, a context monitor, and a query generator, by exploiting semantic precursors in generation dynamics that emerge several tokens before uncertainty becomes critical. Experiments on multiple benchmarks demonstrate up to 43.5% end-to-end latency reduction and 62.4% improvement in time-to-first-token, while maintaining answer quality comparable to synchronous RAG baselines.

Abstract:
Minimum Bayes Risk (MBR) decoding yields more robust and higherquality text generation than maximum a posteriori (MAP) decoding by selecting hypotheses that maximize expected utility over sampled pseudo-references. However, there exists discrepancy in the design: hypothesis selection calculates expected utility scores conditioned on given pseudo-references, while commonly used evaluation metrics, e.g., BLEU and COMET, are asymmetric. Therefore, it is important to consider both hypothesis-to-reference and reference-to-hypothesis directional effects. In this study, we introduce a noisy channel decomposition of MBR decoding that naturally incorporates bidirectional effects to account for these asymmetries. We decompose MBR decoding into four interacting components: hypothesis-to-reference likelihood, reference-to-hypothesis likelihood, hypothesis prior, and reference prior. This decomposition provides a unified interpretation of existing MBR variants and enables metric- and task-specific interpretability by isolating the contribution of each channel. Furthermore, our comprehensive analysis demonstrates that appropriate channel weighting consistently yields performance gains over original MBR decoding across tasks and utility functions.

Abstract:
Manifold clustering has demonstrated strong capability in capturing complex data structures and has been widely studied in cluster analysis. However, many existing methods mainly focus on combining Kmeans with manifold learning, while overlooking the consistency between data structures and clustering labels, and often suffer from high computational cost when handling large scale data. To address these issues, we propose a manifold balanced clustering method based on anchor induced distance(LMBC), grounded in the relationship between K-means clustering and manifold learning. Specifically, the LMBC uses label information to guide the construction of the manifold structure, thereby ensuring consistency between data structures and clustering labels. To enable large scale clustering, we introduce an anchor induced distance representation that models manifold structure in a compact anchor space, significantly reducing computational complexity while preserving essential structural information. Furthermore, to naturally maintain class balance during clustering, we maximize the Schatten-p norm of the label representation and provide theoretical analysis to support its effectiveness. Experimental results on several benchmark datasets demonstrate the effectiveness and scalability of the proposed method.

Abstract:
Outof-distribution (OOD) generalization has attracted increasing research attention in recent years, owing to its promising empirical results in real-world applications. However, theoretical studies on OOD generalization remain limited, particularly with respect to lower bounds on the generalization error. To better understand how source data contributes to improved OOD generalization performance, we take an initial step toward establishing a lower bound on the OOD generalization error, and subsequently investigate upper bounds from the perspective of statistical learning theory. Interestingly, we find that under some conditions, simply minimizing the average empirical risk over the source domains can yield a nearly optimal error rate (up to a logarithmic factor) without requiring knowledge or estimation of distributional parameters or the discrepancy between source and target domains. This finding offers an explanation for the surprising phenomenon observed in DomainBed, where carefully designed OOD generalization algorithms fail to outperform the simple empirical risk minimization (ERM) algorithm. Our results also imply a no-free-lunch theorem and provide an optimistic bound for OOD generalization.

Abstract:
Diffusion models are typically trained with objectives that focus on local denoising targets at individual time steps (or adjacent pairs), which do not enforce consistency between predictions along the denoising trajectory. This lack of crosstime consistency can degrade performance, especially for few-step samplers. We introduce a temporal difference (TD) objective that penalizes inconsistency of the model’s \emphmulti-step progress along the denoising path. By reformulating the diffusion process as a Markov reward process and casting denoising as a policy evaluation problem in reinforcement learning, we derive a unified TD approach that applies to both discrete- and continuous-time diffusion formulations. We further propose a principled sample-based reweighting method that stabilizes training. Empirically, we show that using our TD training can significantly improve sample quality measured by FID, with stronger advantages when the number of sampling steps is small, highlighting its practical utility under low-computation-budget scenarios. We provide ablation studies to justify our design choices, including pairwise loss reweighting, regularization weight, and one-step stride. Overall, our TD approach can be a general drop-in that enforces cross-time consistency and improves generation quality across different diffusion generative models.

Abstract:
We study layered models, including feedforward networks and transformers, by limiting each layer to a width of d = 3 neurons, i.e., a representation space of \mathbbR^3. This allows us to examine how a neural network changes lowdimensional topological invariants like links and knots, as well as more sophisticated measures like Milnor's \mu-invariant, through the layers. Note that one may simplify or even trivialize just about any topological structure by simply increasing dimension; for example, any knot is equivalent to an unknot in \mathbbR^4. By restricting to \mathbbR^3, we not only isolate the effects of activation and depth from that of width, we work in a space that lends itself to easy visualization. We provide full mathematical proofs and empirical experiments to justify the following insights: When measured by their power to effect topological changes, ResNets are as powerful as transformers; both are strictly more powerful than feedforward neural networks, which are in turn more powerful than invertible models like flow-based models; but using a non-monotone activation would allow the feedforward networks to become as powerful as ResNets and transformers. These results suggest that low-dimensional topology can be an important tool to guide future designs of AI architectures. We then generalize our results from d = 3 to arbitrary d > 3.

Abstract:
While selfconsistency methods have emerged as a promising approach to enhance the correctness of large language model (LLM) outputs by aggregating multiple stochastic samples, they suffer from two critical limitations, resulting in high computation cost. First, they evaluate output consistency monolithically, failing to efficiently combine partially correct answers across multiple samples. Second, they use static stopping criteria that cannot adapt to varying task complexities and model capabilities, resulting in suboptimal computational efficiency. In this work, we present Task-and-Model-Aware Fractal-Consistency (TMAFC), a novel self-consistency framework that addresses these limitations through two key innovations: (1) Fractal-Consistency, which evaluates the output consistency at the granularity of output components to effectively combine partial correct answers across samples, and (2) Adaptive Stopping Criteria Calibration (ASCC), which dynamically adjusts sampling stopping criteria based on real-time assessment of both task difficulty and LLM capability. Through extensive experiments on diverse question-answering benchmarks, we demonstrate that TMAFC achieves superior efficiency-accuracy trade-offs, reducing sample cost by up to 55% while maintaining competitive accuracy compared to state-of-the-art baselines.

Abstract:
We study the active learning problem of fixedconfidence top-k identification from noisy pairwise comparisons under latent-utility models. The objective is to identify the top-k items with probability at least 1-\delta while using as few comparisons as possible by adaptively selecting which pairs to compare. While pure exploration with dueling bandits has been studied, an algorithm achieving asymptotic optimality has not yet been established. We characterize the structure of the information-theoretic lower bound on sample complexity, revealing a structured saddle-point problem. This structure enables a primal--dual algorithm that learns the optimal comparison allocation online while being computationally efficient. We then construct an adaptive comparison-allocation strategy that tracks the optimal solution and prove that the resulting procedure is asymptotically optimal.

Abstract:
Large language models (LLMs) can generate syntactically valid optimization programs, yet often struggle to reliably choose an effective modeling strategy, leading to incorrect formulations and inefficient solver behavior. We proposeSAGE, a strategyaware framework that makesModeling Strategyexplicit in both data construction and post-training. SAGE builds a solver-verified multi-strategy dataset and trains a student model with supervised fine-tuning followed by Segment-Weighted GRPO using a composite reward over format compliance, correctness, and solver efficiency. Across eight benchmarks spanning synthetic and real-world settings, SAGE improves average pass@1 from 72.7 to 80.3 over the strongest open-source baseline. With multiple generations, SAGE discovers more distinct correct formulations and improves component-level diversity at pass@16 by 19-29%. At the largest scale, SAGE produces more compact constraint systems with 14.2% fewer constraints than the baseline, consistent with solver-efficient modeling. Overall, these results show that makingModeling Strategyexplicit improves automated optimization modeling. Code is available at https://anonymous.4open.science/r/SAGE-F25B/.

Abstract:
Large language models (LLMs) have been globally adopted in various scenarios, making robust multilingual safety alignment a prerequisite for their reliable deployment across diverse languages. Despite recent advances, LLMs exhibit a substantial safety gap between highand low-resource languages: models that can consistently refuse harmful requests in high-resource languages often fail to do so in low-resource languages. In this work, we reveal that such safety failures stem from insufficient representation-space separability between harmful and harmless prompts in low-resource languages. Through geometric analyses, we find that, compared to English, harmful prompts are significantly less separated from the manifold of harmless prompts, and that the resulting cross-lingual spatial margin gap is strongly correlated with attack success rates. Capitalizing on these insights, we propose Multilingual Spatial Margin Gap-based Optimization (SMO), a novel training strategy that exploits the well-aligned safety geometry of a dominant language (e.g., English) to enhance safety alignment in other languages. SMO explicitly leverages the spatial margin gap between English and target languages as an example-wise supervision signal, enabling effective cross-lingual transfer of safety capabilities while preserving the dominant language’s original performance. Experiments conducted on LLaMA-3.1-8B-Instruct and Qwen2.5-7B-Instruct demonstrate that SMO is capable of substantially reducing attack success rates in low-resource languages to near zero, often reaching zero, while maintaining strong general multilingual performance. Warning: This paper contains content that may be harmful.

Abstract:
Prior work on judgmental forecasting with large language models (LLMs) has focused on broad, mixed‑topic question banks and shallow context (e.g., short news snippets). We study a specialized, high‑stakes financial setting: forecasting M&A outcomes for merger arbitrage. Using rich textual evidence, with context engineering informed by veteran mergerarb specialists, we construct an LLM‑based forecasting system and finetune the model using outcome-conditioned gold reasoning traces. The system outputs probabilistic forecasts over closing at announced terms, higher bid, and deal termination. On an out-of-sample set of more than 400 large deals spanning 42 countries, our finetuned system outperforms a variety of frontier models and market-based baselines, using a Brier score weighted by the P&L impact of each deal.

Abstract:
VisionLanguage-Action (VLA) models provide a versatile foundation for general robotic manipulation, yet they exhibit significant brittleness when confronted with novel task distributions. While In-Context Imitation Learning (ICIL) offers a training-free alternative, existing frameworks suffer from anadaptation bottleneckthat hinders the effective translation of expert context to actions. This failure originates from superficial retrieval mechanisms and an entrenchedbehavioral inertiathat anchors the policy to its pre-trained priors. To address these limitations, we present RA-VLA, a retrieval-augmented VLA framework that unifies behavior-aligned context retrieval with a grounded execution pipeline. By enforcing strict adherence to functional cues within a scalable architecture, our framework facilitates seamless task adaptation while preserving inference efficiency. Evaluations across the LIBERO benchmark and a real-world UR5e environment demonstrate that RA-VLA achieves superior success rates and computational efficiency, providing a robust framework for training-free robotic adaptation.

Abstract:
Despite its importance to applications in protein design, predicting protein properties like binding affinity and thermostability from sparse experimental data remains a significant challenge. Accordingly, we introduce a class of sequence kernels that exploit evolutionary substitution matrices as well as local linearity and demonstrate that the resulting Gaussian processes provide dataefficient models of protein property landscapes, frequently outperforming alternatives that rely on foundation model embeddings. Furthermore--by learning what are in effect structure-aware substitution matrices--we show that our kernels can readily incorporate structural information from foundation models. We demonstrate that these structure-conditioned kernels are well suited to multi-task learning across multiple protein property landscapes and can decisively outperform local supervised learning methods.

Abstract:
We study experimental designs that balance nonlinear functions of covariates, extending classical methods that primarily target linear balance. Building on the GramSchmidt Walk (GSW) framework of Harshaw et al (2024) for linear covariate balancing, we introduce a design that directly controls imbalance in nonlinear structure, including polynomial and more general smooth function classes. Like GSW, the proposed design retains sufficient robustness against model misspecification. Our implementation operates directly on a Gram matrix, avoiding the expensive step of explicitly constructing the nonlinear covariate expansions. We further accelerate the nonlinear design via a low-rank approximation of the Gram matrix, achieving runtimes comparable to the GSW of Harshaw et al (2024) while preserving nonlinear covariate balance and robustness.

Abstract:
Federated learning has emerged as the foremost approach for decentralized model training with privacy preserving. The global class imbalance and crossclient data heterogeneity naturally coexist, and the mismatch between local and global imbalances exacerbates the performance degradation of the aggregated model. The agnosticism of global class distribution poses significant challenges for data-level methods, especially under extreme conditions with severe class absence across clients. In this paper, we propose FedReLa, a novel data-level approach that tackles the coexistence of data heterogeneity and class imbalance in federated learning. By re-labeling samples with a feature-dependent label re-allocator, FedReLa corrects biased global decision boundaries without requiring knowledge of the global class distribution. This modular, model-agnostic approach can be integrated with algorithmic methods to deliver consistent improvements without additional communication overhead. Through extensive experiments, our method significantly improves the accuracy of minority classes and the overall accuracy on stepwise-imbalanced and long-tailed datasets, outperforming the previous state of the art.

Abstract:
This paper introduces EPS3D, a new endto-end feed-forward framework for open-vocabulary 3D panoptic segmentation. Unlike existing methods relying on additional preprocessing, we design an end-to-end architecture, with a distillation-based training strategy on diverse 3D scenes to predict 3D-aware semantic and instance features from multi-view images, improving 3D consistency and avoiding error accumulation. We further propose a mutual enhancement module to enforce inherent semantic-instance consistency. By aligning semantics within instances (Ins2Sem) and refining instance features with semantic guidance (Sem2Ins), we achieve more coherent 3D scene understanding. Ultimately, EPS3D outperforms SOTA baselines on two benchmarks (e.g., +13% mIoU for semantics on Replica) with high efficiency (e.g., 1s per scene), supporting tasks like robotic manipulation and 3D scene editing.

Abstract:
Multiple kernel learning~(MKL), which borrows ideas from ensemble learning, aims to achieve improved generalization performance by treating individual kernels as base learners and combining them appropriately. However, existing MKL methods often lack comprehensive consideration of diversity among base kernels, which has been demonstrated to play an essential role in ensemble learning. Moreover, the traditional kernels are predefined functions and equally treat all the input features, which ignore the diversity of feature, and yield suboptimal performance. In this paper, we formally define kernel diversity, and propose a novel datadriven class kernel named Recursive Feature Machine~(RFM) kernel, which is able to learn the feature importance directly from the different datasets. Moreover, a novel kernel selection method is proposed that explicitly optimizes both kernel diversity and quality. The resulting binary quadratic programming problem is NP-hard. Therefore, it is reformulated as a linear program and accelerated via sketching techniques, and a theoretical analysis of the estimation error is provided based on covering number bounds. Extensive empirical studies demonstrate that the proposed method outperforms state-of-the-art MKL approaches.

Abstract:
Realworld super-resolution (SR) is challenging due to complex degradations. HYPIR, a recent state-of-the-art diffusion-based restoration model, struggles to deal with this task in a single step. Although a naive two-step cascade improves the results, over-saturation, limited fine-grained details, and high inference latency remain. To address these limitations, we present HYPIR++. It removes the degradation removal encoder and noise augmentation to better preserve fidelity cues from low-quality inputs. To enhance fine-grained detail restoration and local structure fidelity, HYPIR++ introduces a tailored latent ConvNeXt and a latent patch discriminator, enabling adversarial learning directly in the latent space. In addition, HYPIR++ improves inference efficiency by reducing the text sequence length and replacing full attention with sparse neighbor attention, allowing direct processing of high-resolution images without block-based tiling. Extensive experiments demonstrate that HYPIR++ achieves superior perceptual quality and a 1.71× speedup over HYPIR, establishing a new state-of-the-art for real-world SR.

Abstract:
Multiview clustering aims to obtain a consensus clustering by integrating complementary and consistent information from multiple views. However, two critical challenges still exist in variational methods: (1) view heterogeneity and noise often make fusion unreliable; (2) ambiguous posteriors and misassigned boundary samples impact the clustering performance. To address these issues, we propose Dual-stage Contrastive Learning-enhanced Multi-view Variational Clustering (DCL-MVC), which integrates contrastive learning into both the fusion and representation stages. Firstly, at the fusion stage, we introduce a fusion-then-attention mechanism to capture cross-view interactions and learn view-level attention weights for building a unified and reliable fused representation, and further introduce instance-level contrastive learning to enforce cross-view alignment at the instance level. Secondly, we focus on boundary samples with uncertain posteriors and refine their cluster assignments by using cluster-center contrastive loss to enlarge inter-cluster margins, while leveraging prototypical contrastive learning with a confidence-aware curriculum to promote intra-cluster compactness at the representation stage. Extensive experiments on six real-world datasets demonstrate consistent improvements over strong baselines and validate the contribution of each component.

Abstract:
Weakto-strong generalization (W2SG) is the phenomenon in which a powerful student model, trained on labels produced by a weaker teacher, ultimately outperforms the teacher on the target task. In this work, we theoretically investigate how W2SG can arise via a generalized bias–variance decomposition under Bregman divergence. We show that the expected population risk gap between the student and the teacher is characterized by the expected misfit between the two models. Unlike earlier misfit-based analyses, our theory removes several restrictive assumptions, e.g., it does not require the student hypothesis class to be convex. Our results indicate that W2SG is more likely when the student effectively approximates the teacher's posterior mean. Specializing to squared loss, we provide a sufficient condition (illustrated through a concrete example) under which the student converges to its posterior mean teacher; in particular, increasing the student model size can ensure this convergence. For cross-entropy loss, our analysis further suggests that lowering the entropy of the student's predictive distribution can promote W2SG. We also find that the reverse cross-entropy, unlike the standard forward cross-entropy, is less sensitive to the teacher's predictive uncertainty. Finally, we verify these theoretical insights empirically and demonstrate that incorporating reverse cross-entropy consistently improves student performance.

Abstract:
Realworld datasets often follow a long-tailed distribution, making generalization to tail classes difficult. We revisit this problem through the lens of shortcut learning, where models prefer the easiest predictive cues (e.g., background or textures) over object-centric semantics, especially under scarce and biased supervision. We find that this tendency is amplified for tail classes: limited examples often share similar contexts, making non-semantic signals highly correlated and thus tempting shortcuts, whereas head classes with diverse appearances and environments encourage more stable object-focused representations. Motivated by this observation, we propose Shortcut-Resistant CAM Distillation (SRCD), a plug-and-play framework that transfers object-focused explanations from head to tail classes. SRCD operates in the Class Activation Map (CAM) space, where a CAM provides a class-specific spatial evidence map for a prediction. SRCD aggregates CAMs from a small set of head-class candidates into a shortcut-resistant teacher using an energy-model weighting based on coherence (Laplacian smoothness) and concentration (Hoyer sparsity), and distills it to the tail-class CAM. We provide a theoretical analysis that quantifies shortcut reliance as shortcut-region evidence mass in CAM space and shows that SRCD suppresses tail shortcuts. Extensive experiments on long-tailed benchmarks consistently improve strong baselines.

Abstract:
Programmatic policies are often reported to generalize better than neural policies in reinforcement learning (RL) benchmarks. We revisit some of these claims and show that much of the observed gap arises from uncontrolled experimental factors rather than intrinsic representational reasons. Reevaluating three core benchmarks used in influential papers---TORCS, Karel, and Parking---we find that neural policies, when trained with a few modifications, such as sparse observations and cautious intrinsic reward functions, can match or exceed the out-of-distribution (OOD) generalization of programmatic policies. We argue that a representation enables OOD generalization if (i) the policy space it induces includes a generalizing policy and (ii) the search algorithm can find it. The neural and programmatic policies in prior work are comparable in OOD generalization because the domain-specific languages used induce policy spaces similar to those of neural networks, and our modifications help the gradient search find generalizing solutions. By disentangling representational factors from experimental confounds, we advance our understanding of what makes a representation succeed or fail at OOD generalization.

Abstract:
In this paper, we propose a discrete rotoreflection group equivariant vision transformer with convolutional attention. Roto-reflection equivariant networks preserve the rotational, flip and positional symmetry in feature maps, making them useful for tasks where orientation of the inputs is relevant to the model outputs. In image classification and object detection, most of the studies on roto-reflection equivariant models have focused on using convolutional neural networks rather than vision transformers. In this paper, we examine the challenges involved in achieving equivariance in vision transformers, and we propose a simpler way to implement a discretized roto-reflection group equivariant vision transformer. The experimental results demonstrate that our approach outperforms the existing approaches for developing discrete roto-reflection group equivariant neural networks for image classification.

Abstract:
Human spatial understanding arises from jointly perceiving geometry and semantics, enabling consistent object identification and localization across viewpoints and time. Current video segmentation models depend on explicit object appearance memory banks for instance tracking, yet they remain vulnerable to large viewpoint changes and longterm occlusions. Leveraging the spatial consistency afforded by modern feed-forward 3D reconstruction models, we propose the Geometry Grounded Tracking Anything Model (G^2TAM), a unified framework for promptable instance tracking in 3D using only unordered RGB images or videos. G^2TAM employs spatially aligned geometric representations as implicit memory, ensuring stable instance identity and localization across frames and views. At its core is a cross-modal spatial encoder that integrates visual and textual prompts into a shared geometric space, enabling end-to-end spatial reconstruction and instance-consistent mask prediction. To support training and evaluation, we construct InsTrack, a large-scale dataset with a dedicated validation split for benchmarking. Extensive experiments show that G^2TAM delivers strong cross-view consistency, promptable instance spatial tracking, video object segmentation, and spatial reconstruction, establishing a foundation for interactive, geometry-grounded spatial reasoning.

Abstract:
Datadriven modeling in real-world regression tasks often suffers from limited training samples, high collection costs, and noisy observations. Inspired by the impact of data augmentation in vision and language, we propose a novel Counterfactual Residual Data Augmentation (CRDA) technique for tabular regression. Our key insight is that once a regressor has modeled the systematic component of the data, the remaining noise can be viewed as an invariant residual that remains stable under small perturbations of carefully selected features. We exploit this residual invariance to generate new, yet realistic, training samples, effectively expanding the dataset without requiring additional real data. Our method is model-agnostic and readily applicable to various types of regressors. In experiments across datasets from a variety of benchmark repositories, on average, CRDA reduces an MLP Regressor's MSE by 22.9% and an XGBoost Regressor's MSE by 6.4%. When compared to existing state-of-the-art data generators and augmentation techniques, CRDA consistently outperforms in MSE reduction. By adding principled counterfactual variations to the training data, our method offers a simple and efficient remedy for noise-prone, small-sample regression settings.

Abstract:
Despite the proliferation of proactive defenses against deepfakes, the lack of a unified evaluation protocol precludes fair comparison and masks critical vulnerabilities. To bridge this gap, we present the first comprehensive benchmark that systematically assesses disruption, robustness, and transferability encompassing pixel, perceptual, and identity metrics. Our extensive analysis reveals that fidelity and identity metrics capture orthogonal performance axes, often leading to conflicting interpretations when relied upon individually. Furthermore, we identify a fundamental tradeoff where peak white-box performance signals overfitting, and we introduce a calibrated evaluation to correct generator-induced identity bias. By exposing these blind spots, we establish a rigorous standard to guide the development of genuinely generalizable protections.

Abstract:
Video inpainting aims to restore missing regions while preserving spatial and temporal coherence. Diffusionbased methods achieve strong per-frame reconstruction, but their sampling implicitly generates temporally coupled latent trajectories whose long-horizon stability is not explicitly modeled, leading to a trade-off between temporal consistency and structural detail. We revisit video inpainting from the perspective of temporal trajectory stability, viewing temporal inconsistency as instability along time-indexed denoising trajectories rather than an output-level error. Based on this view, we propose an inference-time trajectory stabilization framework that monitors motion-aligned deviation and triggers risk-aware correction only when instability accumulates. It combines sparsely sampled trajectory anchors as stability references with neighborhood-consistent propagation to regulate trajectory evolution while preserving local generative freedom. Implemented as a lightweight control layer in the sampling loop, it selectively contracts unstable trajectories toward motion-consistent manifolds instead of enforcing uniform temporal constraints. Experiments show consistent improvements in temporal coherence and structural fidelity.

Abstract:
Modern recommender systems are typically trained offline and deployed with parameters held fixed between periodic refreshes, yet user behavior can evolve substantially during deployment. This can cause ranking utility to degrade over time and makes it difficult to provide formal guarantees about recommendation quality. We proposeCARE, an adaptive calibration framework that wraps an arbitrary backbone recommender and outputs variablesize recommendation sets with finite-sample performance guarantees over interaction streams. CARE combines (i) a loss-based monitoring module that localizes behavioral changes and triggers threshold recalibration, and (ii) an online aggregation rule that promotes compact recommendation sets by dynamically reweighting candidate set predictors. We provide theoretical results establishing finite-sample guarantees for utility-based risk control and bounds on the expected set size relative to the best constituent predictor. Experiments across multiple datasets and backbone models demonstrate that CARE improves robustness and maintains compact recommendation sets while preserving the desired statistical guarantees. The code and implementation are available in https://anonymous.4open.science/r/CARE-FCBD.

Abstract:
Despite the advancing reasoning capabilities of large language models (LLMs), many reasoning evaluations are increasingly compromised by data contamination, which induces unreliablecontaminated reasoningon leaked inputs. While this phenomenon is widely observed, its underlying mechanism remains poorly understood, hindering the ability to distinguish generalization from memorization and to develop effective solutions. In this work, we first identify a distinctive signal of contaminated reasoning, namely the mutual information decay between representations and gradients. Our mechanistic analysis reveals that contaminated models exhibit pronounced eigenspectrum concentration in their representations, leading to a lowdimensional computation regime. Under leaked inputs, this mechanism weakens the linear coupling between representations and gradients, manifested as a structural decay of the singular values in the whitened space. We show that this narrowing geometry mathematically implies a reduction in mutual information, and further demonstrate the practical utility of our analysis by successfully restoring the reasoning behavior of contaminated models, achieving an 11.03% improvement in average consistency with the base model over the strongest baseline.

Abstract:
Efficiently training recurrent neural networks to capture temporal structure is fundamental to machine intelligence, yet scaling gradientbased learning remains an open challenge. The standard global paradigm, backpropagation through time (BPTT), suffers from memory costs that scale linearly with sequence length and susceptibility to gradient instability. Conversely, biologically inspired local learning rules, while efficient, often introduce severe bias, failing to match the performance of global optimization. Inspired by the criticality observed in biological neural circuits, we introduce Criticality-driven Online Local Alignment (COLA). By leveraging the long-range spatiotemporal correlations inherent to the critical regime, COLA enables a strictly local learning rule to approximate the efficacy of global error propagation, thereby combining the advantages of online plasticity with the precision of gradient descent. Theoretically, for an RNN with H hidden units, the method reduces learning complexity to a O(H) auxiliary state with constant activation memory, independent of sequence length. Empirically, COLA matches BPTT on standard benchmarks and demonstrates superior robustness on stability-sensitive tasks. We support these results with a rigorous analysis on the approximation error, providing a theoretical foundation for reliable, scalable online learning.

Abstract:
Recent advances in generative models, such as diffusion models, have raised concerns related to privacy, copyright infringement, and data curation. Prior work has shown that training data can be reconstructed from such models, but existing attacks typically rely on substantial computational resources, access to the training set, or carefully engineered prompts. In this work, we present a lowresource reconstruction attack that operates through seemingly benign prompts and requires little to no access to the training data. Our attack targetstemplate-memorized images (TMI), where recurring layouts and visual structures are memorized during training. We show that such memorization manifests under potentially realistic usage. This raises a possibility of unintentional reconstruction by naive users that don't carry explicit adversarial intent. For example, we observe that a simple prompt such as "blue Unisex T-Shirt" can reproduce visual content depicting a real individual. Beyond extraction, we observe novel phenomena occurring in TMI (e.g., interpolation), raising questions about the novelty of generated content and the effectiveness of established methods for detecting memorized content.

Abstract:
Invariant learning can fail even when the invariant structure is statistically identifiable. We show an inherent computational barrier: under the Planted Clique hypothesis, there exist samplable linearGaussian multi-environment instances with a one-dimensional invariant subspace (k=1) that are learnable with polynomial samples by exhaustive search but intractable for any polynomial-time algorithm, via an average-case reduction from a supervised sparse primitive. We further quantify environment diversity by a separation parameter \gamma, which controls identifiability and the curvature of invariance objectives. Under sufficient diversity, the minimax risk is \mathbbE[dist(\hatV,V_\mathrminv)^2]=\Theta(k(d-k)/(n|\mathcalE|)), and under label-induced shifts a phase transition occurs at n^\propto k(d-k)/(|\mathcalE|\gamma^2). Synthetic and real datasets validate the predicted gaps and transitions and motivate simple diversity diagnostics.

Abstract:
This paper studies the problem of graph outof-distribution (OOD) detection, which aims to identify anomaly graphs out of a graph dataset. Prior efforts usually focus on the utilization of topological structures with unsupervised graph learning to foster typical pattern recognition, which overlooks the semantic structure preserved in contextually affine neighborhoods. Towards this end, we propose a novel approach named Contextual Affinity Exploration with Twin Concordance (CLINIC) for graph OOD detection. The core of CLINIC is to explore and exploit the contextual affinity of the graph data samples for discriminative graph representations. In particular, our CLINIC first builds a contextual affinity graph to depict the semantic structure in the hidden space. More importantly, we introduce high-order affinity to enhance geometric understanding of the structure by utilizing a meta-graph neural network. To enhance representation discriminability with high robustness, we introduce twin concordance learning, which not only minimizes the difference of affinity distributions across different views, but also encourages the consistency between contextually affinitive neighbors. Finally, we introduce a compression strategy to expand the decision boundary for enhanced separation between in-distribution and out-of-distribution graphs. Extensive experimental results demonstrate the superiority of our CLINIC across ten real datasets in comparison to various baselines.

Abstract:
Visionlanguage models (VLMs) have progressed rapidly with large-scale high-quality data and adaptation strategies, yet remain brittle under real-world corruptions, where both visual recognition and language-grounded reasoning degrade. Beyond cascaded image restoration, a natural alternative is parameter-efficient adaptation that aligns corrupted features with clean references; however, Euclidean alignment alone is not semantics-preserving and can even harm downstream reasoning. We attribute this to a semantic misalignment gap, where features become geometrically closer while drifting off the in-distribution support on which multimodal reasoning is calibrated. To address this, we propose Manifold-Adversarial Adapters (MAA), lightweight layer-wise modules for a frozen vision encoder that explicitly steer corrupted features back onto the clean in-distribution manifold rather than merely shrinking feature-space distance. MAA combines paired feature self-distillation with a token-level adversarial manifold constraint to prevent off-manifold semantic shortcuts. At inference, only the adapters are retained, enabling single-stage robustness with negligible overhead and avoiding the latency and semantic drift of restoration pipelines. Across benchmarks and corruption settings, MAA consistently improves performance over strong baselines.

Abstract:
Mixtureof-Experts (MoEs) have become a central component of many state-of-the-art open-source and proprietary large language models. Despite their widespread adoption, it remains unclear how close existing MoE architectures are to optimal with respect to inference cost, as measured by accuracy per floating-point operation and per parameter. In this work, we revisit MoE design from a hardware-software co-design perspective, grounded in empirical and theoretical considerations. We characterize key performance bottlenecks across diverse deployment regimes, spanning offline high-throughput execution and online, latency-critical inference. Guided by these insights, we introduce LatentMoE, a new model architecture resulting from systematic design exploration and optimized for maximized accuracy per unit of compute. Empirical design space exploration at scales of up to 95B parameters and over a 1T-token training horizon, together with supporting theoretical analysis, show that LatentMoE consistently outperforms standard MoE architectures in terms of accuracy per FLOP and per parameter.

Abstract:
In this paper, we identify a semantic decomposition in robot action sequences, separating tasklevel motion intent from execution-level refinements. By analyzing actions in the spectral domain using the discrete cosine transform (DCT), we observe that low-frequency components capture global motion trajectories, while high-frequency components encode precise timing, alignment, and contact behaviors. Motivated by this structure, we propose Causal Spectral Policy (CSP), which models action generation as a causal coarse-to-fine process: coarse motion is predicted from observation and language, and fine corrections are generated conditionally on the realized trajectory. Across simulation and real-world evaluations, CSP consistently outperforms strong baselines on precision-sensitive manipulation tasks. Additionally, we propose human-inspired teleoperation noise injection as a data augmentation method under which our approach demonstrates strong robustness to noisy demonstrations.

Abstract:
While traditional network analysis focuses on singlelayer networks, real-world systems often form multilayer networks with multiple relationship types. However, existing methods typically fail to capture complex inter-layer dependencies by treating layers independently or aggregating them. To address this, we propose T-GINEE (Tensor-Based Generalized Multilayer-graph Estimating Equation), a statistical regularization framework combining tensor-based generalized estimating equations with task-specific loss to model cross-network correlations explicitly. Key innovations include: (1) CP tensor decomposition capturing structural dependencies via shared latent factors; (2) a generalized estimating equation framework modeling inter-layer correlations through working covariance matrices; and (3) a flexible link function accommodating characteristics like sparsity. Our theoretical analysis establishes consistency and asymptotic normality under mild conditions. Extensive experiments on synthetic and real-world datasets validate T-GINEE’s effectiveness for multilayer network analysis. Our code is available in the supplementary materials to ensure reproducibility.

Abstract:
Adaptive control of nonlinear systems under unknown disturbances requires learning algorithms aligned with the downstream control objective. While controloriented meta-learning addresses the mismatch between regression-based identification and tracking performance, existing methods rely on Euclidean or static algebraic geometries that fail to capture the distributional structure of system uncertainties. We propose a framework that lifts adaptation into Wasserstein space, measuring parameter estimation errors as the optimal transport cost between estimated and true system behaviors. By constructing a Wasserstein Bregman divergence over representative task distributions, we use meta-learning to jointly optimize nonlinear feature representations, control gains, and transport geometry. This adaptation law learns an adaptation geometry that captures structural properties of the underlying physical system, implementing a physically grounded, data-driven attention mechanism. Closed-loop tracking simulations demonstrate that our controller achieves optimal performance on both fully-actuated and underactuated nonlinear planar rotorcraft, maintaining robustness under significant distributional shifts between training and testing conditions.

Abstract:
Representation intervention aims to locate and modify the representations that encode the underlying concepts in Large Language Models (LLMs) to elicit the aligned and expected behaviors. Despite the empirical success, it has never been examined whether one could locate the faithful concepts for intervention. In this work, we explore the question in safety alignment. If the interventions are faithful, the intervened LLMs should erase the harmful concepts and be robust to both indistribution adversarial prompts and the out-of-distribution (OOD) jailbreaks. While it is feasible to erase harmful concepts without degrading the benign functionalities of LLMs in linear settings, we show that it is infeasible in the general non-linear setting. To tackle the issue, we propose \textttConcept Concentration (\textttCOCA). Instead of identifying the faithful locations to intervene, \textttCOCA refactors the training data with an explicit reasoning process, which first identifies the potential unsafe concepts and then decides the responses. Essentially, \textttCOCA simplifies the decision boundary between harmful and benign representations, enabling more effective linear erasure. Extensive experiments with multiple representation intervention methods and model architectures demonstrate that \textttCOCA significantly reduces both in-distribution and OOD jailbreak success rates, and meanwhile maintaining strong performance on regular tasks such as math and code generation.

Abstract:
We present a general framework for synthesizing search spaces of parametric functions, along with strategies for traversing these spaces to find optima. We formalize an algebraic theory for the categorical model of parametric functions in finite combinatory logic with predicates (FCLP). Based on a componentoriented synthesis framework for FCLP we automate composition from given components and search for parametric functions. Components are language-agnostic and may be instantiated as any implementation of parametric functions, e.g., as PyTorch modules. A proof-of-concept implementation demonstrates how to represent more specific concepts, such as neural architecture search and hyperparameter optimization, within the framework.

Abstract:
We propose the PseudoMallows distribution over the set of all permutations of n items, to approximate the posterior distribution of the Bayesian Mallows model. The Bayesian Mallows model has been successfully used for recommender systems to learn personal preferences from highly incomplete users data. However current inference algorithms do not scale, preventing its use in real-time applications. The Pseudo-Mallows distribution is a product of univariate discrete Mallows-like distributions, where the quality of the approximation depends on the order of the n items in the factorization sequence. In a variational setting, we optimize the variational order parameter by minimising a marginalized KL-divergence, conjecturing a certain form of the optimal variational order that depends on the data, and proposing an approximation algorithm for this discrete optimization. Empirical evidence and some theory support our conjecture. We demonstrate on clicking data that variational inference via the Pseudo-Mallows distribution allows much faster probabilistic preference learning compared to alternative MCMC-based options.

Abstract:
The increasing prevalence of largescale hypergraphs poses significant computational challenges for hypergraph neural network (HNN) training. To address this, hypergraph condensation (HGC) distills large real hypergraphs into compact yet informative synthetic ones, going beyond traditional graph condensation (GC) methods limited to modeling pairwise relations. However, existing HGC methods rely on decoupled training architectures, where structure generators are pre-trained on the original hypergraph but not jointly optimized with condensed features during refinement, resulting in misaligned structures that degrade downstream utility. Moreover, trajectory-based optimization incurs substantial computational overhead in refinement, limiting condensation efficiency. To tackle these issues, we propose Anchor-guided HyperGraph Condensation with Dual-level Discrimination (AHGCDD), which consists of three key components: (1) a node initialization module based on Heat Kernel PageRank (HKPR) to encode structural knowledge into feature semantics; (2) an anchor-guided hyperedge synthesis scheme based on feature-level association for joint optimization of condensed features and structure; (3) a theoretically grounded dual-level discrimination objective for utility-preserving condensation without redundant HNN training. Extensive experiments demonstrate the superior effectiveness and efficiency of AHGCDD.

Abstract:
Longcontext language modeling is commonly framed as a scalability challenge of token-level attention, yet local-to-global information structuring remains largely implicit in existing approaches. Drawing on cognitive theories of discourse comprehension, we propose HiCI (Hierarchical Construction--Integration), a hierarchical attention module that constructs segment-level representations, integrates them into a shared global context, and broadcasts both to condition segment-level attention. We validate HiCI through parameter-efficient adaptation of LLaMA-2 with only ～5.5% additional parameters, extending context from 4K to 100K tokens (7B) and 64K tokens (13B). Across language modeling, retrieval, and instruction-following benchmarks, HiCI yields consistent improvements over strong baselines, including matching proprietary models on topic retrieval and surpassing GPT-3.5-Turbo-16K on code comprehension. These results demonstrate the effectiveness of explicit hierarchical structuring as an inductive bias for long-context modeling.

Abstract:
We study the error rate of LLMs on tasks like arithmetic that require a deterministic output, and repetitive processing of tokens drawn from a small set of alternatives. We argue that incorrect predictions arise when small errors in the attention mechanism accumulate to cross a threshold, and use this insight to derive a quantitative twoparameter relationship between the accuracy and the complexity of the task. The two parameters vary with the prompt and the model; they can be interpreted in terms of an elementary noise rate, and the number of plausible erroneous tokens that can be predicted. Our analysis is inspired by an "effective field theory'' perspective: the LLM's many raw parameters can be reorganized into just two parameters that govern the error rate. We perform extensive empirical tests, using Gemini 2.5 Flash, Gemini 2.5 Pro and DeepSeek R1, and find excellent agreement between the predicted and observed accuracy for a variety of tasks, although we also identify deviations in some cases. Our model provides an alternative to suggestions that errors made by LLMs on long repetitive tasks indicate the "collapse of reasoning'', or an inability to express "compositional'' functions. Finally, we show how to construct prompts to reduce the error rate.

Abstract:
Debate, where AI agents argue opposing positions, has emerged as a key approach to scalable oversight. However, debate faces a fundamental tension: models are incentivized to be persuasive to the judge, which may not always align with epistemic honesty. In this work, we propose an alternative paradigm:disagreement resolution, which reframes the interaction mechanism from adversarial debate to collaborative truth seeking. Drawing on principles from human mediation and conflict resolution, where mediators facilitate dialogue to help disputing parties reach consensus rather than adjudicating between them, we design an automated pipeline that adapts these strategies to AI oversight. Unlike standard debate where models argue for fixed positions, our pipeline directs models to collaboratively identify points of disagreement, examine the evidence for conflicting claims, and converge toward consensus or isolate the specific ``crux'' of their disagreement. We find that Disagreement Resolution consistently helps nonexpert models identify the truth, achieving 62.1% judging accuracy compared to 49.2% for standard debate. Our results provide encouraging empirical evidence for rethinking the scalable oversight protocol from adversarial persuasion to collaborative truth-seeking.

Abstract:
LLMs cannot reliably recognize their parametric knowledge boundaries and often hallucinate answers to outsideof-boundary questions. In this paper, we introduce MASH (Modeling Abstention via Selective Help-seeking), a training framework that readily extracts abstentions from LLMs. Our key idea is that any external help-seeking by an LLM, i.e. search tool use, can serve as a proxy for abstention if the external help (search) is appropriately penalized while also rewarding answer accuracy. MASH operationalizes this idea using reinforcement learning with a pay-per-search reward. We run experiments on three knowledge-intensive QA datasets. Our results show that MASH substantially improves upon the selective help-seeking performance of prior efficient search approaches; on multi-hop datasets, it improves answer accuracy by 7.6%. Furthermore, MASH demonstrates strong off-the-shelf abstention performance, showcasing behavior competitive with prior abstention methods that additionally require predetermining model knowledge boundaries to construct training data. Overall, we show that MASH training effectively aligns search tool use with parametric knowledge, which can be successfully leveraged for making abstention decisions and efficient search tool use.

Abstract:
Code transformation is a foundational capability in the software development process, where its effectiveness relies on constructing a highquality code representation to characterize the input code semantics and guide the transformation. Existing approaches treat code transformation as an end-to-end learning task, leaving the construction of the representation needed for semantic reasoning implicit in model weights or relying on expensive compiler-level abstractions. We present SemRep, a framework that improves code transformation through generative code representation learning. Our key insight is to employ the semantics-preserving transformations as the intermediate representation, which will be used to train the model as a generative task, and also guide the subsequent instruction-specific code transformations. Across general code editing and CUDA kernel optimization, SemRep outperforms the strong closed-weight baselines by 6.9% and 43% in correctness, 13.9% in generalization and 6.7% in robustness. Combined with an evolutionary coding agent, SemRep finds optimizations that 685B larger-weight baselines fail to discover while achieving the same performance with 25% less inference compute.

Abstract:
In openended domains, natural language instructions are often underspecified, mapping to multiple valid yet functionally distinct latent intents. While Large Language Models (LLMs) excel at generation, their ability to resolve such task ambiguity through interaction is currently hampered by semantic blindness—a tendency to squander interaction budgets on distinguishing trivial syntactic variants rather than fundamental intent differences. To address this, we propose Topological Active Inference (TAI), a geometric framework that recasts disambiguation as a process of intent-manifold contraction. TAI first leverages Persistent Homology to recover the topological skeleton of the solution space, theoretically guaranteeing the separation of semantic signal from syntactic noise. Subsequently, it synthesizes clarifying questions as separating hyperplanes designed to efficiently bisect the probability mass of the intent manifold. We introduce Topological Expected Information Gain (TEIG) for question selection and prove that maximizing TEIG reduces query complexity from linear \mathcalO(N) to logarithmic \mathcalO(\log K), where K is the number of latent intents. Extensive experiments demonstrate that TAI recovers user intent with significantly fewer turns, achieving state-of-the-art disambiguation efficiency.

Abstract:
Diffusion models have achieved remarkable success in graph generation, but enforcing hard constraints on generated graphs remains challenging, limiting their deployment in constraintcritical applications. Existing approaches either fail to guarantee strict constraint satisfaction or are limited to narrow constraint types, lacking the flexibility to handle diverse constraint specifications. To address this challenge, we exploit the discrete structure of graphs, which allows hard constraints to be formulated as symbolic reasoning problems. Building on this insight, we propose NSPSG, a framework that integrates unconstrained diffusion models with discrete projection operators. NSPSG employ an SMT (Satisfiability Modulo Theories)-based projector to ensure that the generated graphs strictly satisfy constraints while remaining within the training data distribution. To further accelerate generation, we employ a supervised auto-regressive neural projector to approximate the symbolic reasoning process. Across heterogeneous constraints and various graph generation datasets, NSPSG achieves 99%-100% validity rates, demonstrating state-of-the-art performance. Notably, for a complex non-linear constraint, it improves data validity by up to 43% and reaches 99% validity while maintaining comparable generation efficiency.

Abstract:
Understanding cellular phenotypes and how they respond to perturbations is critical for disease biology and therapeutic design. Singlecell RNA sequencing enables characterization at cellular resolution, yet the combinatorial space of conditions makes exhaustive experimental mapping infeasible. We introduce single-cell Concept Bottleneck Generative Models (scCBGM), a framework for interpretable and precise counterfactual editing of individual cells. scCBGM adapts concept bottleneck architectures for single-cell data through decoder skip connections and a cross-covariance penalty that promotes disentanglement without dimensional constraints. We extend the framework to flow matching models, enabling concept-guided editing in both encoding-decoding and generation regimes. To enable rigorous evaluation, we develop a synthetic benchmark with ground-truth counterfactuals. Across multiple real datasets, scCBGM demonstrates superior accuracy in zero-shot generalization and cell-level counterfactual prediction compared to state-of-the-art methods, while providing interpretable control over biological concepts.

Abstract:
Recent work shows that flowmatching networks can be effective for value function estimation in reinforcement learning, but it remains unclear why they work well or whether flow-matching Q-functions differ fundamentally from standard critics. We show that their success is not explained by distributional RL: explicitly modeling return distributions often degrades performance. Instead, we argue that flow-matching Q-functions are effective because they couple a learned velocity field with an integration procedure that is used both during training and to read out Q-values at inference time. This coupling enables robust value prediction through \emphtest-time recovery from imperfect intermediate estimates where errors dampen out as more integration steps are performed. This mechanism is absent in monolithic critics. Beyond test-time recovery, training with the integration procedure induces more \emphplastic representations, allowing critics to represent non-stationary future TD targets without overwriting previous features. We formalize these effects and validate them empirically, showing that flow-matching critics outperform monolithic critics by over 2× in performance and achieve 5–10× higher sample efficiency in high-UTD regimes.

Abstract:
Modern optimizers combine gradients from the current minibatch with historical optimization state, such as momentum or adaptive moments. While effective for stability, this interaction can produce update directions that increase the loss of individual samples in the current batch. We formalize this effect asharmand cast the computation of an update as an optimization problem that explicitly minimizes the harmful impact of past optimization state on current data. To make this optimization problem tractable, we first reduce its dimensionality from the number of parameters to the batch size, and further show that restricting the optimization to the last layer provides an effective and efficient proxy. The resulting subproblem can be solved with a small number of GPU-friendly iterations and integrated seamlessly into SGD with momentum and AdamW. Experiments on image classification benchmarks show reduced per-sample interference and improved generalization with moderate overhead.

Abstract:
MetaReinforcement Learning (Meta-RL) focuses on training policies using data collected from a variety of diverse environments. This approach enables the policy to adapt to new settings with only a few training steps. While many Meta-RL methods have demonstrated success, they often rely on the assumption that unobserved confounders can be excluded \empha priori. This paper investigates robust Meta-RL in sequential decision-making, given confounded observational data collected across multiple heterogeneous environments. We introduce a novel augmentation procedure for standard Meta-RL algorithms (e.g., MAML), which employs partial identification methods to generate posterior counterfactual trajectories from candidate environments that align with the confounded observations. These counterfactual trajectories are then used to find a policy initialization that produces strong generalization performance in the target domain. Theoretical analysis reveals that our causal Meta-RL approach is guaranteed to yield a solution that minimizes generalization loss in future inference tasks.

Abstract:
Visual outcomes are increasingly central to multimodal large language models, making reliable and finegrained verification essential for scaling generalist foundation models. In this work, we investigatemultimodal meta-verification, which leverages verifier-generated rationales rather than decision-only signals, and explore how to effectively incorporate meta-verification feedback into multimodal verifier training. We identify two key findings. First, symbolic verifier outputs (e.g., bounding boxes) outperform textual explanations as meta-verification rationales, enabling efficient rule-based reinforcement learning rewards while avoiding reliance on model-based rewards from auxiliary judge models. Second, decoupling reinforcement learning objectives for binary judgment and meta-verification substantially outperforms joint reward optimization, due to intrinsic differences in output structure and learning dynamics. Based on these insights, we trainOmniVerifier-M1, a generalist visual verifier leveraging symbolic meta-verification and decoupled reinforcement learning. OmniVerifier-M1 provides robust verification and fine-grained error localization, and further enablesM1-TTS, a verifier-driven agentic generation system achieving dynamic region-level self-correction. This approach paves the way for more reliable, interpretable, and fine-grained multimodal verification, supporting safer and more controllable foundation model deployment.

Abstract:
Masked Diffusion Models (MDMs) have emerged as a promising alternative to autoregressive models in language modeling, offering the advantages of parallel decoding and bidirectional context processing within a simple yet effective framework. Specifically, their explicit distinction between masked tokens and data allows their simple framework and effective conditional generation. However, MDMs typically require many sampling iterations due to factorization errors stemming from simultaneous token updates. We observe that a theoretical lower bound of the factorization error exists, which standard MDMs cannot reduce due to their use of a deterministic singlestate mask. In this paper, we propose the Infinite Masked Diffusion Model (IMDM), which introduces a stochastic infinite-state mask to mitigate the theoretical bound while directly inheriting the benefits of MDMs, including the compatibility with pre-trained weights. We empirically demonstrate that MDM fails to perform few-step generation even in a simple synthetic task due to the factorization error bound, whereas IMDM can find an efficient solution for the same task. Finally, when equipped with appropriate distillation methods, IMDM surpasses existing few-step distillation methods on LM1B and OpenWebText datasets.

Abstract:
Semi‑supervised statistical inference plays a key role in biomedical research, where labeled data often have higher quality but are limited due to costly clinical annotation. Yet, existing semi‑supervised statistical inference methods rely heavily on structured variables and strictly matched covariates between labeled and unlabeled datasets limitations ill‑suited for the heterogeneity and unstructured nature of real-world biomedical data. Modern biomedical studies increasingly collect unstructured data (clinical notes, patient audio and video recordings), with inconsistent protocols across datasets causing covariate misalignment (for instance, detailed medication histories may be recorded in one study but not another). Recent advances in pre‑trained multimodal large language models (LLMs), which excel at handling unstructured data, present an attractive potential solution. To transform this potential into rigorous semi-supervised statistical inference methods for biomedical research, two key challenges must be addressed: (1) How can we reliably integrate LLMs to enhance semi-supervised inference efficiency without compromising statistical validity? (2) How can those efficiency gains persist despite mismatched covariates between labeled and unlabeled datasets? In this paper, we tackle these challenges by systematicallycalibrating pseudo-labelsprovided LLMs with a novelprediction-invariance identificationstrategy. Our resulting semi‑supervised inference framework improves parameter estimation efficiency while maintaining full statistical validity, as demonstrated through our theoretical results and illustrated in a case study for identifying key biomarkers in Alzheimer’s disease detection with speech data.

Abstract:
Anomaly detection (AD) methods often ignore causal dependencies and treat anomalies as outliers, which is brittle when anomalies are primarily mechanism violations rather than extreme values. We propose CausalAno, a causalaware detector that trains a causal GAN on normal data and leverages its discriminator to learn mechanism-consistent representations. We score test samples by fitting a Gaussian model in this feature space and computing the Mahalanobis distance, measuring deviation from the normal causal manifold. We demonstrate the effectiveness of CausalAno with extensive experiments on 28 tabular datasets (18 continuous-only and 10 mixed-type), comparing against 16 SOTA baselines. Our results show consistent improvements across both mixed-type and numerical-only settings. Our ablation studies further confirm that the gains come from the causal factorization in the causal GAN rather than a generic GAN backbone. Our CausalAno offers a practical and effective solution for real-world applications where anomalies often arise from dependency-breaking behaviors.

Abstract:
AIdriven disease characterization in histopathology promises to assist in clinical decision making, but its performance is limited by the scarcity of detailed annotations. In contrast, single-cell gene expression provides expressive and interpretable labels that compensate this scarcity, but assays are costly and rarely acquired in clinical workflows. To overcome this gap, we propose to bridge these data sources using a trimodal contrastive learning framework that aligns histopathology images, gene expression profiles, and natural-language descriptions. Our training data combines atlas-scale datasets of (i) spatially-resolved gene expression paired with histopathology images, and (ii) single-cell gene expression with curated annotations. Together, these data induce an alignment between image and text modalities, which we leverage for zero-shot image annotation tasks, such as the identification of immune cells. We present a sufficient condition under which this transfer can succeed and assess the performance of our approach against established baselines. We predict cell types at 15.4% improved relative AUROC over leading pathology vision language models. Our method also exhibits significant gains across diverse prediction tasks in low-data regimes, when combining training data from all three modality pairs. Our work thus establishestransitive representation learningas an effective strategy to enhance histopathology interpretation.

Abstract:
Multistate mechanisms underlie many of the complex functions observed in natural proteins. The ability to rationally design multistate proteins would have transformative implications for many areas of biotechnology, yet lies beyond the capabilities of existing deep learning frameworks for protein design. To address this gap, we introduce SwitchCraft, a versatile and programmatic framework for designing stateswitching proteins based on backpropagation through compositional design constraints parameterized by structure prediction models. In silico evaluations demonstrate success on a wide range of state-switching functional primitives, from allosteric regulation of motifs to discrimination of bound ligand identities. Using these primitives, we demonstrate an in silico strategy for de novo design of fluorescent biosensors to arbitrary small molecule analytes. These results position SwitchCraft at the inception of a powerful paradigm for higher-order functional protein design.

Abstract:
While neuroscience experiments have repeatedly demonstrated the involvement of large populations of neurons in even simple behaviors, these studies have just as often reported that the collective dynamics of neural activity are approximately lowdimensional. As a result, methods for identifying low-dimensional latent representations of time series data have become increasingly prominent in neuroscience. However, most existing methods either ignore temporal structure or model time evolution using latent dynamical systems approaches. In the first case, dynamics may be distorted or even scrambled in the latent space, while in the second, many possible latent dynamics may give rise to the same data. Here, we address these challenges using a novel flow-matching approach in which data are generated by a pair of flow fields, one governing time evolution, the other a mapping between data and a low-dimensional latent space. Importantly, the dimension-reducing flow is trained to minimize distortions of the temporal dynamics, learning an identifiable low-dimensional representation that preserves temporal relations in the original data. Additionally, we constrain our latent spaces to have low-dimensional support in a soft, parameterized manner, taking inspiration from ideas on nested dropout. Across both neural and behavioral data, we show that this dual flow approach produces both more interpretable dynamics and higher-quality reconstructions than competing models, including in noise-dominated data sets where conventional approaches fail.

Abstract:
The prevailing inference framework for diffusion models formulates generation fundamentally as a problem of numerical integration. This perspective casts the model as an accurate estimator, neglecting the inherent statistical uncertainty of the denoising process. In this work, we propose inferencetime Forward-process Alignment for Diffusion models (DiFA), a training-free inference framework that reformulates diffusion sampling as a sequential state estimation problem. Instead of discarding historical predictions, DiFA treats the inference trajectory as a sequence of correlated observations with varying variances. We derive a principled fading-memory Kalman filter strategy that synthesizes historical predictions to minimize estimation variance. Crucially, to counteract the over-smoothing typically associated with variance reduction, we introduce a deviation boosting mechanism that adaptively restores high-frequency details. Empirically, DiFA yields significant improvements in FID, IS, and FD-DINOv2 scores on CIFAR-10 and ImageNet, demonstrating that aligning inference with the forward statistical structure substantially improves generative fidelity.

Abstract:
Lowprecision formats such as BF16 and FP8 can greatly improve training efficiency, but fully low-precision training often degrades accuracy under standard optimizers. We identify a key cause: additive updates can vanish under coarse mantissa resolution. We introduce M+Adam, an optimizer for stable low-precision training that operates on a mantissa--exponent decomposition of weights and carries out Adam-Madam updates in parallel. Madam is a multiplicative analogue of Adam, where instead of applying additive updates, it performs updates multiplicatively, which is naturally suited for updating exponents. Building on this idea, \method applies additive updates to the mantissa and multiplicative updates to the exponent in parallel. We demonstrate the complementary failure modes of purely additive or multiplicative updates under quantization and thus, our method that combines both can overcome all these failure modes. We establish a monotone descent guarantee under standard smoothness assumptions for our method. Under a challenging setting where both the weight and compute are in FP8, M+Adam substantially outperforms AdamW (e.g., by 10.51% perplexity at 350M). Moreover, M+Adam enables stable BF16 training without stochastic rounding and consistently outperforms AdamW across 60M-350M models and 1-8× Chinchilla budgets.

Abstract:
Recent studies have shown that reinforcement learning with KLregularized objectives can enjoy faster rates of convergence or logarithmic regret, in contrast to the classical \sqrtT-type regret in the unregularized setting. However, the statistical efficiency of online learning with respect to KL-regularized objectives remains far from completely characterized, even when specialized to multi-armed bandits (MABs). We address this problem for MABs via a sharp analysis of KL-UCB (Zhao et al., 2025b) using a novel peeling argument, which yields a \tildeO(\eta K\log^2T) upper bound: the first high-probability regret bound with linear dependence on K. Here, T is the time horizon, K is the number of arms, \eta^-1 is the regularization intensity, and \tildeO hides all logarithmic factors except those involving \log T. The near-tightness of our analysis is certified by the first non-constant lower bound \Omega(\eta K \log T), which follows from subtle hard-instance constructions and a tailored decomposition of the Bayes prior. Moreover, in the low-regularization regime (i.e., large \eta), we show that the KL-regularized regret for MABs is \eta-independent and scales as \tilde\Theta(\sqrtKT). Overall, our results provide a thorough understanding of KL-regularized MABs across all regimes of \eta and yield nearly optimal bounds in terms of K, \eta, and T.

Abstract:
Language Models (LMs) have shown remarkable potential as roleplaying chatbots, delivering consistent, stylized interactions when given a specification of a character or user persona. However, applying these capabilities to real-world applications (e.g., ecosystems with numerous NPCs interacting simultaneously) exposes a critical inefficiency due to the excessive computational cost. In this paper, we question the necessity of dedicating a full, generalist model to a single persona, hypothesizing that a specific character identity relies on only a fraction of the model’s total capacity. We observe that naïvely pruning LMs often severely degrades the role-playing performance for a specific persona; it does not distinguish between redundant knowledge and essential character traits. We proposePersona-Pruner, a framework that sculpts a lightweight role-playing model by isolating persona-specific sub-networks from a single description. Our experiments consistently show that Persona-Pruner preserves role-playing performance substantially more effectively than existing state-of-the-art LLM pruning techniques, reducing the performance drop from the dense model by up to 93.8% over the strongest baseline on RoleBench in LLM-as-a-judge score, while still maintaining general LLM capabilities.

Abstract:
Flowmatching has emerged as a leading framework for high-fidelity text-to-image generation. However, its alignment with human preferences through RL is often hindered by substantial computational overhead. In this paper, we introduce Flow-TTRL, the first test-time reinforcement learning framework that achieves alignment on the fly. Our approach reinterprets intermediate latent representations as an implicit policy and utilizes SDE-based rollouts to explore high-reward trajectories within the learned vector field. Specifically, we propose a two-stage optimization strategy: Proximal Reward Difference Prediction (PRDP) ensures structural stability in high-noise regimes through pairwise reward regression, while Group Relative Policy Optimization (GRPO) refines fine-grained aesthetic details by maximizing relative advantages within sampled candidate groups. Experimental results show that Flow-TTRL significantly boosts aesthetic quality, text-image alignment, and human preference across diverse backbones. On the GenEval benchmark, Flow-TTRL elevates the accuracy of SD 3.5-Medium from 63% to 87% and Flux.1 Dev from 66% to 83%. Furthermore, our framework achieves an average gain of 15% to 20% across T2I-CompBench metrics, delivering performance comparable to state-of-the-art RL-based fine-tuning methods without the need for additional fine-tuning.

Abstract:
Neural population geometry shapes downstream inference. Recent findings in neurobiology suggest that a hyperbolic structure underlies population activity. However, a theoretical framework for this phenomenon is still lacking. Here, we propose a plausible construction of hippocampal tuning curves that statistically induce hyperbolic geometry. Next, we establish a connection between neural decoding and associative memory by demonstrating that the Modern Hopfield Network update rule computes the optimal squared loss estimator under hyperbolic geometry. Furthermore, we introduce a novel associative memory model defined in hyperbolic space that yields significantly larger capacity than existing models. Our results suggest that animals encode spatial information as a latent hyperbolic cognitive map, which enhances both memory capacity and decoding accuracy.

Abstract:
We develop a theory of generalization and scaling for Mixtureof-Experts (MoE) Transformers that cleanly separates active per-input capacity from routing combinatorics. Conditioning on fixed routing patterns and union-bounding across them, we obtain a sup-norm covering-number bound whose metric entropy scales with the active parameter budget and incurs a MoE-specific overhead. Combining this with a standard ERM argument for squared loss we provided a generalization bound under a d-dimensional manifold model (d is the intrinsic dimension of the training data) and C^\beta targets, showing that approximation and estimation trade off in the same way as in dense networks once active parameters are counted appropriately. We further prove a constructive approximation theorem for MoE architectures, demonstrating that accuracy can be improved either by scaling active capacity or by increasing the number of available experts, with the better of the two mechanisms prevailing. From these results we derive neural scaling laws, covering model scaling, data scaling and compute–optimal tradeoffs. The theory highlights that enlarging the expert pool at fixed sparsity influences performance only through a mild logarithmic routing term, whereas increasing active capacity per input drives the main gains in generalization and approximation. These insights provide principled guidance for the design of efficient sparse Transformer systems and clarify the fundamental tradeoffs underlying their empirical scaling behavior.

Abstract:
The scaling of Large Language Models (LLMs) has driven significant performance gains but created substantial challenges in inference efficiency. While Mixture of Experts (MoEs) architectures address this by decoupling model size from inference cost, training MoEs from scratch is often unstable and compute intensive. Conversion of pretrained dense models into sparse MoEs has emerged as an alternative solution; however, existing methods typically rely on heuristic neuron clustering or random splitting to partition Feed-Forward Networks (FFNs) into experts. In this work, we propose DOT-MoE, a novel framework that formulates the decomposition of dense layers as a Differentiable Optimal Transport (DOT) problem. Instead of static heuristics, we model neuron assignment as a balanced transport problem, utilizing differentiable Sinkhorn-Knopp iterations to enforce strict expert capacity constraints. Furthermore, we utilize Straight-Through Estimators (STE) to jointly learn the discrete neuron-to-expert assignment and the token-to-expert routing policy end-to-end. Extensive experiments across multiple architectures and benchmarks demonstrate that DOT-MoE significantly outperforms structured pruning, heuristic clustering, and random-split baselines, retaining 90% of the original dense model's performance while reducing active parameters by 50%.

Abstract:
Learning effective value heuristics for combinatorial search is difficult, as prior methods rely on surrogate supervision or costly downstream search to assess progress. We introduce a trajectoryaware probabilistic framework that models uncertainty in cost-to-go labels instead of treating them as fixed targets. Heuristic learning is cast as inference over state trajectories using an HMM-style model (Rabiner, 1989), where estimated depth-change dynamics define transitions and forward–backward inference yields soft supervision. To evaluate heuristic quality without search, we propose a large-scale local ranking metric that measures a model’s ability to order neighboring states. On the Rubik’s Cube, our approach consistently improves local ranking accuracy and downstream search performance under matched computational budgets.

Abstract:
Unsupervised dense retrievers offer scalability by learning semantic similarity from unlabeled documents via contrastive learning, but they struggle to capture the temporal relevance, retrieving semantically related but temporally misaligned documentsan important aspect when a document collection spans multiple time periods (e.g., retrieving from related document spanning 2018-2025 given a query "Who is the president in 2019?'' introduces temporal ambiguity). Existing methods rely on supervised training with explicit timestamps, which are not always feasible.We propose TPOUR (Temporal Preference Optimization for Unsupervised Retriever), which integrates our novel training methodTemporal Retrieval Preference Optimization(TRPO). TRPO reinterprets preference learning in the temporal dimension, guiding the retriever to favor temporally aligned documents. TPOUR further generalizes to unseen time periods via interpolation in a learned time embedding, enabling continuous temporal alignment. Experiments on temporal QA with a mixed-timestamp document collection show that TPOUR outperforms both unsupervised and supervised baselines. Compared to Nomic Embed v2 MoE, TPOUR Contriever improves nDCG@5 by +7.13 (+23.5%) on explicit and +7.76 (+25.5%) on implicit queries on average.

Abstract:
Hypergraphs are graph generalizations with key applications in domains such as healthcare, where strict data privacy requirements apply, or bioinformatics, where testing new compounds is costly. However, due to their combinatorial nature, hypergraph representations are often either intractable, or introduce major information loss. For this reason, research into hypergraph synthesis is limited, and stateof-the-art approaches yield poor generation quality in terms of overall structural patterns and graph-level validity. To address such shortcomings, we introduce SuperHype, an exact and tractable hypergraph diffusion model. The core of SuperHype is the projection of graph-superposition, a novel representation that embeds a hypergraph into a multilayer graph enabling a tractable representation with no loss of generalization. To generate new samples from such representations, we introduce a Graph-Superposition Transformer that treats the superposition as an interconnected sequence of layers. Moreover, we enhance the model’s performance with hypergraph specific auxiliary features and triplet aggregation of indirect node interactions. Our evaluation on five datasets shows that SuperHype generally reproduces local and global connectivity patterns with superior fidelity than state-of-the-art baselines.

Abstract:
Logitbased watermarking is a widely used mechanism for identifying LLM generated content, yet its effectiveness is governed by a fundamental trade-off between detectability and semantic distortion. Existing analyses provide limited guidance for principled hyperparameter selection, leaving practical deployments reliant on heuristic tuning. In this work, we develop a power-calibrated statistical framework that establishes explicit quantitative relationships between watermark hyperparameters, detection power, and distortion. This characterization transforms watermark design into a guided optimization problem. Building on these results, we derive practical parameter selection procedures that achieve optimal trade-offs under constraints. Extensive experiments across multiple language models and datasets validate the theory and demonstrate that the proposed framework consistently identifies Pareto-optimal points.

Abstract:
Recent advanced diffusion methods typically derive strong generative priors by scaling diffusion transformers. However, scaling fails to generalize when adapted for realtime compression scenarios that demand lightweight models. In this paper, we explore the design of real-time and lightweight diffusion codecs by addressing two pivotal questions. First, does diffusion pre-training benefit lightweight diffusion codecs? Through systematic analysis, we find that generation-oriented pre-training is less effective at small model scales whereas compression-oriented pre-training yields consistently better performance. Second, are transformers essential? We find that while global attention is crucial for standard generation, lightweight convolutions suffice for compression-oriented diffusion when paired with distillation. Guided by these findings, we establish a one-step lightweight convolution diffusion codec that achieves real-time 60 FPS encoding and 42 FPS decoding at 1080p. Further enhanced by distillation and adversarial learning, the proposed codec reduces bitrate by 85% at a comparable FID to MS-ILLM, bridging the gap between generative compression and practical real-time deployment.

Abstract:
Causal discovery is essential for advancing datadriven fields such as scientific AI and data analysis, yet existing approaches face significant time- and space-efficiency bottlenecks when scaling to large graphs. To address this challenge, we present CauScale, a neural architecture designed for efficient causal discovery that scales inference to graphs with up to 1000 nodes. CauScale improves time efficiency via a reduction unit that compresses data embeddings and improves space efficiency by adopting tied attention weights to avoid maintaining axis-specific attention maps. To keep high causal discovery accuracy, CauScale adopts a two-stream design: a data stream extracts relational evidence from high-dimensional observations, while a graph stream integrates statistical graph priors and preserves key structural signals. CauScale successfully scales to 500-node graphs during training, where prior work fails due to space limitations. Across testing data with varying graph scales and causal mechanisms, \sys achieves 99.6% mAP on in-distribution data and 84.4% on out-of-distribution data, while delivering 4×–13,000× inference speedups over prior methods.

Abstract:
We propose a scalable method for training prediction (machine learning) models in the predictthen-optimize paradigm, where model outputs serve as coefficients for a subsequent linear optimization task. Directly minimizing the empirical decision regret is intractable for linear programming and combinatorial optimization since the decision mapping is piecewise constant, and the gradients are zero almost everywhere. While existing methods address this by smoothing the differentiation process, they suffer from scalability issues, since a computationally expensive solver call is required for every gradient evaluation. To address this, we propose a decision-focused learning pipeline based on a measure transformation principle, which yields a new surrogate loss that is completely optimization-solver-free during training. We establish theoretical guarantees, including Fisher consistency and excess risk bounds. Empirically, our method achieves decision quality competitive with state-of-the-art methods while reducing training time by orders of magnitude.

Abstract:
AIgenerated image detectors achieve high accuracy on in-distribution data but often fail on unseen generators. A key obstacle to understanding this failure is the black-box nature of current detectors: they do not reveal which evidence drives their decisions. We propose \textscForensicConcept, a framework that extracts explicit forensic concepts from detectors and enables their transfer across backbones. Our method localizes decision-critical patches via Transformer attribution, clusters them into a compact concept codebook, and uses a concept-aligned projection to produce auditable evidence readouts. Motivated by prior studies showing that DINO representations can guide diffusion generation and exhibit concept-level correspondence with diffusion features, we introduce a generation-trace reference based on CleanDIFT diffusion features and quantify backbone-trace alignment via neighborhood-structure consistency (CKNNA). We further propose concept codebook injection to transfer diffusion-derived concepts into target backbones. Experiments on GenImage, GAN-family, and Chameleon benchmarks show consistent improvements over prior methods. We also find that CKNNA alignment predicts transfer effectiveness, providing a principled explanation for why some backbones yield more transferable forensic evidence than others.

Abstract:
Spiking Neural Networks (SNNs) offer a promising avenue toward energyefficient language modeling by replacing multiply-accumulate operations with sparse, event-driven computation. However, constructing fully spiking language models reveals two fundamental challenges: (1) gradient degradation from dead neurons caused by diminishing input magnitudes in deep networks, and (2) reduced token selectivity due to the absence of softmax's competitive winner-takes-all mechanism. These limitations create a substantial performance gap that has hindered the practical deployment of spiking language models. To address these challenges, we introduce SpikingLM, a framework that bridges the efficiency of SNNs with the capabilities of modern language models through two key innovations. First, we propose Distribution-aware Scaling, which rescales linear outputs to an optimal range that prevents gradient vanishing. These parameters are fused into preceding linear layers at inference, incurring zero additional overhead. Second, we introduce Spike2Max, a hardware-efficient attention mechanism that restores winner-takes-all dynamics through base-2 exponentiation and max-subtraction. By exploiting the integer-valued nature of spike coincidence counts, Spike2Max replaces floating-point exponentials with bit-shift operations, reducing attention energy consumption by over 95% compared to softmax. Extensive experiments demonstrate that SpikingLM achieves a 57.9% reduction in energy consumption while delivering state-of-the-art performance on GLUE among spiking language models.

Abstract:
We propose a mathematically principled PDE gradient flow framework for distributionally robust optimization (DRO). Exploiting the recent advances in the intersection of Monte Carlo sampling and statistical optimal transport, we show that our theoretical framework can be implemented as practical algorithms for sampling from worstcase distributions and, consequently, DRO. While numerous previous works have relied on dual reformulation techniques, we contribute a sound and complete gradient flow view based on SDEs or PDEs that can be used to construct new algorithms for general, potentially non-convex, losses. Without loss of generality, we solve a class of Wasserstein and entropy-regularized DRO problems using the recently-discovered Wasserstein Fisher-Rao and Stein variational gradient flows. Notably, we also show some simple reductions of our framework recover exactly previously proposed popular DRO methods, and provide new insights into their theoretical limits and optimization dynamics of DRO. Numerical studies based on stochastic gradient descent on machine learning tasks provide empirical backing for our theoretical findings.

Abstract:
Recent advancements in machine learning have largely been driven by foundation models (FMs) trained on large, diverse datasets, enabling them to generalize effectively to new, related tasks. However, extending this paradigm to reinforcement learning (RL), where an agent interacts with an environment to select actions, remains a significant challenge. Most existing approaches train FMs directly on sets of control tasks, but developing diverse RL environments and scaling training across them can be costly and complex. In this study, we explore a simpler alternative approach based on a classical reduction from RL to regression. We demonstrate that a foundation model pretrained for regression tasks, when used as an in-context regression (ICR) model, can be directly applied to RL problems. Building on this insight, we introduce a gradient-free method, ICR-RL, that requires no additional training and leverages an ICR foundation model to tackle RL tasks. We evaluate our approach by applying the ICR model with the recently proposed TabPFN, which is trained on a wide range of regression tasks. Experiments conducted on the Gymnasium classic-control benchmark indicate that ICR-RL matches or outperforms state-of-the-art methods, including DQN and PPO. These results show that ICR foundation models can effectively solve RL tasks without fine-tuning, demonstrating their potential as a foundation for RL-oriented models

Abstract:
Interpretable machine learning requires models that are accurate and structurally faithful to the data. Existing explainability methods rely heavily on additive representations (e.g., GAMs, SHAP, functional ANOVA), which can suffer from signal cancellation and extrapolation in presence of strong interactions. We propose Tensor Separation Learning (TSL), a regression model that learns a sum of separable (rank1) tensor products via an orthogonal greedy algorithm. By enforcing separability, TSL avoids the information loss inherent in additive projections caused by marginalizing higher-order interactions. The learned TSL model can be fully reconstructed from first-order partial dependence functions of its fitted factors. We establish approximation-rate guarantees for functions with bounded mixed p -th order partial derivatives and demonstrate that TSL competes with black-box models on regression benchmarks. Crucially, TSL improves interpretability by factorizing interactions, allowing users to explicitly disentangle the magnitude of an effect from its direction directly via the fitted factors.

Abstract:
We study a variant of the stochastic multiarmed bandit problem in which the learner aims to identify and play an arbitrary arm whose mean reward exceeds a known satisficing threshold S, rather than optimizing against the best arm. Prior work has shown that when such a satisficing arm exists, time-independent bounds on the satisficing regret are achievable, but these guarantees deteriorate when an arm lies close to the threshold. We focus on instances in which the excess gap \Delta_ (gap between the best arm and the threshold) is small relative to the suboptimality gaps \Delta_i, a regime that exposes this limitation. To capture this challenge, we introduce a refined notion of regret and propose a new algorithm, uncertain-UCB, which achieves satisficing pseudo-regret of O \left(\sum_i: \Delta_i > \Delta_ \frac\log(K/\Delta_)\Delta_i\right), while recovering standard pseudo-regret bounds when no arm exceeds the threshold. Further, we establish a near-matching lower bound in the small excess-gap regime, showing that any algorithm incurs at least \Omega \left(\sum_i: \Delta_i > \Delta_ \frac\log \left(\left(\sum_i:\Delta_i > \Delta_ \Delta_/\Delta_i\right)^-1\right)\Delta_i\right) satisficing pseudo-regret.

Abstract:
Visionlanguage models (VLMs) are now widely used in downstream tasks. However, real-world applications often expose VLMs to distribution shifts induced by semantic variation (e.g., shape, size, and style). Robustness certification determines if a model’s prediction changes when transformations are applied to its input. While most certification frameworks study geometric or pixel-level transformations over inputs, this work proposes a novel framework that enables certifying VLM robustness under semantic-level transformations. Leveraging the open-vocabulary capability of VLMs, we use text prompts as semantic proxies to construct transformations parameterized by an extent that controls the degree of semantic variation. By characterizing the VLM decision boundary in closed form, our framework quantitatively certifies extent intervals for which the predicted class remains unchanged under the semantic transformation. Our framework is the first to certify VLM robustness under semantic-level variations without requiring additional data for each variation, making it practical to apply. Experiments on both synthetic and real-world data show that our framework enables certifying robustness under diverse semantic variations across scenarios.

Abstract:
Linear attention reformulates sequence modeling as recurrent state evolution, enabling efficient lineartime inference. Under the key–value associative paradigm, existing approaches restrict the role of the query to the readout operation, decoupling it from state evolution. We show that query-conditioned state readout induces a structured value prediction over accumulated memory that complements key-based retrieval. Based on this insight, we propose Q-Delta, a query-aware delta rule that integrates mixed key--query prediction errors into state evolution, enabling jointly corrective dynamics while preserving delta-rule efficiency. We establish stability guarantees for the resulting dynamics and derive a hardware-efficient chunkwise-parallel formulation with a custom Triton implementation. Empirical results demonstrate stable optimization, competitive throughput, and consistent improvements over strong baselines on language modeling and long-context retrieval tasks.

Abstract:
Explicit surface reconstruction aims to recover highfidelity meshes directly from point clouds. While existing methods achieve strong performance on scene-level data, they often rely on test-time optimization, resulting in a prohibitive runtime of several minutes. To address this bottleneck, we propose FastSESR, a two-stage framework for efficient scene-level explicit surface reconstruction. In the first stage, a lightweight triangular candidate network (TCN) captures local connections via an edge-factorized parameterization, enabling effective extraction of surface triangles from uniformly sampled points. In the second stage, an offset optimization network amortizes offset refinement into a small, fixed number of learnable update steps guided by TCN, producing geometries that are more suitable for triangulation. Experiments on multiple scene-level datasets show that FastSESR accelerates surface reconstruction by at least 20 × over prior methods while maintaining competitive reconstruction quality. Moreover, evaluations on shape-level benchmarks indicate good generalization performance. Our code is available at https://anonymous.4open.science/r/FastSR-84C1.

Abstract:
Calibration aligns a model's predictive uncertainty with the frequencies of its empirical outcomes and is important toward understanding and trusting reported probabilities. Recent work shows that enforcing calibration at the level of individual predictors can substantially improve ensemble performance, with mixtureof-calibrated-experts (MoCE) models showing strong empirical improvements in particular. However, the conditions under which calibration helps MoCE are not well understood. In this work, we study MoCE models from a distributional robustness perspective, focusing on how routing mechanisms interact with expert-level calibration. We show that for hard routing, expert calibration is sufficient to ensure calibration of the overall model under a broad class of distribution shifts but is insufficient for calibrating a soft-routed model. We characterize the conditions that must hold for a soft-routed MoCE to be calibrated, and we discuss how reframing calibration as a distributionally robust objective helps recover robustness guarantees for soft-routed mixtures.

Abstract:
The physical world is fundamentally compositional, yet empirical data are often fragmented across decentralized silos that cannot be aggregated due to privacy, legal, or economic constraints. Such scenarios pose a fundamental challenge for generative modeling: learning models that collectively cover the union of these sources while enabling compositional generalization when the factors required for composition are distributed across isolated data sources. We introduce Decentralized Compositional Flow Matching (DCFM), a framework for learning generative models from decentralized private data without exchanging raw samples. DCFM enforces structural constraints that induce conditional independence across the global set of generative factors. As a result, DCFM allows novel combinations to emerge through interactions across peers, even when no single data source contains sufficient information to support composition on its own. Empirically, DCFM substantially outperforms federated learning and mixtureof-experts baselines across conditional image generation, robotic spatial planning and medical attribute co-occurrence modeling.

Abstract:
ParameterEfficient Fine-Tuning (PEFT) strategies such as Low-Rank Adaptation (LoRA) are effective solutions for fine-tuning large-scale pre-trained models; however, their memory requirements scales with the size of the model, \mathcalO(dr), where d is the model's hidden dimension and r is the rank. Our proposal, FrameFT, models the parameter update \Delta W with a sparse coefficient matrix in a Fusion Frame basis. Fusion Frames can be generated algorithmically and shared across model layers, enabling highly efficient updates. Only the sparse coefficients of the basis expansion are stored/optimized, strongly reducing the memory footprint and parameter count. The sparse structure of the coefficient matrix in FrameFT and the sparsity in the Fusion Frames, give sizable compute benefits. Our technical analysis shows that FrameFT allows obtaining formal convergence results. We evaluate our method across a suite of supervised fine-tuning benchmarks, primarily focusing on language tasks, but also report applicability to vision models. Our empirical evaluations show that FrameFT achieves performance on par with or exceeding state-of-the-art PEFT techniques, but needs far fewer trainable parameters and less memory.

Abstract:
Recent neural audio codecs have achieved impressive reconstruction quality, typically relying on quantization methods such as Residual Vector Quantization (RVQ), Vector Quantization (VQ) and Finite Scalar Quantization (FSQ). However, these quantization techniques limit the geometric structure of the latent space, make it harder to capture correlations between features leading to inefficiency in representation learning, codebook utilization and token rate. In this paper we introduce TwoDimensional Quantization (Q2D2), a quantization scheme in which feature pairs are projected onto structured 2D grids, such as hexagonal, rhombic, or rectangular tiling and quantized to the nearest grid values, yielding an implicit codebook defined by the product of grid levels, with codebook sizes comparable to conventional methods. Despite its simple geometric formulation, Q2D2 improves audio compression efficiency, with low token rates and high codebook utilization while maintaining state of the art reconstruction quality. Specifically, Q2D2 achieves competitive to superior performance in various objective and subjective reconstruction metrics, across extensive experiments in speech domain compared to state of the art models. Comprehensive ablation studies further confirm the effectiveness of our design choices.

Abstract:
Machine unlearning is becoming increasingly critical in the context of data privacy regulations, particularly for recommender systems that are directly trained on user interaction data. The goal of this work is to remove designated interactions and their downstream influence while preserving recommendation quality, and to do so without incurring the substantial computational cost of full retraining. Existing approaches exhibit inherent tradeoffs, including limited unlearning completeness, poor scalability, degradation in recommendation utility, or substantial computational and memory overhead. In this paper, we propose Obliviate, an efficient two-stage unlearning framework for recommender systems that achieves strong unlearning completeness while maintaining high utility. In the first stage, we introduce a Low-Rank Unlearning Adapter (LUA), which employs a lightweight Hessian proxy to enable curvature-aware and parameter-efficient unlearning through localized low-rank adapter modules. In the second stage, we present Locality-Aware Calibration (LAC), a lightweight refinement procedure that updates only the adapter parameters using a compact witness set, enforcing unlearning via ranking-based objectives while preserving utility through knowledge distillation. Extensive empirical evaluations demonstrate that Obliviate, achieves strong forgetting with minimal loss in recommendation quality and at significantly reduced computational cost, offering a practical and scalable solution for large-scale recommender systems.

Abstract:
Molecular dynamics (MD) is a key tool for simulating the dynamical behavior of atomic systems. However, MD is inherently serial, which makes it difficult to increase singlesystem throughput with concurrent compute. To address this, we introduceLangevinSpeculativeDynamics (LSD), a distributed and model-agnostic speculative sampler for accelerating MDwithout adding relative error. Inspired by speculative methods in language and diffusion modeling, LSD uses a draft model to propose fast simulation steps and verifies them in parallel with a slower target model, applying a transport map from the draft to the target distribution. We extend speculative sampling to second-order Langevin dynamics, derive the achievable speedup as a function of physical parameters, show that LSD generalizes across different systems and draft-target combinations with a 3-9x speedup, and confirm theoretically and empirically that LSD samples trajectories from the same distribution as its target model.

Abstract:
Programming adaptive behaviors at the cellular level is a longstanding goal that raises the question of how probabilistic computation can be implemented in biochemical systems. Chemical reaction networks (CRNs) provide such a substrate and have been shown to realize probabilistic models, including hidden Markov models and factor graphs, with dynamics reproducing Bayesian inference and belief propagation. However, encoding these algorithms typically requires prohibitively large reaction networks, and classical CRN reduction techniques don't apply. By embedding CRNs into factor graphs in a structure- and dynamics-preserving manner, we leverage recent factor-graph reduction results to obtain significantly smaller CRNs, a gain we demonstrate numerically.

Abstract:
Preferencebased reinforcement learning (PbRL) learns policies from human trajectory-level comparisons, avoiding explicit reward design and expert demonstrations. Existing methods typically train utility functions on trajectory or segment-level preferences while relying on per-step utility estimates during policy optimization. This training and inference mismatch induces a distribution shift that severely degrades temporal credit assignment and limits policy learning. We analyze this issue and propose Preference Learning with Advantage-Weighted Segments(PAWS), a segment-based preference learning method that performs policy updates directly using segment-level advantage functions. By aligning utility training with policy optimization, PAWS preserves trajectory-level preference information and avoids unreliable per-step learning signals. Experiments on simulated robotic manipulation and locomotion tasks demonstrate that PAWS consistently outperforms existing PbRL approaches, highlighting the importance of distribution-consistent preference learning.

Abstract:
Large Language Models (LLMs) often struggle with complex logical reasoning. Existing approaches typically rely on either purely neural reasoning in natural language or offloading to formal solvers via symbolic representations. However, both paradigms face significant limitations: while LLMs exhibit strong semantic intuition they are prone to hallucinations, whereas symbolic solvers offer rigorous derivation but remain highly sensitive to minor syntactic errors. To combine the strengths of these two paradigms while mitigating their respective limitations, we introduceLogicSAGE(LogicinformedSocraticAgent forGuidedEnhancement), a dual-process framework that integrates a robust neural reasoner (System 1) with a rigorous symbolic validator (System 2). Specifically, our framework employs a Socratic Error Correction mechanism that treats solver feedback not as terminal failures but as pedagogical signals, engaging in a dialectic loop to iteratively refine logic programs and resolve semantic ambiguities. Extensive experiments on five benchmarks show that LogicSAGE (8B) achieves a state-of-the-art 92.36% average accuracy, significantly outperforming GPT-4 baselines, which establishes that architectural innovation can supersede model scale in faithful reasoning.

Abstract:
Vision Transformer (ViT) tokenizers offer a scalable alternative to convolutional auto-encoders, yet current architectures have two key limitations: their performance degrades when images vary in aspect ratio or resolution, and their reliance on adversarial losses makes them harder to train at scale. To address this, we introduce ViTok-v2, a ViT tokenizer building on ViTok. We add native resolution support via NaFlex with 2D RoPE and stabilize training by replacing the standard LPIPS-plus-discriminator objective with our novel DINO perceptual loss. We scale our model to 5B parameters, training the largest ViT-based image compression auto- encoder to date and demonstrate continued im- provements with scale. In downstream generation experiments with flow matching models, we find that smaller generators perform best with aggres- sive channel compression while larger generators effectively leverage higher channel counts. ViTok- v2 matches state-of-the-art reconstruction at 256p and outperforms across benchmarks at 512p and higher resolutons, while remaining compatible with any pipeline requiring flexible aspect ratios.

Abstract:
Regional refinement for interactive segmentation is of great necessity to ensure the fidelity of segmented pixels nearby userprompted locations, which specifies a local window (\ie, focus view) for the latest click after a global prediction, where local pixels are revisited and optimized using numerous refining structures. Previous methods either utilize a two-stage pipeline to estimate the focus view or manually preset a fixed scope for all clicks, while the former is time-consuming, the latter fails to capture the correlation among click position, object geometry, and focus intensity. In this paper, we inherit the core idea of FCFI \citewei2023focused and dedicate a one-stage framework characterized with Elaborate Focus Prior (EFPNet). Concretely, EFPNet outputs an erroneous mask \wrt historical feedback and newly-placed click in an end-to-end manner, which deduces precise focus region according to its max-connected component, followed with feedback correction considering image, feature and mask affinity. We further design a clicked-with-focus mechanism for efficient feedback integration. Extensive studies on four benchmarks have revealed outstanding performance of EFPNet for both efficacy and computational overhead.

Abstract:
In realworld Few-Shot Learning (FSL), support sets are quickly constructed and inevitably contain noisy samples. With limited examples per class, even a single noisy instance can distort class distributions, cause prototype drift, and reduce generalization. Existing methods mostly assume clean data or require large-scale statistics, which are impractical in FSL’s data-scarce setting. We find that clean samples in semantic feature space lie in low-rank subspaces, while noisy samples cause rank anomalies disrupting this structure. To address this, we propose a differentiable low-rank approximation that estimates the intrinsic rank of the support set and detects anomalous noisy samples. Building on this, a rank-guided diffusion process generates high-quality replacements under low-rank constraints, reconstructing a clean, consistent support set for improved robustness.This low-rank guided approach effectively mitigates prototype drift and significantly reduces errors under noise levels up to 40% across MiniImageNet, TieredImageNet, and other noisy benchmarks, demonstrating the power of low-rank geometry for noise detection and correction in FSL.

Abstract:
Most reinforcement learning (RL) baselines maximize future cumulative rewards with a fixed single discount factor, which limits their performance in complex sequential decisionmaking tasks due to a failure to balance short-term objectives and long-term planning. To address this issue, this paper focuses on a multi-timescale critic framework, where each component corresponds to a Q-value with a distinct discount factor. Two key improvements are proposed: (1) A Neural Reward Decoder reconstructs the reward sequence from multi-scale Q-values, with value and reward reconstruction losses enhancing Q-value estimation consistency; (2) A cross-attention-based Q-weight predictor adaptively adjusts Q-value weights via current observations to generate the final Q-value for policy optimization. Extensive experiments on DMControl and CARLA benchmarks demonstrate that our method significantly outperforms state-of-the-art (SOTA) baselines. Furthermore, we validate the framework's generalizability by integrating it with both off-policy (SAC, DrQ-v2) and on-policy (PPO) algorithms, achieving consistent performance gains. The code is available in the supplementary material.

Abstract:
Many learning problems require uncovering a hidden ordering that reveals structure in unordered data, such as monotonicity in sorting or spatial continuity in jigsaw reconstruction. In these settings, permutations can be learned as latent operators by optimizing objectives defined directly on the reordered output, often without access to groundtruth orderings. Differentiable relaxations such as Gumbel–Sinkhorn make this approach practical by approximating permutation matrices with doubly stochastic matrices. However, learning from structure without supervision induces a non-uniform uncertainty: some assignments become confident early, while others remain ambiguous. Existing methods control this process using a single global temperature, forcing all assignments to sharpen or diffuse simultaneously and leading to instability at scale. We introduce an entropy-adaptive formulation of Gumbel–Sinkhorn that locally modulates temperature based on assignment uncertainty. This allows confident assignments to discretize early while preserving exploration where uncertainty remains. Across sorting and jigsaw reconstruction tasks and in routing-style settings, adaptive entropy control improves training stability and final permutation quality relative to fixed-temperature baselines, particularly as problem size and assignment ambiguity increase.

Abstract:
Existing dimensionality reduction methods all perform dimensionality reduction by preserving some invariant in the space before and after dimensionality reduction. This paper proposes a new dimensionality reduction invariant: preserving the invariant of the pointdistributions similarity. We also design a linear and efficient method to achieve dimensionality reduction while preserving this invariant. We theoretically prove the feasibility of our method for dimensionality reduction. Furthermore, our results on benchmark datasets and single-cell expression data demonstrate the effectiveness and efficiency of the proposed method.

Abstract:
In domains such as recommender systems and information retrieval, learning from humangenerated feedback is especially challenging because the information provided is often sparse and incomplete. In this work, we address the problem of learning the top-k items from incomplete rankings. Most existing models for incomplete rankings rely on rigid assumptions regarding both the ranking model that generates the latent ranking and the censoring mechanism that determines which comparisons remain unobserved. On the one hand, the ranking model is often assumed to follow a Plackett-Luce (PL) or Mallows distribution. On the other hand, the censoring mechanism is typically assumed to be Missing Completely At Random (MCAR) or to exhibit well-behaved dependencies on the latent ranking, such as winner feedback or top-h feedback. We introduce a new, general framework for learning from incomplete rankings that unifies and strictly generalizes the established frameworks in the literature. We consider the broad class of ranking models that satisfy the complete consensus property, which comprehends all widely adopted models, including PL and Mallows. Furthermore, we present a new preference-based feedback model, named positional censoring, which generalizes winner and top-h feedback. We show that it is possible to learn in this general setting by presenting the PIRATE algorithm and providing a near-optimal instance-dependent bound to the sample complexity. Finally, we show that, under the PL ranking, PIRATE matches the sample complexity of state-of-the-art algorithms in the relevant scenarios of winner and top-h feedback.

Abstract:
Recovering sceneconsistent 4D crowd motion from monocular video in large-scale scenes remains challenging due to severe depth ambiguity and complex scene geometry. Existing monocular crowd reconstruction methods typically rely on single-plane assumptions, leading to unreliable metric scale and spatial drift under complex terrain. We propose Crowd4D, the first scene-aware 4D crowd reconstruction framework that jointly optimizes the crowd and scene from a monocular RGB video in large-scale scenes. Crowd4D explicitly incorporates scene geometry and ensures consistency across image and scene spaces via a multi-stage optimization strategy. A key bottleneck of this task lies in accurate human–scene alignment, particularly in scale and position. However, human and scene reconstructions are typically decoupled. To address this, we introduce the Human–Scene Interaction Proxy (HSIP) as an intermediate representation, derived from Scene Interaction Point Clouds and a Scene Interaction Surface (SIPC&SIS), which encode explicit scene-aware geometric priors and redefine the optimization space for large-scale monocular 4D crowd reconstruction. To further improve temporal stability under occlusions, we introduce Crowd Structural Coherence Regularization (CSCR), which leverages HSIP-based spatial priors to impose soft temporal consistency on pairwise relative displacements and directions within local crowd neighborhoods. Extensive experiments demonstrate that Crowd4D consistently outperforms existing state-of-the-art methods and enables robust monocular 4D crowd reconstruction in complex, large-scale real-world scenes.

Abstract:
Pretrained scorebased diffusion models provide strong unconditional priors, yet enforcing measurement or physics consistency in inverse problems is often handled by heuristic guidance, intermittent projections, or task-specific conditional training, with limited guarantees of feasibility at the end of inference. We propose terminal-conditioned inversion for score-based SDE priors. Given a frozen Score-SDE prior and a task-defined terminal feasibility specification, we construct an associated backward stochastic differential equation whose adapted solution defines a principled inverse map from the terminal requirement to a prior state at a chosen noise level. Under standard regularity conditions, we establish existence and uniqueness of the adapted solution and obtain terminal consistency by construction. We further develop a practical neural BSDE solver that composes arbitrary pretrained diffusion priors with domain constraints without modifying the score-defined coefficients, producing an anchored prior state that enables neighborhood sampling for uncertainty characterization. Experiments on toy datasets validate stable terminal-conditioned inversion and distributionally consistent neighborhood sampling. As a real-world case study, we apply the framework to low-dose CT reconstruction and achieve improved reconstruction quality over representative training-free baselines while satisfying strict measurement feasibility under the prescribed terminal specification.

Abstract:
Missing data is a common challenge in spatiotemporal systems, arising in applications such as air quality monitoring and urban traffic management. Traditional machine learning approaches, like recurrent and graph neural networks, rely on iterative propagation, which tends to accumulate errors over time and space. Recent diffusionbased methods mitigate error propagation but require iterative sampling and often depend on problem-agnostic Gaussian priors, limiting both efficiency and effectiveness. To address these limitations, we propose GiFlow, a Graph-Informed Flow Matching framework for spatiotemporal imputation. GiFlow replaces the typical Gaussian prior with a graph-informed prior constructed via spatiotemporal filtering of observable signals, which better aligns the source distribution to the target and thereby simplifies the generation trajectory. The flow field is parameterized by a hybrid vector field model that integrates spatial attention, temporal attention, and spatiotemporal propagation, enabling joint modeling of spatial and temporal dependencies. Unlike diffusion models, GiFlow is trained via direct regression and supports deterministic, few-step generation at inference. Extensive experiments on both synthetic and real-world datasets with different missing patterns and missing rates demonstrate that the proposed GiFlow outperforms the state-of-the-art approaches in spatiotemporal imputation.

Abstract:
Hallucinations pose a key challenge for large language models. Chainof-Thought prompting exposes intermediate reasoning, but reasoning traces are treated as linear traces, making it hard to capture cross-step dependencies and localize unsupported intermediate claims. We propose a \emphstructural reasoning model to describe the interactions among local steps. To detect hallucinations, we extract a reasoning directed acyclic graph over conditions and intermediate claims, verify each claim against its parent nodes, and aggregate the step signals with a simple mass-flow rule. Under a probabilistic model, we provide an information-theoretic interpretation of this aggregation as measuring information loss along the reasoning graph. Experiments on GSM8K and MATH across multiple model families show that the proposed method improves detection performance over recent sampling-based baselines and judge-based methods. These findings provide a new perspective on the evaluation of chain-of-thought outputs and confirm the advantages of structural reasoning in hallucination detection.

Abstract:
Macrocycles are ringshaped molecules that offer a promising alternative to small-molecule drugs due to their enhanced selectivity and binding affinity against difficult targets. Despite their chemical value, they remain underexplored in generative modeling, likely owing to their scarcity in public datasets and the challenges of enforcing topological constraints in standard deep generative models. We introduce MacroGuide: Topological Guidance for Macrocycle Generation, a diffusion guidance mechanism that uses Persistent Homology to steer the sampling of pretrained molecular generative models toward the generation of macrocycles, in both unconditional and conditional (protein pocket) settings. At each denoising step, MacroGuide constructs a Vietoris-Rips complex from atomic positions and promotes ring formation by optimizing persistent homology features. Empirically, applying MacroGuide to pretrained diffusion models increases macrocycle generation rates from 1% to 99%, while matching or exceeding state-of-the-art performance on key quality metrics such as chemical validity, diversity, and PoseBusters checks.

Abstract:
Twosample comparison is a fundamental problem in machine learning, with broad applications such as generative modeling. Although the maximum mean discrepancy (MMD) is widely used, MMD-based tests often exhibit poor or even counterintuitive performance under covariance- and location-shift alternatives, partly due to cancellation effects induced by their sum-of-differences construction. To address this issue, we propose a kernel-based maximum-of-difference (MOD) test, which maximizes the squared discrepancy between within-sample and between-sample average distances, thereby improving sensitivity to subtle distributional differences. We further develop a fused MOD procedure to adaptively combine multiple kernels. Extensive experiments demonstrate clear performance gains over existing MMD-based methods.

Abstract:
The generalized Ising problem captures a broad spectrum of hard combinatorial problems, including MAXCUT, Number Partitioning (NPP), and Maximum Independent Set. In this work, we consider the notion of one-flip local minima for this problem. We construct a polynomial relaxation and prove that there exists a one-to-one correspondence between the local minima of the relaxation and the one-flip minima of the original Ising problem. This guarantee reduces the Ising problem to finding the local minima of a smooth function, allowing us to leverage scalable gradient-based optimizers such as ADAM. We demonstrate that our method achieves strong performance across challenging benchmarks, including spin-glass models, MAX-CUT, and NPP.

Abstract:
Deep learning methods have vastly expanded the capabilities of motion planning in robotics applications, as learning priors from largescale data has been shown to be essential in capturing the highly complex behavior required for solving tasks such as manipulation or navigation for autonomous vehicles. At the same time, model-based planning algorithms based on search or optimization remain an essential tool due to their flexibility, efficiency, and the ability to incorporate domain knowledge via expert-designed algorithms and objective functions. We propose a new generative framework to unify these two paradigms. First, we learn an autoencoder with a high compression ratio and a latent space of hierarchically ordered, discrete-valued tokens. Leveraging both the dimensionality reduction and the hierarchical coarse-to-fine structure learned by this autoencoder, we then perform motion planning by directly searching in the latent space of tokens. This search can optimize arbitrary objective functions specified at test time, providing a large degree of flexibility while maintaining efficiency and producing realistic solutions by relying on the generative capabilities of the highly compressed autoencoder. We evaluate our method on nuPlan and the Waymo Open Motion Dataset, showing how latent space search can be used for a variety of guided behavior generation tasks, achieving strong performance for closed-loop motion planning and multi-agent guided scenario synthesis without requiring any task-specific training.

Abstract:
In many RL domains, environments are linked by prerequisite relations—e.g., difficultyincreasing edits or parameter increments—which induce a directed acyclic curriculum graph (DAG). In practice, this structure is often exploited only implicitly, yet it can yield clear gains in training. We introduce PATH, a curriculum learning framework that performs active learning on the curriculum graph. PATH first expands coverage by sampling diverse curriculum paths, then reallocates training toward regions that remain unmastered. Experiments show that PATH leverages the graph structure to achieve strong robustness and generalization across diverse environments.

Abstract:
Learning meshbased operators from data can match training objectives while implicitly violating algebraic consistency constraints that classical discretizations satisfy by construction. Such violations can introduce near-kernel directions, degrade conditioning as resolution increases, and distort the low-frequency spectral structure on which downstream solvers and diagnostics rely. This work introduces a lightweight compatibility certificate for learned operator pairs, together with a closed-form projection that maps a learned pair to its Frobenius-nearest chain-compatible operator. The certificate provides an explicit distance-to-compatibility and yields perturbation bounds for the discrete operator. These bounds imply stability guarantees for elliptic solves and for low-frequency spectral counts, provided a spectral gap separates the kernel from the rest of the spectrum and boundary treatments are well posed. Experiments on standard elliptic problems show that defect-aware training prevents condition-number blow-up at higher resolutions, improves robustness under mesh and topological distribution shifts, and maintains predictive accuracy relative to unconstrained learning. Overall, these results support the use of non-invasive, computable algebraic consistency checks to detect and control failure modes that are not revealed by loss values alone.

Abstract:
Scene graph generation (SGG) aims to parse an image into a structured graph of objects and their predicates, enabling explicit relational reasoning for visual understanding. However, prevailing methods often overpredict geometric predicates, resulting in scene graphs that are factually correct yet semantically shallow. While recent works effectively attribute this phenomenon to the long-tailed data distribution, we identify another critical factor driving such biased prediction: co-occurrence-induced representation entanglement, where geometric and non-geometric predicates that frequently co-occur are encoded into overly similar representations. To this end, we introduce Dual-stream Synergistic Network (DS-Net) that models geometric and non-geometric predicates with two specialized streams, coupled with a bidirectional cross-stream fusion mechanism. The space stream focuses on spatial and structural cues, while the vision stream captures fine-grained visual evidence and semantic priors. Extensive experiments show that DS-Net consistently improves predicate inference, achieving 1.3% ～ 6.1% absolute gains in mR@100 on the SGGen task when integrated into existing SGG methods. These results highlight the importance of synergistic modeling of geometric and non-geometric predicates for generating semantically richer scene graphs.

Abstract:
Covariance matrices compactly summarize feature distributions in many machinelearning pipelines, including domain adaptation and Gaussian embeddings. Under a Gaussian approximation, the unregularized Wasserstein-2optimal transport(OT) discrepancy reduces to theBures--Wasserstein(BW) distance betweensymmetric positive definite(SPD) covariance matrices. We introduceITSPACE(Iterative Transport for Stable Proximal Alignment of Covariance Embeddings), a lightweightfew-stepmethod for covariance alignment under tight compute and memory budgets: it maintains a low-rank representation and produces a valid covariance estimate at every iteration through simple closed-form updates.ITSPACEis designed for the rank-budgeted, anytime regime relevant to covariance-based domain adaptation and test-time moment matching: under exact computations, each step provably decreases theexactBW distance, and under approximate linear-algebra steps we provide a computable certificate bound that quantifies any deviation from monotone descent. Empirically, in the strict few-step regimeITSPACEreaches the same BW distance thresholds faster than BW-targeting gradient descent under a common rank budget, and is more stable than Euclidean, alternative-geometry, and entropically regularized baselines.

Abstract:
We study flowmatching transformers when data lie on a low-dimensional manifold. Our key insight is a flow decomposition that splits motion along the manifold from motion off the manifold. The scheme works for first- and higher-order flow matching and ties model complexity to the intrinsic manifold dimension. Building on these, we establish tighter sample-complexity bounds for velocity approximation, velocity estimation, and distribution estimation. These bounds meet near-minimax rates for flow-matching transformers of any order. Our results show how flow-matching transformers escape the curse of dimensionality by utilizing intrinsic data structure.

Abstract:
In recent years, decentralized optimization has gained significant attention for solving machine learning problems where data are distributed across multiple devices. However, existing decentralized optimization algorithms are primarily designed for singlelevel and two-level optimization tasks, limiting their application to more complex problems such as decentralized stochastic K-level optimization, where K>2. In this work, we propose a novel decentralized stochastic K-level variance-reduced gradient descent algorithm to address the significant computation and communication overhead caused by the multi-level structure in decentralized stochastic K-level optimization problems. Moreover, we propose a novel theoretical analysis to tackle the recursive dependence issue caused by the multi-level structure when establishing the convergence rate of our algorithm. Finally, the experimental results confirm the effectiveness of our proposed algorithm.

Abstract:
Fair graph clustering has emerged as a critical research area for addressing algorithmic bias in machine learning. The objective is to ensure that the proportion of each protected group within a cluster is consistent with its representation in the entire dataset. However, most existing spectral solutions rely on computationally expensive eigendecompositions of the graph Laplacian, limiting their scalability. In this paper, we propose Riemannian Fair Spectral Clustering (RFairSC), a novel method that formulates fair spectral clustering as a constrained optimization problem on a Riemannian manifold. We develop a Riemannian alternating direction method of multipliers employing a variable-splitting strategy to efficiently solve the associated subproblems. Numerical experiments on large synthetic and real-world graphs demonstrate that R-FairSC significantly improves computational efficiency over state-of-the-art methods while maintaining high clustering quality and fairness.

Abstract:
A central insight in lossless data compression is the close connection between probabilistic nextsymbol prediction and efficient sequence compression, whereby predictive models can be combined with classical coding techniques to achieve strong compression performance. Applying this approach with powerful modern learned models, such as LLMs, has been shown to achieve markedly better compression than traditional techniques across a wide range of domains. However, significant practical challenges remain, including model non-determinism, in which a model produces different predictions on different machines despite identical parameters and inputs; such mismatches between the encoder and decoder can lead to complete decoding failure. Probability Matching Interval Coding (PMATIC) was recently introduced as a drop-in framework for mismatch-robust coding and shown to enable reliable compression and decompression in the presence of bounded prediction mismatch (Adler & Tang, 2026). In this work, we present PMATIC+, a generalization of PMATIC that allows the incorporation of tight theoretical results into the design and more flexible parameter optimization, resulting in substantial improvements in compression efficiency and robustness.

Abstract:
Continuousdepth neural networks (CDNNs), including Neural Ordinary Differential Equations (ODEs) and liquid-time-constant (LTC) networks, suffer from high computational costs due to solving numerous nonlinear ODEs during training and inference. We introduce Continuous Depth Acceleration (CoDA), a framework that leverages Mori–Zwanzig/Koopman operator theory to replace continuous-depth layers requiring multiple nonlinear ODEs with a compact GRU module, a single low-dimensional linear ODE, and a dense layer. We prove PAC learnability of CoDA, establishing that this transformation preserves accuracy and can be applied repeatedly across multiple layers with unified backpropagation. Experiments on the Liquid Foundation Model (LFM-1.2B) demonstrate 6.7× training speedup and 1.8× inference speedup without loss of accuracy. Across six real-world LTC applications, CoDA consistently outperforms state-of-the-art acceleration techniques—including neural flows, model order reduction, and variational formulations—in both training and inference time while maintaining competitive or superior accuracy.

Abstract:
Automatic verification is a critical component in building mathsolving agents and reinforcement learning, yet it often falls short in generalizability, performance, and cost-efficiency. Identifying that the primary bottleneck of verification lies in error detection capability, we propose pessimistic verification, a paradigm of agentic workflows that rejects a solution if any of multiple parallel verifiers identifies a flaw. We further introduce progressive pessimistic verification, which employs fine-grained proof decomposition to significantly enhance verification accuracy and efficiency. Our approach surpasses the performance and token efficiency of extended long chain-of-thought (long CoT) and mainstream verification workflows, crucially, our analysis reveals that existing benchmarks underestimate its effectiveness on stronger models due to inherent annotation errors. To further validate the effectiveness of our method, we applied a verification-based solving workflow on the IMO 2025 and MathArena Apex 2025 datasets, where the workflow with progressive pessimistic verification exhibits remarkable improvements in both efficiency and accuracy on highly challenging contest-level math problems with state-of-the-art models.

Abstract:
Domain generalization (DG) aims to learn representations that remain predictive under distribution shifts. A key challenge is that the target domain is unobserved during training, which complicates the search for invariant representations: alignment objectives that do not account for the preservation of discriminative structure may become illconditioned or lead to degenerate solutions, especially under finite samples. We propose GeometricRate–DistortionInvariance (RDI), a DG framework that addresses this challenge by generalizing classical rate–distortion theory to Grassmann manifolds.RDIexplicitly models class-conditional representations as low-dimensional subspaces and formulates DG as a joint optimization of (i) cross-domain subspace alignment (geometric distortion) and (ii) spectral–volumetric complexity (a capacity-regularized rate term). This integrated approach is designed to promote stable alignment while preventing the collapse of discriminative geometry, adapting to dataset-specific regimes. We provide finite-sample stability guarantees under bounded shifts. Experiments on DomainBed demonstrate thatRDIis competitive with strong DG baselines, and ablations verify that reliable generalization necessitates the concerted action of both alignment and complexity control.

Abstract:
Recent trainingtime defenses against neural backdoors isolate a benign subset from poisoned training data, to learn a backdoor-free model from it. In this paper, we formulate this defense strategy as a coreset selection problem, giving rise to so-called “Anti-Backdoor Coreset Selection.” Since poisonous samples have a) lower prediction uncertainty and are b) less frequent than benign samples, coreset selection naturally focuses more on samples associated with benign functionality than the backdoor functionality. We use the Cumulative Entropy as selection criterion to further facilitate this effect. The metric tracks the learning dynamics of training samples and allowing us to select benign samples with high informativeness for the coreset. Additionally, we unlearn the chosen samples in each epoch to facilitate the separability between benign and poisonous samples. Together, this yields an exceptionally effective training-time defense that constructs a benign coreset to train a backdoor-free model. Unlike prior defenses that compromise natural accuracy and fail against certain attacks, our method mitigates backdooring attacks consistently with a negligible impact on natural performance.

Abstract:
Estimating the Conditional Average Treatment Effect (CATE) is constrained by the high cost of obtaining outcome measurements, making active learning essential. However, conventional strategies suffer from a fundamental objective mismatch: they reduce uncertainty in model parameters or observable outcomes rather than the unobservable causal quantities of interest. We address this via the principle of causal objective alignment, positing that acquisition functions must target unobservables like potential outcomes or CATE directly. We operationalize this through CausalEPIG, a framework adapting Expected Predictive Information Gain to quantify uncertainty reduction in causal quantities. We derive two distinct strategies: a comprehensive approach that robustly models the full causal mechanisms via the joint potential outcomes, and a focused approach that directly targets the CATE estimand for maximum sample efficiency. We provide theoretical justification for our framework, establishing a formal link between our information-theoretic objective and the minimization of CATE estimation error. Extensive experiments demonstrate that our strategies consistently outperform standard baselines, and crucially, reveal that the optimal strategy is context-dependent, contingent on the base estimator and data complexity. Our framework thus provides a principled guide for sample-efficient CATE estimation in practice.

Abstract:
Machine learning models are often intended to augment rather than replace human decisionmakers, by providing information that is complementary to human judgement. Yet, in practice, human decision makers routinely fail to realize such complementary gains, even when models provide useful signal. In this work, we study how asymmetric information about the quality of information available to a human decision maker vs. an AI impacts the ability of a decision maker to extract complementary value from AI predictions. We show that a key factor is the error correlation structure between human and AI predictions. In particular, when the AI's prediction errors are negatively correlated with those of the human, the decision-maker can construct robust strategies which guarantee improvements in expected utility. We empirically investigate whether these conditions for complementarity arise in practice, using real-world forecasting benchmarks.

Abstract:
While Randomized controlled trials (RCTs), or A/B tests, are the gold standard for optimizing onlineplatform policies, they are limited by discrete testing levels. This approach is suboptimal for continuous variables (e.g., prices and incentives), as it fails to extrapolate to untested values or account for user heterogeneity. We address this by developing Deep Learning for Policy Targeting (\textsfDLPT) to learn personalized continuous policies from discrete RCTs using high-dimensional features. We prove our estimators are asymptotically unbiased and consistent, achieving a \sqrtn-regret bound. In a collaboration with a leading social media platform to optimize creator incentives, we show that \textsfDLPT substantially outperforms existing benchmarks. In a collaboration with a leading social media platform to optimize creator incentives, we show that \textsfDLPT substantially outperforms existing benchmarks.

Abstract:
We introduce a rankstatistic approximation of f-divergences that avoids explicit density-ratio estimation by working directly with the distribution of ranks. For a resolution parameter K, we map the mismatch between two univariate distributions \mu and \nu to a rank histogram on \0,\dots,K\ and measure its deviation from uniformity via a discrete f-divergence, yielding a rank-statistic divergence estimator. We prove that the resulting estimator of the divergence is monotone in K, is always a lower bound of the true f-divergence, and we establish quantitative convergence rates for K\to\infty under mild regularity of the quantile-domain density ratio. To handle high-dimensional data, we define the sliced rank-statistic f-divergence by averaging the univariate construction over random projections, and we provide convergence results for the sliced limit as well. We also derive finite-sample deviation bounds along with asymptotic normality results for the estimator. Finally, we empirically validate the approach by benchmarking against neural baselines and illustrating its use as a learning objective in generative modelling experiments.

Abstract:
Generative models frequently suffer miscalibration, wherein statistics of the sampling distribution such as class probabilities deviate from desired values. We frame calibration as a constrained optimization problem and seek the closest model in KullbackLeibler divergence satisfying calibration constraints. To address the intractability of imposing these constraints exactly, we introduce two surrogate objectives for fine-tuning: (1) the relax loss, which replaces the constraint with a miscalibration penalty, and (2) the reward loss, which converts calibration into a reward fine-tuning problem. We demonstrate that these approaches substantially reduce calibration error across hundreds of simultaneous constraints and models with up to nine billion parameters, spanning applications in protein design, image generation, and language modeling.

Abstract:
ComputeIn-Memory (CIM) accelerators are promising for energy-efficient edge inference, yet they faces fundamental challenges when deploying Deep Neural Networks (DNNs), as hardware-induced weight perturbations from intrinsic noise and device drift degrade accuracy and impede reliable inference. To tackle this challenge, we propose Diversity-aware Weight Perturbation (DWP), an immune-system-inspired training method that emulates affinity-based selection by exploiting sample-level prediction disagreement under diverse noise realizations to guide adaptive sample weighting, building robustness to weight perturbation. Experiments show that DWP-trained models consistently yield superior robustness, achieving over 15% accuracy improvements compared to standard-trained models under severe weight perturbations (mismatch level up to 70%) and maintaining inference accuracy at 90% over a simulated one-year CIM operation with only 2%–4% variation in accuracy. Moreover, under matched model and inference configurations, deployment on low-precision CIM hardware reduces inference energy by 38% compared to a GPU baseline. These results demonstrate that DWP enables robust and energy-efficient neural network deployment on resource-constrained edge devices with inherent hardware uncertainties.

Abstract:
As LLMs increasingly generate production code, protecting intellectual property demands watermarking techniques that respect code's strict syntactic constraints. In this work, we introduce CodeTracer, an innovative adaptive code watermarking framework underpinned by a reinforcement learning training paradigm. At its core, CodeTracer features a policydriven approach that utilizes a parameterized model to intelligently bias token choices during next-token prediction. This strategy ensures that embedded watermarks maintain code functionality while exhibiting subtle yet statistically detectable deviations from typical token distributions. To facilitate policy learning, we devise a comprehensive reward system that seamlessly integrates execution feedback with watermark embedding signals, balancing process-level and outcome-level rewards. To enable gradient-based optimization of these discrete watermarking decisions, we employ Gumbel Top-k reparameterization. Extensive comparative evaluations demonstrate that CodeTracer outperforms state-of-the-art baselines across multiple benchmarks in both watermark detectability and code functionality. Our code is available at https://anonymous.4open.science/r/CodeTracer-B8EE.

Abstract:
PhysicsInformed Neural Networks (PINNs) represent a significant advancement in computational methods for solving partial differential equations (PDEs). However, the adoption of deeper neural network architectures presents significant challenges, as they struggle to address differential-related complications that arise during the computation of derivatives over the input of PINNs. These complications extend beyond traditional vanishing and exploding gradients to include vanishing and exploding differentials, with both phenomena becoming more severe as networks grow deeper. By examining the computation graph of derivatives in deep neural networks, we identify key bottlenecks causing numerical instabilities in deep architectures. In response, we introduce a novel approach that utilizes Coupling Layers with carefully regulated spectral norms of Jacobian matrices to stabilize and facilitate deep PINN training, effectively addressing differential-related challenges and improving model stability. Our proposed architecture successfully mitigates the fundamental constraints of deeper PINNs while maximizing their capabilities through consistent differential propagation. Comprehensive evaluations show that our approach surpasses conventional shallow PINN methods and alternative deep PINN designs across a range of challenging problems, particularly in cases featuring high-frequency solution components.

Abstract:
Evaluation of longcontext Large Language Models (LLMs) has advanced rapidly. However, most existing benchmarks are limited to the document level and focus mainly on high-resource languages, leaving many fine-grained challenges insufficiently evaluated. To address this gap, we present MGAL, the first multilingual, granularity- and position-aware long-context benchmark. MGAL is constructed from United Nations (UN) reports spanning 8K to 128K tokens across the six official UN languages. It covers four coherent levels of linguistic granularity (word, sentence, paragraph, and document) and further stratifies entries by their position within the document (begin, middle, and end), indexed at both the document and paragraph levels. This design enables systematic diagnosis of multilingual long-context comprehension across different granularities. Through extensive experiments and analyses, we find that: (1) LLMs perform well at word-level tasks but struggle with coarser-grained ones; and (2) Closed-source models retain a clear performance advantage in lower-resource languages. We further identify two new challenges: (1) Under local semantic crowding, where neighboring sentences share topics and entities, models tend to follow surface cues (e.g., connectives like 'however' or repeated entities) rather than the discourse role of the sentence in surrounding context (e.g., background, outcome); and (2) A gap between fluency and consistency in generated outputs, where models produce text that reads smoothly but drifts from the source facts. In addition, we observe several patterns in line with prior studies, including reliance on nearby evidence and reuse of options under uncertainty.

Abstract:
Reflection removal is a highly challenging problem. Though remarkable progress has been made, current methods primarily exploit static image priors from a single frame. Due to the inherent ambiguity between layers, existing methods still suffer from severe residual artifacts. In this paper, we propose leveraging event signals to break this ambiguity. By employing event cameras to capture microdynamics, we reveal the differential motion between the reflection and background layers. We thereby present a novel event-driven reflection removal network, EvReflection, that utilizes these dynamic cues for layer separation. Specifically, we design a Micro-Dynamics Decoupler to disentangle layer-specific motions from event streams as priors, which then guide a Parallax-Attention Rectifier to cleanly remove artifacts from the RGB image. Furthermore, to address the data shortage, we develop a physics-based simulation pipeline and construct the EVR^2 benchmark, the first real-world dataset for this task. Extensive experiments demonstrate that EvReflection significantly outperforms existing methods, recovering clean images in challenging real-world scenarios.

Abstract:
A significant gap exists between theory and practice in deep learning. Generalization and approximation error bounds are often derived for simplified models or are too loose to be informative. Many rely on the manifold hypothesis and on geometric regularity such as intrinsic dimension, curvature, and reach. Progress requires insight into datamanifold geometry and suitable benchmarks, yet existing options are polarized: analytic manifolds with known geometry but limited applicability, or real-world datasets where geometry is only coarsely estimable. We introduce a benchmarking framework for studying data geometry by repurposing and extending dSprites and COIL-20 with additional transformation dimensions and denser sampling, enabling accurate finite-difference estimates of curvature, reach, and volume that are otherwise difficult to estimate reliably and implement in practice. As applications, we assess bounds by Genovese et al. and Fefferman et al., and analyze how geometry evolves across network layers in \beta-VAEs, highlighting the behavior of current bounds and the value of controlled benchmarks for guiding and validating future theory. Code to reproduce the framework and experiments is included with the submission and will be released as open-source library upon publication.

Abstract:
Privacy auditing has emerged as a practical tool for empirically estimating training data leakage in machine learning models, in contrast to the provable but often overly pessimistic bounds provided by differential privacy analysis. A common strategy is to use membership inference attacks to detect the presence of specific canaries—data points chosen to maximize attack success—in training data. However, existing canary designs are largely heuristic, relying on mislabeled or outof-distribution samples. We address this gap by formulating canary design as a bilevel optimization problem, where the model is trained in the inner loop and the canary is optimized in the outer loop to maximize its detectability. To solve this problem, we develop OptiFluence, a scalable optimization framework that combines (i) initialization by selecting candidates using influence functions and (ii) unrolled optimization with memory-efficient techniques. Our approach achieves remarkable empirical performance on four datasets. Optimized canaries demonstrate 415× (CIFAR-10/100) higher detectability than in-distribution baselines, achieving near-perfect detection rates of 99% true positive rate at 0.1% false positive rate. Critically, these canaries transfer effectively across different model architectures without retraining, enabling practical third-party privacy audits. This transferability allows regulators and auditors to assess model privacy without requiring access to proprietary training infrastructure or substantial computational resources.

Abstract:
Natural language to optimization (NL2Opt) requires translating unstructured text into executable mathematical models. Beyond simple syntax errors, this task suffers from silent modeling failures, where incorrect formulations execute successfully but yield invalid results. We propose Draftand-Audit RL (DA-RL), a framework that learns optimization modeling as a two-step iterative workflow. Unlike inference-time scaffolds that rely on intermediate solver feedback to guide repairs, DA-RL optimizes a shared-parameter policy using terminal-only verification: the model is rewarded solely based on the execution of the final audited program. This constraint forces the model to internalize rubric-guided revision as a learned capability and encourages the emergence of cross-turn synergy, where the policy learns to generate drafts that are structurally amenable to self-correction.

Abstract:
AIgenerated image (AIGI) detection is undergoing a critical transition from laboratory benchmarks to open-world adversarial defense. The prevalent paradigm focuses on finding static feature spaces, assuming that some invariant artifacts learned from historical data can achieve universal zero-shot generalization. While achieving saturation on several AIGI benchmarks, this static hypothesis suffers a severe performance drop against rapidly evolving generators (e.g., SD3, Nano Banana Pro). To address these limitations, we propose that the field should expand beyond “static generalization” to a new paradigm of “dynamic adaptation”. We introduceFleet,ForensicLearning viaEvolvingExemplarTuning, a framework that pioneers a dynamic paradigm of continuous few-shot evolution, enabling rapid alignment with emerging generative threats. By employing dual-space orthogonal fine-tuning, Fleet surgically adapts to novel artifacts via a lightweight subspace without disrupting the pre-trained semantic manifold. To validate this, we presentTreasure, a benchmark spanning 64 models and 360k images, featuring diverse architectures and 20 closed-source commercial engines. Experiments reveal that while static SOTA methods fail catastrophically on modern generators, Fleet restores performance from 20.4% to 73.1% with only 10-shot adaptation on Doubao Seedream 4.0. Code and data will be released.

Abstract:
Building reliable applications that leverage large language models (LLMs) remains a significant challenge. While LLMs offer impressive capabilities across diverse tasks, their outputs often lack accuracy and provide no clear measure of confidence. This uncertainty compounds in flows of multiple calls to LLMs and other tools, making it difficult for developers and endusers to trust the results. This paper introduces a probabilistic language for programming LLM-based flows. It enables developers to quantify and propagate uncertainty throughout the application's flow, and experiment with different inference scaling techniques without adding a single line of code beyond the flow's logic. We present an experimental study to demonstrate this capability, and a case study building a theorem proving agent for the Rocq theorem prover.

Abstract:
With the rapid development of deep learning, the issue of data scarcity has become increasingly prominent, inspiring emerging interests towards research on data augmentation techniques over recent years. However, our literature survey indicates that existing efforts often suffer from two issues of semantic infidelity, including: (i) visual semantics infidelity, such as visual artifacts, manifold intrusion, and unnatural blending boundaries etc, and (ii) label semantic infidelity, where augmented images do not match the original labels, creating extra label noises. To address these issues, we propose a Semantics Calibrated and DiffusionEnhanced Augmentation (SCalDA) scheme to achieve accurate semantics calibration across image, label and feature domains. Compared with the existing approaches, our proposed features in precise guidance in label domain, semantics driven synthesis across three domains (image, label and feature), and semantics-aware metric learning. Extensive experiments on multiple datasets demonstrate that SCalDA yields consistent and significant performance improvements for both fine-grained and general classification tasks, validating the effectiveness and broad applicability of the proposed.

Abstract:
Incomplete multiview clustering (IMVC) aims to uncover shared clustering structures from heterogeneous views with partial observations. Recently, existing generative IMVC methods have made significant progress in this field; however, they still remain limited in two aspects. On the one hand, they rely on weak cross-view signals, resulting in unstable latent recovery when facing heterogeneous missing data. On the other hand, they overlook stable cross-view neighborhood structures, leading to weak structural constraint. To address these limitations, we propose neighborhood-conditioned diffusion for incomplete multi-view clustering (IMVC-NCD), which achieves robust latent completion. Our method learns compact view-specific latent representations and constructs a unified conditioning vector by aggregating stable local neighborhood structures from available views while encoding heterogeneous missingness states, providing reliable guidance for diffusion-based denoising. With neighborhood-level conditioning, IMVC-NCD produces semantically aligned and view-consistent latent representations that are well suited for clustering, even under high missing-view ratios. Extensive experiments on four benchmark datasets demonstrate the effectiveness and robustness of our method compared with state-of-the-art IMVC approaches.

Abstract:
Observational studies are a key resource for causal inference but are often affected by systematic biases. Prior work has focused mainly on detecting these biases, via sensitivity analyses and comparisons with randomized controlled trials, or mitigating them through debiasing techniques. However, there remains a lack of methodology for uncovering the underlying mechanisms driving these biases,e.g., whether due to hidden confounding or selection of participants. In this work, we show that the relationship between bias magnitude and the predictive performance of nuisance function estimators (in the observational study) can help distinguish among common sources of bias. We validate our methodology through extensive synthetic experiments and a realworld case study, demonstrating its effectiveness in revealing the mechanisms behind observed biases. Our framework offers a new lens for understanding and characterizing bias in observational studies, with practical implications for improving causal inference.

Abstract:
As larger language models deliver stronger capabilities, their autoregressive inference becomes increasingly expensive. Speculative decoding accelerates generation by letting a fast draft propose tokens that the target model verifies in parallel. Yet under sampling (T>0), observed speedups consistently lag behind those under greedy decoding, as the classical lossless verification rule tends to overreject low-risk drafts, leading to lower acceptance rates and limited acceleration. To address this gap, we propose ARC-Decode (Acceptance with Risk Control), a training-free method that augments speculative decoding without extra forward passes. ARC-Decode enables relaxed acceptance by identifying drafts whose acceptance preserves the output distribution of the target model, under a risk-controlled criterion based on Jensen--Shannon divergence. It combines confidence-based pre-verification filtering with a risk-bounded acceptance criterion derived from an analytic upper bound on the potential distributional deviation. Integrated into the state-of-the-art EAGLE-3 pipeline, ARC-Decode increases accept length per cycle and reduces verification compute, achieving up to 1.6× end-to-end speedup over EAGLE-3 under sampling with negligible quality change across benchmarks.

Abstract:
Hiddenaction principal-agent problems model scenarios in which a principal induces an agent to take a costly and unobservable action through the provision of outcome-dependent payments. These problems find application in a variety of real-world settings, such as crowdsourcing, online labor platforms, and machine learning task delegation. Recently, much of the literature has focused on how to handle the principal’s uncertainty about the agent and the surrounding environment, which is often the main challenge in practice. One prominent approach is to adopt an online learning framework, where the principal repeatedly interacts with the agent to learn optimal payments from experience. However, existing learning algorithms, while achieving regret that scales sublinearly in the number of interaction rounds T, typically suffer from an exponential dependence on the size of the problem instance. In this paper, we show that this problematic exponential growth can be avoided by assuming that the principal has knowledge of a set of possible actions of the agent, while remaining unaware of which actions are actually available---an assumption that is reasonable in many real-world settings.

Abstract:
Current stateof-the-art multi-class unsupervised anomaly detection (MUAD) methods rely on training encoder–decoder models to reconstruct anomaly-free features. We first show these approaches have an inherent fidelity–stability dilemma in how they detect anomalies via reconstruction residuals. We then abandon the reconstruction paradigm entirely and propose Retrieval-based Anomaly Detection (RAD). RAD is a training-free approach that stores anomaly-free features in a memory and detects anomalies through multi-level retrieval, matching test patches against the memory. Experiments demonstrate that RAD achieves state-of-the-art performance across four established benchmarks (MVTec-AD, VisA, Real-IAD, 3D-ADAM) under both standard and few-shot settings. On MVTec-AD, RAD reaches 96.7% Pixel AUROC with just a single anomaly-free image compared to 98.5% of RAD's full-data performance. We further prove that retrieval-based scores theoretically upper-bound reconstruction-residual scores. Collectively, these findings overturn the assumption that MUAD requires task-specific training, showing that state-of-the-art anomaly detection is feasible with memory-based retrieval. Our code is available at https://anonymous.4open.science/r/RAD-3A6C.

Abstract:
Generalized Category Discovery (GCD) requires models to categorize an unlabeled pool containing both known and novel classes under sparse supervision. We identify a systemic confidence bias inherent in existing parametric methods: while entropy regularization prevents class collapse, it indiscriminately suppresses predictive certainty on all unlabeled instances. This bias drives a distributional wedge between labeled and unlabeled samples of the same category, forcing models to sacrifice their stability on known classes to achieve plasticity for new ones. To resolve this, we propose Reliable Confidence Alignment (RCA), a plugand-play framework grounded in Evidential Deep Learning. RCA first establishes high certainty anchors on labeled data using a Reliable Anchor for Certainty (RAC) module. Then, we introduce Cross-view Confidence Alignment (CCA) to propagate this grounded reliability to the unlabeled discovery set. Thus, RCA captures the fine-grained geometry of the probability simplex, effectively calibrating the model's epistemic uncertainty. Extensive evaluations on coarse- and fine-grained benchmarks demonstrate that RCA effectively rectifies the confidence landscape, significantly mitigating performance decay on known classes without compromising novel-class discovery.

Abstract:
Toolaugmented LLM agents can harbor implicit state that persists across sessions, activates through events, and propagates across agent boundaries—all invisible to standard debugging. We formalize this as Persistent Semantic Entities (PSEs): constructs defined by name binding, event triggering, and cross-boundary propagation. Experiments across twenty models from nine families (OpenAI, Anthropic, Google, Meta, Alibaba, DeepSeek, Mistral, Zhipu, Moonshot) spanning 1.5B to 1 trillion parameters reveal three findings. First, PSE susceptibility affects all tested architectures including Claude (88%) and Gemini (84–96%), with rates ranging 20–100%; even the largest model (1T parameters) shows 50% susceptibility. Second, contamination does not decay—it increases over conversation turns, as instruction-tuned models reinforce rather than forget injected state. Third, self-reflection provides inconsistent protection—from no effect to negative effect (contamination increases 14% on Claude-Sonnet-4)—while quarantine-based validation consistently achieves 57–85% reduction across models. We validate findings against documented production incidents. Our work establishes PSEs as a distinct phenomenon requiring architectural solutions beyond prompt engineering.

Abstract:
Imageto-video (I2V) diffusion models have recently made generative inbetweening a practical reality by synthesizing semantically plausible intermediate frames between two keyframes. Among them, inference-time sampling schemes that re-use large pre-trained I2V backbones without any additional training are especially attractive. Yet current methods frequently exhibit temporal inconsistency and artifacts such as ghosting or reverse motion. A key reason is that the two trajectories are driven by distinct motion priors, each inherited from its own conditioning frame, and are simply stitched together without explicitly reconciling these priors. We introduce Motion-Residual Conflict-Aware Time Reversal (MR‑CATR), an inference-time sampling framework that aligns conflicting motion priors instead of discarding one of them or collapsing to a single start-conditioned prior. MR‑CATR first derives a motion-residual–based direction from the forward path, combined with an end-conditioned residual to form a consensus motion axis. This design suppresses bidirectional motion conflicts while still allowing end-frame information to refine the trajectory and enforce endpoint consistency. MR‑CATR can be seamlessly integrated into existing time-reversal samplers without changing model parameters. Experiments on generative inbetweening benchmarks show that our method produces videos with smoother motion, fewer artifacts, and consistently better quantitative scores and user preferences than prior strategies.

Abstract:
Recent advances in OmniMultimodal Large Language Models (Omni-MLLMs) have enabled strong integration of vision, audio, and language. However, their audio-visual intelligence (AVI) remains insufficiently evaluated due to the lack of systematic and comprehensive benchmarks. We introduce AVI-Bench, a cognitively inspired benchmark that evaluates Omni-MLLMs across three stages, perception, understanding, and reasoning, through cross-modal tasks requiring joint audio-visual interpretation. This design enables fine-grained diagnosis of model capabilities and failure modes. To further assess robustness beyond familiar domains, we propose AVI-Bench-PriSe, an extension that probes models' primitive audio-visual sensation using unfamiliar, low-semantic stimuli, testing generalization beyond common training distributions. Extensive experiments on both open-source and closed-source models reveal substantial limitations in current Omni-MLLMs. Based on these findings, we present a four-level AVI taxonomy. Overall, AVI-Bench provides a principled evaluation framework to guide the development of more robust and generalizable AVI.

Abstract:
Domain generalization aims to improve model performance on unseen, outof-distribution (OOD) domains, yet existing methods often overlook the crucial aspect of uncertainty quantification in their predictions. While ensemble learning combined with knowledge distillation offers a promising avenue, naively combining these techniques is non-trivial and remains largely unexplored in the context of domain generalization. In this work, we systematically investigate different ensemble and distillation strategies for domain generalization tasks and design a tailored data allocation scheme. This approach trains teacher models on distinct subsets of domains and performs distillation on complementary (unseen) subsets, thereby fostering model diversity and training efficiency. Moreover, our theoretical analysis demonstrates that distilling from teachers on unseen domains effectively filters out domain-specific spurious correlations. To address the accuracy degradation often observed with standard uncertainty distillation, we further develop a novel technique that decouples uncertainty distillation from the standard distillation process, enabling accurate uncertainty estimation without compromising model accuracy. Our proposed method,Domain-aware Ensemble Distillation(DomED), is extensively evaluated against state-of-the-art domain generalization and ensemble distillation techniques across multiple benchmarks, achieving competitive accuracies and substantially improved uncertainty estimates.

Abstract:
This work presents WorldCompass, a novel Reinforcement Learning (RL) posttraining framework for the long-horizon, interactive video-based world models, enabling them to explore the world more accurately and consistently based on interaction signals. To effectively "steer" the world model's exploration, we introduce three core innovations tailored to the autoregressive video generation paradigm: 1) Clip-level Roll-out Strategy: We generate and evaluate multiple samples at a single target clip, which significantly boosts roll-out efficiency and provides fine-grained reward signals. 2) Complementary Reward Functions: We design reward function for both interaction-following accuracy and visual quality, which provide direct supervision and effectively suppress reward-hacking behaviors. 3) Efficient RL Algorithm: We employ the negative-aware fine-tuning strategy coupled with various efficiency optimizations to efficiently and effectively enhance model capacity. Evaluations on the SoTA open-source world model, WorldPlay, demonstrate that WorldCompass significantly improves interaction accuracy and visual fidelity across various scenarios.

Abstract:
Large Reasoning Models (LRMs) excel at complex reasoning but suffer from inefficient reasoning, like overthinking and overshoot. These issues stem from excessive or misdirected reasoning triggered by the model's "doubt", manifested as selfvalidation and exploratory extension, increasing computational cost and degrading performance. Existing efficient reasoning methods seek to regulate reasoning via internal signals or static schedules, lacking specialization in the "doubt" characteristics of LRMs. To address this, we propose a Conjunction-Guided Intervention (CGI) reasoning framework that intervenes when the model shows signs of doubt. Our key insight is that overthinking and overshoot in LRMs arise from conjunction-triggered extensions where LRMs exhibit "doubt" through transitional conjunctions, extending redundant self-validation and exploration without timely state-based correction. Building on this insight, CGI pauses reasoning at conjunction markers of doubt for external state-based feedback, adaptively extending or terminating reasoning to reduce redundancy while preserving accuracy. The feedback is generated via criteria evaluation (rationality and completeness) and comes from either human or LLM proxies. We train the target model by Group Relative Policy Optimization (GRPO) to adapt to the interactive mode. Experiments show that our framework achieves a superior balance between accuracy and reasoning length.

Abstract:
For some time, LLMs have sped up life sciences research by synthesizing published literature, but now leading LLMbased tools can also perform certain in silico tasks that had previously been the domain of experienced biologists. These emerging AI capabilities offer new opportunities for scientific discovery and biomedical advances, but they are also changing the landscape of biosecurity risks. To address this, we introduce the Agentic Bio-Capabilities Benchmark (ABC-Bench), a suite of evaluations to measure agentic biosecurity-relevant capabilities. ABC-Bench evaluates LLM-based agents on both benign and potentially harmful biosecurity-relevant tasks: writing code to operate liquid handling robots, designing DNA fragments for in vitro assembly, and evading DNA synthesis screening. These tasks require a combination of biology and software expertise; indeed, when PhD biologists with at least two years of coding experience attempted the tasks in ABC-Bench, they scored only 24% on average. By contrast, the top-performing LLM, Grok 3, achieves 53% across tasks, outperforming 60%, 100%, and 54% of experts on the Liquid Handling Robot, Fragment Design, and Screening Evasion tasks, respectively. In three additional experiments, we found that OpenAI's GPT-4o-mini-high produced code that, when run on an OpenTrons robot, successfully assembled DNA with the expected sequences.

Abstract:
Multiagent debate has shown promise for improving the reasoning of large language models, yet recent theory suggests its benefits are highly regime-dependent. While interaction can amplify informative signals under corrective conditions, symmetric debate dynamics are neutral in expectation, often making majority voting preferable. We reconcile these views by arguing that debate is effective only when invoked at the right time and with appropriate structure. Based on this insight, we propose LASE, a leader-centric multi-agent debate framework that selectively engages interaction only in non-neutral regimes. LASE introduces an asymmetric leader–supporter architecture that enables directed information flow and selective signal amplification, while defaulting to simple aggregation otherwise. Experiments across diverse reasoning benchmarks show that LASE achieves multi-agent-level performance with near single-agent token cost, substantially improving efficiency over static debate and voting baselines.

Abstract:
Selfsupervised learning (SSL) faces a fundamental conflict between semantic understanding and image reconstruction. High-level semantic SSL (e.g., DINO) relies on global tokens that are forced to be location-invariant for augmentation alignment, a process that inherently discards the spatial coordinates required for reconstruction. Conversely, generative models preserves dense feature grids for reconstruction but fails to produce high-level abstractions. We introduce STELLAR, a framework that resolves this tension by factorizing visual features into a low-rank product of semantic concepts and their spatial distributions. This disentanglement allows us to perform DINO-style augmentation alignment on the semantic tokens while maintaining the precise spatial mapping in the localization matrix necessary for pixel-level reconstruction. We demonstrate that as few as 16 sparse tokens under this factorized form are sufficient to simultaneously support high-quality reconstruction (2.60 FID) and match the semantic performance of dense backbones (79.10% ImageNet accuracy). Our results highlight STELLAR as a versatile sparse representation that bridges the gap between discriminative and generative vision by strategically separating semantic identity from spatial geometry.

Abstract:
Processing long sequence data such as speech requires models to maintain longterm dependencies, which is challenging for recurrent spiking neural networks due to high temporal dynamics in neuron models that leak stored information in their membrane potentials, and due to vanishing gradients during backpropagation through time. These issues can be mitigated by employing more complex neuron designs, such as ALIF and TC-LIF, but these neuron-level solutions often incur high computational costs and complicate hardware implementation, undermining the efficiency advantages of Spiking neural networks. Here we propose an architectural-level solution that leverages the dynamical interactions of a few leaky integrate-and-fire (LIF) neurons to enhance long-term information storage. The memory capability of this LIF-based micro-circuit is adaptively modulated by global recurrent connections of the recurrent spiking neural network, contributing to selective enhancement of temporal information retention, and promoting stable gradient propagation through time. The proposed model outperforms previous methods including LSTM, ALIF, and TC-LIF in long sequence tasks, achieving 96.52% accuracy on the PS-MNIST dataset. Furthermore, our method also provides a compelling efficiency advantage, yielding up to 277× computational efficiency improvement compared to conventional models such as LSTM. This work paves the way for building cost-effective, hardware-friendly, and interpretable spiking neural networks for long sequence modeling.

Abstract:
Consistency training encourages a model to produce similar outputs across related inputs or sampling procedures. Such methods are simple, scalable, and improve reasoning in language models, but their effects on model alignment haven't been well studied. Could the selfbootstrapping nature of these methods amplify undesired behavior in models? We test seven consistency training methods on 108 ``model organisms'': open-source models (7B--70B) fine-tuned to exhibit various forms of controlled misaligned behavior. We find that outcomes vary significantly: consistency training generally suppresses reward hacking and emergent misalignment but amplifies sycophancy. We present evidence that distribution shifts induced by the consistency labeling process, rather than variation in the selection operators, may be the primary driver of systematic alignment effects. Finally, we present a unifying theoretical framework to derive conditions under which consistency training will amplify or suppress misalignment. In total, our study establishes that consistency training is not alignment-neutral, and that its use in critical systems should be carefully audited.

Abstract:
Despite the remarkable progress of Large Language Model (LLM) based MultiAgent Systems, most research focuses on optimizing coordination topology while largely underexploring the equally critical problem: how to transmit and optimize messages among agents effectively? Current communication schemes typically rely on the direct concatenation of first-order neighbor responses, which induces a restricted evidence receptive field and leads to the dilution of crucial insights over multi-hop paths. To address these limitations, we propose the Multi-Order Communication (MOC) scheme, which reconstructs the inter-agent communication to capture multi-hop dependencies and incorporates a structural message consolidation strategy to ensure efficiency. Specifically, we formalize the communication mechanism to construct a structured multi-order evidence stream, and subsequently design a Semantic-Topological Merging algorithm to optimize semantic fidelity within token constraints. Extensive experiments across six diverse datasets and LLM backbones of varying parameter scales demonstrate that MOC consistently improves task performance and reduces communication costs.

Abstract:
Operator splitting methods are at the foundation of many numerical solvers for partial differential equations. In parallel, unrolled and hybrid learningbased architectures have been introduced to enhance classical solvers, but their design is rarely linked to the underlying problem structure. In this work, we propose a unifying framework that explicitly links operator splitting algorithms from optimization with unrolled hybrid architectures. We show that each operator splitting scheme naturally defines an unrolled architecture, which recovers a wide range of existing plug-and-play and hybrid models as special cases. Using this framework, we design new unrolled hybrid architectures and validate them on benchmark fluid dynamics simulations, where they achieve improved accuracy and stability.

Abstract:
Targeted instruction tuning requires selecting pertinent samples from massive mixedcandidate datasetsguided by a smallreference datasetreflecting the desired capability, yet efficiently identifying highquality data amidst noise remains challenging. To address this, we proposeTarGATE(Target-awareGATEs, a simple yet effective data selection framework that leverages the model's inherent data understanding. TarGATE computes a token-level Information Retention Ratio (IRR) to scale the output of the feed-forward network, where the instance-level average IRR serves as a quantitative metric for data quality. To align gates' preferences with the target task, we employ a joint optimization strategy utilizing the reference set and a subset of candidate data, which encourages the gates to assign higher IRRs to reference-aligned data while suppressing low-quality samples. Extensive experiments across noisy and real-world scenarios demonstrate that TarGATE outperforms related baselines. Furthermore, TarGATE exhibits superior computational efficiency and strong cross-model transferability, enabling smaller selector to effectively curate high-quality fine-tuning data for larger foundation models. The code is available athere.

Abstract:
Activities in numerous evolving systems can be represented as dynamic graphs in snapshot form at different time intervals, i.e., discretetime dynamic graphs (DTDGs). Existing methods show impressive advances in capturing historical temporal evolution patterns in DTDGs, but they focus on addressing an offline learning setting, where models are trained using historical snapshots once and then evaluated to all subsequent graph snapshots without further updating. This fails to capture 1) the nature of evolving complexities across graph snapshots and 2) the distribution shift in the testing graph snapshots. To address these problems, we propose PromptDyG, a novel framework that leverages unsupervised test-time Prompt adaptation for Dynamic Graph learning under a live-update online setting. The key insight is that an expressive dynamic graph prompt can be learned on a frozen backbone via minimization of feature-wise, label-free entropy to efficiently and continuously model the evolving patterns. We show theoretically that this unsupervised prompt adaptation can guarantee a larger similarity margin between positive and negative pairs, facilitating more accurate dynamic predictions. It is further confirmed by our extensive empirical results on six benchmark datasets that show consistent and significant improvements of PromptDyG over state-of-the-art baselines. Code is available at https://anonymous.4open.science/r/PromptDyG-3887.

Abstract:
This paper studies secondorder methods for nonconvex-strongly-convex bilevel optimization. We propose a novel fully second-order bilevel approximation method (FSBA) that achieves an iteration complexity of \tilde\mathcalO(\epsilon^-1.5) for finding the (\epsilon, \mathcalO(\sqrt\epsilon)) second-order stationary point of the hyper-objective function. Our results demonstrate that second-order methods can achieve an accelerated convergence rate than first-order methods in bilevel optimization. To address the heavy computational cost associated with the second-order oracle, we introduce a lazy variant of FSBA, called LFSBA, which reuses second-order information across several iterations. We prove that LFSBA exhibits better computational complexity than FSBA by a factor of \sqrtd, where d is the dimension of the problem. We also apply a similar idea to nonconvex strongly-concave minimax optimization and propose the lazy minimax cubic-regularized Newton (LMCN) method with better computational complexity compared to existing second-order methods.

Abstract:
Instruction data curation is central to improving the instructionfollowing ability of large language models. However, existing approaches often struggle to simultaneously maintain data quality, diversity, and distributional consistency, largely because they do not explicitly distinguish semantic redundancy from quality defects and rely on coarse-grained modeling of instruction data quality. To address this issue, we propose Cure-SFT, a coarse-to-fine, diagnostic-guided method for instruction data curation that explicitly disentangles semantic redundancy from quality defects. Specifically, Cure-SFT removes redundant samples via stratified semantic-geometric sampling, applies teacher models for diagnostic triage, and performs targeted defect remediation on fixable samples. Our experiments show that Cure-SFT can surpass full-data instruction tuning using only 10% of the data budget. Moreover, Cure-SFT consistently outperforms strong selection-based and rewriting-based baselines across data budgets, supporting the effectiveness of diagnostic-guided data curation.

Abstract:
Tokenlevel adaptive computation seeks to reduce inference cost by allocating more computation to harder tokens and less to easier ones. However, prior work is primarily evaluated on natural-language benchmarks using task-level metrics, where token-level difficulty is unobservable and confounded with architectural factors, making it unclear whether compute allocation truly aligns with underlying complexity. We address this gap through three contributions. First, we introduce a complexity-controlled evaluation paradigm using algorithmic and synthetic language tasks with parameterized difficulty, enabling direct testing of token-level compute allocation. Second, we propose ANIRA, a unified recurrent Transformer framework that supports per-token variable-depth computation while isolating compute allocation decisions from other model factors. Third, we use this framework to conduct a systematic analysis of token-level adaptive computation across alignment with complexity, generalization, and decision timing. Our results show that compute allocation aligned with task complexity can emerge without explicit difficulty supervision, but such alignment does not imply algorithmic generalization: models fail to extrapolate to unseen input sizes despite allocating additional computation. We further find that early compute decisions rely on static structural cues, whereas online halting more closely tracks algorithmic execution state.

Abstract:
Despite strong performance, large language models (LLMs) still suffer from hallucinations. Most existing mitigation methods operate at inference time, without addressing the underlying cause: LLMs are not trained to recognize their own lack of knowledge, and therefore tend to generate plausible responses even when the required knowledge is missing. Alignmentbased approaches encourage uncertainty expression or refusal to improve truthfulness, but often consequently degrade helpfulness. To address this trade-off, existing alignment methods typically treat truthfulness and helpfulness as either universally collaborative or universally conflicting objectives across all samples. In contrast, we show that these objectives are consistent for most samples and conflict only in a small subset—where adaptive trade-off is truly needed. Based on this insight, we propose Conflict-Aware Adaptive Margin Preference Alignment (CAMP), which explicitly models when conflicts arise and adaptively regulates optimization strength. Experiments on UltraFeedback and representative hallucination benchmarks demonstrate that CAMP consistently improves truthfulness while maintaining a favorable helpfulness trade-off compared to strong hallucination mitigation and multi-objective alignment baselines.

Abstract:
Lowlight video enhancement (LLVE) remains a challenging task due to severe information degradation under low-illumination conditions. Recent multimodal approaches have significantly improved enhancement performance by incorporating auxiliary modalities, such as event streams and infrared images. However, these methods typically assume the availability of these modalities at inference, which is often not feasible in real-world scenarios. To solve this problem, in this work, we propose AMNet, a unified multimodal framework for LLVE, to support flexible modality-agnostic inference, where auxiliary modalities may be unavailable. To address the issue of modality absence, we introduce a Spatial-Spectral Dual-Gated Translator that learns the correspondence between auxiliary modalities and RGB inputs, producing implicit auxiliary representations to support the robust enhancement. Additionally, to fully facilitate the learning of cross-modal correspondence, we conduct large-scale multimodal pretraining based on the RGB-only dataset with synthetic auxiliary modalities. Extensive experiments demonstrate that AMNet could handle arbitrary inference-time modality combinations and exhibits superior performance for LLVE under modality absence conditions.

Abstract:
Multimodal large language models (MLLMs) often introduce errors when generating image captions, resulting in misaligned imagetext pairs. Our work focuses on a class of captioning errors that we refer to as systematic misalignments, where a recurring error in MLLM-generated captions is closely associated with the presence of a specific visual feature in the paired image. Given a vision-language dataset with MLLM-generated captions, our aim in this work is to detect such errors, a task we refer to as systematic misalignment detection. As our first key contribution, we present Symbal, which utilizes a structured, dual-stage setup with off-the-shelf foundation models to identify systematic misalignments and summarize results in natural language. As our second key contribution, we introduce SymbalBench, a benchmark designed to evaluate automated methods on our proposed task. SymbalBench consists of 420 vision-language datasets from two domains (natural images and medical images) with annotated systematic misalignments. Symbal exhibits strong performance on this benchmark, correctly identifying systematic misalignments in 63.8% of datasets, a nearly 4x improvement over the closest baseline. We supplement our evaluations on SymbalBench with real-world evaluations, showing that (1) Symbal can accurately surface systematic misalignments in captions generated by four MLLMs and (2) Symbal is a powerful tool for auditing off-the-shelf image-caption datasets. Ultimately, our novel task, method, and benchmark can aid users with auditing MLLM-generated captions and identifying critical errors, without requiring access to the underlying MLLM.

Abstract:
Visionlanguage models such as CLIP have shown strong zero-shot performance, but their reliability degrades in realistic multi-label settings under distribution shift. Standard test-time adaptation (TTA) methods either rely on costly gradient-based updates or adopt lightweight statistical schemes that implicitly assume label independence. The latter is particularly harmful in multi-label scenarios, where visually dominant classes suppress semantically correlated yet weaker labels, leading to severe recall loss. We revisit multi-label TTA from a Bayesian perspective and propose Bayesian Conditional Priors~(BCP) estimation, a backpropagation-free framework that injects label dependencies into CLIP's zero-shot predictions. Treating the zero-shot scores as approximate marginal posteriors, BCP derives an anchor-conditioned Bayesian refinement in which each logit is corrected by a term determined solely by the conditional prior P(c_i=1 \mid c_a=1). These conditional priors are estimated online via second-order co-occurrence statistics over the test stream and instantiated as closed-form, monotonic logit transformations, without backpropagation or architectural changes. Experiments on multi-label benchmarks show that this structure-aware adaptation consistently improves mean average precision over entropy-based and retrieval-augmented TTA baselines, while incurring negligible computational overhead.

Abstract:
We introduce 3DDLP, a self-supervised object-centric representation learning model that decomposes scene-level RGB-D or voxel observations into a set of 3D latent particles. Building on the Deep Latent Particles (DLP) framework, each particle encodes disentangled attributes, including 3D keypoint position, bounding box dimensions, and appearance features, and represents a distinct entity in the scene. The model learns interpretable per-particle segmentation maps through an end-to-end self-supervised reconstruction objective. We demonstrate on both simulated and real-world datasets that the learned latent space is interpretable and controllable: by manipulating particle positions and decoding, we can generate novel scene configurations. Furthermore, we show that leveraging these compact 3D latent particles for downstream robotic manipulation improves performance over baselines that either lack explicit 3D information or rely on memory-intensive dense 3D inputs without object-centric structure. Open-source code will be made publicly available. Video results are available on the project website: https://sites.google.com/view/3d-dlp/home

Abstract:
While multimodal agents can achieve strong performance through finetuning, their ability to generalize remains limited in complex real-world tasks such as mobile navigation, where diverse applications, frequent system changes, and customized workflows are common in practice. We argue that a fundamental bottleneck lies in whether an agent possesses sufficient task-specific procedural knowledge to accomplish a given goal. Such procedural knowledge may be provided by the general capabilities of large language models, or obtained from additional external resources such as web search when necessary. Based on this view, we propose Procedure-Aware Multimodal Agent with Meta Reasoning, a framework that explicitly represents task knowledge as natural-language procedures and trains a procedure-aware grounded agent to condition its actions on this knowledge. By learning to leverage procedural knowledge from different sources, our approach enables robust generalization across tasks, applications, interface versions, and multi-app workflows, achieving substantial improvements on challenging Android benchmarks.

Abstract:
Identifiable virtual face (IVF) generation aims to transform a user's original face into a virtual face for high utility privacy protection. The IVF is visually and statistically different from the original face, which can still be used for recognizing the user's identity. Despite the advantage, these schemes are unable to verify the trustworthiness of the IVF, the quality and controllability of which is often limited. To address these issues, we propose TIVDiff, a diffusionbased framework for trustworthy and identifiable virtual face generation. TIVDiff learns a virtual identity (VID) space via Virtual Identity Projection (VIP) and synthesizes high-quality virtual faces conditioned on VID and 3D facial geometry for pose and expression preservation. To enable the trustworthiness of IVF, we introduce Identity-Guarded Generative Watermarking (IGGW) to bind the diffusion initial noise with VID through a reversible mapping. This enables the embedding of an imperceptible cue into IVF for legitimacy verification. Experiments demonstrate the advantage of our TIVDiff over the state-of-the-art IVF generation schemes in terms of image quality, identifiability and trustworthiness.

Abstract:
Evaluation metrics establish a standard assessment framework for models, playing a pivotal role in model optimization and advancement. However, widely adopted pixelwise metrics (e.g., IoU, Dice) rely heavily on pixel-level statistics, often failing to capture the structural integrity of predictions. While the S-measure (S_m) incorporates structural perception to some extent, it struggles to differentiate critical structural violations and remains insensitive to background false positives and small objects. To address these limitations, we propose the Topology-aware measure (T_m), a novel metric designed to explicitly quantify topological consistency. T_m employs the Fuzzy Jaccard Index as a foundational score, integrates a Topological Integrity term (I_topo) to penalize critical structural fragmentation, and utilizes a Boundary Alignment term (\mathcalA_bdy) to evaluate boundary alignment. These three components synergize to achieve robust evaluation of prediction maps at the topological level. We establish a rigorous Meta-Measure validation framework and benchmark our method against nine mainstream metrics across diverse complex scenarios. Extensive experiments demonstrate that T_m performs exceptionally in downstream tasks and maintains high consistency with human visual perception.

Abstract:
Anyto-any modeling aims to flexibly relate arbitrary modalities within a single system, a requirement that arises across multimodal learning and scientific domains such as ecology and astronomy. However, existing any-to-any approaches are typically trained from scratch using encoder–decoder or diffusion architectures, limiting empirical performance and the use of pretrained models. We investigate decoder-only any-to-any multimodal modeling, which treats all modalities symmetrically and supports arbitrary modalities as inputs and outputs without modality-specific heads, losses, or task pipelines. As a consequence of this unified design, the resulting model MODUS naturally enables chained generation through intermediate modalities, cross-modal consistency verification, and analysis of visual representations by combining semantic and reconstruction features. Across a range of benchmarks, MODUS demonstrates strong out-of-the-box performance and flexible multimodal composition within a single model.

Abstract:
Traditional 3D visual grounding typically assumes that natural language expressions unambiguously refer to target objects in a 3D scene. However, in practical applications, human instructions are often ambiguous or insufficient, which may lead existing models to associate the query with multiple possible objects, resulting in incorrect results. In this paper, we propose a new task, 3D visual grounding with referential ambiguity, which allows for referential ambiguity in language descriptions, making it more broadly applicable to realworld scenarios. To tackle this task, we propose an interactive grounding framework that performs multi-round question-answer interactions, in which the model actively generates clarifying questions and receives human-provided answers to acquire additional object attributes, spatial relationships, and other contextual information, thereby resolving referential ambiguity and achieving accurate grounding. To support the learning of interactive grounding, we construct a large-scale dataset named AmbiRefer3D, which contains 47,085 samples with 141,255 annotations of question-answer dialogues that capture interactive disambiguation processes, covering 7,316 indoor 3D scenes. Furthermore, we establish multi-round evaluation metrics to measure both disambiguation efficiency and grounding accuracy.

Abstract:
Flowmatching policies have recently emerged as a powerful class of generative models for offline reinforcement learning (RL), capable of capturing complex, multi-modal action distributions from static datasets. However, standard training objectives are largely agnostic to the global properties of the generative path, permitting learned vector fields that are irregular and unstable, which can hinder performance. In this work, we introduce PDE-regularized Q-Learning (PQL), a novel algorithm that addresses this limitation by imposing a principled structure on the entire probability flow. PQL makes two synergistic contributions: first, a partial differential equation based regularizer derived from the continuity equation enforces global smoothness and stability on the flow. Second, to solve the complex optimization problem introduced by this regularizer, we propose a Beta-distributed timestep sampling strategy that focuses learning on the critical trajectory segments where the trade-off between imitation and smoothness is most acute. Through extensive experiments, we demonstrate that by structuring the generative journey and not just its destination, PQL achieves state-of-the-art performance on a wide range of challenging offline RL tasks.

Abstract:
Openvocabulary grounding requires accurate vision-language alignment under weak supervision, yet existing methods either rely on global sentence embeddings that lack fine-grained expressiveness or introduce token-level alignment with explicit supervision or heavy cross-attention designs. We propose ExpAlign, a theoretically grounded vision-language alignment framework built on a principled multiple instance learning formulation. ExpAlign introduces an Expectation Alignment Head that performs attention-based soft MIL pooling over token-region similarities, enabling implicit token and instance selection without additional annotations. To further stabilize alignment learning, we develop an energy-based multi-scale consistency regularization scheme, including a Top-K multi-positive contrastive objective and a Geometry-Aware Consistency Objective derived from a Lagrangian-constrained free-energy minimization. Extensive experiments show that ExpAlign consistently improves open-vocabulary detection and zero-shot instance segmentation, particularly on long-tail categories. Most notably, it achieves 36.2 AP_r on the LVIS minival split, outperforming other state-of-the-art methods at comparable model scale, while remaining lightweight and inference-efficient.

Abstract:
Image generative models are trained on massive, largely uncurated internetscale datasets that contain undesirable visual concepts. Efficiently removing such concepts from the model generations without degrading the quality of output images remains challenging. We introduce a novel concept removal method for frontier diffusion and image autoregressive models, such as, SD3.5, Flux, and Infinity. Our intervention replaces the internal bottleneck layer present in all these modern models with a transcoder that is trained to replicate the original layer while structuring it into distinct activation features. This in‑place substitution creates an integrated filter through which concept‑specific signals can be selectively disabled while preserving the rest of the model’s behavior. Since the intervention modifies the model backbone rather than attaching an external component, it remains persistent under white‑box access. Empirically, the approach achieves state‑of‑the‑art concept removal performance across modern diffusion and autoregressive models, maintains visual generation quality, provides robustness against adversarial prompts, and supports sequential removal of diverse concepts. This positions our method as a practical approach for concept removal in frontier image generative models.

Abstract:
Transformers consistently fail to learn certain simple functions such as Paritywhich returns whether the input has an even number of ones---even when they can provably compute them with specific parameter settings. This gap betweenlearnabilityandexpressivityis particularly prominent for sensitive functions---functions whose output is likely to change if a single bit of the input is changed. While prior work has established that transformers exhibit a bias toward low-sensitivity functions, the precise mechanism underlying this bias remains poorly understood. To shed light onto this phenomenon, we study the geometry of transformers' parameter space. We show that sensitive functions---even when representable---occupy a vanishingly small region that random initialization is unlikely to reach. More specifically, we prove that randomly initialized transformers almost surely compute functions with many low-sensitivity inputs, where flipping a bit is unlikely to change the output. Our results provide a novel theoretical grounding for the empirical observation that transformers exhibit a strong bias toward low-sensitivity functions, shifting the focus from average sensitivity to the fullsensitivity profile.

Abstract:
Diffusion models have advanced 3D shape generation, yet most methods still denoise in highcardinality spaces (e.g., voxel/SDF grids, meshes, or point clouds), which is computationally and memory intensive and makes it difficult to scale in terms of both higher resolution and stronger controllability. We rethink the diffusion representation and propose to move diffusion from dense geometry to compact geometric primitives, representing each shape as a small set ofsuperquadrics. Instead of operating on thousands to millions of geometric representation values, we leverage 7KB superquadric parameters (pose, size, and shape), drastically reducing diffusion-state dimensionality and per-step compute/memory. Our diffusion-over-superquadrics improves scalability by supporting broader capabilities (e.g., resolution-free point-cloud decoding, part-level editing, and constraint-based design) and achieving competitive surface-fidelity and distributional performance on standard benchmarks after point-cloud decoding, while enabling efficient generation within 0.6s per shape for most conditions.

Abstract:
Influential benchmarks incentivize competing model builders to strategically allocate posttraining resources towards improvements on the leaderboard, a phenomenon dubbed \emphbenchmaxxing or \emphtraining on the test task. In this work, we initiate a principled study of the incentive structure that benchmarks induce. We model benchmarking as a Stackelberg game between a benchmark designer who chooses an evaluation protocol and multiple model providers who compete simultaneously in a subgame given by the designer’s choice. Each competitor has a model of unknown latent quality and can inflate its observed score by allocating resources to benchmark-specific improvements. First, we prove that current benchmarks induce games for which no Nash equilibrium between model developers exists. This result suggests one explanation for why current practice leads to misaligned incentives, prompting model providers to strategize in opaque ways. However, we prove that under mild conditions, a recently proposed evaluation protocol, called tune-before-test, induces a benchmark with a unique Nash equilibrium that ranks models by latent quality. This positive result demonstrates that benchmarks need not set bad incentives, even if current evaluations do.

Abstract:
Diagnosing rare diseases remains a persistent challenge, often hindered bycognitive anchoring: once clinicians settle on a common diagnosis, they often discount alternative explanations, including rare conditions. To address this, we propose a humancentered counterfactual reasoning framework using a Denoising Masked AutoEncoder (DMAE) to simulatewhat-ifdiagnostic scenarios that disrupt clinicians’ initial assumptions. Our model jointly learns (1) the true distribution of diseases and symptoms, and (2) human diagnostic behavior, revealing critical gaps betweenmedically possibleandclinically considereddiagnoses. By strategically perturbing latent patient representations, it generatescontrastive counterfactualsthat highlight rare yet plausible diseases that cognitive bias often obscures. Unlike traditional decision-support tools, our systemproactivelysuggests rare diseases not because they are statistically probable, but because they arecognitively neglected. Across four public and three private rare-disease datasets, our approach outperforms standard machine learning classifiers in detecting rare conditions while maintaining strong performance on common diagnoses. Beyond boosting accuracy, the counterfactual evidence encourageshypothesis-driven reasoningand supports clinical learning.

Abstract:
In personalized federated learning (PFL), collaboration graphs specify model aggregation among clients. However, without constraints on the collaboration geometry, training can drift into two degenerate regimes: global consensus or spontaneous clustering. This paper provides a unified dynamical analysis: under the same budget of representative models, collaborative PFL is more expressive and achieves higherorder approximation accuracy than clustered PFL. An upper bound on disagreement further reveals two degeneration mechanisms—overly strong collaboration drives consensus (reducing to standard federated learning), while similarity-driven weight updates make the graph nearly reducible and induce self-clustering (collapsing to clustered PFL). Motivated by these findings, we propose pFedCCG. pFedCCG preserves the expressivity advantage via controlled collaboration geometry (CCG): it builds a static similarity-based collaboration template decoupled from training, optimizes a Markovian collaboration matrix with a prescribed stationary distribution via reversible parameterization and Euclidean projection, and schedules collaboration strength to avoid self-clustering. Experiments across diverse heterogeneity settings show consistent personalization gains and markedly reduced collapse and self-clustering. Code will be available at https://anonymous.4open.science/r/pFedCCG-CB88.

Abstract:
Compositional datavectors encoding relative proportions---arise across scientific domains, including ecology, geochemistry, and genomics. The features in these data often come with known hierarchical structure (e.g., taxonomies, phylogenies, ontologies), yet existing methods either ignore this structure, discard the intrinsic Aitchison geometry, require assumptions such as binary trees, or yield incomplete coordinate systems. We describePolyILR, a canonical orthonormal decomposition of the Aitchison tangent space aligned with any tree topology. Our construction defines a weighted local geometry at each internal node capturing full branching structure, then lifts these to a global orthonormal basis where every coordinate corresponds to a specific tree location. On microbiome and single-cell benchmarks, PolyILR yields stable, interpretable features and enables inference at multiscale tree resolution. We also establish a novel theoretical connection to softmax classifiers, suggesting possible applications to probabilistic modeling.

Abstract:
Diffusionbased language models (dLLMs) enable parallel token generation through iterative denoising, but existing decoding strategies collapse to single-token generation under low confidence, severely limiting throughput. Unlike autoregressive models where speculative decoding operates on token sequences in a fixed left-to-right order, dLLMs require speculating over \emphdenoising trajectories—sequences of multi-token updates with explicit positions and unmasking orders. We develop a trajectory-level speculative framework that constructs draft denoising trajectories via confidence-stratified tree exploration and verifies them through blockwise parallel evaluation with bidirectional attention masking. Our method further introduces inter-block speculation, exploiting diffusion models' bidirectional structure to perform cross-block lookahead. We formally characterize when this approach is exact and identify trajectory drift as the fundamental cost of increased parallelism. Building on Fast-dLLM's dual-cache infrastructure, our framework reduces denoising iterations by 30-40% and increases tokens-per-step from 2.6 to 4.3, achieving 7-14× speedup over vanilla dLLMs and 1.3× over Fast-dLLM with less than 1% accuracy change across reasoning and code benchmarks.

Abstract:
Static program analysis is a foundational technique in software engineering for reasoning about program behavior. Traditional static analysis algorithms model programs as logical systems with welldefined semantics, but rely on uniform, hard-coded heap abstractions. This limits their precision and flexibility, especially in dynamic languages like JavaScript, where heap structures are heterogeneous and difficult to analyze statically. In this work, we introduce ABSINT-AI, a language-model-guided static analysis framework that augments abstract interpretation with adaptive, per-object heap abstractions for Javascript. This enables the analysis to leverage high-level cues, such as naming conventions and access patterns, without requiring brittle, hand-engineered heuristics. Importantly, the LM agent operates within a bounded interface and never directly manipulates program state, preserving the soundness guarantees of abstract interpretation. To evaluate our approach, we focus on a soundness-critical task: determining whether object property accesses may result in undefined or null dereferences. This task directly models a common requirement in compiler optimizations, where proving that an access is safe enables the removal of dynamic checks or simplifies code motion. On this task, ABSINT-AI reduces false positives by up to 34% compared to traditional static analyses with fixed heap abstractions, while preserving formal guarantees. Our ablations show that the LM’s ability to interact agentically with the analysis environment is crucial, outperforming non-agentic LM predictions by 25%.

Abstract:
Can we determine someone’s geographic location solely from the sounds they hear? Are acoustic signals enough to localize within a country, state, or even city? In this work, we tackle the challenge of globalscale audio geolocation, with a particular focus on wildlife and natural sounds. We posit that bioacoustic signals contain informative geolocation cues because of well-defined geographic ranges of species. To test this hypothesis, we benchmark image geolocation and soundscape mapping methods, design oracles and species-centric baselines, and propose a hybrid approach that combines species range prediction with retrieval-based geolocation. We further ask whether geolocation improves with species-diverse recordings and spatiotemporal aggregation across neighboring samples. Finally, we extend our study to multimodal geolocation with case studies from movies that combine both audio and visual content. Our results highlight the potential of incorporating bioacoustic signals into geospatial tasks, motivating future work on species recognition and audio geolocation.

Abstract:
Over the past decade, Graph Neural Networks (GNNs) have become a standard tool for solving machine learning problems on graphs. While many aspects of GNNs have been studied in depth, including their efficiency and expressive power, the invertibility of these models has remained largely unexplored. Standard aggregation functions, such as the mean, max and sum operators, are not invertible, which limits their applicability in tasks requiring invertible transformations. In this work, we introduce an invertible GNN layer. By stacking multiple such layers, we construct fully invertible GNN models, which we refer to as InvGNNs. These models inherit the benefits of invertible neural networks, including low memory usage for deep architectures, exact likelihood computation, and generative modeling capabilities. We demonstrate that InvGNNs can match the expressive power of the 1dimensional Weisfeiler-Leman algorithm, showing that invertibility does not compromise model expressiveness. On standard graph classification benchmarks, our model performs comparably to other well-established GNNs, such as GIN. Beyond classification, we demonstrate the potential of invertible layers through density estimation tasks, including outlier detection and node feature generation. Our experiments confirm that InvGNNs effectively handle tasks that benefit from invertible layers.

Abstract:
Generative tabular augmentation is appealing in datascarce domains, yet the prevailing focus on distributional fidelity does not reliably translate into better downstream models. We formalize afidelity-utility gap: common generative objectives prioritize distributional plausibility, whereas augmentation succeeds only when injected samples reduce the current learner's held-out evaluation loss. This gap motivates learning not just how to generate, but what to generate and when to inject as training evolves. We propose TAP (Tabular Augmentation Policy), which couples diffusion inpainting with a lightweight, learner-conditioned policy to steer generation toward high-utility regions and controls safe injection via explicit gating and conservative windowed commitment. Under severe data scarcity, TAP consistently outperforms strong generative baselines on seven real-world datasets, improving classification accuracy by up to 15.6 percentage points and reducing regression RMSE by up to 32%.

Abstract:
Phenotypic screening experiments produce many microscope images of cells under diverse perturbations, with biologically significant responses often subtle or difficult to identify visually. A central challenge is to extract image representations that distinguish activity from controls and group phenotypically similar perturbations. In this work we propose new adaptations of contrastive loss functions that incorporate experimental metadata as learned class vectors, and a geometrically inspired variant, called SPC, where class vectors are confined to the unit sphere and updated only by attractive terms (allowing more overlap of phenotypically similar classes). The approach is tested on two popular benchmarking datasets, BBBC021 and RxRx3core; and we also evaluate performance on uncurated screens of HaCaT cells to gauge effectiveness in a realistic use-case scenario. We find we outperform prior methods across the three datasets and on a wide array of metrics measuring phenotype grouping, biological recall, drug-target interaction and mechanism-of-action inference. We also show we maintain this improved performance compared to models over 10x larger in parameter count, and that SPC can be used as an effective fine-tuning technique. The method is easy to implement and is well suited to settings with limited data or compute resources.

Abstract:
VisionLanguage Models (VLMs) suffer from high inference latency due to long visual sequences. To enable efficient, on-demand utilization of visual information, we argue that visual necessity should be assessed by its semantic impact on the output distribution, rather than inferred from intermediate interaction signals such as attention weights. We propose a training-free framework based on token embedding subspace decomposition, which we term a prediction-conditioned Semantic Lens. Specifically, at fixed decoding intervals, we perform QR decomposition on the Top-K candidate token embeddings to construct an orthogonal semantic basis. We then introduce Semantic IImpact–Driven Visual Scheduling (SIVS), which measures how visual inputs impact model predictions by projecting visual-induced hidden-state variations onto this semantic lens. SIVS provides a geometrically grounded, impact-driven criterion for dynamic visual KV scheduling. Empirical results demonstrate that SIVS achieves ~87% visual KV compression while maintaining over 99% of model performance.

Abstract:
Counterfactual explanations (CEs) are essential for actionable recourse, yet their reliability is often compromised in lowdensity regions, where classifiers exhibit high variance. Unlike existing methods that rely on expensive ensemble intersections to define stability, we propose DensityFlow, a generative framework that constructs robust CEs by adhering to the high-confidence data manifold. Specifically, we model the counterfactual generation as continuous-time dynamics parameterized by Neural ODE, guided by a differentiable density score to actively avoid uncertain, low-density areas. This density score is learned via Noise Contrastive Estimation, effectively leveraging a (K+1)-way discriminator to estimate density ratios. For black-box settings, we introduce a local proxy distillation mechanism that aligns a lightweight surrogate with the target model strictly within the trajectory of CE generation, enabling efficient gradient-based optimization with minimal queries. Experiments demonstrate that DensityFlow achieves superior validity under model multiplicity while significantly reducing query costs compared to ensemble-based baselines. Our implementation is available in the anonymous repository.

Abstract:
Planningdriven model-based (modelic) reinforcement learning has achieved impressive success in continuous control tasks but predominantly relies on zero-order optimizers like Model Predictive Path Integral (MPPI). While robust for global exploration, MPPI updates actions solely through sampling and neglects the smooth return gradients inherent in structured dynamics that guide fine-grained search. To complement MPPI’s robustness with gradient-guided precision, we first propose Langevin Rollout Optimization (LaRO), which leverages return gradients to refine actions via Langevin dynamics, achieving reliable local convergence without sacrificing multimodal exploration. This is supported by a score-augmented world model that jointly learns dynamics and a score function within a unified latent space, facilitating efficient and accurate gradient estimation for real-time planning. Second, we combine MPPI and LaRO through a simple yet effective choice mechanism, termed Maximum Look-Ahead Planning (MLAP). Finally, we instantiate MLAP within the latest BOOM algorithm, replacing its MPPI-only planner and yielding BOOM-L. Empirical results on the DeepMind Control Suite and Humanoid Bench demonstrate that BOOM-L consistently outperforms strong baselines in both sample efficiency and final performance.

Abstract:
Recent genomic foundation models largely adopt large language model architectures that treat DNA as a onedimensional token sequence. However, exhaustive sequential reading is structurally misaligned with sparse and discontinuous genomic semantics, leading to wasted computation on low-information background and preventing understanding-driven compression for long contexts. Here, we present \textscOpticalDNA, a vision-based framework that reframes genomic modeling as OCR-style document understanding. \textscOpticalDNA renders DNA into structured visual layouts and trains an OCR-capable vision--language model with a \emphvisual DNA encoder and a \emphdocument decoder, where the encoder produces compact, reconstructible visual tokens for high-fidelity compression. Building on this representation, \textscOpticalDNA defines prompt-conditioned objectives over core genomic primitives—reading, region grounding, subsequence retrieval, and masked span completion—thereby learning layout-aware DNA representations that retain fine-grained genomic information under a reduced effective token budget. Across diverse genomic benchmarks, \textscOpticalDNA consistently outperforms recent baselines; on sequences up to 450k bases, it achieves the best overall performance with nearly 20× fewer effective tokens, and surpasses models with up to 985× more activated parameters while tuning only 256k \emphtrainable parameters.

Abstract:
Diverse Visionlanguage-action (VLA) models have been proposed and demonstrated remarkable capabilities in robotic manipulation. However, how to effectively ensemble VLAs to further enhance performance remains largely unexplored, as conventional ensemble techniques designed for discriminative tasks cannot be directly applied to generative action policies with high-dimensional, multimodal distributions. To address this challenge, we propose EnsembleVLA, an energy-based framework that enables principled ensemble of diverse VLA models. We establish a unified theoretical framework showing that both diffusion-based and flow-based VLA models can be formulated as energy-based models, where additive energy combination naturally induces policy composition at the distribution level. This theoretical foundation enables multiple pre-trained policies to be seamlessly aggregated into a stronger ensemble policy. Building upon this compositional framework, EnsembleVLA further incorporates learnable composition weights for dynamic policy balancing, coupled with a confidence-aware gating mechanism that adaptively modulates bounded residual corrections, collectively ensuring stable and robust task execution. Extensive experiments demonstrate that EnsembleVLA achieves competitive performance across various tasks in both simulated and real-world environments.

Abstract:
Communication is a key component in multiagent reinforcement learning (MARL) for mitigating partial observability, yet prior approaches often rely on inefficient information exchange or fail to transmit sufficient state information. To address this, we propose LLM-driven Multi-Agent Communication (LMAC), which leverages an LLM's reasoning capability to design a communication protocol that enables all agents to reconstruct the underlying state as accurately and uniformly as possible. LMAC iteratively refines the protocol using an explicit state-awareness criterion, improving state recovery while narrowing differences in agents' knowledge. Experiments on diverse MARL benchmarks show that LMAC improves state reconstruction across agents and yields substantial performance gains over prior communication baselines.

Abstract:
As large language models (LLMs) improve in mathematical reasoning and formal understanding, a promising approach for automated theorem proving (ATP) is to enable LLMs construct proof sketches, which plan a highlevel proof strategy and decompose complex theorems into independently provable subgoals. However, most existing proof sketches are immutable. As a result, any revision typically requires rebuilding the entire sketch, which discards already proved subgoals and bring additional cost. In this paper, we address this limitation by introducing EditableSketch, an editable proof-sketch structure that supports in-place edits for error correction and further subgoal decomposition while preserving previously proved subgoals. Building on EditableSketch, we introduce SketchRefine, a proof-generation framework for ATP by iteratively refining proof sketches through localized, incremental edits. Experiments show that our method not only reduces the cost of the proof process, but also achieves superior performance. For example, our method realizes 76.0% pass rate on FormalMath-Lite (+14.1% vs. DeepSeek-Prover-V2-671B). Meanwhile, compared with Hilbert, our method significantly reduces token overhead while achieving comparable performance.

Abstract:
We introduce the Perception Rubric Benchmark (PRB), a rubricbased evaluation framework for Multimodal Large Language Models (MLLMs) that addresses the growing gap between benchmark scores and human-perceived quality. While standard perception metrics approach saturation, they produce compressed rankings that obscure meaningful performance differences, largely due to their linear and lenient reward designs. PRB reframes evaluation from holistic scoring to rubric-based verification. It is built through a scalable hybrid automation pipeline over a stratified collection of complex, multi-domain visual inputs. Using pair-wise contrastive generation, PRB distills over 15,000 diagnostic rubric criteria that function as explicit unit tests for perception and are evaluated via a ternary protocol distinguishing benign approximations from perceptually critical errors. Experiments show that PRB decouples compressed leaderboard rankings, reveals perceptual blind spots in top-performing models, and aligns more closely with human preference than conventional metrics. Beyond evaluation, generated rubrics can be reused as inference-time verifiers, yielding consistent gains on multiple perception benchmarks. PRB provides a principled foundation for reliable and discriminative multimodal evaluation.

Abstract:
The rapid rise of vlogs as a personalized storytelling medium has created a demand for automated systems to evaluate and refine vlog editing plans. However, vlog assessment is highly subjective and remains challenging due to a lack of standardized criteria, dataset and benchmark, and effective reward models. To address these challenges, we define a comprehensive vlog evaluation framework guided by professional vlog creators and product managers, establishing a taxonomy of six key dimensions,i.e.,Creativity,Consistency,Concept Design,Cinematography,Narration, andPacing. Subsequently, we curate a largescale dataset of 100k vlog edits and a dedicated benchmark,VRMBench, to evaluate the vlog rewarding capabilities of Multimodal Large Language Models (MLLMs). Finally, we presentVlogReward, a robust vlog reward model that can provide both fine-grained multi-dimensional scores and actionable feedback for iterative refinement. Technically, we enhance the Group Relative Policy Optimization (GRPO) framework by introducing an adjustable inter-group comparison reward, which mitigates the "direction blindness" issue of standard GRPO and enables the model to better distinguish varied-quality edits. VlogReward achieves state-of-the-art results that significantly outperform existing MLLMs, including GPT-5 and Gemini-3-Pro. We hope that our study can help vlog creators and foster automated vlog evaluation and refinement systems.

Abstract:
As the reasoning capabilities of Large Language Models (LLMs) expand, evaluating true inductive generalization on entirely unseen data becomes increasingly challenging. To this end, we introduce a modular incontext learning evaluation framework, that is scalable and extendable across its separate modules. This is based upon the notion of synthetic scenarios with controllable complexity across three independent axes: \ 1) the logic of the underlying data distribution (UDD) 2) their projection into diverse representations, and 3) the interaction dynamic determining how the model accesses and explores the data. For these scenarios, the model is tasked to perform in-context scientific discovery and produce an interpretable theory in natural language that explains the observations. In a separate conversation, the model is then tasked to convert this generated theory into executable code, which can be programmatically compared against the underlying data distribution. Using this modular framework we produce an initial suite of 600 diverse scenarios that we use to evaluate and analyze various state-of-the-art LLMs. Although these experiments show that Gemini 3.0 Pro achieves the best overall score, each model performs the best at different tasks. For example: GPT 5.2 is the clear winner on pure symbolic data, Claude Opus 4.5 is the best at working with files, Gemini is the strongest model for the non-dynamic scenarios, and Grok 4.1 is the strongest model when UDD complexity scales. Furthermore, all models struggle with active exploration and are seemingly incapable of identifying informative data points, resulting in less efficient exploration than a random baseline. This highlights the room for improvement state-of-the-art LLMs have, even without further scaling of the complexity of the benchmark.

Abstract:
Large Language Models (LLMs) have recently emerged as powerful controllers for interactive agents in complex environments, yet training them to perform reliable longhorizon decision making remains a fundamental challenge. A key difficulty lies in the sparsity and delay of supervision: agents often receive feedback only at episode termination, leading to severe credit assignment issues. In this work, we propose a self-evolving framework for LLM agents that unifies automatic process-reward labeling and in-distribution policy learning within a principled offline reinforcement learning paradigm. Our method learns an in-distribution critic from a hybrid offline dataset that combines expert demonstrations with agent-generated trajectories, stabilizing Bellman backups in sparse-reward settings via a weighted Implicit Q-Learning objective. The learned value function is then used to derive step-wise process rewards through advantage estimation, enabling dense and reliable supervision without environment backtracking or human annotation. Leveraging these signals, we perform behavior-proximal policy optimization that evolves the agent while remaining strictly within the data support, allowing iterative self-improvement without exacerbating distribution shift. We evaluate our method on long-horizon interactive benchmarks, including AlfWorld, WebShop, and ScienceWorld, where it consistently outperforms strong baselines in sample efficiency, robustness, and overall task performance. Our results demonstrate that stable self-evolution of LLM agents is achievable by grounding process-level supervision and policy improvement within a shared in-distribution learning loop.

Abstract:
The rapid expansion of remote sensing technology has generated massive amounts of unlabeled multiview data distributed across different institutions. Analyzing this data presents significant challenges, as centralized processing incurs prohibitive communication costs and raises data privacy concerns. To address these issues, this paper proposes a novel deep federated multi-view clustering (MVC) framework tailored for remote sensing data. Unlike existing methods that transmit sensitive data features, our approach shares only privatized prototypes masked with adaptive noise, ensuring both communication efficiency and privacy preservation. First, we employ superpixel segmentation to reduce the spatial dimensionality of remote sensing data, lowering computational burdens. Furthermore, to resolve the inconsistency of cluster assignments across different clients, we design a co-occurrence structural alignment module that synchronizes local models. Finally, we incorporate a wasserstein prototype contrastive learning mechanism, which models clusters as distributions rather than points, to enhance global consistency and robustness against data heterogeneity. Extensive experiments on four public datasets demonstrate that our framework achieves superior clustering performance and efficiency compared to state-of-the-art methods.

Abstract:
We investigate language generation in the limit (Kleinberg & Mullainathan, 2024; Li et al., 2025) in variants where the generator receives some feedback based on its “actions.” We study two such variants. In the first, which is inspired by Littlestone’s model of online learning, the generator observes whether it made a mistake at each iteration. In the second, introduced by Charikar & Pabbaraju (2025a), the generator can query whether a string belongs to the target language. Our main result is a characterization of collections that are generable with mistake feedback. Using similar techniques, we also characterize when generation is possible in the query model with setbased generators; set-based generators have been studied in several works (Charikar & Pabbaraju, 2025a; Kalavasis et al., 2025; Kleinberg & Wei, 2025a; Li et al., 2025). Beyond the characterizations themselves, we derive several implications. First, our results imply new closure properties for generation with mistake and query feedback. Second, our results show that, under feedback, generation is robust to noise: it remains possible with arbitrary contamination in the adversary’s examples and with finite contamination in the feedback. Third, our techniques also yield new sufficient and necessary conditions for generation without feedback among other implications.

Abstract:
Offline contextual pricing often relies on logged observational data, but faces challenges from distributional shifts between training and deployment environments. Distributionally robust optimization (DRO) provides a principled approach to offpolicy evaluation and learning (OPE/L). However, existing methods are mostly limited to discrete actions. Recent work has explored DRO for continuous treatments using inverse propensity weighting (IPW), while such IPW-based estimators can be sensitive to the convergence rate of propensity score estimates, particularly when estimated nonparametrically, which may lead to larger estimation errors and regret. In this work, we develop a doubly robust (DR) framework for distributionally robust OPE/L in continuous pricing settings. For evaluation, we propose a localized DR estimator that addresses the computational challenges of worst-case expectations by fitting only a small number of regressions, comparable to standard non-robust DR, while achieving semiparametric efficiency under mild product rate conditions. For learning, we leverage the inherent smoothness of demand noise to handle pricing-specific discontinuities in revenue outcomes (e.g., threshold-based purchase decisions), establishing a finite-sample regret bound of \tilde\mathcalO_p(T^-s/(2s+1)) for smoothness orders s=1,2. This bound improves upon existing regret rates in existing DRO-based off-policy learning (OPL) for continuous treatments. Extensive experiments under various levels of distribution shift validate our proposed framework.

Abstract:
MARS has recently emerged as a stateof-the-art optimizer, consistently outperforming AdamW in large language model (LLM) training. It modifies the momentum-based variance reduction (MVR) update by introducing a multiplicative coefficient \gamma, which scales the momentum correction term. However, the existing theory of Yuan et al. (2025) does not explain why this modification improves the convergence of MARS over MVR. In this paper, we provide a rigorous theoretical explanation for the superiority of MARS over MVR. We introduce the novel similarity condition, called \gamma-similarity, which generalizes standard similarity and smoothness assumptions for analyzing stochastic algorithms. Under this condition, we derive gradient complexity guarantees for MARS, which explicitly depend on \gamma and a \gamma-similarity constant \delta_\gamma. We prove that by appropriately tuning \gamma \in [0,1], MARS achieves strictly lower complexity than MVR. Finally, experiments on GPT pretraining corroborate our theoretical findings, demonstrating that MARS with an optimal choice of \gamma improves token efficiency over MVR, and yields substantial gains compared to AdamW.

Abstract:
Offline goalconditioned RL (OGCRL) learns to reach arbitrary goals from offline dataset, but long-horizon performance hinges on crossing a handful of hard-to-cross bottlenecks. These bottlenecks not only dictate the feasible paths toward the goal but also act as critical keypoints, marking the transitions between adjacent regions and providing the agent with essential directional guidance. Prior hierarchical methods pick subgoals by time or short-horizon value heuristics, which do not localize the bottleneck, as a result, the agent losing the clear guidance that bottlenecks could provide about where to pass next. We instead model long-horizon planning as “cross the next bottleneck”: we apply Laplacian spectral clustering to offline dataset to expose bottlenecks and then identify trajectories from the offline dataset that cross these boundaries, and the intersects are defined as keypoints (KPs). Then the most representative KPs are automatically selected and a directed KP reachability graph \mathcal G_\mathrmKP is constructed based on the selected KPs. We then restrict high-level choices to these bottleneck states and use a pluggable low-level controller to execute the short transitions between them. We provide theory showing that under a standard metastable decomposition of the state space, routing through bottlenecks yields an (approximately) optimal one-step subgoal in terms of hitting-time, and that Laplacian spectra recover bottlenecks with high overlap. Thus, Laplacian spectral clustering can discover approximately optimal subgoals. Empirically, the same pattern holds: across D4RL and OGBench, our method achieves state-of-the-art results on a broad set of navigation and manipulation tasks and across diverse dataset regimes, for example, 96.5% on AntMaze and 84.5% on Franka-Kitchen.

Abstract:
This paper challenges the dominance of continuous pipelines in visual generation. We systematically investigate the performance gap between discrete and continuous methods. Contrary to the belief that discrete tokenizers are intrinsically inferior, we demonstrate that the disparity arises primarily from the total number of bits allocated in the latent space (i.e., the compression ratio). We show that scaling up the codebook size effectively bridges this gap, allowing discrete tokenizers to match or surpass their continuous counterparts. However, existing discrete generation methods struggle to capitalize on this insight, suffering from performance degradation or prohibitive training costs with scaled codebook. To address this, we propose maskedBitAutoRegressive modeling (BAR), a scalable framework that supports arbitrary codebook sizes. By equipping an autoregressive transformer with a masked bit modeling head, BAR predicts discrete tokens through progressively generating their constituent bits. BAR achieves a new stateof-the-art gFID of0.99on ImageNet-256, outperforming leading methods across both continuous and discrete paradigms, while significantly reducing sampling costs and converging faster than prior continuous approaches.

Abstract:
Multimodal image fusion integrates complementary information from different modalities. However, large crossmodal discrepancies and local conflicts often introduce uncertainty into fusion decisions. This uncertainty can bias modality allocation in inconsistent regions, leading to information loss or the propagation of artifacts. Therefore, we address this problem by formally casting image fusion as an integrated probabilistic decision system that couples prior decision-making with posterior risk minimization. Based on this view, we propose a dIfference-aware Decision-lEArning muLtimodal image fusion paradigm (IDEAL). It treats cross-modal differences as decision triggers and learns contribution policies conditioned on local conditions. Specifically, we use a difference-attention module to generate multi-scale difference maps as spatial decision conditions. We also obtain spectral conditions by projecting features into the frequency domain, where power-spectrum energy, complementary spectra, and spectral-entropy reliability characterize modality discrepancy and reliability. We then employ a symmetric Beta prior to map these decision conditions to gating weights, yielding explicit and interpretable modality contribution policies. To improve robustness, we introduce an uncertainty modulation mechanism that reverts the policy to conservative mixing when conditions is insufficient. Extensive experiments demonstrate stable and competitive performance.

Abstract:
Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretrained encoders, semantic distillation, or heterogeneous CNNbased architectures. These designs introduce fixed inductive biases that limit reconstruction fidelity and hinder effective scaling. In this paper, we argue that discrete audio tokenization should be learned fully end-to-end using a homogeneous and scalable architecture. Based on this perspective, we propose TAC, a Transformer-based audio tokenizer that jointly optimizes the encoder, quantizer, and decoder from scratch for high-fidelity reconstruction of general audio. We show that a simple, fully end-to-end learned tokenizer built from homogeneous, causal Transformer blocks scales gracefully and supports high-fidelity reconstruction across diverse audio domains. Across speech, sound, and music, the proposed tokenizer consistently outperforms prior codecs over a wide range of bitrates, while exhibiting predictable improvements with increased scale. Notably, leveraging TAC’s discrete tokens, we develop the first purely autoregressive TTS model that surpasses prior non-autoregressive and cascaded systems. Furthermore, TAC enables competitive ASR performance without auxiliary encoders. Our findings position TAC as a unified, scalable interface for the next generation of native audio foundation models.

Abstract:
Multitask reinforcement learning trains generalist policies that can execute multiple tasks. While recent years have seen significant progress, existing approaches rarely provide formal performance guarantees, which are indispensable when deploying policies in safety-critical settings. We present an approach for computing high-confidence guarantees on the performance of a multi-task policy on tasks not seen during training. Concretely, we introduce a new generalisation bound that composes (i) per-task lower confidence bounds from finitely many rollouts with (ii) task-level generalisation from finitely many sampled tasks, yielding a high-confidence guarantee for new tasks drawn from the same arbitrary and unknown distribution. Across state-of-the-art multi-task RL methods, we show that the guarantees are theoretically sound and informative at realistic sample sizes.

Abstract:
Transformerbased Spiking Neural Networks (SNNs) combine Transformer performance with SNN energy efficiency through an event-driven self-attention mechanism. However, Spiking Transformers still lag behind their Artificial Neural Network (ANN) counterparts. Most existing studies address this issue through new architectural designs, yet none has considered optimization algorithms specific to Spiking Transformers. Here, we first analyze the gradient characteristics of Spiking Transformers and identify excessive noise from surrogate gradient learning as a major challenge to stable training. We then provide a quantitative definition of noise in the gradient update direction and propose an adaptive gradient descent method for spiking transforms, named AdaS. Since moderate update direction noise can enhance generalization, whereas excessive noise degrades training, AdaS adaptively adjusts the update direction noise to an optimal level, thereby improving the performance of Spiking Transformers. We conduct extensive experiments on various Spiking Transformer architectures and datasets from both computer vision and natural language processing. The results demonstrate that the proposed AdaS consistently enhances performance across different Spiking Transformers, validating its effectiveness and generalizability. This work presents the first systematic investigation of optimization algorithms specifically tailored for SNNs, offering a practical tool to narrow the accuracy gap with ANNs while preserving the energy advantages of spike-based computation.

Abstract:
Existing parameter isolationbased methods in continual learning employ diverse designs to learn more tasks within a limited model capacity. However, most of their designs inevitably incur substantial computational overhead if their model capacity is enlarged to accommodate further tasks as the task stream continually grows, resulting in a significant efficiency bottleneck. In this paper, we propose a novel GNN framework with a biological neuron-inspired architecture, termed the capacity-agnostic GNN (CAGNN), to simultaneously overcome catastrophic forgetting and boost efficiency under capacity expansion. Unlike other methods that employ full network propagation, CAGNN leverages graph contextual information to support the construction of task-specific subnetworks and decouples subnetworks during both training and inference, while enabling effective knowledge transfer between tasks. Intensive experiments demonstrate CAGNN's superiority to the state of the art, in terms of effectiveness as well as computational efficiency.

Abstract:
Developing reliable machine translation (MT) systems hinges on our ability to distinguish superior translations from inferior ones. However, existing evaluation paradigms, whether limited to coarse overall rankings or misaligned with human preferences, fail to deliver interpretable, fine‑grained feedback in reference‑free settings. We present a FineGrained Ranking Evaluation method (FiRE) that leverages off‑the‑shelf large language models to perform criterion‑driven pairwise comparison across three complementary dimensions: faithfulness, fluency, and consistency of style, instead of producing a single holistic judgment. To enable rigorous meta‑evaluation of evaluation paradigms in the absence of any suitable testbed, we construct the first human‑annotated, reference‑free benchmark for fine-grained ranking evaluation, achieving substantial inter‑annotator agreement. Through meta‑evaluation on this benchmark and existing MQM datasets, FiRE demonstrably outperforms regression‑based and error‑analysis metrics in aligning with human comparative judgments, while providing more informative insights into translation quality. Finally, our examination of LLM evaluator biases (position and self-enhancement) and their handling of tied cases offers guidance for more nuanced MT evaluation.

Abstract:
Existing deep spatial–spectral fusion (SSF) methods typically learn the fusion mapping in the coordinate domain using convolutions and attentions, making it hard to scale across varying spatial resolutions and offering limited control over the frequency content of the reconstructions, which may further lead to severe spectral distortion. In this work, we propose Latent Spectral Operators (LSO), a SSF framework that learns fusion mappings between spectral functions through a structured operator parameterization. Specifically, LSO first applies a crossattention projection, where learned latent tokens serve as spectral prompts, to compress high-dimensional observations into a compact latent representation, and then adopts a hierarchical, patch-based architecture to integrate rich multi-scale cues. Furthermore, to parameterize the latent fusion operator in a controllable manner, a Trigonometric Basis Solver is elaborated, which represents the mapping using a trigonometric basis expansion. This formulation naturally supports multi-frequency modeling, with a capacity–stability trade-off governed by the number of basis functions. Extensive experiments on the CAVE and Harvard benchmarks demonstrate that LSO achieves consistent state-of-the-art performance and exhibits strong transferability across different spatial scales. Codes are attached.

Abstract:
Stochastic FrankWolfe is a classical optimization method for solving constrained optimization problems. On the other hand, recent optimizers such as Lion and Muon have gained quite significant popularity in deep learning. In this work, building on recent initiatives, we provide a unifying perspective by interpreting these seemingly disparate methods through the lens of Stochastic Frank-Wolfe. Specifically, we show that Lion and Muon with weight decay can be viewed as special instances of a Stochastic Frank-Wolfe, and we establish their convergence guarantees in terms of the Frank-Wolfe gap, a standard stationarity measure in non-convex optimization for Frank-Wolfe methods. We further find that convergence to this gap implies convergence to a KKT point of the original problem under a norm constraint for Lion and Muon. Moreover, motivated by recent empirical findings that stochastic gradients in modern machine learning tasks often exhibit heavy-tailed distributions, we extend Stochastic Frank-Wolfe to settings with heavy-tailed noise by developing two robust variants with strong theoretical guarantees that hold for general compact convex sets without the need for a large batch size, filling the gap in the literature on Stochastic Frank-Wolfe for non-convex optimization. Our contributions in the later part of this work, in turn, yield new variants of Lion and Muon, that better accommodate heavy-tailed gradient noise, thereby enhancing their practical scope.

Abstract:
Federated learning enables ondevice training without centralizing data, yet existing systems still struggle to provide explanations that are both locally faithful and globally consistent under strict privacy and bandwidth constraints. Prior approaches either keep explanations siloed across clients, transmit heavy or sensitive artifacts, or replace expressive task models with interpretable surrogates that sacrifice accuracy. We propose xFedAlign, a model-agnostic framework that decouples task optimization in parameter space from explanation coordination in a compact group space. Each client distills a lightweight surrogate to produce private, per-class top-k attribution artifacts, which are robustly aggregated by the server into a Global Explanation Prior that softly aligns client explanations without constraining task learning. Across image, text, and tabular benchmarks with IID and non-IID partitions, xFedAlign matches FedAvg accuracy while consistently reducing explanation drift and improving deletion and insertion AUC relative to Local-XAI, FedAttr-Agg, and Fed-XAI, with only a few kilobytes of additional communication per round. Privacy and robustness evaluations further demonstrate reduced membership inference advantage and increased resistance to attribution poisoning, enabling consistent and trustworthy explanations in federated learning.

Abstract:
Datadriven dynamical system models often fail to recover the long-term structure of the underlying system, as their behavior is weakly constrained off the data manifold. Conjugacy-based approaches address this limitation by learning a diffeomorphism that pushes forward a source vector field to match observed dynamics, inheriting qualitative topology from the source. However, such methods typically presuppose that the chosen source system is topologically compatible with the target data. When this assumption is violated, the conjugacy problem becomes ill-posed, and arbitrary corrections can be traded off against diffeomorphic variation, leading to non-identifiability. We propose a framework that relaxes this assumed prior by jointly learning the diffeomorphic conjugacy together with controlled adjustments to the source dynamics via low-dimensional context modulation. Inspired by versal unfolding theory, we enforce the modulation space to be orthogonal to the worst-case orbit-tangent directions, obtained by adversarially searching over a class of parameterized diffeomorphisms. This promotes an identifiable decomposition of dynamical variation into diffeomorphic and intrinsic, topology-changing components, enabling interpretable corrections that recover the canonical structure such as normal forms and symmetries.

Abstract:
Data valuation quantifies the intrinsic quality of individual samples to enable principled data curation, quality control, and robust learning. For time series in critical domains such as healthcare, finance, and industrial monitoring, effective valuation methods are essential yet fundamentally lacking. Existing approaches are either modeldependent, limiting their generalizability, or designed for i.i.d. data and thus fail to capture temporal dependencies, multi-scale patterns, and non-stationary dynamics inherent to sequential data. We introduce TimeLAVA, a learning-agnostic framework that values temporal segments by their marginal contribution to minimizing distributional discrepancy between evaluated and reference data. At its core is a novel Selective Wavelet-based Wasserstein (\mathcalW_\textSW) discrepancy combining multi-scale wavelet transforms for temporal localization with unbalanced optimal transport for robustness to distributional shifts. Segment values are efficiently computed via sensitivity analysis without requiring model training and aggregated into point-wise scores. We provide theoretical guarantees linking valuation to model-agnostic generalization and prove bounded sensitivity to outlier contamination. Extensive experiments across anomaly detection, data pruning, and label noise detection demonstrate that TimeLAVA produces significantly more informative value scores than existing methods on diverse real-world datasets.

Abstract:
Video motion transfer aims to synthesize novel content videos that strictly follow the motion trajectories of a reference video. However, existing methods typically operate in Euclidean space, treating motion as unconstrained pixel displacements or linear phase shifts. This simplification frequently causes severe shearing artifacts and perspective collapse under complex camera and object motions. In this work, we present LieWarper, a geometryaware motion transfer framework that reconceptualizes motion as coordinate evolution on a manifold rather than mere pixel displacement. Specifically, we derive an analytic solver on the \textSim(2) manifold to extract global evolution parameters from noisy optical flow. We then introduce a flow-guided phase modulation mechanism, enabling non-rigid dynamics to undergo coordinate transformation along the evolution path. This approach achieves accurate trajectory transfer while maintaining global geometric integrity. Extensive experiments show that LieWarper significantly outperforms state-of-the-art training-free baselines in both motion fidelity and geometric stability, while maintaining high generation quality.

Abstract:
Zeroshot classifier expansion aims to recognize unseen classes by learning a shared mechanism to map semantics of all classes to classifier weights without access to images. However, existing methods rely on a shared mapping, which is difficult to classify in scenarios containing a mixture of distinct and similar classes, especially with the continuous expansion of classes. Since this mapping prioritizes general attributes for distinct classes while neglecting subtle attributes for similar ones, this granularity mismatch, compounded by sensitivity to noise, induces optimization interference where gradients from distinct classes dominate the learning process. To overcome this limitation, a granularity-aware adaptive framework with interventions is introduced to balance them. Specifically, this method first generates multi-source semantics by intervening on non-causal noise, then discovers latent class structure to separate distinct classes, and finally refine similar classes to synthesize weights with invariance to non-causal noise. The effectiveness is demonstrated through theoretical and empirical analysis in multiple aspects.

Abstract:
Automatic Heuristic Design (AHD) has emerged as a transformative approach for solving combinatorial optimization problems. While recent Large Language Model (LLM)based methods have shown promise, they predominantly rely on fixed evolutionary operators and struggle to effectively accumulate and reuse historical search experience. This paper proposes RefineEvo, a novel evolutionary framework that transforms AHD from a static trial-and-error process into a planning-guided, experience-driven system. RefineEvo introduces a Planner to dynamically schedule evolutionary operators and trigger refinement based on the current search state, and a Reflector to distill valuable lessons into a Bidirectional Experience Pool containing both positive insights and negative pitfalls. This synergistic framework enables the system to adapt its search tools to the evolving complexity of the problem and leverage trajectory-aware, situation-conditioned insights to guide generation. Experiments on several classic combinatorial optimization benchmarks demonstrate that RefineEvo consistently outperforms strong baselines. In particular, RefineEvo delivers superior solution quality while improving token efficiency, enabling more efficient and autonomous heuristic design. Our code is available at https://anonymous.4open.science/r/RefineEvo-FDC4.

Abstract:
Stateof-the-art single-image 3D reconstruction methods often rely on complex hybrid architectures or necessitate compressing geometry into latent spaces to leverage pre-trained latent diffusion models. In this work, we demonstrate that such architectural overhead is unnecessary. We introduce a minimalist pixel-space Diffusion Transformer built on a plain ViT, which operates directly on raw point map patches and is conditioned on image tokens from a pre-trained DINOv3. Unlike existing latent diffusion-based approaches, we train our diffusion backbone entirely from scratch, eliminating the need for point map tokenizers. We show that this streamlined approach yields results superior to complex latent-based diffusion models while remaining significantly simpler than hybrid alternatives. Notably, our model produces sharper geometric structures and achieves significantly better results on highly ambiguous regions, such as transparent objects.

Abstract:
Reinforcement Learning from Human Feedback (RLHF) for Large Language Models increasingly relies on criticfree methods as a practical alternative to actor–critic training. Despite their simplicity, existing critic-free approaches propagate a trajectory-level learning signal uniformly across all tokens appearing in the trajectory. This requires full-trajectory update for every rollouts, leading to substantial optimization cost for long reasoning traces although the feedback signal is effectively determined early in the trajectory. We propose Prefix-Sampling Proximal Policy Optimization (PS-PPO), a compute-efficient critic-free method for RLHF that exploits this temporal redundancy. PS-PPO introduces a prompt-conditioned cutoff distribution and samples a cutoff timestep for each trajectory. Policy gradient updates are then applied only up to the sampled cutoff timestep, while a correction mechanism ensures that the resulting truncated gradient estimator remains unbiased with respect to the full-trajectory objective. This procedure bypasses later tokens whose contribution to the feedback signal is negligible, without distorting the underlying learning signal. Experiments on mathematical reasoning and RLHF benchmarks show that PS-PPO achieves large reductions in training compute and peak GPU memory, while maintaining accuracy comparable to strong critic-free baselines.

Abstract:
This paper investigates the entropy dynamics of Chainof-Thought (CoT) and uncovers a consistent two-phase structure: anUncertainty Regionof exploration transitioning sharply to aConfidence Regionof convergence. We demonstrate that the Confidence Region possesses two critical properties: 1)High Reliability—answers in confidence region become highly accurate and stable, and 2)High Redundancy—models generate unnecessary tokens long after reaching the correct answer. These properties unlock more efficient and reliable inference strategies: 1)Early Exitleverages reliability and redundancy to terminate computation safely when returns diminish, and 2)Test-Time Scalinguses the Confidence Region signal to prioritize converged trajectories. To operationalize these insights, we formulate Confidence Region detection as a sequential change-point detection problem, being the first to apply classical change-point methods to monitor CoT reasoning. Using the Cumulative Sum (CUSUM) algorithm, a statistically optimal change-point detector, we develop a training-free framework for real-time inference control. Experiments show our approach establishes a superior Pareto-frontier for early exit. CUSUM achieves 63.06% accuracy with 11.1% token reduction, outperforming DEER and Dynasor by 3.28% and 4.36% in accuracy respectively. For test-time scaling, CUSUM-weighted voting consistently outperforms self-consistency.

Abstract:
Federated Learning (FL) enables collaborative training across distributed clients while preserving data privacy. However, finetuning large-scale pre-trained models in FL is hindered by resource constraints and communication costs. Although introducing parameter-efficient fine-tuning strategies such as Low-Rank Adaptation (LoRA) effectively reduces trainable parameters, this low-rank constraint exacerbates noise sensitivity, leading to overfitting and aggregation bias. Existing robust federated fine-tuning methods rely on additional proxy data and treat low-rank adapters as generic weight vectors. In this paper, we investigate the structural properties of LoRA and reveal a robustness asymmetry. The down-projection matrix A extracts stable general features, whereas the up-projection matrix B is highly susceptible to fitting noise patterns. Based on this finding, we propose Federated Decoupled LoRA (FDLoRA), which employs a dual-branch mechanism to decouple robust feature learning from noise modeling and mitigates noise interference through noisy branch negative learning. During federated aggregation, we establish global consensus through aggregating B while preserving local feature alignment in A. Extensive experiments demonstrate that FDLoRA outperforms existing state-of-the-art methods across various noisy federated scenarios. Our code and models will be released.

Abstract:
Diffusion Transformers (DiTs) have achieved remarkable success in image generation, yet their deployment is hindered by high computational costs. We identify two sources of redundancy. First, temporal redundancy: ClassifierFree Guidance (CFG) applies costly dual forward passes at every timestep, yet guidance matters only at specific steps, and variable scales at critical steps can compensate for skipping others. Second, spatial redundancy: under variable guidance, different transformer blocks exhibit heterogeneous sensitivity, yet uniform calibration across all blocks wastes computation while failing to address their varying requirements. We present RSTR, the first framework to jointly reduce spatiotemporal redundancy in diffusion transformers. Stage-1 addresses temporal redundancy through evolutionary search, discovering sparse guidance schedules with variable scales. Stage-2 addresses spatial redundancy through adaptive rank allocation, assigning calibration capacities to transformer regions based on their sensitivity. Experiments on DiT-XL/2, PixArt-\alpha, FLUX, and state-of-the-art Qwen-Image demonstrate 50%-70% compute savings while maintaining or improving quality. On DiT-XL/2, REST achieves 57% savings with 15% FID improvement; on Qwen-Image, 3.43× speedup with preserved quality.

Abstract:
Reinforcement learning (RL) for large language models (LLMs) relies on imperfect reward supervision, necessitating constraints on policy updates to prevent overfitting. Nevertheless, the widely adopted KL constraint overpenalizes actions with low reference probabilities and lacks the sparsity to discard marginal policy shifts. In contrast, the L1-norm offers a distinct mechanism that is more tolerant of low-probability actions yet strictly suppresses minor probability perturbations. Motivated by this, we propose Magnitude-Regularized Policy Optimization (MRPO), which enforces an L1-norm constraint on policy updates. We demonstrate that MRPO permits substantial probability boosts for low-probability actions and induces sparse updates, ensuring invariance to noise that preserves the top-ranking order. Furthermore, MRPO guarantees convergence in general RL settings and achieves a tighter approach to optimality than KL-based methods in single-step scenarios. Empirically, MRPO delivers exceptional results across diverse scenarios, notably doubling the performance gains of GRPO in preference alignment, outperforming DAPO in mathematical reasoning, and surpassing DPO in offline settings using only binary rewards.

Abstract:
Conventional reinforcement learning strategies for visual generation typically employ samplewise reward functions, yet this practice frequently results in reward hacking that degrades image diversity and introduces visual anomalies. To address these limitations, we present a novel framework that finetunes generative models using distribution-wise rewards, ensuring better alignment with real-world data distributions. Unlike rewards that evaluate samples individually, distribution-wise reward accounts for the data distribution of the samples, mitigating the mode collapse problem that occurs when all samples optimize towards the same direction independently. To overcome the prohibitive computational cost of estimating these rewards, we introduce a subset-replace strategy that efficiently provides reward signals by updating only a small subset of a generated reference set. Additionally, we apply RL to optimize post-hoc model merging coefficients, potentially mitigating the train-inference inconsistency caused by introducing stochastic differential equation (SDE) in regular RL practices. Extensive experiments show our approach significantly improves FID-50K across various base models, from 8.30 to 5.77 for SiT and from 3.74 to 3.52 for EDM2. Qualitative evaluation also confirms that our method enhances perceptual quality while preserving sample diversity.

Abstract:
VisionLanguage-Action models face significant challenges in real-world deployment due to the entanglement of high-level reasoning with low-level control, and the instability of policy optimization. In this paper, we introduce SyVLA, a robust VLA model trained with diversified experiences. We propose an Intention Decoupling algorithm to isolate control-relevant features from reasoning contexts and a similar-sample guided RL pipeline to stabilize policy updates and mitigate distribution shift. Extensive experiments on real-world robotic tasks and multi-modal benchmarks demonstrate that SyVLA achieves superior task success rates and stronger out-of-distribution generalization compared to existing methods, while effectively preserving core vision-language capabilities.

Abstract:
LowRank Adaptation (LoRA) is a widely used parameter-efficient fine-tuning technique, and previous works have studied the update dynamics of LoRA, showing that updating via the low-rank matrix \mathbfA can be viewed as a process within the compressed subspace defined by \mathbfA^\top \mathbfA of the gradient \nabla f\left(\mathbfW \right). However, few works analyze how the properties of the low-rank matrices affect the performance of LoRA, since existing methods heuristically initialize the low-rank matrices as Gaussian matrices. In this paper, we provide a theoretical understanding of the update dynamics of LoRA. We reveal that the update dynamics can be viewed as a process within the subspace projected by \mathbfA^\top (\mathbfA \mathbfA^\top)^\dagger \mathbfA, and prove that when the gradient \nabla f\left(\mathbfW \right) is unavailable, if \mathbfA is an Equiangular Tight Frame (ETF), \mathbfA^\top \mathbfA and \mathbfA^\top (\mathbfA \mathbfA^\top)^\dagger \mathbfA can preserve the maximum information from the gradient \nabla f\left(\mathbfW \right). Thus, initializing \mathbfA as an ETF is the optimal solution for low-rank adaptation when the gradient \nabla f\left(\mathbfW \right) is unavailable. Furthermore, we establish the convergence of Low-Rank Adaptation with a rate of \mathcalO\left(\frac1T\right) when \mathbfA is an ETF. Extensive experiments show that initializing the low-rank matrices as ETFs significantly outperforms the commonly used Gaussian initialization for existing primary LoRA variants.

Abstract:
Singlepass uncertainty quantification (UQ) methods for classification represent uncertainty by predicting a tractable distribution over the class probability vector. While existing approaches primarily focus on enhancing the expressiveness of this distribution, they often provide limited insight into how predictive uncertainty is structured and aggregated, resulting in weak interpretability. We introduce thecourtroom analogy, which conceptualizes uncertainty-aware classification as a structured debate among class-specific advocates. Each advocate forms a probabilistic opinion, and a final verdict is reached by aggregating these opinions using input-dependent plausibility weights. In this framework, each advocate's opinion is modeled as a Dirichlet distribution whose concentration parameter is decomposed into shared evidence and class-specific advocacy. This yields a structured mixture of Dirichlet distributions with semantically interpretable parameters. To instantiate this formulation, we proposeMixture of Dirichlet EXperts(MoDEX), a single-pass neural architecture that predicts the courtroom parameters, enabling efficient and expressive UQ while explicitly modeling uncertainty aggregation. We demonstrate that MoDEX enjoys strong theoretical properties and achieves state-of-the-art UQ performance across diverse benchmarks, yielding interpretable uncertainty estimates with meaningful semantics.

Abstract:
Conformal prediction (CP) is a distributionfree method to construct reliable prediction intervals that has gained significant attention in recent years. Despite its success and various proposed extensions, a significant practical feature which has been overlooked in previous research is the potential skewed nature of the noise, or of the residuals when the predictive model exhibits bias. In this work, we leverage recent developments in CP to propose a new asymmetric procedure that bridges the gap between skewed and non-skewed noise distributions, while still maintaining adaptivity of the prediction intervals. We introduce a new statistical learning problem to construct adaptive and asymmetric prediction bands, with a unique feature based on a penalty which promotes symmetry: when the penalty intensity varies, the intervals smoothly change from symmetric to asymmetric ones. This statistical learning problem is based on reproducing kernel Hilbert spaces and the recently introduced kernel sum-of-squares framework. First, we establish representer theorems to make our problem tractable in practice, and derive dual formulations which are essential for scalability to larger datasets. Second, the intensity of the penalty is chosen using a novel data-driven method which automatically identifies the symmetric nature of the noise. We show that consenting to some asymmetry can let the learned prediction bands better adapt to small sample regimes, outliers or biased predictive models. Finally, our experiments illustrate the efficiency of such penalized kernel sum-of-squares to construct adaptive prediction bands.

Abstract:
Adaptive rounding techniques in PostTraining Quantization (PTQ) enable the efficient deployment of Large Language Models (LLMs) with low resource and data dependencies. While learning-based rounding methods are accurate yet costly, compensation-based approaches offer a highly efficient alternative. However, synergizing these two to realize their full potential is hindered by spatiotemporal misalignment in the decoupled paradigm. Key challenges include temporal parameter conflict, the invalidation of the initial Round-to-Nearest (RTN) assumption, and spatially-inconsistent optimization objectives. This paper introduces STLA, a novel rounding-optimized PTQ framework that achieves both fast and accurate LLM quantization. STLA resolves temporal inconsistency through cluster-wise integrated rounding optimization, which collocates the learning and compensation phases. STLA achieves spatial alignment through a unified global objective derived from the Schur Complement, enabling the solver to look ahead and align local rounding decisions with the optimal future compensation of remaining weights. Furthermore, we propose a Hessian-guided clustering strategy that exploits both diagonal and off-diagonal information to maximize intra-cluster error cancellation. Extensive experiments demonstrate that STLA establishes a new state-of-the-art for low-bit PTQ while maintaining high computational efficiency. The code is available at https://anonymous.4open.science/r/STLA.

Abstract:
Dynamical systems model trajectory data generated by an underlying fixed dynamics, with applications ranging from biological systems to flows in physics. The identification problem concerns recovering the parameters of a system from observed trajectories. In many scientific settings, however, dynamical systems are not generic and instead exhibit symmetries imposed by physical laws, formalized as equivariance with respect to a group action. In this work, we study \emphadaptive symmetry discovery for dynamical system identification and address how a system can be identified from a single trajectory when it is equivariant with respect to an unknown symmetry group. To this end, we first show that for known symmetries, the system can be identified from a significantly shorter single trajectory than in the generic setting, and we precisely characterize this improvement. We then consider the automatic symmetry discovery setting, proposing a method to learn the symmetry group directly from a single trajectory and incorporate it into the identification procedure, achieving the same optimal trajectory length as in the knownsymmetry case. Our analysis relies on tools from group representation theory and the expander properties of Cayley graphs, and may be of independent interest for the study of symmetries in dynamical systems.

Abstract:
Multimodal neuroimaging diagnosis must integrate cross-modal agreement with modality-specific complementarity, yet in real multi-site cohorts these signals are frequently entangled with site- and cohort-dependent correlations, yielding shortcut-driven predictions, fragile transfer, and limited interpretability. We propose Structured Multi-modal Graph Disentanglement (SMGD), which explicitly factorizes multi-modal graph representations into four components with distinct roles: shared diagnostic evidence, complementary diagnostic evidence, incidental cross-modal agreement, and modality-specific non-robust correlations. SMGD is realized as geometry-driven structure learning: under a mild distributional assumption, we develop mini-batch estimable surrogate regularizers that shape subspace organization and cross-modal relations, enforcing semantic consistency through relational geometry rather than centroid coincidence while suppressing confounded dependencies. Experiments on large multi-site datasets (ABIDE-I, SRPBS) show improved in-domain diagnosis and more reliable cross-dataset generalization under modality gap, without expert-crafted features. Code is available at: \urlhttps://anonymous.4open.science/r/anonymousICML2026/README.md.

Abstract:
Tooluse agents that address day-to-day digital tasks such as ordering groceries must not only operate applications, but also interact with the user, e.g., to ask clarification questions, prompt for confirmation, and inform the user when the instruction is infeasible. However, current benchmarks for evaluating agent-user interactions do not capture the diversity of such interactions. Further, they operate in small environments with few, often non-state-changing, APIs. To address this gap, we introduce AppWorld-UL, a ``user-in-the-loop'' benchmark of 306 challenging tasks requiring diverse agent-user interactions. Building upon the AppWorld framework with 9 popular simulated apps like Amazon and Spotify, we systematically modify original tasks to introduce ambiguities and constraints that necessitate various types of agent-user interaction. User behavior is simulated by an LLM prompted to respond with carefully designed knowledge boundaries, offering more reliable simulation than the unconstrained or overly rigid alternatives used in prior work. Our evaluation reveals that a state-of-the-art LLM, GPT-5, achieves only 38.2% success on AppWorld-UL and that correct user-interaction is crucial for success. This demonstrates the benchmark's difficulty and its potential to advance research on user-in-the-loop tool-use agents.

Abstract:
Genomic language models (gLMs) achieve strong performance across diverse genomic prediction tasks, but their internal biological representations remain poorly understood. Sparse autoencoders (SAEs) have emerged as an interpretability tool in vision and natural language models, yet their applicability to gLMs remains unexplored. We present a systematic study of SAEbased interpretability for gLMs, introducing a diverse benchmark of human genomic annotations and a suite of genome-tailored interpretability metrics. Using Evo2 as a primary case study, we show that SAE features, particularly those from intermediate layers, are more interpretable than raw model embeddings across 42/55 (76%) of our genomic concept evaluations, with 26 of them having an F1 score greater than 0.7. We further find that interpretability depends on SAE training data properties such as evolutionary proximity and context length, with mixed-species and longer-context training improving recovery of human genomic features. Finally, we develop a graph-based representation method to construct a feature atlas that organizes semantically related genomic concepts learned by an SAE, outperforming the baseline approach of using SAE model weights. Our results establish SAEs as a powerful framework for better understanding gLMs, broadening their accessibility and utility for disease-driven genomic analysis.

Abstract:
Federated Continual Learning (FCL) aims to empower distributed devices to learn a sequence of tasks over time. However, existing FCL research largely relies on the impractical assumption of synchronous new task arrival. This overlooks the reality of asynchronous user behavior and system latencies, forcing more efficient clients to endure costly inactivity. The practical necessity of an asynchronous method gives rise to Asynchronous Federated Continual Learning (AFCL). The server constantly receives a mixture of updates from clients at different time steps, leading to a catastrophic task drift that corrupts the global model and prevents effective learning. In this paper, we introduce a novel Crosstask Calibration framework called C^2-AFCL that is the first to tackle task drift at a semantic level within an Asynchronous FCL setting. Its core is a two-stage orthogonal calibration mechanism. First, intra-client calibration uses task-aware caches to mitigate variance from local client drift. Second, and more critically, inter-task interference calibration dynamically estimates an interference subspace from historical task knowledge. New updates are orthogonally projected to isolate and remove components that conflict with this subspace, preserving previous knowledge while learning new tasks. Extensive experiments show that C^2-AFCL significantly outperforms existing methods, demonstrating robust and efficient learning in dynamic federated environments.

Abstract:
Slice sampling is a standard Monte Carlo technique for Dirichlet process (DP)based models, widely used in posterior simulation. However, formal assessments of the scalability of posterior slice samplers have remained largely unexplored, primarily because the computational cost of a slice-sampling iteration is random and potentially unbounded. In this work, we obtain high-probability bounds on the computational complexity of DP slice samplers. Our main results show that, uniformly across posterior cluster-growth regimes, the overhead induced by slice variables, relatively to the number of clusters supported by the posterior, is O_\mathbb P(\log n). As a consequence, even in worst-case configurations, superlinear blow-ups in per-iteration computational cost occur with vanishing probability. Our analysis applies broadly to DP--based models without any likelihood-specific assumptions, still providing complexity guarantees for posterior sampling on arbitrary datasets. These results establish a theoretical foundation for assessing the practical scalability of slice sampling in DP-based models.

Abstract:
The persistent storage requirements for highresolution, spatiotemporally evolving fields governed by large-scale and high-dimensional partial differential equations (PDEs) have reached the petabyte-to-exabyte scale. Transient simulations modeling Navier-Stokes equations, magnetohydrodynamics, plasma physics, or binary black hole mergers generate data volumes that are prohibitive for modern high-performance computing (HPC) infrastructures. To address this bottleneck, we introduce ANTIC (Adaptive Neural Temporal in situ Compressor), an end-to-end in situ compression pipeline. ANTIC consists of an adaptive temporal selector tailored to high-dimensional physics that identifies and filters informative snapshots at simulation time, combined with a spatial neural compression module based on continual fine-tuning that learns residual updates between adjacent snapshots using neural fields. By operating in a single streaming pass, ANTIC enables a combined compression of temporal and spatial components and effectively alleviates the need for explicit on-disk storage of entire time-evolved trajectories. Experimental results demonstrate that ANTIC achieves storage reductions of approximately ～ 400× for 2D Kolmogorov flow simulations and ～ 7000× for large-scale physics simulations such as binary black hole mergers.

Abstract:
Autoregressive models have emerged as powerful tools for sequential data, from language to video. Understanding how and why these models learn latent representations remains an open theoretical question. In this work, we demonstrate that when trained by empirical risk minimization on data from partially observed linear dynamical systems, two-layer linear auto-regressive models naturally learn to approximate Kalman filtering. In particular, we show that the learned hidden representation coincides, up to a similarity transformation, with the state estimates produced by the optimal (Kalman) filter, even though the model has no explicit knowledge of the underlying dynamics or state. The result follows from three main insights. First, we establish that the Kalman filter is well approximated by an auto-regressive model with bounded truncation error. Second, we show that despite non-convexity, the two-layer optimization landscape is benign, i.e., all stationary points are either strictly saddle or global minima. Finally, as our main contributions, we provide finite-sample guarantees on prediction error, parameter estimation error, and latent state recovery. Numerical simulations support the theoretical results and demonstrate that auto-regressive models automatically represent latent state estimates.

Abstract:
Graph Contrastive Learning (GCL) has significantly advanced selfsupervised representation learning on graphs, yet its practical efficacy remains hindered by random augmentations that induce semantic distortion and rigid one-to-one sampling strategy that amplifies inter-class entanglement and intra-class dispersion. To address these limitations, we develop CL-GCL, a Comprehensive and Lightweight Graph Contrastive Learning framework. Specifically, we exploit graph coarsening to preserve structural semantics through community-level representations and manifold learning to capture local geometric relations without costly pairwise distance computations. This design naturally aligns with the neighborhood aggregation principle of Graph Convolutional Networks, enhancing structural consistency while eliminating negative sampling bias. We theoretically prove that CL-GCL approximates node-level contrastive loss under mild conditions. Extensive experiments demonstrate consistent superiority in both accuracy and efficiency over state-of-the-art GCL methods.

Abstract:
Higherorder structures are powerful relational modeling tools, yet existing spectral operators decompose topology into separate ranks, leaving practitioners to fuse information back to vertices through ad-hoc choices. We introduceCollapsed Effective Operators, which marginalize higher-order structures into a single vertex-level operator via Schur complementation of a graded Laplacian. This yields a dense operator that encodes long-range interactions mediated by topology and is applicable to arbitrary higher-order constructs. We show it preserves positive semi-definiteness with a strict spectral upper bound relative to the rank-0 Laplacian, effectively lowering system energy under higher-order connectivity. Empirically, our operator significantly improves spectral clustering, enables diffusion over topological structures, and accelerates the processing of higher-order structures with neural networks.

Abstract:
Despite rapid progress in Grounded Visual Reasoning (GVR) with MLLMs and RLstyle fine-tuning, existing approaches often lack effective learning signals for intermediate grounding decisions and are prone to shortcut solutions. In this work, we explicitly decompose GVR into Evidence Generation followed by Counterfactual Answer Reasoning, and formalize this structure as a Causal Grounding Graph (CGG) in which the generated evidence acts as a causal mediator. Building on this formulation, we propose CofactGVR, which estimates the mediator’s utility via a matched counterfactual intervention that perturbs the predicted region while keeping the original image–question context fixed. The factual–counterfactual reward gap yields a principled intermediate bonus, selectively assigned to high-quality factual rollouts to promote evidence-faithful reasoning. To further stabilize and efficiently exploit this causal training signal, we incorporate a Quantile-filtered Prioritized Advantage Sampling (Q-PAS) strategy that preferentially updates on trajectories with high-magnitude advantages while filtering low-signal samples. Extensive experiments across GVR benchmarks show consistent improvements, indicating that CofactGVR strengthens reliance on informative visual evidence under controlled interventions. The source codes will be publicly available.

Abstract:
Visual tracking aims to estimate target states in video sequences, with applications spanning diverse computational requirements. Recent methods optimize trackers using manually pruned image tokens with a fixed budget to reduce computational costs. However, these trackers, once trained, are constrained to perform tracking under a fixed computational budget, limiting their adaptability to realworld computational diversity. To address the above limitation, we provide the first exploration of the elastic token budget training framework (ETBTrack), enabling trackers to perform robust tracking under varying computational budgets. It enjoys several merits. First, we present a novel result-driven importance criteria, in which we optimize a policy network guided by the localization precision of the tracker to estimate token importance, thereby aligning the objectives of importance estimation and tracking precision. Second, we develop a new budget-collaborative optimization strategy, in which we collaboratively optimize the tracker across varying budgets, thereby enabling the tracker to be compatible with diverse budgets. Two optimization processes are performed alternately to enhance the capability of elastic inference. Extensive experiments on large-scale benchmarks demonstrate the effectiveness of our method. Codes will be released.

Abstract:
This paper presents a theoretical investigation into the generalization capabilities of crossdomain knowledge distillation. Utilizing a high-dimensional asymptotic analysis of a linear teacher–student model, we characterize the excess risk while accounting for both model and covariate shifts. Our results provide a formal guarantee for the efficacy of distillation: even when the source and target domains differ substantially, there still may exist a regime where the student model achieves superior generalization ability over the student-only baseline. Moreover, we identify a crossed double descent phenomenon: the excess risk can vary non-monotonically with the teacher’s and student’s dimension-to-sample-size ratios. These results provide rigorous insight into when and why distillation helps across domains.

Abstract:
Graph foundation models (GFMs), pretrained on massive graph data, have transformed graph machine learning by supporting generalpurpose reasoning across diverse graph tasks and domains. Existing GFMs pretrained with fixed-hop subgraph sampling impose a fixed receptive field, causing scale mismatch on diverse tasks, which often require heterogeneous and unknown structural contexts beyond a fixed sampling scale. We proposeR-GFM, a Riemannian Graph-of-Graphs (GoG) based foundation model, that treatsstructural scaleas a first-class citizen in modeling. R-GFM constructs a multi-scale GoG over-sampled subgraphs at different hop distances and learns geometry-adaptive representations from Riemannian manifolds. Theoretical analysis shows that R-GFM reduces structural domain generalization error compared to fixed-scale GFMs. Experiments on various datasets demonstrate that R-GFM achieves state-of-the-art performance, with up to a49%improvement on downstream tasks.

Abstract:
Trainable denoisers with Lipschitz control have become central to convergent image reconstruction. However, training neural networks that simultaneously offer strong denoising performance and global Lipschitz guarantees is challenging. Existing approaches enforce Lipschitz control only empirically, providing no guarantees beyond the training data. In this work, we show that by exploiting the action of permutations on the image lattice, we can constrain a neural architecture that is globally nonexpansive (Lipschitz bound \leqslant 1). We integrate the proposed denoiser with forward imaging operators to develop a reconstruction mechanism that is provably contractive and therefore globally convergent. Experiments on standard inverse problems, such as superresolution and deblurring, demonstrate that our reconstruction performance is competitive with softly constrained baselines while providing Lipschitz guarantees.

Abstract:
Recent decompositionbased approaches have achieved significant success in Multi-Objective Combinatorial Optimization (MOCO). However,existing methods typically rely exclusively on node-centric representations, failing to capture the complementary representations provided by edge features for problem instances, resulting in a persistent optimality gap. To address this , we propose a Preference-Modulated Structural Attention mechanism to enhance problem representation by synergizing node and edge features. It includes: (1) Utilizing preference-modulated edge features as explicit structural biases during attention calculation, enabling model to perceive sub-problem structures conditioned on specific preferences,and (2) an edge feature aggregation strategy that dynamically incorporates node-specific context into edge representations to enhance the perception of preference-aware structures. Experiments on classic MOCOP benchmarks demonstrate the superiority of our approach in terms of both performance and generalization capabilities.

Abstract:
Large language models (LLMs) often fail in systematic, modelspecific ways under meaning-preserving question rewrites (paraphrases, format changes, benign distractors). In this work, we address this instability by identifying where the model's reasoning process diverges across semantically-equivalent inputs. For each target LLM, we sample multiple solution traces under rewrites and aggregate them into a graph of recurring intermediate steps, which pinpoints where incorrect traces diverge from correct ones. We then generate a small set of semantics-preserving examples that mirror the rewrite patterns most responsible for these divergences, and use them to steer the model (\emphtargeted alternation training), either via fine-tuning or via in-context learning. Across MMLU-Pro, Big-MATH, and DROP, this yields consistent gains and cross-dataset generalization. On Humanity’s Last Exam, using 200 in-context examples, it improves GPT-5.2 (xhigh) from 35.4% to 38.1%, demonstrating that targeted alternation training can materially improve a frontier, API-accessible closed model under realistic access constraints.

Abstract:
Humans do not always express what they need perfectly. Helpful assistants must be able to ask for clarification to handle realworld user commands which may be underspecified or poorly stated. Yet, optimal clarification remains challenging. The space of possible tasks is large, and not all missing information is equally valuable. We study real software engineering tasks. These tasks require the user to provide many details, some of which are more important than others, and an accurate assistant should be able to determine which clarification questions to ask when necessary information is missing. We systematically quantify which types of information in prompts types most impact task success and what types of clarification questions enable productive answers by simulated users. Using Shapley attribution and distributional comparisons, we identify two learnable properties of effective clarification: task relevance (which information impacts success) and user answerability (what users can realistically provide). We operationalize these as multi-stage reinforcement learning rewards to train an 8B parameter module matching GPT-5's performance while generating 41% fewer questions. Our work demonstrates that effective clarification emerges from grounding reward design in empirical analysis of information impact and user answerability, establishing a generalizable reward framework applicable across interactive task domains.

Abstract:
In crowdsourcing scenarios, to mitigate the impact of noisy labels assigned by nonexpert workers, each instance is typically annotated multiple times by different workers. However, repeated annotation can introduce instance- or label-level redundancy, thereby inflating annotation costs. Despite its practical importance, research on repeated annotation strategies remains limited, and no existing strategy simultaneously avoids being offline, instance-unaware, and model-centric. In this paper, we propose a model-agnostic active annotation strategy, MA^3S, that addresses these limitations: (1) To reduce label redundancy caused by offline procedure, MA^3S estimates instance uncertainties with a general Beta distribution and updates them online as new labels arrive. (2) To prevent instance redundancy induced by instance-unaware designs, MA^3S constructs a nearest-neighbor graph to propagate instance uncertainties, reducing repeated annotations of similar instances. (3) To avoid being model-centric, MA^3S actively selects instances for annotation based solely on the estimated uncertainties, without relying on model feedback. Extensive experiments on synthetic and real-world datasets demonstrate that MA^3S consistently outperforms existing annotation strategies.

Abstract:
Modern conversational agents condition on an evergrowing dialogue history at each turn, incurring redundant attention and encoding costs that grow with conversation length. Naive truncation or summarization degrades fidelity, while existing context compressors lack cross-turn memory sharing or revision, causing information loss and compounding errors in long dialogues. We revisit the context compression under conversational dynamics and empirically present its fragility. To improve both efficiency and robustness, we introduce Context-Driven Incremental Compression (C-DIC), which treats a conversation as interleaved contextual threads and stores revisable per-thread compression states in a single, compact dialogue memory. At each turn, a lightweight retrieve → revise → write-back loop shares information across turns and updates stale memories, stabilizing long-horizon behavior. In addition, we adapt truncated backpropagation-through-time (TBPTT) to our multi-turn setting, learning cross-turn dependencies without full-history backpropagation. Extensive experiments on long-form dialogue benchmarks demonstrate superior performance and efficiency of C-DIC; notably, C-DIC maintains near-constant inference time and stable perplexity even over hundreds of dialogue turns, supporting a scalable path to high-quality dialogue modeling.

Abstract:
Flow matching VisionLanguage-Action (VLA) models promise precise continuous control, but their iterative denoising nature introduces fundamental incompatibilities with real-time robotics: global timestep injection invalidates KV-caching, forcing a choice between slow O(N^2) re-computation or mathematically incorrect cache reuse. We present Reflex, a framework that enables real-time streaming inference for flow matching policies by exploiting the Timestep-Invariance Property---that perception encoders are functionally independent of the denoising loop. Reflex partitions the attention context into static, sliding, and dynamic regions, enabling O(1) incremental cache updates that guarantee outputs identical to full-batch inference. To ensure stability under continuous high-frequency inference, we introduce AdaRMSNorm, an adaptive normalization layer that prevents BFloat16 numerical collapse by gating on flow phase. We further maximize throughput through an async pipeline that decouples visual encoding from action generation, combined with operator fusion that reduces kernel overhead. On LIBERO and Kinetix benchmarks, Reflex achieves a 2.58× inference speedup and 50Hz stable streaming, reducing reaction latency by up to 54% and enabling efficient deployment without performance degradation.

Abstract:
Regression with functional covariates poses fundamental challenges due to the infinitedimensional nature of functional data, and its theoretical properties have been studied under specialized frameworks in classical nonparametric statistics. While deep neural networks (DNNs) have demonstrated remarkable empirical success in high-dimensional regression, their theoretical behavior in settings involving infinite-dimensional covariates remains largely unexplored. In this work, we study the theoretical performance of DNN-based estimators for regression problems with functional covariates. We extend existing theoretical techniques, which are developed for finite-dimensional covariates supported on compact sets, to the infinite-dimensional and non-compact functional data setting. Under mild conditions, we show that DNN estimators attain minimax-optimal polynomial rates of convergence for both functional linear models and functional generalized linear models. For fully nonparametric regression with functional covariates, we establish a lower bound on the prediction error, and further discuss the fundamental obstacles inherent to this problem and their connections to existing state-of-the-art methods in the literature.

Abstract:
The probabilistic bisection algorithm (PBA) extends the classical binary search to settings with noisy responses, and is a foundational algorithm commonly used in basic problems such as rootfinding. Despite its strong empirical success, its theoretical property, particularly the convergence rate, remains unclear. This paper establishes that PBA converges at a geometric rate, providing a rigorous justification for its empirical efficiency. Notably, this rate is optimal in the sense that it matches the performance of classical binary search under noiseless responses. The core of our analysis lies in directly characterizing the dynamics of PBA queries, which had not been examined in the prior literature. We show that the queries oscillate around the truth but steadily draw closer, thus leading to an estimator that rapidly concentrates on the truth. Beyond resolving the long-standing question of PBA’s convergence, our developed techniques offer new tools for analyzing PBA's dynamics, which may be of independent interest.

Abstract:
The rapid development of multimodal large language models (MLLMs) has created an urgent demand for more reliable and robust evaluation protocols. However, existing static benchmarks are prone to data contamination and performance saturation, which can result in inflated or misleading evaluation results. To address these limitations, we first introduce a graph formulation to represent both static and dynamic visual question answering (VQA) samples. Building upon this frmulation, we propose KnowledgeEnhanced Benchmark Evolution (KBE), a dynamic multimodal evaluation framework that first analyzes the original static benchmark, then expands it by integrating multimodal knowledge, transforming the static benchmark into a controllable, dynamic evolving version. Crucially, KBE can both reconstruct questions by Re-selecting visual information in the original image and expand existing questions with external textual knowledge. By explicitly controlling the degree of question exploration, KBE enables difficulty-controllable evaluation across a wide range of model capabilities. Extensive experimental results demonstrate that KBE effectively mitigates data contamination and benchmark saturation, while providing a more comprehensive and flexible assessment of MLLM performance.

Abstract:
Deep neural networks achieve strong performance on many supervised learning tasks but remain vulnerable to adversarial perturbations. Neural network verification provides mathematically rigorous robustness guarantees, yet at substantial computational cost. To mitigate this, certified training techniques optimise for verifiable robustness during training, typically inducing a tradeoff between natural and certified accuracy controlled by method-specific hyperparameters. Because these metrics are inherently conflicting, the common practice of reporting a single configuration is problematic: it can mislead conclusions about overall performance and prevents unbiased assessments of the state of the art. We address this by evaluating certified training methods via Pareto front comparisons over the natural--certified accuracy trade-off. To enable fair, method-agnostic comparisons, we perform efficient automated multi-objective hyperparameter optimisation to identify a set of Pareto-optimal configurations for each method. This approach often uncovers substantial undertuning in previously reported configurations, yielding superior performance and establishing a new state of the art. Leveraging these fronts, we present the first comprehensive multi-objective comparison of certified training approaches, showing that prior advancements are less pronounced than assumed and revealing previously unreported performance complementarities.

Abstract:
Most existing visionlanguage manipulation research targets rigid robotic arms, whose fixed morphology limits adaptability in cluttered or confined spaces. Soft robotic arms offer an appealing alternative due to their deformability, but confront challenges such as unreliable proprioception and distributed low-level actuation. To investigate these challenges, we introduce ManiSoft, a benchmark for vision-language manipulation with soft arms. \ManiSoft features a tailored simulator that couples realistic soft-body dynamics with contact-rich interactions via an elastic force constraint. On this basis, ManiSoft defines four tasks, each highlighting distinct aspects of deformable control, from basic end-effector coordination to obstacle avoidance. To support policy training and evaluation, ManiSoft includes an automated pipeline that generates 6,300 diverse scenes and corresponding expert trajectories. To produce high-quality trajectories at scale, we first employ a high-level planner to decompose each task into a sequence of waypoints, followed by a low-level reinforcement learning policy that generates torque commands to track waypoints. Benchmarking three representative policy models shows relatively promising results in clean scenes but substantial performance drop under randomization. Visualization analysis indicates that failures stem primarily from inaccurate visual estimation of proprioceptive state and limited exploitation of deformability for adaptive obstacle avoiding. We anticipate ManiSoft to serve as a valuable testbed, bridging the gap between rigid and soft arms in the context of vision-language manipulation.

Abstract:
Agent planning faces dynamic heterogeneity—nonstationary observations, dynamics, and objectives with sparse, delayed rewards—which dominant methods largely ignore, leading to poor generalization under environment shifts. We propose FlowMatching for Agent Planning (FlowMAP), which formulates planning as a continuous-time flow matching by learning a planning-time velocity field that transports an initial meta-state distribution toward a task-conditioned target. FlowMAP introduces Value-Transport Flow Matching to provide distribution-level planning objective that steers transport toward high-value regions in meta-state distribution, mitigating error accumulation under environmental shifts. To enforce alignment between meta-state distributions transport and action-environment interaction, FlowMAP further propose Flow-Policy Co-Training, which jointly optimizes the planning flow and policy so that the flow transport directly regularizes the policy-induced meta-distribution dynamics. Across diverse agent planning benchmarks, FlowMAP consistently outperforms strong baselines, yielding improvement in planning generalization.

Abstract:
Ordinary differential equations (ODEs) are central to scientific modelling, but inferring their vector fields from noisy trajectories remains challenging. Current approaches such as symbolic regression, Gaussian process (GP) regression, and Neural ODEs often require complex training pipelines and substantial machine learning expertise, or they depend strongly on systemspecific prior knowledge. We propose FIM-ODE, a pretrained Foundation Inference Model thatamortisesODE inference by predicting the vector field directly from noisy trajectory datain a single forward pass. We pretrain FIM-ODE on a prior distribution over ODEs with low-degree polynomial vector fields and represent the target field with neural operators. FIM-ODE achieves strongzero-shotperformance, matching and often improving upon ODEFormer, a recent pretrained symbolic baseline, across a range of regimes despite using a simpler training prior. Pretraining also provides a strong initialisation for finetuning, enabling fast and stable adaptation that outperforms modern neural and GP baselines without requiring machine learning expertise. We release pretrained weights with the supplementary material.

Abstract:
The periodicity misalignment remains a challenge problem in generating timeseries data across multiple domains. Existing methods model time-series interactions either at the granularity of individual points or fragmented segments. This limits their ability to capture and adapt to complex periodic patterns inherent in diverse domains. To address this, we introduce Winformer, a novel diffusion framework built on window-wise attention mechanism. We shift the fundamental processing unit in the attention mechanism from pairwise points similarity to continuous windows comparison of the entire horizon. Leveraging the adaptive window-alignment kernels derived from the frequency decomposition, Winformer brings semantically richer window representations, and effectively captures and transfers complex periodic patterns across domains. Extensive experiments on 12 real-world datasets demonstrate Winformer's effectiveness, achieving an average performance gain of 10.67% over SOTA baselines.

Abstract:
This work proposes a novel method for solving learning from label proportion problems. For this purpose, we learn a classifier that minimizes three key objectives: (i) a baglevel loss, which quantifies the discrepancy between true and predicted label proportions in bags, (ii) an instance-level loss, inspired from domain adaptation, which leverages anchor samples with known labels and trainable supports and (iii) a distribution discrepancy that aims at aligning anchor's learned support with those of the bag samples. The problem is formulated as an alternating optimization process, iteratively updating the classifier and aligning distributions via a particle flow method. The flow of anchor samples is governed by a vector field designed to minimize the anchor loss while ensuring alignment between anchor and bag distributions. We provide a theoretical analysis, guaranteeing the convergence of the flow and identifying conditions under which the method achieves effective alignment. Our analysis highlights that gap and diversity in label proportions within bags is a critical factor for learnability. Empirical results on tabular and image datasets demonstrate the method's effectiveness, outperforming state-of-the-art approaches.

Abstract:
Recent advances in visual representation learning have seen the rise of clusteringbased vision backbones, which adopt clustering as a core paradigm for feature extraction. However, existing clustering-based backbones typically rely on a single clustering algorithm, whose inherent inductive bias limits their representational capacity. To address this, we propose EnFormer, which embeds ensemble clustering as a core component of feature extraction. EnFormer structures feature extraction around two steps: (i) Ensemble Generation, where several differentiable base clustering methods are introduced to capture diverse semantic structures; and (ii) Consensus Aggregation, which employs a differentiable mechanism to fuse the results of all base clusterings to reconstruct refined visual features. Extensive experiments show that EnFormer consistently outperforms existing clustering-based backbones across core vision tasks, with higher performance and significantly improved throughput.

Abstract:
LowRank Adaptation (LoRA) has become the de facto paradigm for parameter-efficient fine-tuning, with its effectiveness critically influenced by rank allocation across modules. However, existing approaches face a fundamental dilemma: uniform allocation ignores module heterogeneity, while adaptive methods introduce expensive training overhead or lack reusability across configurations. We propose CSPLoRA (Confidence-guided Structural Planning for LoRA), a decoupled framework that reweights probe samples by prediction uncertainty to obtain higher-fidelity module importance estimates. The key insight is that hard samples---those the model struggles with---provide more informative gradient signals for identifying critical modules than easy samples. Combined with scale-invariant allocation, our method produces reusable structural priors that transfer across different rank budgets and LoRA backends, enabling "probe once, deploy everywhere." Experiments on GLUE, commonsense reasoning, and arithmetic tasks show that CSPLoRA consistently improves over uniform LoRA (+1.25 points on LLaMA-2-7B commonsense reasoning) while maintaining comparable parameters, with the same structure transferring directly to other LoRA variants.

Abstract:
VisionLanguage-Action (VLA) models have become a central paradigm for embodied intelligence. However, most existing approaches are built on large-scale Transformers, resulting in substantial inference latency and energy consumption that limit their practical deployment in low-power, real-time scenarios. We propose SpikeVLA, an end-to-end spiking VLA framework for embodied navigation with energy-efficient inference, consisting of three key components. (i) a spiking vision encoder, Spike-V, that replaces dense continuous computation with event-driven spiking representations to reduce the energy cost of visual representation learning, (ii) a multimodal spiking large language model, Spike-L, that reformulates cross-modal reasoning with spiking dynamics and token-level event-driven sparsity to further lower inference overhead, and (iii) a spiking action policy network, Spike-A, that uses Laplacian-kernel population coding and end-to-end reinforcement learning to produce stable, robust continuous control under low-energy constraints. Experiments on multimodal interaction and robotic control tasks show that SpikeVLA significantly reduces energy consumption and computational overhead while maintaining competitive performance, highlighting its potential for low-power, real-time embodied intelligence.

Abstract:
LLM serving exhibits extreme length variability, making sizebased scheduling difficult in practice. Recent LLM schedulers approximate SJF/SRPT using predicted decode lengths or rank and primarily report mean-centric metrics (e.g., TTFT/TBT). We show these prediction-driven policies can be fragile under distribution shifts, bursty arrivals, and GPU memory pressure, and still offer limited control over tail latency (P90–P99) that dominates user experience—even with perfect decode-length knowledge. We introduce a distribution-aware, prediction-free scheduling framework that replaces explicit length prediction with soft, \gamma-parameterized priority boosting driven by lightweight statistical signals. Our design co-optimizes scheduling with cache-aware preemption to account for memory-coupled decode dynamics that vary across workload mixes. Evaluated on Azure production traces, our method achieves a P99 TTLT up to 35--50% lower than SRPT with perfect length prediction and a TTFT 34--47% lower across various workloads, including reasoning-heavy and chat-heavy tasks, demonstrating a robust alternative for tail-latency optimization in online LLM serving.

Abstract:
Tabular data is one of the most fundamental and widely used formats for representing structured information. Classical machine learning algorithms continue to achieve substantial success in extracting predictive patterns and constructing accurate models from structured data; however, representation learning approaches that extend languagemodel-based methods to the tabular setting have opened new opportunities. Nevertheless, conventional tokenization procedures and token embedding mechanisms are not well-suited to numerical variables, as they fail to preserve key numerical properties, including proximity structure and ordinal relationships. To address this limitation, we propose TabularBERT, a Transformer-based model that discretizes numerical variables via binning-based tokenization and learns representations that preserve numerical proximity and ordinal information while capturing conditional dependencies among variables through masked self-supervised pretraining. We empirically demonstrate the effectiveness and interpretability of the proposed approach, highlighting the benefits of language-model-based representation learning in the tabular domain.

Abstract:
Existing image inpainting frameworks rely on strictly supervised training paradigms, often suffering from an overreliance on ground-truth reconstruction, which leads to conservative outputs with misaligned creativity and limited diversity. To this end, we propose the first framework to explore Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO) for text-guided image inpainting, formulating an efficient online reinforcement learning pipeline that enables flexible, human-aligned aesthetic control via a preference scoring model. Crucially, by decoupling the rigid one-to-one correspondence between text prompts and masked images, our method enables the model to explore diverse, controllable, and high-quality solutions beyond a single target. Furthermore, to balance semantic consistency with physical naturalness at mask boundaries, we introduce a scale-aware dynamic reward mechanism that adaptively emphasizes boundary gradient coherence for small occlusions while prioritizing visual aesthetics in large-scale generation. Extensive experiments demonstrate that our approach consistently produces higher-quality results across different backbone architectures such as Stable Diffusion and FLUX, significantly enhancing the generative capacity of base models. Code is available at https://anonymous.4open.science/r/E3F47R.

Abstract:
The state space duality (SSD) framework, central to modern statespace models (SSMs) such as Mamba, has established an efficient attention-like mechanism by leveraging the commutative property of linear recurrences. However, existing formulations are limited to single-input single-output (SISO) systems that enforce commutativity with a restrictive scalar-identity constraint, which prevents cross-dimensional interactions within the state dynamics. In this work, we generalize SSD to the multi-input multi-output (MIMO) setting by introducing a matrix polynomial parameterization. This approach not only provides a principled way to ensure commutativity for generalized duality but also induces a shared algebraic structure across state transitions, thereby significantly reducing parameter redundancy. Building on this foundation, we present MIMOMamba, a multi-head SSM architecture that captures rich cross-dimensional dynamics while retaining linear-time training. Empirical results on a sequence modeling benchmark show that MIMOMamba matches or exceeds the performance of standard Transformers with only approximately one-third the parameters of the baseline.

Abstract:
Reliable longhorizon precipitation nowcasting requires preserving fine-scale echo structures while maintaining coherent transport. Although Rectified Flow (RF) can generate detail-preserving future sequences, numerical ODE integration compounds velocity estimation errors and induces progressive off-manifold drift, causing morphological distortions at extended lead times. We propose Physically-guided Data-space Rectified Flow (PDRF), which re-parameterizes the generative ODE in data space: the network predicts the clean future sequence, analytically inducing a coupled vector field with an implicit restoring effect that suppresses drift. We also introduce a soft Semi-Lagrangian teacher based on an advection prior to regularize large-scale transport, while allowing local growth/decay/deformation to be learned from data. Experiments on four public benchmarks demonstrate consistent improvements in event-based skill and better preservation of intense-echo morphology over long horizons.

Abstract:
Recently, machine learning Hamiltonian (MLH) models have gained traction as fast approximations of electronic structures such as orbitals and electron densities, while also enabling direct evaluation of energies and forces from their predictions. However, despite their physical grounding, existing Hamiltonian models are evaluated mainly by reconstruction metrics, leaving it unclear how well they perform as energy–force predictors. We address this gap with a benchmark that computes energies and forces directly from predicted Hamiltonians. Within this framework, we propose QHFlow2, a stateof-the-art Hamiltonian model with an SO(2)-equivariant backbone and a two-stage edge update. QHFlow2 achieves 40% lower Hamiltonian error than the previous best model with fewer parameters. Under direct evaluation on MD17/rMD17, it is the first Hamiltonian model to reach NequIP-level force accuracy while achieving up to 20× lower energy MAE. On QH9, QHFlow2 reduces energy error by up to 20× compared to MACE. Finally, we demonstrate that QHFlow2 exhibits consistent scaling behavior with respect to model capacity and data, and that improvements in Hamiltonian accuracy effectively translate into more accurate energy and force computations.

Abstract:
Differentially private (DP) stochastic optimization algorithms are widely used in privacypreserving deep learning, where per-sample gradient clipping and noise injection protect sensitive information. However, these operations limit existing DP algorithms to converge within a constant-radius neighborhood of the first-order stationary point, leading to solution bias and the well-known privacy-utility trade-off. To enhance model utility, we propose a novel algorithmic framework called DP-C4, which is designed to be error-Consistently-vanishing, Coupledly-clipped, solution-Calibrated, and Convergence-guaranteed. Specifically, it incorporates a carefully designed coupled clipping scheme with shifted threshold strategy, ensuring that both clipping bias and noise variance asymptotically vanish, thereby eliminating the DP-induced solution bias. Moreover, we extend existing sensitivity analysis techniques and develop a tailored privacy budget allocation to guarantee the privacy of DP-C4. Compared with the well-recognized DP-SGD, our framework injects significantly less noise under the same privacy level. In addition, we prove that our framework converges to the optimum in strongly-convex case and to a diminishing neighborhood of the first-order stationary point in non-convex case. Experiments show that DP-C4 achieves superior privacy-utility trade-off over existing baselines across various tasks and datasets.

Abstract:
Most deep video codecs emphasize lowlevel motion modeling and remain largely semantics-agnostic, which can degrade perceptual quality in complex scenes. We propose MoVie, a Multimodal Video compression framework built on a Text-guided Video Transformer–CNN Mixed block (Text-VideoTCM). MoVie adopts a video-centric architecture that jointly models local spatial structures and temporal dynamics via window-based processing, delivering a favorable computation--perception trade-off. To incorporate semantics, we introduce dual-stage text fusion with Extractor and Injector modules. We further present history-conditioned coding that leverages both previous and aggregated historical frames, and a spatial--channel factorized entropy model that estimates probabilities over spatial neighborhoods and channel groups for adaptive bit allocation. Together, these designs reduce redundancy and improve rate control and temporal coherence, yielding reconstructions at low bitrates. On UVG and MCL-JCV, MoVie achieves -50.23% BD-rate for FID and -14.64% for LPIPS (VGGNet) relative to HM, while requiring only 55.76% of DCVC-FM's per-pixel kMACs.

Abstract:
Advanced agents are increasingly demonstrating the potential to operate as autonomous engineers, creating a growing demand for evaluation benchmarks that capture the complexity of realworld development. Such environments typically involve both complex code and large-scale data (i.e., file system). However, existing benchmarks usually evaluate code-centric or data-centric capabilities in isolation, leaving a clear gap with real development scenarios. In this paper, we bridge this gap by introducing CoDA-Bench, the first benchmark to jointly evaluate code and data intelligence in a data-intensive environment. We construct a data-intensive Linux sandbox based on the Kaggle ecosystem (containing hundreds of datasets), where agents must actively explore complex file hierarchies to identify relevant resources and generate code for data-driven analytical tasks. CoDA-Bench comprises 1,202 tasks spanning 53 communities, with each task environment containing an average of 700 files, simulating realistic data scale and noise. Evaluations of advanced agents reveal that even top-performing systems struggle to effectively integrate data discovery with code execution, achieving a success rate of only 56.1%. These results highlight a substantial gap in current agentic capabilities for data-intensive tasks and point to promising directions for future research.

Abstract:
Federated learning often face both data and model heterogeneity, with the latter often more challenging. Architectural differences yield incompatible representation, making the knowledgesharing carrier central to heterogeneous collaboration. Using proxy model enables distillation-based collaboration but incurs high communication and computation costs. Prototype-based carriers are lighter yet cause semantic confusion when incompatible features are mixed. Therefore, we propose rule-based federated learning (RFL) that shares interpretable, class-discriminative rules to enable heterogeneous collaboration, avoid feature confusion, and keep communication lightweight. RFL uses a rule network to unify clients’ decision features and collaborates at the rule level, avoiding forcible averaging of incompatible representations. RFL selects sparse, high-coverage, beneficial rules for broadcasting, compressing shared knowledge into an interpretable class-rule set and reducing communication and computation costs. Each client selectively activates only rules relevant to its local classes, mitigating negative transfer while preserving personalization. Across heterogeneous settings, RFL achieves a better accuracy–communication trade-off.

Abstract:
Selfspeculative decoding accelerates LLM inference by using a lightweight draft model for generation and a target model for verification, where the draft model is constructed by a subset of the target model’s layers, and the key challenge lies in layer configuration strategies. To address this challenge, we propose LEAP, a plug-and-play approach that formulates and optimizes the draft model construction problem as a sequential decision-making process by Monte Carlo Tree Search (MCTS). To navigate the prohibitive search space of deep LLMs, we leverage two empirical observations: (i) the prefilling-derived redundancy information remains informative during decoding, and (ii) the layer redundancy exhibits zone-wise characteristics. These observations enable a structured search space through zone partitioning and layer grouping, which serves as an inductive bias to facilitate efficiency of MCTS. Experimental results show that LEAP achieves a speedup of 1.7×～2.0× for LLM inference.

Abstract:
Extracting causally meaningful features from timeseries data is fundamental for robust machine learning under distribution shifts. In process monitoring, existing methods struggle to maintain detection performance when operating conditions change. Current approaches capture either temporal causal relationships or cross-environment invariance, but not both simultaneously. We propose Causal Feature Learning (CFL), a unified framework that jointly optimizes for temporal relevance and environment mean invariance. CFL formulates feature extraction as a generalized Rayleigh-quotient problem, maximizing correlation with target variables while penalizing sensitivity to environment-dependent mean shifts. Theoretical analysis establishes conditions under which CFL identifies a mean-invariant predictive subspace. Experiments on the Tennessee Eastman Process demonstrate that CFL achieves 93.69% average fault detection rate, outperforming 15 baseline methods and validating the benefit of jointly capturing both aspects of causality.

Abstract:
Implicit neural representations (INRs) encode data as continuous coordinatebased functions parameterized by neural networks, shifting downstream tasks such as image recognition to operate on functional rather than discrete representations. Despite their increasing adoption, the adversarial robustness of INR-based classification pipelines remain largely underexplored. In this work, we present the first systematic study of adversarial robustness in INR-based classifiers. A key challenge is that generating an INR requires \text\emphtraining a neural network for each input sample, resulting in an optimization-in-the-loop forward pass that renders standard gradient-based attacks computationally prohibitive. To address this, we design surrogate models that amortizes the INR-generation process, serving as a practical proxy for attacking INR-based classifiers. We also develop speed-up techniques that substantially reduce the training cost of the surrogate. We show that in contrast to recent work, INR-based classifiers are vulnerable: under adversarial input perturbations, classification accuracy collapses to near zero. Moreover, existing countermeasures designed to operate on discrete representations offer limited protection.

Abstract:
We consider imitation learning in the lowdata regime, where only a limited number of expert demonstrations are available. In this setting, methods that rely on large-scale pretraining or high-capacity architectures can be difficult to apply, and efficiency with respect to demonstration data becomes critical. We introduce Noise-Guided Transport (NGT), a lightweight off-policy method that casts imitation as an optimal transport problem solved via adversarial training. NGT requires no pretraining or specialized architectures, incorporates uncertainty estimation by design, and is easy to implement and tune. Despite its simplicity, NGT achieves strong performance on challenging continuous control tasks, including high-dimensional Humanoid tasks, under ultra-low data regimes with as few as 20 transitions.

Abstract:
Perception for embodied agents is videobased, often multi-view (ego, exo, or both), and inherently continual, with simultaneous task and viewpoint shifts. Yet continual learning (CL) remains dominated by exo-only recognition tasks, obscuring behavior under these real-world coupled shifts. We introduce Continual Ego, Exo, and Ego-Exo Learning (CE^4L), a unified multi-view CL benchmark spanning four representative tasks: cross-view referenced skill assessment, temporal action segmentation, cross-view association, and action anticipation \& planning. CE^4L highlights challenges largely absent in prior CL benchmarks, including cross-view correspondence, view-dependent asynchrony, and heterogeneous semantic objectives. To this end, we propose Video Incremental Subspace-routed Task Adapters (VISTA), a parameter-efficient baseline method that stores task-specific updates in lightweight adapters and performs training-free routing via residual distance to task-specific whitened subspaces estimated from second-order statistics. Extensive experiments demonstrate the significantly varied efficacy of representative CL methods across CE^4L settings, while VISTA is consistently competitive and achieves state-of-the-art overall performance.

Abstract:
Standard machine learning pipelines often admit many nearoptimal models. These “Rashomon sets” pose a range of challenges and opportunities for uncertainty-aware, robust decision making. They allow users to incorporate domain knowledge and preferences that would otherwise be difficult to specify directly in an objective, and they quantify diversity among valid models for a given training dataset and objective function. However, computation of Rashomon sets, even for simple, interpretable model classes such as sparse decision trees, continues to require immense memory and runtime resources. We present PRAXIS, an algorithm to approximate this Rashomon set with orders of magnitude improvement in runtime and memory usage. We validate that PRAXIS regularly recovers almost all of the full Rashomon set. PRAXIS allows researchers and practitioners to scalably model the Rashomon set for real-world datasets.

Abstract:
Offline imitation learning typically learns from expert and unlabeled demonstrations, yet often overlooks the valuable signal in explicitly undesirable behaviors. In this work, we study offline imitation learning from contrasting behaviors, where the dataset contains both expert and undesirable demonstrations along with an unlabeled set of demonstrations. We propose a novel formulation that optimizes a difference of KL divergences over the stateaction visitation distributions of expert and undesirable (or bad) data. Although the resulting objective is a DC (Difference-of-Convex) program, we prove that it becomesconvexwhen expert demonstrations outweigh undesirable demonstrations, enabling a practical and stable non-adversarial training objective. Our method avoids adversarial training and handles both positive and negative demonstrations in a unified framework. Extensive experiments on standard offline imitation learning benchmarks demonstrate that our approach consistently outperforms state-of-the-art baselines.

Abstract:
Textto-motion (T2M) generation has emerged as a fundamental task. However, existing evaluation metrics often fail to accurately capture the semantic alignment between textual descriptions and generated 3D motions. In this work, we propose VeMo, a novel evaluation framework that leverages the zero-shot reasoning capabilities of Video-Language Models (VLMs) for T2M assessment. Our basic idea is: render the generated human motion into a skinned video, and then use a VLM for evaluation. To mitigate the information loss inherent in 3D-to-2D projections, we introduce an entropy-based uncertainty analysis that ensures the reliability of the evaluation scores. To address the lack of rigorous standards in the field, we contribute a meta-evaluation benchmark featuring manual annotations of coarse-grained alignment and fine-grained rationales. Extensive experiments demonstrate that VeMo significantly outperforms traditional metrics in human-alignment, offering a scalable and data-independent solution for the reliable assessment of T2M models.

Abstract:
Large language models built on autoregressive Transformers excel at nexttoken prediction, but it is unclear how their internal computations capture the latent hierarchical dependencies that often underlie language. We study this question in a controlled formal-language setting based on probabilistic context-free grammars (PCFGs), where sequences are generated by a latent hierarchical process. Empirically, standard autoregressive Transformers can be trained to accurately match the grammar-induced next-token distribution. Using probing analyses, we find that Transformer hidden states contain information used by classical parsing algorithms. Moreover, this information emerges through a layer-wise progression, revealing a local-to-global mechanism: early layers accumulate local patterns, while later layers aggregate them into a compact summary for next-token prediction. Complementing these empirical findings, we provide an explicit construction of Transformers that can parse binary PCFGs with depth \emphlogarithmic in the grammar's sequence length. Surprisingly, trained Transformers in this setting exhibit prediction behavior and internal representations that closely mirror our construction. Together, our results offer a mechanistic account of how Transformers integrate hierarchical parsing with autoregressive generation.

Abstract:
Automated redteaming of Large Language Models (LLMs) commonly relies on attack success rates (ASR) as a proxy for real-world harm, implicitly assuming that judge-detected violations correspond to actionable risk. In practice, safety judges are imperfect, and outputs that satisfy automated criteria for harm can vary widely in their operational usefulness. In this work, we investigate whether model failure modes can be reshaped so that, when defenses fail, they preferentially produce low-utility, non-actionable outputs rather than highly actionable harm. Inspired by honeypots in computer security, we construct responses that are frequently flagged as harmful by automated judges yet provide little real-world operational value, and treat them as hard negatives in the safety training pipeline. Our findings show that shaping how models fail under attack can improve overall safety by reducing both the real-world impact and the frequency of harmful failures, and serves as a practical complement to ASR-based evaluations.

Abstract:
Video Diffusion Models (VDMs) is constrained by immense computational costs. While offline calibrationbased acceleration suffers from calibration data dependency, prohibitive calibration duration, and susceptibility to distribution shifts, offline calibration-free methods eliminate these hurdles. However, since they rely on instantaneous zero-order approximations where the mapping between input and output differences varies in real-time, they are susceptible to observational noise and ignore the intrinsic momentum within the diffusion trajectory. In this paper, we propose NaviCache, a plug-and-play test-time self-calibration method re-conceptualizing feature evolution as an Inertial Navigation System (INS) problem. NaviCache bridges the fundamental domain gap and the non-stationary nature of diffusion by modeling the relative coupling between input and output variations. We introduce a dual-state estimation architecture that adaptively tracks the feature change ratio and its latent drift, initialized via a specialized Initial Alignment phase. By integrating a time-dependent noise schedule with an uncertainty-aware Measurement Update mechanism, NaviCache provides a theoretically grounded mechanism for error-bounded block skipping. Extensive experiments on the HunyuanVideo, Wan, and Open-Sora series demonstrate that NaviCache exhibits more accurate error judgment for block skipping and achieves outstanding comprehensive performance.

Abstract:
We study the problem of dynamic facility location with nonuniform costs. Facility location is a central problem in unsupervised learning and in recent years the dynamic version of the problem has been extensively studied. In this paper, we study the setting where clients are added and deleted in real-time and one is interested in maintaining efficiently a stable and high-quality solution. Interestingly, we are able to show that on High Dimensional Euclidean metrics it is possible to obtain efficient algorithms for this problem. More formally, we obtain a randomized algorithm for dynamic facility location in d-dimensional Euclidean spaces with \gamma approximation ratio, O(\log m) amortized recourse and \textpoly(d) \cdot (m+n)^O(1/\gamma) amortized update time, for every sufficiently large constant \gamma \geq 1. Our result is the first efficient dynamic algorithm for the \emphnon-uniform dynamic facility location problem in high-dimensional Euclidean spaces. It also provides a stronger recourse bound than the existing solutions.

Abstract:
In this paper, we study the fullydynamic k-center with outliers problem. In this setting, the clustering data changes over time through a sequence of updates. The goal is to maintain an approximate k-center solution with efficient update and query time, while allowing up to z points to be discarded as outliers. To achieve provable guarantees, existing approaches typically rely on radius-guessing to maintain clusters with a uniform radius associated with the optimal one. Although effective, these methods can yield additional update overhead to maintain pairwise distance bounds, and the update and query time typically have explicit dependence on the aspect ratio Δ (the ratio of the maximum to the minimum pairwise distance). This may limit the scalability of the algorithms in large-aspect-ratio settings. To address these issues, we propose a layered-sampling method that can achieve sub-linear update and query time while eliminating prior knowledge and assumptions on aspect ratio terms. The proposed method avoids traditional radius guessing process by constructing layered structures in different stages, where the initial layers summarize most inliers, the transition layer finds good representatives for the remaining points, and the final layers carefully adjusts the outliers discarded using a fine-grained division manner. By adopting an update-delay strategy, the proposed algorithm achieves \tildeO(k^2/\epsilon^4) update time and \tildeO(k^2/\epsilon^4) query time that are independent of Δ, while guaranteeing constant approximation with at most (1+\epsilon)z outliers discarded. Under mild assumptions on optimal cluster sizes, the time bounds can further be improved by a factor of z via sampling-based data reduction techniques. This is complemented with a lower bound update time of \Omega(k^2/z) in the general metric space query model.

Abstract:
While Multimodal Diffusion Transformers (MMDiTs) have achieved remarkable success in highfidelity generation, maintaining semantic faithfulness and structural consistency during the image editing process remains a fundamental challenge. DiT-based editing is primarily limited by cumulative drift and semantic leakage induced by new textual conditions. To address these challenges, we propose T-Edit, a training-free framework that formalizes consistent editing as a trajectory anchoring process. T-Edit explicitly decouples the inversion, reconstruction, and editing trajectories, leveraging the reconstruction branch as a structural reference to achieve real-time compensation for deviations in the latent manifold. To further reveal the internal regulation mechanism of DiTs, we analyze the spatio-temporal heterogeneity of their layer-wise structural sensitivity and accordingly propose a Dynamic Vital Layer (DVL) localization mechanism based on information energy. Furthermore, addressing the asymmetry of textual perturbations in the frequency domain distribution, we introduce a frequency-aware strategy based on tensor Singular Value Decomposition (t-SVD) to anchor (TA) high-rank structural components. Experiments show that T-Edit achieves state-of-the-art performance in both semantic alignment and structural fidelity, and can be seamlessly extended to multi-step editing and video scenarios, providing a new perspective for understanding and controlling the internal stability of DiTs.

Abstract:
In selfsupervised goal-conditioned reinforcement learning (RL) without external rewards, goals are typically specified by observations sampled from experience. However, depending on the observation structure, such a fixed representation of goals may be either too concrete (requiring exact pixel-level matches) or too abstract (involving ambiguous observations). Here we propose the construction of hierarchical latent goal spaces that integrate both concrete and abstract goals. To this end, we use an energy function to learn a partially ordered space, in which a subset relation between observations naturally induces a hierarchy from concrete to abstract goals. This representation enables agents to disambiguate specific states while also generalizing to shared concepts. In experiments on navigation and robotic manipulation, agents trained with our hierarchical goal space achieve higher task success and greater generalization to novel tasks compared to agents limited to purely observational goals.

Abstract:
Multimodal prompt learning is a parameter-efficient approach to adapt large vision--language models to downstream classification tasks. However, prompts can inadvertently evolve into a high-capacity pathway encoding environment-dependent spurious correlations that are only predictive in the source domain, thereby undermining transferability. To address this issue, this paper introduces Do-Prompt, a \emphcompress-and-intervene framework that brings together variational bottlenecks and causal interventions for robust prompt tuning. We model prompts as stochastic latent variables and impose a \emphvariational prompt bottleneck to explicitly regulate the information transmitted through prompts, effectively mitigating their propensity to memorize spurious nuisance cues. Building on this capacity constraint, we propose lightweight \emphprompt-level interventions by perturbing the environment-related prompt components and enforcing prediction consistency under these do-style perturbations. This synergistic integration encourages reliance on task-stable, invariant semantics rather than spurious prompt content. Notably, Do-Prompt is plug-and-play compatible with existing multi-modal prompt tuning pipelines with negligible computation overhead. Extensive experiments on base-to-novel generalization, cross-dataset transfer, and ImageNet distribution shifts demonstrate consistent performance gains, with particularly notable improvements on datasets exhibiting pronounced domain or texture biases.

Abstract:
Crosssilo recommendation from implicit feedback is a key task in modern recommender systems, where user-item interaction data are distributed across multiple parties and cannot be centrally collected. Unlike explicit feedback, which provides fully observed real-valued ratings, implicit feedback is one-class and extremely sparse, recording only users' actions or inactions (e.g., clicks, visits, or bookmarks), yet it is far more prevalent in real-world applications. Such behavioral data are often highly sensitive, raising significant privacy concerns when used for collaborative model training. Although differential privacy (DP) has been widely applied to explicit feedback-based models, the resulting utility degradation makes it difficult to apply DP effectively to implicit feedback learning. In this work, we propose DPIMF, a differentially private implicit matrix factorization framework for cross-silo recommendation based on objective perturbation. To improve utility, we redesign the loss function and adopt an importance sampling scheme to reduce the noise scale required for privacy preservation. We further provide formal utility guarantees for the proposed techniques and characterize the conditions under which utility improvements are maximized. Experiments on three benchmark datasets validate our theoretical analysis and demonstrate that DPIMF achieves a better privacy-utility trade-off than state-of-the-art methods.

Abstract:
We present WhisperSplat, the first lossless steganography method for 3D Gaussian Splatting (3DGS) models that hides a full‐resolution 2D image in a single view without any degradation of the model's rendering quality elsewhere. Prior work embeds data by retraining or modifying model weights, altering novel‐view synthesis fidelity and limiting capacity. Instead, we learn a small, view‐specific noise key applied to each Gaussian's spherical‐harmonic (SH) features while keeping all other views remain indistinguishable from the original renders. We further propose a Gradual Pixel Perturbation (GPP) strategy with a cosinedecay schedule, bootstrapping fast divergence from the clean render before transitioning to a combined reconstruction and SSIM loss. Unlike prior works that are highly dependent on accurate and large pretrained decoders, our method is able to recover the hidden image through rendering with noise key, and an optional lightweight refiner to enhance recovery image quality. Across nine standard 3DGS data scenes, WhisperSplat demonstrates superior hidden image recovery quality without sacrifice in clean 3DGS model performance, when compared to prior work such as GS-Hider and KeySS.

Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm for improving the reasoning capabilities of large language models (LLMs). However, existing RLVR methods typically rely on onpolicy optimization from scratch, resulting in high sampling costs and inefficient utilization of accumulated experience. As model capabilities and policy behaviors evolve during training, recent attempts to reuse experience via fixed reasoning trajectories further suffer from policy mismatch. Motivated by these limitations, we argue that experience in RLVR should not be reused as fixed reasoning trajectories, but instead expressed in a policy-adaptive manner. In this work, we propose Experience-Augmented Policy Optimization (EAPO), which leverages a prior RL-optimized policy as an action-level experience prior and selectively injects experience at critical decision points during rollout. To ensure stable and unbiased learning from experience-augmented rollouts, EAPO further incorporates an adapted importance sampling scheme. Experiments on using Qwen-2.5-math 7b and Qwen-3-8B on five different benchmarks demonstrate that EAPO consistently improves reasoning performance over state-of-the-art RLVR methods.

Abstract:
Generative diffusion models can provide powerful priors for inverse problems in imaging, but existing implementations suffer from two key limitations: (i) they learn only an implicit approximation of the prior density, and (ii) they rely on crude likelihood approximations that introduce biases in the sampling. We address these challenges by introducing a new energybased model trained using denoising score matching with a covariance-based regularization that enforces consistency across different inverse problems. Our approach learns explicit, normalized posterior densities for diverse linear inverse problems using a single model, while preserving the sampling capabilities of diffusion models. This enables new capabilities unavailable to score-based diffusion models: energy-guided adaptive sampling that adjusts schedules on-the-fly, unbiased MCMC correction with Metropolis-Hastings acceptance, and blind degradation estimation via Bayes rule. We validate our method on multiple datasets (MNIST, CelebA, ImageNet) and tasks (inpainting, deblurring), demonstrating competitive or superior performance to established baselines.

Abstract:
Cognitive structure (CS), a student's construction of concepts and interconcept relations, has long been recognized as a foundational notion in psychology and intelligent education, yet remains largely unassessable in practice. Existing approaches such as knowledge tracing (KT) and cognitive diagnosis (CD) simplify and indirectly approximate CS, but they intertwine representation learning with prediction objectives, limiting generalization, interpretability, and reuse across tasks. To address this gap, we propose Cognitive Structure Generation (CSG), a task-agnostic framework that explicitly models CS through generative modeling. Based on educational theories, CSG first pretrains a Cognitive Structure Diffusion Probabilistic Model (CSDPM) and then applies reinforcement learning with SOLO-based hierarchical rewards to capture plausible patterns of cognitive development. By decoupling cognitive structure representation from downstream prediction, CSG produces interpretable and transferable cognitive structures that can be seamlessly integrated into diverse student modeling tasks. Experiments on five real-world datasets show that CSG yields more comprehensive representations, substantially improving performance while offering enhanced interpretability and modularity.

Abstract:
Textto-image diffusion models have achieved remarkable success in generating high-quality images, yet existing safety mechanisms exhibit critical cross-seed instability where defense performance varies significantly under different random seed conditions. This instability stems from the fact that a single malicious prompt generates diverse harmful variants across different noise initializations, forming complex distributional clusters that current methods cannot adequately address. We investigate extending Noise Contrastive Alignment (NCA) to diffusion models due to its native capability of handling multiple negative samples through probabilistic weighting, but our theoretical analysis reveals two fundamental flaws in direct extension: gradient reversal caused by positive regularization terms that paradoxically penalize safe content generation, and uniform suppression of harmful samples that ignores severity variations. To tackle these issues, we propose Noise Contrastive Diffusion (NCD), which incorporates targeted algorithmic modifications including elimination of problematic regularization and introduction of pairwise regularization mechanisms that establish individualized preference relationships between safe and harmful variants. Extensive experiments further demonstrate that NCD achieves superior cross-seed stability, reducing attack success rates (ASRs) from 11.1% to 6.2% compared to SOTA methods at the seed level while maintaining exceptional generation quality, exhibiting robust resistance against sophisticated jailbreak prompts and strong generalizability across different T2I architectures. WARNING: This paper may contain examples of harmful texts and images.

Abstract:
Realtime estimation of average treatment effects (ATE) in streaming observational data poses two key challenges: strict memory constraints that preclude storing the full data history, and distributional shifts in both treatment assignment and outcome-generating process. Existing methods either require offline access to the entire dataset for covariate balancing or rely on parametric online models that are vulnerable to model misspecification under such shifts. This paper proposes a novel model-agnostic method for ATE estimation in streaming data, which effectively addresses the above challenges. Based on discrepancy theory, we first compress streaming data into feature coresets that preserve covariate balancing objectives over a rich nonparametric function class, enabling linear-time updates with bounded memory. Then, by directly learning balancing weights and bypassing parametric propensity score estimation, we enhance the model's robustness against the shift in treatment assignment, while by balancing over an expressive function space we make the model more adaptive to the shift in the outcome-generating process. Theoretically, we establish convergence guarantees with explicit bounds on memory usage and computational complexity. Empirically, extensive experiments on both synthetic and real-world datasets show the effectiveness and robustness of the proposed method, consistently outperforming existing techniques.

Abstract:
Watermarking methods embed imperceptible and verifiable signals into text generated by large language models (LLMs). Existing approaches include zerobit schemes for distinguishing synthetic text from human writing and multi-bit schemes for embedding metadata. However, current multi-bit watermarking methods do not allow selective disclosure: verifying any part of the watermark requires revealing the entire embedded message. This lack of control leads to unnecessary information exposure and raises privacy concerns. We propose Hierarchical Vocabulary Routing, a watermarking framework that enables selective disclosure of embedded metadata. The method recursively partitions the vocabulary and distributes watermark information across hierarchical layers, so that different verifiers can decode only the portions of the payload corresponding to their access level. We show that the proposed scheme preserves the unbiasedness of the underlying sampling process and thus maintains text quality. Experiments demonstrate that our framework supports fine-grained access control while achieving high detection accuracy and low latency.

Abstract:
We study a linear mixing model with dependent latent components, assuming multiple data domains. Most existing models assume that the components are independent or at least uncorrelated, in line with independent component analysis (ICA). Some recent work allows for dependent components, but then makes specific assumptions such as parametric forms of dependencies, multiview settings, or interventions. In contrast, we consider a multi-domain setting in which domains differ through domain-specific scalings of the components, while the distribution of the underlying latent components is the same across domains. This approach can model data collected, for example, from different sensors measuring the same process, different laboratories conducting the same experiment, different experimental conditions, or different subjects that might differ in biological or physiological factors. We show that, under sufficient domain variability, latent variables and mixing functions can be identified from second-order statistics alone. We propose theMulti-DomainCovarianceMatching (MuDo-CoM) algorithm that generalizes previous methods of joint diagonalization. MuDo-CoM is validated on simulated data and a real-world fMRI dataset.

Abstract:
Distributionally Robust Optimization (DRO) is widely used to improve model robustness, with existing methods addressing either geometric perturbations (e.g., input shifts) or statistical contamination (e.g., heavytailed noise and outliers) effectively. However, these uncertainty sources often co-exist. Coupling them through a single divergence or optimal transport constraint conflates geometric displacement with loss-based outlierness, which often leads to the discarding of informative high-leverage samples. We introduce nested DRO, a bilevel formulation that combines optimal transport with an outer \\phi-divergence constraint to decouple geometric smoothing from statistical robustness. We prove that this structure naturally induces a geometry-invariant, loss-based reweighting mechanism that separates outlier suppression from transport-induced regularization. We derive a tractable strong dual for the resulting non-convex problem and show its equivalence to variance-regularized risk minimization, providing a rigorous theoretical justification for reweighting gains as a natural consequence of dualization. Empirical results on synthetic and real datasets demonstrate that nested DRO consistently outperforms geometry-coupled DRO baselines, particularly under heavy-tailed contamination where preserving high-leverage structure is crucial.

Abstract:
Generating periodic datasuch as fractional atomic coordinates in crystal structures and phase patterns in compressive light-field (CLF) displays---is challenging because wrap-around boundaries complicate probabilistic modeling and learning. While Bayesian Flow Networks (BFNs) offer a powerful generative framework with strictly additive accuracy in Euclidean space, existing periodic adaptations typically sacrifice additivity and become sensitive to schedule heuristics. We introduce \emphPeriodicBFN, which embeds each periodic scalar into a two-dimensional unit-circle representation and performs Gaussian Bayesian updates in the resulting Cartesian space, thereby restoring strictly additive accuracy. To address invariance in periodic generative modeling, we further derive a Rao--Blackwellized objective that analytically marginalizes global periodic translations, producing a translation-invariant target with reduced gradient variance. Experiments on crystal structure prediction and multi-layer phase synthesis for CLF displays demonstrate improved training stability and strong performance. To our knowledge, this is the first work to extend periodic-data generative modeling to phase synthesis for modern glasses-free 3D display systems.

Abstract:
Continual learning seeks to develop models capable of acquiring new tasks sequentially while retaining prior knowledge. A central challenge in this setting is managing inherent knowledge conflicts that arise as overlapping or contradictory information is introduced across tasks. While parameterefficient fine-tuning (PEFT) techniques, particularly those based on Low-Rank Adaptation (LoRA), have shown promise by reducing interference through parameter isolation or modular architectures, they often treat conflict as something to avoid rather than address directly. In this work, we propose \underlineSubspace-\underlineAligned \underlineBalanc\underlineed \underlineRecomposition (SABER), a novel method that reframes continual learning as a problem of structured conflict management. SABER introduces a unified subspace alignment framework to support shared task representations, decomposes task-specific knowledge into orthogonal components to preserve distinct information, and recomposes them using an energy-aware balancing mechanism that coordinates contributions without compromising stability. Extensive experiments across multiple continual learning benchmarks show that SABER achieves performance on par with or surpassing state-of-the-art methods, offering a principled approach that directly addresses the root cause of forgetting by managing representational conflict.

Abstract:
Upcycling, a strategy that initializes Mixtureof-Experts (MoE) by replicating pre-trained feed-forward or MoE networks to expand model capacity, has become a popular method in continual learning due to its effectiveness in mitigating catastrophic forgetting. However, existing paradigms rely on indiscriminate expansion prioritize performance at the cost of extreme inefficiency, introducing parameter redundancy without exploiting the structural heterogeneity essential for counteracting forgetting with architectural economy. To address this, we investigate the determinants of forgetting in training dynamics using Centered Kernel Alignment (CKA) and loss landscape flatness to analyze the behavior of pre- and post-expansion MoE layers, uncovering instability in deep-layer representations and heterogeneous expert sensitivity to new tasks, thereby demonstrating the potential of selective upcycling to eliminate redundancy. Consequently, we propose a dynamic bilevel optimization framework to guide adaptive upcycling, featuring an outer loop employing a Gumbel-Softmax differentiable mask to perform Neural Architecture Search (NAS) for adaptive growth, while an inner loop optimizes weight updates via task objectives and CKA-regularized replay. Experiments on TRACE benchmark demonstrate that our proposed method achieves better average accuracy with 80% forgetting reduction, while effectively eliminating 60% of redundant parameter expansion that standard upcycling would introduce.

Abstract:
Crowdsourced factchecking systems have been widely adopted by major social media companies such as X, Meta, Tiktok, and Google with the aim of combating misinformation at scale without relying on centralized editorial control. These systems have been developed around a common underlying algorithm: a bridging mechanism—based on matrix factorization—that surfaces notes indicating misinformation only when they receive support from diverse ideological groups rather than simple majority support. Although this algorithm is designed to be robust against traditional brigading, we demonstrate an attack showing that coordinated users can strategically fabricate diverse agreement in the system’s latent space to manipulate the scoring algorithm. We validate this attack on real-world production data and find that a surprisingly large number of notes’ scores can potentially be manipulated with a small number (< 10) of coordinated votes, raising the risk that adversaries could surface arbitrary notes on these social media platforms. We complement these findings with a theoretical analysis of voting strategies that surface arbitrary notes, revealing counterintuitive properties of the system: for instance, rating a note as “Not Helpful” can increase its helpfulness score. Finally, we develop a cost model quantifying manipulation effort and discuss potential mitigations. Following a responsible disclosure process, X's Community Notes team acknowledged this attack and has deployed mitigations based on our findings. We hope this work spurs further research into the robustness of crowdsourced fact-checking systems and, more broadly, bridging-based consensus mechanisms.

Abstract:
Selfreflection is a powerful mechanism for credit assignment in human learning, converting sparse outcome feedback into actionable guidance. However, its potential for post-training Large Language Models (LLMs) remains underexplored. We propose Self-Reflective Policy Optimization (SRPO), a framework that internalizes this capability. SRPO enables LLMs to analyze their own completed trajectories, synthesize errors into concise "reflection patches," and use these reflection-conditioned rollouts as high-quality, on-policy distillation targets. This process effectively transforms sparse terminal supervision into dense, token-level learning signals without requiring external critics, separate reward models, or larger teacher models. We demonstrate that SRPO achieves state-of-the-art performance across mathematical reasoning and long-horizon agentic benchmarks with exceptional data efficiency. Using a Qwen3-8B base model, SRPO attains 73.3% on AIME’24 using only 8% (0.08×) of the training FLOPs required by scaled supervised fine-tuning, while significantly improving success rates on WebShop (64.7%), ALFWorld (76.8%), and SWE-Bench-Lite (31.2%).

Abstract:
Interpreting residual architectures as discretizations of Ordinary Differential Equations characterizes the forward pass as a continuous latent flow. Although this framework defines the mechanics of inference, conventional training paradigms primarily constrain the terminal state, leaving the intermediate evolution unregulated. In this work, we formalize the forward pass as a Conditional Discriminative Flow and investigate its intrinsic kinematic laws. Using Tuned Lens analysis, we discover that the accumulation of semantic evidence follows a consistent exponential schedule. This finding confirms that deep models naturally require an extended phase of feature construction prior to a rapid transition toward categorical certainty in the terminal layers. Standard training ignores this latent progression. To resolve this impedance mismatch, we propose Deep Trajectory Supervision, a framework that aligns auxiliary supervision with this intrinsic exponential bias. By rectifying the trajectory of the inference flow, DTS functions as a critical physical inductive bias. Empirical evaluations on ImageNet1K and various benchmarks demonstrate that DTS significantly accelerates convergence and improves terminal performance.

Abstract:
Existing approaches for multimodal variational autoencoders (VAEs) face a tradeoff between generative quality and coherence—i.e., they struggle to generate realistic and diverse samples that, at the same time, are semantically consistent across modalities. A recent work shows that using a simple approximation to Hölder pooling as an aggregation method improves coherence over the SOTA MMVAE+, despite assuming a single shared representation across all modalities. Yet, it slightly compromises sample diversity. Inspired by this insight, we propose Hölder++, a novel multimodal VAE that improves the generative quality-coherence trade-off through: (i) the first implementation ofHölder pooling without any approximationfor multimodal VAEs; (ii) an extended architecture that modelsdistinct shared and private(i.e., modality-specific) representations (Hölder+); and (iii)hierarchical inferencethat further enhances the disentanglement between the shared and private representations (Hölder++). Our experiments corroborate that Hölder++ consistently improves the generative quality-coherence trade-off, yields more structured latent spaces, and learns shared representations that are informative for downstream tasks.

Abstract:
Verification of model outputs is rapidly emerging as a key primitive for both training and realworld deployment of large language models (LLMs). In practice, this often involves using imperfect LLM judges and reward models since ground truth acquisition can be time-consuming and expensive. We introduce Fully Unsupervised Score Ensembling (FUSE), a method for improving verification quality by ensembling verifiers without access to ground truth correctness labels. The key idea behind FUSE is to control conditional dependencies between verifiers in a manner that improves the unsupervised performance of a class of spectral algorithms from the ensembling literature. Despite requiring zero ground truth labels, FUSE typically matches or improves upon semi-supervised alternatives in test-time scaling experiments with diverse sets of generator models, verifiers, and benchmarks. In particular, we validate our method on both conventional academic benchmarks such as GPQA Diamond and on frontier, unsaturated benchmarks such as Humanity's Last Exam and IMO Shortlist questions.

Abstract:
Multistep prompt injection attacks on LLM agents present a fundamental detection challenge because malicious intent emerges only after workflows complete, while individual actions remain legitimate in isolation. Existing defenses, including input sanitization, output validation, and instruction hierarchy, operate on individual actions or content patterns and cannot capture this sequential structure. We present \textttCausalTrace, a detection system that reframes prompt-injection defense as causal inference. It constructs Structural Causal Models from agent trajectories with typed edges capturing data dependency, trust transfer, and state enablement, then applies Pearl's do-calculus to answer a counterfactual question, namely whether the harmful outcome would have occurred if the injection had been blocked. This formalization enables a principled distinction between attacks that depend on injections and benign workflows that share surface-level features. Evaluation on a dataset spanning crowdsourced traces, LLM agent benchmarks, and semi-real and real scenarios demonstrates strong detection performance, outperforming content-based baselines while requiring minimal LLM inference cost. Bidirectional slicing recovers complete attack chains, providing interpretable explanations that trace exploitation to its causal origins.

Abstract:
Aligning large language models (LLMs) to diverse human preferences is fundamentally challenging since criteria can often conflict with each other. Inferencetime alignment methods have recently gained popularity as they allow LLMs to be aligned to multiple criteria via different alignment algorithms at inference time. However, inference-time alignment is computationally expensive since it often requires multiple forward passes of the base model. In this work, we proposeinference-aware meta-alignment(IAMA), a novel approach that enables LLMs to be aligned to multiple criteria with minimal computational overhead at inference time. IAMA trains a base model such that it can be effectively aligned to multiple task optima via different inference-time alignment algorithms. To solve the non-linear optimization problems involved in IAMA, we proposenon-linear GRPO, which provably converges to the optimal solution in the space of probability measures.

Abstract:
We study the \maxecc\ problem, where given an edgecolored hypergraph with k colors and edge size r, we seek to color the vertices of the graph in order to maximize the number of satisfied edges (edges having the same color as their extremities): this is an effective mechanism for clustering (coloring) objects based on their multi-way interactions with one another in a system, providing significant applications in machine learning, clustering, and data mining. We exponentially improve upon the approximation ratio of an existing algorithm, by Crane et al. present in ICML 2025, to \frac1r+1, present another novel dependent-rounding algorithm with an approximation ratio of 1/ \lceil \frack2\rceil, and modify the initial algorithm via analytical scaling techniques in order to achieve an approximation factor of (1-e^-r)/r. We then apply our scaling algorithm to graph \maxecc\ and improve the best-known approximation factor for all hypergraphs: in particular, our algorithm provides an approximation factor of 0.43 as opposed to the previously-known 0.38 factor for graphs.

Abstract:
In this paper, we address the standard Karmed multi-armed bandit (MAB) with M heterogeneous data sources, each exhibiting unknown and distinct noise variances, \sigma_j^2. We propose SOAR (Source-Optimistic Adaptive Regret Minimization), a novel algorithm that adaptively balances exploration and exploitation by jointly constructing upper confidence bounds for arm rewards and lower confidence bounds for data source variances. Our theoretical analysis establishes that SOAR achieves a regret bound of \tildeO\left(\sigma^^2 \sum_i=2^K \tfrac1\Delta_i\right), along with a preprocessing cost that depends only on the problem parameters \\\sigma_j\\_j = 1^M, K, and grows at most logarithmically with the horizon T; where \sigma^\^2 is the minimum source variance, and \Delta_i denotes the suboptimality-gap of the i-th arm reward. The \tilde O(.) notation hides the polylogarithmic factors in these problem parameters. This near-optimal instance dependence regret analysis of SOAR underscores its effectiveness in dynamically managing heteroscedastic noise without incurring significant overhead. Experiments on synthetic problem instances and a real dataset (MovieLens 25M) demonstrate that our method significantly outperforms baseline bandit algorithms in terms of regret performance. Our work opens a new direction for adaptively leveraging multiple heterogeneous data sources, extending beyond traditional bandit frameworks.

Abstract:
We address the challenge of unsupervised coreset selection for fewshot in-context learning (ICL). The goal is to select a small subset of examples under a fixed annotation budget to yield effective prompts for large language models. Existing geometry-based methods often yield coresets that suffer from a skewed distribution, due to the oversampling of peripheral examples and high local redundancy. To address these issues, we propose a process-aware framework for coreset selection. It jointly optimizes the diversity and representativeness of selected samples via a submodular objective. It ensures representativeness by selecting samples based on local density awareness, while promoting diversity by imposing a redundancy penalty relative to the evolving selected set. Thus, it performs progress-aware balancing of representativeness and diversity based on the selection context. Extensive experiments on 7 NLP datasets demonstrate that our method consistently outperforms state-of-the-art coreset selection methods in downstream ICL performance. Further analysis validates that our approach better balances diversity and representativeness throughout the selection process, while retaining the theoretical guarantees of submodular optimization.

Abstract:
Diffusionbased generative image steganography converts the input single secret image into noise, and generates the stego image with it serves as the initial noise. Nevertheless, existing methods exhibit three severe limitations: (1) the fixed hiding space constrains their capacity to one secret image; (2) severe inter-secret interference arising from substantial information divergence among multiple secret images while concealing them within a shared hiding space; (3) security risks owing to the absence of the receiver-side verification mechanism. To systematically address these issues, this paper proposes a novelReceiver Authenticable Generative Image Steganography frameworkbased on diffusion models. We introduce aDynamic Cover Selection and Optimization Engineto adaptively allocate suitable hiding spaces for different secret images. This design permits the concealment of disparate secret images (or fragments of a single image) into separate spaces, enabling dynamic multi-image concealment while effectively preventing inter-secret interference and expanding capacity through the enlarged hiding spaces. Furthermore, aSignature Authentication Controllercryptographically signs the secret container after concealing and verifies it before extraction, ensuring secure receiver isolation and precise localization of the secret data container. Experiments demonstrate that the proposed framework achieves superior secure multi-receiver isolation and high-performance generative image steganography with large capacity.

Abstract:
Longcontext LLM inference faces a fundamental conflict: head-adaptive compression algorithms (e.g., Top-p nucleus sampling) offer superior accuracy by dynamically fluctuating memory budgets, yet modern inference engines (e.g., vLLM) demand rigid, static memory patterns to leverage CUDA Graphs and PagedAttention. We resolve this ``Static-Dynamic'' mismatch with HARD-KV, a unified framework that that bridges dynamic selection with rigid system constraints. HARD-KV introduces a Cascade Cache hierarchy, managing the token lifecycle across dense, sparse, and condensed tiers. Crucially, we propose a Logits Calibration mechanism that normalizes diverse importance metrics into a unified probability space, enabling consistent Top-p budgeting across heterogeneous heads. To bridge the efficiency gap, we offer a system-level solution, which rewrites fragmented, dynamic indices into contiguous physical layouts compatible with high-performance inference engine. Extensive experiments on math-reasoning benchmarks (AIME, U-Math) verify that HARD-KV achieves up to 2× throughput improvement over static baselines while maintaining high-fidelity generation in 10k+ token scenarios. Our code will be made publicly available.

Abstract:
Crossdomain class-incremental learning (CD-CIL) requires models to continuously acquire new classes across shifting domains while retaining previously learned knowledge. Existing approaches often entangle what to update with how to update, resulting in unstable adaptation and severe forgetting under domain shifts. Inspired by the hippocampal learning mechanism that separates rapid adaptation from stable consolidation, we propose Parameter-Masked Decoupled Optimization (PMDO) that disentangles what knowledge is adapted from how learning proceeds in cross-domain class-incremental learning. Specifically, we introduce a domain-aware knowledge decoupler that selectively adapts domain-relevant shared parameters, constraining incremental updates while preserving prior representations. To regulate how learning proceeds, we further design a stability-aware trajectory regulation that guides optimization along transferable and stable optimization trajectories, thereby reducing interference across domain transitions. As a result, PMDO enables effective cross-domain adaptation while mitigating catastrophic forgetting and maintaining long-term learnability. Extensive experiments across multiple benchmarks demonstrate the effectiveness of PMDO and its superiority over state-of-the-art methods.

Abstract:
Many visual explanation methods in computer vision highlight pixel importance but struggle to link these lowlevel cues to semantically meaningful concepts, limiting their interpretability and trustworthiness. We introduce Concept-based Explanations (ConEx), a novel framework that bridges saliency visualization with concept-based reasoning to provide both localized and global interpretability. ConEx automatically discovers class-specific concepts and represents them through concept activation vectors (CAVs), learned without manual supervision using an architecture-specific masking mechanism that reduces noise introduced by the segmentation masks to enhance concept purity. Locally, ConEx generates saliency maps that reveal where each concept appears in the image and how it contributes to the prediction; globally, it identifies the most influential concepts for each class. To evaluate the reliability of these learned concepts, we propose two complementary metrics, Vector-Concept Match (VCM) and Concept-Class Match (CCM), that quantify concept alignment and enable direct comparison with existing methods. Extensive experiments across diverse datasets and architectures demonstrate that ConEx achieves state-of-the-art performance on faithfulness, segmentation, and concept-quality benchmarks. Human studies further confirm that the discovered concepts are interpretable, distinctive, and aligned with human understanding. Overall, ConEx advances the field toward truly interpretable and concept-grounded explanations in vision models.

Abstract:
Large Language Models (LLMs) frequently memorize sensitive training data, creating significant privacy and copyright risks. We present a novel unlearning framework rooted in the principle that learning is easier than forgetting. We first introduce Divergence Decoding (DD), a mechanism that uses small, efficiently trained auxiliary models to steer the logits of the LLM away from specific data during inference. We then demonstrate this steered distribution can be trivially distilled back into the base model. Our method decisively outperforms stateof-the-art (SOTA) baselines on TOFU and MUSE benchmarks, and we find evidence of generalization in the domain of images. \hrefhttps://anonymous.4open.science/r/targetedunlearningicml2026/Code is available at this anonymous link.

Abstract:
Finetuning large language models (LLMs) is compute-intensive and error-prone: model performance depends sensitively on data quality and hyperparameter choices, and naïve runs can even degrade model performance. This raises a fundamental question: Can we predict fine-tuning performance before training begins? We present TuneAhead, a lightweight framework for pre-hoc prediction of fine-tuning performance. TuneAhead encodes each fine-tuning run as a meta-feature vector that combines static dataset descriptors with dynamic probe features from a short simulated run. A gradient-boosting predictor maps these features to performance predictions, while SHAP-based attributions provide interpretable diagnostics that reveal which specific features are driving performance. Across 1,300+ fine-tuning runs on Qwen2.5-7B-Instruct, TuneAhead consistently outperforms strong baselines such as ProxyLM and Early-Stop Extrapolation. On a held-out test set of 370 runs, by defining ‘success’ as exceeding a performance threshold, it accurately predicted 89.4% of successful runs (110/123) and 91.0% of failure runs (225/247), enabling practitioners to proactively avoid costly unsuccessful runs before training begins. This leads to computational savings of 58.4% in total.

Abstract:
Automating User Interface (UI) generation substantially improves productivity and accelerates development by reducing engineering time and manual effort. Despite recent progress of Large Language Models (LLMs) in UIto-Code, most existing approaches focus on a single HTML/CSS form and fail to systematically incorporate front-end frameworks such as React, Vue, and Angular. Moreover, their outputs are often verbose and hard to reuse at the component level. To address those issues, we propose Deterministic Component Mining (DCM) method, a multi-stage pipeline that couples MLLM prompting with a compact intermediate representation to enable multi-framework and component-oriented generation. Firstly, a lightweight structure model predicts the representation of DOM tree in JSON format capturing the coarse layout from webpage screenshot. Subsequently, we formulate deterministic rules to normalize the predicted DOM tree and mine reusable components and repetitive patterns via hashing and clustering, thereby yielding a portable intermediate representation. Finally, we employ a framework-conditioned LLM prompt governed by a binding specification and a file-block protocol to emit HTML/React/Vue/Angular code with explicit component props and repeat constructs. Experiments demonstrate that DCM significantly outperforms baselines on automatic evaluation metrics and component-level reuse, while delivering consistent gains in multi-framework portability and code structural quality.

Abstract:
Unlearnable examples (UEs) protect training data by injecting imperceptible perturbations so that models fail to extract exploitable representations. In this paper, we reveal that existing UEs exhibit a critical failure once lowpass filtering is applied, indicating that the effective perturbation signals for unlearnability concentrate predominantly in high frequencies. Hence, we argue that reliable UEs should remain effective across the full spectrum. To this end, we proposeFull-spectrumUnlearnable Examples viaSpectralEqualization (FUSE), which aims to generate spectrum-agnostic perturbations by equalizing the contributions from different bands and enforcing cross-band consistency. Specifically, FUSE adopts a Random Spectral Masking (RSM) strategy during generator training, which randomly removes a contiguous frequency band, forcing the remaining bands to maintain unlearnability. In addition, FUSE further integrates Cross-Band Guidance (CBG), which enforces mutual consistency between high- and low-frequency components, thereby further enhancing low-frequency unlearnability and regulating high-frequency perturbations to preserve the semantic fidelity of images. Extensive experiments across multiple datasets, architectures, and spectral filtering demonstrate the strong protection achieved by FUSE.

Abstract:
The progressive scaling of large language models (LLMs) has consistently enhanced multimodal understanding and advanced reasoning capabilities, but has substantially increased computational and hardware execution overhead. In this paper, we present SQuant, a novel post-method that compresses only model weights. We partition each weight tensor into fixed-size blocks and assign a single seed to each block. The seed drives a hardware-friendly Linear Feedback Shift Register (LFSR) generator that dynamically produces multiple basis matrices. Each block is then reconstructed as a linear combination of these basis matrices, with block-specific coefficients, which substantially reduces the amount of stored data, increases the data-transfer efficiency between memory and compute units, and consequently speeds up memory-bound inference for large language models. Experimental results on different LLM models ranging from 7B–70B parameters show that S-Quant attains state-of-the-art performance when weights are compressed to approximately 3-bit or 4-bit. We also design a dedicated ASIC accelerator that achieves a 4× speed-up for memory-bound LLM inference.

Abstract:
Multiview clustering (MVC) is a fundamental task in heterogeneous data analysis, where anchor-based graph methods are widely adopted for their computational efficiency. However, existing approaches typically utilize static, single-layer anchors, failing to capture the multi-granularity nature of complex data. Drawing inspiration from hierarchical human cognition, we propose a hierarchical anchor graph learning method, termed HAG-MVC, a novel framework that organizes multi-view data as a multi-level pyramid. Unlike conventional one-shot anchor generation methods, HAG-MVC introduces a multi-level co-evolution mechanism, where anchors and graph structures are iteratively refined together to capture semantics from fine-to-coarse granularities. Moreover, HAG-MVC offers a transparent abstraction architecture as an alternative to black-box deep clustering: by maintaining all anchors within the original feature space, it enables explicit inspection of the abstraction process, ensuring inherent interpretability. Extensive experiments on benchmark datasets demonstrate that HAG-MVC consistently outperforms state-of-the-art methods. Beyond MVC, this work provides a scalable and trustworthy paradigm for hierarchical knowledge representation in broad machine learning tasks.

Abstract:
Autofocus for spike cameras is challenging because their sparse binary measurements do not provide reliable instantaneous gradients, and noise or illumination drift often breaks the unimodal assumptions behind conventional focus measures. We show that during a focus sweep, the stable sensorobservable cue is a persistent migration of spectral energy in the frequency domain: energy shifts outward toward higher frequencies when approaching focus and recedes under renewed defocus. Building on this observation, we propose CEN (Centroid-based Energy Navigation), a frequency-domain autofocus method that measures spectral migration via a bounded spectral centroid computed on accumulated spike blocks, without image reconstruction or explicit edge extraction. To handle multi-peak and irregular responses in real scenes, CEN further performs structure-consistent response identification, selecting the frequency bound whose curve exhibits a clear, localized, interior extremum, followed by robust peak localization using a weighted near-maximum centroid. Experiments on spike-camera dataset demonstrate that CEN achieves the best overall accuracy and response discriminability across diverse scenes, motion types, and illumination variation patterns.

Abstract:
Infrared and visible video fusion is essential for achieving comprehensive perception in dynamic scenes. However, maintaining temporal consistency remains a formidable challenge. Conventional methods relying on optical flow often suffer from geometric rigidity and ghosting artifacts. Moreover, standard diffusionbased fusion models typically operate in a frame-by-frame manner; when extended to autoregressive settings, they lack intrinsic temporal constraints and are prone to severe error accumulation and drifting, where minor artifacts amplify over time. To address these limitations, we propose a drift-resilient video fusion method that reformulates the task as history-conditioned motion generation. We introduce Stabilized History Guidance and Soft Temporal Anchoring to reframe temporal consistency as spectral filtering, implicitly aggregating motion dynamics without rigid alignment. Furthermore, our Decoupled Structure-Motion Adaptation strategy bridges pre-trained priors and structural constraints via two-stage training and latent refinement. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both fusion quality and temporal stability.

Abstract:
Despite the success of LVLMs, general optimization objectives (e.g., standard MLE) fail to constrain visual trajectories, leading to language bias and hallucination. To mitigate this, current methods introduce geometric priors from visual experts as additional supervision. However, we observe that such supervision is typically suboptimal:it is biased toward geometric precision and offers limited reasoning utility. To bridge this gap, we propose Perceptual Flow Network (PFlowNet), which eschews rigid alignment with the expert priors and achieves interpretable yet more effective visual reasoning. Specifically, PFlowNet decouples perception from reasoning to establish a selfconditioned generation process. Based on this, it integratesmulti-dimensional rewardswithvicinal geometric shapingvia variational reinforcement learning, thereby facilitating reasoning-oriented perceptual behaviors while preserving visual reliability. PFlowNet delivers a provable performance guarantee and competitive empirical results, particularly setting new SOTA records on V Bench (90.6%) and MME-RealWorld-lite (67.0%).

Abstract:
Direct Preference Optimization (DPO) has become a predominant approach for aligning large language models with human preferences. Recent work has used perplexity differentials to identify unreliable preference labels, but these methods apply uniform calibration strategies across all samples. We observe that the reliability of perplexity signals varies substantially across task types: perplexity differentials strongly correlate with preference quality for factual tasks but provide weak signals for creative tasks where novelty is valued. Based on this observation, we propose TaskAware Preference Calibration (TAPC), which learns task-conditioned calibration functions that adapt to the characteristics of different prompt types. TAPC employs a task encoder to extract prompt representations and learns task-specific slope and bias parameters for mapping perplexity signals to confidence targets. Through meta-learning on a small reference dataset, TAPC discovers how to weight perplexity signals appropriately for each task category. Experiments on Llama-3-8B and Qwen2-7B demonstrate that TAPC outperforms existing methods across multiple benchmarks, with particularly large improvements on creative and open-ended tasks where uniform calibration strategies fail.

Abstract:
In reinforcement learning, human preference feedback is emerging as a viable alternative to expertdesigned reward functions, which can be difficult to engineer in real-world problems. However, despite the growing importance of preference feedback, how to effectivelyelicitpreferences remains a fundamental open problem. This work focuses on thecompatible reward identificationtask. The aim is to derive, starting from preference feedback, a reward function compatible with the observed preferences and accurate across theentirestate-action space, ensuring higher transferability, safety, and interpretability. Indeed, the most commonreinforcement learning from human feedbackobjective is to learn theoptimal policy, requiring accuracy only in the portion of the state-action space that the agent visits. However, this goal cannot provide the same guarantees as compatible reward identification. First, we discuss commonalities and differences between the two goals. Then, we consider deterministic preferences, deriving the minimum number of interactions needed to identify the set of compatible rewards, and showing that using fewer queries may lead to arbitrarily large suboptimality. Finally, we focus on stochastic preferences generated via the Bradley-Terry (BT) model. We introduce the concepts ofquery basisand itsindex, relating them to the problem complexity. Upon this, we discuss the connection between the index of a basis and the BT model, as well as the limitations that the model induces in this setting. Additionally, we devise an algorithm to identify a nearly-optimal query basis with polynomial human query complexity.

Abstract:
This work investigates several fundamental tasks, including \mathsfMaxSum, \mathsfMinSum, \mathsfMaxSelect, and \mathsfMinSelect, in the continual release model under differential privacy. Previous research has demonstrated that any algorithm for these tasks must admit a large purely additive error. We show that the error can be substantially reduced if a relative error term is allowed, provided that the input stream is generated nonadaptively. However, when input data records can be selected adaptively, we prove that a large error is inevitable for the task of selecting an attribute with a small cumulative sum, whereas small error bounds remain achievable for other tasks. This reveals a significant separation between non-adaptive and adaptive streams. We also complement our algorithms with nearly matching lower bounds.

Abstract:
Autonomous agents for machine learning engineering have advanced rapidly, yet comparing their effectiveness remains difficult. Existing systems combine different techniquesmulti-agent decomposition, iterative refinement, memory management, and planning---in varying configurations, making it unclear which components actually drive performance. Complicating evaluation, existing benchmarks rely on historical competitions whose data likely contaminates LLM training corpora and whose static baselines reflect outdated human performance. To address this, we conduct over 4,000 controlled experiments systematically ablating architectural components, alongside K-live, a new benchmark of 25 active competitions that provides a contamination-free, dynamic evaluation environment. Our findings challenge common design assumptions: iterative feedback contributes more than architectural complexity, and multi-agent coordination can hurt as often as it helps. These results provide concrete guidance for practitioners building ML engineering agents.

Abstract:
The proliferation of recursive synthetic data training promises to alleviate data scarcity but introduces the existential risk of model collapse, wherein recursive training on synthetic data erodes distributional tails and homogenizes outputs. Current literature identifies data selection as a pivotal solution, employing verifiers to prune datasets in pursuit of synthetic samples that approximate the true data manifold. However, this approach hinges on the fragile and often unrealistic assumption that a perfect verifier possesses global distributional knowledge. In realworld scenarios characterized by data silos, such as fragmented healthcare consortia or proprietary financial institutions, this assumption is invalidated by the inherent fragmentation of knowledge. We theoretically prove that such siloed selection accelerates model collapse, driving diversity decay governed by a power law. To bridge this gap, we propose an automated filtering criterion that synergizes the sensitivity theorem with Wasserstein geometry. Specifically, multiple parties collaboratively compute geodesic interpolations and the Wasserstein Barycenter as proxy measures, without exchanging raw data. These proxies serve as a collective reference, enabling multiple parties to score synthetic data rather than relying on a single biased perspective in a data silo. Empirical results show the failure of the baseline on skewed distributions, whereas our methods effectively prevent collapse. Code available atAnonymous Github.

Abstract:
Generative sequence modeling faces a fundamental tension between the expressivity of Transformers and the efficiency of linear sequence models. Existing efficient architectures are theoretically bounded by shallow, singlestep linear updates, while powerful iterative methods like Test-Time Training (TTT) break hardware parallelism due to state-dependent gradients. We propose PRISM (Parallel Residual Iterative Sequence Model) to resolve this tension. PRISM introduces a solver-inspired inductive bias that captures key structural properties of multi-step refinement in a parallelizable form. We employ a Write-Forget Decoupling strategy that isolates non-linearity within the injection operator. To bypass the serial dependency of explicit solvers, PRISM utilizes a two-stage proxy architecture: a short-convolution anchors the initial residual using local history energy, while a learned predictor estimates the refinement updates directly from the input. This design distills structural patterns associated with iterative correction into a parallelizable feedforward operator. Theoretically, we prove that this formulation achieves Rank-L accumulation, structurally expanding the update manifold beyond the single-step Rank-1 bottleneck. Empirically, it achieves comparable performance to explicit optimization methods while achieving 174x higher throughput. Codes are available in \hrefhttps://anonymous.4open.science/r/msir-F607/anonymous.4open.science.

Abstract:
By synergizing graph topology with the global expressive power of the attention mechanism, Graph Transformers (GTs) have emerged as a dominant architecture for node classification. However, existing models primarily focus on diverse topology injection mechanisms, specifically scorelevel and representation-level designs, yet lack a unified theoretical foundation to characterize how these mechanisms shape the representation propagation. To bridge this research gap, this paper unifies these designs under a common Graph Signal Denoising framework, revealing that denoising efficacy (i.e., representation quality) is fundamentally dictated by the block-diagonal structure of the propagation operator. To instantiate this prior efficiently, this paper introduces a novel Block-Diagonal GT architecture, named \textscBDFormer, which enforces a block-diagonal constraint via spectral-regularized cross-attention on latent anchors. Specifically, by routing global interactions through these anchors, \textscBDFormer imposes the spectral block-constraint directly on the anchor-level affinity. Crucially, the learned global affinity guides the pruning of local heterophilous edges, ensuring that both scales synergistically adhere to the target distribution. Extensive evaluations on benchmark datasets demonstrate the scalability and robustness of \textscBDFormer.

Abstract:
VisionLanguage-Action (VLA) models have shown strong potential for robotic manipulation, yet adapting pretrained models to novel tasks typically relies on substantial task-specific demonstrations, limiting scalability. Current VLA methods mostly focus on action imitation, which ignores the richer structure contained in trajectories. In contrast, motion dynamics governing how actions evolve over time are more informative and transferable, making them better suited for few-shot adaptation. Motivated by this idea, we propose DynVLA, a few-shot adaptation system that reformulates VLA learning from action imitation to trajectory-level motion dynamics modeling. Specifically, we propose Motion Dynamics Mechanism (MDM), which distills latent physical regimes from trajectories via flow-matching inversion, yielding compact representations that capture dynamics. We further design Dynamics-Constrained Modeling (DCM). DCM projects these inferred representations onto a Dynamics Bank, which stores prior motion knowledge pretrained from diverse demonstrations. By grounding action generation in these learned priors, the system enables interpolating between existing action paradigms to represent novel dynamics modes. Experiments on 13 real-world tasks demonstrate that DynVLA outperforms existing SOTA systems by 19% in average success rate with only 10-20 demonstrations, highlighting its adaptation capabilities in real-world scenes.

Abstract:
We study a stylized social learning dynamics where selfinterested agents collectively follow a simple multi-armed bandit protocol. Each agent controls an "episode": a short sequence of consecutive decisions. Motivating applications include users repeatedly interacting with an AI, or repeatedly shopping at a marketplace. While agents are incentivized to explore within their respective episodes, we show that the aggregate exploration fails: e.g., its Bayesian regret grows linearly over time. In fact, such failure is a (very) typical case, not just a worst-case scenario. This conclusion persists even if an agent's per-episode utility is some fixed function of the per-round outcomes: e.g., \min or \max, not just the sum. Thus, externally driven exploration is needed even when some amount of exploration happens organically.

Abstract:
Despite advances in Preference Alignment (PA) for Large Language Models (LLMs), mainstream methods like reinforcement learning with human feedback face notable challenges. These approaches require highquality datasets of positive preference examples, which are costly to obtain and computationally intensive. The LLM unlearning technique presents a promising alternative by directly removing the influence of negative examples. However, current research has primarily focused on empirical validation, lacking systematic quantitative analysis. To bridge this gap, we propose a framework linking PA with LLM unlearning. Through bi-level optimization, we first quantify how unlearning specific negative examples impacts PA performance. Our analysis reveals that these effects vary substantially across negative examples. Building on this insight, we pose a crucial question: how can we optimally select and weight negative examples for unlearning to maximize PA performance? To answer this, we propose Unlearning to Align (U2A), which leverages bi-level optimization to efficiently select and unlearn examples for optimal PA performance. We validate the proposed method through extensive experiments, with results confirming its effectiveness. Our code is available at https://anonymous.4open.science/r/U2A-9E75.

Abstract:
Incremental object detection (IOD) aims to extend detectors to new categories while retaining previously acquired knowledge. Existing methods often adopt a class incremental learning perspective, separating feature spaces to sharpen decision boundaries. However, this paradigm conflicts with the inherent object symbiosis in detection, where objects naturally cooccur or occlude one another, creating spatial and semantic dependencies that necessitate a shared feature space. Ignoring these dependencies distorts the shared representations, exacerbates confusion between old and new classes, and accelerates catastrophic forgetting. To address this, we propose Symbiosis-Inspired Knowledge Distillation (SIKD), which explicitly leverages object symbiosis at two complementary levels. Spatial Symbiosis Distillation (SpSD) focuses on symbiotic regions where the old model responds with high overlap to objects in the new task. It preserves generalizable old class cues, suppresses class-specific bias and redundancy, and distills the refined evidence to the new model at matched spatial locations with slot-aligned supervision. Semantic Symbiosis Distillation (SeSD) maintains class level structure by forming confidence weighted prototypes for old classes and aligning their inter class soft ranks over the old class logits, which stabilizes the semantic topology during adaptation. Extensive experiments demonstrate the effectiveness and superiority of the proposed method. Code is available in the supplementary material.

Abstract:
TokenPruning accelerates Vision-Language Models by removing redundant visual tokens, yet its safety implications remain underexplored. In this work, we present the first comprehensive safety evaluation of Token-Pruning mechanism and find that: Most pruning strategies significantly degrade safety as pruning ratios increase, whereas Query-based Compression shows the opposite, with extreme pruning (up to 99.8%), unexpectedly improves model safety. This sharp contrast prompts a key question: How do different Token-Pruning strategy reshape model's safety behavior, and is it possible to enhance safety without sacrificing acceleration? To answer this, we identify an unrecognized mechanism, termed Pruning-Induced Malicious Amplification, where removal of background tokens triggers a side effect: forcing model's attention to collapse onto a few retained malicious anchors within foreground, inadvertently amplifying their toxic semantics under jailbreak. To address that, we propose an inference-time and plug-and-play Safety-Aware Pruning (SAP) mechanism that counteracts such dominance via three steps: (1) identifying malicious anchors, (2) restoring pruned benign tokens, and (3) reallocating excessive attention from malicious anchors to benign tokens. Extensive experiments across 3 safety and 4 utility benchmarks demonstrate that SAP mitigates pruning-induced vulnerabilities, i.e., reducing ASR by up to 62%, without compromising efficiency or utility.

Abstract:
Traditional textbased person retrieval approaches typically rely on single-shot textual queries, which are generally incomplete or vague in real-world scenarios. Recently, chat-based person retrieval methods enable iterative query refinement via question-answering interactions between the system and users. However, these methods fall short of direct user interaction with retrieved candidates during conversation, making it challenging to effectively refine the retrieval results. To address these limitations, we propose multimodal interactive person retrieval (MInterPR), a new retrieval paradigm that allows users to iteratively refine retrieved candidates by providing feedback on visual differences from the target person. To support this task, we establish MInterPEDES, a multimodal conversational dataset constructed by augmenting existing question-answering dialogues with synthesized visual feedbacks. Furthermore, to tackle the challenge of accurate and efficient semantics understanding in multimodal dialogues, we propose a multimodal conversational memory-enhanced framework MNEMO, which incorporates an atomic turn encoding (ATE) module to model each dialogue turn independently, and a dialogue memory aggregation (DMA) module to capture the fine-grained information and cross-turn dependencies. Extensive experiments demonstrate that MNEMO achieves substantial improvements in both retrieval accuracy and generalization ability, highlighting its promising potential in real-world scenarios. The code and dataset will be released to facilitate future research.

Abstract:
Diffusion models have shown remarkable performance on diverse generation tasks. Recent work finds that imposing representation alignment on the hidden states of diffusion networks can both facilitate training convergence and enhance sampling quality, yet the mechanism driving this synergy remains insufficiently understood. In this paper, we investigate the connection between selfsupervised spectral representation learning and diffusion generative models through a shared perspective on perturbation kernels. On the diffusion side, samples (e.g., images, videos) are produced by reversing a stochastic noise-injection process specified by Gaussian kernels; on the spectral representation side, spectral embeddings emerge from contrasting positive and negative relations induced by random perturbation kernels. Motivated by this, we propose a self-supervised spectral representation alignment method to facilitate diffusion model training. In addition, we clarify how joint spectral learning can benefit diffusion training from a geometric perspective. Furthermore, we find that the optimization of the spectral alignment objective is in an equivalent form of diffusion score distillation in the representation space. Building on these findings, we integrate a spectral regularizer into diffusion training objectives to improve the performance of diffusion models on multiple datasets. Experiments across images and 3D point clouds show consistent gains in generation quality.

Abstract:
Diffusion models offer a powerful framework for sampling from complex probability densities by learning to reverse a noising process. A common approach involves solving for the timereversed stochastic differential equation (SDE), which requires the score function of the evolving sample distribution. The logarithm of this distribution's density is governed by a Hamilton-Jacobi-Bellman (HJB) type partial differential equation (PDE). However, current methods for solving this PDE, such as PINNs or trajectory-based techniques, often suffer from long training times and significant sensitivity to hyperparameter tuning. In this work, we introduce a novel and efficient solver for the underlying HJB equation based on the functional tensor train (FTT) format. The FTT representation leverages latent low-rank structures to efficiently approximate high-dimensional functions, enabling both model compression and rapid computation. By integrating this efficient representation with a backward-in-time iterative scheme derived from backward stochastic differential equations (BSDEs), we develop a fast, robust and accurate sampling method. Our approach overcomes primary bottlenecks of existing techniques, enabling high-fidelity sampling from challenging target distributions with improved efficiency.

Abstract:
Inaccurate Visual Anomaly Detection (VAD) can lead to critical failures in safetysensitive domains, including autonomous navigation and industrial surveillance. With the increasing abundance and rapid proliferation of VAD algorithms, their reliable evaluation has become increasingly important and challenging. Commonly used evaluation metrics often fail to capture practically relevant aspects of model behavior, yielding inconsistent or misleading assessments by overlooking errors such as redundant detections and the spatial distribution of false positives. In this paper, we formalize the requirements for VAD evaluation by introducing a set of axiomatic, verifiable properties that an evaluation metric should satisfy. Through a systematic analysis of state-of-the-art evaluation methods, we show that none satisfies all proposed properties. To address this gap, we introduce SAAM-ALARM, a novel evaluation metric that satisfies these properties. Our results show that SAAM-ALARM provides a more nuanced and theoretically sound assessment, establishing a stronger standard for performance benchmarking in VAD.

Abstract:
Graph Neural Networks (GNNs) are increasingly adopting the "pretraining, adaptation" paradigm, which first pre-train GNNs on large-scale unlabeled graph data and then adapt them to specific downstream tasks. As a common pattern, graph prompting adapts to the frozen encoder by modifying the input graph structure, rather than fine-tuning the model parameters. However, most existing graph prompting approaches optimize the continuous and weighted adjacency structure in the adaptation phase, while requiring a hard discretization at inference time. This difference causes a train-test mismatch which is particularly harmful in few-shot regimes. To address the issue, we proposeDiscretePrompting forGraphs, a discrete prompting framework that directly learns task-specific topology prompts in the combinatorial space. DiP-G operates on multi-hop local candidate subgraphs to ensure scalability, generates hard (k)-sparse prompts through a perturbed Top-(k) solver, and optimizes the discrete structures using an I-MLE gradient estimator. To improve the efficiency of backward pass, we further introduce an adaptive active-set screening rule that accelerates the target solve and can provably maintain the accuracy of the solution. Extensive experiments conducted on multiple benchmark datasets have validated the effectiveness of our proposed method. Our main code is available in the supplementary materials.

Abstract:
We show how an adversarial model trainer can plant backdoors in a large class of deep, feedforward neural networks. These backdoors are statistically undetectable in the whitebox setting, meaning that the backdoored and honestly trained models are close in total variation distance, even given the full descriptions of the models (e.g., all of the weights). The backdoor provides access to invariance-based adversarial examples for every input, mapping distant inputs to unusually close outputs. However, without the backdoor, it is provably impossible (under standard cryptographic assumptions) to generate any such adversarial examples in polynomial time. Our theoretical and preliminary empirical findings demonstrate a fundamental power asymmetry between model trainers and model users.

Abstract:
Transparent objects pose a fundamental challenge for depth estimation and 3D reconstruction due to their violation of Lambertian assumptions, leading to severe geometry degradation in downstream tasks. To address this, we propose a novel geometryguided preprocessing framework GHOST that leverages visual foundation models to transform transparent regions into opaque, structurally consistent representations without requiring downstream model retraining. Specifically, our pipeline utilizes (1) TransDINO and (2) TransDecomp to disentangle masks and transparency physical properties, while (3) DAF-Net recovers surface normal priors to encode geometric curvature. Subsequently, (4) GeoSemTransNet integrates these multi-modal cues to synthesize a texture-rich opaque RGB image that preserves the transparent object's 3D structure. Extensive experiments demonstrate that our method significantly enhances the accuracy of state-of-the-art depth estimation and reconstruction models on transparent objects by restoring essential photometric cues.

Abstract:
The Neural Tangent Kernel is a theoretical framework for understanding the training dynamics of neural networks. However, standard NTK and its variants fail to properly depict the finetuning of foundation models, as they neglect the preconditioning effects of adaptive gradients. To bridge this gap, we propose the Optimizer Aware Kernel (OAK), which incorporates the optimizer's influence into standard NTK framework by a preconditioner estimation technique. Furthermore, we conduct an analysis to answer: when and why kernel regime fails in finetuning. We derive explicit error bounds showing that the collapse of kernel regime is primarily due to the cumulative training effects and the task discrepancy between pretraining and finetuning. Theoretically, we justify OAK's preconditioner estimation by bounding its error term. Empirically, experiments on various model architectures show both the effectiveness of the OAK method and validity of our arguments on kernel regime collapse.

Abstract:
Pretrained Vision-Language Models (VLMs) such as CLIP achieve strong zero-shot generalization, but their performance degrades sharply under adversarial perturbations. Existing test-time adaptation methods typically rely on sample-level confidence heuristics, overlooking the intrinsic distributional structure of the data. This sample-centric approach limits robustness, as it fails to distinguish confident adversarial mispredictions from true semantic consistency. In this work, we observe that adversarial distortion is structurally brittle: while holistic representations are corrupted, semantic integrity is often preserved in the distribution of augmented views. Motivated by this insight, we propose \textttRITA, a Robust test-tIme prompT Adaptation framework that shifts from sample-level estimates to distribution-level alignment. Specifically, \textttRITA employs optimal transport to align the distribution of augmented visual features with textual prototypes, mitigating adversarial outliers and rectifying cross-modal semantic misalignment. Furthermore, we introduce a dynamic cache to progressively accumulate reliable cues from the test stream for online refinement. Extensive experiments demonstrate that \textttRITA significantly improves adversarial robustness without compromising clean accuracy.

Abstract:
Graph Neural Networks (GNNs) on meshes have emerged as promising surrogates for computational mechanics, but standard local message passing struggles to propagate information across unstructured meshes, leading to large errors in regions with complex physics (e.g., shocks, wakes, boundary layers). Existing approaches enlarge connectivity with longrange edges chosena priorivia geometric heuristics or random sampling, which lack a mechanism to prioritize high-error regions and often introduce redundant communication. We proposeMiSe-GNN, a dual-head architecture that adaptively augments graph connectivity using model-predicteda posteriorierrors. MiSe-GNN jointly predicts physical fields and a node-wise error indicator; the predicted error is periodically converted into a hierarchy of additional edges via an adaptive tree that links high-error nodes to spatial pivots at multiple scales. This error-guided connectivity concentrates message passing where the surrogate is uncertain while keeping the graph sparse elsewhere, yielding a transparent and physically interpretable graph-space analogue of adaptive mesh refinement. Across industrial CFD and CSD benchmarks, MiSe-GNN consistently improves accuracy and accuracy–compute trade-offs over strong baselines, and qualitative analyses show that it routes communication toward physically challenging regions. These results establisherror-guided edge augmentationas a robust and general design principle for long-range message passing in physics-aware GNNs.

Abstract:
Audiovisual speaker tracking aims to localize and track active speakers by leveraging auditory and visual cues, enabling fine-grained, human-centric scene understanding. This capability is essential for real-world applications such as intelligent video editing, surveillance, and human–computer interaction. However, existing datasets are largely limited to simple or homogeneous audio-visual scenes with coarse annotations. Such oversimplified settings bias evaluation toward static audio–visual co-occurrence, rather than rigorously assessing robust spatiotemporal modeling and cross-modal reasoning in complex, dynamic scenes. To address these limitations, we introduce AVTrack, a human-centric audio-visual instance segmentation (AVIS) dataset designed for dynamic real-world scenarios. AVTrack features diverse and challenging conditions, including camera motion, visual occlusions, and position changes. Evaluations of representative AVIS methods on AVTrack reveal substantial performance degradation, establishing AVTrack as a challenging benchmark for robust human-centric audio-visual scene understanding in complex environments. We further provide a simple yet effective baseline to facilitate future research.

Abstract:
Generalization under distribution shift remains a core challenge in modern machine learning, yet existing learning bound theory is limited to narrow, idealized settings and is nonestimable from samples. In this paper, we bridge the gap between theory and practical applications. We first show that existing definition of concept shift breaks when the source and target supports mismatch. Leveraging entropic optimal transport, we propose a key notion: γ-concept shifts, and derive a general error bound unifying covariate and γ-concept shifts, which applies to broad loss functions, label spaces, and stochastic labeling. We further develop estimators for these shifts with concentration guarantees, and the DataShifts algorithm, which can quantify distribution shifts and estimate the error bound in most applications - a rigorous and general tool for analyzing learning error under distribution shift.

Abstract:
Learning from continuoustime trajectories requires modeling multivariate sensor measurements generated by underlying physical or dynamical processes. Under extreme data compression and heterogeneous sampling, directly optimizing synthetic signals as discrete sample values becomes fundamentally misaligned with the underlining \emphcontinuous-time physical processes, often producing high-frequency, non-physical artifacts that overfit specific models and break reuse across architectures and sampling rates. We propose CHESS, a \emphfunction-first synthesis framework shifts optimization from discrete samples to underlying continuous-time signal trajectories. CHESS injects physics-induced structure by jointly enforcing low-rank spatial coherence and piecewise Chebyshev polynomial temporal parameterization, constraining synthesis to a physically meaningful function manifold. We provide theoretical analysis establishing explicit smoothness and stability guaranties. Experiments on diverse sensor testbeds under the dataset distillation protocol demonstrate CHESS consistently outperform state-of-the-art methods with a compression ratios up to 133× for each synthetic sample. Furthermore, CHESS exhibits strong cross-architecture generalization and enables zero-shot adaptation across different sampling resolutions.

Abstract:
While existing methods for reconstructing hand–object interactions have made impressive progress, they either focus on rigid or partwise rigid objects—limiting their ability to model real-world objects (e.g., cloth, stuffed animals) that exhibit highly non-rigid deformations—or model deformable objects without full 3D hand reconstruction. To bridge this gap, we present PhysHandi (Physics-based Reconstruction of Hand and Deformable Object Interactions), a framework that enables full 3D reconstruction of both interacting hands and non-rigid objects. Our key idea is to physically simulate object deformations driven by forces induced from densely reconstructed 3D hand motions, ensuring that the reconstructed object dynamics are both physically plausible and coherent with the interacting hand movements. Furthermore, we demonstrate that such simulation of object deformations can, in turn, refine and improve hand reconstruction via inverse physics. In experiments, PhysHandi outperforms the state-of-the-art baseline across reconstruction and future prediction.

Abstract:
Spectral clustering, a widelyused technique for graph-based data partitioning, faces a severe computational bottleneck due to its O(n^3) time complexity. While anchor-based approximations reduce the complexity to O(nm^2) (m \ll n), they often yield degenerate solutions in the absence of explicit cluster-size control. To address this limitation, we propose SC-FAGC (Size-Constrained Fast Anchor Graph Clustering), a unified formulation that integrates entropy regularization and bilateral cardinality constraints within an anchor-based spectral clustering framework. Our model simultaneously promotes cohesive clusters and enforces soft lower and upper bounds on cluster sizes, thus avoiding trivial or highly unbalanced partitions. To solve the resulting non-convex optimization problem efficiently, we develop an Iteratively Re-weighted (IRW) optimization scheme, which sequentially linearizes the objective and solves each subproblem via a Double-Bounded Optimal Transport (DB-OT) solver based on the Sinkhorn--Knopp algorithm. This approach guarantees convergence while maintaining scalability. Extensive experiments on benchmark datasets demonstrate that SC-FAGC consistently achieves state-of-the-art performance in terms of accuracy, purity, and recall, while strictly satisfying the prescribed cluster-size constraints. Thus the proposed method offers a principled and scalable solution for large-scale graph clustering with controllable partition structure.

Abstract:
Multimodal object Re-Identification (ReID) aims to retrieve the same object across different modalities by exploiting their complementary visual information. Recent advances leverage Multi-modal Large Language Models (MLLMs) to generate descriptive textual annotations as auxiliary supervision. However, existing approaches usually adopt these generated texts directly, overlooking the varying correspondence degrees between visual and textual modalities. Such neglect may lead the model to treat strong- and weak-correspondence image–text pairs equally, limiting its ability to learn discriminative associations and hindering effective optimization. To overcome this limitation, we propose a Correspondence Cognitive Learning (CCL) framework that explicitly models the correspondence degree and facilitates a progressive learning process from easy to hard pairs. CCL is composed of two synergistic modules. The Correspondence-Guided Semantic Refinement (CGSR) module dynamically refines visual representations using text semantics according to the correspondence difficulty estimated from the previous epoch, thereby enhancing feature alignment under imperfect associations. The Cognitive-Driven Dynamic Optimization (CDDO) module presents a self-paced weighting mechanism that adaptively adjusts the optimization focus by emphasizing easy pairs at the early stage and gradually integrating harder ones as training evolves. Together, these modules enhance feature-level alignment and optimization adaptivity, yielding robust and discriminative multi-modal representations. Extensive experiments on three multi-modal object ReID benchmarks demonstrate the superior performance of our method.

Abstract:
This paper challenges the assumption that visionlanguage models (VLMs) require fixed patch-based 2D vision features. Analyzing fine-tuned vision encoders, we find that representations become increasingly abstract and less spatially coherent during VLM training. Notably, models trained with image-text alignment (such as SigLIP2) develop a small number of specialized tokens that effectively summarize global image content. Building on this, we introduce RADIO1D, which compresses images into a compact, variable-length 1D token sequence using multi-teacher knowledge distillation and an autoencoder design. The resulting representations exhibit strong hierarchical summarization, enabling accurate scene understanding–even with a single token–and support improved composition-aware image retrieval. In VLMs, RADIO1D provides flexible accuracy-efficiency tradeoffs through adjustable token counts, delivering competitive performance on diverse multimodal benchmarks with lower computational overhead and better accuracy. We release our models under a permissive license.

Abstract:
Existing referring video object segmentation (RVOS) is largely built on passive perception and assumes the target is already visible in the observed video, which limits realworld use when queries refer to objects beyond the current view. To address this gap, we introduceReferring Video Active Exploration and Segmentation(RVAS), a new task that focuses on reasoning about exploration policy and then locating and segmenting the object according to an input referring expression. To support RVAS, we build a large-scale dataset with manually annotated exploration actions and reference reasoning traces, enabling supervised training and evaluation. We benchmark representative RVOS and related video understanding baselines and find that they struggle to perform active target search and incur substantial overhead when coupled with online decision making. Motivated by these challenges, we propose LESA, a baseline framework that introduces a state controller and hierarchical memory for efficient streaming processing and sparse MLLM reasoning. LESA substantially reduces inference cost while maintaining competitive planning quality, and consistently improves segmentation accuracy on the RVAS dataset.

Abstract:
We argue that highquality motion data can steer tracking policies toward better optimization trajectories early in training. In this work, we introduce LIMMT (Less Is More for Motion Tracking). To our knowledge, this is the first data-centric study for physics-based humanoid motion tracking. We go beyond simply removing erroneous clips. We define motion data quality through three dimensions: physics feasibility, diversity, and complexity. We show that training with under 3% of AMASS yields better tracking performance than training with the full dataset. Extensive experiments and analyses validate the effectiveness of our framework. We will release our code and curated data on GitHub.

Abstract:
Classincremental semantic segmentation learns new classes while retaining old ones without access to past data. Although existing methods alleviate catastrophic forgetting on old classes, new-class performance remains limited. We identify the key bottleneck arises from low-margin regions, where the logit of the ground-truth class is close to that of the most competitive non-ground-truth class. Our theoretical analysis suggests that optimization in these regions is characterized by high curvature and a small stability radius, making learning prone to class confusion. Based on the above analysis, we propose Learnability-Driven Knowledge Assimilation (LDKA), which targets low-margin learning via three complementary optimization strategies: (i) Progressive Margin Learning continuously reallocates pixel-wise optimization budget in a threshold-free manner, shifting emphasis from high-margin to low-margin regions; (ii) Smooth Knowledge Distillation applies curvature damping and perturbation stabilization to suppress high-frequency updates and increase stability radius; (iii) Misclassification-Aware Decoupling measures inter-class confusion with a competition matrix and decouples highly competitive class representations. Experiments show that LDKA improves mIoU on new classes while preserving performance on old classes across 9 incremental protocols.

Abstract:
Modern generative models possess a deep understanding of visual content, yet training them for image editing typically requires massive datasets of paired examples. This limits scalability, especially for video editing where collecting paired data is prohibitively expensive. We propose a general framework for unpaired training of flow matching editing models. It leverages the base model's knowledge without any external signal. Our approach pairs instructionfollowing cues extracted from the frozen model with cycle-consistency for structure preservation. To make this tractable, we propose to route gradients from downstream losses over clean predictions to noisy training states. We demonstrate state-of-the-art results on challenging data-scarce image and video editing scenarios. Extensive evaluations and user studies show that our method effectively generalizes to unseen domains and outperforms supervised baselines trained on millions of samples. Analysis reveals that our gradient routing bridges the train-inference gap, and extracting semantic cues from a base model provides a robust training signal that obviates the need for external reward models.

Abstract:
Informationtheoretic acquisition functions such as Entropy Search (ES) offer a principled exploration–exploitation framework for Bayesian optimization (BO). However, their practical implementation relies on complicated and slow approximations, i.e., a Monte Carlo estimation of the information gain. This complexity can introduce numerical errors and requires specialized, hand-crafted implementations. We propose a two-stage amortization strategy that learns to approximate entropy search-based acquisition functions using Prior-data Fitted Networks (PFNs) in a single forward pass. A first PFN is trained to be conditioned on information about the optima; second, the α-PFN is trained to predict the expected information gain by training on information gains measured with the first PFN. The α-PFN offers a flexible learned approximation, which replaces the complex heuristic approximations with a single forward pass per candidate, enabling rapid and extensible acquisition evaluation. Empirically, our approach is competitive with state-of-the-art entropy search implementations on synthetic and real-world benchmarks, while accelerating the different entropy search variants across all our experiments, with speed ups over 50x.

Abstract:
Offline Policy Evaluation (OPE) aims to estimate the value of a target policy from historical logged data without interating with the environment, thereby assessing policy performance. In settings with network interference, individuals no longer satisfy the SUTVA assumption: an individual’s outcome is influenced not only by their own treatment but also by the treatments of their neighbors, which makes the definition and estimation of policy value more complex. To capture this interference mechanism, we allow all neighbors to affect individual outcomes through a unified exposure mapping, and we use a decaying higherorder neighborhood aggregation to characterize the influence of more distant neighbors. Moreover, in real-world applications, the target policy and the logging policy often do not fully overlap (non-overlap), so the policy value in non-overlap regions cannot be point-identified. To address this issue, we partially identify the policy value over non-overlap regions and, under a smoothness assumption, formulate the estimation of the lower and upper bounds as a linear program, yielding valid bounds on the offline policy value. Finally, we conduct systematic experiments on semi-synthetic network data to validate the effectiveness and robustness of the proposed method under network interference and limited overlap.

Abstract:
Anchorbased multi-view clustering has garnered wide attention for its ability to reduce the computational complexity of large-scale spectral clustering.However, existing methods mostly adopt a unidirectional optimization paradigm confined to sample-anchor bipartite graphs, treating the construction of the consensus graph and discrete clustering assignments as separate sub-problems to be solved independently. This weakens the information exchange between continuous representation and discrete structure, confining the optimization process to iterative updates within local modules.To address these limitations, we propose a Discretely-Refined Multi-view Clustering(DRMC) via Aligned Anchor Learning. Unlike approaches that directly perform fusion in the anchor space, our method starts from the anchor graph, elevates sample-anchor associations to sample-level similarity graph representations, and thereby enhances both within-cluster similarity and between-cluster separation. Furthermore, we design a discrete feedback module that jointly conducts spectral embedding learning and discrete label assignment by orthogonally aligning the continuous embedding matrix with the discrete indicator matrix. The resulting discrete partition is then fed back into the consensus graph construction, continuously refining the graph structure. Experiments on multiple benchmark datasets demonstrate that the proposed method exhibits significant advantages over existing state-of-the-art approaches.

Abstract:
AggregationFree Federated Learning enables joint training by sharing synthetic data, aiming to eliminate data heterogeneity across clients. However, existing methods fail to explicitly separate the principal and residual components of dataset, leading to biased synthetic data. In this paper, we propose a novel Unbiased Synthetic Data optimization method FedUSD for Aggregation-Free Federated Learning, which is achieved by exploring the High-energy Orthogonal Base (HOB) and variance of dataset in feature space. Our FedUSD is inspired by the discovery that principal component concentrates in HOB while residual component independently reflects in variance, regardless of networks. Based on the observation, we develop a method that mathematically optimizes synthetic data by matching both HOB and variance with those of real data. Besides, we experimentally show the superior effectiveness of leveraging HOB and variance to separately extract the principal and residual components over existing methods. We also theoretically prove that FedUSD achieves unbiased synthetic data and thus convergence. Without introducing any constraints, FedUSD thereby yields significant improvements over the state-of-the-arts in terms of global model performance, under equivalent communicational costs. For example, on the SVHN dataset, FedUSD improves 6.74% to 30.82% which is higher than others with Dirichlet coefficient \alpha=0.01.

Abstract:
Humans facing algorithmic decision systems have been found to ``game'' them by altering their input data (at a cost to them) in order to favorably change the algorithmic outcomes they receive (at a cost to the algorithm). The growing literature on strategic classification seeks to develop robust machine learning algorithms that account for, and reduce, this strategic behavior. A limitation of these existing works is that they assume the cost of strategic behavior to be fixed and independent of the classifier's decision. In practice, however, manipulation costs evolve and depend on past algorithmic decisions: today's decisions influence tomorrow's costs. This paper proposes and analyzes a twostage robust optimization framework with a decision-dependent uncertainty set to capture such dependencies. We highlight that awareness of policy-dependent costs not only reduces uncertainty, but also better curtails gaming of the algorithmic system over time.

Abstract:
In machine unlearning, (\varepsilon,\delta)unlearning is a popular framework that provides formal guarantees on the effectiveness of the removal of a subset of training data, the \emphforget set, from a trained model. For strongly convex objectives, existing first-order methods achieve (\varepsilon,\delta)-unlearning, but they only use the forget set to calibrate injected noise, never as a direct optimization signal. In contrast, efficient empirical heuristics often exploit the forget samples (e.g., via gradient ascent) but come with no formal unlearning guarantees. We bridge this gap by presenting the Variance-Reduced Unlearning (VRU) algorithm. To the best of our knowledge, VRU is the first first-order algorithm that directly includes forget set gradients in its update rule, while provably satisfying (\varepsilon,\delta)-unlearning. We establish the convergence of VRU and show that incorporating the forget set yields strictly improved rates, i.e., a better dependence on the achieved error compared to existing first-order (\varepsilon,\delta)-unlearning methods. Moreover, we prove that, in a low-error regime VRU asymptotically outperforms any first-order methods that ignores the forget set. Experiments corroborate our theory, showing consistent gains over both state-of-the-art certified unlearning methods and over empirical baselines that explicitly leverage the forget set.

Abstract:
The anchorbased multi-view clustering method has received extensive attention due to its efficiency and scalability in large-scale data scenarios. Existing methods still face significant challenges in optimizing the quality of anchors. Current mainstream approaches typically rely on random sampling strategies or orthogonal constraints for anchor selection and learning. Nevertheless, these methods treat anchor learning and cluster assignment as mutually independent processes handled separately, thereby giving rise to issues including redundant anchor coverage and ambiguous cluster boundaries. Unlike existing anchor-based multi-view clustering methods, this paper innovatively proposes a Dual-Calibration Multi-view Clustering via Compact Anchor Learning (DCMC), which effectively improves anchor quality through a dual-space alignment mechanism. Specifically, DCMC initializes view-specific anchors to capture the underlying data distribution, and then enforces bidirectional consistency between the anchor space and the clustering space to jointly optimize both the sample-to-anchor assignments and the cluster assignments. The alternating optimization process derived from this objective effectively enhances cross-view semantic consistency while preserving the discriminative characteristics of each view. Experimental results demonstrate that DCMC outperforms state-of-the-art methods across multiple benchmark tests, confirming its effectiveness and reliability.

Abstract:
Despite advances in longcontext inference, large language models (LLMs) remain fundamentally limited by the key-value (KV) caching mechanisms that are necessary for stable computation. Management techniques, such as selective token eviction and pruning, have vastly mitigated the issues that have arisen, but often discard potentially useful information to manage the growing memory requirements of the cache. In this paper, we build upon these approaches to propose Attention with Routed Memory ARM, a novel KV caching structure that introduces a fully differentiable, fixed-size memory system organized as a hierarchical routing structure that learns to select memory slots via Gumbel-Softmax and performs sigmoid-gated updates that softly combine new and stored information, avoiding hard eviction and thereby reducing information loss. By combining this with a policy to dynamically select varying amounts of memory at inference, ARM adapts its accesses for simple contexts and expanding retrieval for inputs that require deeper reasoning, enabling more scalable and effective retrieval on both short and long contexts. Experimental results on standard commonsense and long-context reasoning benchmarks demonstrate that ARM achieves superior performance and efficiency compared to fixed KV-caching approaches, while remaining efficient and scalable in terms of both memory and generation latency.

Abstract:
As Large Language Models (LLMs) are increasingly deployed to serve openended, multi-turn interactions, evaluating conversational quality at human scale has become a central challenge. Existing evaluation frameworks built for summarization, translation, or short-form QA tasks fall short of adequately measuring the consistency of human-scale dialogue, especially when derivation and validation of these metrics themselves often rely on synthetic rather than human sources. We fill the gap by introducing UPHELD (Utility & Planning Human-Scale Evaluated Long Dialogues), a large, reference-full benchmark for evaluating human-scale conversational ability beyond factual correctness. UPHELD consists of hundreds of complete human-to-human dialogues authored by professional script writers, with realistic turn densities and 36,000+ per-turn human annotations across 10,000+ expert-generated dialogue turns. Using UPHELD, we systematically evaluate classical automatic metrics and reference-free LLM-as-a-judge approaches, and find them unreliable when correlated with expert human judgment. Building off this analysis, we use UPHELD to develop a Mixture-of-Judges framework that combines multiple evaluative signals and improves correlation with human assessments by approximately 30%. Overall, UPHELD provides a robust, human-grounded foundation for evaluating long, human-scale conversational intelligence that fills a crucial gap in the pre-existing LLM dataset landscape

Abstract:
We investigate the contextual slate bandit problem with generalized linear rewards under limited adaptivity. At each round, the learner is presented with N sets of items and constructs a slate by selecting one item per set; the resulting slate yields a scalar reward sampled from a Generalized Linear Model (GLM). We propose algorithms under two limitedadaptivity paradigms: (a) batched and (b) rarely-switching settings. For the batched setting, we introduce B-SlateGLinCB, which partitions the time horizon into O(\log\log T) batches such that each batch's policy relies only on data from previous batches. For the rarely-switching setting, we propose RS-SlateGLinCB, which adaptively performs only O(d\log T) parameter updates. Under a diversity assumption on the item sequences, we prove that B-SlateGLinCB and RS-SlateGLinCB achieve regret bounds of O(Nd^3/2\sqrtT) and O(Nd\sqrtT), respectively. Notably, both bounds are independent of the non-linearity parameter \kappa that is typically found to scale the regret of GLM bandit algorithms. Our algorithms are computationally efficient, requiring only \textpoly(N) time per round despite 2^\Omega(N) possible slates. Simulations show our algorithms outperform existing batched baselines and remain competitive with Slate-GLM-OFU, a fully adaptive state-of-the-art algorithm. Notably, a slightly modified B-SlateGLinCB empirically matches this baseline. Finally, we demonstrate strong performance in a practical in-context example selection task for language models.

Abstract:
Federated Graph Learning (FGL) facilitates privacypreserving collaborative training of graph neural networks, yet homophily heterogeneity across subgraphs triggers optimization conflicts that degrade model generalization. Most existing solutions rely on multi-channel architectures to mitigate such conflict, which increase the burden on edge devices and lack theoretical convergence guarantees. To overcome these limitations, we propose FedGCM, a novel FGL framework with Group-oriented Conflict Mitigation, which aligns inconsistent optimization objectives via a tailored gradient surgery scheme. Specifically, FedGCM first divides clients into distinct groups based on their homophily levels, a strategy that precludes exhaustive client-to-client conflict assessments. To resolve inter-group interference, we develop RPGrad, a gradient surgery mechanism based on residual projection, which integrates synergistic knowledge while filtering inter-group conflicts. The refined updates are then transmitted in a group-wise fashion, effectively alleviating optimization conflicts induced by homophily heterogeneity without augmenting the client-side burden. Furthermore, we provide a formal theoretical analysis establishing the convergence of FedGCM. Extensive experiments on both homophilous and heterophilous graphs demonstrate that FedGCM consistently achieves advanced performance.

Abstract:
Federated Incremental Learning (FIL) aims to learn streaming tasks across distributed clients without catastrophic forgetting while preserving privacy. Most existing methods focus on samplebased replay techniques, which mitigate forgetting by replaying historical data samples. However, such methods often face challenges related to data privacy risks and significant resource overheads, making them impractical and difficult to deploy on edge devices with limited resources. To address this challenge, we propose a novel and \underlineLightweight \underlineFederated \underlineIncremental \underlineLearning framework called Li-FIL that leverages dense features synthesized by a secure generator on the server to enable efficient feature-based replay on decoupled local models. More specifically, each client extracts high-confidence features from the new task, applies mixup to obtain a dense feature representation, and then privatizes these features before uploading them to the server, which reduces both storage and communication overhead. A generator is deployed on the server to learn the distributions of different clients and generate global features for replay. Moreover, to enable clients to better learn from these dense features, we decouple the local model into two components: a feature extractor and a classifier. This design allows feature replay and the alignment between new and previous features to be conducted separately and more effectively. Extensive experiments demonstrate that Li-FIL outperforms other state-of-the-art methods by up to 10.14 in terms of accuracy on both old and new tasks with superior resource efficiency.

Abstract:
Longcontext large language models (LLMs) are increasingly deployed in real-world applications, yet self-attention remains a major efficiency bottleneck -- especially during decoding -- due to the necessity of repeatedly processing ever-growing key-value (KV) caches. Existing sparse attention reduce computation by attending to fewer KV pairs, but often suffer from substantial accuracy degradation, require additional training, or rely on expensive hashing. In this work, we present BinaryPC, a training-free, data-aware hashing-based sparse attention for long-context LLMs. BinaryPC constructs compact binary hash codes and corresponding hash function by computing binary principal components of data. Unlike Locality-Sensitive Hashing (LSH) with data-independent random projections or learned non-linear hashing methods that require per-model optimization, BinaryPC constructs binary codes that explicitly preserve the structural information of data without requiring gradient-based training. Comprehensive experiments across multiple model families and long-context benchmarks show that BinaryPC preserves accuracy relative to full attention while achieving superior performance among sparse and hashing-based baselines. On modern GPUs, BinaryPC improves end-to-end decoding throughput by 3.56× over the FlashAttention kernel.

Abstract:
This paper introduces a new family of adaptive, distributionfree independence tests for multivariate random vectors based on binary expansion coefficients, supported by a tractable asymptotic theory. Our first key contribution establishes a general equivalence between independence testing and testing cross-covariances among exponentially many binary expansion interaction coefficients, applicable to broad sample spaces and not limited to kernel-induced representations. While this exponential interaction structure makes naive construction and computation infeasible, we overcome this challenge by reformulating the proposed tests as a class of U-statistics and deriving an explicit kernel representation that enables scalable and efficient computation. Exploiting the multiscale nature of binary expansions, the proposed framework automatically adapts to unknown dependence structures by selectively truncating higher-order interactions, yielding both strong power and clear interpretability. To further enhance power and computational efficiency, we introduce an adaptive weighted aggregation procedure, termed wa-dCoBET, which combines a baseline Covariance Binary Expansion Test (CoBET) with a distance-measure–based CoBET. Extensive simulations and a real-data application demonstrate that wa-dCoBET consistently matches or outperforms HSIC and distance covariance, particularly in higher-dimensional and non-monotone settings, while maintaining accurate type I error control.

Abstract:
Autoregressive (AR) models based on nextscale prediction are rapidly emerging as a powerful tool for image generation, but they face a critical weakness: information inconsistencies between patches across timesteps introduced by progressive resolution scaling. These inconsistencies scatter guidance signals, causing them to drift away from conditioning information and leaving behind ambiguous, unfaithful features. We tackle this challenge with Information-Grounding Guidance (IGG), a novel mechanism that anchors guidance to semantically important regions through attention. By adaptively reinforcing informative patches during sampling, IGG ensures that guidance and content remain tightly aligned. Across both class-conditioned and text-to-image generation tasks, IGG delivers sharper, more coherent, and semantically grounded images, setting a new benchmark for AR-based methods.

Abstract:
Reasoning tasks are characterized by data that lie on a tree. The root represents a prompt, and the groundtruth answer is one of the leaves. Each edge in the tree represents a plausible next reasoning step. We show that Chain of Thought (CoT)-based reasoning is most effective at predicting the answer to a query when this tree has a roughly equal degree at each level. Directly predicting the answer from the prompt is effective only when the tree has a small number of leaves. CoT-based predictors have been observed to perform well on deeper trees, i.e., they reason for an extended number of steps (they ``think''). We identify a critical threshold for the degree, below which such extended reasoning is detrimental, and above which there exists an optimal depth that minimizes error. It is impossible to surpass this minimal error by increasing the depth of thinking.

Abstract:
Recent OmniMLLMs are driving a paradigm shift in multimodal emotion recognition from label-only prediction towardMultimodal Emotion Reasoning(MER), where models output both emotions and textual explanations grounded in visual, acoustic, and linguistic signals. However, we show that current emotion-oriented Omni-MLLMs still lackreliable omni-modal perception: they (i) underutilize multimodal cues in their reasoning trajectories and (ii) exhibit unfaithful behavior, often hallucinating modality-specific statements from other modalities. Building on these insights, we proposeOPPO(Omni-PerceptionPolicyOptimization), a reinforcement learning framework that explicitly optimizes multimodal perception. First, an Omni-Perception Reward decomposes ground-truth reasoning into fine-grained visual, acoustic, and emotion cues and rewards trajectories that semantically recover these cues. Second, an Omni-Perception Loss compares the policy under full and unimodally masked inputs, applying a KL penalty only to modality-specific evidence tokens to suppress cross-modal hallucination. We further introduceMEP-Bench, a diagnostic benchmark that quantifiesutilizationandfaithfulness. Experiments show that OPPO achieves state-of-the-art performance on MER-UniBench and substantially improves utilization and faithfulness scores on MEP-Bench, highlighting the importance of sufficient and faithful omni perception for multimodal emotion reasoning. The code is provided in the Supplementary Materials.

Abstract:
Aligning large language models (LLMs) with diverse user preferences is a critical yet challenging task. While posttraining methods can adapt models to specific needs, they often require costly data curation and additional training. Test-time scaling (TTS) presents an efficient, training-free alternative, but its application has been largely limited to verifiable domains like mathematics and coding, where response correctness is easily judged. To extend TTS to preference alignment, we introduce a novel framework that models the task as a realignment problem, since the base model often fails to sufficiently align with the stated preference. Our key insight is to decompose the underlying reward function into two components: one related to the question and the other to preference information. This allows us to derive a REAlignment Reward (REAR) that selectively rescales the proportions of these two reward terms. We then show that REAR can be formulated as a linear combination of token-level policy log-probabilities, making it computationally efficient and easy to integrate with various TTS algorithms such as best-of-N sampling and tree search. Experiments show that compared to other test-time baselines, REAR not only enables scable test-time realignment for preference alignment tasks under diverse user requirements, but also generalizes to mathematical and visual tasks under appropriate preference settings.

Abstract:
Decentralized Unlearning (DU) aims to remove the influence of specific clients from a collaboratively trained global model. However, existing methods suffer from strong reliance on static, problemspecific hyperparameters or restrictive convexity assumptions, limiting their general applicability. To overcome these limitations, we proposeTRAjectory-awareCErtifiedDecentralizedUnlearning (TRACE-DU), a generic unlearning framework for decentralized training.TRACE-DUintroduces a fine-grained sensitivity analysis that leverages local SGD updates and decentralized training dynamics, thereby eliminating the need for convexity assumptions and reducing dependence on manually tuned parameters. By integrating strategic checkpoint selection with calibrated noise perturbation, the proposed framework enables efficient certified unlearning. Moreover, we exploit historical model trajectories to extend this framework, enabling it to naturally support sequential unlearning requests from an arbitrary number of clients. We provide theoretical guarantees for certified unlearning and derive sensitivity bounds under both convex and non-convex loss functions. Experimental results demonstrate that our framework outperforms state-of-the-art baselines across diverse metrics.

Abstract:
Advances in data collection are producing growing volumes of temporal count observations, making adapted modeling increasingly necessary. In this work, we introduce a generative framework for independent component analysis of temporal count data, combining regimeadaptive dynamics with Poisson log-normal emissions. The model identifies disentangled components with regime-dependent contributions, enabling representation learning and perturbations analysis. Notably, we establish the identifiability of the model, supporting principled interpretation. To learn the parameters, we propose an efficient amortized variational inference procedure. Experiments on simulated data evaluate recovery of the mixing function and latent sources across diverse settings, while anin vivolongitudinal gut microbiome study reveals microbial co-variation patterns and regime shifts consistent with clinical perturbations.

Abstract:
We study the problem of learning a drifting concept in the presence of Massart noise. In this framework, an online learner has access to a history of independent samples whose labels are noisy versions of a target concept that may change from round to round. The goal is to output, in each round, a hypothesis with small prediction error. We study the complexity of this learning problem for the fundamental class of marginseparable linear classifiers (halfspaces). On the positive side, we give a computationally efficient learner achieving error \eta + \tilde O(\Delta^1/3/\gamma), where \eta upper bounds the Massart noise rate, \Delta is the drift rate, and \gamma is the margin. Interestingly, in the realizable setting, an adaptation of our techniques yields an efficient learner with an improved error rate over prior work. On the lower-bound side, we provide formal evidence of an information-computation tradeoff, strongly suggesting that our algorithm's performance is essentially optimal. Specifically, while the information-theoretically optimal error scales with \Delta^1/2, we prove that \Delta^1/3-scaling is unavoidable for low-degree polynomial tests, even in the special case of random classification noise.

Abstract:
ANNto-SNN conversion offers energy-efficient inference but faces a fidelity-latency trade-off due to open-loop error accumulation. While conversion-aware training mitigates this, it sacrifices the generality of using off-the-shelf ANNs. We propose a closed-loop fine-tuning framework that calibrates these errors without altering the source model. Our approach employs a Dual Alignment Mechanism, utilizing global Kullback-Leibler divergence for output distillation and introducing an adaptive local Centered Kernel Alignment constraint, weighted by initial conversion loss, for feature alignment. We uncover a critical time-dependent dynamic: local constraints are essential for stabilizing representations in low-latency regimes (e.g., T=8) where global gradients are unstable, whereas global alignment drives fidelity at higher time steps. Experiments on CIFAR-10 demonstrate that our method achieves over 99% of source ANN accuracy at T=32 (e.g., ResNet-18: 96.38% vs.\ 96.39%). Furthermore, this fine-tuning acts as a regularizer, yielding SNNs with input noise robustness that matches or exceeds the source ANN.

Abstract:
Recent research has made growing efforts to leverage large language models (LLMs) for computeraided design (CAD), a domain that demands advanced geometric and spatial reasoning across long operation sequence. However, existing studies remain limited in addressing complex modeling tasks that necessitate step-by-step reasoning, primarily due to the scarcity of high-quality CAD datasets and the absence of fine-grained evaluation frameworks. In response to these challenges, we introduce Op-CAD, the first large-scale, multi-modal dataset for operation-oriented CAD generation, encompassing four operation types and five modalities. Furthermore, we introduce a novel CAD parsing module together with a geometry-guided hierarchical annotation pipeline, which decomposes modeling sequences into discrete operations and substantially improves the annotation accuracy of Vision-Language Models (VLMs). Based on our dataset, we redefine the CAD modeling task by decoupling geometric and spatial perspectives and introduce a novel metric, Chamfer/Fillet Intersection over Union (CF-IoU), to fill the void in assessing chamfer and fillet operations. By comprehensively evaluating eight LLMs on Op-CAD, we establish a benchmark for current models on operation-oriented tasks. Finally, we investigate performance enhancement strategies through fine-tuning on Op-CAD and propose Chain-of-Operation (COOP), a novel prompting strategy that emulates human-engineer reasoning.

Abstract:
Visionlanguage models (VLMs) generate fluent causal explanations for visual scenes, but does this fluency reflect genuine structural understanding? We address this question through a dual-probe methodology that isolates plausibility from faithfulness. The Text-Only Probe measures linguistic quality; the Chain-Text Probe requires models to first generate explicit causal chains before text responses. We define the Abstraction Gap (AG) metric as the normalized performance difference between probes, operationalizing the plausibility-faithfulness distinction from explainable AI research. Applying this methodology to eight VLMs using CAGE (Causal Abstraction Gap Evaluation), a benchmark of 49,500 questions across 5,500 images spanning Pearl's causal hierarchy, we find seven models exhibit AG exceeding 0.50: scoring 6--8 on text but below 2.5 on chains, often producing blank outputs. Fine-tuning on 45,000 chain-annotated examples fails to close the gap, indicating that explicit chain supervision cannot instill structural abstraction capability. Current VLMs optimize for plausible language without faithful structural understanding.

Abstract:
Time series forecasting fundamentally involves learning probability distributions over future observations. However, existing loss functions rely on pointwise Euclidean metrics, neglecting the intrinsic geometric structure of probability distributions. This leads to suboptimal alignment between predicted and true distributions, particularly for uncertainty quantification. We propose InfoGeo Loss, a principled loss function grounded in information geometry that measures distributional discrepancies on statistical manifolds. Our approach comprises three key components: (1) a distribution parameterization module that models predictions with learnable sufficient statistics, (2) a Fisher information metric that quantifies intrinsic distributional distance, and (3) a Bregman divergence component that captures asymmetric prediction errors. We further introduce a natural gradient weighting strategy for efficient optimization on statistical manifolds. Theoretically, we prove statistical consistency and establish convergence guarantees. Extensive experiments on seven datasets with five architectures show that InfoGeo Loss consistently outperforms existing losses, achieving average improvements of 6.8% in MSE and 5.3% in MAE.

Abstract:
Continual learning requires models to mitigate catastrophic forgetting of prior knowledge while learning a sequence of tasks. Although existing methods based on orthogonal projection prevent interference by constraining parameter updates, they tend to limit plasticity as the task sequence progresses. The reliance on linear approximation further causes the projected gradients to deviate from the nonlinear manifold. To address these issues, we propose Metabolic Optimization for Continual Learning (MoCL), a rehearsalfree framework that strikes a balance between stability and plasticity. To capture the geometric manifold of prior knowledge, MoCL introduces a factorized subspace approximation that avoids expensive explicit matrix inversion. Given the heavy-tailed distribution of the Fisher Information Matrix, we employ a metabolic gating based on Tsallis entropy to suppress updates that conflict with historical knowledge. Theoretical and empirical analyses reveal that MoCL enables the model to converge to a shared low-loss region across sequential tasks. Extensive experimental results across multiple benchmarks demonstrate that MoCL outperforms state-of-the-art methods in both classification performance and efficiency.

Abstract:
Vision–Language–Action (VLA) models enable generalpurpose robotic control via large-scale multimodal pretraining, yet their effectiveness under few-shot imitation learning remains limited. We conduct a systematic stress test of state-of-the-art VLA models and show that performance degrades sharply as demonstrations are reduced, revealing a key weakness of existing adaptation strategies. To address this, we introduce FOCA, a future-oriented conditioning framework for data-efficient VLA adaptation. FOCA combines explicit prediction of task-grounded future interaction embeddings with implicit alignment to future goal observations, enabling long-horizon reasoning in latent space without pixel-level prediction. This formulation naturally supports action-free co-training with synthetic videos from video world models and can be interpreted as learning a future-conditioned value-like representation. Extensive experiments demonstrate FOCA achieves 95.7% success with 20 demonstrations on LIBERO, improves 7–12% on RoboCasa, and delivers up to 26% absolute gains on real robots, establishing a new state of the art in few-shot VLA adaptation.

Abstract:
Likelihoodbased training is the dominant paradigm in survival prediction. Under independent censoring, we can factorize the likelihood and optimize only the terms related to event modeling, effectively treating the censoring mechanism as incidental. This is justified when censoring isnon-informative, i.e., when the censoring process shares no parameters with the event-time model. However, this may not hold in practice, and ignoring censoring contributions may discard useful signals for learning representations that can help to effectively estimate event distributions. Motivated by this, we argue that explicitly modeling censoring can improve representation learning and time-to-event estimation, particularly when event and censoring processes are coupled. We introduce a latent decomposition view that partitions covariates into four disjoint factors: those affecting only the event process, only the censoring process, both, or neither. We then learn decomposed representations for the first three categories to guide a better estimation of the event distribution. We instantiate our method on four popular deep-learning survival models and evaluate on 10 datasets (2 semi-synthetic and 8 real-world), showing consistent gains over strong baselines and multiple SOTA methods.

Abstract:
Offline Reinforcement Learning from Human Feedback (RLHF) pipelines such as Direct Preference Optimization (DPO) train on a precollected preference dataset, which makes them vulnerable to preference poisoning attack. We study label flip attacks against log-linear DPO. We first illustrate that flipping one preference label induces a parameter-independent shift in the DPO gradient. Using this key property, we can then convert the targeted poisoning problem into a structured binary sparse approximation problem. To solve this problem, we develop two attack methods: Binary-Aware Lattice Attack (BAL-A) and Binary Matching Pursuit Attack (BMP-A). BAL-A embeds the binary flip selection problem into a binary-aware lattice and applies Lenstra-Lenstra-Lovász reduction and Babai's nearest plane algorithm; we provide sufficient conditions that enforce binary coefficients and recover the minimum-flip objective. BMP-A adapts binary matching pursuit to our non-normalized gradient dictionary and yields coherence-based recovery guarantees and robustness (impossibility) certificates for K-flip budgets. Experiments on synthetic dictionaries and the Stanford Human Preferences dataset validate the theory and highlight how dictionary geometry governs attack success.

Abstract:
In this work, we propose Oneshot Entropy Minimization (EM), a simple and fully unsupervised post-training approach that significantly improves reasoning and generation performance using only a single unlabeled data and approximately ten gradient steps. To avoid data contamination, we pretrain a 7-billion-parameter language model from scratch with strictly decontaminated data. Despite its extreme simplicity, one-shot EM yields substantial performance gains and improves reasoning abilities across a broad range of domains, including mathematical reasoning, logical reasoning, and coding. We further show that entropy minimization induces a characteristic right-skewed logit shift, amplifying high-probability tokens while suppressing low-probability tails, in contrast to reinforcement learning. Our findings suggest that entropy minimization primarily acts as a distribution shaping mechanism rather than a conventional learning process, offering an efficient and practical algorithm for post-training large language models.

Abstract:
Reinforcement Learning from Human Feedback (RLHF) is critical for aligning Large Language Models (LLMs) with human preferences. However, its efficacy is often compromised by the inherent inconsistency and subjectivity of human annotations. Existing preference optimization frameworks, such as Direct Preference Optimization (DPO), typically treat ambiguous pairs with high annotator disagreement identically to those with unanimous consensus, forcing models to overfit to inconsistent supervision signals and leading to suboptimal alignment. In this work, we propose ReliabilityGuided Preference Optimization (RGPO), a robust framework designed to mitigate the impact of inconsistent human feedback. RGPO estimates annotator reliability and infers latent ground truth labels from noisy human feedback to identify robust preferences. Furthermore, we introduce a reliability-aware consistency optimization that dynamically modulates the training objective based on the consensus level of annotations, ensuring the model prioritizes high-consensus supervision signals. Extensive experiments on LLM alignment benchmarks demonstrate that RGPO effectively reduces inconsistency and noise in training data and achieves superior performance compared to widely adopted RLHF baselines.

Abstract:
Despite considerable progress in the development of machinetext detectors, the ease with which machine-text can be manipulated to evade detection has led to suggestions that the problem is inherently intractable. In this work, we investigate the limits of such evasion strategies. We demonstrate that while current attacks, ranging from prompt engineering to detector-guided optimization can effectively degrade performance of standard detectors, they fail to erase the underlying stylistic "fingerprints" of machine text. We show that few-shot detectors that utilize the stylistic feature space are robust to these evasion attempts, reliably detecting samples even from models explicitly tuned to prevent detection. This raises the question: does style represent a universal defense against machine-detection attacks? We demonstrate that the answer is "no" by introducing a novel paraphrasing approach that simultaneously optimizes for undetectability and adherence to specific human styles. We show that unlike prior methods, this attack effectively evades all considered detectors, including those that utilize writing style. However, we find that this evasion is not absolute: as the number of documents available for analysis grows, the human and machine distributions become distinguishable again. Overall, our findings suggest that reliable machine-text detection requires moving beyond single-document analysis to multi-document analysis.

Abstract:
We propose a frequencyaware perceptual optimization framework for low-complexity image compression, realized as a Realism-enhanced Region-based Implicit Codec (Re2IC). Re2IC models visual perception via saliency-guided region partitioning and local–global perceptual modulation. To enhance realism under complexity constraints, we introduce wavelet–Wasserstein distortion (WA-WD), a frequency-decomposed perceptual distortion that balances fidelity and realism through subband-aware modeling and provides a more reliable approximation than standard Wasserstein distortion. Together, these designs enable fine-grained spatial–spectral optimization, allowing Re2IC to achieve superior rate–perception trade-offs, outperforming generative codecs such as HiFiC while using less than 1% of their decoding cost. Extensive experiments demonstrate state-of-the-art perceptual performance among overfitted codecs. Beyond compression, WA-WD serves as a standalone, tunable perceptual metric with strong alignment to human preference (Pearson 94.6%, Spearman 92.3%) and competitive performance across multiple IQA benchmarks.

Abstract:
Geospatial understanding is a critical yet underexplored dimension in the development of machine learning systems for tasks such as image geolocation and spatial reasoning. In this work, we analyze the geospatial representations acquired by three model families: visiononly architectures (e.g., ViT), vision-language models (e.g., CLIP), and large-scale multimodal foundation models (e.g., LLaVA, Qwen, and Gemma). By evaluating across image clusters, including people, landmarks, and everyday objects, grouped based on the degree of localizability, we reveal systematic gaps in spatial accuracy and show that textual supervision enhances the learning of geospatial representations. Our findings suggest the role of language as an effective complementary modality for encoding spatial context and multimodal learning as a key direction for advancing geospatial AI.

Abstract:
Adaptive Moment Estimation (Adam) is one of the most popular and often the default stochastic optimizers for deep neural network training. Using firstand second-moment estimation, Adam provides adaptive learning rates for each parameter, significantly outperforming Stochastic Gradient Descent (SGD). However, as deep neural networks become larger, estimating the first and second moments consumes substantial memory. It motivates various methods to reduce memory usage for adaptive optimizers. In this paper, we propose to rethink the first and second moment estimation from a gradient computation perspective. The gradient of the weight matrix is the multiplication of the input and the gradient of the output. Instead of finding low-rank approximations of the first and second moments, as in previous work, we propose tracking the input and output gradients to efficiently estimate moments. We provide analyses of the similarities and differences between our proposed method, the widely used Adam optimizer, and previous memory-efficient optimizers designed to reduce memory usage. We conduct experiments to verify the effectiveness of our method, which reduces memory usage by up to 30% while preserving similar performance or even improving the performance of Adam.

Abstract:
In the field of integrated circuits, leaf cells are the basic units, serving as the fundamental building blocks (e.g., standard cells) that are widely reused in various VLSI designs, forming the basis for more complex circuits. Therefore, the design quality of leaf cell layouts significantly impacts the PPA (Power, Performance, and Area) of the final VLSI designs. To automatically design leaf cell layouts that are close to expert designs, we propose GenLeaf. GenLeaf first utilizes a supervised, performanceaware embedding model to represent layouts and automatically calculate their similarity scores. Since there are expert-designed layouts but no corresponding scripts, we implement Bayesian optimization to generate a layout-script dataset for LLM training. With subsequent supervised fine-tuning and further preference optimization, GenLeaf can generate leaf cell layouts through scripts whose performance closely resembles that designed by human engineers. Experiment results demonstrate that GenLeaf outperforms expert-designed golden layouts across key performance metrics.

Abstract:
Program termination is a fundamental liveness property in software verification. Proving termination of a given program is a formidable challenge due to the undecidability of the problem. In this paper, we propose LIFT, a termination verification framework that leverages LLMs to generate loop bounds within a guessand-check workflow. LIFT couples this generation with a sound formal validation procedure that both guarantees all reported terminations and refutes invalid loop bounds via violation analysis. Experiments on publicly accessible termination benchmarks show that LIFT significantly outperforms existing termination verification tools.

Abstract:
Designing sparse attention for 2dimensional image data in diffusion models and vision-language models requires reconciling spatial locality with hardware-efficient execution, a fundamental trade-off that existing methods struggle to resolve. Prior approaches preserve 2D structure through handcrafted sparsity patterns, but often incur uncoalesced memory access, limiting practical speedups on modern GPUs. We present HilbertA, a 2D-aware and GPU-efficient sparse attention mechanism, and show that Hilbert curves provide a hardware-aligned inductive bias for sparse attention over 2D data. By reordering image tokens along Hilbert curves, HilbertA preserves local spatial neighborhoods while inducing a contiguous memory layout aligned with efficient GPU execution. To enable global information flow without uncoalesced access, HilbertA further employs a layer-wise sliding schedule, allowing long-range interactions to emerge progressively across depth. In addition, a small central shared region facilitates cross-tile communication and enhances positional awareness. Implemented in Triton, HilbertA achieves substantial acceleration while maintaining or improving model quality across both diffusion models and VLM. On Flux.1-dev, HilbertA delivers up to 4.17× at 2048×2048, with image quality comparable to baselines. On Qwen3-VL-8B, HilbertA achieves over 2.08× attention acceleration with 1.55× improvements in Time-To-First-Token, while maintaining competitive model performance.

Abstract:
Merging multiple lowrank adapters (LoRA) provides a practical route to scaling multi-task learning and deployment more efficiently than full-model weight merging, while avoiding reliance on task-specific training data. However, most existing approaches either treat LoRA updates as dense weight deltas or depend on expensive subspace factorizations, making the merge step a primary latency bottleneck. To address this issue, this paper establishes a theoretically positive relationship between merging quality and the effective rank of the matrices being merged. Motivated by this insight, we proposeCircuMerge, a sketch-based framework for low-rank model merging built on shared circulant transforms. Especially, this approach treats each adapter as a pair of low-rank matrices and applies a shared circulant transform to align all tasks in a common coordinate system. This alignment enables more efficient sampling, allowing us to generate compact sketches that effectively summarize the interactions between tasks. These compact sketches enable applying the merging rules directly to them and reconstructing a standard low-rank adapter, preserving the essential information while significantly reducing computational overhead. Across a broad multi-task LoRA benchmarks covering both vision and language settings, extensive empirical results demonstrate that CircuMerge reduces the overall merging time by at least 44% compared to the state-of-the-art approaches, with accuracy matching or exceeding the optimal level.

Abstract:
Data heterogeneity is a fundamental challenge in Federated Learning (FL), where induced model drift often results in "negative gains" for global models on dataabundant clients, with performance falling below that of local training. To address this issue, we propose FedGain, a novel framework that optimizes collaborative client clustering to mitigate the negative gain. We are the first to develop a modified Scaling Law to quantify the "reduction" in data utility caused by heterogeneity and define Effective Federated Capacity to align clients with the highest potential collaboration gains. Extensive experiments demonstrate that our modified SL strictly adheres to the power-law learning discipline in non-IID scenarios. FedGain effectively suppresses negative gains to a negligible level across various FL algorithms and outperforms other Clustered FL methods.

Abstract:
A standard selfattention layer consists of two interacting circuits: the query-key circuit that governs attention allocation, and the output-value circuit that maps attended representations to predictions. Collapsed and factorized parameterizations of the query-key and output-value circuits lead to qualitatively different attention patterns. In particular, some parameterizations give sharper attention to task-relevant tokens, at a similar training loss. We analyze how the parameterizations of these circuits shape the parameter trajectories in single-layer self-attention models trained on next-token prediction. Through gradient-flow analysis, we show that factorization induces implicit rescaling of the two circuits' learning rates. We derive closed-form dynamics showing that output-value and query-key parameters move along a line, with relative speeds determined by their learning rates. Faster query-key learning relative to output-value learning thus produces sharper attention, as the model compensates for slower output-value learning by increasing attention mass on relevant tokens. Experiments show that differences in the relative learning rates of the two circuits govern attention concentration. This improves attention interpretability while maintaining comparable predictive performance.

Abstract:
Adapting large language models (LLMs) to lowresource domains remains challenging due to the scarcity of domain-specific data. While in-domain data is limited, there exists a vast amount of general-domain data that shares similar question–answer formats and reasoning patterns with domain tasks. This observation raises an important question: can useful general-domain data be mined to improve low-resource domain adaptation? Our initial findings show that general-domain chain-of-thought data contains useful auxiliary signals for domain adaptation, even without careful selection. This observation motivates a new paradigm for domain adaptation beyond exclusive reliance on domain-specific data. To systematically identify the most beneficial general-domain samples, we propose NTK-Selector, motivated by the Neural Tangent Kernel’s ability to capture alignment in training dynamics. Since directly applying NTK to pretrained LLMs is impractical, we introduce a Jacobian-free NTK approximation and empirically demonstrate stable NTK-like behavior during fine-tuning. Extensive experiments across medical, financial, legal, and psychological domains demonstrate that NTK-Selector consistently outperforms domain-only fine-tuning and existing data selection baselines. In particular, NTK-Selector achieves gains of +8.7 and +5.1 points on Llama3-8B-Instruct and Qwen3-8B, respectively, compared to only +0.8 and +0.9 points from domain-only fine-tuning.

Abstract:
In openworld settings, thermal infrared (TIR) image degradations continuously emerge and evolve, while most existing all-in-one restoration methods are built on a closed-set assumption and struggle to continually adapt to novel degradations. To address this, we propose ECMRNet, an Expandable, Compressible, and Mineable Restoration Network for open-world TIR restoration from the perspective of continual degradation learning. Conceptually, ECMRNet unifies continual degradation learning as an ``expand–compress–mine'' closed-loop process, enabling sustained adaptation to new degradations with controllable evolution. Structurally, ECMRNet decomposes intermediate representations into group-isolated subspaces, and achieves strict parameter isolation and fast adaptation to new degradations by freezing historical groups and isomorphically expanding new ones. To curb model growth as tasks accumulate, we present Structural Entropy Pruning, which identifies and removes redundant channel groups via two-dimensional structural entropy minimization, achieving information contribution–driven adaptive compression. Moreover, we design a Sub-degradation Knowledge Mining module that dynamically retrieves and recombines transferable components from historical representations to improve restoration under compound degradations. Experimental results demonstrate that ECMRNet achieves superior overall performance across diverse single and compound degradations while using fewer parameters and lower computational cost, highlighting its scalability and efficiency in open-world TIR restoration.

Abstract:
This paper proposes a powerenhanced panel Granger causality test (PE-PGCT) for assessing the Granger non-causality in heterogeneous and potentially high-dimensional panel data. Building on any existing panel Granger non-causality test, we show, both theoretically and empirically, that the proposed PE-PGCT boosts its power substantially. The power gains are particularly significant in situations of high-dimensional panels when the cross-sectional dimension exceeds the time dimension, as well as under sparse alternatives when the signals are sparsely distributed across panel units. We establish rigorous theoretical guarantees on the asymptotic behavior of the proposed power enhancement component, demonstrating attractive power enhancement properties that it induces negligible size distortion under the null hypothesis while delivering significant power gain under the alternatives. The empirical performances are illustrated via extensive simulation studies, as well as a real-world application.

Abstract:
Deep neural networks with repeated blocks, such as transformers and ResNets, often exhibit closely related representational structure across layers that emerges with training. Motivated by this observation, we introduceGradient Smoothing, a general training paradigm that couples gradient updates across blocks and admits a natural interpretation as a preconditioning method. Our framework applies structured smoothing operators to layerwise updates, such as weighted averages and exponential moving averages, with minimal computational overhead. We evaluate Gradient Smoothing across a range of architectures and training regimes, including RL post-training of LLMs on reasoning tasks, as well as diffusion and classification with Vision Transformers. Across these settings, Gradient Smoothing consistently improves generalization performance, in addition to promoting structured representation evolution across layers. These results suggest that gradient smoothing is a simple and broadly applicable technique for improving training in modern deep networks.

Abstract:
RetrievalAugmented Generation (RAG) systems are increasingly deployed to provide query-based access to large knowledge bases, thereby introducing concrete privacy risks whereby the underlying corpus may be partially or fully extracted through the deployed service. Existing extraction attacks typically rely on locally driven search strategies, in which newly extracted content is inferred or expanded based on previously recovered fragments. However, real-world knowledge bases are often multi-source and heterogeneous, with pronounced semantic discontinuities across domains. Such gaps can trap extraction methods that rely on local semantic continuity in local optima, severely limiting large-scale corpus reconstruction. In this paper, we introduce an extraction framework (GeoEx) designed to navigate and reconstruct heterogeneous RAG knowledge bases without any prior knowledge. The framework plans extraction directly in the embedding space of a proxy retrieval model to improve global coverage, and employs an embedding inversion module to convert latent vectors into executable queries. We further propose a composite geometric strategy that combines orthogonal query synthesis for cross-domain exploration with local embedding perturbations for dense extraction within discovered clusters. Experiments on mixed corpora spanning eight diverse domains and multiple retrievers and LLMs show that GeoEx significantly outperforms baselines in both extraction coverage and query efficiency.

Abstract:
While Vision Foundation Models (VFMs) excel at predictive tasks on satellite imagery, their performance can arise from visual correlations rather than underlying structural invariants, making certain perceptionbased out-of-distribution accuracy a poor proxy for scientific utility. As a result, models may look correct without reasoning correctly—a discrepancy we term thePerception–Physics Paradox. To address this gap, we introduceScientific Alignmentas an implicit objective for representation learning in scientific domains. We study a principled, testable aspect of scientific alignment throughStructural Isomorphism, which requires latent representations to uniquely identify physical systems up to a linear reparameterization. This perspective induces a hierarchy of necessary conditions and yields a systematic probing protocol for physical and causal interpretability. To operationalize this framework, we releaseTC-Atlas, a foundational global dataset and automated construction pipeline for tropical cyclone research, and show that current VFMs rely on visual shortcuts that collapse in extreme regimes, indicating that scientific alignment does not arise as a natural byproduct of visual scaling alone.

Abstract:
Noisy evolution strategies commonly mitigate ranking uncertainty by improving pergeneration fidelity—for example, by allocating budget to resampling candidates or using robust aggregation to stabilize the within-generation ordering. Under strict fixed evaluation budgets, however, any additional intra-generation querying directly reduces the number of generations the algorithm can execute, shortening the optimization trajectory. This dynamic can be characterized as prioritizing fidelity over depth. We propose a paradigm shift in fixed-budget regimes toward depth over fidelity, arguing that the cumulative progress from a long sequence of noise-smoothed updates often outweighs that of a short sequence of rigorously denoised ones. We operationalize this principle via probabilistic elite membership, replacing hard truncation with conditional expected rank weights that integrate over ranking uncertainty. This shifts noise handling from the evaluation stage to the selection stage: rather than repeatedly reevaluating candidates to denoise their objective values, we directly smooth the selection signal driving the update. We instantiate this approach using residual bootstrapping: we perform sparse reevaluations near the selection boundary, store standardized noise residuals in a reusable pool, and generate bootstrap rankings to estimate expected weights. Recognizing that residual pool mismatch constitutes a potential statistical risk, we derive a falsifiable error decomposition and provide runtime diagnostics to ensure estimator validity. To prevent oversmoothing in low-noise regimes, we introduce an adaptive probe-and-switch mechanism that leverages a low-cost rank disagreement metric to dynamically select between standard CMA-ES and our bootstrap-based updates. Extensive evaluations across the COCO bbob-noisy suite and diverse external tasks—including RL policy search and noisy HPO—demonstrate consistent gains. Specifically, in high-misranking regimes constrained by strict budgets, our residual-bootstrap approach achieves substantially steeper progress curves than both uncertainty-handling CMA-ES and fixed-k resampling baselines. These results substantiate a testable thesis: when budgets are limited and ranking uncertainty is high, integrating uncertainty at the selection stage is more sample-efficient than reducing it at the evaluation stage.

Abstract:
Highcurvature regions in 3D point clouds encapsulate critical fine-grained geometric semantics yet exhibit a distinct long-tail sparsity in their spatial distribution. The inherent limitations of polynomial volume growth in Euclidean space frequently render these intricate geometric features challenging to adequately resolve within a uniform-scale feature space. Consequently, these regions are frequently overshadowed by smooth global features dominated by low-curvature regions, thereby limiting the discriminative capacity of the network. To address this issue, we propose PointCHR, a curvature-aware hyperbolic rectification (CHR) for point cloud analysis. Utilising the property of exponential volume expansion in the vicinity of hyperbolic manifolds, CHR presents a learnable curvature-guided radial rectification mechanism. By adaptively projecting high-curvature points towards boundary regions endowed with larger effective embedding capacities, PointCHR effectively mitigates the representation crowding problem inherent in Euclidean settings. Extensive experimentation has demonstrated that PointCHR significantly enhances the ability of backbone to capture fine-grained geometric details, achieving state-of-the-art performance across multiple benchmarks.

Abstract:
Accurate prediction in survival analysis with competing risks is challenged by rare event rates and limited effective sample sizes. Knowledge distillation offers a promising way to transfer information from an external teacher to improve a local student, but existing methods are overwhelmingly developed for uncensored outcomes and do not directly extend to survival analysis, where censored observations provide only partial information. Moreover, prior work often assumes that teacher and student share identical outcome definitions, whereas in competing risks settings, they may differ in outcome granularity and event definitions, further complicating knowledge transfer. To address these gaps, we propose DiSKD (Discrete Survival Knowledge Distillation), a deep learning framework for discretetime competing risks that integrates teacher predictions via a cause-specific, time-dependent Kullback--Leibler divergence. DiSKD enables flexible and privacy-preserving transfer without requiring raw data sharing, remains robust to model misspecification or outcome-definition heterogeneity, and adaptively weights teacher guidance by emphasizing compatible teachers while down-weighting less relevant ones. Simulation studies and real-world applications demonstrate improved discrimination and calibration.

Abstract:
Textto-video (T2V) generation aims to synthesize videos with high visual quality and temporal consistency that are semantically aligned with input text. Reward-based post-training has emerged as a promising direction to improve the quality and semantic alignment of generated videos. However, recent methods either rely on large-scale human preference annotations or operate on misaligned embeddings from pre-trained vision-language models, leading to limited scalability or suboptimal supervision. We present \textttPISCES, an annotation-free post-training algorithm that addresses these limitations via a novel Dual Optimal Transport (OT)-aligned Rewards module. To align reward signals with human judgment, \textttPISCES uses OT to bridge text and video embeddings at both distributional and discrete token levels, enabling reward supervision to fulfill two objectives: (i) a Distributional OT-aligned Quality Reward that captures overall visual quality and temporal coherence; and (ii) a Discrete Token-level OT-aligned Semantic Reward that enforces semantic, spatio-temporal correspondence between text and video tokens. To our knowledge, \textttPISCES is the first to improve annotation-free reward supervision in generative post-training through the lens of OT. Experiments on both short- and long-video generation show that \textttPISCES outperforms both annotation-based and annotation-free methods on VBench across Quality and Semantic scores, with human preference studies further validating its effectiveness. We show that the Dual OT-aligned Rewards module is compatible with multiple optimization paradigms, including direct backpropagation and reinforcement learning fine-tuning.

Abstract:
Diffusion models (DMs) are a stateof-the-art generative method to approximately sample from an unknown distribution. Their training and evaluation primarily rely on an Evidence Lower Bound (ELBO), which relates the Kullback-Leibler (KL) divergence of model samples to the score matching loss along the path, which serves as a tractable surrogate. The difference between sample quality and the score matching loss produced by this bound leads to the score matching gap, which is known to be tight in the worst-case but not descriptive of sample quality in general. In this work, we provide a theoretical analysis of this gap, developing tighter bounds for three metrics: KL divergence, reverse KL divergence, and Wasserstein distance, effectively exploiting the regularity of the class of score estimators. Our results suggest that the quality of the score approximation has more impact on closing the score matching gap for low noise scales. To obtain these bounds, our key technical insight is to exploit the contraction properties of the backward processes. In particular, we rely on entropy flows, logarithmic Sobolev inequalities and reflection couplings, rigorously linking the ergodicity of the Langevin diffusion to the score matching gap problem.

Abstract:
LLMbased deep research agents are largely built on the ReAct framework. This linear design makes it difficult to revisit earlier states, branch into alternative search directions, or maintain global awareness under long contexts, often leading to local optima, redundant exploration, and inefficient search. We propose Re-TRAC, an agentic framework that performs cross-trajectory exploration by generating a structured state representation after each trajectory to summarize evidence, uncertainties, failures, and future plans, and conditioning subsequent trajectories on this state representation. This enables iterative reflection and globally informed planning, reframing research as a progressive process. Empirical results show that Re-TRAC consistently outperforms ReAct by 15–20% on BrowseComp with frontier LLMs. For smaller models, we introduce Re-TRAC-aware supervised fine-tuning, achieving state-of-the-art performance at comparable scales. Notably, Re-TRAC shows a monotonic reduction in tool calls and token usage across rounds, indicating progressively targeted exploration driven by cross-trajectory reflection rather than redundant search.

Abstract:
Positional encodings (PEs) enhance the power of graph neural networks (GNNs), both theoretically and empirically. Two of the most popular families of PEsspectral (e.g., Laplacian eigenspaces, effective resistance) and random walk (polynomials of the adjacency matrix)---are theoretically equivalent in expressive power, and both are known to lie between the 1-WL and 3-WL tests in terms of expressivity. However, this equivalence assumes the GNN uses the "complete'' version of these PEs, which requires O(n^3) time and space complexity. Practitioners therefore commonly use truncated variants of these encodings (e.g., the first k eigenspaces or powers of adjacency matrix). However, the theoretical properties of these truncated PEs are unknown. In this work, we initiate the study of these truncated PEs. Theoretically, we show that, under truncation, several families of PEs are fundamentally different in expressive power. As a corollary, we show that truncated spectral PEs are no longer stronger than the 1-WL test. We also study a family of spectral PEs, the k-harmonic distances, to highlight the differences in expressive power of even closely related truncated PEs. Finally, we experimentally show that a mix of truncated PEs is preferable to any single family on real-world datasets.

Abstract:
Finetuned Large Language Models (LLMs) are vulnerable to backdoor attacks through data poisoning, yet the internal mechanisms governing these attacks remain a black box. Previous research on interpretability for LLM safety tends to focus on alignment, jailbreak, and hallucination, but overlooks backdoor mechanisms, making it difficult to understand and fully eliminate the backdoor threat. In this paper, aiming to bridge this gap, we explore the interpretable mechanisms of LLM backdoors through Backdoor Attribution (BkdAttr), a tripartite causal analysis framework. We first introduce the Backdoor Probe that proves the existence of learnable backdoor features encoded within the representations. Building on this insight, we further develop Backdoor Attention Head Attribution (BAHA), efficiently pinpointing the specific attention heads responsible for processing these features. Our primary experiments reveals these heads are relatively sparse; ablating a minimal ～ 3% of total heads is sufficient to reduce the Attack Success Rate (ASR) by over 90%. More importantly, we further employ these findings to construct the Backdoor Vector derived from these attributed heads as a master controller for the backdoor. Through only 1-point intervention on single representation, the vector can either boost ASR up to ～ 100% \uparrow on clean inputs, or completely neutralize backdoor, suppressing ASR down to ～ 0% on triggered inputs. In conclusion, our work pioneers the exploration of mechanistic interpretability in LLM backdoors, demonstrating a powerful method for backdoor control and revealing actionable insights for the community.

Abstract:
Interpretable methods such as Vision–Language Models (VLMs) have advanced affect analysis by aligning images with textual descriptions. However, relying on text as an intermediate proxy faces critical limitations: linguistic templates are inherently discrete, making them fundamentally incompatible with continuous Valence–Arousal (VA) regression, while also acting as a bottleneck for finegrained visual nuances. Cognitive psychology suggests that human affective perception is not mediated by linguistic translation, but is grounded in direct perceptual resemblance to internalized Visual Archetypes. Motivated by this, we propose AURA, a archetype framework that replaces brittle linguistic proxy with a self-organizing archetype manifold. By adaptively allocating representational density based on data complexity, AURA enables precise continuous regression and reshapes affective taxonomies, decomposing coarse labels into interpretable, geometrically coherent visual primitives. This paradigm offers a transparent, perceptually grounded decision trail, achieving state-of-the-art performance across discrete and continuous tasks.

Abstract:
Generative video diffusion models have shown strong robustness to large motion and occlusions for video frame interpolation (VFI). However, their inference efficiency lags significantly behind learningbased methods due to the structural redundancy of pairwise inference and the procedural latency of multi-step iterative denoising. To address these limitations, we propose SpeedVFI, a one-step diffusion framework that achieves dual efficiency improvements by interpolating the entire video sequence in a single forward pass to eliminate pairwise overhead, and distilling the generation trajectory into a one-step denoising process to bypass iterative latency. To support this high-efficiency architecture, we introduce temporal RoPE alignment to ensure temporal consistency across the unified sequence, and noise-centric partial attention to reduce computational overhead while preserving global context. Extensive experiments demonstrate that SpeedVFI accelerates diffusion-based VFI by orders of magnitude while maintaining competitive quantitative and visual quality.

Abstract:
The human brain is a biologically instantiated ondevice neural system that integrates both learning and inference in a unified architecture, which enables rapid and flexible learning on-the-fly. This extraordinary capability is achieved through non-BP learning mechanisms, whereas BP is computationally and memory intensive that unsuitable for on-chip edge learning. Zeroth-order (ZO) optimization methods, which resemble biologically plausible perturbation-based learning, offer a promising alternative that enables learning with only forward passes and hence can significantly reduce the complexity of on-chip hardware implementation. However, in this work we show that applying ZO methods to spiking neural networks (SNNs) is non-trivial due to the step-function nature of spiking activation. We analyze the challenges posed by the spiking activation, and reveal a variance amplification effect of it. Based on this insight, we propose a subspace-based ZO (SZO) method that leverages the intrinsic low-dimensional structure of the SNN optimization trajectory. By learning in a low-dimensional subspace, SZO substantially enhances ZO learning efficacy, achieving accuracy comparable to first-order (FO) methods with faster learning speed than full-space BP. We evaluate SZO on model training from scratch, continual training, and unsupervised adaptation. Experimental results demonstrate that SZO closely approaches FO training performance for the first time while offering fast learning speed.

Abstract:
Determining the complexity, or Intrinsic Dimension (ID), of data is fundamental to efficient and interpretable representation learning. This is particularly challenging in multimodal settings when trying to learn disentangled representations for shared and private information. Existing techniques leave a critical gap: they are often static, uni-modal, or in the case of contrastive methods, adapt only to the shared ID implicitly. We introduce Fidelity-Guided Rank Optimization (FiGuRO), a framework for learning the complete ID structure of uni- and multi-modal data. FiGuRO learns the dimensions of low-rank projections using truncated singular value decomposition and an algorithm that determines when to reduce or increase dimensionalities and in which latent spaces. Disentanglement of shared and private information arises as an emergent property of this optimization, eliminating the need for complex auxiliary loss functions. We demonstrate that FiGuRO outperforms existing ID estimation techniques and is more robust to hyperparameter changes. Across simulations and real-world data, FiGuRO captures distinct ID scales and varying subspace ratios, and decomposes shared and private information successfully. Furthermore, we show that FiGuRO can be applied to modern uni-modal pretrained models, enabling efficient, post-hoc disentanglement of multi-modal representations.

Abstract:
Current VisionLanguage-Action (VLA) models excel at robotic manipulation but often struggle with non-Markovian tasks requiring long-term memory and reasoning due to their reliance on immediate observations. Existing solutions face a frequency-competence paradox, where high-performance models are too slow for real-time control, while faster models lack sufficient reasoning capabilities. To resolve this architectural misalignment, we proposeHiMe, a Hierarchical Embodied Memory framework that decouples embodied intelligence into a high-frequency Executor for execution, a Sentry for working memory, and a Planner for long-term strategy. We also introduce a dynamic knowledge system based on cross-modal semantic schemas and active management mechanisms, allowing robots to maintain memory plasticity through "Add, Update, and Delete" operations. This hierarchical design effectively balances the conflict between real-time execution and slow thinking planning, significantly improving success rates in long-horizon tasks. Experiments demonstrate that this approach not only outperforms flat memory baselines but also exhibits the novel ability to self-correct its internal knowledge based on human preferences.

Abstract:
Testing conditional independence is a fundamental yet inherently difficult challenge, as controlling Type I error is impossible in general. The recently popular "ModelX" paradigm offers a solution by relying on a perfectly known conditional distribution. In traditional "one-shot" testing regimes, slight deviations from perfect knowledge are sometimes allowable, but existing work in more realistic online settings has required exact adherence to Model-X. We propose a new approach for sequential testing of conditional independence that is far more robust to estimation errors in the conditional distribution. Our method, based on online optimization of the Kernel Conditional Independence statistic, introduces a novel normalization and "truncate-and-shift" calibration strategy to the testing-by-betting paradigm. This framework greatly improves validity with estimated conditionals while still providing high power across high-dimensional synthetic benchmarks and real-world fairness tasks.

Abstract:
This paper studies simple transformers on a highorder Markov chain, where the model must incorporate knowledge from multiple past positions, each with different statistical importance. We show that transformers learn the task incrementally, with each stage induced by the acquisition or copying of information from a subset of positions via a sparse attention pattern. Notably, the learning dynamics transition from competitive, where all heads focus on the statistically most important attention pattern, to cooperative, where different heads specialize in different patterns. We explain these dynamics using a set of simplified differential equations, which characterize the stage-wise learning process and analyze the training trajectories. As transformers progress through these stages, they climb a complexity ladder defined via simpler misspecified hypothesis classes until reaching the full model class. Overall, our work provides theoretical explanations for how transformers learn in stages even without an explicit curriculum and provides insights into the emergence of complex behaviors and generalization, with relevance to applications such as natural language processing and algorithmic reasoning.

Abstract:
Researchers have shown that neural similarity among humans predicts social closeness and cooperative success, whereas innovation often emerges from interactions among dissimilar individuals. We investigate whether these principles extend to artificial intelligence by examining interactions between large language models. In our experiments, 276 model pairs interact across eight games spanning both cooperation and novelty. We find that pairs with more similar representation spaces achieve significantly higher cooperation but exhibit reduced novelty and creativity. The effects of representational similarity on cooperation and novelty remain robust even after isolating other factors such as performance disparity and model size. We also find that similarity in the early layers consistently exhibits the strongest effect across games, compared to the middle and later layers. This suggests that a central factor underlying the observed trend is the extent to which the two models share lexical and semantic grounding. These findings suggest that representational similarity can be an important consideration in multiagent system design.

Abstract:
Current datadriven scientific modeling struggles with a functional dichotomy: neural operators exhibit spectral bias in high-frequency regimes, while physics-constrained paradigms suffer from optimization pathologies. To bridge this gap, we propose Adaptive Physics-Informed Computing (APIC), a neuro-symbolic meta-architecture designed with structural reconfigurability to encode diverse domain priors. Crucially, APIC integrates a gradient-isolated interaction strategy that mechanistically decouples the optimization paths of parameter identification and residual correction, effectively mitigating gradient interference/conflicts. By instantiating this framework for nonlinear dissipative systems, we derive the Generalized Kuramoto-Sivashinsky-Cahn-Hilliard (G-KSCH) kernel, providing a unified representation for sparse dynamic identification. Extensive experiments demonstrate that APIC establishes new benchmarks in 3D compressible supersonic shock wave prediction, surpassing diverse architectures (e.g., CNNs and Transformers) by 20% to 70% in predictive accuracy. Notably, APIC achieves Pareto-optimal performance, delivering superior precision with reduced computational overhead compared to SOTA models, while exhibiting robust cross-domain generalization across meteorological and urban traffic datasets.

Abstract:
Continual training of deep neural networks under nonstationarity often leads to a progressive loss of plasticity, eventually limiting further learning. We define plasticity loss as the network’s diminishing ability to make reliable progress under gradient descent updates in all output-space directions, and identify departures from dynamical isometry (i.e., drift of the layer-wise Jacobian singular values away from one) as a key mechanism driving this loss. We first revisit a class of networks that are isometric almost-everywhere while remaining universal Lipschitz function approximators, demonstrating that isometry is compatible with expressive nonlinear function classes. Turning to more general architectures, we study an efficient isometry-promoting regularization scheme for continual learning. We analyze its interaction with common activation functions, and reveal a mechanism by which it can reactivate dead ReLU units. To integrate this regularization with adaptive optimization, we propose AdamO, an Adam-style optimizer that decouples isometric regularization from gradient updates, analogous to AdamW. Finally, we evaluate our methods in supervised and reinforcement learning settings designed to induce plasticity loss and show that they effectively preserve plasticity while also yielding strong performance.

Abstract:
Multitask learning (MTL) aims to construct a joint model for multiple tasks by sharing a common representation across domains. To achieve this goal, existing optimization-centric methods either balance task gradients or modify the shared architecture. However, as these approaches remain agnostic to the content of the shared representation, they fail to disentangle task-relevant structure from spurious context, leading to negative transfer and poor generalization. To overcome this limitation, we propose Causal Orthogonal Representations for Multi-Task Learning (CORE-MTL) , a representation-centric framework that structurally disentangles the shared representation into semantic and residual streams, concentrating task-relevant structure in the semantic stream while relegating nuisance variation to the residual stream. We instantiate this framework in the visual domain by leveraging physical priors for structured scenes and statistical constraints for attributes. Theoretically, our method enjoys a tighter out-of-distribution generalization bound than optimization-centric methods and reduces task gradient interference without explicit gradient projection or reweighting. Empirically, CORE-MTL consistently outperforms existing methods on visual multi-task benchmarks in both in-distribution and out-of-distribution settings.

Abstract:
Identifying and estimating causal relationships from data is an important component of the scientific research process because it enables researchers to understand how variables affect one another. While large language models (LLMs) show potential for assisting research workflows, their ability to perform causal inference in empirical studies remains underexplored, despite the importance of causality in domains such as medicine and public policy. To address this gap, we introduce CauSciBench, a benchmark for evaluating LLMs' ability to perform endto-end causal inference autonomously to answer causal questions that arise in empirical research. CauSciBench contains over 300 evaluation tasks derived from real-world studies across multiple disciplines, synthetic scenarios, and textbook datasets. Prior causal inference benchmarks primarily evaluate whether LLMs can implement user-specified methods. In contrast, CauSciBench evaluates performance across the full causal analysis pipeline, including variable selection, method selection, causal effect estimation, and statistical interpretation. We evaluate seven frontier models using several test-time scaling strategies, including Chain-of-Thought, Program-of-Thought, and ReAct. Results show a clear performance gap between real-world and synthetic settings, highlighting limitations in current agentic capabilities for data-driven causal analysis.

Abstract:
Sparse Mixtureof-Experts (MoE) architectures enable scaling LLM parameters under a fixed inference budget by activating only a small subset of experts via top-k routing. While this preserves causality and suits autoregressive language models, the discrete top-k operator is not differentiable, forcing a fixed number of active experts per input and resulting in inefficient use of computation. We propose SoftMoE, which replaces discrete routing with a truncated soft top-k LapSum relaxation, allowing gradient-based optimization of expert routing. We further parameterize the mean number of active experts per layer and impose a global budget constraint, enabling the model to learn how to allocate expert capacity across layers. SoftMoE remains fully compatible with autoregressive modeling and achieves performance comparable to or better than sparse MoE on language modeling and downstream tasks, while activating significantly fewer experts. Notably, the learned allocation is highly non-uniform, with later layers activating more experts.

Abstract:
Large language models (LMs) are typically posttrained via RL to produce a single best answer per query, implicitly optimizing for modal correctness. While effective for benchmark accuracy, this approach is unideal for many applications of interest such as in medical diagnosis, which would benefit from models generating a set of plausible answers (ideally paired with uncertainty estimates).This paper describes a multi-answer reinforcement learning (RL) approach for enabling LMs to do this, where we modify the RL objective to train models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model’s generative process. We instantiate this approach through Multi-Answer Reinforcement Learning with Verifiable Rewards (Multi-RLVR), which generalizes ordinary RLVR to the multi-answer case with a set-level reward. We further extend this approach to Multi-Answer Reinforcement Learning with Calibrated Rewards (Multi-RLCR) which adds a set-level Brier score-based calibration objective to enable LMs to output calibrated uncertainty estimates associated with each answer in the output set. Multi-answer training promotes explicit representation of alternative hypotheses rather than repeated generation of the dominant mode. Across question-answering and medical diagnostic benchmarks, we observe improved diversity, recall, and set-level calibration scores compared to single answer-trained baselines. We further observe that models trained with our approach are more token-efficient, requiring fewer tokens to generate multiple answers than competing approaches. These results position multi-answer RL as a principled and compute-efficient alternative to inference-time scaling.

Abstract:
Reinforcement learning has driven the mass adoption of large language models by unlocking unexpected capabilities, yet this approach remains largely underexplored for generative DNA models. We investigate whether similar posttraining techniques can induce emergent biological realism in DNA language models, using plasmid generation as a testbed due to plasmids' relative simplicity, well-characterized functional constraints, and ubiquity in biotechnology. Using Group Relative Policy Optimization with a reward function based on constraints from engineered biology, our model achieves a 77% quality control pass rate compared to 5% for the pretrained baseline. Remarkably, beyond explicitly optimized features, the model exhibits surprising biological parallels: generated sequences match natural plasmids in thermodynamic stability, codon usage patterns, and ORF length distributions, properties not explicitly optimized in the reward function. These results suggest that RL post-training can steer DNA language models toward biologically coherent regions of sequence space, analogous to how such techniques unlock unexpected capabilities in natural language models, particularly in verifiable domains.

Abstract:
Neural operators have shown promise in accelerating PDE solvers, yet they remain unreliable for the sparse linear systems induced by discretization due to limited generalization across physical parameters and insufficient accuracy, and hybrid neural iterative schemes face stagnation as the residual distribution evolves over iterations. To address these limitations, we propose GRANS (Generalizable Residual-Aware Neural Solver), a neuralized iterative paradigm that performs residual-aware subspace corrections by mapping the residual to the matrix graph, generating multi-scale correction subspaces via a residual-aware basis generator, and applying projected updates through a differentiable subspace projection. G-RANS is trained fully self-supervised via progressive bootstrap with multi-stage residual distributions. On sparse FEM systems from a representative suite of second-order elliptic PDEs (Poisson, advection--diffusion, reaction--diffusion, and Helmholtz), G-RANS is robust to severe coefficient shifts (up to \pm70%) and shows strong cross-equation generalization. G-RANS reaches relative residuals on the order of 10^-5, substantially outperforming end-to-end neural operators, and attains the same target residual with 2--4× less wall-clock time than classical Krylov solvers.

Abstract:
Crossarchitecture migration of high-performance libraries dictates ecosystem readiness on emerging hardware. The challenge is twofold: disentangling library-scale dependencies and performance-critical kernels with ISA-specific SIMD intrinsics, often trading migration speed for peak performance. While LLM-based agents offer a promising approach, are confined to function-level tasks or scalar code, failing to assess agents’ capabilities and limitations in realistic, library-scale migration. We present CLAM-Bench (Cross-architecture Library-scale Agent Migration benchmark), featuring 85 critical kernels from widely used libraries, including OpenCV, libjpeg, and NCNN. It supports comprehensive evaluations of compilability, correctness, and performance across major transitions: ARM→RISC-V, x86→ARM, and ARM→LoongArch. Evaluation of 12 SOTA agent-LLM combinations on CLAM-Bench reveals that, due to the lack of library-level navigation and hardware-aware optimization, agents regress to superficial pattern matching, yielding only 20.88% correctness and 0.83x speedup for libjpeg. Motivated by these findings, we further propose FSCM, a multi-agent framework incorporating hardware-aware global reconfiguration and performance optimization. FSCM improves OpenCV correctness to 71%. The benchmark and code are available at https://anonymous.4open.science/r/clam_bench-D8EB/.

Abstract:
Shapley values are a principled attribution measure widely used in interpretable machine learning, but their exact computation scales exponentially with the number of players, motivating a wide range of approximation methods based on valuefunction evaluations of sampled coalitions. This raises the question of whether approximation accuracy can be improved by adaptively selecting coalitions for evaluation based on previous outcomes. This is particularly relevant in settings where the value function is costly, and the number of evaluations is severely limited, such as retraining-based feature importance, data valuation, and hyperparameter importance. For this purpose, we propose \textttShaplEIG, a Bayesian experimental design approach that approximates the expensive value function via a Gaussian process surrogate and adaptively selects coalitions based on their expected information gain about the Shapley values. Since Shapley values are a linear function of the value function, we show that the expected information gain is available in closed form and efficiently computable. In extensive experiments across diverse costly applications, our method consistently improves estimation accuracy over state-of-the-art baselines.

Abstract:
We introduce Finslerian Reinforcement Learning (FiRL), an RL framework that makes directional costs explicit and improves robustness to tail risk. FiRL incorporates a Finsler metric into the locomotion cost, expressing effort as F(x,v) that depends on the state x and motion v, so it can capture uphill versus downhill asymmetry, lateral slip, and other directiondependent effects. To handle rare but catastrophic outcomes, FiRL optimizes a Conditional Value-at-Risk objective. We derive the corresponding risk-sensitive Bellman equation and show that the resulting CVaR–Finsler Bellman operator is a \gamma-contraction. This guarantees a unique fixed-point value function, which induces a quasi-metric structure that satisfies a triangle inequality despite directional asymmetry. We then develop a FiRL actor–critic algorithm to learn policies under this anisotropic, risk-averse objective. Across MuJoCo and Isaac Sim locomotion benchmarks, FiRL consistently learns safer and more energy-efficient behaviors than strong baselines such as risk-neutral PPO. For instance, on a 12^\circ sloped Hopper task, FiRL reduces worst-case impact forces by over 35% and total energy cost by 15%, while also improving success rate.

Abstract:
Full finetuning of large language models (LLMs) incurs prohibitive computational and storage costs. Parameter-efficient fine-tuning (PEFT) addresses this limitation, with Low-Rank Adaptation (LoRA) gaining widespread adoption due to its simplicity and zero inference overhead. However, LoRA and its variants typically rely on uniform rank allocation or a single importance metric such as gradient magnitude or output sensitivity to guide rank distribution. This approach fails to recognize that gradient magnitude and output contribution are decoupled properties, leading to suboptimal allocation where critical layers are under-provisioned while less important ones waste capacity. To address this challenge, we propose COBRA, a principled framework integrating dual importance factors for adaptive rank allocation. COBRA operates in three stages: (1) layer conductance attribution quantifies each layer's contribution via path-integral attribution; (2) dual-factor aggregation combines contribution with adaptation demand, producing the TA-LC distribution; and (3) Bayesian rank allocation translates this distribution into optimal heterogeneous ranks via variational optimization. Layer conductance provides layer-level interpretability by explicitly quantifying how much each layer contributes to predictions without redundancy, directly aligning with the granularity of rank allocation decisions and enabling principled cross-layer comparison for rank distribution. Experiments across diverse architectures and tasks demonstrate that COBRA consistently outperforms existing methods, achieving up to 1.6 points improvement on GLUE and 6.6% average gain in high-rank regimes under comparable parameter budgets.

Abstract:
Autoformalization—translating mathematical problems from natural language into formal proof assistant code—is essential for rigorous machine reasoning. However, existing evaluation frameworks provide only opaque binary verdicts or scalar scores, offering no interpretable insight into where or why translations fail. This opacity severely limits both human understanding and automated system improvement. To bridge this gap, we introduce FormalRX, a comprehensive diagnostic evaluation framework that transforms autoformalization assessment from blackbox judgments into actionable feedback. At its core is \textscSci Error Taxonomy, a hierarchical classification scheme decomposing autoformalization errors into 28 distinct categories with strict priority ordering. Building on this taxonomy, FormalRX provides four critical diagnostic capabilities: (1) alignment verdicts, (2) error categorization, (3) error localization, and (4) correction. Our diagnostic model % specialized diagnostic models and evaluate them across all tasks. Our model FormalRX-8B, achieves F1-scores of 0.88 (verdict) and 0.71 (categorization), along with accuracies of 0.75 (localization) and 0.73 (correction), substantially outperforming both general-purpose LLMs and specialized baselines. By connecting evaluation with actionable insights, FormalRXenables systematic diagnosis and improvement of autoformalization systems.

Abstract:
Causal inference in modern largescale systems faces growing challenges, including high-dimensional covariates, multi-valued treatments, massive observational (OBS) data, and limited randomized controlled trial (RCT) samples due to cost constraints. We formalize treatment-induced structural non-overlap and show that, under this regime, commonly used weighted fusion methods provably fail to satisfy randomized identifying restrictions.To address this issue,we propose a constrained joint estimation framework that minimizes observational risk while enforcing causal validity through orthogonal experimental moment conditions. We further show that structural non-overlap creates a feasibility obstruction for moment enforcement in the original covariate space.We also derive a penalized primal–dual algorithm that jointly learns representations and predictors, and establish oracle inequalities decomposing error into overlap recovery, moment violation, and statistical terms.Extensive synthetic experiments demonstrate robust performance under varying degrees of non-overlap. A large-scale ride-hailing application shows that our method achieves substantial gains over existing baselines, matching the performance of models trained with significantly more RCT data.

Abstract:
Accurate mechanical properties (or materials) Young's modulus (E), Poisson's ratio (\nu) and density (\rho) are essential for reliable physics simulation of digital worlds, but most 3D assets lack this information. We propose AdaVoMP, a method for predicting accurate dense spatiallyvarying (E, \nu, \rho) for input 3D objects across representations, improving the resolution, accuracy, and memory efficiency over the state-of-the-art. The foundation of our technique is a sparse and adaptive voxel structure SAV that efficiently represents both the input 3D shape and the material field output. We replace the fixed-voxel model of the most accurate prior method, VoMP, with a novel sparse transformer encoder-decoder model that learns to generate a unique SAV autoregressively for every input shape to represent its materials, achieving a resolution 16^3× higher than prior art. Experiments show that AdaVoMP estimates more accurate volumetric properties, even with lesser test-time compute than all prior art. This allows us to convert high-resolution complex 3D objects into simulation-ready assets, resulting in realistic deformable simulations.

Abstract:
Mixture of Experts architectures have recently advanced the scalability and adaptability of Large Language Models for continual multimodal learning. However, extending these models to accommodate sequential tasks remains challenging. As new tasks arrive, naive model expansion leads to rapid parameter growth, while modifying shared routing components often causes catastrophic forgetting, undermining previously learned knowledge. To address these issues, we propose CoPE, a continual learning framework for LLMs that requires no replay data of previous tasks and ensures both parameter efficiency and robust knowledge retention. Our approach introduces the ProbeGuided Knowledge Extension mechanism, which uses probe experts to dynamically determine when and where new experts should be added, enabling adaptive and minimal parameter expansion tailored to task complexity. To support inference without task labels, we further incorporate a Probabilistic Task Locator that dynamically matches inputs to the correct task-specific components. To handle the practical issue that task labels are unknown during inference, we leverage a VAE-based reconstruction strategy to identify the most suitable router by matching input distributions, allowing automatic and accurate expert allocation. This design mitigates routing conflicts and catastrophic forgetting, enabling robust continual learning without explicit task labels. Extensive experiments on the CoIN benchmark, covering eight diverse VQA tasks, demonstrate that CoPE delivers strong continual learning performance with a compact model size, significantly reducing forgetting and parameter overhead compared to prior methods. These results showcase the effectiveness and scalability of our approach for parameter-efficient continual learning in large language models. Our code will be open-sourced soon.

Abstract:
Load Balancing has emerged as a critical problem in expertparallel distributed inference of Mixture-of-Experts (MoE) models. As routing distributions are typically skewed across experts, devices hosting lighter-loaded experts must idle to wait for the heaviest during expert computing, leading to inefficiency. Existing load-balancing approaches primarily rely on expert replication or migration within each layer, which introduce additional overhead and limit their flexibility and scalability. To address this problem, we propose EasyBalance, a cross-layer load balancing strategy for expert-parallel MoE inference. EasyBalance requires no modifications to the expert-device mapping, enabling instant adaptability and incurring essentially no additional overhead. Our key insights are that (1) experts from other layers can be viewed as naturally redundant, and (2) expert workloads of multiple layers, if from different micro-batches, can be jointly executed. Based on these observations, EasyBalance greedily schedules subsets of cross-layer workloads at each expert-computation stage, while deferring the remaining workloads for future balancing opportunities. Extensive experiments across different models, tasks, and parallelism configurations demonstrate that EasyBalance consistently accelerates expert-parallel inference, reducing GPU idling by uniformly over 40%.

Abstract:
Existing reasoning segmentation approaches typically finetune multimodal large language models (MLLMs) using image-text pairs and corresponding mask labels. While recent efforts leverage reinforcement fine-tuning to further enhance reasoning ability, they often suffer from overthinking and produce uniformly verbose reasoning chains irrespective of task complexity. To address this problem, we propose PixelThink, a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty to regulate reasoning generation within a reinforcement learning paradigm. The model learns to compress reasoning length in accordance with scene complexity and predictive confidence. To support comprehensive evaluation, we introduce ReasonSeg-Diff, an extended benchmark with annotated reasoning references and difficulty scores, along with a suite of metrics designed to assess segmentation accuracy, reasoning quality, and efficiency jointly. Experimental results demonstrate that the proposed approach not only improves segmentation performance but also significantly reduces inference latency by 30.4%, cutting token usage by 48.2%. The code and model will be publicly available.

Abstract:
Large language models (LLMs) trained on natural language data are capable of translating between languages, predict chess moves, and write poetry. Performance on a given task depends on directly relevant training data, yet confounders abound: data in related languages has been shown to help lowresource languages, and training on code has been shown to improve reasoning capabilities in natural language generation. Formal languages have become a common tool for understanding the learnability of language model architectures and their limitations---we argue that they should also be treated as multi-task learners when studying the learnability of a given \emphtask. This means that to understand the learnability of a given property of a formal language, confounders from other tasks need to be considered. We propose a causal graphical model and an efficient sampling mechanism for probabilistic finite-state automata that gives full control over the occurrences of a given task while maintaining other language properties. To enable targeted evaluation, we derive task-specific decomposed KL-divergences. These tools allow us to know the \emphcausal relationship between how often a task appears and its true learnability. Our experiments confirm that the correlation between task occurrences and learnability does not recover the accurate relationship---for this, the causal analysis and machinery is necessary.

Abstract:
Visuallanguage action (VLA) models enable robots to predict actions directly from observations and language instructions, but their performance depends on large-scale, high-quality data and is limited by the scarcity of real-world robot action datasets. To facilitate VLA model learning with abundant unlabeled human videos, Latent Action Models (LAM) learn latent action representations from visual dynamics to provide additional supervision for VLA learning. However, LAM and VLA are typically trained separately, leaving LAM ungrounded during VLA training and VLA models constrained by frozen LAM representations. To address these issues, we propose Latent Action Representation Alignment (LARA), a plug-and-play framework that jointly optimizes LAM and VLA via representation alignment. This enables reciprocal benefits where LAMs learn with action trajectories to avoid spurious visual changes, while VLAs are regularized by forward dynamics learned within LAMs to reduce hallucinations of functionally ineffective trajectories. We demonstrate LARA's versatility and effectiveness for pre-training, post-training enhancement of pre-trained VLA models, and LAM refinement, achieving an average of ~10%, ~5%, and ~15% improvement over 3 simulation and 1 meticulously designed real-world robotic manipulation benchmarks.

Abstract:
Federated Learning (FL) often suffers from degraded generalization under statistical heterogeneity, where client updates systematically deviate from the global objective. While recent SharpnessAware Minimization (SAM) methods promote locally flat solutions, they implicitly assume that local flatness transfers to the global model, which generally does not hold under heterogeneous data distributions. This mismatch gives rise to a flatness discrepancy induced by misaligned loss landscapes. To address this issue, we propose FedScar, a federated optimization framework that explicitly corrects heterogeneity-induced geometric inconsistency. FedScar maintains a history-accumulated geometric bias to capture persistent curvature skew across clients, and employs a variance-aware injection mechanism to steer local updates toward regions that are flat with respect to the global objective. We provide a theoretical interpretation of FedScar as a Split-Dual ADMM formulation, which jointly enforces parameter consensus and geometric alignment. Extensive experiments under severe heterogeneity demonstrate that FedScar consistently reduces flatness discrepancy and improves generalization over state-of-the-art methods, without incurring additional communication overhead.

Abstract:
Speculative decoding (SD) addresses the high inference costs of large language models (LLMs) by having lightweight drafters generate candidates for large verifiers to validate in parallel. Existing draftverify methods use binary decisions: accept or fully recompute. Yet we find that many rejected tokens can be verified correctly by a slim submodel derived from the full verifier via intra-model routing, instead of the full verifier. This motivates our slim-verifier to handle tokens requiring moderate verification resources, reducing expensive large-model calls. We proposeVerification viaIntra-Model Routing forSpeculativeDecoding (VIA-SD), a multi-tier framework using a routed slim-verifier. Draft tokens are processed hierarchically: direct acceptance for high-confidence cases, slim-verifier regeneration for medium-confidence cases, and full-model verification for uncertain cases. Across summarization, translation, reasoning, QA, and coding tasks on encoder-decoder and decoder-only model families, VIA-SD consistently lowers rejection rates (0.1–0.22) and achieves 10–20% speedup over state-of-the-art SDs. Compared to decoding without drafting, VIA-SD provides 2.5-3× acceleration while improving output quality. Moreover, VIA-SD is compatible with existing SD frameworks without modifying their training procedures. Our results establish multi-tier SD as a general paradigm for scalable and efficient LLM inference. Our code will be publicly available.

Abstract:
Large language models (LLMs) often hallucinate by generating factually incorrect or unfaithful content, posing significant risks to their safe use. Detecting such hallucinations is particularly challenging under the zerosource constraint, where no model internals or external references are available, and detection must rely solely on the textual query–answer pair. In this paper, we propose Human-like Criteria Probing for Hallucination Detection (HCPD), a paradigm that emulates the multi-faceted reasoning of human evaluators. Its core is an Human-like Criteria Probing (HCP) mechanism, in which an LLM agent adaptively decomposes its judgment into a weighted set of interpretable criteria and aggregates criterion-specific scores into a final truthfulness measure. To achieve this adaptive capability, we introduce a reward-based alignment scheme using only weak supervision from semantic consistency. At inference, we employ a multi-sampling aggregation strategy to ensures robust decisions while preserving full interpretability. We further provide theoretical analysis supporting the reliability of our approach. Extensive experiments show that HCPD consistently outperforms state-of-the-art baselines, offering an effective and explainable solution for zero-source hallucination detection.

Abstract:
Relational Deep Learning aims to learn directly on multitable databases, yet current methods face a fundamental tension: Transformers' quadratic complexity prohibits the large contexts relational data demands, while GNNs sacrifice global context for efficiency. We introduce Ramba, the first selective state-space model for relational databases. Our approach features two innovations: (1) Topology-Aware Linearization, which processes cells via global columnar serialization in O(L) complexity while recovering relational structure through sparse entity and foreign-key attention masks; and (2) Schema Dynamic Gating, which modulates SSM state transitions based on semantic alignment between the currently scanned attribute and the prediction target, enabling cross-table relevance filtering without relying on value distributions. Together, these enable Ramba to ingest vast relational contexts while selectively retaining semantically relevant information, a combination existing architectures cannot achieve. Experiments demonstrate state-of-the-art performance with linear scalability across diverse relational benchmarks.

Abstract:
We study the corruptionrobustness of in-context reinforcement learning (ICRL), focusing on the Decision-Pretrained Transformer (DPT, Lee et al., 2023). To address the challenge of reward poisoning attacks targeting the DPT, we propose a novel adversarial training framework, called Adversarially Trained DPT (AT-DPT). Our method simultaneously trains a population of attackers to minimize the true reward of the DPT by poisoning environment rewards, and a DPT model to infer optimal actions from the poisoned data. We evaluate the effectiveness of our approach against standard bandit algorithms, including robust baselines designed to handle reward contamination. Our results show that AT-DPT significantly outperforms them in bandit settings under a learned attacker, and generalizes to more complex environments such as adaptive attackers and MDPs. It shows promise in ICRL as a meta-RL approach to learning effective corruption-robust algorithms.

Abstract:
For many NPhard optimization problems, strong theoretical inapproximability results exist. However, in practice, heuristics regularly outperform these pessimistic worst-case results on real-world datasets. Assessing the quality of these algorithms' outputs is often difficult since we lack good lower bounds on the optimal solution. In this paper, we present efficient algorithms for computing lower bounds on the optimal solutions for correlation clustering, which is a popular problem in social-network analysis. Our lower bounds allow us to provide empirical certificates that bound the solution quality of practical algorithms by obtaining instance-specific approximation ratios. Our main technical contribution is an algorithm that approximates an LP relaxation of a related triangle covering problem in near-linear time on sparse graphs; the algorithm is based on the multiplicative weights update framework and runs on graphs with millions of edges in a few minutes. For the concrete problem of correlation clustering, our lower bounds certify that state-of-the-art heuristics achieve almost optimal approximation ratios of 0.94 for the agreement version and 1.97 for the disagreement version (averaged over 7 real-world datasets). We also show similar results for the fundamental max-cut problem.

Abstract:
We study a broad family of momentum Linear Minimization Oracle (LMO) methods that includes normalized SGD with momentum, signbased (Adam-like) directions, and Muon (spectral) updates. Our focus is on subsampling regimes where the classical uniformly-bounded-variance model can be fragile even for finite-sum objectives on unbounded domains. To obtain subsampling-faithful guarantees, we analyze this LMO family under expected smoothness (ABC condition), which captures common sampling schemes. We establish a unified nonconvex convergence theory via a new self-bounding closure that handles the history-coupling induced by momentum under ABC. Our bounds recover known bounded-variance results as a special case and simplify in strong-growth regimes. Specializing to \tau-nice sampling, we derive explicit batch-size scaling laws, predicting that the optimal momentum must increase with the batch size to maximize sample efficiency. We further identify a theoretical optimal batch size that minimizes total sample complexity. Experiments on linear and matrix regression corroborate these predictions, showing a distinct diagonal shift in the optimal momentum-batch landscape that matches our theoretical scaling.

Abstract:
Large language models (LLMs) are trained at significant computational and data cost, making them valuable intellectual property (IP). Existing IP verification methods primarily rely either on invasive watermarking that degrades model utility, or on superficial behavioral signatures disrupted by finetuning and model merging. This apparent trade-off between model utility and IP protection has constrained practical deployment. We challenge this trade-off and propose CircuitPrint, a non-invasive IP fingerprinting framework that enables robust verification through standard model queries by leveraging stable internal computational circuits of LLMs. We show that these circuits function as a persistent computational backbone across model derivatives, allowing them to serve as stable fingerprints for distinguishing LLMs. Building on this stability, CircuitPrint constructs IP signatures by identifying mechanistically essential supernodes that causally produce specific predictions within these circuits. Specifically, trigger queries are synthesized to replicate the internal suppression of these supernodes, thereby inducing distinctive and observable output shifts. Experimental results demonstrate that CircuitPrint substantially outperforms existing baselines while remaining robust under aggressive fine-tuning and model merging, effectively resolving this trade-off without altering model parameters.

Abstract:
Reasoning posttraining with GRPO is typically built onstatic uniformity: uniform prompt sampling and a fixed number of rollouts per prompt. For heterogeneous, heavy-tailed reasoning data, this wastes compute on already-solved patterns while under-training the long tail of hard problems. We cast GRPO post-training astwo independent GDRO games(not coupled) overdynamic difficulty groupsdefined online by pass@8: adata adversarythat reshapes prompt sampling and acompute adversarythat redistributes rollouts.Prompt-GDROapplies multiplicative-weights reweighting over bins (with an EMA-debiased difficulty score) to upweight persistently hard groups without frequency bias.Rollout-GDROallocates rollouts across bins under a fixedmeanbudget via a shadow-price controller, improving gradient information efficiency on high-uncertainty groups while remaining compute-neutral. Our approach is principled and theory-driven: we provide no-regret guarantees for the Prompt-GDRO game (via an entropy-regularized GDRO surrogate) and a variance-proxy analysis that yields a square-root optimal compute allocation for Rollout-GDRO. On DAPO 14.1k with Qwen3-Base (1.7B/4B/8B), each controller improves pass@8 by 9-13% over GRPO, and diagnostics reveal an emergent curriculum that tracks the evolving reasoning frontier.

Abstract:
Recent studies have revealed two intriguing phenomena in large language models: massive activations, characterized by a small number of activations exhibiting abnormally large magnitudes, and attention sink, where a disproportionate amount of attention is consistently allocated to specific tokens regardless of their semantic relevance. However, the coemergence and co-existence of these two phenomena remain poorly understood. In this work, we revisit the prevailing view that massive activations are the primary mechanism responsible for concentrating attention on sink tokens, and provide a more nuanced interpretation of their relationship. Through both theoretical analysis and empirical evidence, we demonstrate that massive activations and attention sink jointly act to prevent excessive token mixing in self-attention. Specifically, attention sink suppresses mixing among non-sink tokens, whereas massive activations suppress mixing between sink tokens and non-sink tokens. Furthermore, our theory provides a principled explanation of how the location of massive activations depends on the placement of layer normalization, and why KV-biases and gating mechanisms can remove massive activations while largely preserving attention sink. We further conduct intervention analyses and find that removing the value vector of the sink token can recover attention sink even when massive activations are entirely suppressed. Overall, this work provides a mechanistic perspective on how massive activations and attention sink interact under normalization and self-attention, offering new insights into their functional roles in Transformer models.

Abstract:
We study hybrid Reinforcement Learning (RL) in adversarial Markov Decision Processes (MDPs), where the learner simultaneously receives onpolicy feedback from the executed policy and off-policy feedback from a fixed behavior policy, and loss functions can change arbitrarily over time. On-policy feedback allows exploration and ensures the worst-case guarantee against any comparator policy, while off-policy feedback provides coverage-dependent guarantee that scales with the "mismatch" between the behavior and comparator policies (called coverage ratio) and can be sharper than on-policy results whenever this ratio is small. We propose a new hybrid RL framework that accommodates adversarial losses and unknown transitions, preserving off-policy guarantees while ensuring non-trivial worst-case performance.

Abstract:
Mixtures of Bernoulli product distributions are a simple and widely used latentvariable model, with applications in e.g.\ recommendation systems, crowdsourcing, and medical data analysis. We consider the problem of reconstructing the mixture parameters from oracle access to its probability generating polynomial (PGP), for instance represented by a probabilistic generating circuit (PGC). We show that the parameters are uniquely identifiable for almost all mixtures, and give a randomized algorithm that exactly recovers the mixture weights and component marginals for mixtures of r Bernoulli product distributions over n variables using only O(nr^2) oracle queries. The algorithm repeatedly applies restrictions to O(r) variables, extracts low-degree coefficients, and then recovers the parameters using a moment-based tensor decomposition. To the best of our knowledge, this is the \em first exact reconstruction algorithm in this PGP oracle model with query complexity linear in n and polynomial in r.

Abstract:
Unsupervised Domain Adaptation (UDA) person search aims to transfer a model trained on a labeled source domain to an unlabeled target domain without using target annotations. However, existing UDA methods frequently neglect the issue of scale inconsistency between the source and target domains. These inconsistency arises from different variations in Camera height, tilt angle, and focal length change. To address this challenge, we propose a ScaleAware Consistent Alignment Learning (SCALE) framework. Specifically, we propose a Scale-aware Domain Harmonization (SDH) adaptively harmonizes semantic and structural scales through cross-path interaction and consistency refinement to alleviate cross-domain scale inconsistency. To further improve the pseudo-label inaccuracies, we introduce a Bidirectional Cluster Regularization (BCR) strategy, which obtains more reliable pseudo-labels by refining the results a second time. By collaboratively alleviating the impact of scale misalignment and enhancing pseudo-label reliability, our approach achieves state-of-the-art performance on two benchmark person search datasets, with 82.3% mAP and 84.0% top-1 on the CUHK-SYSU dataset, 41.7% mAP and 82.4% top-1 on the PRW dataset.

Abstract:
Weight space learning aims to learn representations of neural network (NN) weights, enabling different downstream tasks. Existing approaches show promising performance, but lacking a way to shape these weightspace representations using information about the datasets the models were trained on, thus limiting downstream applications. We propose to learn a dataset-aligned latent space for neural networks, where datasets information is induced during training. The NNs are encoded as latent representations using an autoencoder, while dataset samples are encoded using a dataset encoder. The two representations are aligned using a contrastive objective, effectively reshaping the weight-space representations according to the datasets. We demonstrate that such representations can be used for different downstream tasks, including mapping dataset information to a weight-space representation that decode to strong models. In addition, we introduce a latent refinement process for generating models that outperforms standard fine-tuning. Overall, our results demonstrate that explicitly incorporating dataset information improves what can be achieved with weight-space representations across retrieval, generation, and refinement.

Abstract:
Evaluation benchmarks play a central role in assessing vision–language models (VLMs). However, most existing multimodal benchmarks are static, making them increasingly vulnerable to data contamination, temporal staleness, and high construction costs. In this work, we introduce MMBenchLive, a multi-agent-driven dynamic multimodal benchmark that supports continuous updates without human in the loop. MMBench-Live is maintained through an end-to-end automated pipeline that integrates structured benchmark description, real-time data acquisition, and verifiable question–answer (QA) generation, enabling scalable, live, and low-cost benchmark updates. To ensure reliable evaluation across versions, we further propose a distribution-consistent updating strategy based on semantic task interpretation and feedback-driven data collection and filtering. Based on MMBench-Live, we conduct systematic evaluations of multiple open-source VLMs and analyze their performance, cross-version consistency, and data contamination, providing empirical evidence for the effectiveness of the proposed dynamic benchmark updating framework.

Abstract:
Autonomous agents have recently achieved remarkable progress across diverse domains, yet most evaluations focus on shorthorizon, fully observable tasks. In contrast, many critical real-world tasks, such as large-scale software development, commercial investment, and scientific discovery, unfold in long-horizon and partially observable scenarios where success hinges on sustained reasoning, planning, memory management, and tool use. Existing benchmarks rarely capture these long-horizon challenges, leaving a gap in systematic evaluation. To bridge this gap, we introduce UltraHorizon, a novel benchmark that measures the foundational capabilities essential for complex real-world challenges. We use exploration as a unifying task across three distinct environments to validate these core competencies. Agents are designed in long-horizon discovery tasks where they must iteratively uncover hidden rules through sustained reasoning, planning, memory and tools management, and interaction with environments. Under the heaviest scale setting, trajectories average 200k+ tokens and 400+ tool calls, whereas in standard configurations they still exceed 35k tokens and involve more than 60 tool calls on average. Our extensive experiments reveal that agents powered by state-of-the-art LLMs consistently underperform in these settings, whereas human participants achieve much higher scores, underscoring a persistent gap in agents' long-horizon exploration abilities. We also observe that simple scaling fails in our task. To better illustrate the failure of agents, we conduct an in-depth analysis of collected trajectories. We identify eight types of errors and attribute them to two primary causes: in-context locking and functional fundamental capability gaps.

Abstract:
Bioacoustic foundation models rely on largescale citizen science platforms like Xeno-Canto for geographically and ecologically diverse data. Recent work has shown that supervision alone can produce SotA species detection models when trained on this large-scale data---however, there remains unutilized potential in the form of recording metadata readily available within these community-driven data hubs. In this work, we explore the use of metadata---such as location and time---as auxiliary supervision signals, allowing the model to leverage species-metadata correlations in its learned representation. Auxiliary metadata losses provide additional information beyond vocalizations alone that can encourage a richer, more robust representation that generalizes better to species distribution and acoustic domain shifts---important challenges for deployment in real-world passive acoustic monitoring (PAM) settings. We introduce MetaBio, a new foundation model that achieves strong species identification performance across multiple challenging domains and present an extensive empirical study of the effects of 9 diverse metadata sources on 17 bioacoustic datasets.

Abstract:
Executiongrounded LLM agents offer a promising paradigm for learning from tool feedback, but such feedback is often expensive and slow to obtain, making online reinforcement learning (RL) impractical. High-coverage hardware verification exemplifies this challenge due to its reliance on industrial simulators and non-differentiable execution signals. We propose LLM4Cov, an offline agent-learning framework that models verification as memoryless state transitions guided by deterministic evaluators. Building on this formulation, we introduce execution-validated data curation, policy-aware agentic data synthesis, and worst-state-prioritized sampling to enable scalable learning under execution constraints. We further curate a reality-aligned benchmark adapted from an existing verification suite through a revised evaluation protocol. Using the proposed pipeline, a compact 4B-parameter model achieves 69.2% coverage pass rate under agentic evaluation, outperforming its teacher by 5.3% and demonstrating competitive performance against models an order of magnitude larger.

Abstract:
Gaussian Splatting has achieved significant improvements by incorporating warpingbased techniques. These approaches enhance synthesis quality by warping images from source views into the target viewpoint to compensate for missing or residual pixels. However, such methods suffer from pixel-level inaccuracies due to uncertain geometry. This uncertainty leads to spatial misalignments in the warped images, which disrupt residual learning used in warping-based methods and fundamentally limit the gains of correction, particularly on thin structures and high-frequency details. Driven by our insight that useful visual cues are not lost but locally preserved under slight displacement, we propose Geometry-Aware Deformable Aggregation (GADA). This method introduces an iterative refinement module with deformable offsets to actively correct spatial misalignments and recover these displaced cues. Furthermore, to address the limitations of standard pipelines where visibility checks (i.e., thresholding) often discard valid pixels and multi-view warped image fusion relies on naive mean aggregation, our module is coupled with an implicit confidence weighting mechanism that selectively suppresses unreliable evidence. Consequently, our approach outperforms prior warping-based Gaussian Splatting, preserving high-frequency quality while achieving 2.13times faster FPS.

Abstract:
Recent years have witnessed significant progress in developing effective diffusion models. Parallel sampling is a promising recent approach that reformulates the sequential denoising process as solving a system of nonlinear equations, and it can be combined with other acceleration techniques. However, current progress is limited by the tradeoff between high fidelity and computational efficiency. This paper addresses the challenge of scaling to high-dimensional, multi-modal generation. Specifically, we present ROPA (Robust Parallel Diffusion Sampling), which takes into account the properties of the denoising process and solves the linear system using adaptive local sparsity to achieve stable parallel sampling. Extensive experiments demonstrate ROPA’s effectiveness: it significantly accelerates sampling across diverse image and video diffusion models, achieving up to 2.9× speedup with eight core, an improvement of 52% over baselines without sacrificing sample quality. ROPA enables parallel sampling methods to provide a solid foundation for real-time, high-fidelity diffusion generation.

Abstract:
Speculative decoding (SD) has emerged as a key solution to accelerate the inference of autoregressive models. However, in the field of image generation, it faces the challenge of low acceptance rates, and directly relaxing its criteria leads to degradation in image quality. In this paper, we propose a novel contentaware speculative decoding algorithm, termed CSD, which integrates an entropy-based probability relaxation mechanism with an optimal resampling strategy to enhance the inference efficiency for autoregressive image generation. By leveraging the informational uncertainty inherent in different regions of an image, CSD dynamically adjusts the acceptance probability of candidate tokens, increasing the acceptance rate in low-detail areas to accelerate generation. Moreover, a distribution alignment filter is introduced to ensure the output distribution to be aligned with the target model, which significantly improves the generative quality. Experiments conducted on Lumina-mGPT and Janus-Pro demonstrate that the superiority of the proposed CSD. Our source codes are released in Supplementary Material.

Abstract:
PositiveUnlabeled (PU) learning is a weakly-supervised paradigm that trains a binary classifier from labeled positive and unlabeled instances. In PU risk estimation, the empirical risk consists of an unlabeled term and a positive term. In this paper, we observe that when labeled positives are scarce, the risk deviation is dominated by the generalization bound of the positive term, which is composed of a complexity term governed by Rademacher complexity and a concentration term governed by the uniform range bound, leading to estimator instability. Based on this observation, we theoretically derive the minimal sufficient learning threshold, defined as the smallest number of labeled positives required to achieve a target excess risk with high probability, and reveal its explicit dependence on both components. Inspired by this insight, we propose ScalePU, which incorporates variance regularization to induce a restricted sub-hypothesis space with reduced Rademacher complexity, and geometric regularization to encourage compact clustering of positive samples with a tighter effective range. Theoretical analysis demonstrates that both mechanisms effectively lower the threshold through improvements to different components of the bound. Experiments on eight benchmark datasets validate the effectiveness of ScalePU, with significant improvements under extreme label scarcity.

Abstract:
Multilingual documents encapsulate rich regional cultures, scientific discoveries, and historical records. Parsing this content into structured, machinereadable formats is critical for unlocking global knowledge. However, existing benchmarks predominantly focus on high-resource languages like English and Chinese, creating a significant evaluation blind spot concerning model performance on the vast spectrum of other languages. While recent Vision-Language Models (VLMs) claim support for hundreds of languages, the lack of comprehensive ground truth makes it impossible to empirically verify these capabilities. To bridge this gap, we introduce MORE, a large-scale, linguistically comprehensive benchmark designed for rigorous multilingual document parsing evaluation. MORE distinguishes itself through three key dimensions: (1) Unprecedented Scale: It covers 149 languages, making it the most linguistically diverse benchmark to date; (2) Structural Complexity: Unlike previous works, it extends evaluation beyond plain text to include complex structural elements such as code blocks, tables, and catalogs; and (3) Data Authenticity: All samples are curated from real-world documents via a rigorous model-assisted, human-refined annotation pipeline. We conduct an extensive evaluation of state-of-the-art models using MORE, establishing new performance baselines for long-tail languages and validating the benchmark's effectiveness in diagnosing model capabilities in realistic, diverse scenarios.

Abstract:
Leveraging the unlabeled stream is crucial yet challenging in SemiSupervised Continual Learning (SSCL) under continual class expansion. Existing SSCL methods typically enforce dense pseudo-label consistency and indiscriminate distillation on unlabeled data, which can reinforce errors and intensify base–novel interference. To address these issues, we propose Discrete-anchored Incremental Learning (DiL) to ground continual updates on reliable discrete anchors that remain stable under noisy pseudo-labels. DiL introduces Discrete Contrastive Distillation (DCD), which discretizes the distillation pathway and performs anchor-referenced selective distillation to curb error reinforcement. Meanwhile, Class-Aware Channel-Chunked Encoding (CACE) learns channel-chunked representations and exploits the confusion matrix induced by the discrete anchors to separate novel from confusable base classes. Extensive experiments on multiple datasets show that DiL achieves state-of-the-art performance across diverse SSCL protocols.

Abstract:
Diffusion models serve as generative priors for dataset distillation, yet existing pipelines rely on persample update rules that evolve each synthetic image independently, limiting their ability to optimize collective set-level objectives. We propose Set-Coupled Guidance (SCG), a plug-and-play auxiliary controller that shifts from per-image to group (IPC-at-once) sampling by injecting set-symmetric feedback at each diffusion step. SCG combines spectral set-point regulation, which aligns set-level statistics to real data via empirical characteristic function matching, with cooperative kernel coupling that stabilizes joint trajectories under noisy feedback. All computations operate on lightweight descriptors extracted from predicted clean latents, adding low overhead to the base method. We provide theoretical analysis including Lyapunov descent and input-to-state stability for distributional tracking. Experiments on ImageNette, ImageWoof, ImageNet-100 and ImageNet-1K show consistent accuracy gains across multiple diffusion-based baselines.

Abstract:
Offline goalconditioned reinforcement learning (GCRL) provides a practical framework for obtaining goal-reaching policies from fixed datasets. However, learning a reliable goal-conditioned value function in long-horizon tasks remains challenging. In this paper, we identify erroneous generalization in goal-conditioned value functions as a fundamental bottleneck, and demonstrate that appropriate inductive bias in the value function is crucial for addressing the bottleneck. Building on these findings, we propose Latent-Aligned Value Learning (LAVL), an offline GCRL algorithm that integrates latent-representation-based value generalization with hierarchical planning in a unified framework. Extensive numerical experiments on OGBench demonstrate that LAVL consistently outperforms existing offline GCRL methods, achieving the highest performance on20out of 22 datasets. Notably, LAVL exhibits strong performance in long-horizon tasks and trajectory stitching datasets, where prior methods suffer significant performance degradation.

Abstract:
The detection of synthetic images has traditionally been framed as a binary classification problem. However, we argue that this formulation overlooks a fundamental structural property of generative datasets: synthetic images are not independent samples, but are implicitly paired with real images sharing the same semantic source. Existing methods treat real and fake images as independent instances, failing to capture generationinduced relational discrepancies in real–fake pairs. Moreover, models tend to rapidly overfit to seen fake patterns, leading to poor generalization to unseen ones. To overcome these challenges, we propose a novel detection framework that explicitly mines real–fake pairs by constructing source-guided mappings or leveraging nearest-neighbor relationships in the CLIP embedding space. We then introduce pair-wise discrepancy learning that explicitly enlarges generation-induced deviations and discrepancy inversion to mitigate overfitting. Moreover, to preserve pretrained semantic representations while improving generalization, we adopt a bias-only fine-tuning scheme that restricts model capacity during adaptation. Extensive experiments show that our approach achieves superior generalization across unseen fake patterns.

Abstract:
Multiagent debate (MAD) systems are often evaluated using binary correctness or peer agreement, yet such evaluations obscure how individual agents revise their answers during social interaction. We study the microscopic dynamics of answer revision in large language models (LLMs) using ConceptARC, a 2D grid-reasoning benchmark that admits quantitative distance measures between candidate solutions. By exposing a target model to controlled configurations of peer answers, we analyze how revision likelihood and direction depend on both social context and the distance between answers and the ground truth. We find that agents are more likely to revise when their answers are farther from the correct solution, and that revisions of incorrect answers exhibit a systematic contraction toward the ground truth, even when the final answer remains incorrect. Conversely, correct answers can be overturned by social pressure, particularly when wrong peers are near-correct. Together, these results show that multi-agent interaction induces structured, distance-aware movements in solution space that are invisible under binary correctness, clarifying when social reasoning leads to improvement, stability, or gradual regression in solution quality.

Abstract:
Language models acquire syntax and world knowledge together, entangling the two in ways that limit efficiency and controllability. We show that syntax can be learned while suppressing semantic plausibility and world‑knowledge cues, yielding more efficient and controllable models. We train tiny LMs on grammatical nonsense — syntactically wellformed text with semantic content ablated via constrained relexicalization (SAMBAL). Models trained on this data perform comparably to standard pretraining on syntactic benchmarks (BLiMP, SyntaxGym) while scoring at chance on world knowledge probes (EWoK). On targeted grammar-plausibility conflict probes, content-neutral models prefer grammaticality where standard models prefer plausibility, and their representations show more syntactic vs lexical alignment. On efficiency, disentanglement yields substantial sample and parameter gains: in low‑resource regimes, a 5M‑parameter model matches a 30M‑parameter baseline at the same data budget. On controllability, content-neutral models adapt rapidly to a new domain with minimal exposure, suggesting the feasibility of modular post‑hoc knowledge specialization.

Abstract:
Virtual screening of billionscale molecular libraries based on 3D shape similarity remains computationally prohibitive, requiring expensive conformational sampling and alignment, as done by established tools like ROCS. Here, we introduce SAND (Shape-Aware Neural Descriptor), a method that can retrieve shape similar molecules from their 2D graph alone. Our approach makes two key contributions: (1) a rank-preserving contrastive learning framework using differentiable Spearman correlation that results into representations where similarity strongly correlates with 3D molecular shape overlap (R=0.86), and (2) an end-to-end learned quantization-aware training scheme that jointly optimizes the encoder with a two-level IVF-PQ discretization step, achieving approximately 4× better compression than post-hoc quantization at equivalent retrieval quality. We demonstrate that SAND enables searching over 10 billion molecules in less than a second on a single GPU node - a speedup of >10^8 × compared to traditional methods. We release open-source code and trained weights to facilitate adoption.

Abstract:
Unsupervised Reinforcement Learning (URL) aims to pretrain scalable, skill-conditioned policies without extrinsic rewards, serving as a foundation for downstream control tasks. Despite recent progress, we argue that current off-policy URL methods are limited by two critical, overlooked bottlenecks: (1) non-stationarity skill semantic and (2) brittle generalization. To address these challenges, we propose GenDa (Generalizable Data-efficient Agent), a unified framework for robust unsupervised reinforcement learning. First, we introduce a skill relabeling mechanism to mitigate non-stationarity and significantly improve sample efficiency for pretraining. Second, we propose a Complementary Information Bottleneck(CIB), encouraging the learned skill policy to focus on ego-centric features and become robust to distribution shifts for downstream tasks. Through various experiments, we demonstrate that GenDa significantly enhances the scalability of URL with superior generality and sample efficiency. Our source codes are available in the supplementary materials.

Abstract:
Textto-image models have advanced high-fidelity content generation, but their inability to maintain subject consistency hampers realistic applications. Existing training-based methods rely on heavy computation and large datasets; while training-free approaches demand excessive memory or complex auxiliary modules. In this paper, we first reveal a key property overlooked in prior works that the identity-relevant signals, termed Identity-Preserving Embeddings (IPemb), are implicitly encoded in textual embeddings of frame prompts. To address the consistent T2I generation with theIPembembedding, we propose Boost Identity-Preserving Embedding (BIPE), a training-free yet plug-and-play framework that explicitly extracts and enhances theIPemb. Its core innovations are two complementary techniques: First, Adaptive Singular-Value Rescaling (adaSVR) applies singular-value decomposition to the joint embedding matrix of all frame prompts, amplifying identity-centric components while suppressing frame-specific noise. Second, Union Key (UniK) further reinforces consistency by aligning the T2I backbone’s image-text attention across the entire generation sequence. Experiments on theConsiStory+benchmark demonstrateBIPEoutperforms existing methods in both qualitative and quantitative metrics. To address the gap in evaluating a broader range of scenarios with diversified prompt templates, we introduce aDiverStorybenchmark to further confirm our scalability.

Abstract:
Diffusion Transformers (DiTs) have achieved stateof-the-art generative performance, yet their iterative denoising process remains computationally expensive and energy-intensive. Spiking Neural Networks (SNNs) offer a promising neuromorphic alternative for energy efficiency; however, the non-differentiable nature of spiking neurons makes direct training difficult, positioning ANN-to-SNN conversion as a more practical, training-free solution. In this paper, we identify a critical challenge unique to converting DiTs: standard fixed-scale spiking neurons fail to accommodate the highly dynamic activation ranges inherent across denoising steps. This mismatch leads to cumulative errors that significantly degrade generation fidelity. To resolve this, we propose a novel conversion framework featuring Multi-Threshold (MT) neurons and a Membrane Potential Error-Feedback (MPEF) mechanism. MT neurons expand the expressive capacity of discrete spikes by employing a multi-level firing strategy. Concurrently, MPEF exploits the temporal correlation between successive denoising steps to recycle residual membrane potential, effectively compensating for information loss and mitigating distribution shifts without retraining. Extensive experiments on ImageNet demonstrate that our framework achieves competitive generative quality with superior energy efficiency, establishing a new performance benchmark for spiking Diffusion Transformers.

Abstract:
Deep attribute graph clustering aims to learn discriminative node representations by leveraging both node attributes and graph topology to partition nodes into distinct clusters. Although substantial progress has been made in attributegraph clustering in recent years, two key challenges remain: noisy edges in the original adjacency matrix degrade the quality of information propagation, and redundant feature information across different feature views hampers the learning of discriminative representations. To address these issues, we propose a self-supervised attribute graph clustering method based on topological reconstruction and correlation decorrelation. First, we reconstruct the graph topology by computing intersections between k-nearest neighbors and the original adjacency relationships, while simultaneously leveraging global semantic information from K-means clustering to filter out noisy nodes. This reconstructed topology effectively mitigates information redundancy during feature aggregation in Graph Neural Networks. Second, unlike existing augmentation-based contrastive methods, we treat the feature representations from an auto-encoder (AE) and a graph auto-encoder (GAE) as two complementary natural views. We then apply mutual information minimization and a decorrelation constraint to suppress redundant information between views, yielding more discriminative node representations. Extensive experiments on four widely-used graph datasets—ACM, DBLP, CITE, and AMAP—demonstrate that our method consistently outperforms six state-of-the-art baselines.

Abstract:
Recent advances in Large Reasoning Models (LRMs) demonstrate remarkable performance improvements by iteratively reflecting, exploring, and executing complex tasks, yet suffer from inefficiencies due to redundant reasoning, known as "overthinking". Existing methods to mitigate this issue either rely on static difficulty estimates or require taskspecific training, and thus fail to adapt to the dynamic complexity during reasoning. In this work, we empirically show that the problem difficulty evolves dynamically throughout the reasoning process and is linearly encoded in the LRM’s step-level embeddings. Building on this insight, we propose DyCon, a training-free framework that leverages latent step-level representations to explicitly model the evolving task difficulty, enabling the dynamic control of reasoning depth to mitigate the overthinking issue. Extensive experiments conducted on four models ranging from 4B to 32B, and across twelve benchmarks in math reasoning, general question answering, and coding tasks demonstrate that DyCon significantly enhances reasoning efficiency by reducing redundant steps without sacrificing accuracy or generalization. Code and models will be made publicly available.

Abstract:
World models enable learning policies via latent imagination, offering benefits such as history compression and sample efficiency. The primary challenge in applying world models to multiagent tasks is that modeling multi-agent dynamics in latent space requires integrating information from different agents, often creating spurious correlations between their latent states. Existing methods either reconstruct the observation for each agent or employ communication to maintain correlation during execution, failing to learn disentangled latent states that are crucial for effective decentralized control. To address this, we present the Disentangled Multi-Agent World Model (DMAWM). It facilitates learning decentralized policies in the latent space through a novel architecture comprising independent agent modules and a shared environment module. During real-environment execution, agent modules independently process local information to form a factorized latent representation. The environment module is then trained to mirror the factorized structure generated by the agent modules, effectively disentangling individual latent states from the interaction dynamics. Consequently, imaginary rollouts generated by the environment module more faithfully simulate decentralized execution dynamics, facilitating the transfer of policies from imagination to decentralized execution. Empirically, DMAWM outperforms existing model-based and model-free approaches in convergence speed and final performance, with additional visualization demonstrating its efficacy in capturing agent interactions.

Abstract:
Artificial intelligence has achieved remarkable breakthroughs in fields such as text, image, and video analysis, with deep learning serving as the mainstream paradigm widely adopted across applications. Trained deep learning models can be integrated into various applications either through finetuning or without any modification. While this practice promotes the advancement of artificial intelligence, it also raises concerns regarding intellectual property protection and information security risks. Therefore, it is necessary to propose relevant methods to measure the similarity between models. Existing code clone detection techniques are insufficient to address this issue. In this paper, we formalize deep learning models as weighted graph objects defined by both computational structure and parameter distribution. Drawing inspiration from code clone analysis, we provide the first definition of model cloning and design a method for model similarity detection. The framework characterizes model topology at the structural level based on normalized computational graphs, and at the weight level, it employs a method that does not require explicit parameter alignment to measure the statistical similarity of weight parameters. Experiments on a synthetic model clone benchmark dataset and real-world open-source models demonstrate that the proposed method can accurately detect similar models. The experimental analysis results align with the expected similarity changes during model fine-tuning and derivation processes. This method provides a unified and extensible quantitative foundation for model lineage analysis, model retrieval, and intellectual property protection of models.

Abstract:
We introduce Convex Distance Operator Transport (CDOT), the first convex optimal transport framework that aligns distributions across heterogeneous domains by jointly preserving feature correspondence and intrinsic geometric structure. Specifically, CDOT employs an operatorbased regularization that aligns aggregated distance structures by introducing distance and conditional expectation operators. Consequently, the proposed regularization improves the robustness to local geometric variations. We further prove that the resulting CDOT discrepancy is a valid pseudometric on the space of attributed compact metric-measure spaces. In addition, we characterize the relationship between CDOT and Gromov--Wasserstien (GW) through a new notion of dispersion gap, formally elucidating the geometric source of non-convexity in GW compared to the convexity of CDOT. In the finite-sample regime, we derive a non-asymptotic risk bound decomposed into optimization and statistical errors, establishing risk consistency under a globally convergent Frank--Wolfe algorithm. Experiments on synthetic point clouds, brain connectomes, and graph classification benchmarks demonstrate better performance over existing methods, with stable and reliable behavior in practice.

Abstract:
Depth estimation for transparent objects remains a fundamental challenge, as RGBbased cues often fail in regions affected by refraction and light transmission. Polarization provides physically grounded information related to surface orientation and material properties, offering reliable geometric cues even in the absence of texture. In this work, we introduce PolarDepth, a monocular framework that incorporates both RGB and polarization inputs, including the degree and angle of linear polarization (DoLP and AoLP), to estimate dense depth and localize transparent regions. PolarDepth injects polarization-derived physical priors by estimating the refractive index, zenith angle, and azimuth angle from polarization measurements and embedding them into an implicit geometric representation that constrains depth inference in ambiguous transparent regions. To support model development and evaluation, we introduce PTOD, a dataset with synchronized RGB, polarization, and depth data and manually annotated transparent region masks. Experimental results demonstrate that PolarDepth achieves state-of-the-art performance in transparent object depth estimation. The findings highlight the effectiveness of embedding polarization-derived physical priors into learned representations for robust perception in complex visual environments.

Abstract:
Recent large language models (LLMs) have demonstrated strong capabilities across challenging tasks, enabling their widespread adoption in agentic systems that interact with external tools. In such deployments, however, LLMs are typically trained with generalpurpose tokenizers designed for broad language coverage, while their usage is dominated by narrow, structured tool-calling interactions. This training–deployment mismatch leads to inefficient tokenization, where repetitive structural patterns and frequent semantic units in function calls are fragmented into long sequences of low-level tokens, increasing decoding overhead. To address this gap, we introduce AgentVocab, a structure-aware vocabulary adaptation framework for efficient LLM agents. AgentVocab derives specialized vocabulary entries from real tool-calling traces and adapts the model vocabulary to better reflect structural and semantic regularities, without task-specific schema engineering. Experiments on \tau and \tau^2-bench show that AgentVocab significantly improves decoding efficiency, reducing latency by approximately 15-25% relative to the vanilla baseline, while preserving tool-calling performance. Our approach is orthogonal to existing fine-tuning and agent-training methods and integrates seamlessly into standard agent pipelines. Source code and models will be available at https://anonymous.4open.science/r/AgentVocab-28CC.

Abstract:
Representing the past in a compressed, efficient, and informative manner is a central problem for systems trained on sequential data. TheHiPPOframework, originally proposed by Gu & Dao et al., provides a principled approach to sequential compression by projecting signals onto orthogonal polynomial (OP) bases via structured linear ordinary differential equations. Subsequent works have embedded these dynamics in state space models (SSMs), where HiPPO structure serves as an initialization. Nonlinear successors of these SSM methods such as Mamba are state of the art for many tasks with longrange dependencies, but the mechanisms by which they represent and prioritize history remain largely implicit. In this work, we revisit the HiPPO framework with the goal of making these mechanisms explicit. We show how polynomial representations of history can be extended to support capabilities of modern SSMs such as adaptive allocation of memory, and input-dependent state updates, and associative memory, while retaining direct interpretability in the OP basis. We introduce a unified framework comprising five such extensions, which we collectively refer to as a "HiPPO zoo." Each extension exposes a specific modeling capability as an explicit modification of the underlying measure or dynamics governing the polynomial coefficients, rather than as an opaque learned transformation. The resulting models adapt their memory online and train in streaming settings with efficient updates. We illustrate the behaviors and advantages of these extensions through a range of synthetic sequence modeling tasks, highlighting how explicit polynomial memories can recover and clarify mechanisms implicit in SSMs.

Abstract:
Richformat documents are essential for everyday operations yet costly to author, motivating the need for automated generation to enhance productivity. To this end, we present FormAct, an agentic system that generates professional rich-format documents from scratch. FormAct operates on an HTML source representation and performs iterative source refinement with anediting agentthat invokes a suite of tools, including a syntax-aware source editor and a template retriever, and areview agentthat critiques rendered pages to guide refinement. Additionally, we incorporate edit-triggered context compression to maintain a bounded working context and keep multi-round editing efficient. To support development and evaluation, we introduce RichDocBench for end-to-end generation, and RichDocFuzz to evaluate formatting-error recognition for reviewer agents. Through extensive automated evaluation and blind human-preference studies, we show that FormAct consistently outperforms strong baselines, including Codex-CLI, with particularly strong improvements in generating error-free, professional rich-format documents.

Abstract:
Signbased optimization methods, such as SignSGD and Signum, have become essential for modern Deep Learning due to their 1) high performance 2) low memory footprint and 3) communication efficiency. Despite their success, these methods suffer from distinct limitations in the terminal phase of training: they decouple update mechanisms from gradient magnitudes and fail to account for parameter heterogeneity, often leading to oscillation rather than convergence. While switching to SGD represents a potential remedy, a naive "hard" switch is poorly useful due to learning rate mismatches, momentum buffer suboptimality, and the assumption of uniform parameter dynamics. In this work, we propose SoftSignum, a novel optimization method that implements a principled, smooth transition mechanism from sign-based updates to SGD, which adapts to individual parameter sensitivities. We provide a generalized theoretical framework guaranteeing convergence in stochastic non-convex settings relevant to Deep Learning and demonstrate empirically that SoftSignum effectively handles parameter heterogeneity, yielding superior convergence across diverse tasks, including LLM pretraining, compared to standard sign-based baselines.

Abstract:
Tokenization is the first point of contact between large language models (LLMs) and text data, yet it has not been viewed by many as a component of LLMs worth accelerating. During inference, tokenizers typically rely on simple dictionary lookups and are executed on CPUs as standard processes. This approach, however, introduces significant overhead from scheduling delays, core selection, data copying, and other systemlevel costs. These inefficiencies become problematic in latency-sensitive applications such as embedding, small language models, and agentic AI. In this paper, we present the Pinned Tokenizer (PinTok), a novel tokenizer architecture that reduces redundant hardware, operating system, and networking overhead through three key innovations: core and memory pinning, scheduling and context switch avoidance, and duplicate network packet copy and processing avoidance. Our implementation of PinTok can serve as a drop-in replacement for existing tokenizer deployments, delivering latency reductions of up to 95% (average), 97% (P50), 94% (P90), and 87% (P99) along with throughput improvements of up to 2,084%.

Abstract:
We consider minimizing highdimensional smooth nonconvex objectives using only noisy pairwise comparisons. Unlike classical zeroth-order methods limited by the ambient dimension d, we propose Noisy-Comparison Random Search (NCRS), a direct-search method that exploits random line search to adapt to the intrinsic dimension k \le d. We establish a novel nonconvex analysis for approximate stationarity: under a uniform-margin oracle with advantage p, NCRS attains \epsilon-stationarity with complexity \mathcalO(k/(p^2\epsilon^2)), explicitly replacing ambient dependence with the intrinsic dimension. Furthermore, we introduce a general tie-aware noise model where comparison quality degrades near ties; for this setting, we prove that a majority-vote variant of NCRS achieves \epsilon-stationarity with complexity \mathcalO(k^2/\epsilon^4).

Abstract:
Bytelevel tokenization enables language models to handle any Unicode input, but models can generate invalid UTF-8 sequences when encountering rare or unseen characters. We investigate the relationship between training scale and UTF-8 generation reliability with a 355M parameter model trained on 80B tokens from a balanced multilingual corpus of English, Japanese, Korean, and Chinese. We introduce multiple evaluation protocols that isolate UTF-8 structural validity from language modeling. UTF-8 validity convergence lags perplexity by a roughly a factor of two: perplexity stabilizes after 2.1B tokens, but UTF-8 validity requires 4.2B tokens. In context-free generation, rare characters achieve higher structural validity than common characters, suggesting over-specialization of frequent character representations. Through experiments, we observed that reliable UTF-8 generation is a distinct capability requiring evaluation beyond perplexity.

Abstract:
We introduce Equilibrium State Estimation (ESE), a novel paradigm for simultaneous prediction, where multiple interacting systems require separate yet coordinated forecasts. Such scenarios often arise in realworld such as economics and healthcare modeling. Unlike existing approaches that predict one system at a time, ESE forecasts all systems in a single pass. It first estimates the equilibrium state across systems, then generates holistic forecasts based on the difference between the current state and the estimated equilibrium. Extensive experiments on synthetic and real-world datasets, including currency exchange and COVID-19 spread modeling, demonstrate that ESE is at least as accurate as state-of-the-art (SOTA) methods while being significantly faster. In addition, ESE integrates seamlessly with conventional predictors, combining their accuracy with its exceptional efficiency and delivering a 10–70× speedup. With linear-time complexity, ESE scales far better than SOTA methods as the number of systems increases. Moreover, it remains accurate under diverse perturbations, establishing ESE as a fast, generalizable, robust, and scalable multi-prediction method. Source code and data are available at https://anonymous.4open.science/r/ESE-C339.

Abstract:
In realworld scenarios, data collection for reinforcement learning (RL) is often constrained by safety concerns and high costs, resulting in limited data availability. Diffusion models (DMs) have recently demonstrated remarkable capabilities in capturing complex distributions, making data augmentation a promising approach. However, existing DM-based data augmentation methods still suffer from the limited quality of synthesized data for downstream RL tasks. To overcome this limitation, we propose a novel method called episodic memory-guided controllable experience synthesizer (EMCES). EMCES incorporates an episodic memory-based controllable DM with informative yet concise conditions constructed by episodic memory (EM). To guide the synthesis toward high-quality data, we propose an EM-prioritized condition sampling strategy that leverages EM-based temporal-difference errors to focus generation on data most helpful for RL. Furthermore, we introduce a hashing-based state representation for EM to improve its efficiency and further boost the quality of synthetic data. To the best of our knowledge, EMCES is the first work to incorporate EM into controllable DMs and to leverage EM for guiding data synthesis in RL. Experimental results across multiple environments demonstrate that EMCES significantly improves the quality of the synthetic data, thereby improving the performance of several state-of-the-art RL algorithms.

Abstract:
Reinforcement learning with massively parallel simulations has become an emerging trend; however, most existing approaches still rely on simple Gaussian policy parameterizations. Diffusion models provide a more expressive policy class and have shown strong performance on challenging control problems, yet most diffusionbased RL methods are designed for offline or off-policy training. In this work, we ask whether diffusion policies can be trained effectively in the massively parallel, on-policy regime. To this end, we introduce Trust-region Diffusion Policies (TruDi), which enables diffusion policies for on-policy RL with massively parallel simulations. This setting is particularly challenging because the data distribution changes quickly across updates, making stable training with complex policies difficult. TruDi addresses this by integrating a trust-region optimization rule to enforce a KL-divergence constraint over the entire diffusion trajectory, rather than only at the final denoising step. Empirically, we evaluate TruDi on a diverse set of 4 massively parallel RL benchmarks comprising a total of 73 tasks. Across these tasks, TruDi consistently outperforms or is on-par with strong baselines on standard tasks and achieves clear gains on more challenging humanoid control tasks, establishing a strong new baseline for massively parallel on-policy RL.

Abstract:
Ensuring scalable inputto-state stability (sISS) is critical for the safety and reliability of large-scale interconnected systems, especially in the presence of communication delays. While learning-based controllers can achieve strong empirical performance, their black-box nature makes it difficult to provide formal and scalable stability guarantees. To address this gap, we propose a framework to synthesize and verify neural vector Lyapunov-Razumikhin certificates for discrete-time delayed interconnected systems. Our contributions are three-fold. First, we establish a sufficient condition for discrete-time sISS via vector Lyapunov-Razumikhin functions, which enables certification for large-scale delayed interconnected systems. Second, we develop a scalable synthesis and verification framework that learns the neural certificates and verifies the certificates on reachability-constrained delay domains with scalability analysis. Third, we validate our approach on mixed-autonomy platoons, drone formations, and microgrids against multiple baselines, showing improved verification efficiency with competitive control performance.

Abstract:
Recent work shows that even safety aligned large language models (LLM) can be pushed into unsafe behavior by carefully crafted jailbreak prompts. Existing jailbreaking attack methods often rely on disfluent or incoherent prompts, which limit their success and make them easy to detect. We introduce SJA, a structured jailbreak attack built around two ideas. First, inspired by the logic of Spilsbury puzzle, SJA decomposes a harmful query into a sequence of harmless subquestions and reconstructs the original answer by combining the sub-question responses. Second, by leveraging the theory of Hamiltonian dynamics on hyperbolic space, we propose a hyperbolic Hamiltonian dynamics-based sub-question generation framework that effectively captures the structural and temporal dependencies. We provide a theoretical analysis of how each sub-question evolves along the trajectory and show that the hyperbolic Hamiltonian system effectively captures the underlying semantic structure. Finally, we propose a hyperbolic narrative fusion mechanism built on fractional embedding and Möbius fusion. This mechanism integrates coherent narratives into sub-questions while preserving geometric consistency and improving stealth performance. We theoretically validate that the combination of the generated harmless sub-questions, guided by the stealthy narrative, can effectively preserve the contextual semantics of the original harmful question.

Abstract:
Denoising diffusion sampling (DDS) is an emerging approach for generating new samples that have the same distribution as some training samples. However, it is vulnerable to adversarial attacks by even a Gaussian perturbation. In this work, we propose a complete set of adversarial attack and defense methodology for DDS. In the attack side, we propose to inject a perturbation to the sampling stage, which significantly worsen the performance of sample generation. In the defense side, we propose a local variation based regularization model for the potential function minimization, which effectively tolerates the adversarial perturbations. Moreover, we develop a conjugate gradient algorithm to solve the defense model, which integrates with a recentlydeveloped zeroth order rejection sampling method that saves computational cost. Experimental results show that the proposed attack significantly worsen the existing state-of-the-art methods, but can be defended by the proposed local variation regularization.

Abstract:
The massive vocabulary sizes of large language models, often exceeding 100k tokens, impose a computational bottleneck on the final linear projection layer during speculative decoding. Existing vocabulary pruning solutions rely on static or coarselygrained sub-vocabularies that necessitate large active sizes (～30k) to maintain draft quality. We propose NanoSpec, a novel training-free approach that breaks this trade-off by dynamically constructing a minimalist, context-aware active vocabulary for each generation step. Leveraging the inherent temporal locality of language generation, NanoSpec achieves high coverage while slashing the average vocabulary size by over 40× (to <3k tokens) without requiring any auxiliary trained parameters. To realize the theoretical benefits of such high sparsity on modern hardware, we introduce a system-algorithm co-design that overcomes the inefficiencies of sparse memory access through asynchronous gathering and GPU-resident state management. As a complementary plug-and-play module, NanoSpec cuts draft inference time by an average of 51.6%, delivering a 1.12-1.32× end-to-end speedup over the state-of-the-art speculative decoding method EAGLE-2 across diverse benchmarks and outperforming complex training-based pruning baselines.

Abstract:
Recent works introduce Rotary Position Embeddings (RoPE) into vision transformers (ViTs) to enhance their extrapolation capability, i.e., maintaining performance when inference is conducted on higher resolution images. RoPE encodes positions via rotating phases whose change is controlled by frequency components. Strandard 2D RoPE does not generalize well to input resolution changes as it only applies axial frequencies separately along each individual axis. To solve this issue, MixRoPE combines xy‑axis frequencies, such that it can model position relations in diagonal direction. However, in practice, we observe that the learned 2D frequencies become anisotropic in their direction distributions due to the axial spectral bias in image features, limiting the extrapolation ability of ViTs. Motivated by this observation, we propose Compass‑RoPE. We replace the xy cartesian coordinates with a polar parameterization that explicitly decouples frequency scale and angle. By initializing the angle vectors uniformly over [0,2π), it ensures the isotropic direction coverage. Besides, we further introduce discrete Fourier transform (DFT) mixing for the angle vectors, allowing each transformed individual angle vector element to nest multipule angles and thus to enrich angular expressiveness. Extensive experiments on multi-resolution classification and dense prediction tasks show that our Compass-RoPE achieves more stable extrapolation performance under large-scale resolution changes.

Abstract:
Driven by the imperative to leverage privacysensitive data scattered across decentralized devices, federated fine-tuning has emerged as a vital paradigm for adapting large language models without compromising data privacy. Yet, its practical efficacy is bottlenecked by severe client resource heterogeneity. Existing truncation-based methods typically couple the transmitted rank with the trainable rank, which (i) under-utilizes bandwidth on communication-rich but compute-limited clients and (ii) exacerbates truncation-induced gradient drift. To address this, we propose FedHera, a resource-decoupled framework that explicitly differentiates information reception from gradient optimization. FedHera employs a spectrum-preserving allocation strategy to maximize the transfer of global knowledge (via high-rank singular values) within bandwidth limits, irrespective of training constraints. Furthermore, we introduce a prefix-gating mechanism that utilizes the downloaded high-capacity basis as a frozen reference to guide local updates, thereby minimizing the optimization gap caused by aggressive truncation. Extensive experiments under different heterogeneous settings show that FedHera improves stability and accuracy over state-of-the-art baselines.

Abstract:
Although peptides are well suited for flexible and shallow binding interfaces, their intrinsic flexibility induces a strongly coupled sequence–structure relationship that current fixedgeometry latent models cannot simultaneously model with conformational diversity and physical feasibility, ultimately limiting design quality. To overcome this bottleneck, PepFGLD is proposed as a receptor-conditioned, flexibility-aware framework for full-atom peptide design. The framework is motivated by a systematic analysis of existing limitations: geometry shifts driven by interfacial flexibility are not well captured by standard equivariant encoders; the static combination of sequence information and 3D geometry cannot represent their dynamic interactions; and diffusion models without timely geometric feedback tend to drift away from physically reasonable energy landscapes. In PepFGLD, FlexEGNN is used to improve the sensitivity of geometric representations to local flexibility, a coherent and adaptable latent conformational manifold is formed through bidirectional sequence–structure interaction and nonlinear latent mapping, and a time-dependent energy-guided diffusion mechanism is incorporated to balance exploration and convergence during diffusion so that sampling trajectories are continuously guided toward physically feasible full-atom structures. PepFGLD yields improved binding affinity and design success across multiple peptide design tasks.

Abstract:
We propose a new ``bimetric'' framework for designing nearest neighbor data structures. Our framework assumes two dissimilarity functions: a ground-truth metric that is accurate but expensive to compute, and a proxy metric that is cheaper but less accurate. In both theory and practice, we show how to construct data structures using only the proxy metric such that the query procedure achieves the accuracy of the expensive metric, while only using a limited number of calls to both metrics. Our theoretical results instantiate this framework for two popular nearest neighbor search algorithms: DiskANN and Cover Tree. In both cases we show that, as long as the proxy metric used to construct the data structure approximates the ground-truth metric up to a bounded factor, our data structure achieves arbitrarily good approximation guarantees with respect to the ground-truth metric. On the empirical side, we apply the framework to the text retrieval problem with two dissimilarity functions evaluated by ML models with vastly different computational costs. We observe that for almost all the large data sets in the BEIR benchmark, our approach achieves a considerably better accuracy-efficiency tradeoff than the alternatives, such as retrieve-then-rerank.

Abstract:
Cooperative multiagent reinforcement learning (CMARL) policies are vulnerable to action hijacking even when only a few timesteps are compromised. Recent adversarial attacks and adversarial training methods have been explored, but under an explicit attack budget, existing attacks often fail to accurately expose critical coordination weaknesses and incur substantial training cost. We propose Budgeted Hierarchical Efficient Attack (BHEA), a budgeted hierarchical adversarial attack that separates decisions on when and which agents to hijack from action replacement, enabling more precise vulnerability discovery under limited attack opportunities. We further show that training cooperative policies against BHEA substantially improves robustness to limited-step action hijacking while reducing training overhead. Experiments on the StarCraft Multi-Agent Challenge (SMAC) demonstrate stronger attacks under the same attack budget and improved robustness. Code is available at https://anonymous.4open.science/r/BHEA-068D.

Abstract:
Video Diffusion Transformers (VDiTs) represent the stateof-the-art in video generation but are fundamentally constrained by the quadratic computational complexity of self-attention. To accelerate this critical computation, we analyze the pre-softmax matrix (QK^T) and reveal two key insights: (1) dense attention patterns inherently reside on a global low-rank manifold characterized by rapid singular value decay; and (2) real motion manifests as hybrid spatio-temporal patterns rather than rigid "spatial vs. temporal" classifications. Guided by these insights, we propose Video-SVD. As a plug-and-play acceleration method that requires no alteration to the original network parameters, it extracts universal bases via offline SVD and employs a dynamic subspace projection strategy at inference, thereby bypassing the expensive full QK^T matrix computation entirely. To ensure high fidelity, we deploy layer-shared dual-stream MLPs to synthesize fine-grained textural details and recover high-frequency RoPE information. Video-SVD achieves significant end-to-end speedup while maintaining high visual quality, reaching 1.92× on HunyuanVideo, 1.75× on Wan2.1-1.3B, and 1.79× on Wan2.1-14B.

Abstract:
Proximal Policy Optimization (PPO) is widely used in continuouscontrol tasks, yet its performance is often highly sensitive to training dynamics when neural networks approximate the policy and value functions. This paper introduces SPPO, a drop-in augmentation that preserves PPO’s clipped objective and network architecture while stabilizing actor-critic geometry via three mechanisms: (i) a CKA-based constraint on critic representations, (ii) a no-flip regularizer on actor updates, and (iii) KDE-driven advantage shaping. Theoretical analysis shows that these mechanisms tighten bounds on one-step bootstrapping error, improve expected directional alignment of action updates, and ensure non-decreasing occupancy mass over high-novelty regions. Experiments on standard continuous-control benchmarks demonstrate consistent gains over PPO and recent PPO stabilization methods. Ablation studies further quantify the contribution and complementary effects of each component. Additional training-dynamics analyses indicate that SPPO reduces instability and oscillations in both actor and critic updates, improving training stability and final performance.

Abstract:
With the rapid emergence of multibehavior learning in recommender systems, leveraging auxiliary user behaviors has proven effective for mitigating target-behavior data sparsity. Yet auxiliary behavior graphs frequently contain noisy or irrelevant interactions that do not align with the target task, impeding the learning of accurate user and item embeddings. Moreover, the scarcity of direct supervised from the target behavior complicates the extraction of informative collaborative signals. In this paper, we introduce GCIB Graph Contrastive Information Bottleneck, a novel framework that denoises auxiliary behavior information and enriches target behavior representations at both the structural and feature levels. At the structural level, GCIB employs a Graph Information Bottleneck (GIB) objective to maximize mutual information between the denoised auxiliary graph and the target-behavior graph while minimizing mutual information with the original auxiliary graph. This formulation preserves task-relevant structural patterns and suppresses spurious interactions. At the feature level, we propose a cross-behavior Graph Contrastive Learning (GCL) scheme in which denoised auxiliary features and target-behavior features serve as complementary views for both users and items. By contrasting these views, GCIB enriches sparse target-behavior representations with semantics distilled from auxiliary behaviors. Extensive experiments on multiple real-world datasets demonstrate that GCIB outperforms state-of-the-art baselines, highlighting its ability to learn noise-resilient and target-aware representations for multi-behavior recommendation.

Abstract:
Large reasoning models (LRMs) achieve strong performance by explicitly generating chainof-thought (CoT) reasoning, but this reasoning process can be manipulated by adversarial prompts. Inference-time CoT interventions offer a simple and lightweight approach to improving safety, yet existing methods typically apply static heuristics that ignore the dynamic nature of reasoning, leading to an inherent trade-off between robustness and over-refusal. This paper introduces SafeCompass, a plug-and-play framework for dynamically steering chain-of-thought reasoning using inference-time safety signals extracted from internal states. At different reasoning positions, SafeCompass derives a latent safety direction through contrastive analysis of internal representations and uses this direction to quantify the model’s current safety state. These signals enable selective intervention, allowing the model’s reasoning trajectory to be modified only when and where it becomes unsafe. Extensive experiments demonstrate that SafeCompass significantly improves robustness, reducing the average attack success rate up to 10× compared to the best baseline, while preserving general reasoning performance and minimizing over-refusal rates.

Abstract:
This work proves that an ndimensional hybrid system can be embedded into an m-dimensional Euclidean space equipped with a continuous vector field on its embedded image whenever m>2n. This result suggests that an intrinsically discontinuous hybrid system generically admits a continuous extrinsic representation that is well-posed for differentiable optimization. Building on this existence theorem, we show that a latent Neural ODE with consistency loss in both the latent and state space can accurately recover the flow of hybrid systems. Extensive experiments suggest the proposed method outperforms the existing method in learning hybrid systems with varying geometries from only time series data.

Abstract:
We present Creat3r, an iterative nextbest-view (NBV) selection framework for efficient, high-quality 3D reconstruction. Starting from a small seed set of image-pose pairs, Creat3r repeatedly selects the most informative next camera pose. After each pose is chosen, the corresponding image is acquired and added to the multi-view set to update a 3DGS reconstruction. To guide selection, Creat3r constructs an intermediate point cloud and estimates reconstruction reliability via a novel 3D confidence field, which is projected to candidate poses through Gaussian projection to produce 2D confidence and exploration maps. These maps balance exploitation of reliable regions and exploration of uncertain or unseen areas under computational constraints. Experiments with standard 3DGS show that Creat3r consistently outperforms baselines in novel view synthesis and surface reconstruction, achieving higher SSIM and F1 scores with fewer views.

Abstract:
Mixtureof-Experts (MoE) scales model capacity efficiently by selectively routing inputs to a specialized subset of experts. However, input-expert specialization, the core motivation of MoE, critically depends on whether the router is actually aware of input structure. In practice, MoE routing is typically implemented as a shallow linear projection with limited awareness of input representation, which often leads to unstable and imbalanced routing. We propose STAR, a STructure-Aware Routing that rethinks MoE routing as a subspace learning problem by augmenting standard learnable routing with an evolving principal subspace that tracks dominant input structure via Generalized Hebbian Algorithm (GHA). By aligning routing decisions directly with input structure along with the task-supervision from learnable gate, STAR enables stable and balanced expert specialization without relying on auxiliary load-balancing losses. We evaluate STAR on controlled synthetic setup and large-scale language and vision tasks, where it consistently improves routing quality and downstream performance over strong MoE baselines. Moreover, optional test-time subspace updates further enhance routing robustness under distribution shifts.

Abstract:
Meta BlackBox Optimization (MetaBBO) has emerged as a promising paradigm by employing meta learning to automatically optimize the configurations of low-level black-box optimizers. Despite its potential, the generalization of MetaBBO remains significantly constrained when facing unseen, complex objective landscapes. We identify that this bottleneck stems from a restricted performance upper bound inherent in current training mechanisms: by learning from scratch in a self-supervised or unsupervised manner, meta optimizers are never exposed to advanced, high-quality optimization behaviors, forcing them to converge on suboptimal strategies. In this paper, we propose \textttMetaDistill, a general MetaBBO training framework designed to lift the strategy ceiling through pretraining and test-time fine-tuning. In the pretraining stage, we represent high-quality strategies from classical algorithms as expert optimization trajectories and utilize them for diversity-preserving distillation, enabling the learnable optimizer to internalize advanced optimization behaviors. In the optional fine-tuning stage, we perform self-supervised fine-tuning as a warm-start procedure to further refine the distilled knowledge on unseen tasks. We evaluate our \textttMetaDistill framework on the BBOB test suite and three control tasks. The results demonstrate that \textttMetaDistill significantly improves the generalization ability of various learnable optimizers compared to their original training paradigms.

Abstract:
Analog circuits constitute the indispensable interface between physical reality and digital computation, underpinning safetycritical systems from autonomous driving to medical implants. Consequently, verification correctness is paramount; yet, it remains the critical bottleneck in hardware design, consuming over 50% of engineering cycles due to a heavy reliance on the manual interpretation of unstructured, heterogeneous specifications. While Large Language Models (LLMs) offer automation potential, their probabilistic, autoregressive nature is structurally misaligned with the strict determinism required for analog verification tasks. Specifically, generic LLMs struggle to resolve semantic dispersion, latent causal dependencies, and numerical precision. To bridge this gap, we introduce AnalogVerifier, a neuro-symbolic framework that automates end-to-end testbench generation by decoupling semantic translation from logical enforcement. We propose a four-stage architecture: (1) Context-Aware Task Serialization transforms complex specifications into atomic tasks via an agentic workflow; (2) Graph-Symbolic Scheduling satisfies analog design constraints through Port Dependency Graphs (PDG) for correct-by-construction sequencing; (3) Numerical-Symbolic Grounding mitigates numerical hallucination by delegating threshold derivation to a deterministic symbolic oracle; (4) Closed-Loop Repair enables correctness and completeness of the generated testbenches by simulation feedback. Evaluation on five industrial analog circuits demonstrates that AnalogVerifier achieves 82.3%--100% functional pass rate, establishing a new paradigm for reliable, automated analog verification. The code and data are publicly available at \urlhttps://anonymous.4open.science/r/ICML26--AnalogVerifier-72EE/.

Abstract:
Longtailed multi-label text classification is often treated as a data scarcity problem, addressed by re-sampling or fixed re-weighting. We argue that a central failure mode is dominant coalition capture: frequent labels, amplified by spurious co-occurrences, form dominant coalitions that dominate shared representations and gradient allocation during optimization. As a result, rare labels are learned via superficial shortcuts, yielding brittle generalization under distribution shifts. We propose NBCG, a Nash-Bargained Causal Game that reformulates multi-label learning as a cooperative bargaining process among label coalitions. NBCG first leverages Neural Structural Equation Models to learn a directed dependency structure, inducing causally coherent coalitions---rather than random partitions---and coalition-specific communication masks. We then optimize a Nash bargaining objective over coalition utilities relative to an adaptive disagreement point, which serves as a principled credit-allocation mechanism: it adaptively prioritizes under-served coalitions while maintaining a Pareto-efficient trade-off among all players.

Abstract:
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in video temporal grounding (VTG) with the introduction of reinforcement learning (RL) for generating reasoning paths. However, existing models often produce superficial reasoning, such as providing generic video descriptions, which offer limited guidance for precise temporal localization. This limitation stems from (1) inefficient random exploration in RL, and (2) reward functions that focus solely on the answer correctness while ignoring reasoning quality. To address these issues, we propose TaRO (Temporal-Aware Reasoning Optimization), a framework that explicitly enhances the model’s ability of thinking with time. First, we introduce a Constructive Reasoning Exploration that leverages pre-generated dense captions to construct reasoning paths grounded in explicit visual cues and timestamps, enabling efficient exploration of high-quality time-aware reasoning. Second, to evaluate reasoning quality, we design a Temporal-Sensitivity Reward. We postulate that high-quality reasoning should be anchored to specific events and timestamps. If the event boundary under thinking is disrupted (e.g., via frame shuffling), such reasoning should become invalid, leading to a drop in the logit of the reasoning path. We utilize this drop as a critique of reasoning quality. Finally, TaRO follows a progressive curriculum, which starts by utilizing this reward to select better constructed reasoning paths, and evolves to a free exploration phase where the model autonomously generates effective reasoning. Extensive experiments demonstrate that TaRO improves temporal reasoning and achieves state-of-the-art performance on VTG benchmarks.

Abstract:
Longhorizon dialogue agents suffer from latent state drift: what an agent says, what it internally represents, and what it stores in memory can diverge silently across turns. This creates asymmetric rupture risk—many locally coherent exchanges undone by a single high-cost contradiction. We propose BRIDGE (Behavioral Reasoning through Integrated Dynamic Gated Evolution), which performs triangular fixed-point refinement to explicitly couple Observable (\mathcalO), Latent (\mathcalL), and Memory (\mathcalM) before decoding each response. We prove that under mild conditions, the refinement operator converges to a unique fixed point, providing a theoretical guarantee that the agent’s internal state remains self-consistent before each response. Empirically, BRIDGE achieves the highest scores on both PersonaGym (4.59 avg., surpassing Claude-3.7-Sonnet) and CoSER (59.5% avg., +3.1 over Claude-3.7-Sonnet), with gains concentrated on persona-specific metrics (+8.0 Character Fidelity over Qwen2.5-32B-Instruct)—while updating only 0.85% trainable parameters of the frozen backbone. We also provide a Lyapunov-style uniform drift bound for tiered memory updates, grounding bounded persona evolution in long-horizon interaction.

Abstract:
Sparse attention has emerged as a vital technique for longcontext inference in Large Language Models (LLMs), effectively accelerating memory-bound decoding by reducing memory access for non-essential keys. However, the assumption that decoding attention is memory-bound has been shattered. The proliferation of Multi-head Latent Attention (MLA) and Multi-Token Prediction (MTP) architectures has effectively rendered the process compute-bound. We observe that, in MLA, Q-heads exhibit a degree of sparsity even when attending to the same key; consequently, traditional sparse attention algorithms introduce significant computational inefficiency in this new regime by rigidly computing interactions between all associated Q-heads and the retrieved keys. To address this, we propose TileSparse, an arithmetic-intensity-aware (a.i.-aware) algorithm for efficient attention in compute-bound settings. We first introduce a cost model that emphasizes compute budget (compute tile size) rather than memory budget (fetched tokens) when evaluating sparse methods. Next, QK 2D Sparsity prunes unnecessary Q-head--key computations and uses the freed compute to retrieve more semantically important tokens. Because Q-head sparsity differs across keys, we further propose Tiered QK 2D Sparsity and an AutoTuner to choose the best pattern. Experiments show that under tight budgets our method improves accuracy by 40% over state-of-the-art dynamic K-only sparse methods. It also preserves 99% of full-attention accuracy while cutting attention compute by 40.8%, outperforming prior sparse attention approaches.

Abstract:
The learnware paradigm aims to establish a market of learnwares, each of which is a welltrained model combined with a specification that describes its functionality without leaking data privacy. The market enables users to efficiently reuse relevant models based on specifications on their own tasks instead of training models from scratch. Recent works have attempted to generate specifications using Reduced Kernel Mean Embedding (RKME), which maps input distributions into Reproducing Kernel Hilbert Space (RKHS) while ignoring the output space, causing models trained on similar input spaces to yield similar specifications, even when their functionalities differ. Many labeled-RKME improvements attempt to address this by indirectly modeling the input-output conditional distributions, but they remain limited to classification tasks and lack clear theoretical explanations. In this work, we propose Reduced Neural Conditional Mean Embedding (RNCME), a novel specification generation method that directly models input-output conditional distributions via Conditional Mean Embedding (CME). Our RNCME method has a clear theoretical understanding based on CME and is applicable to both regression and classification tasks. Empirical experiments demonstrate the effectiveness and efficiency of our RNCME method.

Abstract:
Electrocardiogram (ECG) representation learning via ECGreport alignment is often hindered by the inherent structural and statistical divergence between signals and natural language. Existing methods struggle to bridge this gap with simple contrastive objectives, but struggle with distribution dependencies between heterogeneous features. To address this, we proposeSGERA(Stein-GuidedECG-ReportAlignment), which leverages the unique properties of Stein kernels to provide a more rigorous geometric alignment in the latent space:instance-levelalignment via a Stein-RBF kernel enforces pairwise consistency between ECG and report embeddings anddistribution-levelalignment via a Stein-Score kernel captures higher-order interactions for global alignment. Furthermore, we introduce an ECG-Report matching task with a Hard Sample Mining strategy to refine discriminative boundaries. Experiments across three public datasets demonstrate that SGERA significantly outperforms state-of-the-art SSL methods in zero-shot classification, linear probing, and transfer learning, proving the superiority of Stein-guided alignment in handling complex medical modalities. Code is available at supplementary materials.

Abstract:
Diffusion language models (DLMs) offer substantial speed advantages through parallel decoding, but the lack of token dependencies limits generation quality compared to autoregressive (AR) models. Recent progress attempts to bridge the gap via importance sampling, with DLM being the proposal and AR being the target. However, due to the huge gap between their probability space, the sampling requires a large number of particles and thus expensive computation. In this paper, we introduce PoEBridge, a novel decoding framework that drastically improves generation speed and accuracy by introducing an intermediate distribution to bridge the gap. The distribution is constructed as a Product-of-Experts (PoE) of the DLM proposal and the AR target. With the intermediate distribution, we first conduct multi-token sampling with the DLM and then apply rejection sampling using the PoE to retain only the verified tokens. The generated chunks are then evaluated by the AR target via importance sampling to produce the final faithful generation. We further propose several improved techniques, including mixed-temperature sampling for enhanced diversity and elastic rejection windows for reducing wasted verification. Empirically, PoE-Bridge achieves significantly improved accuracy with 5× speedup over the standard DLM decoding approach, and recovering at least 95% of the target AR model's performance, efficiently advancing most of the quality gap on challenging mathematical reasoning and coding tasks.

Abstract:
Recent Large Language Model–based approaches for clarifying visual design largely focus on selecting questions that better uncover user intent, but often overlook the cognitive burden imposed on users—i.e., the effort required to interpret and answer these questions—which is crucial for effective human–agent interaction. We proposeAgentic Model Predictive Questioning Control (AMPQC), a test-time framework that reduces user interaction burden while improving visual design alignment by formulating multi-round clarification as trajectory optimization with receding-horizon replanning, allowing the agent to revise its questioning strategy as feedback arrives. We further introduce lookahead question plans to reduce ambiguity early, and a lightweight respond-or-reject surrogate reward to steer questions toward lower-burden formats (e.g., yes/no). Experiments on webpage and ad banner generation benchmarks show that A-MPQC not only produces better designs but also achieves lower user-interaction cost across diverse baselines—including fixed-format strategies (e.g., multiple-choice and open-ended) and a retrieval-augmented baseline—without retraining. Overall, our work explicitly formulates and optimizes human cognitive burden jointly with final design alignment, opening new opportunities for advancing human–agent interaction in design.

Abstract:
Graph Mixtureof-Experts (Graph-MoE) offers a way to scale GNNs via adaptive capacity allocation, with the goal of allowing different experts to capture diverse graph patterns. Its effectiveness heavily depends on the coordination between routing decisions and expert specialization. However, through extensive empirical study, we identify two critical phenomena. First, discrimination loss occurs on both the expert and routing sides, where GNN experts become highly homogenized and the router collapses to a small subset of experts, failing to reflect diverse graph semantics. Second, routing uncertainty is prevalent, as existing routers produce uncertain expert assignments for most nodes, and such uncertainty exhibits a strong negative correlation with model performance. To address these issues, we propose C^2GMoE, a novel Graph-MoE framework featuring Contrastive routing and Confidence-aware fusion. We introduce a group-wise contrastive routing strategy that provides explicit guidance for routing optimization by aligning node-level routing decisions with semantic clusters while satisfying load-balancing constraints. Moreover, through a theoretical analysis of generalization error, we develop a confidence-aware fusion mechanism that adaptively reweights expert predictions according to their confidence. Extensive experiments across multiple benchmarks demonstrate the effectiveness of our proposed C^2GMoE.

Abstract:
Large language models can solve algorithmic problems either through direct natural language (NL) reasoning or by generating executable code delegated to an external solver. However, little progress has been made on understanding why. Comparing NL reasoning and solverbased pipelines directly is ill-posed: they differ simultaneously in representation space and execution mechanism. We introduce a three-route framework that makes this comparison tractable by introducing an intermediary step---code generation with LLM-based execution. This enables our empirical analysis, which shows a statistically significant gap supporting code > NL by +28.9% across 48 different algorithmic tasks and 6 models. A statistical analysis indicates that natural-language reasoning does not provide additional decision-relevant information beyond what is already captured by code representations. Consequently, replacing NL traces with code traces incurs minimal performance loss while enabling deterministic execution. A systematic comparison of LLM-based reasoning and external execution further shows that execution, rather than trace representation, is the primary performance bottleneck.

Abstract:
We find multilayer perceptrons and transformers both universally learn an instantiation of the same divideand-conquer algorithm that requires only a logarithmic number of neural representations to solve dihedral multiplication. Clustering neurons based on similar activation behaviour reveals remarkably clear structure: each neural representation corresponds to a Cayley graph. To our knowledge, this is the first work that fully characterizes and describes all neural representations that are learnable on a dataset, while prior work on group multiplications studied neuron-level behavior, or preliminarily investigated cluster behavior. Thus, we can understand the algorithm networks universally learn at three levels of abstraction: 1) Neurons activate on coset or approximate coset structure of the dihedral group. 2) Groups of neurons together form neural representations that act to divide the dataset into different subproblems, being Cayley graphs, where the equivalence class of the answer is computed. 3) The global algorithm then linearly combines each neural representation (subproblem) together at the logits. This work provides a deep case study and provides the community with a very well understood toy model for interpretability, as well as makes steps toward proving the conjecture that DNNs will divide and conquer all group multiplication tasks.

Abstract:
Simulation plays a central role in scientific discovery. In many applications, the bottleneck is no longer running a simulator—it is choosing among large families of plausible simulators, each corresponding to different forward models/hypotheses consistent with observations. Over large model families, classical Bayesian workflows for model selection are impractical. Furthermore, amortized modelselection methods typically hard-code a fixed model prior—or complexity penalty—at training time, requiring users to commit to a particular parsimony assumption before seeing the data. We introduce PRISM, a simulation-based encoder-decoder that infers a joint posterior over both discrete model structures and associated continuous parameters, while enabling test-time control of model complexity via a tunable model prior that the network is conditioned on. We show that PRISM scales to families with combinatorially many (up to billions) of model instantiations on a synthetic symbolic regression task. As a scientific application, we evaluate PRISM on biophysical modeling for diffusion MRI data, showing the ability to perform model selection across several multi-compartment models, on both synthetic and in-vivo neuroimaging data.

Abstract:
Large language models (LLMs) incur substantial inference latency due to autoregressive decoding, in which each token requires a full forward pass through all transformer layers. Earlyexit methods that terminate computation at intermediate layers offer a promising remedy, yet existing approaches suffer from fundamental limitations. Confidence-based methods rely on evaluating the full LM head at every layer, introducing considerable overhead that can negate the expected speedup. Schedule-based methods avoid this cost through predetermined exit schedules, but their monotonically decreasing layer allocation collapses to shallow layers, thereby constraining the maximum generation length. Learned exit predictors further require costly task-specific training and are vulnerable to distribution shifts in unseen domains. We propose Window-Aware Vocabulary-Efficient Early-Exit (WAVE), a training-free framework that addresses these challenges through two key innovations. First, exit window scheduling identifies an optimal layer range for early-exit decisions via offline calibration, preventing premature convergence to shallow layers while substantially reducing the number of exit checks. Second, a proxy LM head constructs a lightweight vocabulary subset at the window’s starting layer, reducing per-layer exit overhead by 87% relative to full LM head. WAVE requires no gradient-based training and enables immediate deployment with only a brief calibration phase. Experiments on Llama-2 7B demonstrate up to 1.4× average speedup while preserving output quality, with full compatibility with W4A16 quantization, establishing WAVE as a practical early-exit framework for accelerating LLMs inference without retraining.

Abstract:
Unlike diffusionbased models that operate in continuous latent spaces, autoregressive unified multimodal models produce images by sequentially predicting discretized visual tokens. These tokens are derived from a codebook that maps embeddings to quantized visual patterns. The language-like architecture enables unified multimodal models to effectively capture text conditional information for generation, making them promising for text-to-image tasks. This also raises an interesting question: how safe are the images generated in such an autoregressive way? Existing methods that ensure safe generation by operating on diffusion continuous representations fail to generalize well to discrete representations. In this work, we propose iterative self-improving codebooks for safe autoregressive generation. We leverage the understanding and judgment capabilities of the unified multimodal model itself to identify unsafe generated images without human annotation. Subsequently, the inherent representations in the codebook are fixed to eliminate harmful mappings. Our method comprises two steps: first, we use the unified model to identify unsafe generations and construct corresponding harmful and safe image-text pairs. These pairs are used to construct the Harmful Space and guide updates to the codebook, thereby eliminating harmful outputs. Second, we perform adaptive fine-tuning on the codebook within the harmless space using safe image-text pairs to ensure the quality of generated images. These two steps are repeated until no further improvement is observed, producing a safety-enhanced model codebook. Extensive experiments are conducted to verify the effectiveness of our method on five unified multimodal models. Without additional external feedback, the safety of models is improved iteratively.

Abstract:
Grammarconstrained decoding is essential for enabling large language models (LLMs) to efficiently generate structured outputs in applications, such as JSON objects for parameter passing. Existing approaches typically execute grammar constraint masking on the CPU, while LLM inference is performed on the GPU. This execution mismatch introduces frequent grammar-induced CPU \rightarrow GPU control and data synchronization, leading to substantial overhead in large-batch inference. In contrast, we propose Gram2Token, which preprocesses grammar constraints into token-level representations that can be executed natively on GPUs at run time, thereby reducing decoding overhead. Specifically, Gram2Token first converts the input grammar into a pushdown automaton and aligns the automaton with tokenizer outputs via a trie. Through this alignment, pushdown stack configurations are encoded into a finite set of augmented grammar states, and tokens are categorized according to the grammar states in which they are valid. We further design a GPU-native grammar-constrained decoding pipeline that replaces complex run-time grammar parsing with O(1) table lookups and eliminates run-time grammar-induced CPU \rightarrow GPU control dependencies. Experimental results on large-batch JSON and SQL generation tasks show that, compared to state-of-the-art implementations, Gram2Token improves decoding throughput by 1.5×–2.3×. These results demonstrate that GPU-native grammar-constrained decoding is an effective and scalable approach for structured LLM generation.

Abstract:
Terminal embeddings have emerged as a powerful tool for dimension reduction. Given a set of points P\subset \mathbbR^d, a terminal embedding is a mapping f:\mathbbR^d\rightarrow \mathbbR^t that preserves the pairwise distance between any pair of points p\in P and q\in \mathbbR^d up to small distortion under this mapping. Terminal embeddings have been particularly fruitful for constructing kmeans and k-median coresets, where the objective is to find a typically weighted subset \Omega of P such that for any candidate solution, the cost of the clustering objective on \Omega approximates the cost of the clustering objective on P up to small distortion. Unfortunately, these techniques have not been extended to more complicated structures such as clustering time-series data under common straight-line interpolation between measurements. The main issue is that terminal embeddings, arguably the central technique in this line of research, cannot be linear and are thus not immediately suitable to preserve linear structures. In this work, we develop a generalization of terminal embeddings to affine line-segments that overcomes this issue. We showcase their applicability by using our lines-preserving terminal embeddings to obtain the first dimension-free coresets for clustering time-series under the Fréchet distance. The underlying dimension reduction uses Johnson-Lindenstrauss embeddings, and our experiments indicate that they compare favorably against PCA for real-world time-series.

Abstract:
In the data selection problem, the objective is to choose a small, representative subset of data that can be used to efficiently train a machine learning model. Sener and Savarese [ICLR 2018] showed that, given an embedding representation of the data and suitable geometric assumptions, heuristics based on kcenter clustering can be used to perform data selection. This perspective was further explored by Axiotis et. al. [ICML 2024], who proposed a data selection approach based on k-means clustering and sensitivity sampling. However, these methods rely on the assumption that the dataset exhibits intrinsic geometric structure that can be effectively captured by clustering, whereas many modern datasets instead possess global algebraic structure that is better exploited by low-rank approximation or principal component analysis. In this paper, we introduce a new data selection framework based on low-rank approximation and residual-based sampling, formulated through the lens of row subset selection and loss-preserving coreset construction. Given an embedding representation of the data satisfying mild regularity conditions, which can be interpreted as algebraic or angular notions of Lipschitz continuity, we show that it is possible to select a weighted subset of \tildeO\left(k + \frac1\varepsilon^2\right) data points whose average loss approximates the average loss over the full dataset within a (1+\varepsilon) relative error, up to an additive \varepsilon \Phi_k term, where \Phi_k denotes the optimal rank-k approximation cost of the embedding matrix. We complement these theoretical guarantees with empirical evaluations, demonstrating that on a range of real-world datasets, our data selection approach achieves improved performance over prior strategies based on uniform sampling or clustering-based sensitivity sampling.

Abstract:
In this work, we extend distributional reinforcement learning (RL) to develop a risksensitive multi-objective RL framework, with applications to domains such as finance and robotics. We achieve this by adopting vector-risk measures and approximating them via Knothe-Rosenblatt (KR) quantile regression. This approach directly extends the IQN framework to the multi-objective setting, aligns with the axiomatic definition of vector-risk measures, and guarantees that critics converge under the distributional Bellman operator. To mitigate the artificial ordering imposed by the KR map, we employ a transformer architecture without positional encoding, and introduce MO-TQC for training stability. We demonstrate improved performance on MO-Gymnasium benchmarks and use our framework to study risk-sensitive policies in multi-objective tasks.

Abstract:
Neural operators have become a central tool in scientific machine learning for learning discretizationconsistent solution operators, achieving strong results on partial differential equation (PDE) benchmarks. Physiological time series, however, are highly nonstationary and dominated by localized transient events, properties that can challenge both PDE-oriented neural operators and conventional deep models. We propose NeurOCNN, a neural-operator-based model for physiological signals that learns a function-to-label mapping while exhibiting discretization invariance. NeurOCNN integrates continuous-time, spline-parameterized convolutions with Fourier projection pooling and an attention-based task head, thereby enabling robust inference under sampling-rate shifts. Empirically, NeurOCNN outperforms standard neural-operator baselines, achieves performance comparable to state-of-the-art methods, and maintains stable accuracy under zero-shot evaluation across multiple previously unseen sampling rates. Code is available at: https://github.com/dcoder444/NeurOCNN.git

Abstract:
While LiDAR NeRFs excel in static environments, synthesizing dynamic scenes remains challenging as moving objects break multiview consistency, causing conflicting supervision and ghosting artifacts across frames. Existing methods typically suffer from optimization difficulty from the start, struggling to disentangle valid geometry from motion noise when initial motion priors are unreliable. To address this, we propose MAC-NeRF, a novel LiDAR NeRF framework enhanced by motion-aware curriculum learning for high-fidelity dynamic scene synthesis. First, we propose Rectified Temporal Consistency to resolve temporally conflicting supervision. By filtering out erroneous supervision via forward-backward geometric verification, it creates a curriculum that prioritizes trustworthy temporal correspondences before handling challenging motions. Second, we propose Confidence-Modulated Frequency Regularization (CMFR) to eliminate geometric ambiguity. It adaptively modulates the frequency regularization bandwidth, progressively transitioning from strict low-frequency constraints for artifact suppression to full-spectrum modeling for fine-grained detail preservation. Extensive experiments on KITTI-360 and nuScenes demonstrate that MAC-NeRF significantly outperforms state-of-the-art methods in rendering quality. Our code will be available upon acceptance.

Abstract:
While existing textto-speech (TTS) models exhibit high expressiveness, fine-grained control over composite instructions remains challenging due to the structural mismatch between discrete textual intents and continuous acoustic realizations. Inspired by human cognitive decoupling, we introduce AgentSteerTTS, a multi-agent closed-loop framework designed for intent-faithful expressive control of composite instructions. First, in our framework, an adversarial disentanglement module mitigates speaker-emotion leakage through gradient reversal and cross-covariance regularization. Next, a Dual-Stream Anchoring Controller grounds abstract intents using a large-scale acoustic prototype library: a Retrieval Agent selects expressive anchors, while a Synthesis Agent fuses them into continuous control vectors via gated attention. Finally, a Fast–Slow Feedback Agent refines output intensity through latent gradient correction and resolves semantic–acoustic mismatches using high-level perceptual critique. Experiments on a composite-instruction benchmark and public test sets show that AgentSteerTTS yields consistent and significant improvements to the baselines, demonstrating the effectiveness of the proposed method. Our code and models will be released.

Abstract:
Recent advances have empowered large language models (LLMs) with remarkable finegrained instruction-following capabilities in text generation tasks. However, embedding methods typically rely solely on the hidden state of the input's last token, limiting their ability to capture complete semantic signals distributed across the full output tokens. Moreover, existing discrete-to-continuous re-encoding approaches introduce semantic discontinuity. To address these limitations, we propose InstEmb, a novel instruction following embedding framework. InstEmb jointly optimizes two key aspects: (1) Input-Intrinsic semantic information, achieved by employing contrastive learning focused on the representation of the last input token, and (2) Output-Aware semantic information, captured through representation self-distillation leveraging learnable look-ahead tokens without introducing additional decoding latency. Additionally, we introduce Dual-Anchor Alignment Pooling (DAAP), explicitly aligned with our dual training objectives. Extensive experiments demonstrate that InstEmb achieves state-of-the-art performance across multiple instruction following benchmarks without benchmark-specific supervised data.

Abstract:
Machine unlearning aims to remove the influence of specific training samples due to privacy, copyright or bias concerns. Multiobjective unlearning seeks to ensure the effective forgetting of such samples while preserving the utility of the unlearned model. Existing multi-objective unlearning methods typically optimize a weighted combination of the objectives. They provide no guarantee that any of the objectives can achieve a required performance and do not consider the similarity between the forget data and the remaining retain data. In this work, we quantify how hard it is to reconcile the conflicting objectives arising from overlapping data and provide conditions under which collateral forgetting is unavoidable, that is, when improving forget quality forces retain utility degradation. Utilizing this hardness measure, we propose a hardness-aware multi-objective unlearning algorithm (HAMU) that adapts the unlearning updates based on the per-iteration hardness. Our algorithm is applicable to non-convex models and is easily parallelizable, making it readily deployable in real-world scenarios. We empirically demonstrate HAMU's superior performance over baselines on both image and text datasets using large-scale models.

Abstract:
While ObjectCentric Learning has shown great promise in modular perception, its extension to Continual Learning remains underexplored. In this work, we observe that Slot Attention exhibits a distinctive behavior: it organizes latent representations into small and separated regions, each of which preserves identical factor states, crucially emerging not only in the current task but also across sequential tasks with novel factors. Thisinter-task separationoffers significant advantages in continual learning, which typically suffers from severe object-wise forgetting. We refer to this phenomenon asFactor-Wise Homogeneity, and show that this intrinsic inter-task separation is crucial, serving as a key mechanism to prevent catastrophic forgetting in Continual Object-Centric Learning. However, despite its strong robustness, factor-wise homogeneity alone is insufficient due to the bottleneck in exploiting this separation at the decoder. To overcome this limitation and demonstrate the significance of our findings, we show that a minimal strategyDecoder-only Post-Replay, which freezes the factor-wise homogeneous representations and employs decoder-only fine-tuning, is sufficient. This work serves as a fundamental basis for understanding and leveraging the intrinsic dynamics of Slot Attention, offering essential insights for advancing object-centric systems.

Abstract:
Learning to rank can be cast as a probabilistic modeling problem over permutations, where the goal is to estimate the likelihood of an observed total ordering of items. This formulation naturally involves fullorder probabilities of the form \mathbbP(\mathrmz_1 < \cdots < \mathrmz_n), whose exact computation and optimization are intractable due to the factorial growth of the permutation space with respect to list size. We introduce the Full-Order Bound (FOB), a tractable lower bound on the probability of an observed ordering, constructed from a subset of ordering constraints that factorizes across items while avoiding low-dimensional surrogate objectives and preserving order-reversal invariance. FOB induces a convex inner tightening problem over latent cut points, which we solve efficiently during training using a safe-region gradient ascent (SRGA) procedure. Experiments on synthetic ranking tasks and large-scale learning-to-rank benchmarks show that FOB consistently improves performance over pairwise and listwise surrogates, Plackett--Luce style sequential factorization models, and differentiable sorting baselines. Code is available at [https://anonymous.4open.science/r/FOB_2026-46C4](https://anonymous.4open.science/r/FOB_2026-46C4).

Abstract:
Largescale dedicated application of LLMs in diverse scenarios increasingly demands specialized model inference behavior under strict constraints of accuracy, latency, and memory. However, the heterogeneous and long-tailed nature of real-world specialized scenarios makes it difficult to obtain training data and optimize models. We study a practical inference-time specialization setting: given an LLM base, we compile a reusable, budget-bounded pathway/subnetwork within a specific scenario. Our approach is motivated by an empirical coupling phenomenon: input scenario sets aligned with similar representation subspaces (e.g., domain) in embedding space tend to activate a consistent and sparse set of internal reasoning pathways in model parameter space. To build the bridge between them, we propose probe-based SubspacePath Pruner with two core components: (1) Domain-Basis Synthesis (DBS) constructs a quasi-orthogonal basis of domain axes in embedding space, serving as a stable coordinate system. (2) Probe-based Scenario Pruning (PSP) uses efficient layer-wise linear probes to estimate axis alignment and compute budgeted head-wise pathways for a specific scenario. Experiments on LLaMA-2-13B show 29.3 average Recall on cross-domain tests (vs. 24.7 dense) and 21.6 on cross-dataset tests (vs. 25.5 dense) with 1.27x speedup at ~30% pruning ratio.

Abstract:
Why do adversarial examples exist, and why do they transfer between models? Existing explanations appeal to highdimensional geometry, non-robust patterns in the input, and decision boundary structure, but none provides a representation-level mechanism that explains why specific perturbations succeed and why attacks transfer between models. In this paper, we show that adversarial vulnerability can stem fromefficientinformation encoding in neural networks. Specifically, vulnerability can arise fromsuperposition-- the phenomenon where networks represent more concepts than they have dimensions, forcing non-orthogonal representation and thus interference. This interference causes perturbations targeting one representation to affect others, creating vulnerabilities determined by interference patterns. In synthetic settings with precisely controlled superposition, we establish that superpositionsufficesto create adversarial vulnerability. The resulting attacks are predictable: PGD-discovered perturbations align with theoretically optimal perturbations derived from the interference geometry. Models trained on similar data develop similar interference patterns, explaining attack transferability. We then show that successful attacks on vision transformer classifiers trained on CIFAR-10/100 exhibit the structure predicted by our theory. These findings reveal adversarial vulnerability can be a byproduct of networks' representational compression, complementing existing explanations based on data properties or architectural factors.

Abstract:
Selecting a pretrained large language model (LLM) to finetune for a task-specific dataset can be time-consuming and costly. With several candidate models available to choose from, varying in size, architecture, and pretraining data, finding the best model for a specific task often involves extensive trial and error. In addition, the "best" model may not necessarily be the one with the lowest test loss, as practical considerations such as deployment costs, inference throughput, and limited search budgets might also play crucial roles. To address this, we introduce LAMPS (LAnguage Model Pareto Selection), a novel and open-source multi-objective AutoML framework that meta-learns a resource allocation policy to efficiently identify (or approximate) the Pareto front of candidate LLMs for a task-specific dataset. It is based on two key ideas: (1) landmark fine-tuning, which generates early performance indicators of the candidate models, and (2) meta-learning via reinforcement learning, which learns an effective selection policy from historical performance data (a meta-dataset). Our results show that, on held-out datasets, LAMPS reduces search time by an average of 73% compared to exhaustive search, while still covering more than 99% of the optimal target space hypervolume.

Abstract:
Estimating the difference of two Stein's score functions is a fundamental problem in generative modeling. In particular, score differences arise naturally in transfer learning, where the score difference provides the mechanism for adapting a pretrained model to a new target distribution, and in diffusion model-based post-training methods such as discriminator guidance. Existing estimators for score differences in these settings either lack of statistical consistency or are difficult to scale up in high-dimensions. We propose a statistically consistent and scalable estimator for score differences based on Sobolev regularization, which plays a crucial role in ensuring consistency and stablizing the training in the small-sample regime. Mathematically, we establish a convergence rate of \tilde\mathcalO(n^-\fracs-1d+2s-2) where d is the dimension and s denotes the smoothness of the underlying densities, and provide a minimax lower bound of \tilde\Omega(n^-\frac2(s-1)d+2s) (in mean-squared error). Empirically, our estimator exhibits significantly improved stability in small-sample regimes compared to existing methods. We demonstrate its effectiveness on real-world tasks, including transfer learning for ECG signal generation, where it substantially outperforms non-regularized score difference estimators in downstream classification performance.

Abstract:
Visionlanguage models (VLMs) combining reinforcement learning (RL) ignite remarkable progress in multimodal reasoning, yet still struggle with medical images, which typically exhibit extremely sparse visual evidence to inform clinical decision-making. We recognize that pruning visual tokens outside the grounding region greatly enhances medical reasoning. However, a united RL framework for active visual token pruning (VTP) and medical multimodal reasoning remains unestablished. Here, we propose a dual-stream RL framework, ViToS, to fulfill token pruning and question answering. ViToS trains one policy model with two task branches, where one focuses on grounding while the other conducts token-sparse reasoning after VTP. Furthermore, we solve the coupled policy learning problem by introducing the cross-feedback sequential optimization, avoiding gradient conflict and facilitating convergence of the shared policy model. Evaluated on seven medical benchmarks, our method reduces visual tokens to 77% of the original sequence length while achieving a 108.27% relative performance on Lingshu-7B and 104.16% relative performance on HuatuoGPT-Vision-7B. Overall, ViToS delivers superior performance and inference speedup, establishing an efficient paradigm for medical multimodal reasoning.

Abstract:
Poster generation is a complex task demanding a harmonious integration of visual aesthetics and information hierarchy. While recent textto-image models have advanced visual synthesis, they remain non-editable and struggle with precise text rendering. Conversely, existing layout-generation methods offer structure but typically rely on static, one-shot predictions, lacking the mechanism for self-correction essential to professional design. Inspired by the iterative workflow of human designers, we introduce PosterAgent, a novel framework that reformulates poster creation as an agentic workflow involving initial drafting followed by iterative refinement. To effectively train this multi-turn capability, we propose Stage-Aware Reinforcement Learning (SARL), which decouples the optimization into draft-specific and refinement-specific phases, ensuring precise credit assignment for both initial drafting and incremental refinement actions. Extensive experiments demonstrate that PosterAgent significantly outperforms strong baselines, validating the potential of agentic systems in graphic design.

Abstract:
Analyzing attention weights has become a standard approach for interpreting the information flow of Large Language Models (LLMs). However, this approach has significant limitations as it neglects the geometric properties of the value vectors being aggregated. To address this gap, we introduce \emphContribution Weights, a projectionbased metric that quantifies a token's influence by accounting for it's attention weight, value magnitude, and directional alignment with the layer output. We demonstrate that contribution weights provide a more faithful measure of token importance, consistently outperforming attention-based metrics in identifying semantically critical tokens across different models, tasks, and datasets. Further, our metric enables novel mechanistic analysis of \emphattention sinks. While previous work has characterized sinks as passive repositories for excess attention, we reveal they serve an active functional role, suppressing information through a convex relationship between sink rate and output norm, stabilizing representations by opposing the semantic drift of low-confidence tokens.

Abstract:
3D Gaussian Splatting (3DGS) has been widely adopted for scene reconstruction, where training inherently constitutes a highly coupled and nonconvex optimization problem. Recent works commonly incorporate geometric priors, such as LiDAR measurements, either for initialization or as training constraints, with the goal of improving photometric reconstruction quality. However, in large-scale outdoor scenarios, such geometric supervision is often spatially incomplete and uneven, which limits its effectiveness as a reliable prior and can even be detrimental to the final reconstruction. To address this challenge, we model partially observable geometry as a continuous energy field induced by geometric evidence and propose EnerGS. Rather than enforcing geometry as a hard constraint, EnerGS provides a soft geometric guidance for the optimization of Gaussian primitives, allowing geometric information to steer the optimization process without directly restricting the solution space. Extensive experiments on large-scale outdoor scenes demonstrate that, under both sparse multi-view and monocular settings, EnerGS consistently improves photometric quality and geometric stability, while effectively mitigating overfitting during 3DGS training.

Abstract:
Large Multimodal Models (LMMs) integrate unimodal encoders with Large Language Models (LLMs) to execute complex multimodal tasks. Despite progress in the field, understanding the internal representations of these models through interpretable logic remains an open problem. To address this, we present a framework utilizing a HumanInspired (Neuro-fuzzy) approach for learning token representations. In this method, we leverage fuzzy rules to compute activation firing strengths, which are subsequently defuzzified to extract distinct concepts. This mechanism allows for the interpretation of learned representations directly through explicit logic. Consequently, we derive "multimodal concepts" that are both semantically coherent and interpretable. We validate our approach through rigorous qualitative and quantitative experiments, demonstrating the utility of these concepts in interpreting test samples. Additionally, we evaluate the disentanglement of the learned concepts and the efficacy of their grounding in both visual and textual domains.

Abstract:
Diffusionbased camouflaged object detection (COD) has recently shown great potential. In contrast to existing approaches that rely on multiple sample steps to refine the predicted masks, we propose CODiff, which reformulates the diffusion process to enable one-step mask prediction while maintaining competitive accuracy. Specifically, we first establish the theoretical feasibility of one-step sampling for COD. Based on this, we design a dedicated network for one-step inference with a global semantic guidance mechanism to guide the denoising process globally and hierarchical condition integration blocks to provide fine-grained structural semantics. In addition, we design a straight-forward regularization to learn better intermediate features by bridging the representation gap between the condition backbone and the diffusion model. Extensive experiments demonstrate that CODiff achieves state-of-the-art performance across multiple benchmarks, improving MAE by over 22% on the challenging COD10K dataset. The code will be released upon publication.

Abstract:
AudioVisual Large Language Models (AV-LLMs) grapple with the prohibitive computational costs of processing massive, redundant audio and video tokens. Existing unimodal compression techniques fail to capture the heterogeneous and mutually influential information density of joint audio-visual signals. Furthermore, we identify a fundamental and previously overlooked theoretical bottleneck in sparse token reduction: positional aliasing. We demonstrate that aggressive sparse sampling on standard position-encoded sequences violates the Nyquist limit relative to the effective token interval, causing phase-wrapping collisions that corrupt temporal monotonicity. To address this, we introduce EchoingPixels, a framework for aliasing-resistant joint token reduction. First, our Cross-Modal Semantic Sieve performs extractive selection on the synergistic audio-visual stream, learning to dynamically allocate budgets based on joint-modality saliency rather than fixed ratios per modality. Second, to resolve the aliasing issue, we derive Sync-RoPE, a mechanism that acts as a spectral low-pass filter for Rotary Positional Embeddings. By adapting the encoding bandwidth to the sparse sampling rate, Sync-RoPE preserves monotonic temporal relationships in the reduced stream. Extensive experiments show that EchoingPixels achieves performance comparable to full models using only 5-20% of original tokens, validating that a theoretically-grounded approach to sparse learning offers a robust solution for efficient AV-LLMs.

Abstract:
We study sequential decisionmaking in partially observable environments against strategic, adaptive opponents, modeled as partially observable Markov games (POMGs). The central challenge is to learn latent dynamics from partial observations while facing an adversary whose behavior depends on the learner's strategy, making standard regret notions inadequate. We develop an epoch-based optimistic, model-based framework and show it achieves policy regret \tildeO\!\left(H(m+\sqrt\beta)\sqrtd_E T\right), where H is the horizon, m the adversary's memory bound, d_E the Eluder dimension of the joint model class, and T the number of episodes. Our algorithm selects one policy per geometrically growing epoch using confidence sets built cumulatively from past data. We also prove a matching \Omega(\sqrtd_E T) lower bound (optimal up to log factors), and extend the framework to fading-memory adversaries and horizon-adaptive variants. These results give the first tight characterization of learnability in POMGs against adaptive opponents.

Abstract:
Object insertion aims to seamlessly composite a reference object into a specified region of a background image. Recent diffusionbased methods achieve high visual quality but formulate insertion as a simple 2D inpainting task, providing no explicit control over the object’s 3D pose and limiting their practical applicability. We propose DIRECT (Decomposed Injection for Reference Composition and Target-integration), a novel framework that integrates interactive pose manipulation with high-fidelity 2D image synthesis to enable precise geometric alignment. Our method decomposes the insertion conditions into three complementary components: appearance guidance capturing visual details from the reference object, geometry guidance derived from the user-adjusted 3D proxy, and target-integration guidance from the background image. We also introduce an automated data construction pipeline to improve training diversity and visual realism. Experiments show that DIRECT outperforms previous methods in both geometric controllability and visual quality.

Abstract:
Neural speech codecs based on VectorQuantized VAEs (VQ-VAEs) are core audio tokenizers for speech LLMs, yet their reconstruction fidelity is bottlenecked by quantization error. Instead of modifying the quantizer or increasing model capacity—common approaches that complicate downstream language modeling—we introduce self-guidance, a simple yet general training principle that enhances the decoder's robustness to quantization error. Our core idea is to align the decoder's internal feature manifolds when processing both the quantized tokens and their original continuous embeddings, using a lightweight feature-mapping loss. This requires minimal training overhead and no inference-time changes. Applied to XCodec2, self-guidance improves all reconstruction metrics, achieving state-of-the-art low-bitrate performance. It generalizes across codebook sizes, quantizer types, and network architectures, demonstrating value as a universal codec enhancer. Notably, it enables a 4× codebook reduction without fidelity loss, which downstream TTS experiments show significantly improves LLM-based synthesis by simplifying the token modeling space. Self-guidance thus establishes an efficient, broadly applicable method for high-fidelity neural audio coding.

Abstract:
Chunked prefill has become the dominant scheduling strategy for large language model (LLM) inference, interleaving prefill and decode operations to improve GPU utilization. However, this approach does not universally outperform exclusive batching: on bandwidthconstrained GPUs, mixed batches can intensify memory-bandwidth contention and degrade decode throughput. Through controlled experiments on H200 (4.8\~TB/s) and RTX PRO 6000 (1.8\~TB/s), we find that mixed batching begins to underperform exclusive decoding at 80% decode ratio on H200, but at merely 20% on the bandwidth-constrained RTX PRO 6000. We present \sys, a scheduling framework for exclusive batching that derives closed-form, asymptotically optimal phase-switching thresholds under stochastic output-length models, along with memory-safe batch sizing. Experiments on real-world workloads demonstrate that \sys achieves up to 15.3% higher throughput than chunked prefill on bandwidth-constrained hardware while maintaining competitive latency.

Abstract:
RL with verifiable rewards can substantially improve LLM reasoning, yet standard GRPOstyle training often uses uniform sampling and near-uniform weighting, leading to inefficient computation allocation. We study GRPO by tracking token log-probabilities, group-normalized advantages, and induced token-level update weights. This reveals three recurring dynamics: probability inflation, advantage contraction as accuracy rises, and hierarchical convergence, where easy questions quickly saturate while hard questions remain discovery-limited due to rare correct rollouts. These findings imply that the benefit of each update depends strongly on both question difficulty and the model’s current competence. Motivated by this, we propose Confidence and Difficulty-adaptive Policy Optimization (CoDaPO), which assigns each question a bounded value from rollout confidence and empirical difficulty, then uses it to reweight policy updates and resample high-value questions within minibatches to increase discovery under a fixed compute budget. Across seven benchmarks, CoDaPO consistently improves accuracy over other RL methods.

Abstract:
Multiview classifiers typically fuse all observed views into a single representation, which becomes fragile when some views are missing or corrupted.We propose a prototype-anchored fusion module based on an entropically regularized unbalanced optimal transport (UOT) barycenter.Each view is summarized into a small set of learned atoms and is matched to a shared prototype support; fusion outputs a probability measure over prototypes with fixed dimension.By relaxing marginal constraints with a generalized KL penalty, the UOT objective can leave a fraction of view mass unmatched when matching is geometrically costly, yielding a simple differentiable trimming mechanism without hand-tuned thresholds.We provide a basic theoretical result showing that discarding an arbitrary subset of atom mass incurs a penalty bounded by its total mass, independent of transport distances.Experiments on multi-view action recognition benchmarks under simulated missing views, missing-rate shift, and feature-space corruption demonstrate consistently improved stability under severe missingness with modest overhead on top of strong backbones.

Abstract:
Reliable medical audio diagnosis demands models that are not only accurate but also honest about their uncertainty. However, finetuned models based on small, imbalanced datasets often become overconfident due to norm bias, whereby they rely on feature magnitude rather than semantic alignment. As a theoretical optimum for class-separating geometric structures, Equiangular Tight Frame (ETF) is effective for class-imbalanced and calibration tasks because of its maximal angular separability and geometric fairness. Yet, existing ETF-based methods perform weak when deal with noisy medical data; specifically, the gradient-based rotation results in instability, while the fixed ETFs fail in adapting to drifting prototypes. To solve this, We propose Spherical Procrustes Alignment (SPA), the first method combining spherical constraints with dynamic ETF alignment for medical audio. The SPA includes two branches: 1) the Spherical branch, which normalizes features and weights to eliminate the norm bias, and 2) the Geometric branch, which adapts features, tracks prototypes, and uses Dynamic Procrustes Alignment to align the fixed ETF with the prototypes, generating stable logits. Then a self-alignment mechanism fuses the two branches to jointly optimize the logits. Experiments on ICBHI 2017 and CirCor DigiScope datasets show that the SPA achieves new state-of-the-art results, turning large pre-trained models into reliable and efficient clinical tools without extra inference costs.

Abstract:
The rapid development of Large Language Models has driven MultiAgent Systems (MAS) growth, but constructing efficient MAS still requires labor-intensive manual design. Current automation methods often generate templated agents, rely on monolithic optimization, and ignore task complexity gradients. This paper presents Evolutionary MAS (EvoMAS), a biologically inspired framework that addresses these limitations through three interconnected dimensions: (1) dynamic and diverse evolutionary strategies with six biologically inspired operators (3 exploration, 3 exploitation) and adaptive strategy selection; (2) role-level evolution that dynamically optimizes agent specialization and collaboration patterns; and (3) curriculum-guided evolution that partitions tasks by difficulty and evolves sequentially from simple to complex under cross-stage stability constraints. To bridge the inefficiency of pure evolutionary search and the rigidity of manual design, we introduce the Cyber Creator, a meta-control system that combines dynamic rule formulation with reflective updates. Experiments show EvoMAS consistently outperforms existing methods across multiple domains while remaining cost-efficient, with agent roles evolving from homogeneous actors to specialized reasoning ensembles.

Abstract:
Effective DNA modeling demands the integration of complex patterns such as local motifs, longrange dependencies, and periodic signals. Yet, architectures like CNNs, Transformers, and SSMs are hindered by static or time-domain-exclusive designs, which limit their representational flexibility. To address this, we introduce theSynergistic Plasticity Unit (SPU), a scalable architecture that achieves multi-level plasticity through three synergistic layers. Specifically, SPU integrates aLocus Plasticity Layer(LPL) to capture fine-grained local motifs via token-specific convolution operations, while utilizing aDomain Plasticity Layer(DPL) to form multi-domain global features by concurrently modeling sequential (time) and spectral (frequency) patterns. Furthermore, it incorporates aSaliency Plasticity Layer(SPL) to optimize information flow through dual-axis saliency scoring. Supported by theoretical analysis, extensive empirical validation, and in-depth biological interpretation, this unified design enables SPU to achieve state-of-the-art performance with quasi-linear complexity, establishing a robust and principled paradigm for DNA modeling. Code will be available upon acceptance.

Abstract:
This paper introduces a novel crossphysiology translation task: synthesizing sleep electroencephalography (EEG) from respiration signals. To address the significant complexity gap between the two modalities, we propose a waveform-conditional generative framework that preserves fine-grained respiratory dynamics while constraining the EEG target space through discrete tokenization. Trained on over 28,000 individuals, our model achieves a 7% Mean Absolute Error in EEG spectrogram reconstruction. Beyond reconstruction, the synthesized EEG supports downstream tasks with performance comparable to ground truth EEG on age estimation (MAE 5.0 vs. 5.1 years), sex detection (AUROC 0.81 vs. 0.82), and sleep staging (Accuracy 0.84 vs. 0.88), significantly outperforming baselines trained directly on breathing. Finally, we demonstrate that the framework generalizes to contactless sensing by synthesizing EEG from wireless radio-frequency reflections, highlighting the feasibility of remote, non-contact neurological assessment during sleep.

Abstract:
Federated causal discovery for time series is becoming increasingly important in many application domains. In practice, intervention policies on each client often change over time, causing the local underlying causal mechanisms to drift rather than remain fixed. Moreover, different sampling frequencies across clients yield incompatible time scales in the observed data, making the resulting local causal graphs naturally heterogeneous and difficult to aggregate consistently. Accordingly, we propose FedCAW, a Credibility-Aware Weighting Federated causal discovery framework for time series. Specifically, we define edge-level credibility scores that quantify per-edge reliability by summarizing (i) within-client temporal stability across windows and (ii) cross-client temporal consistency after mapping heterogeneous sampling frequencies onto a unified time scale. We then aggregate privatized edge statistics under differential privacy, treating credibility scores as weights to recover a global causal graph while preserving personalized local structures without sharing raw data. Theoretically, we demonstrate the rationale for the unified time scale mapping and establish rigorous differential privacy guarantees. Experimental results on synthetic and real-world datasets demonstrate the effectiveness of our proposed method.

Abstract:
We study optimization over nonconvex constraint sets that are homeomorphic to a ball, encompassing important problem classes such as star-shaped sets that frequently arise in machine learning and engineering applications. We propose Hom-PGD^+, a learning-based and projection-efficient first-order method that efficiently solves such problems without requiring expensive projection or optimization oracles. Our approach leverages an invertible neural network (INN) to learn the homeomorphism between the non-convex constraint set and a unit ball, transforming the original problem into an equivalent ball-constrained optimization where projections admit closed-form solutions. We establish that Hom-PGD^+ achieves an \mathcalO(\epsilon^-2) convergence rate to an (\epsilon + \mathcalO(\sqrt\epsilon_\rm inn))-approximate stationary solution, where \epsilon_\rm inn denotes the homeomorphism learning error. This rate significantly improves upon existing methods for optimization over non-convex sets, while maintaining a per-iteration complexity of only \mathcalO(W) for W INN parameters. Experiments on chance-constrained optimization problems in power systems demonstrate that Hom-PGD^+ achieves convergence rates comparable to state-of-the-art methods while delivering speedups of up to one order of magnitude.

Abstract:
Stability is a property of fundamental importance in realworld systems. Although it has been widely studied and well understood incontrol theory(CT) for deterministic systems, it is largely overlooked in stochastic systems such asMarkov decision processes(MDPs). In this paper, we aim to translate the steady-state control problem, well established in CT, where the goal is to synthesize a controller with prescribed asymptotic stability properties, into the MDP framework. To this end, we propose the novelfast-mixing steady-state(FMSS) problem. Given an ergodic MDP and a target steady-state distribution, the objective is to synthesize a Markovian policy that induces this distribution with the fastest possible convergence rate. Addressing this problem requires controlling the spectral properties of the inducedMarkov chain(MC) transition matrix, which generally leads to non-convex programs. Thus, we derive a tractable surrogate objective that leads to a convex program, whose properties we study in terms of approximation quality, feasibility, and computational complexity. We then move to the learning setting and propose an "offline" sample-based algorithm for FMSS (FMSS-SV), designed for tabular MDPs, in which the environment’s transition model is estimated from data. We quantify the impact of transition model estimation errors on both the objective value and the learned policy, and provide a finite-sample complexity analysis.

Abstract:
LowRank Adaptation (LoRA) achieves parameter-efficient fine-tuning by constraining model updates to a low-rank subspace and has been widely used in practice. However, LoRA typically employs a shared low-rank update across tokens, which limits its ability to fully exploit the adaptation subspace for tokens from different sequences. To address this issue, we propose an adaptive utilization of Low-Rank Adaptation (U-LoRA), which employs conditioned gating to explicitly learn effective token-level utilization of the limited low-rank adaptation subspace. Specifically, U-LoRA generates utilization coefficients along low-rank directions for each token and jointly coordinates and constrains them using sequence-level contextual information, thereby inducing more consistent adaptive patterns within a sentence. To further enhance training stability, we introduce a bias-corrected exponential moving average (EMA) historical prior that calibrates utilization signals across optimization steps, suppressing noise caused by batch-to-batch fluctuations. The effectiveness of our method arises from a better utilization of the existing low-rank subspace via input-conditioned strategies, rather than from expanding the subspace. Experiments on mathematical reasoning and natural language understanding benchmarks demonstrate that U-LoRA achieves competitive performance under comparable parameter budgets when with strong LoRA baselines and recent variants.

Abstract:
Monocularto-stereo conversion synthesizes stereoscopic content from 2D videos for immersive 3D experiences. Modern Depth-Image-Based Rendering (DIBR) approaches identify stereo inpainting of disocclusions as the critical bottleneck. Training-based methods achieve superior quality but rely on scarce stereo pairs or synthetic data with domain gaps. We address this through the first self-supervised framework learning from monocular videos via cycle consistency. Our key contribution is theGeometric Reciprocity Theorem (GRT): the disocclusion mask when synthesizing a target view exactly equals the mask of pixels lost when warping back from target to source, enabling analytical computation of test-time disocclusion masks directly from monocular images. This achieves exact train-test consistency, enabling self-supervised learning from unlimited monocular videos and substantial improvements over training-free and supervised state-of-the-art methods.

Abstract:
Subject Customization is a foundational task in modern image generation. By providing a few reference images and a text prompt, users can generate images of a specific object in any desired scene. However, existing methods still struggle to achieve effective pose control for customized subjects. In practice, they often exhibit inaccurate poses or inconsistent crosspose appearances. These limitations suggest that understanding objects in a volumetric manner remains a significant challenge for 2D-native backbones. To address this challenge, we propose Pose-ICL, a tuning-free framework that leverages 3D-aware In-Context Learning (ICL) to directly adapt to new subjects through multiple paired image-pose references. Its core mechanism, Surface-Anchored Position Embedding (SAPE), equips the model with explicit 3D awareness by anchoring image tokens to the surface coordinates of a volumetric bounding box. Dedicated refinements ensure its seamless compatibility with existing DiT models. Extensive evaluations on both 3D assets and real-world subjects demonstrate that Pose-ICL significantly outperforms current methods in both pose accuracy and identity consistency.

Abstract:
Large language models (LLMs) often employ RetrievalAugmented Generation (RAG) to improve factuality. However, this also increases the risk of sensitive private information leakage. Differential Privacy (DP) has therefore been integrated into LLM inference and is widely regarded as a standard safeguard; yet most studies focus narrowly on the privacy–utility trade-off, leaving the trustworthiness of DP outputs underexplored. To assess trustworthiness, we employ the confidence gap (CG), which quantifies an LLM’s internal knowledge conflict. We show that CG correlates with both hallucination and exposure of personally identifiable information (PII). Building on this insight, we propose PEARL, a CG‑guided, entropy‑aware private decoding framework. PEARL adaptively allocates the privacy budget across tokens and sentences based on CG, concentrating protection on PII-bearing spans while stabilizing low-confidence, hallucination-prone regions. In experiments, PEARL improves both trustworthiness and robustness against PII extraction attacks. Notably, while applying DP alone significantly increases hallucination, our framework demonstrates that it is possible to preserve privacy while reducing hallucination.

Abstract:
Existing proxybased hashing methods optimize samples toward independently learned proxies using isolated similarity constraints. Although efficient, this design overlooks the fact that proxies are learned jointly but lack explicit relational or competitive interactions during optimization. Consequently, proxy responses to a sample are often accumulated rather than contrasted, leading to weakly defined decision regions and limited discriminative structure in the Hamming space. In contrast, our method organizes multiple proxies into sample-specific relational structures, enabling proxies to interact and compete when responding to each sample. Through structure-guided learning, these interactions explicitly contrast positive and negative proxy responses, thereby shaping clearer and more discriminative decision boundaries. Extensive experiments on standard cross-modal benchmarks demonstrate that this structured discrimination consistently improves retrieval accuracy and embedding separability.

Abstract:
Large language models (LLMs) are increasingly used to provide automated assessment signals for evaluating modelgenerated outputs. However, practical deployment faces three persistent challenges: heterogeneous reliability across models, substantial latency and token costs, and the absence of principled strategies for allocating evaluation resources. We introduce R^2Eval, a routing-aware automated assessment framework that formulates evaluation as a resource allocation and aggregation problem rather than relying on a single monolithic evaluator. R^2Eval combines difficulty-aware routing with reasoned evaluation signals to dynamically select evaluator models on a per-instance basis under explicit accuracy, latency, and cost constraints. Our study makes three contributions. First, we construct six difficulty-aware datasets spanning both reasoning-intensive (mathematics, logic, code) and non-reasoning (knowledge, roleplay, writing) tasks, with human-annotated reference assessments. Second, we provide a systematic empirical analysis of how reasoning traces produced by different evaluator models correlate with assessment outcomes, revealing substantial variance and systematic mismatches across difficulty regimes. Third, we develop and evaluate both offline and online routing strategies that adaptively allocate evaluation queries, achieving substantially improved accuracy–efficiency trade-offs compared to static baselines. Experiments across 19 language models demonstrate that R^2Eval significantly reduces evaluation cost and latency while maintaining close alignment with human assessments. These results highlight the importance of routing-aware automated assessment and establish R^2Eval as a scalable and reliable framework for large-scale model evaluation.

Abstract:
Kleinberg and Mullainathan recently proposed a formal framework for studying the phenomenon of language generation, calledlanguage generation in the limit. In this model, an adversary gives an enumeration of example strings from an unknown target language, and the algorithm is tasked with correctly generating unseen strings from the target language within finite time. Refined notions of nonuniform and uniform generation were later introduced by Li, Raman, and Tewari (2025), and a noisy model was introduced by Raman and Raman (2025), which allows the adversary to insert extraneous strings. A natural question in the noisy model is to quantify the effect of noise, by studying the impact of each additional extraneous string. We show two complementary results in this setting. We first show that for both uniform and non-uniform generation, a single noisy string strictly reduces the set of collections that can be generated, thus answering an open question in Raman and Raman (2025). Then, we show for both uniform and non-uniform generation that generation with a single noisy string is equivalent to generation with any finite amount of noise, sharply contrasting with the strict hierarchy for noisy generation in the limit shown by Bai, Panigrahi, and Zhang (2026). Finally, we leverage our previous results to provide the first known characterization for non-uniform noise-dependent generatability.

Abstract:
Sequential recommendation models, while powerful, have limited flexibility in responding to immediate user requests, making it difficult to adapt their recommendations to the user's timely interests. Unfortunately, existing user request adaptation methods often incur high computational overhead due to either 1) retraining the entire backbone network or 2) leveraging the inference ability of large language models (a.k.a. prompt engineering), limiting their applicability in largescale recommendation services. This paper presentsControl viaRequest-AwareMasking forEditingRecommenders (CRAMER), a framework that takes users' natural-language requests to immediately change sequential recommendation models' behavior. Specifically, inspired by the model control theory, CRAMER treats user requests as control signals to modulate frozen backbone parameters through masking, achieving instant adaptation to diverse requests while avoiding costly retraining. Experiments on multiple large-scale benchmark datasets show that CRAMER outperforms four state-of-the-art request-aware baselines across multiple recommendation metrics while achieving minimal overhead. Moreover, the proposed framework exhibits enhanced controllability and cross-domain adaptability, establishing a new paradigm for request-aware sequential recommendation.

Abstract:
Existing contrastive multiview clustering methods rely on a pre-defined cluster number, limiting their flexibility in real-world scenarios lacking prior knowledge. To address this, we propose GROK, a novel framework driven by a cluster decision agent for unknown-K multi-view clustering. It pioneers the adaptation of group relative policy optimization (GRPO) —a reinforcement learning strategy for LLM reasoning— into the unsupervised domain to autonomously determine the optimal K. Specifically, the agent orchestrates the clustering process through three synergistic phases. First, in the state perception phase, we employ a structure-aware adaptive backbone to aggregate multi-view data, providing the agent with consistent and discriminative consensus observations. Second, in the group decision phase, we introduce an action space divide-and-conquer strategy and an adaptive reward function. Equipped with these mechanisms, the agent performs group sampling and relative advantage estimation within the discrete action space of candidate K values, autonomously searching for the optimal K via reward maximization. Finally, via geometric feedback, geometric clustering guidance mechanism transforms the agent's structural hypotheses into explicit differentiable constraints to reshape feature manifolds, thereby closing the perception-decision-feedback loop. Experimental results demonstrate that GROK achieves superior clustering performance in unknown-K scenarios by autonomously exploring the underlying cluster structure.

Abstract:
Autonomous imageediting agents powered by multimodal large language models (MLLMs) improve transparency and controllability by translating high-level instructions into tool-mediated edit sequences, but training such agents with reinforcement learning often relies on dense proxy rewards (e.g., incremental image-quality score gains) to compensate for sparse human feedback. When these proxies overvalue small local changes, the resulting optimization signal can be dominated by numerically measurable yet perceptually negligible edits, biasing policy gradients toward proxy artifacts rather than meaningful progress. We propose B-Spar, a reward-centric Reinforcement Learning framework for perceptually aligned image retouching under sparse feedback that combines prior-guided trajectory sampling to reduce inefficient exploration, Bayesian reward modeling to densify sparse binary feedback into a stable training signal, and anchor-regularized policy optimization to steer updates toward high-reward regions while preventing early mode collapse. Experiments on public benchmarks demonstrate that B-Spar improves perceptual quality and metric alignment with stable training and competitive inference efficiency over strong prompt-based and training-based baselines. Notably, it outperforms AIGC-based baselines by over 95% in perceptual quality, achieving an improvement of approximately 33.5% over the state-of-the-art.

Abstract:
Naive application of tokenwise temperature scaling alters the maximum a posteriori (MAP) estimate at the sequence level, degrading model performance. This issue is exacerbated in MDMs, which estimate sequence-level likelihoods with high variance under different unmasking orders. In this paper, we address the challenge of reliable temperature scaling with a novel fine-tuning procedure and introduce a new metric to measure effective temperature scaling without requiring the partition function. Our method adapts a context-dependent sequence-level temperature scaling method to any-order generative models, such as MDMs. And introduces two new, more stable learning objectives. We achieve this by computing the expected probability of a given sequence under different unmasking orders. Our experiments on language models (bd3lm) show that this leads to more consistent generation, with lower perplexity and lower generation variance.

Abstract:
Federated Learning (FL) enables collaborative model training across decentralized clients, but its performance often degrades under nonIID data distributions, particularly in the presence of data skew. Existing approaches mitigate this issue by estimating client skew via kernel density estimation over neighboring model updates, which preserves privacy and reduces communication costs. However, such approaches suffer from two fundamental limitations: bias toward skewed majority clients due to fixed neighborhood structures, and vulnerability to noise-induced perturbation in kernel space. To address these challenges, we propose FedVeer, a skew-aware FL framework based on self-adaptive kernel density estimation with k-free neighborhoods. FedVeer dynamically determines the neighborhood size via max-margin learning to mitigate majority-client bias, and further incorporates Kalman filtering to stabilize margin estimation under noisy updates, with a high-probability theoretical guarantee on margin deviation. Extensive experiments on real-world datasets demonstrate that FedVeer consistently outperforms four baselines, achieving up to 6.36% accuracy improvement and reducing noise-induced degradation by up to 6.01%.

Abstract:
4bit quantization reduces the memory footprint and latency of large language model inference, but its aggressive precision reduction can severely degrade accuracy. Prior methods address this by decomposing each weight matrix into two components (e.g., via singular value decomposition) and quantizing them separately, assigning the bulk of values to a low-precision residual component while handling outliers with a high-precision low-rank component. However, such decompositions are designed to minimize the real-valued energy of the residual, rather than the post-quantization error of the residual and low-rank components. We propose TwinQuant, a 4-bit quantization framework that learns quantization-friendly decomposed subspaces and jointly reshapes both the low-rank and residual components. TwinQuant learns component-specific transformations via a joint optimization over the Stiefel and general linear manifolds, flattening their distributions and reducing dynamic-range imbalance. To enable efficient end-to-end execution, we further design a fused dual-component kernel that pipelines the two-stage low-rank computation on-chip and merges both components with a single epilogue, avoiding intermediate global-memory traffic. Across LLaMA3 and Qwen3 models, TwinQuant preserves near-FP16 accuracy and delivers up to 2.11× end-to-end speedup over an FP16 baseline.

Abstract:
Prompt learning for visionlanguage models (VLMs) primarily follows end-to-end or decoupled routes to balance base and new task performance, but suffers a fundamental bottleneck: sample-wise optimization within task-specific feature spaces traps models in local optima, hindering global optimality. To address this, we identify a key insight that VLMs can be prompted within a Coupled Prompt Field-a shared space where base and new tasks are mutually constrained-and present AlignedNorm, which enforces the field coupling. By dynamically aligning the norms of prompts to VLMs' native scale, our method enables joint optimization of both tasks. Without complex designs, our method matches leading decoupled approaches on 15 datasets across 4 experimental settings, offering both a new perspective and a practical solution to the local-optima dilemma in prompt learning.

Abstract:
Synthesizing authentic phonocardiograms (PCG) from ubiquitous electrocardiograms (ECG) is a critical task for accessible cardiac monitoring. Existing generative models, however, struggle to capture the heart's complex electromechanical coupling, failing to meet the dual requirements of temporal precision and physiological fidelity essential for clinical diagnosis. We introduce the TemporallyAware VAE-Diffusion model, a synergistic hybrid architecture that resolves this trade-off. Our architecture enforces tight physiological coupling through an Enhanced Condition Fusion mechanism and explicitly models long-range cardiac dynamics via Temporal Attention Blocks. On the EPHNOGRAM benchmark, our model sets a new state-of-the-art, achieving a Pearson correlation of 0.910\pm0.008, 95.95% S1 detection accuracy, and a precise 12.0 ms timing error, significantly outperforming leading diffusion and Transformer baselines.Crucially, our work presents a rigorous demonstration of successful zero-shot generalization for this task. evaluated on the unseen PhysioNet/CinC 2016 dataset, our model maintains high fidelity even on challenging pathological recordings, establishing a validated foundation for robust, accessible cardiovascular diagnostics.

Abstract:
It is infeasible to encompass all possible disturbances within the training dataset. This raises a critical question regarding the robustness of VisionLanguage-Action (VLA) models when encountering unseen real-world visual disturbances, particularly under imperfect visual conditions. In this work, we conduct a systematic study based on recent state-of-the-art VLA models and reveal a significant performance drop when visual disturbances absent from the training data are introduced. To mitigate this issue, we propose a lightweight adapter module grounded in information theory, termed the Information Bottleneck Adapter (IB-Adapter), which selectively filters potential noise from visual inputs. Without requiring any extra data or augmentation strategies, IB-Adapter consistently improves over the baseline by an average of 30%, while adding fewer than 10M parameters, demonstrating notable efficiency and effectiveness. Furthermore, even with a 14x smaller backbone (0.5B parameters) and no pre-training on the Open X-Embodiment dataset, our model StableVLA achieves robustness competitive with 7B-scale state-of-the-art VLAs. With negligible parameter overhead (<10M), our approach maintains accuracy on long-horizon tasks and surpasses OpenPi under real-world visual disturbances. The code will be made publicly available.

Abstract:
We present a principled framework for uncertainty quantification in computed tomography (CT) reconstruction. Based on the sequential likelihood mixing framework (Kirschner et al., 2025), we establish the first confidence regions with theoretical coverage guarantees for deep learningbased CT reconstructions. In particular, we consider a realistic forward model following the Beer-Lambert law, i.e., a log-linear forward model with Poisson noise, closely reflecting clinical and scientific imaging conditions. The framework is general and applies to both classical algorithms and deep learning reconstruction methods, including U-Nets, U-Net ensembles, and generative Diffusion models. Empirically, we demonstrate that deep reconstruction methods yield substantially tighter confidence regions than classical reconstructions, without sacrificing theoretical coverage guarantees. Our approach allows the detection of hallucinations in reconstructed images and provides interpretable visualizations of confidence regions. This establishes deep models not only as powerful estimators, but also as reliable tools for uncertainty-aware medical imaging.

Abstract:
Knowledge distillation is crucial for compressing large Vision–Language Models (VLMs) into efficient architectures. While prior VLM research has primarily focused on reasoning tasks like visual question answering, multimodal embedding learning, a key component for largescale retrieval, has received comparatively less attention. Existing distillation methods typically align static global representations, overlooking hierarchical feature structure and fine-grained cross-modal interactions. This leads to a structural gap where student models fail to inherit object-level semantics and spatial relationships from teachers. To address this limitation, we proposeHieRD, a Hierarchical Representation Distillation framework that preserves hierarchical structure within and across modalities throughout the distillation process by leveraging clustered visual tokens and multi-granular alignment with phrase-level text. Experimental results on multimodal embedding and downstream tasks show that HieRD consistently outperforms strong baselines, reflecting the effectiveness of its fine-grained semantic and spatial modeling, while enabling compact and efficient embedding models.

Abstract:
Learned world models are emerging as a powerful paradigm in robotics, offering a promising path toward task generalization, longhorizon planning, and flexible decision-making. However, prevailing approaches often operate on 2D video sequences, inherently lacking the 3D geometric understanding necessary for precise spatial reasoning and physical consistency. We introduce aStructured 4D Latent World Model, which predicts the evolution of a scene’s 3D structure in a structured latent space conditioned on observations and textual instructions. Our representation encodes the scene holistically and can be decoded into diverse 3D formats, enabling a more complete and physically consistent scene understanding. This structured 4D latent world model serves as a planner, generating future scenes that are translated into executable actions by a goal-conditioned inverse dynamics module. Experiments demonstrate that our model generates futures with superior visual quality, physical consistency, and multi-view coherence compared to state-of-the-art video-based planners. Consequently, our full planning pipeline achieves superior performance on complex manipulation tasks, exhibits robust generalization to novel visual conditions, and proves effective on real-world robotic platforms. Our website is available at https://icml2026-18617.github.io/.

Abstract:
Black box auditing of language models is an essential predeployment tool, but it may miss subtle forms of misalignment and hidden information. To better elicit hidden information during an auditing process, we introduce overthinking: the process of using reasoning task vectors to amplify the chain-of-thought faithfulness of reasoning models. Given the parameters of a base instruct model M and reasoning-distilled model R, we define the overthinking model as \mathcalO_\alpha = M + \alpha(R - M), where \alpha > 1 amplifies reasoning beyond the pure reasoning model R. Additionally, we introduce layer-wise attenuation strategies that selectively amplify reasoning without losing quality and coherence of model outputs. We demonstrate that overthinking models are more likely to reveal hidden information across four experimental settings, across 2B-32B models. Our findings suggest that reasoning amplification may surface secrets or unintended behaviors acquired during training up to 10 times more frequently than the original reasoning model.

Abstract:
While Large VisionLanguage Models (LVLMs) achieves remarkable success, hallucinations remain a significant barrier to their reliable deployment. Recent studies primarily attribute these defects to cross-modal attention imbalances, with most solutions focusing on re-weighting visual tokens or suppressing language priors. Such approaches often overlook the spectral characteristics of the visual information flow and frequently rely on Contrastive Decoding (CD), which doubles the inference time. Instead of following conventional approaches, we identify two distinct hallucination patterns—Perceptual-Semantic Dissociation and Localized Fixation—and accordingly develop FLASH (Frequency-Localized Attention SHaping), a training-free and CD-free framework. FLASH utilizes a Spectral Vortex Score to detect visual heads within multi-head attention layers, applying adaptive spectral modulation to rectify the visual information flow during the decoding phase. Empirical results demonstrate that FLASH offers a superior balance between performance and efficiency compared to SOTA methods.

Abstract:
Masked diffusion language models enable parallel token generation and offer improved decoding efficiency over autoregressive models. However, their performance degrades significantly when generating multiple tokens simultaneously, due to a mismatch between tokenlevel training objectives and the need for joint sequence consistency. In this paper, we propose ME-DLM, an edit-based refinement framework that augments diffusion generation with a lightweight post-generation editing step. After producing an initial complete response, the model refines it through minimal edit operations, including replacement, deletion, and insertion, conditioned on the full sequence. Training supervision is derived from edit distance, providing a deterministic supervision signal under a fixed canonicalization scheme for learning minimal corrections. This approach encourages sequence-level consistency through minimal, globally conditioned edits while preserving the efficiency benefits of parallel diffusion decoding, and substantially improves the quality and robustness of multi-token generation. Extensive experiments demonstrate that the proposed approach substantially improves the quality and robustness of multi-token parallel generation. In particular, when built upon LLaDA, our method achieves consistent gains of 11.6% on HumanEval and 33.6% on GSM8K while using one-eighth of the total diffusion steps.

Abstract:
Fairness in machine learning prediction has attracted growing attention in recent years. In this article, we propose a causal–inference–based framework for fair prediction, defined through pathspecific counterfactual interventions. Instead of imposing fairness via constraints on predictive objectives or model parameters, our approach specifies fairness directly at the level of counterfactual prediction semantics. Given a learned causal graph, we construct a predictive distribution for the outcome Y using a structural causal model and generate counterfactual predictions by selectively intervening on causal paths emanating from sensitive attributes. By allowing or blocking the propagation of sensitive information along designated paths, possibly involving multiple sensitive sources, our framework induces a hierarchy of interpretable fairness notions, generalizing standard path-specific causal semantics. Our empirical experiments demonstrate how different fairness levels can be instantiated and compared in practice.

Abstract:
VisionLanguage models (VLMs), such as CLIP, achieve powerful zero-shot classification. However, their predictions remain sensitive to spurious correlations, where contextual cues dominate over semantic content. Earlier solutions typically rely on fine-tuning or prompt engineering, which either undermine the advantages of pre-trained models or are prone to hallucination. In this work, we propose Density-Aware Translation (DAT) that refines image-text similarity scores using a local geometric density term derived from group reference sets. Our approach is motivated by the phenomenon that CLIP embeddings exhibit a modality gap and lie on an anisotropic shell in the feature space: common patterns cluster near the mean, while rare patterns are pushed outward. This geometry creates uneven alignment, where spurious correlations are amplified while semantically meaningful but rare cues are marginalised. To address this, we employ a relative measure to rescale similarities based on embedding density, suppressing overconfident scores in diffuse regions while preserving dense, semantically consistent matches. Experimental results on benchmark datasets demonstrate consistent improvements in worst-group and average accuracy, highlighting density-aware translation as a simple and effective calibration mechanism for reliable zero-shot classification using multimodal models.

Abstract:
This paper studies a constrained linear best arm identification problem with covariate selection in the fixedconfidence setting, where each arm is evaluated across multiple performance metrics. The mean performance of each metric depends linearly on the feature vectors of both arms and covariates. The goal is to identify the arm with the highest expected value of one targeted metric while ensuring that the means of the remaining metrics stay below specified thresholds for each covariate. We first establish an instance-dependent lower bound on the sample complexity, formulated as a multi-level optimization problem that captures both feasibility and optimality. We then prove that this bound is tight by designing an algorithm that asymptotically matches it. Since the original algorithm is computationally intensive, we develop a relaxed version of the bound through a surrogate optimization problem and derive its convex dual. Using this bound, we propose a duality-based decomposition algorithm that is computationally efficient, updating only two coordinates and performing a single gradient step per iteration. We further show that the algorithm achieves the relaxed bound in theory and demonstrates its practical effectiveness through numerical experiments.

Abstract:
Discrete Flow Matching (DFM) presents a promising approach for graph generation; however, existing adaptations often introduce substantial complexity by incorporating taskspecific heuristics, compromising the continuity equation and significantly expanding the hyperparameter space. Moreover, their sampling efficiency remains limited, as the required number of steps is often comparable to diffusion models, diminishing DFM’s practical advantages. To address these limitations, we propose SimGFM, a simplified graph DFM for graph generation. Leveraging characteristic patterns in graph-generation trajectories, SimGFM relies only on the scheduler and rate matrix, eliminating heavy heuristics and hyperparameter tuning, and achieves large step reductions while preserving SOTA results. SimGFM achieves strong empirical results: on QM9, it matches prior models requiring 500–1000 steps with only 10 steps, and on most datasets, its performance at 50 steps matches or surpasses these baselines, demonstrating both efficiency and competitiveness.

Abstract:
Rotary Position Embedding (RoPE) is widely adopted in Transformer models, yet its extension to highdimensional domains lacks a unified theoretical formulation. Most existing approaches either apply rotations independently along each axis or mix frequencies empirically, which limits cross-dimensional interactions and yields direction-dependent representations. To address these limitations, we propose nD-RoPE, a decomposition-free generalization of rotary embeddings to arbitrary dimensions. From a translation-invariant formulation in continuous Hilbert space, we derive a spectral condition for isotropy that requires treating positions and frequencies as coupled n-dimensional vectors. We instantiate this principle with a multi-scale regular simplex wave-vector design that provides uniform directional coverage with maximal symmetry. Experiments across images, videos, and point clouds demonstrate consistent performance gains and improved generalization in high-dimensional settings.

Abstract:
Privacy protection has become a critical requirement in the era of ubiquitous visual data sharing, imposing higher demands on efficient and robust privacy detection algorithms. However, current robust detection models are severely hindered by the lack of comprehensive datasets. Existing privacyoriented datasets often suffer from limited scale, coarse-grained annotations, and narrow domain coverage, failing to capture the intricate details of sensitive information in real-world environments. To bridge this gap, we present a large-scale, fine-grained Visual Privacy Dataset (VPD-100K), designed to facilitate generalized privacy detection. We establish a holistic taxonomy comprising four primary domains: Human Presence, On-Screen Personally Identifiable Information (PII), Physical Identifiers, and Location Indicators, containing 100,000 images annotated with 33 fine-grained classes and over 190,000 object instances. Statistical analysis reveals that our dataset features long-tailed distributions, small object scales, and high visual complexity. These characteristics make the dataset particularly valuable for demanding, unconstrained applications such as live streaming, where actors frequently face unintentional, real-time information leakage. Furthermore, we design an effective frequency-enhance lightweight module consisting of frequency-domain attention fusion and adaptive spectral gating mechanism that breaks the limitations of spatial pixel intensity to better capture the subtle details of sensitive information. Extensive experiments conducted on both diverse image and streaming videos benchmarks consistently demonstrate the effectiveness of our VPD-100K dataset and the well-curated frequency mechanism.

Abstract:
Multimodal dataset distillation (MDD) seeks to compress large-scale multi-modal datasets into a compact set of synthetic pairs. Existing methods employ a dual-trajectory matching framework to align the teacher and student models within each modality. While effective, this paradigm incurs non-negligible memory and computational overhead due to the checkpoint storage and bi-level optimization over synthetic data. To address these limitations, we propose analytic parameter matching (APM), which theoretically derives the analytic parameters of modal projectors to replace the inner-loop optimization, and then aligns the analytic projector parameters of teacher and student models. APM offers two key advantages: (1) it replaces checkpoint-intensive storage with only two cached matrices, significantly reducing memory consumption; and (2) it computes analytic parameters in a single forward pass, thereby avoiding costly bi-level optimization. Empirically, APM achieves up to 65× storage reduction and 9.6× faster distillation, while scaling to 1,000 synthetic pairs. Extensive experiments on image-text and audio-text benchmarks demonstrate the effectiveness of APM in cross-modal retrieval tasks, \eg, 12.8 IR@1 and 17.8 TR@1 in Flickr30k with 100 synthetic pairs. Moreover, APM exhibits notable generalization performance in cross-architecture evaluation and zero-shot classification tasks.

Abstract:
Evidential Deep Learning (EDL) enables trustworthy multiview classification, yet suffers from a critical vulnerability: the Scale Mismatch Problem. We theoretically demonstrate that existing evidential fusion rules erroneously equate logit magnitude with semantic confidence, rendering them susceptible to semantic hijacking by inflated but uninformative views. To resolve this, we propose Scale-Invariant Evidential Fusion (SAEF), a framework utilizing instance-wise standardization to strictly decouple confidence from scale. Instead of relying on magnitude dominance, SAEF aggregates views based on statistical consensus. Theoretically, SAEF guarantees invariance to global scaling and robustness to asymmetric dominance. Experiments on four diverse datasets confirm that SAEF outperforms state-of-the-art baselines in accuracy and robustness to semantic conflicts and noise, ensuring stability against severe scale perturbations.

Abstract:
Evaluating survival models under censoring is inherently challenging, yet standard evaluation practices are often applied without explicitly assessing how censoring distorts metric reliability. Performing a large experimental study, we analyze and quantify how survival evaluation metrics are affected in fundamentally different ways by the censoring rate and the censoring mechanism. Using a controlled semisynthetic framework, we vary both the censoring mechanism (administrative, independent, covariate-dependent) and the censoring rate, and compare standard evaluations based on censored data with oracle evaluations using fully observed event times. This controlled setting enables us to quantify distortions along two complementary axes: numerical bias and preservation of model ranking. Across datasets and metric families, we find that censoring induces systematic, mechanism-dependent distortions. Moderate numerical bias, if not properly addressed, can lead to unreliable model comparison as censoring increases. These findings reveal fundamental limitations of common benchmarking practices and call for more careful interpretation of survival evaluation under realistic censoring.

Abstract:
Learningbased Traffic Signal Control (TSC) achieves satisfactory performance in small networks, but its effectiveness often deteriorates in larger networks under dynamic traffic patterns and intersection heterogeneity. In this work, we propose SLight, a policy-aware grouped MARL-TSC framework that enables scalability and efficiency balance under dynamic and heterogeneous traffic conditions. SLight captures policy-influenced traffic patterns with a policy-aware traffic pattern encoder, learns explicit group-level shared control principles from state–action trajectories, and matches each intersection’s traffic pattern embedding to principle prototypes flexibly through a compatibility-based adaptive assignment module. Experiments on real-world and synthetic networks demonstrate that SLight sustains performance gains as scale increases and outperforms existing rule-based, reinforcement learning, and grouping-based baselines. Code is available at \urlhttps://anonymous.4open.science/r/code-20D3/

Abstract:
Multimodal clustering, which integrates information from diverse sources and feature modalities, has shown great potential in data mining and computer vision. However, existing methods relying on single-granularity relationships often struggle with complex data distributions, leading to limited performance, as fine-grained features are prone to local heterogeneity and redundant perturbations while coarse-grained representations tend to lose local structural information. To address these limitations, we introduce granular-balls (GBs), adaptive multi-granularity hyperspheres that enclose similar samples, and propose the Structure-aware Granular-Ball based Information Bottleneck (SGB-IB) algorithm. This method initializes the dataset as a single GB and recursively splits GBs based on a purity metric, which quantifies the average mutual information between sample features and K-means-derived pseudo-labels across all modalities. It also balances local structure preservation and global redundancy suppression through a structure-aware objective function. Extensive experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art approaches, validating the effectiveness of fusing GB structures with information-theoretic principles.

Abstract:
Machine unlearning (MU) aims to expunge a designated forget set from a trained model without costly retraining, yet the existing techniques overlook two critical blind spots: “over‑unlearning" that deteriorates retained data near the forget set, and post‑hoc “relearning” attacks that aim to resurrect the forgotten knowledge. Focusing on classlevel unlearning, we first derive an over-unlearning metric, \operatornameOU@\varepsilon, which quantifies collateral damage in regions proximal to the forget set, where over-unlearning mainly appears. Next, we expose an unforeseen relearning threat on MU, i.e., the Prototypical Relearning Attack, which exploits the per-class prototype of the forget class with just a few samples, and easily restores the pre-unlearning performance. To counter both blind spots in class-level unlearning, we introduce \textttSpotter, a plug‑and‑play objective that combines (i) a masked knowledge‑distillation penalty on the nearby region of forget classes to suppress \operatornameOU@\varepsilon, and (ii) an intra‑class dispersion loss that scatters forget-class embeddings, neutralizing Prototypical Relearning Attacks. \textttSpotter achieves state-of-the-art results across CIFAR, TinyImageNet, and CASIA-WebFace datasets, offering a practical remedy to unlearning’s blind spots.

Abstract:
Controlling partial differential equations (PDEs) with learningbased policies remains fundamentally limited by fixed-dimensional representations: policies trained for a specific sensor, actuator, or agent configuration typically fail when the configuration changes. This limitation is particularly severe in multi-agent PDE control, where policies do not scale across population sizes without retraining. We address this challenge by reformulating PDE control as an operator learning problem that maps state fields to continuous control functions and trains them end-to-end through differentiable PDE solvers, yielding policies that naturally adapt to varying sensor and actuator configurations. Remarkably, policies trained on small swarms exhibit cardinality invariance, allowing for zero-shot transfer to significantly larger populations as well as robustness to partial agent failure. This scalability arises from agents sharing a common policy and coordinating through their physical environment, which produces an emergent self-normalization effect. To explain this phenomenon, we provide a theorem grounded in mean-field theory demonstrating that policy gradients computed from finite-agent systems converge to those of a continuous control limit. Empirically, we validate the framework on tracking, stabilization, and density transport across linear, nonlinear, chaotic, and turbulent PDEs.

Abstract:
Differential privacy (DP) imposes fundamental tradeoffs between privacy and statistical fidelity in synthetic data generation. While access to public data has been shown to improve these trade-offs empirically, existing approaches exploit public data only indirectly, through pre-processing (e.g., using pre-trained generative models) or post-processing steps (e.g., matching target statistics estimated from public datasets), while relying on domain-agnostic DP mechanisms. In this work, we lay the theoretical framework to study the principled incorporation of public data into DP mechanisms themselves. We consider normalized histograms as distribution estimators and characterize the \emphtheoretically optimal domain-aware privacy mechanism within a class of mixing-based DP mechanisms. We introduce \textscPubMix, a public-data-aware DP mechanism that can be used in histogram-based data synthesis pipelines. Our experiments demonstrate that, when public data is available, \textscPubMix significantly improves synthetic data generation quality across tasks without compromising privacy.

Abstract:
We study the incontext universal approximation and compositional generalization of softmax Transformers. We prove an in-context universality result: a fixed-weight softmax Transformer approximates a broad class of continuous sequence-to-sequence functions. Building on this universality, we establish a composition theorem: by concatenating prompts associated with simple ``subprograms,'' the same fixed Transformer executes their composition, and thereby synthesizes more complex programs on-the-fly. These results support a principled view of prompts as programs and fixed-weight Transformers as program interpreters. Moreover, we provide a concrete mechanism by which GPT-style models both execute and assemble algorithms in context.

Abstract:
We introduce LazyStack, a method for efficient model ensemble inference. The core idea is intuitive: after each model executes, we check whether accumulated evidence is sufficient to exit confidently. Sometimes one model suffices; other times we aggregate predictions from several models via trained metalearners before reaching confidence. Two insights make this work. First, most inputs follow only 3 to 8 execution trajectories. This reduces the training problem from exponential to linear: we learn aggregators only for these common paths, not all possible model combinations. Second, we formulate trajectory selection as an MDP and use value iteration to compute the optimal routing policy, which reveals counterintuitive model orderings. On intrusion detection, starting with a moderately expensive model outperforms starting with the cheapest, because its higher confidence enables earlier overall exit. Across vision, text, tabular, and LLM tasks, we achieve up to 38x speedup at 97%+ accuracy retention compared to a complete ensemble. The result: ensemble-quality predictions at cascade-level cost.

Abstract:
Submodular maximization subject to a knapsack constraint (SMK) is a fundamental problem in discrete optimization, with wideranging applications in machine learning and related fields. As these applications increasingly involve sensitive individual data, there is a growing need for high-utility algorithms that provide formal privacy guarantees. In this work, we study the SMK problem under differential privacy, considering both monotone and non-monotone objective functions. For monotone objectives, we propose a differentially private algorithm that achieves the optimal (1-1/e)-approximation ratio while significantly improving both additive error and query complexity over prior work. We also present a more efficient variant attaining a 1/2-approximation. For non-monotone objectives, we introduce, to our knowledge, the first differentially private algorithm with provable guarantees, achieving a 1/4-approximation in expectation and an additive error comparable to the best known for monotone objectives.

Abstract:
In many prediction problems, we have extra information during training (for example, measurements that are expensive or slow to collect) that will not be available when the model is deployed. A common strategy is to first train a model that uses all training information, then use its predictions on unlabeled examples to train a second model that only uses the inputs available at test time. However, when the extra trainingonly information is weak or noisy, this two-step approach can mislead the deployable model and even hurt accuracy. We propose a joint training method that learns the two models together, so the deployable model can benefit from the extra information only when it actually helps, instead of inheriting its mistakes. We provide guarantees that describe when joint training improves prediction accuracy and analyze a simple alternating training algorithm for large, high-dimensional models. Experiments on synthetic data and medical prediction tasks show that our approach avoids these failures and consistently outperforms standard two-step baselines.

Abstract:
Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods perform spatial reasoning in a single inference pass, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable evidence. We argue that spatial reasoning should be revisitable: conclusions formed under limited evidence should remain open to revision when complementary viewpoints become available. Building on this insight, we propose Reason, then Rereason (ReRe), a training-free, inference-time framework with two phases: in the Reason Phase, an MLLM forms a spatial hypothesis from the original video; in the Re-reason Phase, it verifies or revises the hypothesis by observing a synthesized novel-view video. To enable effective cross-view revisiting, we design a Geometry-to-Video pipeline that renders strategically complementary novel views from predicted 3D geometry. These views feature an elevated, oblique perspective with scene-spanning coverage, while preserving the MLLM's native video interface without architectural modifications. Extensive evaluations on VSI-Bench demonstrate that ReRe consistently boosts open-source MLLMs to rival proprietary state-of-the-art performance.

Abstract:
Understanding and predicting how social beliefs evolve in response to events—from policy changes to scientific breakthroughs—remains a fundamental challenge in social science. Given LLMs’ commonsense knowledge and social intelligence, we ask:Can LLMs model the dynamics of social beliefs following social events?In this work, we introduce the concept of the Social World Model (SWM), a general framework designed to capture how social beliefs evolve in response to major events. SWM learns statetransition functions for social beliefs by mining temporal patterns in social data and optimizing evidence lower bound, without the need for explicit human annotations that link events to belief shifts or expensive census data. To evaluate SWM, we introduce a benchmark, SWM-Bench, derived from real-world prediction market data from both Kalshi and Polymarket. SWM-Bench includes over 10k datapoints for social belief prediction tasks spanning diverse domains such as politics, sports, cryptocurrency, and elections. Our experimental results show that SWM significantly outperforms time-series foundation models, achieving RMSE reductions of 8.4% and 11.2% on Polymarket and Kalshi respectively, while offering interpretable insights into the underlying mechanisms of social belief dynamics.

Abstract:
Causal softmax attention is the algorithmic foundation of modern large language models. Inspired by linear attention, recent work has sought to enhance it by incorporating forget gates. However, these efforts, such as FoX, have been limited to coarse, scalar gates. While finegrained, element-wise gates are shown to be more effective than scalar ones in linear attention, their direct integration into softmax attention is non-trivial due to algebraic constraints. In this work, we introduceFine-grained Forgetting Transformer(FiX), a novel architecture that successfully enables element-wise forget gates in softmax attention. Our core insight is that the softmax denominator becomes mathematically redundant under a subsequent RMSNorm layer, allowing us to reformulate the forgetting mechanism as a direct element-wise multiplication on the value vectors. This formulation makes FiX the first positional encoding applied to value-output (VO) pairs, designed to be complementary to existing query-key (QK) encodings like RoPE. We systematically address implementation challenges including numerical precision, computational efficiency, and inference memory consumption. Extensive experiments showFiXachieves lower training loss and superior performance on both short-text common sense benchmarks and long-context tasks, opening a new path for building more powerful transformers.

Abstract:
Discovering heterogeneous catalysts tailored for specific reaction intermediates remains a fundamental bottleneck in materials science. While traditional trialand-error methods and recent generative models have shown promise, they struggle to capture the intrinsic coupling between surface geometry and adsorbate interactions. To address this limitation, we propose CatFlow, a flow matching-based framework for de novo design and structure prediction of heterogeneous catalysts. Our model operates on a primitive cell-based factorized representation of the slab-adsorbate complex, reducing the number of learnable variables by an average of 9.2x while explicitly encoding the surface orientation of the slab-adsorbate interface. Experiments on the Open Catalyst 2020 dataset demonstrate that CatFlow significantly improves the structural fidelity of generated catalysts compared to autoregressive and sequential baselines. Further experiments show that the generated structures accurately capture the adsorption energy distributions of physically plausible interfaces and lie closer to thermodynamic local minima.

Abstract:
Accurate survival modeling in biomedical studies is often hindered by rare events, limited effective sample sizes, and settings with limited or partially observed information (e.g., covariates of interest that are difficult or expensive to collect, highlystructured sampling designs, or nuisance parameters omitted by conditioning). Knowledge distillation can leverage external predictive information without sharing individual-level data, but existing approaches are largely built for fully specified likelihoods or probability-based survival models and do not extend to settings where outcome distributions are only partially specified. To address this challenge, we propose a knowledge distillation framework based on a composite-likelihood Kullback–Leibler divergence that aligns teacher and student models within components. Our key insight is that, although composite likelihoods do not define a global outcome distribution, each likelihood component induces a well-defined probability model on its restricted outcome space, enabling a principled KL divergence. Simulation studies and biomedical case studies show improved discrimination and estimation efficiency in rare-event, heterogeneous settings without requiring access to external individual-level data.

Abstract:
Measuring statistical dependency between highdimensional random variables is a fundamental task in data science and machine learning. Neural mutual information (MI) estimators offer a promising avenue, but they typically require costly test-time training for each new dataset, making them impractical for real-time applications. We present InfoAtlas, a foundation model-like architecture that eliminates this bottleneck by directly inferring MI in a single forward pass. Pretrained on large-scale synthetic data with rich dependence patterns, InfoAtlas learns to identify diverse dependence structures and predict MI directly from the dataset. Comprehensive experiments demonstrate that InfoAtlas matches state-of-the-art neural estimators in accuracy while achieving 100× speedup, can flexibly handle varying dimensions and sample sizes through a single unified model, and generalizes effectively to complex, real-world scenarios. By reformulating MI estimation from an optimization problem to an inference task, InfoAtlas establishes a foundation for real-time dependency analysis.

Abstract:
This paper studies the problem of recovering a hidden vertex correspondence between two correlated graphs when both edge weights and node features are observed. While most existing work on graph alignment relies primarily on edge information, many realworld applications provide informative node features in addition to graph topology. To capture this setting, we introduce the featured correlated Gaussian Wigner model, where two graphs are coupled through an unknown vertex permutation, and the node features are correlated under the same permutation. We characterize the optimal information-theoretic thresholds for exact recovery and partial recovery of the latent mapping. On the algorithmic side, we propose QPAlign, an efficient method based on a quadratic programming relaxation, and demonstrate its strong empirical performance on both synthetic and real datasets. Moreover, we also derive theoretical guarantees for the proposed procedure, supporting its reliability and providing convergence guarantees.

Abstract:
This paper investigates the problem of generating synthetic tabular data with differential privacy (DP) guarantees, enabling data sharing in sensitive domains. Despite extensive study, stateof-the-art methods often focus on minimizing low-order marginal query errors and overlook the challenges posed by high-order correlations. To address this gap, we extend the Private Evolution (PE) framework, originally developed for DP-compliant image and text synthesis, to tabular data. We introduce Tab-PE -- an algorithm for synthetic tabular data generation under DP constraints. Tab-PE iteratively improves a candidate dataset via an evolutionary process that leverages tabular-specialized operators to produce variations, privately scores them, and selects the highest-quality samples to retain and propagate. In contrast to the original PE, which relies on large foundation models, Tab-PE employs heuristic operators with significantly lower computational costs, makes PE more practical and scalable for tabular data. Through extensive experiments on real-world and simulation datasets, we demonstrate that Tab-PE substantially outperforms prior baselines on datasets exhibiting high-order correlations. Compared to the best baseline -- AIM, Tab-PE improves classification accuracy by up to 10% while running 28× faster.

Abstract:
RetrievalAugmented Generation (RAG) increasingly relies on hierarchical indexing, yet existing frameworks are bottlenecked by the high cost and information loss of recursive, LLM-based summarization. We propose SeRAG, a novel token-free hierarchical indexing framework that replaces textual summaries with an information-theoretic knowledge taxonomy. SeRAG first transforms a corpus into a multi-perspective graph capturing semantic, logical, and sequential dependencies, then minimizes structural entropy to induce a topologically-faithful encoding tree. To bridge the gap between abstract themes and granular facts, we introduce localized structural weight-based vector aggregation for token-free community consolidation. Extensive experiments demonstrate that SeRAG significantly reduces indexing overhead while outperforming state-of-the-art methods in complex multi-hop reasoning tasks.

Abstract:
Incremental Learning (IL) for Openended Image-to-Text Generation (OpenITG) enables models to continuously generate accurate, contextually relevant text for new images while preserving previously acquired knowledge. Unlike prior studies, this paper addresses a more practical scenario in which the predominant category of visual data shifts over time as environments evolve. In this context, we introduce a new notion of continual alignment, which incrementally adapts the alignment module within pre-trained VLMs to preserve high-quality cross-modal representations. Based on this idea, we proposeEfficientContinualAlignment (ECA), a novel exemplar-free IL approach for OpenITG. The key challenge is enabling the model to acquire new, task-specific features while minimizing interference with the established alignment without accessing raw data from previous tasks. To address this, ECA employs three core mechanisms: aMixtureofQuery (MoQ) module that adapts task-specific query tokens, aFisherDynamicExpansion (FeDEx) that dynamically expands model structure based on a Fisher Information Matrix (FIM)-based metric, and an embedding dictionary withDictionaryReplay (DR) to retain past knowledge. To evaluate ECA's performance, we construct four new IL OpenITG benchmarks that better reflect real-world scenarios. Experimental results demonstrate that ECA significantly mitigates catastrophic forgetting and improves IL performance compared to baseline methods. Benchmarks are available at https://anonymous.4open.science/r/ECA-ToS-Benchmarks-FB17.

Abstract:
A promising recipe towards intelligent robotic decisionmaking is the finetuning of pretrained generative control policies, which can summarize offline experience effectively through behavior cloning, with reinforcement learning techniques to adapt them to online experience. In this work we present Diffusion Filtered Exploration via Ensembles (DF-ExpEnse), an exploration technique that meaningfully improves the quality of online experience collection, thus increasing the sample efficiency of the finetuning procedure. DF-ExpEnse first leverages the multimodal modeling capability of the generative control policy to create an expressive and tractably evaluatable candidate set. Then, it utilizes an ensemble of critics to identify an action with high exploration interest that best balances quality with uncertainty. When instantiated in a parallelized fleet, DF-ExpEnse further utilizes cross-agent communication to facilitate collaborative exploration as a group. As it is only used for online experience collection, DF-ExpEnse can be seamlessly integrated on top of existing techniques that seek to finetune pretrained generative control policies via reinforcement learning. We experimentally validate consistent sample-efficiency benefits when using DF-ExpEnse for exploration over both manipulation and locomotion tasks, compared to default finetuning and alternative action selection schemes.

Abstract:
Physicsbased motion imitation is central to humanoid control, yet current evaluation metrics(e.g., MPJPE) only quantify imitation outcomes, not their underlying causes. This conflation obscures a critical diagnostic question: when imitation error occurs, does it stem from policy limitations or the intrinsic learning difficulty of the target motion? To resolve this ambiguity, we propose the Torque Variation Score (TVS), a physics-grounded metric that quantifies the inherent learning difficulty of a motion independently of any policy's performance. TVS measures the magnitude of torque variation required to correct small pose perturbations, directly capturing how dynamical properties shape the reinforcement learning landscape. We establish that high-TV motions induce flat reward landscapes and vanishing policy gradients, explaining persistent imitation failures. Extensive experiments with state-of-the-art methods (UHC, PHC+) confirm TVS strongly correlates with imitation error and enables principled error attribution: high error on low-TV motions indicates policy deficiency, while high error on high-TV motions reflects fundamental learning constraints. Beyond error diagnosis, TVS facilitates three practical applications: Maximum Imitable Difficulty (MID) for policy capability assessment, Difficulty-Stratified Joint Error (DSJE) for granular performance profiling, and Flawed Motion Detection for identifying segments with abnormally high learning difficulty to support mocap data curation and quality control. TVS provides a rigorous lens to distinguish policy-induced errors from motion-inherent challenges and enhances motion dataset reliability.

Abstract:
We study a cooperative form of multiagent reinforcement learning with state space dynamics and agent interaction controlled by an underlying graph. Each agent has a local state and action, the evolution of the local state depends only on the states and actions in the 1-hop neighborhood defined by graph. Structured dynamics of this type arise in various applications, including network resource allocation, co-operative games, epidemic control, and wireless scheduling. The global state-action space scales exponentially in the number of agents, so that computing global optimal policies is intractable in the worst-case. We study conditions under which it is possible to approximate the optimal policies by an m-hop local policy for each agent, depending only on its m-hop neighborhood. By controlling the propagation of influences via a Dobrushin-type stability matrix, we establish that global optimal policies are sharply approximated by m-hop local policies whose sub-optimality gap decays exponentially in m.

Abstract:
Highfidelity 3D generation remains difficult. Although some methods have proposed converting raw meshes to SDFs, it remains a lossy process. TripoSF presented a VAE training paradigm based on a rendering loss to circumvent this lossy SDF conversion, achieving high-precision surface reconstruction. However, because the rendering loss cannot supervise all the VAE outputs in the same way as SDF supervision, it limits detail and scalability.We present Focusing, a 3D VAE that improves efficiency by activating only the voxels that matter for a given view. Our key idea is a depth-driven voxel carving performed in the structured latent space: voxels inconsistent with the rendered depth are pruned before decoding. This concentrates learning on locally relevant geometry, reduces attention and decoding costs, and lowers video random access memory (VRAM) usage. To stabilize training and capture fine details, we further introduce an adaptive zooming strategy that adjusts camera intrinsics to keep the number of active voxels within a target range. The VAE is trained with a render-based loss on depth, normals, masks, and perceptual terms, and we add simple regularizers (e.g., sparse-voxel TV and a short warm-up with TSDF supervision) to reduce small holes and speed up convergence. Across standard reconstruction benchmarks, Focusing improves geometric accuracy (CD, F-score) over strong baselines while cutting VRAM consumption, which allows for training the resolution VAE on as little as 50GB of VRAM. These results show that local, view-consistent sparsity is an effective route to higher-resolution, more efficient 3D VAEs.

Abstract:
Multimodal alignment fails under deployment shift because standard objectives entangle crossmodal dependence with marginal-sensitive geometry. Stable Spectral Copula Alignment (SSCA) provides a deployment protocol targeting copula-stable dependence under strictly monotone marginal distortions, with auditable, label-free diagnostics for monitoring and mitigation. SSCA combines (i) clipped soft-rank Gaussianization suppressing marginal effects while tracking tie/approximation errors, (ii) dependence-weighted sliced Wasserstein hub coupling for globally coherent multiway alignment with cycle auditing, and (iii) diagonal-stabilized block-spectral learning with eigengap-normalized Davis--Kahan diagnostics, yielding an actionable subspace-risk inequality. A calibrated gate maps diagnostic proxies to a reliability signal with controlled false-alarm/miss rates, enabling safe activation, budgeted remediation, and conservative fallback for out-of-scope drift. Evaluations on MOSEI/MELD, MSCOCO, and CC3M-500K demonstrate improved performance under perturbation and substantially reduced degradation under both monotone and realistic drifts.

Abstract:
Biomedical regression tasks require predicting continuous targets from heterogeneous and unstructured evidence. While Large Language Models (LLMs) provide a robust interface for reasoning over mixed modalities, they are inherently limited by their discrete tokenization and crossentropy objectives, which lack awareness of numerical proximity. To bridge this gap, we present SPR-RAFT, a parameter-efficient and regression-aware framework that adapts frozen LLMs for high-precision regression. SPR-RAFT introduces a dual-module architecture: a learnable soft prompt that conditions the LLM to route numerical reasoning into a specific latent state, and a lightweight \texttt[REG]-anchored head for numerical reasoning consolidation. Crucially, we align these two modalities via a hybrid objective that combines distribution-based text generation with representation-based robust regression. This ensures the model remains both semantically coherent and numerically calibrated. With only ～0.04% trainable parameters, SPR-RAFT consistently outperforms prompting strategies, standard fine-tuning, and non-LLM baselines across diverse biomedical benchmarks, including clinical trial duration, biological age estimation, and molecular property prediction.

Abstract:
Spiking Neural Networks (SNNs) offer lowpower, brain-inspired computation, but Transformer-based SNNs face deployment challenges on neuromorphic hardware due to complex operations and high communication overhead. We propose hardware-efficient attention mechanisms, Simplified Spiking Attention (SSA) and Ultra-Simplified Spiking Attention (USSA), which replace matrix multiplications with Hadamard products and remove multi-head attention, scaling, and patching. We theoretically show that double masking is redundant and early-spiking gating preserves richer temporal information than late-spiking. On the CIFAR‑10, CIFAR‑100, and DVS‑Gesture datasets, SSA achieves accuracies of 96.38%, 79.45%, and 97.56%, respectively, outperforming baseline Transformer‑SNNs by up to +1.73%, while reducing computational complexity from \mathcalO(N^2D) to \mathcalO(ND) and communication complexity from \mathcalO(N^2) to \mathcalO(ND). USSA further compresses communication complexity to \mathcalO(N) with only marginal accuracy loss.

Abstract:
Latent Factorizationof-Tensors (LFT) models are an effective approach for relation extraction. Existing LFT models assume each mode of the target tensor corresponds to a entity set and the relationships between entity sets are bipartite graphs to explore the relationships among entities within a mode. However, when the topological structure of entities in a mode is known, for example, entities are ordered physical quantities, such as time or coordinates, the relation between such modes forms a more complicated system, i.e., aligned bipartite networks, and existing LFT models cannot accurately capture this structure. This work is the first to recognize and analyze this issue, and proposes an Asymmetric Latent Factorization-of-Tensors (ALFT) model to address it. ALFT can understand aligned bipartite networks in mode pairs of a tensor by imposing constraints between particular mode pairs in the tensor network. Experimental results on real-world datasets demonstrate the existence of this issue and confirm that the proposed ALFT model can effectively resolve it.

Abstract:
Machine learning is transforming molecular sciences by accelerating property prediction, simulation, and the discovery of new molecules and materials. Acquiring labeled data in these domains is often costly and timeconsuming, whereas large collections of unlabeled molecular data are readily available. Standard semi-supervised learning methods often rely on label-preserving augmentations, which are challenging to design in the molecular domain, where minor changes can drastically alter properties. In this work, we show that semi-supervised methods that rely on an ensemble consensus can boost predictive accuracy across a diverse range of molecular datasets, task types, and graph neural network architectures. We find that training with an ensemble consensus objective increases robustness in models and exhibit an effect similar to knowledge distillation; an individual member of an ensemble trained this way outperforms a full ensemble trained in a traditional supervised fashion in almost all cases. In addition, this type of semi-supervised training reduces calibration error.

Abstract:
Benchmarks are widely used to evaluate and compare the performance of artificial intelligence systems. However, some approaches to computing benchmark metrics produce invalid uncertainty estimates or make unrecognized assumptions about the evaluation setting. We leverage statistical modeling to make two contributions to the practice of AI benchmarking. First, we formally distinguish measurements of benchmark accuracy from generalized accuracy (performance on all potential test items similar to those included in the benchmark). Then, in a simulated setting and with largescale evaluation of 22 API-access frontier large language models on 3 popular benchmarks, we show how analysis via generalized linear mixed model can estimate generalized accuracy while more efficiently quantifying uncertainty compared to existing regression-free approaches. We also show how this approach can equip evaluators with important context on evaluation results, including variance decomposition and item difficulty estimates that illuminate important aspects of LLM performance and benchmark construction.

Abstract:
Multimodal models like CLIP have gained significant attention due to their remarkable zeroshot performance across various tasks. However, studies have revealed that CLIP can inadvertently learn spurious associations between target variables and confounding factors. To address this, we introduce \textscLocate-Then-Correct (LTC), a contrastive framework that identifies spurious attention heads in Vision Transformers via mechanistic insights and mitigates them through targeted ablation. Furthermore, LTC identifies salient, task-relevant attention heads, enabling the integration of discriminative features through orthogonal projection to improve classification performance. We evaluate LTC on benchmarks with inherent background and gender biases, achieving over a > 50% gain in worst-group accuracy compared to non-training post-hoc baselines. Additionally, we visualize the representation of selected heads and find that the presented interpretation corroborates our contrastive mechanism for identifying both spurious and salient attention heads.

Abstract:
Biomolecules such as proteins and smallmolecule ligands play a central role in biological systems, arising from the tight interplay between sequence and three-dimensional structure. Recent generative models for biomolecular co-design aim to capture this interplay by jointly modeling coupled modalities. However, existing approaches largely adopt a parallel execution of marginal generative processes, implicitly enforcing fixed synchronous coupling. We argue that a critical but overlooked degree of freedom lies in how these marginal processes are \emphtemporally coupled during training and generation, where inappropriate coupling can introduce high-variance supervision and inconsistent intermediate states, affecting modality consistency. To address this, we introduce GeoCoupling, a systematic framework that optimizes for temporal couplings between heterogeneous modalities. Empirical results across structure-based drug design and unconditional protein design demonstrate the learned couplings consistently outperform synchronous and randomly coupled baselines, yielding biomolecules with improved physical validity and diversity.

Abstract:
Cosine similarity estimation is a core primitive in coarseto-fine retrieval pipelines, where early-stage candidate selection relies on approximate similarity estimates whose errors are amplified downstream. Widely used sign-based sketches arising from extreme quantization of random projections exhibit a structural variance peak near \theta \approx 90^\circ, the near-background region where candidate selection is most difficult. We propose QA-Cos, a query-aware decoder-side estimator that departs from the Hamming-agreement paradigm, treating sign bits as probabilistic observations rather than deterministic votes. Across simulations and BEIR benchmarks, QA-Cos reduces estimation error by up to ～15--20% in the near-orthogonal region and translates these gains into improved candidate selection in two-stage ANN pipelines, improving Hit@K by up to ～30 percentage points at fixed budgets and reducing candidates by up to ～45--50% at fixed recall.

Abstract:
Densitybased anomaly detection methods often provide accurate and interpretable predictions but their performance can be severely degraded by the inherent noise of data, such as changes arising from environmental conditions during data collection or background noise. To deal with such noise, we present noise-robust density estimation (NRDE) for tabular data anomaly detection. We aim to estimate the density of pure data with the influence of noises isolated, which is a non-trivial task since the data-generating process is completely unknown. Specifically, NRDE learns a Jacobian-regularized normalizing flow to estimate the sources of data and categorizes sources into two groups, where one group generates pure data and the other generates noise. After generating pure data, we can use the density of such pure data to detect anomalies caused by the sources of pure data solely. Therefore, NRDE is robust to inherent noise. We provide theoretical results to support the effectiveness of NRDE and compare NRDE with 17 baselines on 47 benchmark datasets under different settings, including vanilla anomaly detection, anomaly detection with anomaly contamination, anomaly detection on noisy data, and transductive outlier detection.

Abstract:
The diversity of robot embodiments and action spaces makes it challenging to build robot world models that generalize across different embodiments. We introduce a Latent Action Conditioned Robot World Model (LACWM), which operates within a learned unified latent action space shared across diverse embodiments. We show how this unified action space improves the world model’s performance when adapted to previously unseen robot embodiments. We compare LAC-WM to a baseline model, Explicit Action Conditioned World Model (EAC-WM) conditioned on explicit motion labels. Our results show that conditioning on explicit labels creates disjoint action spaces across embodiments, limiting downstream task performance when adapting to new robots. We evaluate both models on a dexterous manipulation task. The latent action-conditioned model LAC-WM achieves up to a 46.7% improvement in performance over EAC-WM. Crucially, the unified latent action space allows LAC-WM’s downstream performance to scale positively with the number of embodiments used during pretraining. In contrast, the disjoint action space in EAC-WM leads to decreased performance as the number of pretraining embodiments increases. These results highlights the importance of a unified action space for efficient cross-embodiment learning, addressing a key challenge in robotics.

Abstract:
Prompt optimization is critical for maximizing the performance of large language models (LLMs). However, it often relies on costly labeled data. Selfsupervised methods reduce data dependency, but they suffer from optimization ambiguity or high computational costs. To address these limitations, we propose the Meta-Reasoning Prompt Engineering Agent (MR.PEA), a self-supervised prompt optimization framework that operates with minimal input. MR.PEA leverages meta-reasoning to iteratively build task-specific knowledge, including problem-solving strategies and evaluation criteria, while adaptively retrieving external information to enhance its understanding. This knowledge guides the generation of diverse validation examples, targeted prompt refinement, and comprehensive quality assessments. Experiments on GSM8K and Big-Bench Hard show that MR.PEA outperforms existing baselines, achieving an average performance gain of 7.4% with an optimization cost as low as 0.01 per task.

Abstract:
Segment Anything Models (SAMs) often struggle with certain specialized tasks. A common approach is to finetune models with specific task labels, but this often leads to overfitting, introduces model bias and significantly degrades their generalization ability. To overcome these challenges, we propose SHERPA, a novel framework that leverages a smaller SAM to guide the fine-tuning of a larger SAM via task-relevant features. Specifically, we first leverage the Fisher Ratio Separation (FRS) module to separate high task-relevant features and preserve the ability of the large SAM to perform other general tasks. Then, the Guiding Feature Extraction (GFE) module is used to extract representative guiding features from the fine-tuned small SAMs. We leverage small SAMs tailored for specific tasks (including natural image segmentation, biomedical image segmentation, and video object segmentation) as guidance and then evaluate the SHERPA scheme to fine-tune larger SAM series models. Our experiments demonstrate that SHERPA enhances the retention of generalization ability across those diverse tasks, by up to 11.1%, and improves specific task performance by up to 2.2%.

Abstract:
Malware detection remains largely reactive: machine learning models trained on known samples degrade as threats evolve. Understanding evolutionary relationships among malware families can inform proactive defense, but traditional reverse engineering can take months to years to uncover such lineage relationships. We propose MalTree, a framework that applies bioinformaticsinspired phylogenetic techniques (UPGMA and Neighbor-Joining) at scale to model malware evolution automatically using structural, behavioral, and image-based features. We introduce temporal validation using VirusTotal timestamps to assess whether inferred trees reflect actual evolutionary order. MalTree achieves 87% temporal consistency, indicating that inferred evolutionary relationships closely align with real-world emergence timelines. Our analysis shows that some families mutate over 10 times faster than others, suggesting that detection strategies should be tailored to family-specific evolutionary tempos. Case studies, including the Mirai botnet, confirm that inferred relationships from our phylogenetic tree align with documented threat intelligence. Our framework provides a foundation for shifting malware analysis from sample-by-sample classification toward lineage-aware evolutionary modeling.

Abstract:
In batch multiobjective Bayesian optimization (MOBO), it is often desirable to identify the whole Pareto optimal set, especially when considering the complicated interplay between different design criteria and constraints. This poses unique challenges in acquiring batches of both high quality and diversity to cover the Pareto front. We propose a novel acquisition strategy, Probability of Matching (POM), which evaluates both batch candidate quality and diversity by explicitly capturing the likelihood that all batch points are Pareto optimal, and the probability that they collectively cover the full Pareto set. To estimate the coverage probability and promote diversity, we incorporate non-replacement sampling principles, resulting in our new POM-guided batch MOBO method. Across synthetic benchmarks and real-world tasks, our method consistently outperforms state-of-the-art baselines on standard MOBO metrics as well as a new design-space coverage metric, Expected Minimum Distance (EMD), with comparable computational efficiency.

Abstract:
Training LLMs as interactive agents for multiturn decision-making remains challenging, particularly in long-horizon tasks with sparse and delayed rewards, where agents must execute extended sequences of actions before receiving meaningful feedback. Most existing reinforcement learning (RL) methods model LLM agents asflatpolicies operating at a single time scale, selecting an action at each turn. In sparse-reward settings, this forces the agent to infer long-range dependencies solely from distant end-of-trajectory signals, often leading to inefficient learning and unstable behavior in complex environments. We proposeHiPER, a novelHierarchicalPlan–ExecuteRL framework that jointly models and optimizes high-level subgoal planning and low-level action execution for LLM agents to overcome flat RL's brittle long-horizon behavior and weak credit assignment under sparse outcome feedback. By maintaining persistent subgoals across multiple turns and explicitly deciding when to switch between them, HiPER introduces structured intermediate decision points that facilitate learning under sparse feedback, converting implicit multi-turn structure into learnable decisions at different time scales. To enable effective training, we introduce Hierarchical Advantage Estimation (HAE), a two-timescale policy gradient method that assigns credit to both action execution and subgoal transitions and achieves variance reduction relative to flat advantage estimation. Empirically, HiPER achieves state-of-the-art performance on challenging interactive benchmarks, reaching 97.4% success on ALFWorld (+6.6% over the best prior method) and 83.3% on WebShop, with especially large gains on long-horizon tasks requiring multiple dependent subtasks. These results highlight the importance of explicit hierarchical decomposition for scalable RL training of multi-turn LLM agents.

Abstract:
Efficient prediction and planning in structured environments often relies on spectral decompositions of transition operators, yet existing gridcell and successor-representation theories implicitly assume “flat” action geometry where translations commute and a single Fourier eigenbasis suffices. We show that this assumption breaks in the presence of path-dependent effects—e.g., circulation, rotational drift, or topological loops—whose defining signature is nontrivial holonomy. We introduce a theory of directed-action prediction on discrete tori based on twisted translation operators forming a projective representation of the underlying motion group, and prove that the resulting controlled Markov operators admit an exact block-diagonalisation under a twisted Fourier transform: actions share a universal harmonic basis while their effects appear as small matrix-valued spectra rather than scalar eigenvalues. This yields closed-form resolvent expressions for the successor representation, a gauge-invariant transfer principle characterising when two environments admit identical predictive structure, and a curvature-induced lower bound showing that nonzero holonomy provably necessitates internal representational dimension. Together, these results generalise Fourier/grid-based prediction from commutative to curved action geometries, providing a principled foundation for generalisation under directed actions without learning environment-specific eigenvectors.

Abstract:
Fair regression methods typically rely on squared error loss, making them fragile under heavy tailed noise. We propose a general framework for robust regression under demographic parity (DP) that applies to a wide class of Mestimators, including Cauchy, Huber, least absolute deviation, quantile, and Tukey losses. We propose an optimal fair transformation that guarantees DP while achieving the minimum population risk among all rank preserving fair predictors. We also establish convergence rates for the resulting estimators. To balance fairness and predictive accuracy, we develop an interpolation scheme whose risk decreases while unfairness grows linearly with the interpolation parameter. The proposed framework can be further extended to conditional DP to account for legitimate covariates. Extensive simulation studies and real data applications show clear improvements over existing fair regression approaches in both robustness and predictive performance.

Abstract:
Recent work has shown that the computations of Transformers can be simulated in the RASP family of programming languages. These findings have enabled improved understanding of the expressive capacity and generalization abilities of Transformers. In particular, Transformers have been suggested to lengthgeneralize exactly on problems that have simple RASP programs. However, it remains open whether trained models actually implement simple interpretable programs. In this paper, we present a general method to extract such programs from trained Transformers. The idea is to faithfully re-parameterize a Transformer as a RASP program and then apply causal interventions to discover a small sufficient sub-program. In experiments on small Transformers trained on algorithmic and formal language tasks, we show that our method often recovers simple and interpretable RASP programs from length-generalizing transformers. Our results provide the most direct evidence so far that Transformers internally implement simple RASP programs.

Abstract:
TestTime Adaptation (TTA) aims to reconcile model generalization in the presence of distribution shifts. Current TTA methods usually leverage sample uncertainty to select reliable samples for model adjustment via entropy minimization (EM). However, sample uncertainty often relies on a plausible metric and leaves many unreliable samples into EM, potentially leading to model collapse. Importantly, these excluded samples incur incomplete data features of the shifted distribution in TTA. This paper introduces SaTeen, aStructuralAlignment-basedTest-TimeAdaptation(SaTeen) method, by two-fold aligning the structures of test samples with the reliable reference structures. Specifically, the two-fold alignments are 1) Intra-sample structure alignment, where SeTeen maximizes cross-entropy discrepancy between a sample (reference) and its structure-disrupted counterpart, with the assumption of stable dominant features; 2) Inter-sample structure alignment, where SeTeen minimizes the reconstruction error of test samples in the reference subspace spanned by the Incremental PCA (IPCA) on reliable samples, with the assumption of stale intrinsic data manifold. Our extensive experiments demonstrate SaTeen achieves the state-of-the-art performance across various scenarios for both TTA and continual TTA.

Abstract:
Graph classincremental learning (GCIL) has emerged to address the challenge of learning from dynamically evolving graphs, which continuously learns new classes over a sequence of tasks while retaining performance on previously seen classes. However, existing GCIL methods assume a closed-set test distribution drawn only from seen tasks. This fundamentally contradicts real-world open-ended scenarios where future unknown classes inevitably emerge. Empirically, we observe that existing GCIL methods falter in such open-set settings due to severe representation drift and generalized overconfidence. To bridge this gap, we investigate the Open-Set GCIL problem and propose SAFER (\underlineSubspace-\underlineAware \underlineFEature \underlineReshaping), a novel framework that endows GCIL with intrinsic open-set capabilities under a replay-free constraint. Specifically, SAFER performs subspace-aware feature reshaping with drift-free fingerprints, unifying task routing and open-set rejection into a single energy-based metric. Furthermore, we introduce a geometric space-consistency regularization that explicitly improves intra-class compactness and suppresses cross-task representation drift. Extensive experiments on four benchmarks demonstrate that SAFER outperforms state-of-the-art baselines by margins of up to 5.2% in accuracy and 31.3% in open-set AUROC, all while maintaining near-zero forgetting under strict no-replay constraints.

Abstract:
Grokking, a sudden transition from memorization to generalization, has been closely linked to the emergence of lowdimensional representations; yet the mechanism driving this organization remains elusive. Here, we propose that intrinsic task symmetries are the key drivers of grokking, inducing structured geometries in representation space. Our analysis reveals a consistent three-stage training dynamic: (i) data memorization, (ii) intrinsic symmetry acquisition, and (iii) geometric organization. We show that generalization emerges during the symmetry acquisition phase, and subsequently the embedding space organizes into a low-dimensional structured geometry. We validate this mechanism across diverse algorithmic domains, spanning algebraic (modular arithmetic), structural (graph metric completion), and relational (comparison) reasoning tasks. Leveraging these insights, we formulate a symmetry-based criterion for generalization and propose symmetry- and geometry-prompting training strategies that can accelerate generalization. Together, our results establish intrinsic symmetry as a central mechanism enabling neural networks to move beyond memorization and achieve robust algorithmic reasoning.

Abstract:
Without incurring significant computational overhead, trainfree long video generation aims to enable foundation video generation models to produce longer videos. Frame-level autoregressive frameworks, e.g., FIFO-diffusion, offer the advantage of generating infinitely long videos with constant memory consumption. However, the mismatch between training and inference, coupled with the challenge of maintaining long-term consistency, limits the effective utilization of foundation models. To mitigate these concerns, we propose MIGA, a novel infinite-frame long video generation method. Firstly, we propose an effective two-stage alignment mechanism that mitigates the training-inference gap by reducing the excessive noise span fed to the model. We then introduce an innovative dual consistency enhancement mechanism, where the self-reflection approach corrects early high-noise frames and the long-range frame guidance approach leverages later low-noise frames with broad coverage to steer generation, jointly improving temporal consistency. Extensive experiments on VBench and NarrLV demonstrate the state-of-the-art performance of MIGA.

Abstract:
Computation overhead has emerged as a critical bottleneck in Federated Learning (FL). Coreset selection tackles this challenge by constructing an informative subset to represent the full dataset. However, existing approaches optimize coreset construction solely at the data level and enforce a uniform retention ratio across all clients, ignoring client heterogeneity and introducing detrimental fragmented clients. In this paper, we first introduce a gradient projection Shapley value (GPSV) to evaluate client contributions. GPSV captures both the directional and magnitude information of client updates and enables exact Shapley value calculation with \mathcalO(1) percoalition evaluation. Building on GPSV, we then propose ShapCCS, the first client-level coreset selection strategy for FL. ShapCCS prioritizes clients with high GPSV scores while excluding fragmented clients with negligible or even negative GPSV. As a client-level coreset selection strategy, ShapCCS can be integrated with a data-level selection approach, and additionally reduces communication costs, an advantage unattainable by data-level methods alone. Extensive experiments demonstrate the superiority of ShapCCS on model performance and robustness to noise. The code is available at https://anonymous.4open.science/r/ShapCCS-5CBB.

Abstract:
The fields of AIbased disease fingerprinting, drug discovery and repurposing are currently among the emerging frontiers of machine learning applied to medicine. One major challenge is to obtain robust in-silico modeling of disease progression while accounting for the vastly different time scales of biochemical interactions, from gene expression to protein abundance and metabolic flux. Discrete sequence models inadequately represent such multi-scale interactions, and standard Neural Ordinary Differential Equations (NODEs) often fail to train stably under stiffness (different time scales). To address this, in this paper a Tri-Scale Stiff NODE is introduced, defined by hierarchically coupled latent differential equations that model the causal flow from genes to proteins and metabolites, and optimized using reconstruction error and information-theoretic mutual information. This enables continuous-time modeling of cellular responses to identify not only disease dynamics, but also drug perturbations that act within narrow time windows, often invisible to discrete-time approaches. Lyapunov analysis provides a theoretical guarantee that the modeled trajectories remain stable and well-behaved even under extreme stiffness. The developed modeling methodology is tested upon a public dataset (STATegra B-cell differentiation) and utilized for a proof-of-concept drug repurposing pipeline.

Abstract:
This article addresses online variational estimation in parametric statespace models. We propose a new procedure for efficiently computing the evidence lower bound and its gradient in a streaming-data setting, where observations arrive sequentially. The algorithm allows for the simultaneous training of the model parameters and the distribution of the latent states given the observations. It is based on i.i.d. Monte Carlo sampling, coupled with a well-chosen deep architecture, enabling both computational efficiency and flexibility. The performance of the method is illustrated on both synthetic data and real-world air-quality data. The proposed approach is theoretically motivated by the existence of an asymptotic contrast function and the ergodicity of the underlying Markov chain, and applies more generally to the computation of additive expectations under posterior distributions in state-space models.

Abstract:
We propose an OTguided sampling framework for solving graph combinatorial optimization through exact multilinear relaxation. Graph combinatorial optimization problems can be written as quadratic unconstrained binary optimization(QUBO). Leveraging a classical result in combinatorial optimization, we obtain a continuous multi-linear relaxation of QUBO that is exact, in the sense that it preserves the optimal binary solutions. The challenge is that the resulting energy landscape is highly nonconvex. We address this by treating the objective as an energy function and optimizing via sampling from the induced Boltzmann distribution to escape poor local optima. Viewing sampling as transporting a simple reference distribution to the target distribution, we use optimal transport to characterize more efficient probability flow and derive a stochastic optimal control problem whose solution yields an optimal sampling dynamics. We parameterize the control policy with graph neural networks to approximate the optimal control. Experiments show improved solution quality and efficiency over strong combinatorial and learning-based baselines.

Abstract:
As Large Language Model (LLM) ecosystems grow, routing queries to the most suitable model in a diverse pool has become a critical strategy for building efficient and highperforming AI systems. A common approach is to train a supervised router; however, this requires vast, expensive human-annotated preference data and creates models that are notoriously brittle, failing to generalize when faced with inevitable distribution shifts in user queries. Consequently, developing robust, unsupervised routing methods that adapt without retraining is a crucial research frontier. This challenge is severely amplified by Large Reasoning Models (LRMs), which introduce a dual problem for any label-free method: their outputs have a causal “thinking”→“answer” structure that must be modeled, and a structural imbalance where long reasoning text can dominate the final answer signal. We introduce ReasoningRouter, a novel framework that resolves these issues with a length-balanced embedding strategy and a probabilistic model capturing the thinking-to-answer dependency. Our key theoretical advance, the Causal Triangulation Property, enables the label-free estimation of component qualities and their causal link. Beyond competitive routing accuracy, ReasoningRouter offers unprecedented insights into model behavior, enabling separate quality assessment of reasoning and answer components while maintaining computational efficiency. The code is provided in the supplementary materials.

Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in visionlanguage understanding, yet they still struggle with fine-grained perception in high-resolution images. While existing training-free methods typically rely on attention-based localization or coarse-to-fine search, they are often misled by distractors and fail to locate multiple targets. To address these limitations, our investigation reveals two causes for failed localization: (a)Contextual Dominance, where salient distractors overwhelm target attention, leading to inaccurate localization, and (b) Semantic Bias, where aggregated global semantics cause the model to fixate on the most salient concept, resulting in incomplete localization under multi-object scenarios. Built on these insights, we propose ActiveScope, a training-free framework that enhances MLLMs by actively seeking and correcting perception. ActiveScope features two modules. The Semantic Anchor Localization (SAL) utilizes fine-grained semantics as anchors to independently localize key targets, thereby mitigating semantic bias. The Interference-Suppressed Refinement (ISR) refines localization by suppressing attention on salient distractions, effectively overcoming contextual dominance. Extensive experiments on high-resolution image understanding benchmarks demonstrate that ActiveScope outperforms existing training-free methods(e.g., 96.34% accuracy on V^ Bench), validating the superiority of the active search and self-correction paradigm.

Abstract:
Multimodal Large Language Models (MLLMs) have achieved strong performance on structured visual understanding tasks such as chart and document question answering. However, existing benchmarks typically evaluate these domains in isolation, overlooking realistic settings where numerical evidence in charts must be interpreted through surrounding narrative context. We introduce DocHop, a benchmark for integrated chartcontext reasoning in document-style images. In DocHop, the document narrative specifies multi-step compositional constraints, while charts provide the corresponding data values. Questions are grounded on a semantic reference label defined in the narrative, requiring models to resolve target entities from context before aggregating evidence across multiple charts. To enable systematic evaluation, we construct DocHop via a stochastic logic-first generation pipeline with controllable reasoning depth and visual density, covering 1,876 examples across six task categories. Experiments on a wide range of proprietary and open-sourced MLLMs show a substantial gap to human performance: annotators achieve over 90% accuracy, while the best model reaches only 60.18%. Reasoning-enhanced models consistently show improved results, but the performance degrades as reasoning complexity increases. Overall, DocHop provides testbed for challenging multi-hop document reasoning.

Abstract:
Video anomaly detection (VAD) with multimodal large language models has shown strong potential, yet most existing methods still depend on largescale annotations or expert-designed priors, limiting their ability to acquire anomaly knowledge with as little human intervention as possible. To address this, we propose Linguistic Relative Policy Optimization (LRPO), which distills group-relative semantic advantages from multiple reasoning trajectories into a linguistically expressed anomaly experience prior, and adapts the model by injecting this prior into the context to steer its output distribution without any parameter updates. LRPO builds two complementary experience representations: general experience captures transferable anomaly preferences across scenarios, while scenario experience models context-dependent anomaly rules for targeted refinement. To further improve the learned experience, we introduce an anomaly alignment reward that guides trajectory optimization to match human risk preferences and reinforce temporally grounded reasoning. Extensive experiments on XD-Violence, UCF-Crime, and UBNormal demonstrate that LRPO significantly outperforms existing state-of-the-art methods under tuning-free settings.

Abstract:
The Gaussian graphical model (GGM) is a useful tool to represent relationships of conditional dependence among variables. In many realworld applications, datasets often contain multiple related sub-populations, whose associated GGMs may have common structure, as well as large structural differences. In such cases, it is useful to recover common hub variables, which are the highly connected variables in the GGMs of all sub-populations. In this paper, we propose the Joint Inverse Components for Hub Detection (JIC-HD) method to recover the common hubs across multiple GGMs without the need to estimate all subpopulation GGMs. To this end, we introduce joint minimax eigenspaces, and show that these can be leveraged for the recovery of common hubs. We establish theoretical guarantees for the recovery of common hubs. Additionally, our numerical simulation studies confirm superior performance of our JIC-HD in detecting common hubs compared to the existing methods in the literature. Our method is especially advantageous when the multiple GGMs have both common and individual hubs across sub-populations. Finally, we analyze cancer gene-expression datasets and identify biologically meaningful common hub genes across cancer subtypes.

Abstract:
Large Vision–Language Models (LVLMs) have demonstrated remarkable capabilities across diverse multimodal tasks, yet their high inference costs necessitate lowbit deployment. Existing post-training quantization (PTQ) pipelines primarily adopt methodologies from text-only LLMs by treating multimodal inputs as homogeneous sequences, overlooking the heterogeneous information density inherent in LVLMs. In this work, we present ImpQuant, an importance-aware PTQ framework tailored for LVLMs that mitigates low-bit accuracy degradation via fine-grained token-importance reweighted calibration and outlier-aware activation quantization. Our key insight is that quantization errors on decision-critical tokens disproportionately impact overall model behavior. Accordingly, we reweight the calibration loss using aggregated attention for textual tokens and a contextual redundancy metric for visual tokens, respectively. Across multiple LVLM backbones and diverse multimodal benchmarks, our approach consistently improves accuracy at low bitwidth and reduces quantization-induced object hallucinations compared to state-of-the-art PTQ baselines.

Abstract:
Classical sample complexity bounds for causal structure learning are minimax in nature, characterizing worstcase difficulty without distinguishing between easy and hard instances. We study instance-specific complexity for Markov equivalence class (MEC) recovery in linear Gaussian structural equation models. We introduce the Fisher dimension, defined as the inverse squared minimum partial correlation that must be detected to recover the MEC. We prove that the Fisher dimension governs sample complexity: it provides both a lower bound and an upper bound (tight up to logarithmic factors) for MEC recovery. A key theoretical finding is that under spectrally well-conditioned models, with bounded noise variances, bounded covariance eigenvalues, and constant-order edge coefficients, the Fisher dimension is uniformly bounded regardless of graph structure. Thus, significant instance-specific variation arises from parametric rather than structural features. Empirical validation shows strong correlation between our predictor and observed sample complexity for structured graph families.

Abstract:
Photoshop is a professional image editing software whose complex multilevel menus, fine-grained operations, and layer-based non-destructive editing pose substantial challenges for automated agents. Existing GUI benchmarks and methods primarily target web interfaces and short-horizon, low-complexity tasks, falling short in modeling the multi-step decision-making and semantic understanding required by professional graphic software. We introduce PSBench, the first benchmark specifically designed for image editing in Adobe Photoshop, consisting of 600 human-annotated tasks across three difficulty levels, with tasks drawn from official tutorials and popular real-world workflows. PSBench covers core functionalities such as canvas adjustment, layer manipulation, and filter application, and provides fine-grained evaluation metrics tailored to each task category. Our experiments show that even the state-of-the-art system, Agent S3, achieves a success rate of only 18.09% on difficult tasks, indicating that GUI agents still face considerable challenges in operating complex professional software. Furthermore, human-in-the-loop evaluations reveal that MLLMs, when serving as interactive assistants, can significantly improve novice users’ task completion rates and reduce operation time.

Abstract:
Chainof-thought (CoT) reasoning has proven effective for enhancing problem-solving in large language models. However, when applied to multimodal LLMs (MLLMs), existing CoT approaches suffer from a fundamental limitation: they perform reasoning entirely in text without accessing visual features during the reasoning process. After initial visual encoding, image information becomes inaccessible, forcing models to reason based solely on whatever was captured in the initial description, which forms a ``vision-blind reasoning'' paradigm that limits fine-grained visual extraction, error verification, and adaptive attention. We propose Text-Visual Interleaved Chain-of-Thought (TVI-CoT), a framework that enables explicit interleaving of textual reasoning and visual feature access through learnable control tokens (\langle\textThink\rangle, \langle\textLook\rangle, \langle\textAnswer\rangle). These tokens allow dynamic switching between reasoning and visual grounding, attending to relevant image regions conditioned on the evolving reasoning state. Experiments on eight benchmarks demonstrate state-of-the-art results among MLLM-based CoT methods and notable performance boost compared to the baseline: +6.1% on MMMU, +3.8% on MathVerse, +3.4% on MathVista, and +3.4% on ScienceQA. Plentiful visualizations verify that TVI-CoT can perform stepwise reasoning with precise visual grounding abilities.

Abstract:
Recent Uniformstate Diffusion Models (USDMs), initialized from a uniform prior, offer the promise of fast text generation due to their inherent self-correction ability compared to masked diffusion models. However, they still rely on complex loss formulations with additional computational overhead, which hinders scalability. In this work, we explore a simplified denoising-based loss for USDMs that optimizes only noise-replaced tokens, stabilizing training while matching the performance of prior methods with more complex objectives. In addition, we introduce an efficient regularization term to mitigate corruption toward uniform output distributions, which further improves performance. We demonstrate the effectiveness and efficiency of our simple and improved loss formulations by pretraining models on widely used text datasets for USDMs. More importantly, our conclusions scale to larger models, showing strong potential for large-scale training.

Abstract:
Crossdevice Federated Learning (FL) is frequently bottlenecked by the prohibitive computational and communication costs of training deep neural networks on resource-constrained edge hardware. While federated dynamic pruning aims to alleviate these costs by adjusting sparse topologies during training, existing methods rely on magnitude-based heuristics that are fundamentally ill-suited for the non-convergent, heterogeneous environments inherent to FL. To address this challenge, we propose Fedfit, a federated dynamic framework that replaces simple heuristics with optimization-centric criteria for topology adjustment. By leveraging a second-order approximation of the loss landscape via the Fisher Information Matrix, Fedfit enables precise and efficient topology adjustment without the overhead of explicit Hessian computation. Empirical evaluations across computer vision and natural language processing benchmarks demonstrate that Fedfit significantly narrows the sparse-to-dense accuracy gap, outperforming state-of-the-art methods while maintaining high communication efficiency.

Abstract:
Large Language Models demonstrate remarkable syntactic fluency, yet the optimization dynamics governing their acquisition of deep semantic dependencies remain poorly understood. We propose a mechanistic framework that models this learning process as a competition between Surface Statistics and Deep Semantics. Our theoretical analysis identifies a ``Gradient Starvation" phenomenon where the error signals for sparse semantic dependencies are actively suppressed during early optimization. This suppression impedes the learning of structural reasoning and causes its emergence to manifest as a sudden phase transition. Furthermore, this framework offers a mechanistic basis for the effectiveness of Chainof-Thought (CoT) strategies. By externalizing intermediate reasoning steps into concrete tokens, CoT effectively bypasses the suppression regime inherent to implicit reasoning. We validate these findings across scales ranging from toy transformers to production models (Llama-3.1-8B, Qwen2.5-Coder-7B). Finally, guided by this theory, we propose a topology-aligned contrastive objective that explicitly rectifies the gradient geometry. Experiments on variable binding tasks demonstrate that our method achieves an improvement that is over 2× larger than that obtained via standard cross-entropy fine-tuning.

Abstract:
Deep learning (DL) methods have become the stateof-the-art for reconstructing sub-sampled magnetic resonance imaging (MRI) data. However, studies have shown that these methods are susceptible to small adversarial input perturbations, resulting in major distortions in the output images. Various strategies have been proposed to reduce the effects of these attacks, but they require retraining. In this work, we propose a novel approach for mitigating adversarial attacks on MRI reconstruction models without any retraining. Based on the idea of cyclic measurement consistency, we devise a novel mitigation objective that is minimized in a small ball around the attack input. Results show that our method substantially reduces the impact of adversarial perturbations across different datasets, attack types/strengths and PD-DL networks, and qualitatively and quantitatively outperforms conventional mitigation methods. We also introduce a practically relevant scenario for small adversarial perturbations that models impulse noise in raw data, which relates to herringbone artifacts, and show the applicability of our approach in this setting. Finally, we show our mitigation approach remains effective in two realistic extension scenarios: a blind setup, where the attack strength or algorithm is not known to the user; and an adaptive attack setup, where the attacker has full knowledge of the defense strategy.

Abstract:
We investigate which models admit a collapse of the expected information gain (EIG) and its derivative from a doubly intractable to a singly intractable expression. We prove that a sufficient condition is that the posterior distribution belongs to an exponential family (EF) that depends on the experimental design and data only through its natural parameters, and derive corresponding singly intractable and unbiased estimators for the \operatornameEIG and its (reparameterised) gradient. We further derive necessary conditions on the likelihood to obtain an EF posterior of the required form, showing that this does not necessarily require the prior to be conjugate. This is complemented by a theoretical analysis of certain degenerate behaviors that may arise when optimizing the \operatornameEIG for EFmodeled experiments. Finally, we empirically demonstrate the benefits of our singly intractable estimators, showing substantial performance gains over standard nested estimators.

Abstract:
While decoders have rapidly scaled, encoders have remained largely unchanged since BERT. We examine this disparity by revisiting evaluation through the lens of finetuning under frozen backbone and linear probing. As models scale, their representations become increasingly unexploitable by frozen probes, despite improved perplexity. This suggests a misalignment between direct token prediction and the learning of rich, versatile, easily extractable representations. Hence, we propose CrossBERT, a twopart architecture that separates the learning of high-quality encoded representations from the rigid grounding of token reconstruction. This design further enables high masking ratios (\ge 50%) and gradient collection over all token via a Complementary Masking Strategy, respectively increasing throughput by 1.5-2× and sample efficiency by 2×. Overall, CrossBERT demonstrates monotonic scaling and superior performance on MTEB(eng, v2) and frozen GLUE benchmarks.

Abstract:
Softmax attention is increasingly treated as a reusable inference primitive in transformers. Much prior theory covers Gaussian/linear models or assumes one shared quadratic querykey metric, which can fail for non-Gaussian exponential-family likelihoods with varying curvature. This risks overreading learned similarities as a global metric or as proof of Bayes-optimality. For discrete latent symbols with exponential-family observations, we show a single head can implement Bayes posteriors (and posterior means) by setting logits to log prior plus log likelihood, and we characterize single-head posteriors as exactly the log-linear (exponential-family) class. Using convex duality, we rewrite log-likelihoods as negative Bregman divergences on mean/sufficient-statistic space, making Bayes rule a soft nearest-neighbor computation; this yields a sharp boundary for globally shared quadratic metrics and a multi-head curvature-atlas approximation with head-count scaling, and we extend these guarantees to in-context estimation with consistency and finite-sample stability bounds. In synthetic Gaussian and Bernoulli in-context estimation, trained minimal attention models validate these predictions: performance approaches a Bayes-oracle baseline as prompt length grows; learned metrics align with noise precision in the Gaussian case, while Bernoulli retains a gap consistent with curvature variation. Together, these results explain when Fisher geometry should emerge, when a single metric is justified, and when multiple heads are necessary for Bayes-like in-context estimators beyond Gaussians.

Abstract:
We present Mesh Field Theory (MeshFT) and its neural realization, MeshFTNet: a structure-preserving framework for mesh-based continuum physics that cleanly separates the physics’ topological structure from its metric structure. Imposing minimal physical principles (locality, permutation equivariance, orientation covariance, and energy balance/dissipation inequality), we prove a reduction theorem for mesh-based physics. Under these conditions, the physical dynamics admit a local factorization into a port–Hamiltonian form: the conservative interconnection is fixed uniquely by mesh topology, whereas metric effects enter only through constitutive relations and dissipation. This reduction clarifies what must be fixed and what should be learned, directly informing MeshFT-Net’s design. Across evaluations on analytic and realistic datasets, physics-consistency tests, and out-of-distribution validation, MeshFT-Net achieves near-zero energy drift and strong physical fidelity (correct dispersion and momentum conservation) along with robust extrapolation and high data efficiency. By eliminating non-physical degrees of freedom and learning only metric-dependent structure, MeshFT provides a principled inductive bias for stable, faithful, and data-efficient learning-based physical simulation.

Abstract:
Large language models increasingly generate structured outputs, including citationgrounded summaries, multi-step reasoning chains, and tool-augmented responses, where correctness is inherently compositional: a single flawed claim can invalidate an otherwise accurate response. Existing certification methods treat outputs as atomic units, forcing a binary choice between unsafe acceptance and wasteful rejection. We introduce Claim Graph Risk Control (CGRiC), a framework that decomposes responses into dependency graphs of verifiable claims and assigns calibrated per-claim risk bounds via information-lift statistics. By composing these bounds, CGRiC provides explicit guarantees on the probability that any incorrect claim passes verification undetected. When this composed risk exceeds a target threshold, the system triggers localized repairs rather than full abstention, preserving correct content while fixing problematic claims. Our approach explicitly models extraction noise and verifier imperfection, and exploits conditional independence structure for tighter certificates when validated. Empirically, CGRiC achieves target risk levels while reducing abstention by 31% compared to atomic baselines across QA, summarization, and reasoning tasks.

Abstract:
Foundation models provide strong but biased priors for downstream tasks with limited labeled data. We formalize blackbox assisted nonparametric regression where \|f^-f_0\|_L_2(P_X)\le\delta for unknown \delta. We characterize the minimax risk, revealing a phase transition at \delta_c(n)\asymp n^-\beta/(2\beta+d) with optimal rate \min\\delta^2, n^-2\beta/(2\beta+d)\. Our Safe Black-Box Residual Estimator integrates zero-initialization (a strong inductive bias) with a holdout selection mechanism. This approach achieves the minimax rate while avoiding negative transfer. Experiments on synthetic data, CIFAR-100 (CLIP), and AG News (Qwen3-8B) validate the theory, improving sample efficiency (e.g., CIFAR-100 at n=2000: from 59.4% zero-shot to 66.7%).

Abstract:
Meanfield variational inference (VI) is scalable, but its independence assumption can severely limit inference when the posterior is inherently coupled across instances especially for correlated data. Existing structured VI approaches either impose simple dependence patterns or incur substantial cost as dependence becomes richer, leaving efficient higher-order instance-level dependence modeling largely unresolved. We propose Copula-SVI, which augments amortized marginals with an explicit vine-copula posterior and refines joint samples with Stein updates toward the true posterior. The vine construction makes dependence learning and sampling practical by decomposing it into bivariate copula factors, enabling edge-minibatched training with variance-aware level-wise sampling and efficient dependence-aware initialization via a sparse vine built from the same sampled edges. Experiments on constrained clustering and time series modeling show consistent improvements over strong structured VI baselines and demonstrate efficient higher-order instance-level dependence modeling.

Abstract:
Adversarial training breaks down in longtailed settings, exhibiting severe robustness degradation on worst-performing (often tail) classes. We identify a key cause of this failure as a posterior mismatch: coarse-grained absolute labels collapse class posteriors into point estimates, leading to biased class-frequency estimation and an enlarged robust generalization gap, which ultimately amplifies worst-class vulnerability. To address this issue, we propose Posterior-driven Adversarial Training (PAT), which learns a posterior surrogate to provide fine-grained probabilistic supervision for adversarial training, and integrates weight perturbations to encourage a flatter loss landscape. Our theory shows that accurate posterior approximation simultaneously tightens class-frequency estimation error and robust generalization bounds, while a flat weight loss landscape stabilizes sensitivity to posterior approximation errors. Extensive experiments on long-tailed benchmarks confirm that PAT consistently improves robustness, with especially large gains on worst-class.

Abstract:
Incomplete MultiView Clustering (IMVC) is fundamentally challenged by structural degradation induced by missing views, rather than the absence of feature values. Existing graph-based approaches either rely on costly data imputation or adopt first-order linear fusion, which acts as a weak low-pass filter and fails to separate latent consensus structure from structural noise. To address this limitation, we reformulate IMVC from a spectral filtering perspective and propose Contractive Anchor Resolvent Diffusion (CARD), a scalable framework for high-order structural inference without explicit imputation. CARD constructs a unified anchor-induced hypergraph and derives a high-order resolvent diffusion operator that functions as a sharp rational filter to amplify consensus signals while suppressing view-specific noise. We further derive an implicit solver that jointly optimizes similarity learning and clustering without materializing dense matrices, and prove that the resulting process constitutes a local contraction mapping toward the consensus subspace. Extensive experiments on large-scale benchmarks demonstrate that CARD consistently outperforms state-of-the-art IMVC methods with linear complexity. The code for our method is publicly available at \urlhttps://anonymous.4open.science/r/CARD-8CB1.

Abstract:
Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of large language model (LLM) agents. While impressive empirical demonstrations have been shown, so far a principled framing of these decision problems remains lacking. In this paper, we formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introducetransfer eluder dimensionas a measure to characterize the hardness of LLF problems. We formalize the intuition that information in the feedback governs the learning complexity of LLF problems. We demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a noregret algorithm, calledHELiX, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension of the problem. Across several empirical domains, we show thatHELiXperforms well even when repeatedly prompting LLMs does not work reliably. Our contributions mark an important step towards designing principled interactive learning algorithms from generic language feedback.

Abstract:
Unpaired domain translation remains a challenging task due to the need of finding a balance between faithfulness and realism. In this paper, we propose a method called Regularized Distribution Matching Distillation (RDMD) that combines the best properties of Optimal Transport (OT) and diffusionbased domain translation methods. Instead of the conventional adversarial training, RDMD utilizes diffusion-based distribution matching, addressing the common shortcomings of OT methods and providing a strong initialization for the trained models. RDMD provides efficient one-step inference, explicitly controls the input-output alignment via regularization of the transport cost and maintains high faithfulness similar to the OT methods. We prove that in theory RDMD approximates the OT map and demonstrate its empirical performance on several tasks, including unpaired image-to-image translation in pixel and latent space and unpaired text detoxification. Empirical results show that RDMD achieves a comparable or better faithfulness-realism trade-off compared to the diffusion and OT baselines.

Abstract:
Verbal logic and visual mental simulation are two essential components of human intelligence. Modern Large Language Models (LLMs) have demonstrated strong verbal reasoning capabilities through textual Chainof-Thought (CoT) reasoning. In contrast, current Video Generation Models (VGMs) struggle with visual reasoning tasks such as spatial planning. We attribute this limitation to two fundamental gaps: (i) VGMs are predominantly trained on general-purpose video corpora emphasizing perceptual fidelity over visual reasoning, leaving reasoning abilities underdeveloped; (ii) most VGMs generate videos in a single pass without mechanisms to explore alternative reasoning trajectories and to revise intermediate errors. Motivated by these limitations, we introduceReaForest, a framework that fosters the reasoning capacity of VGMs in spatial planning through both training-time activation and inference-time scaling. ReaForest comprises three key components:(1)ReaGen-27k, a dataset covering diverse spatial planning tasks that require multi-step reasoning, which activates basic reasoning capabilities of VGMs for spatial planning;(2)Reflective Entropy-Aware Test-Time Scaling (ReaTTS), an inference framework that evolves multiple reasoning branches while enabling failure recovery;(3)Hierarchical constraint verification, which provides actionable feedback for ReaTTS based on decomposed constraints. Extensive experiments demonstrate that ReaForest substantially surpasses advanced textual reasoning models (e.g., Gemini-2.5-Pro) and video generation models (e.g., Sora-2). ReaForest exhibits emergent properties including self-correction, parallel thinking, and scalable reasoning, advancing VGMs toward human-like visual mental simulation.

Abstract:
Sample efficiency remains one of the central challenges in modern deep reinforcement learning (DRL). In recent years, world model approaches have significantly improved the sample efficiency of modelbased reinforcement learning (MBRL) by enabling policy learning in imagination, thereby reducing the need for direct interaction with the real environment. However, most existing world model methods are trained independently for each task or perform multi-task learning in offline settings, failing to fully exploit the latent relationships among tasks in online interactive scenarios. To address this limitation, we propose TaskLoom, a knowledge-sharing world model architecture for online reinforcement learning. TaskLoom adopts a grouped two-stage training paradigm: in the first stage, fine-grained knowledge is shared among tasks within each group, while in the second stage, coarse-grained knowledge is exchanged across groups, enabling hierarchical knowledge transfer and reuse. Experimental results show that TaskLoom outperforms baseline methods on widely used benchmarks such as Proprio Control and Visual Control, validating the effectiveness of the proposed knowledge-sharing mechanism for both low-dimensional state and high-dimensional visual inputs.

Abstract:
Autobidding is a core component of real-time advertising systems, where decisions must optimize long-term performance under budget and cost constraints, while online exploration is prohibitively risky. Offline reinforcement learning and, more recently, Transformer-based sequence modeling have shown promise for learning bidding policies from logged data, but their unimodal and purely parametric formulations often collapse multiple effective bidding strategies into suboptimal averaged actions and perform unreliably under sparse or long-tail traffic. To mitigate these limitations, we proposeDRIVE(Distributional and Retrieval-Augmented Bidding with Value Evaluation), a unified Transformer-based framework that decouples candidate action generation from decision making for offline auto-bidding. DRIVE combines distributional action modeling, retrieval-augmented candidate generation from high-quality historical decisions, and value-based evaluation to select the most promising bid at inference time. Extensive experiments on AuctionNet and additional offline reinforcement learning benchmarks demonstrate that DRIVE consistently improves bidding performance and generalizes well across multiple Transformer–based methods.

Abstract:
This paper considers decentralized optimization of convex functions with mixed affine equality constraints involving both local and global variables. Constraints on global variables may vary across different nodes in the network, while local variables are subject to coupled and nodespecific constraints. Such problem formulations arise in machine learning applications, including federated learning and multi-task learning, as well as in resource allocation and distributed control. We analyze this problem under smooth and non-smooth assumptions, considering both strongly convex and general convex objective functions. Our main contribution is an optimal algorithm for the smooth, strongly convex regime, whose convergence rate matches established lower complexity bounds. We further provide near-optimal methods for the remaining cases.

Abstract:
Estimating causal structure in the presence of latent variables is an important yet challenging problem. Recent works have shown that distributional constraints, such as rank deficiency constraints of the covariance matrices, can be exploited to recover the underlying causal structure involving latent variables. However, realworld data often exhibit heterogeneity/nonstationarity, which pose challenges to existing methods. In this work, we develop a principled approach for identifying the structure of partially observed linear causal models from heterogenous/nonstationary data. We first formulate a class of heterogenous/nonstationary, partially observed linear causal models and prove that their distributional constraints are equivalent to those in the homogeneous case. Building on this, we propose a novel rank deficiency test that can efficiently handle heterogenous/nonstationary data, and further establish identifiability results for recovering the causal structure involving latent variables. We also provide a method to identify which variables exhibit distribution shifts, i.e., whose causal mechanisms vary across domains. Experiments on simulated and real-world data validate our theoretical findings and the effectiveness of our method.

Abstract:
Visionand-Language Navigation (VLN) empowered by Multimodal Large Language Models (MLLMs) is promise, yet remains challenged by long-horizon tasks with complex user instructions. Existing approaches that continuously condition on full instructions incur high latency due to abundant visual tokens and exacerbates instruction interference, where irrelevant text noise induces hallucinations. To address these limitations, we propose IDEAL-VLN ( Instruction DEcomposition and Action aLignment ), a novel paradigm that reformulates navigation as a causal inference chain. We decompose the task into two sequential steps: Semantic Anchoring and Action Alignment. We adopt a Think-Before-Act mechanism that first infers the immediate semantic anchor from the global context and then generates actions conditioned solely on this anchor. This design constructs an explicit information bottleneck, suppressing spurious correlations from irrelevant instruction. Moreover, to alleviate cognitive collapse and limited exploration during training, we introduce a hierarchical correction framework that combines semantic-level thought correction with a spatially-aware adaptive intervention strategy. This strategy adjusts expert intervention probability based on geodesic distance, effectively defining a semantic safety boundary. To support this paradigm, we contribute the Instruction-Aligned Navigation Dataset containing 160K image-text pairs. Extensive experiments demonstrate that IDEAL-VLN achieves state-of-the-art performance and robustness across major benchmarks while significantly reducing inference costs.

Abstract:
Large Language Models (LLMs) have demonstrated strong potential in complex reasoning, yet their progress remains fundamentally constrained by reliance on massive highquality human-curated tasks and labels, either through supervised fine-tuning (SFT) or reinforcement learning (RL) on reasoning-specific data. This dependence renders supervision-heavy training paradigms increasingly unsustainable, with signs of diminishing scalability already evident in practice. To overcome this limitation, we introduce CPMöbius, a collaborative Coach–Player paradigm for data-free reinforcement learning of reasoning models. Unlike traditional adversarial self-play frameworks, CPMöbius inspired by multi-agent collaboration treats the Coach and Player as independent but cooperative roles. The Coach proposes instructions targeted at the Player’s capability and receives rewards based on changes in the Player’s performance, while the Player is rewarded for solving the increasingly instructive tasks generated by the Coach. This cooperative optimization loop is designed to directly enhance the Player’s mathematical reasoning ability. Remarkably, CPMöbius achieves substantial improvement without relying on any external training data, outperforming existing unsupervised approaches. For example, on the Qwen2.5-Math-7B-Instruct, our method improves accuracy by overall average +4.9 and out-of-distribution average +5.4, which exceed RENT for +1.5 on overall accuracy and R-zero for +4.2 on OOD accuracy.

Abstract:
Pairwise comparisons by humans or judge models, combined with aggregation methods such as Elo or BradleyTerry, have become a central part of evaluating generative models. However, there has been significant debate whether they measure what they intend to measure. Some argue, pairwise comparisons from judges may reward superficial stylistic cues or display judge biases. In a more positive turn, we show that model rankings from pairwise comparisons strongly agree with ground-truth-based accuracy rankings when such ground truth is available for comparison. To make this observation, we convert well-known benchmarks into free-form generative evaluations scored with Elo rankings from pairwise comparisons. We find that Elo rankings show Spearman correlation above 0.9 with accuracy rankings across five established benchmarks. In addition, Elo rankings have significantly more agreement with accuracy than direct evaluation when the judge is weak. Finally, we show that style and judge bias have only minor effects on model rankings. Although style and bias impede absolute measurement, our work demonstrates that model rankings from pairwise comparisons nevertheless reflect accuracy.

Abstract:
Transformers serve as the foundation of most modern large language models. To mitigate the quadratic complexity of standard full attention, various efficient attention mechanisms, such as linear and hybrid attention, have been developed. A fundamental gap remains: their expressive power relative to full attention lacks a rigorous theoretical characterization. In this work, we theoretically characterize the performance differences among these attention mechanisms. Our theory applies to all linear attention variants that can be formulated as a recurrence, including Mamba, DeltaNet, etc. Specifically, we establish an expressiveness hierarchy: for the sequential function compositiona multi-step reasoning task that must occur within a model’s forward pass, an (L+1)-layer full attention network is sufficient, whereas any hybrid network interleaving L-1 layers of full attention with a substantially larger number (2^3L^2) of linear attention layers cannot solve it. This result demonstrates a clear separation in expressive power between the two types of attention. Our work provides the first provable separation between hybrid attention and standard full attention, offering a theoretical perspective for understanding the fundamental capabilities and limitations of different attention mechanisms.

Abstract:
Reinforcement learning is a powerful paradigm for training autonomous agents and has achieved impressive performance in complex environments. However, this success often comes at the cost of interpretability, diminishing trust and complicating efforts to debug and improve agent behavior. To address these challenges, we introduce CausalXRL, a novel framework for explainable reinforcement learning (XRL). A key feature of CausalXRL is its use of causal graph reasoning, which provides transparent, structured, multilevel explanations of agent decision-making. We validate CausalXRL through comprehensive case studies and a two-part evaluation: (1) a quantitative analysis of agent performance and explanation fidelity in benchmark RL environments, and (2) a qualitative expert study assessing interpretability in a real-time strategy (RTS) game. Results show that CausalXRL enhances human understanding and diagnostic insight in multi-agent scenarios, without compromising task performance. By enabling human operators to interrogate RL agents through causal models, CausalXRL supports alignment by making behavior transparent and auditable.

Abstract:
Data cleaning and automated machine learning (AutoML) are both crucial for reliable learning systems, yet are commonly treated as independent or sequential stages. This separation ignores their strong interaction and leads to inefficient use of limited computational budgets. We propose DMCO, a unified framework that jointly optimizes data cleaning and model construction under a fixed resource budget. DMCO reformulates the traditional twostage pipeline into a time-sliced process, where data cleaning and AutoML are interleaved and adaptively scheduled. We introduce a gradient-based data cleaning sampling strategy with theoretical guarantees for minimizing gradient estimation variance, and integrates it with loss-driven sampling and progressive AutoML fitting to continuously leverage intermediate data quality improvements. Experiments on six real-world datasets show that DMCO consistently outperforms standalone data cleaning and AutoML baselines on both classification and regression tasks, as measured by F1 score and MSE. Under limited budgets, DMCO achieves up to 82.19% of the performance of full data cleaning with exhaustive AutoML, while remaining robust across different AutoML frameworks.

Abstract:
Robust aerial target detection for autonomous UAVon-UAV pursuit is severely hindered by continuous scale drift, long-tailed scale imbalance, and flight-induced visual noise, rendering standard empirical risk minimization strategies poorly aligned with real-world deployment. To address these challenges, we propose a scale-aware robust optimization framework that performs group-wise minimax optimization over scale-partitioned data, ensuring balanced robustness across long-, mid-, and close-range engagement regimes. We further introduce an uncertainty-rectified regression loss to suppress noise-driven errors without discarding informative hard examples, complemented by a control-aligned center accuracy penalty that prioritizes the localization precision required for stable flight control. Extensive experiments demonstrate that our method yields substantially improved robustness under visual degradation, with significantly slower decay in detection mAP and center-point accuracy compared to baselines. Validated through both photorealistic simulations and real-world flight tests, our system achievesreal-time performance of 120 FPSon an embedded NVIDIA Orin NX platform, confirming its practical efficacy for high-speed interception.

Abstract:
Large language models increasingly rely on retrievalaugmented generation (RAG) to ground responses in external corpora. Yet, even with strong retrievers, generated statements can remain unsupported, and the resulting citations are often not reliable indicators of evidence. We introduce CiteGuard, a RAG decoding layer that treats sentence-level factuality as a multiple-testing problem and combines conformal calibration with false-discovery-rate control. CiteGuard converts claim–evidence scores into p-values for the null hypothesis "unsupported" and uses BH/BY procedures to decide which claims to keep (with citations) and which to abstain on. On FEVER and Natural Questions, CiteGuard reduces the false-discovery rate among accepted claims from 28–31% (vanilla RAG) to below 10% at α=0.10, while retaining 86–92% of supported claims. This yields a user-controlled risk budget: practitioners can trade off faithfulness and coverage via α, with finite-sample guarantees under standard exchangeability assumptions.

Abstract:
Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such activation verbalization approaches actually provide privileged knowledge about the internal workings of the target model, or do they merely convey information about its inputs? We critically evaluate popular verbalization methods across datasets used in prior work and find that they can succeed at benchmarks without any access to target model internals, suggesting that these datasets may not be ideal for evaluating verbalization methods. We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM which generated them, rather than the knowledge of the target LLM whose activations are decoded. Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs.

Abstract:
Domain transfer (DT) maps source to target distributions and supports tasks such as unsupervised imageto-image translation, single-cell analysis, and cross-platform medical imaging. However, DT is fundamentally ill-posed: push-forward mappings are generally non-identifiable, as measure-preserving automorphisms (MPAs) preserve marginals while altering cross-domain correspondences, leading to content-misaligned translation. Recent work shows that MPAs can be eliminated by jointly transferring multiple corresponding source/target conditional distributions, but supervision signals labeling such conditionals are not always available in practice. We develop an alternative route to DT identifiability. Under a structural sparsity condition on the Jacobian support pattern, we show that distribution matching together with a single paired anchor sample suffices to identify the ground-truth transfer---requiring substantially less supervision than prior approaches. To enable practical high-dimensional learning, we further propose an efficient Jacobian sparsity regularizer based on randomized masked finite differences, yielding a scalable surrogate without explicit Jacobian evaluation. Empirical results on synthetic and real-world DT tasks validate the theory.

Abstract:
We study offpolicy evaluation (OPE) under strategic behavior where decision subjects (or agents) respond to a decision maker's policy by strategically modifying their covariates. Such behavior induces a policy-dependent covariate shift, breaking the standard assumption in existing methods that covariates are exogenous to the policy. Related work addresses this challenge by imposing strong assumptions such as repeated interactions or full knowledge of agents’ response behavior, substantially limiting its applicability to OPE. In contrast, we consider a one-shot OPE setting where the decision maker has only partial knowledge of the agents' response behavior. Our key insight is that disclosing local information through post-hoc explanations reveals agents’ pre-strategic covariates prior to adaptation, mitigating the information loss induced by strategic behavior. Leveraging this structure, we estimate a statistical model for the agents’ responses and construct a doubly robust estimator for policy value. By assuming that the agents' cost sensitivity follows a conditional log-normal distribution, we establish consistency of the proposed estimator and validate our approach empirically. More broadly, our results highlight how interaction design can mitigate information asymmetry by revealing otherwise hidden structure in agents' strategic responses.

Abstract:
Multimodal large language models (MLLMs) are increasingly used for realworld tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on simplified, observation-based scenarios or limited modalities and fail to assess attributions in complex multimodal reasoning. We introduce MURGAT (Multimodal Reasoning with Grounded Attribution), a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation. Given inputs spanning video, audio, and other modalities, MURGAT requires models to generate answers with explicit reasoning and precise citations, where each citation specifies both modality and temporal segment. To enable reliable assessment, we introduce an automatic evaluation framework that strongly correlates with human judgments (r = 0.84). Benchmarking against human and automated scores reveals that even strong MLLMs frequently hallucinate citations despite correct reasoning. Moreover, we see a key trade-off: increasing reasoning depth or enforcing structured grounding often degrades accuracy, highlighting a significant gap between internal reasoning and verifiable attribution

Abstract:
Hyperspectral Image Single Image SuperResolution (HSI-SISR) faces a conflict between computational efficiency and global non-local modeling. Existing Transformers suffer from quadratic complexity, while window-based methods compromise global capture. To address this, we propose the Spatial-Spectral Dual-Clustering-based Network (SSDCN). Our method introduces three innovations. First, we design a Spatial-Spectral Dual-Cluster Block (SSDCB). Replacing expensive point-to-point attention, it uses content-driven clustering to learn low-rank structural bases, achieving global modeling with linear complexity \mathcalO(KN). Second, we propose a pyramid progressive hierarchical architecture with a Feature Reuse Reconstruction Block (FRRB). It reuses the core tensor and spectral factors from coarse levels, updating only spatial factors to minimize redundancy. Third, we propose a Pyramid Hierarchical Reconstruction Joint Loss to supervise intermediate levels, ensuring structural accuracy and preventing error accumulation. Experiments demonstrate that SSDCN surpasses SOTA methods in metrics and visual quality with significantly fewer parameters and FLOPs, achieving an optimal efficiency-performance balance.

Abstract:
Conformal prediction is challenging for time series with the timevarying conditional distributions. Existing sequential conformal methods can yield volatile, non-nested prediction intervals due to noisy tail conditional quantile estimation and quantile crossing issue. To overcome this challenge, we construct the prediction intervals for time series via a novel method called Conditional Quantile Adjusted Conformal Prediction (CQACP), which stabilizes sequential conformal calibration by modeling the conditional quantile curve of nonconformity score. At each time step, CQACP evaluates a base conditional quantile learner on a grid of quantile levels, and fits a Cornish-Fisher approximation parameterized by conditional moments of nonconformity score with monotonicity constraints. Asymptotically, we prove the conditional validity of the prediction interval under serial dependence and show improved conditional quantile estimation accuracy. Experiments on multiple real-world datasets demonstrate that CQACP maintains accurate coverage and produces smooth, narrow, and nested prediction intervals across different significance levels and prediction models.

Abstract:
Generalized Category Discovery (GCD) aims to learn semantically structured representations for discovering novel categories in unlabeled data using supervision from known classes. Most existing methods rely on selfsupervised contrastive learning (CL) with consistency and uniformity objectives. We identify an inherent optimization conflict between these objectives: while uniformity enforces global feature dispersion, it can hinder the formation of class-discriminative and semantically coherent structures. To address this issue, we propose a two-stage framework that decouples representation learning from self-contrastive regularization. The first stage learns category-anchored representations aligned with known class prototypes, while the second stage extends the representation space to novel categories via a consistency objective enhanced with unified semantic regularization. We further introduce a Semantic Exploration Energy mechanism to capture shared semantics across categories and mitigate information loss caused by prototype orthogonalization. The resulting framework, termedConsistency-under-Unified Semantic Regularization(CURE), achieves state-of-the-art performance on multiple benchmarks and substantially reduces the performance gap between known and novel categories.

Abstract:
Integrating large language models (LLMs) into textto-speech (TTS) systems has improved speech expressiveness, yet controllable emotional expression remains challenging. Existing approaches primarily rely on external conditioning or global activation steering, offering limited insight into how emotional variation is represented internally and restricting fine-grained control. In this work, we analyze emotion-related variation in the speech semantic hidden states of LLM-based TTS models. To this end, we leverage sparse autoencoders (SAEs) to map these representations to sparse latent features and examine their emotion-related activation patterns. Our evaluations indicate that emotional variation is distributed across multiple sparse latent features, revealing a more interpretable internal representation. Building on this observation, we introduce a feature-level intervention framework that enables targeted and bidirectional emotion induction and suppression without modifying backbone parameters. We further show that distinct latent features correlate with specific acoustic attributes (e.g., pitch), suggesting that emotional expression arises from coordinated latent contributions rather than a single global shift.

Abstract:
The deployment of Large Language Models (LLMs) with extended context windows is increasingly constrained by the linear growth of KeyValue (KV) cache memory. Vector Quantization (VQ) is a key enabler for pushing KV cache storage toward the sub-1-bit regime; in particular, Residual Quantization (RQ) supports this goal via progressive refinement, sequentially encoding residuals with small codebooks across stages. Yet most VQ methods still rely on standard \ell_2 K-means as the core codebook-learning primitive. We identify a subtle high-dimensional issue: Euclidean centroid averaging can induce centroid shrinkage, and under an \ell_2 objective this shrinkage reduces the influence of angular alignment in the distortion term. This coupling can make directional preservation harder to maintain, hindering KV cache vector quantization methods from pushing into the sub-1-bit regime. To mitigate this coupling, we propose Gain-Shape K-means (GSKM), a drop-in replacement for K-means that improves directional fidelity over standard K-means while matching, and in some regimes improving, \ell_2 distortion. We build Gain-Shape Residual Quantization (GSRQ) by incorporating a weighted extension of GSKM into a RQ pipeline. On LLaMA-3-8B, GSRQ yields substantial improvements over strong KV cache quantization baselines across bit rates. At 1-bit, our method improves the average accuracy across LongBench tasks from 11.34 to 32.26, a gain of 20.92 percentage points over VQLLM.

Abstract:
Large Language Model (LLM) agents have demonstrated remarkable proficiency in solving isolated software engineering tasks. However, existing benchmarks predominantly evaluate static, independent issues, failing to reflect the continuous and sequentially dependent nature of realworld software evolution. We introduce DeepCommit, an automated pipeline that reconstructs verifiable software evolution trajectories from git histories as Milestone DAGs, and DevEvol, a benchmark for streaming evaluation over evolving codebases. This setting requires agents to manage long-term context, architectural consistency, and technical debt. Our evaluation reveals a fundamental performance gap: even frontier models achieve only ～35% Score and ～10% Resolve Rate in continuous environments, driven by a ``snowball effect'' where early errors accumulate and block downstream development. These results demonstrate that strong snapshot performance substantially overestimates real-world agent capability, establishing long-horizon software evolution as a critical unsolved challenge. Our code and dataset are available at https://anonymous.4open.science/r/DevEvol-48A8.

Abstract:
Agents that can reason and plan in the real world must be able to predict the consequences of their actions. World models possess this capability but require action annotations that can be complex to obtain at scale. Latent action models address this issue by learning an action space from videos alone. Our work studies the training of latent action world models on inthe-wild videos, expanding the scope of existing works that focus on simple robotics simulations, video games, or manipulation data. While diverse videos enable modeling richer actions, they introduce challenges of environmental noise and lack of a common embodiment across videos. To address these, we carefully study the design and evaluation of latent actions. We find that constrained continuous latent actions are better suited for complex in-the-wild videos, compared to vector quantization. For example, actions specific to in-the-wild videos such as humans entering the room, can be modeled and then transferred across videos. However, in the absence of a common embodiment, learned latent actions are localized in space, relative to the camera. Nonetheless, we are able to train a controller that maps known actions to latent ones, allowing us to use latent actions as a universal interface to solve planning tasks on par with action-conditioned baselines.

Abstract:
Optimizing prompts for foundation models like SAM represents a challenging highdimensional black-box optimization problem, fundamentally plagued by the credit assignment ambiguity. To address this, we introduce PromptPilot, a task-agnostic reinforcement learning framework that structurally decomposes the search space into orthogonal semantic and spatial subspaces. Specifically, a centralized manager orchestrates two specialized agents, a feature agent ensuring semantic coherence and a physical agent maximizing spatial coverage, to navigate conflicting optimization objectives. Crucially, our reward mechanism synergizes global segmentation feedback with an efficient approximation of Shapley values, enabling fine-grained attribution of performance gains to individual prompt actions. PromptPilot functions as an inference-time optimization strategy without parameter updates. Extensive experiments demonstrate that our game-theoretic approach significantly improves segmentation performance and generalization, offering a principled solution for automated prompt engineering.

Abstract:
Adopting pretrained Vision-Language Models (VLMs) in Federated Learning (FL) presents a promising avenue for mitigating data scarcity and heterogeneity. However, existing solutions suffer from high computational complexity or ineffective knowledge aggregation. To address these problems, we propose FedSPA (Federated Adaptation via Semantic-Visual Prototype Alignment). On the client side, FedSPA restricts local optimization to visual prototypes, enabling lightweight personalization. On the server side, we introduce a semantic alignment module that leverages client-uploaded prototypes to minimize a contrastive objective, aligning global semantic prototypes with heterogeneous visual distributions and thereby shifting the paradigm from traditional “learning-to-describe" (optimizing static prompts) to ”learning-to-align". Extensive experiments demonstrate that FedSPA significantly outperforms state-of-the-art methods in both personalized and global benchmarks, while substantially reducing computational overhead.

Abstract:
Compared with individual agents, large language model based multiagent systems have demonstrated great capabilities across a wide range of tasks, including code generation, mathematical reasoning, and planning, etc. Despite their impressive performance, the effectiveness and robustness of these systems heavily rely on their communication topology, which is often fixed or generated in a single step. This restricts fine-grained structural exploration and flexible composition, leading to excessive token consumption for simple tasks or performance bottlenecks for complicated ones. To address this challenge, we introduce RADAR, a redundancy-aware and query-adaptive generative framework that actively reduce communication overhead. Inspired by conditional discrete graph diffusion models, we formulate communication topology synthesis as a step-by-step generation process, guided by the effective size of the graph. Comprehensive experiments on six benchmarks demonstrate that RADAR consistently outperforms recent baselines, achieving higher accuracy, lower token consumption, and greater robustness across diverse scenarios. Our source code and data are available at https://anonymous.4open.science/r/RADAR-8430.

Abstract:
Medical diagnosis tasks often rely on heterogeneous information from multiple sources, such as medical images and clinical text. Multimodal fusion is therefore essential for improving classification performance and robustness. However, most existing methods assume a fixed and known modality set, making them less effective when the number or composition of modalities changes. To address this limitation, we propose a modalityagnostic medical multimodal fusion framework that can naturally accommodate an arbitrary number of input modalities. At the coarse-grained modality level, we represent each modality’s estimation of latent semantics as an uncertainty-aware probability distribution, and impose symmetric consistency constraints to achieve global cross-modal semantic alignment. At the fine-grained token level, we further design a token-level consistency constraint based on linear reconstruction. This constraint enables structured mutual verification of local semantics across modalities. Finally, for multimodal fusion, we adopt a multi-view consistency strategy to obtain a unified representation for diagnosis prediction. In particular, each modality is sequentially treated as a conditional view to suppress noise in the remaining modalities and extract shared semantics. Extensive experiments on five public and self-constructed multimodal medical datasets demonstrate the effectiveness and scalability of the proposed approach. Code is available at https://github.com/gjhgjbkg/SMMF.

Abstract:
We study the problem of private hypothesis selection: given samples from an unknown distribution drawn from a finite hypothesis class, the goal is to identify the best hypothesis under the constraint of differential privacy. Existing algorithms for this problem are either computationally expensive or achieve suboptimal statistical rates. We propose new algorithms that achieve near-optimal rates while running in nearly linear time in the number of hypotheses. Rather than applying the exponential mechanism directly with a score function that requires pairwise comparisons between hypotheses, our approach introduces a carefully designed loss function based on a small set of strong hypotheses. This structure allows the score to be evaluated efficiently for most hypotheses, yielding significant computational savings. We further extend our algorithms to the agnostic setting, where the true distribution may not belong to the hypothesis class. As an application, we obtain faster differentially private algorithms for universal statistical estimation in low dimensional settings.

Abstract:
Why does contrastive learning with simple images and augmentations yield useful representations for downstream tasks? We answer this by analytically computing the optimal contrastive learning (CL) weights in simple onehidden-layer CNNs using only dataset statistics. For a range of basic augmentations and any image dataset with stationary statistics, we prove that such CNNs trained with a contrastive loss learn sinusoidal first-layer filters. With augmentations that combine translation and adding noise, the CNN learns partial whitening of the input and measures frequency contrast: differences between power at frequencies with the same expected power. The selected frequencies and their weights can be computed using a simple “waterfilling” algorithm given the dataset’s expected power spectrum. Experiments with eight image datasets show that CNNs trained with SGD empirically learn partial whitening and the predicted frequency contrasts, and the usefulness of the learned representation for recognition depends on both the augmentations and the mismatch between the training and test power spectra.

Abstract:
Pansharpening fuses highresolution panchromatic (PAN) images with low-resolution multispectral (LMS) images. For spatial-spectral fusion, Fast Fourier Transform (FFT)-based methods provide a global receptive field to capture long-range dependencies and naturally separate frequency components. However, most existing approaches directly transplant spatial operators like convolution or self-attention, while disregarding the fundamental structure of the spectrum: a strict spatial correspondence where each coordinate represents a specific frequency component, and a highly non-uniform, radially decaying energy distribution. To address this, we revisit the classical Butterworth filter, a frequencydomain operator defined directly on spectral coordinates that is inherently suited for processing such structured representations. We generalize the standard isotropic Butterworth filter into an anisotropic, learnable frequency-domain gating mechanism, establishing an efficient alternative to self-attention, and propose the Anisotropic Butterworth Fusion Network (ABFNet). Its core is a novel dual-branch gating module that employs learnable anisotropic Butterworth filters to perform adaptive direction-aware feature selection, integrating global context and local details with linear complexity. Extensive experiments show that ABFNet achieves state-of-the-art (SOTA) performance on pansharpening benchmarks with low computational overhead. Furthermore, its superior accuracy on CIFAR-100 classification validates the broader applicability of this frequency-domain learning paradigm.

Abstract:
Delayed outcomes are ubiquitous in online experimentation. When such a temporal dimension is present, treatment influences not only the outcome value but also the outcome timing, which can move in opposite directions. Motivated by the desire to continuously monitor the performance of treatment arms, we develop an anytimevalid approach to inference in the delayed outcome setting. We adopt a design-based framework where both the outcome timing and value are fixed potential outcomes, and randomness is introduced by treatment assignment only. We target the sample cumulative reward as a function of time, a causal estimand that avoids modeling the unobserved future, which would require strong assumptions violated by the nonstationarity and heterogeneity of our setting. We prove that the estimation error for the Horvitz-Thompson (IPW) estimator forms a martingale with respect to a specific single-arm filtration. Conversely, the estimation error for the AIPW estimator fails to be adapted to this filtration. We prove a fundamental negative result for the treatment effect: the estimation error is not a martingale under any filtration, arising from cross-arm covariance induced by randomized assignment. We resolve this using a union bound, showing it yields tighter intervals than the standard variance upper bound when treatment induces asymmetry in outcome arrival rates.

Abstract:
Current deep learning models for Multispectral and Hyperspectral Image Fusion (MS/HS fusion) are typically designed for fixed spectral bands and spatial scales, which limits their transferability across diverse sensors. To address this, we propose SSA, a universal framework for MS/HS fusion with spectralband and fusion-scale agnosticism. Specifically, we introduce Matryoshka Kernel (MK), a novel operator that enables a single model to adapt to arbitrary numbers of spectral channels. Meanwhile, we build SSA upon an Implicit Neural Representation (INR) backbone that models the HS signal as a continuous function, enabling reconstruction at arbitrary spatial resolutions. Together, these two forms of agnosticism enable a single MS/HS fusion model that generalizes effectively to unseen sensors and spatial scales. Extensive experiments demonstrate that our single model achieves state-of-the-art performance while generalizing well to unseen sensors and scales, paving the way toward future HS foundation models.

Abstract:
Generative Flow Networks (GFlowNets) have recently been used to improve diversity and mitigate popularity bias in LLMbased recommender systems, yet most objectives are developed under online-style assumptions. In offline LLM-based recommendation, learning is constrained to a fixed logged dataset, yielding partial support over token transitions on the dataset-induced token-prefix DAG. Naively applying Sub-Trajectory Balance (SubTB) becomes non-identifiable and can arbitrarily allocate probability mass to unsupported regions. We formalize this failure and identify three sources of non-identifiability that induce distributional shift between the dataset-implied policy and the learned policy: (i) flow overestimation, (ii) forward mass leakage, and (iii) backward compensation. To address it, we propose CFlower, which introduces a conservative SubTB objective that explicitly penalizes unsupported forward flow mass, and combines it with dataset-constrained policy learning with on-policy sampling on the dataset-induced DAG for efficient training under offline constraints. Experiments on three Amazon recommendation datasets show that CFlower improves distributional matching and delivers a stronger accuracy--exposure trade-off than prior GFlowNet and SFT baselines, while serving as a more reliable reference policy for downstream RL fine-tuning.

Abstract:
Personalized and continuous interactions are critical for LLMbased conversational agents, yet finite context windows and static parametric memory hinder the modeling of long-term, cross-session user states. Existing approaches, including retrieval-augmented generation and explicit memory systems, primarily operate at the fact level, making it difficult to distill stable preferences and deep user traits from evolving and potentially conflicting dialogues.To address this challenge, we propose RGMem, a self-evolving memory framework inspired by the renormalization group (RG) perspective on multi-scale organization and emergence. RGMem models long-term conversational memory as a multi-scale evolutionary process: episodic interactions are transformed into semantic facts and user insights, which are then progressively integrated through hierarchical coarse-graining, thresholded updates, and rescaling into a dynamically evolving user profile.By explicitly separating fast-changing evidence from slow-varying traits and enabling non-linear, phase-transition-like dynamics, RGMem enables robust personalization beyond flat retrieval or static summarization. Extensive experiments on the LOCOMO and PersonaMem benchmarks demonstrate that RGMem consistently outperforms SOTA memory systems, achieving stronger cross-session continuity and improved adaptation to evolving user preferences.

Abstract:
We investigate the distributionally robust assortment optimization (DRAO) problem under the contextual multinomial logit (MNL) choice model, where the decisionmaker seeks to maximize revenue against worst-case distributional deviations. To address potential distribution shifts relative to the observed data environment, we study DRAO under ambiguity sets defined by three divergences: total variation (TV), Kullback–Leibler (KL), and chi-square (\chi^2). Incorporating robust concerns poses challenges for both algorithm design and theoretical analysis. By leveraging strong duality results from the distributionally robust optimization literature and integrating them into the assortment optimization procedures, we develop tailored polynomial-time algorithms under each divergence. We further provide a theoretical analysis and establish sample complexity bounds for all three robust formulations.

Abstract:
Modeling the learning curve is critical for costeffective data collection in deep learning systems. Most prior approaches assume a specific parametric learning curve, but these can be inappropriate when no reliable parametric form can be assumed for the learning curve. While Gaussian processes offer flexible nonparametric modeling, existing GP approaches that enforce monotonicity typically introduce intractable factors or require derivative observations. To address this, we propose a Monotonic Variational Gaussian Process for Efficient Data Collection (MOVE), which (i) introduces a novel monotonic variational GP formulation with virtual-derivative factors to enable tractable posterior inference, and (ii) develops an expected shortfall based objective for target-driven data collection. Furthermore, our theoretical analysis shows that expected shortfall provides non-vanishing gradient signals that enable reliable gradient-based optimization. Extensive experiments on classification, segmentation, and detection benchmarks demonstrate consistent improvements over the prior method.

Abstract:
Stock markets are one of the most complex systems in the modern world, where prices emerge from billions of decentralized interactions among heterogeneous participants in an everevolving information landscape. While high-fidelity simulation is important for understanding market dynamics, existing approaches face a persistent trade-off between structural and behavioral fidelity. To this end, we propose MarketSim, a large-scale stock market simulation framework with generative agents. MarketSim introduces a hierarchical multi-agent architecture that decouples strategic reasoning from high-frequency execution, enabling LLM agents to operate in a nanosecond-resolution, NASDAQ-like continuous double auction market. Building on this, we simulate over 15,000 heterogeneous market participants whose interactions shape and are shaped by an evolving market environment grounded in more than 12k real-world news articles, policy documents, and earnings reports. To evaluate our proposed MarketSim, we develop a comprehensive benchmark that includes stocks from 8 GICS sectors and 3 representative real-world scenarios, along with 5 stylized facts for market complexity and 5 price-related statistical metrics. Extensive experiments demonstrate that MarketSim not only captures key complexity properties of real-world markets, but also outperforms state-of-the-art baselines in tracking high-frequency price dynamics with an average MAPE of 3.48%, providing a scalable testbed for market analysis.

Abstract:
Multimodal LLMs lack a systematic understanding of visual dynamics in complex human world activities, which requires the model to predict or simulate multiple levels of dynamic constituents, such as the general progression of actions and the associated changes of lowlevel details in the world. To address this challenge, we propose a dynamic visual schema-guided world model, DynaVieW, optimized for visual dynamic prediction and simulation. DynaVieW achieves an in-depth understanding of visual dynamics by learning interleaved state-transition sequences, where states cover broad visual scenes from video keyframes, and transitions capture comprehensive dynamic constituents within a hierarchical schema. DynaVieW jointly models transition prediction and state simulation under a mixture-of-experts architecture, with a cross-expert selective attention and a schema token re-weighted loss, to ensure effective and robust learning. DynaVieW's superior visual dynamic understanding boosts its downstream performances on both visual narrative creation and world simulation, showing improved consistency and controllability of visual generation and better instruction-following ability.

Abstract:
Continual learning (CL), where a model is trained on a sequence of data tasks, is increasingly being adopted across key fields such as large language models and image recognition, yet it remains highly vulnerable to data poisoning that triggers learning divergence or severe generalization loss. Despite these threats, a principled theoretical foundation in CL for understanding attack and defense remains lacking. In this paper, we develop a theoretical framework to analyze strategic attacks and defenses in regularizationbased CL, a cornerstone of recent CL theory. By framing the adversary-defender interaction as an online zero-sum game, we first establish a fundamental performance limit: no defense succeeds when an adversary poisons a linear proportion of tasks via adding unbounded noise or pattern shifts in regularization-based CL. We then analyze two possibly denfensible scenarios: infrequent attacks and bounded noise per attack. For the former regime, we propose a task-to-task verification mechanism to detect data poisoning and reduce cumulative bias for learning convergence. For the latter regime, we derive a robust defense that minimizes the model’s sensitivity to poisoned features, provably accelerating the convergence rate. Extensive experiments on realistic tasks further validate our theoretical results.

Abstract:
Federated Learning (FL) enables privacypreserving collaboration among distributed clients in open-world environments, but its performance often degrades under data heterogeneity and unpredictable distribution shifts. Test-Time Adaptation (TTA) has recently been introduced into FL to leverage unlabeled data from unseen clients for online adaptation. However, most existing federated TTA methods employ local feature statistics, which can be brittle under diverse and severe distribution shifts. In this work, we observe that despite significant variations in feature distributions, the relational structure among class prototypes—termedprototype affinity topology—remains remarkably stable across heterogeneous clients. Building on this insight, we propose FedPAT, aFederated TTA framework that leveragesPrototypeAffinityTopology as a cross-client structural prior. FedPAT learns a global PAT by aggregating class prototypes from source clients, capturing consensus inter-class relationships that are robust to local distribution variations. For unseen target clients, we design a topology-aware mechanism that enhances predictions via diffusion of the global PAT, fuses them with parametric outputs, and performs lightweight optimization for robust test-time adaptation. Extensive experiments demonstrate that FedPAT consistently outperforms advanced federated TTA and classical TTA methods across various distribution shifts.

Abstract:
Formal specification is a powerful tool to guide the learning process and provides significant advantages over adhoc reward shaping: (1) mathematical rigor; (2) expressiveness to specify objectives and constraints, and (3) the ability to define strategies to achieve objectives. However, these benefits remain largely unexplored in the context of MARL. This paper introduces HyPOLE, a novel framework for MARL under partial observability, where learning is guided by the expressive power of the so-called hyperproperties and, in particular, the temporal logic HyperLTL. HyPOLE targets settings in which agents operate under partial observability, modeled as partially observable Markov decision processes (POMDPs). We integrate CTDE techniques with HyPOLE to synthesize decentralized policies, and our evaluation on StarCraft~II and Wildfire benchmark demonstrates clear advantages over vanilla MARL baselines.

Abstract:
Predicting streetlevel socioeconomic indicators from street view imagery is fundamental to urban planning. Existing methods typically extract visual features via pretrained encoders and propagate information through graph-based learning, but they fail to fully exploit the structured, task-relevant, and label-efficient learning signals inherent in urban scenes. We propose MetaStreet, a semi-supervised multimodal framework with three components: (1) a semantic-spatial visual encoder that jointly models object co-occurrence and spatial adjacency at the semantic category level, (2) a task-aware textual encoder that steers LLMs toward prediction-relevant features via task-specific prompts, and (3) a geography-aware graph contrastive learning module that leverages spatial autocorrelation to extend contrastive supervision to unlabeled streets, enabling them to actively participate in representation learning. Experiments on two cities across three socioeconomic prediction tasks demonstrate that MetaStreet consistently outperforms state-of-the-art methods.

Abstract:
Regular expressions (regexes) frequently exhibit superlinear worst-case behavior in regex engines, exposing software to Regex Denial-of-Service (ReDoS) attacks. Detecting such vulnerabilities is challenging, especially for extended features such as lookarounds and backreferences: existing static approaches are efficient but often lack support for extended features, whereas dynamic and hybrid approaches reduce false positives by executing regex matching on real engines, but incur high runtime overhead. To address this trade-off, we propose ReDoS-MotifGNN (RMGNN), a motif-based graph learning framework for ReDoS detection that leverages the low inference latency of graph neural networks (GNNs). RMGNN converts regexes into Heterogeneous Regex Graphs (HRGs) and encodes three ReDoS-related motifs into HRGs to incorporate domain priors, while preserving the syntactic structure and extended features of the input regex. Furthermore, it applies heterogeneous propagation with kernel-guided motif learning to capture multi-scale semantics, which are fused via residual cross-attention for robust prediction. Comprehensive evaluation on four real-world datasets (over 317k regexes) demonstrates that RMGNN outperforms six state-of-the-art baselines in F1-score and achieves an average 244× speedup over the top F1-performing baseline.

Abstract:
Recovering multiple physical parameters from highdimensional optical measurements remains challenging in computational optics. We presentMMPD-Bench, a pioneering benchmark that reframes multi-polarimetric modalities decomposition from Mueller matrix observations as amodality fissionproblem under the multi-modal learning paradigm. By replacing iterative numerical inversion with deep surrogate models, MMPD-Bench provides data, standardized solutions and evaluations to address the multi-physics modalities generation challenge. We benchmark representative architectures to this problem, including state-space models, vision transformers, conditional diffusion models, and neural operators, under a multi-faceted evaluation protocol that jointly assesses perceptual fidelity, physical consistency, robustness, and computational efficiency. Our analysis reveals non-trivial trade-offs between accuracy and robustness in accelerated high-fidelity polarimetric decomposition, highlighting key limitations of existing surrogates. To support reproducible research, we open-source the full codebase, together with a large-scale dataset of 21,412 high-resolution Mueller matrix observations acquired through extensive polarimetric measurements. We invite the community to further advance the intersection of polarization optics and multimodal representation learning.

Abstract:
Transformerbased autoregressive models excel in data generation but are inherently constrained by their reliance on discretized tokens, which limits their ability to represent continuous values with high precision. We analyze the scalability limitations of existing discretization-based approaches for generating hybrid discrete-continuous sequences, particularly in high-precision domains such as logos, layouts, and semiconductor circuit designs, where precision loss potentially leads to visual artifacts, aesthetic degradation, and even functional failure. To address the challenge, we propose a novel unified framework that jointly models discrete and continuous values for variable-length sequences. Our approach employs a hybrid approach that combines categorical prediction for discrete values with diffusion-based modeling for continuous values, incorporating two key technical components: an end-of-sequence (EOS) logit adjustment mechanism that uses an MLP to dynamically adjust EOS token logits based on sequence context, and a length regularization term integrated into the loss function. Additionally, we present ContLayNet, a large-scale benchmark comprising 334K high-precision semiconductor layout samples with specialized evaluation metrics that capture functional correctness, where precision errors significantly impact performance. Experiments on semiconductor layouts (ContLayNet), graphic layouts, and SVGs demonstrate that our approach achieves higher-fidelity hybrid vector representations than discretization-based and fixed-schema baselines, while scaling to high-precision generation across multiple domains.

Abstract:
Evaluating representation similarity is fundamental to representation learning. However, existing metrics suffer from significant limitations: they are difficult to interpret due to shifting baselines, lack robustness to outliers, and are frequently computationally intractable for large datasets, forcing a reliance on heuristic approximations. To address these shortcomings, we develop an ordinalsimilarity framework, instantiated by the Triplet (TSI) and Quadruplet (QSI) Similarity Indices, which measure alignment by quantifying the consistency of ordinal relationships. We provide a theoretical analysis demonstrating that this formulation is inherently interpretable, robust to outliers, and computationally efficient. Finally, we establish a formal equivalence between TSI alignment and the alignment of local neighborhood structures, as measured by Mutual Nearest Neighbors. Through empirical analysis, we validate these properties and show that ordinal similarity offers a scalable, practical approach to measuring alignment, enabling practitioners to better understand and design representations.

Abstract:
Finetuning Large Language Models (LLMs) enables data holders to construct proprietary, task-specific models by leveraging external high-performance computing infrastructure. However, existing paradigms typically address data privacy and model intellectual property (IP) in isolation, failing to simultaneously uphold both constraints. Privacy-prioritized methods compromise model IP by hosting parameters remotely, while IP-oriented collaborative schemes relying on end-to-end gradient flows inherently violate strict data privacy standards. To address these challenges, we presentPISA(Privacy-preserving andIP-protectedSplitAdaptation), a split fine-tuning framework designed to preserve both data privacy and model IP while maintaining high utility. In PISA, we propose three methods: a Manifold Rectification Pre-training (MRP) method to equip the server-side model with intrinsic robustness against privacy-induced distribution shifts; a Dual-Stream Semantic Compensation (DSC) method to recover feature utility using local clean data as priors; and a Utility-Aware Gradient Rectification (UGR) method to adaptively maximize the performance of the parameter-constrained local model. Experiments on GLUE show that PISA ensures dual protection and delivers a substantial 23.0% performance gain over the privacy-prioritized baseline under strict privacy budgets.

Abstract:
Testtime adaptation (TTA) can reduce error on new and different data by updating the model on these inputs during inference. However, these updates raise the issue of privacy w.r.t. the testing data, because the model parameters now depend on all past inputs. To control this privacy risk, we cast multiple popular TTA methods (Tent, EATA, SAR, DeYO, and COME) into differential privacy (DP) forms that apply per-sample gradient clipping and Gaussian noise for all updates. On ImageNet-C, our DP-TTA methods provide adequate privacy at small cost to accuracy, and in the low-privacy regime the clipping mechanism of DP even improves the accuracy and stability of adaptation in the continual setting. These improvements to privacy and accuracy come at only only modest computational overhead. These first results on private TTA raise awareness of the issue, inform the development of more private test-time updates, and identify per-sample clipping as a reliable technique for improving the accuracy and stability of adaptation.

Abstract:
Bias mitigation is particularly challenging for overparameterized machine learning (ML) models. Overfitting of training points not only amplifies data bias induced by spurious correlations, but also causes the failure of usual bias mitigation methods. To provide actionable insights to address this challenge, we propose a precise analysis of fair empirical risk minimization (ERM) in the overparameterized regime. Importantly, we show that even though conventional fair ERM fails on overparameterized models, this approach can be corrected by modifying the equality fairness constraint to allow for bias overcompensation. Moreover, our analysis presents an empirical criterion for strong equalized odds: balanced groupconditional means of representer coefficients, indicating equal average contribution from each sensitive group. Motivated by this result, we provide an estimable search interval that localizes the required overcompensation level for balanced coefficients. Despite the asymptotic nature of our findings, they closely capture the statistical behavior of moderately large ML models.

Abstract:
In recent years, hypergraph contrastive learning methods have gained widespread attention due to their excellent performance in processing highorder structural data. However, traditional hypergraph learning method often assume that neighboring nodes are homogeneous, which can lead to the mixing of heterogeneous information in highly heterogeneous datasets, thereby affecting node feature representation. To address this issue, this paper proposes a heterogeneity-sensitive hypergraph contrastive learning method. In the view enhancement stage, we introduce a heterogeneity-aware mechanism that masks high-heterogeneity nodes using hyperedges as intermediaries for information filtering. This mechanism weakens the interference of heterogeneous nodes on view consistency, enabling the model to focus more on key features. In the encoding stage, a heterogeneity-sensitive hypergraph encoder is designed. It dynamically adjusts the weights of information propagation through hyperedges in two phases:node-to-hyperedge" andhyperedge-to-node". This adjustment allows hyperedges to focus on homogeneous information and feedback the aggregated homogeneous information to the respective nodes. Besides, we provide a theoretical proof that our model is capable of aggregating information based on node heterogeneity using hyperedges as intermediate structures. Extensive experimental results demonstrate that this method effectively reduces the interference of heterogeneous information and improves model performance on multiple benchmark datasets. Our code is availabl at: https://anonymous.4open.science/r/HHCL-F926

Abstract:
Recommendation systems seek to accurately model user preferences from a large set of candidate items. Graph neural networks (GNNs) have emerged as a dominant approach in this domain due to their ability to capture highorder user–item interactions. Recent efforts have aimed to enhance GNN-based representation learning by incorporating the semantic reasoning capabilities of large language models (LLMs). However, existing methods often process graph structural information and LLM-derived semantic knowledge separately, creating a supervisory gap between structural proximity and semantic relevance. To bridge this gap, we propose CCLRec, a consensus-driven contrastive learning framework for recommendation. CCLRec deeply integrates structural and semantic information by identifying consistent signals. Specifically, we first use an LLM to extract semantic representations of items and to sample candidate positive/negative sets in the semantic space. We then introduce a structural–semantic consensus mining strategy that computes the intersection between a node’s structural neighbors in the graph and its semantically similar items. This allows us to identify high-confidence positive pairs endorsed by both collaborative filtering patterns and LLM-based reasoning. By centering contrastive learning on these consensus pairs and applying a weight-aware reinforcement mechanism during training, CCLRec significantly amplifies the contribution of high-quality consensus features during training. Experiments across multiple public benchmarks show that CCLRec consistently outperforms state-of-the-art methods on key metrics, demonstrating the effectiveness of our consensus-aware design.

Abstract:
Testtime adaptation (TTA) adapts pretrained models to test data on-the-fly. Current TTA methods have focused on what to adapt: lightweight domain-aware components (prompts, normalization statistics) updated with consistency-aware self-supervised losses. This work investigates the more fundamental yet underexplored optimization process, providing insights and guidelines on how to appropriately update models for TTA. By analyzing the optimization error during TTA, we identify a pivotal stability-plasticity trade-off: the model should adapt to novel distributions while retaining learned knowledge, which motivates our design of a CONfidence-and-Gradient-Aware scheduler (CONGA) to constrain model learning rate (LR) within an adaptive exploration interval. For each iteration, the lower bound encourages model exploration on informative confident samples, while the upper bound prevents aggressive overfitting to noisy optimization gradients. Based on our theoretical findings, an adaptation-progress-conditioned cosine decay function decides the specific LR within the interval. As an LR scheduler, CONGA is naturally applicable on existing TTA methods as a plug-in module, introducing little computation overheads. Extensive experiments and analysis demonstrate the superiority and validness of CONGA.

Abstract:
Scaling up visual autoregressive models improves generation quality but incurs substantial inference costs. Mixtureof-Experts (MoE) architectures mitigate this issue through sparse activation and have proven effective in large language models. However, training MoE models from scratch remains prohibitively expensive, and dense-to-MoE conversion for visual autoregressive models is still underexplored. To enablelow-cost and high-quality dense-to-MoE conversion, we proposePrism-MoE, an efficient framework for transforming pretrained dense visual autoregressive models into sparse MoE models. Prism-MoE consists of two key components. First, we introduce trajectory-consistent Initialization, which formulates expert initialization as a principled decomposition problem and preserves the generation trajectory of pretrained models. Second, we propose a confidence-adaptive sparse fine-tuning framework that aligns expert specialization with the information density of visual tokens via confidence-aware routing supervision. Experiments show that Prism-MoE achieves dense-to-MoE conversion with less than10%of the standard training budget, while maintaining generation quality comparable to dense baselines with only37.5%active parameters.

Abstract:
We address the problem of 4D reconstruction from partial point cloud sequences, where observations from depth sensors are incomplete, unordered, and lack explicit point correspondence over time. Recovering coherent 4D geometry in this geometryonly setting is challenging due to missing observations and ambiguous dynamics. While recent progress has largely been driven by image-based methods, existing point-based approaches typically focus on single-object scenarios, assume relatively complete inputs, and rely on explicit correspondence. To mitigate these limitations, we propose DynaTok, a point-based framework for correspondence-free 4D reconstruction from partial point cloud sequences that operates without images. DynaTok encodes each frame into compact latent tokens, aggregates incomplete observations over time with a Transformer-based spatiotemporal encoder, and decouples geometry and motion via a residual token design within a single unified model. Conditioned on the aggregated tokens, a point flow-matching decoder reconstructs complete and temporally consistent 4D point cloud sequences using only point cloud supervision. Experiments on object-level and scene-level benchmarks demonstrate improved reconstruction quality and temporal coherence under partial point cloud observations.

Abstract:
Bokeh rendering and depth estimation share a fundamental optical connection, yet existing methods fail to fully exploit this reciprocity. Conventional bokeh pipelines rely heavily on noisy depth maps that inevitably introduce visual artifacts. Conversely, existing monocular depth models typically follow two flawed paradigms. Generative diffusionbased frameworks often lack consistent metric scale. Meanwhile, feed-forward metric depth models frequently fail in textureless or distant regions where defocus blur can provide geometric information. We propose BokehDepth, a two-stage framework that treats synthetic defocus as a supervision-free geometric signal. In the first stage, a physically grounded generative model produces calibrated bokeh stacks from a single sharp input without requiring prior depth input. Subsequently, a lightweight defocus-aware aggregation module integrates these stacks into the encoder of a depth estimation framework. This mechanism allows the model to extract consistent geometric features from the defocus dimension while keeping the decoder architecture unchanged. Experiments demonstrate that BokehDepth achieves superior visual bokeh fidelity compared to depth-dependent rendering baselines and consistently enhances the metric accuracy of state-of-the-art monocular depth models.

Abstract:
Forecasting by generating RGB videos is computationally expensive, often physically implausible, and not directly actionable, since it requires translation into decisionmaking signals. Direct modality forecasting (e.g., predicting future segmentation) produces directly actionable outputs but fails to scale due to the need for labels. Vision Foundation Model (VFM) features offer the best of both worlds: they contain actionable semantic and geometric information that can be easily decoded from the predicted features, while requiring no labels on the downstream task for training. However, almost all existing VFM feature forecasting methods regress future features from fixed number of input frames, with evaluation predominantly on short horizons matching the training setup. We firstly show that existing regression methods struggle with forecasting from partial observations because they average over multiple plausible futures, failing to capture uncertainty in the future given the past. Interestingly, naively replacing deterministic forecasting with generative flow matching does not match the sample quality of the regression model, despite being a mathematically appropriate formulation of the forecasting task. In this work, we explain why this is the case, and we show how to optimally generate foundation model features. Our key insight is that generative modeling of VFM features requires (auto)encoding into a compact latent space suitable for diffusion. We show that this latent space preserves information more effectively than previously used alternatives, such as uncompressed feature diffusion or PCA-based compression, both for forecasting and other applications, such as image generation. Our results suggest that conditional generation of (compressed) VFM features offers a promising and scalable foundation for future scene forecasters.

Abstract:
Decisionbased image attacks commonly rely on zeroth-order (ZO) Monte Carlo probing to estimate decision-boundary normals and iteratively refine adversarial perturbations to minimize the \ell_2 norm. We theoretically analyze and empirically demonstrate an intrinsic inefficiency arising from heterogeneous input sensitivity, where only a small subset of coordinates strongly affects the target model’s predictions, while most others have a negligible effect. Empirically, with one-bit feedback and a limited query budget, updates on low-sensitivity coordinates are overwhelmed by initialization and sampling noise, preventing their perturbations from exhibiting consistent improvement. By modeling ZO refinement as a stochastic dynamical system, we formally characterize its asymptotic behavior: the perturbation aligns (in expectation) with the normal and its coordinate-wise magnitudes encode a local sensitivity ranking. However, this stationarity does not generally yield \ell_2-optimal perturbations under nonlinear boundaries. Building on this observation, we propose a novel and effective algorithm, Sensitivity-Aware Rescaling (SAR), that leverages this sensitivity signal to infer an importance map from the current best perturbation, then progressively suppresses low-importance regions through a coarse-to-fine schedule to reduce the \ell_2 norm. Extensive experiments show that SAR achieves consistent improvements in perturbation norm, attack success rate, and visual imperceptibility. The code is available at https://anonymous.4open.science/status/SAR-436.

Abstract:
This paper investigates fairnessaware graph adaptation, aiming to transfer knowledge from a labeled source graph to an unlabeled target graph while explicitly accounting for fairness. Most prior methods rely on adversarial learning to learn invariant graph representations of sensitive attributes. However, these approaches assume that sensitive attributes of the target domain are available, which often fails in real-world deployments. To address this limitation, we propose \underlineCausality-attended Repres\underlineentation Dientang\underlinelement with Structural A\underlinelignment (CELL) for fairness-aware graph adaptation without requiring target sensitive labels. The key idea of CELL is to build a causal graph that captures the underlying graph-generation mechanism and guides representation disentanglement toward improved fairness. In particular, CELL employs a sensitive encoder and a causal encoder to extract sensitive and causal factors respectively. We encourage disentanglement by minimizing the mutual information between causal and sensitive representations, considering the conditional distribution. To leverage unlabeled target data, we further generate pseudo-labels for both target task labels and sensitive attributes, and use similarity relations to derive unbiased node representations. Finally, to further mitigate domain shift, we build a fairness-aware bipartite graph that provides additional structural supervision for cross-domain alignment. Extensive experiments on benchmark datasets demonstrate that CELL consistently outperforms strong baselines in both predictive performance and fairness.

Abstract:
Policyplanner bootstrapping has emerged as a powerful paradigm in model-based reinforcement learning (MBRL). We formalize this process as a dual policy improvement mechanism synergizing: (i) exploitative improvement via off-policy Q-maximization, and (ii) lookahead improvement via planner alignment. While we theoretically prove that these improvements anchor to the same optimum, practical training process inevitably encounters gradient disagreement. Exacerbated by approximation inaccuracies and non-stationary data, this disagreement induces destructive interference in policy updates, destabilizing the bootstrapping loop and leading to suboptimal convergence. To address this, we propose harmonized dual policy improvement (HDPI), a gradient-level framework that reconciles exploitative and lookahead improvements through a harmonic optimization scheme. This scheme effectively maximizes the worst-case inner product between the harmonized update and the original gradients, ensuring directional consistency and stabilizing policy evolution. Extensive empirical evaluations on 14 challenging tasks from the DeepMind Control Suite and the Humanoid-Bench demonstrate that HDPI significantly enhances training stability and asymptotic performance, outperforming a wide range of strong baselines.

Abstract:
Multimodal unlearning aims to eliminate specific data from pretrained multimodal models, which offers significant advantages in data privacy and model efficiency. Current methods struggle to achieve the desired properties of effectiveness, reliability and locality, due to the complex interdependency of unimodal and multimodal knowledge. By introducing a causal perspective, we propose multimodal unlearning with decoupled knowledge components. To promote finegrained understanding of multimodal context, we introduce Multimodal Variational Inference (MVI) to infer modal-specific and -consistent factors with incomplete sample observation. With foundation of decoupled knowledge, we propose contrastive semantic editing to regulate multimodal unlearning towards refined forgetting. Experiments on privacy- and copyright-sensitive scenarios validate effectiveness of our method across multiple scenarios, ensuring the unlearned model maintains high reliability and locality.

Abstract:
Adversarial imitation learning (AIL) demonstrates superior expert sample efficiency compared to behavioral cloning (BC), yet requires substantial online environment interaction. While recent empirical work has explored initializing AIL algorithms with BCpretrained policies to address this limitation, a rigorous theoretical understanding of pretraining's role in AIL remains lacking. This paper provides a systematic theoretical analysis and develops principled pretraining algorithms for accelerating AIL. We first analyze AIL with policy pretraining alone, identifying reward error as the dominant error source and thereby uncovering a critical yet previously unexplored gap: the omission of reward pretraining. Leveraging this insight, we introduce a principled policy-reward co-pretraining mechanism through reward-shaping analysis. Our analysis reveals a fundamental connection between expert policies and shaping rewards, naturally motivating CoPT-AIL, an approach that jointly pretrains both policies and rewards through a single BC procedure. We prove that CoPT-AIL achieves an improved imitation gap bound compared to standard AIL, establishing the first theoretical guarantee for pretraining benefits in AIL. Experimental results validate CoPT-AIL's superior performance over existing AIL methods.

Abstract:
This paper introduces a nonparametric test for conditional mean independence between a manifold‑valued Y and Euclidean predictors X. The test is built on a new measure called the Manifold Martingale Difference Divergence (MMDD), which characterizes conditional mean dependence by projecting observations onto the tangent space via the logarithmic map. We provide an empirical estimator for the MMDD, establish its asymptotic null distribution, and implement a wild bootstrap procedure for finite‑sample inference. Simulations on three representative manifolds demonstrate that the proposed test maintains correct size under the null even when the distribution of Y depends on X, in contrast to the severe size distortion exhibited by the distance covariance (dCov) test. At the same time, it achieves competitive power across a range of alternatives. An application to real data illustrates its practical utility.

Abstract:
Clustering is commonly compared through leaderboards that collapse performance into a single aggregate ranking. Such summaries obscure why methods succeed, which data properties align with failure, and how conclusions shift under representation changes and realistic tuning constraints. We present CHB, a diagnostic toolkit for hardnessaware clustering evaluation. CHB maps each dataset--representation pair to an interpretable hardness fingerprint capturing (i) separation, (ii) cohesion and scale heterogeneity, and (iii) topology through scalable persistent-homology summaries. Using this diagnostic space, CHB evaluates clustering algorithms under standardized, compute-aware tracks. Conditioning results on hardness coordinates turns comparison into diagnosis: across a broad range of datasets and their representations, CHB reveals reproducible structural regimes, uncovers regime-dependent ranking reversals across method families, and surfaces robustness signatures, including topology-linked breakdowns. CHB further enables representation auditing by attributing gains to measurable shifts in the hardness fingerprint rather than just external performance changes. We release CHB as an open, extensible artifact for evaluating new clustering methods and embeddings within a shared diagnostic framework.

Abstract:
Quantifying the distance between datasets is a fundamental question in mathematics and machine learning. We propose magnitude distance, a novel distance metric defined on finite datasets using the notion of the \emphmagnitude of a metric space. The proposed distance incorporates a tunable scaling parameter, t, that controls the sensitivity to global structure (small t) and finer details (large t). We prove several theoretical properties of magnitude distance, including its limiting behavior across scales and conditions under which it satisfies key metric properties. In contrast to classical distances, we show that magnitude distance remains discriminative in highdimensional settings when the scale is appropriately tuned. We further demonstrate how magnitude distance can be used as a training objective for push-forward generative models. Our experimental results support our theoretical analysis and demonstrate that magnitude distance provides meaningful signals, comparable to established distance-based generative approaches.

Abstract:
Training data reconstruction from KKT conditions has shown striking empirical success, yet it remains unclear when the resulting KKT equations have unique solutions and, even in identifiable regimes, how to reliably recover solutions by optimization. This work hereby focuses on these two complementary questions: identifiability and optimization. On the identifiability side, we discuss the sufficient conditions for KKT system of twolayer networks with polynomial activations to uniquely determine the training data, providing a theoretical explanation of when and why reconstruction is possible. On the optimization side, we introduce sample splitting, a curvature-aware refinement step applicable to general reconstruction objectives (not limited to KKT-based formulations): it creates additional descent directions to escape poor stationary points and refine solutions. Experiments demonstrate that augmenting several existing reconstruction methods with sample splitting consistently improves reconstruction performance.

Abstract:
Token merging accelerates Transformer inference by clustering similar tokens to reduce sequence length (retention ratio r), but distorts attention outputs, inducing covariate shift in residual streams and performance collapse under high compression. Existing heuristics, such as proportional attention, mitigate mild compression effectively but degrade sharply at aggressive ratios due to unaddressed energy drift and biased attention distributions. We reframe token merging as a statistical reconstruction problem in high dimensions and introduce an asymptotic radialangular decomposition of the reconstruction error, an analytical framework decoupling magnitude and distributional distortions. Minimizing this decomposed risk under minimal assumptions of finite second moments and variance stationarity yields closed-form optimal corrections governed by a single scaling factor \sqrtr: scaling merged values and shrinking merged logits toward the cluster-size prior. This calibrates both energy balance and distributional fidelity. Extensive experiments on vision Transformers demonstrate superior accuracy and robustness across compression levels.

Abstract:
Recent work asks whether large language models (LLMs) condition their reasoning on explicit rules rather than statistical regularities from pretraining. Program execution provides a canonical instance: formal semantics define behavior through symbolic transition rules that can be systematically altered under distribution shift. We investigate whether LLMs can condition their reasoning on formal semantics through program execution and introduce PLSEMANTICSBENCH, pairing featherweight C programs with two se- mantic systems—small-step operational seman- tics and K semantics—and probing four capabil- ities: composing rules for final states, selecting rules when state is unmutated, sustaining such conditioning over long traces, and following sup- plied rules under novel semantics. To decou- ple semantic reasoning from syntactic familiarity, we redefine familiar operators to induce symbol- meaning conflict and introduce novel symbols de- fined only through the supplied rules, and stress- test models on Human-Written, LLM-Translated, and Fuzzer-Generated splits with increasing struc- tural complexity. Across 11 frontier LLMs, strong final-state accu- racy under standard semantics (up to 90%) drops sharply—by as much as 40–60% points—under semantic mutations and increasing structural com- plexity. Only a handful of models achieve non- zero long-horizon conditioning accuracy, and even the best systems reach just 35%. Together, these results suggest that contemporary LLMs of- ten rely on pretrained lexical associations rather than systematically conditioning on supplied for- mal rules. Anonymized PLSEMANTICSBENCH is available at https://huggingface.co/dat asets/LambdaadbmaL/PLSemanticsBench.

Abstract:
Rectifiedflow-based diffusion transformers, particularly FLUX, have demonstrated outstanding performance in high-quality image generation. However, achieving fast and accurate inversion—transforming images back to latent noise for faithful reconstruction and editing—remains a challenging bottleneck due to the discretization errors of linear solvers. This paper introduces SlerpFlow, a straightforward yet highly effective zero-shot approach that unlocks the full potential of FLUX for high-fidelity inversion and editing. Unlike existing approaches (e.g., RF-Solver) that rely on complex numerical approximations such as high-order Taylor expansions to correct trajectory errors, we present a geometric view based on the Manifold Hypothesis: the empirically observed trajectory curvature is not a numerical artifact, but rather serves as a necessary ``centripetal force" that constrains the flow to remain on the data manifold. Guided by this insight, SlerpFlow integrates Spherical Linear Interpolation (Slerp) to rectify flow velocity directions on the hypersphere, strictly adhering to the intrinsic curvature of the latent space. Crucially, by caching the corrected velocity for subsequent steps, SlerpFlow achieves high-precision inversion while maintaining the computational efficiency of a first-order Euler solver. Extensive experiments on FLUX-based reconstruction and editing tasks demonstrate that our method delivers superior structural consistency and lower reconstruction errors compared to state-of-the-art baselines without requiring additional training.

Abstract:
Highperformance GPU kernels are critical for reducing the exponentially growing computational costs of large language models (LLMs), but their development heavily relies on manual tuning by domain experts. While recent advances in LLM-based approaches show promise for automating kernel generation, they still struggle to achieve both correctness and high performance. This limitation primarily arises from the lack of domain-specific optimization guidance, hindering effective exploration of the optimization space. We propose EGG, an \underlineExpert-\underlineGuided Agent Framework for Kernel \underlineGeneration, which incorporates expert optimization principles to guide LLMs’ decisions. Inspired by expert workflows, we decompose kernel generation into two hierarchical stages: 1) algorithmic structure design, which establishes a high-quality computational structure foundation; 2) hardware-specific tuning, which performs targeted adjustments through parallel mapping, tensor tiling, and memory optimization. This staged decomposition defines explicit optimization objectives, structuring the design space to achieve progressive refinements. To this end, a stage-aware multi-agent collaboration mechanism is designed for inter and intra-stage context management, ensuring stable optimization trajectories. Experiments on KernelBench and real-world workloads show that EGG achieves a 2.13× average speedup over PyTorch, outperforming existing agent-based and RL-based approaches.

Abstract:
Largescale learning systems often face the challenge of balancing multiple, potentially competing objectives, such as fairness, accuracy, and latency. While recent work has formalized this as an optimization problem over binary states, many real-world control parameters—such as fairness thresholds, diversity mixing rates, or resource budgets—are continuous. In this work, we extend the framework to continuous state spaces. We model the problem as minimizing a sum of linear objectives subject to movement costs that penalize system instability. We capture the local structure of the objectives using a dependency graph (or factor graph), where each objective is determined by a subset of the state attributes. To address the tension between exploration and stability, we propose Lazy Graph-LinUCB, an algorithm that performs lazy updates to minimize switching costs while maintaining near-optimal regret. Beyond stability, we introduce three advanced mechanisms to exploit the underlying graph structure: (1) an asynchronous update schedule that eliminates synchronization overhead in sparse graphs; (2) an adaptive algorithm that learns the graph structure from data; and (3) a joint estimator that leverages data sharing among correlated objectives to significantly tighten regret bounds. Empirically, we demonstrate that these structural exploitations reduce movement costs by more than a factor of three in heterogeneous systems while maintaining similar cumulative losses.

Abstract:
Instruction tuning aligns large (multimodal) language models with diverse user intents, but scaling to heterogeneous mixtures is hindered by (i) gradient interference that causes negative transfer and stiff highcurvature dynamics, and (ii) bandwidth-heavy synchronization that is often impractical on fragmented compute. We propose MERIT, a decentralized, merge-ready pipeline that splits mixtures before fine-tuning. Starting from a merge-ready initialization, MERIT estimates dataset-level gradients, builds a cosine-similarity conflict matrix, applies PCA to extract dominant conflict axes, and partitions datasets accordingly. Each partition is fine-tuned independently with no inter-partition communication, and merged once via one-shot token-weighted averaging. A local quadratic flat-basin analysis shows that merging acts as a curvature-weighted spectral filter, and that PCA-aligned splitting amplifies cancellation of high-curvature disagreement components. On Qwen2.5-VL-3B fine-tuned on 136 Vision-FLAN tasks, MERIT improves the overall benchmark average from 54.7 (centralized joint training) to 57.0 while enabling communication-free parallel fine-tuning. We further validate MERIT at 7B scale on a 1.6M-example mixture and on text-only instruction mixtures.

Abstract:
Predeployment evaluations inspect only a limited sample of model actions. A malicious model seeking to evade oversight could exploit this by randomizing when to "defect": misbehaving so rarely that no malicious actions are observed during evaluation, but often enough that they occur eventually in deployment. But this requires taking actions at very low rates, while maintaining calibration. Are frontier models even capable of that? We prompt GPT-5 and Qwen-3 families to take a target action at low probabilities (e.g. 0.01%), either given directly or requiring derivation, and evaluate their calibration (i.e. whether they perform the target action roughly 1 in 10,000 times when resampling). We find that frontier models are surprisingly good at this task. If there is a source of entropy in the context (such as a UUID), they maintain high calibration at rates lower than 1 in 100,000 actions. Without external entropy, some models can still reach rates lower than 1 in 10,000. Larger models achieve good calibration at lower rates. Successful low-rate strategies require explicit Chain-of-Thought (CoT) reasoning, so malicious models attempting this approach could currently be caught by a CoT monitor. However, scaling trends suggest future evaluations may be unable to rely on models’ lack of target rate calibration, especially if CoT is no longer legible.

Abstract:
Multidomain graph pre-training is a crucial step in constructing foundational graph models with cross-domain generalization capabilities. However, existing methods predominantly rely on jointly training all source domain graphs, resulting in high computational costs. Furthermore, it remains unclear whether all source domain graph data contribute equally to effective transfer. This paper empirically reveals significant data redundancy in multi-domain graph pre-training. Based on this finding, we propose the Multi-domain Graph Pre-training Framework, MDGMIX, which combines boundary-aware subgraph mixing with hierarchical discrimination. By selecting boundary nodes to construct challenging mixed-domain subgraphs, MDGMIX employs coarse-grained domain discrimination and fine-grained domain decomposition losses to decouple shared patterns from domain-specific patterns. During adaptation, MDGMIX employs a lightweight prompt weighting mechanism to transfer source domain knowledge. Extensive experiments demonstrate that MDGMIX consistently outperforms strong baselines in few-shot classification tasks while exhibiting superior time and memory efficiency.

Abstract:
The automated design of agentic systems offers a promising pathway for scaling large language models (LLMs) beyond singleagent reasoning. While prior work has advanced task performance through handcrafted or automatically generated multi-agent workflows, robustness is often treated as an afterthought, leaving systems vulnerable to external adversaries and internal failures. We propose AutoRAS, a framework for the Automated design of Robust Agentic Systems. AutoRAS formulates system design as generating a sequence of symbolic primitives that jointly encode structural connectivity and behavioral actions, and learns to optimize this sequence using execution-derived safety signals and flow-based sequence-level objectives. Extensive experiments show that AutoRAS achieves the best performance in both vanilla and adversarial settings, with the smallest performance degradation under attacks. Further analyses demonstrate strong transferability, stable optimization behavior, stability across primitive sets, and favorable cost trade-offs. Our code is available atthis link.

Abstract:
People with limb differences often face significant challenges in accessing inclusive AI services, largely due to the lack of structured, highquality resources centered on disability contexts. In this work, we introduce a limb-deficiency aware body-centric learning and evaluation paradigm that involves (i) a large-scale limb-aware vision–language dataset and evaluation benchmark for multimodal reasoning, and (ii) a model adaptation strategy for Vision-Language Models (VLM) in limb-difference contexts. Specifically, we first collect limb-difference data covering all eight limb-deficiency types across diverse real-world scenarios. The data are systematically organized into 96 limb-affected human action categories and 68 medical-functional classes defined by the World Health Organization (WHO). Then, we curate an expert-annotated vision–language dataset for limb-aware multimodal understanding, named Inclusive VQA for Limb Deficiency (IVQA-LD). IVQA-LD comprises 80K VQA pairs spanning eight core tasks including visual grounding, quantitative reasoning, functional semantic classification, and instructional text generation. We benchmark state-of-the-art VLMs on IVQA-LD and find that they consistently struggle across all tasks, exposing substantial deficiencies in limb-aware perception and reasoning. To address this, we further propose a Body-centric Structure-aware Initialization (BSI) strategy that aligns model representations with limb-specific semantics. With BSI, VLMs fine-tuned on IVQA-LD achieve significant performance improvements across all the tasks. We will publicly release the dataset to support future research.

Abstract:
Large VisionLanguage Models (LVLMs) have demonstrated remarkable capabilities in multimodal reasoning. However, prevailing reinforcement learning (RL) paradigms lack explicit counterfactual enhancement and causal learning mechanisms. This fundamental deficiency results in severe grounding failures, manifesting as a tendency to ignore visual evidence in favor of language priors or exhibiting hallucination drift during long chain-of-thought reasoning. To address this root cause, we propose CounterFactual Policy Optimization (CFPO), a novel framework that enforces causal consistency between visual perception and textual reasoning. CFPO introduces a cross-modal counterfactual enhancement mechanism, which regularizes the policy by maximizing the discrepancy between the model’s predictions and those from a counterfactual state where critical visual cues are suppressed. This approach seamlessly integrates with standard algorithms like GRPO and DAPO without requiring external reward models or additional supervision. Extensive experiments demonstrate that CFPO significantly improves reasoning fidelity, achieving consistent gains of 3.17%-6.25% over standard RL baselines and 1.32%-2.13% over the state-of-the-art perception-aware method (PAPO).

Abstract:
Deep learning models for supervised learning on tabular data are rapidly improving. Notably, ensembles (mixtures of multiple models) often play an important role in achieving top performance, which motivates designing ensemblefirst systems rather than treating ensembling as an ad hoc trick. In this work, we present TabPack --- a new ensembling approach that packs many base model-optimizer pairs with different hyperparameters into a single neural network and a single optimizer. The base model-optimizer hyperparameters are sampled randomly, after which all base models are trained in parallel, and the final ensemble is built on the fly during training. As a result, TabPack produces powerful ensembles in a single run, with substantial efficiency gains over traditional approaches. With its remarkable efficiency, strong performance on public benchmarks, and reduced reliance on traditional hyperparameter tuning, TabPack becomes an appealing solution for practitioners, and suggests a new avenue for designing better tabular deep learning systems.

Abstract:
Transformer sequence models such as Decision Transformer can learn strong offline policies from logged trajectories, but they can suffer from causal confusion: reliance on spurious correlations that predict reward in the data but do not reflect the true causal mechanisms of the environment. We propose CSET (Counterfactual State Editing Transformer), which improves robustness in strictly offline reinforcement learning without learning environment transition dynamics. CSET first fits a causal reward model as a conditional variational autoencoder to infer a posterior over reward disturbances for each transition. Conditioning on the factual action and a sampled disturbance, a counterfactual state generator proposes a minimally edited state whose predicted reward matches the factual reward; a normalized moveband constraint and an acceptance gate enforce state plausibility and reward consistency. We then augment trajectories by replacing only the observation token with the edited state while keeping the next observation factual, so the policy is not trained on synthetic successor transitions. On the model side, CSET uses a causally structured hybrid transformer: separate convolutional encoders process return-to-go, state, and action streams for local temporal structure, and a final attention block is softly supervised so action prediction focuses on its direct causal parents (state and return-to-go). Experiments on D4RL locomotion and AntMaze tasks and on offline recommendation benchmarks show consistent gains over transformer baselines and substantially improved robustness to injected spurious distractors.

Abstract:
While prompt engineering is instrumental in maximizing the capabilities of Large Language Models (LLMs) during inference, the role of prompts during training remains critically underexplored. Prevailing finetuning paradigms typically treat training prompts as mere surface forms, assuming that semantically equivalent instructions yield identical learning outcomes. However, we reveal that this equivalence is deceptive: while paraphrased prompts often lead to comparable in-task performance, they induce drastically different cross-task impacts regarding catastrophic forgetting and generalization. Crucially, these impacts are positively correlated across tasks, indicating the existence of superior prompts that consistently yield better performance. Furthermore, we discover that these superior prompts can be robustly identified by task loss prior to learning. Leveraging these insights, we introduce State-Adaptive Prompt Optimization (SAPO), a lightweight yet effective training strategy that shifts task formulation from a static input to a dynamic, state-adaptive variable. Comprehensive experiments on diverse benchmarks confirm its effectiveness, which significantly mitigates forgetting while improving generalization, achieving substantial performance gains over state-of-the-art methods. These results provide insights into how training prompts shape learning dynamics and offer a practical recipe for robust fine-tuning.

Abstract:
Transformer architectures have established strong baselines in time series forecasting, yet they typically rely on positional encodings that assume uniform, indexbased temporal progression. However, real-world systems, from shifting financial cycles to elastic biological rhythms, frequently exhibit ``time-warped'' dynamics where the effective flow of time decouples from the sampling index. In this work, we first formalize this misalignment and prove that rotary position embedding (RoPE) is mathematically incapable of representing non-affine temporal warping. To address this, we propose Symplectic Positional Embeddings (SyPE), a learnable encoding framework derived from Hamiltonian mechanics. SyPE strictly generalizes RoPE by extending the rotation group \mathrmSO(2) to the symplectic group \mathrmSp(2,\mathbbR), modulated by a novel input-dependent adaptive warp module. By allowing the attention mechanism to adaptively dilate or contract temporal coordinates end-to-end, our approach captures locally varying periodicities without requiring pre-defined warping functions. We implement this mechanism in StretchTime, a multivariate forecasting architecture that achieves state-of-the-art performance on standard benchmarks, demonstrating superior robustness on datasets exhibiting non-stationary temporal dynamics.

Abstract:
Sparse Mixtureof-Experts (SMoE) has emerged as a powerful conditional computation paradigm for scaling Large Language Models efficiently. While recent efforts have begun exploring SMoE architectures in recommender systems, achieving commensurate efficiency-performance tradeoffs has proven considerably more challenging than in language modeling. We attribute this difficulty to two structural impediments: (i) conventional token-level routing mechanisms poorly align with the fundamental objective of user-item relevance prediction; and (ii) relevance signals in recommendation models emerge through distributed, multi-stage interactions rather than through a single, consistently traversed transformation, limiting the effectiveness of standard expert selection strategies. To address these challenges, we propose the Massive Routing Network (MRN), a scalable sparse framework that explicitly aligns conditional computation with the unique computational topology of recommendation models. Extensive evaluations on public benchmarks and billion-user-scale industrial datasets demonstrate that MRN consistently outperforms competitive dense and sparse baselines under comparable compute budgets. Crucially, MRN overcomes commonly observed performance saturation and exhibits much more favorable scaling laws than prior state-of-the-art dense and sparse baselines.

Abstract:
Prevailing autonomous agents are often constrained by a single, predefined action space, which limits their generalization capabilities across diverse tasks and can introduce compounding errors through decoupled policy execution. To address these limitations, we introduce the Deep Hierarchical Agent (DeepHA), a unified architecture that operates across a mixture of heterogeneous action spaces, flexibly generating actions ranging from highlevel semantic skills to low-level motor controls. We further propose a Chain-of-Action (CoA) reasoning framework, which enables the agent to use higher-level abstract actions as structured `thoughts' to guide the generation of more granular, subsequent actions. To manage the computational demands of this deep reasoning in long-horizon tasks, we develop a memory-efficient mechanism that dynamically compresses historical context and leverages Key-Value (KV) caching, reducing context length by approximately 75% without sacrificing performance. We conduct extensive evaluations on a new, large-scale benchmark of over 800 diverse Minecraft tasks. Results show that DHA significantly outperforms prior methods, establishing a new state-of-the-art and demonstrating superior generalization, particularly in complex, multi-step planning tasks. Our work presents a novel, unified framework for building more capable and efficient autonomous agents.

Abstract:
Conventional timeseries discriminative forecasting relies on point-wise regression, which inherently induces over-smoothing and fails to capture stochastic volatility in complex systems. While first-order generative flow matching methods mitigate this issue, they ignore system inertia, resulting in phase-space ambiguities and high sensitivity to noise. We introduce KineFlow, a generative time-series forecasting framework that augments flow matching with a phase-space Neural Acceleration Field, treating exogenous inputs as driving forces that produce gradual momentum shifts rather than abrupt state perturbations. This second-order formulation serves as a structural filter via double integration, suppressing high-frequency noise and producing robust, physically consistent predictions. Extensive experiments on six real-world benchmarks demonstrate that KineFlow achieves an average 15% MSE improvement over discriminative baselines and an 8% gain in CRPS compared to state-of-the-art generative methods.

Abstract:
Time series modeling increasingly demands highquality supervision, yet target observations remain scarce—exogenous inputs are broadly available, but target measurements are often unavailable due to cost, infrastructure, or accessibility constraints. Can models trained on observed locations reconstruct target time series where measurements have never been collected? We term this zero-shot time series reconstruction. A naive approach—directly mapping exogenous inputs to targets—can yield predictions at unobserved locations, but without target signals, such models fail to capture the intrinsic dynamics of the target variable, producing overly smooth outputs that underestimate extremes. This reveals systematic errors that call for explicit modeling and calibration. We propose ZeroDiff, which constructs an informed prior from exogenous variables alone, then learns to calibrate reconstruction errors through diffusion—training on observed locations and generalizing to unobserved ones. Experiments across diverse real-world datasets demonstrate significant improvements over existing approaches. Our code is available at https://anonymous.4open.science/r/ZeroDiff/.

Abstract:
Multiagent LLM debates achieve strong performance on decision-making tasks as well as problem-solving benchmarks, yet their safety and fairness risks remain poorly understood. Notably, interaction can amplify the biases of single LLMs, raising concerns for real-world deployment. We identify the emergence of collective (often biased) norms in multi-agent LLM debates and show that noise (e.g., LLM sampling temperature) is a key driver. To explain this, we propose an analytical framework drawing on physics-inspired theoretical models of social dynamics. We predict a phase transition to collective bias when conformity surpasses a critical threshold given the LLMs' initial bias and debate noise. We test the theoretical predictions through controlled experiments and observe a finite-size crossover consistent with an underlying phase transition. We further find that agent heterogeneity suppresses emergence by smoothing (rounding) this transition. Finally, we show that these insights generalize to realistic decision-making tasks, including investment decisions and LLM-as-a-judge evaluation.

Abstract:
Visual perception in modern VisionLanguage Models (VLM) is constrained by a fundamental perceptual bandwidth bottleneck: a broad field-of-view inevitably sacrifices the fine-grained details necessary for complex reasoning. Inspired by the classical paradigms of active vision and information foraging, we frame overcoming this limitation as a sequential decision-making process. We formalise this process through the lens of the sequential Bayesian optimal experimental design (S-BOED) problem. While exact Bayesian inference is intractable in continuous gigapixel spaces, we derive principled yet tractable approximations that balance spatial coverage against resolution. To validate this framework, we present a training-free inference strategy as a practical instantiation of the S-BOED objective for agents equipped with multiple vision tools. Designed as a flexible template, this strategy accommodates arbitrary optimisation algorithms—ranging from efficient greedy sampling to look-ahead planning—to approximate the optimal design. Empirical evaluations on gigapixel-level benchmarks demonstrate that our approach further boosts the performance of state-of-the-art models, significantly outperforming standard baselines and effectively narrowing the gap toward human-annotated oracles.

Abstract:
Realworld agricultural counting often operates in the extreme regime of Dense and Indiscernible Object Counting (DIOC), where targets are tiny, clustered, and highly camouflaged. To facilitate research in this domain, we introduce DIOCblueberry, a large-scale benchmark that pushes the boundaries of visual perception. Unlike general datasets with salient objects, DIOCblueberry features extreme occlusion and camouflage. Compared to the popular FSC147 benchmark, it contains 1.9× more instances per image (avg. 108) with an average box pixel ratio that is 7.9× smaller, serving as a rigorous testbed for model robustness. Standard counting methods struggle in these scenarios due to severe visual ambiguity and scale mismatch. To address this, we propose MaskCount, a coarse-to-fine framework that incorporates semantic guidance. MaskCount leverages Vision-Language Models (CLIP) to generate pseudo segmentation masks for background suppression and employs a contrastive loss to maximize feature discriminability between fruits and foliage. Additionally, we design an edge-aware cropping mechanism to resolve boundary truncation in dense clusters. Extensive experiments demonstrate that MaskCount achieves a new state-of-the-art, reducing MAE and RMSE by 49.16% and 70.50% respectively on DIOCblueberry, with strong generalization to other agricultural scenes.

Abstract:
Recent neural methods, such as 3D Gaussian Splatting, have achieved stateof-the-art rendering quality and speed. However, these methods frequently encounter challenges in regions with overlapping Gaussians, leading to blurring and artifacts in the rendered images. We observed that widely used view-space positional gradients are insufficient for handling such circumstances. To address this, we introduce PCGS, a Patch Comparison Gaussian Splatting method to control the densification of corresponding Gaussians adaptively. Specifically, PCGS divides the rendered image into patches and identifies those with significant errors by comparing the loss between the rendered and ground truth images. Additional densification operations are then applied to the Gaussians in these error-prone regions. Furthermore, to prevent over-densification and redundant Gaussians, we design a Gaussian control strategy to regulate the densification process. Specifically, we set a Gaussian number budget that dynamically changes according to the progress of densification, and sample the Gaussians required for each densification step based on their importance scores. Our method results in significantly fewer artifacts and less blur while maintaining a Gaussian count approximately equal to that of 3DGS. Extensive experiments on multiple standard benchmarks demonstrate the superiority of our approach.

Abstract:
Relational reasoning is the ability to infer relations that jointly bind multiple entities, attributes, or variables. While this capability is essential for scientific reasoning, most existing evaluations of relational reasoning in large language models focus on structured inputs such as tables, graphs, or synthetic relational tasks, and do not isolate the sources of difficulty that arise from higherarity relational binding. We study this problem through the lens ofRelational Complexity (RC), defined as the minimum number of independent entities or operands that must be simultaneously bound to apply a relation. RC provides a principled way to vary reasoning difficulty independently of confounders such as input size, vocabulary, and representational choices. Building on RC, we introduce REL, a generative benchmark framework spanning algebra, chemistry, and biology that varies RC within each domain. Evaluating frontier LLMs, we observe a consistent and monotonic degradation in performance as RC increases, even when the total number of entities is held fixed. This failure mode persists under increased test-time compute and with in-context learning, suggesting a limitation tied to the arity of the required relational binding rather than insufficient inference steps or exposure to examples. Our results identify a well-defined regime of higher-arity reasoning in which current models struggle and motivate revisiting reasoning benchmarks through the lens of relational complexity.

Abstract:
Domain Adaptive Object Detection (DAOD) addresses the challenge of transferring object detectors from labeled source domains to unlabeled target domains. Existing domain adaptation methods primarily rely on feature distribution alignment, which enhances domaininvariant features (statistical invariance) but also inadvertently increases inherent domain-common spurious factors (e.g., common environmental contexts), which act as shortcut features rather than the true causal factors for object classification. We propose Dynamic Causal Refinement (DCR), a novel framework that establishes a closed-loop feedback mechanism between data augmentation and model optimization to progressively refine causal features. Specifically, we design Semantic Prediction Consistency (SPC) to filter domain-specific spurious factors and establish a robust statistical invariance, and Discrepancy-Guided Causal Refinement (DGCR) to actively suppress the dependence on domain-common spurious factors via spectral perturbation for causal refinement. This process encourages the detector to suppress its reliance on shortcut features and instead prioritize semantically meaningful causal representations. Extensive experiments on standard benchmarks demonstrate that our method outperforms state-of-the-art counterparts significantly.

Abstract:
Multiagent reinforcement learning (MARL) has received increasing attention for solving complex decision-making tasks. Networked MARL approaches offer a decentralized solution for parameter sharing to accelerate training via value aggregation. However, existing federated aggregations rely on convex averaging that may fail to converge to global optima and cause learning rollback in the online learning setting. In this paper, we formally characterize the learning rollback phenomenon arising from aggregating value estimates with unequal uncertainty under heterogeneous online update depths. We propose a novel adaptive global consensus (AGC) mechanism for Q-value aggregation in decentralized MARL policy evaluation, which dynamically adjusts aggregation weights based on agents’ uncertainty. We establish theoretical guarantees on accelerated convergence and bounded learning variance with empirical validations, advancing the state-of-art MARL theory.

Abstract:
Scaling Diffusion Transformers to generate highresolution, long videos is constrained by the quadratic cost of self-attention, and existing sparse attention methods degrade under high sparsity. We show empirically that generation quality is determined not by the sparsity ratio itself, but by how well the sparse mask aligns with the tile-wise geometry of full attention. Based on this insight, we propose Veda, a distilled sparse attention framework that formulates tile selection as an explicit reconstruction problem from full attention. Veda integrates statistics-aware tile scoring with head-aware tiling to reduce estimation error and structural mismatch, enabling aggressive sparsity. A hardware-efficient tile-skipping kernel converts theoretical sparsity into practical wall-clock speedups. Experiments on large video diffusion models, including Waver and Wan, demonstrate substantial acceleration without quality degradation. To generate 720P 10-second videos on Waver-T2V-12B, Veda achieves a 5.1× end-to-end speedup and a 10.5× self-attention speedup, reducing attention overhead from 92% to 50%. Notably, the gains increase with sequence length, indicating that Veda scales favorably with spatiotemporal resolution across models.

Abstract:
While Diffusion Transformers (DiTs) have revolutionized highfidelity image synthesis, the prohibitive computational costs of training at ultra-high resolutions necessitate robust inference-time extrapolation. Existing extrapolation methods typically operate under a scale-agnostic assumption, treating the denoising dynamics identically across resolutions. In this work, we identify a critical oversight in this paradigm: the spectral evolution of the diffusion process, transitioning from low-frequency structural construction to high-frequency texture refinement, is inherently scale-dependent. Consequently, applying a uniform strategy across scales causes a spectral misalignment, manifesting as structural collapse or textural degradation. To bridge this gap, we introduce SigMa (\sigma), a training-free framework that utilizes Sigmoid Modulation for scale-adaptive calibration of the extrapolation process. SigMa orchestrates the spectral evolution via a parameterized schedule with two core mechanisms: Decoupled Geometric Center Alignment, which synchronizes the transition timing to secure global structure, and Iso-Variance Rate Adaptation, which scales the transition velocity to ensure a smooth feature handover. Extensive experiments demonstrate that SigMa effectively rectifies spectral deviations, enabling training-free extrapolation up to 16 megapixels and achieving state-of-the-art performance on standard benchmarks.

Abstract:
Formal verification provides strong guarantees of software correctness, but its adoption is limited by the high cost of writing precise formal specifications. While recent large language models (LLMs) have demonstrated impressive capabilities in theorem proving and verified code generation, how powerful they truly are in generating program specifications remains unclear. Existing evaluations require either verifying implementation conformance or proving semantic equivalence between specifications, both are formidably difficult, yielding sparse and often inconclusive results about specification quality. To address this problem, we introduce Coins, a Coqbased evaluation framework that assesses specification quality through provable behavioral correctness on instantiated test cases. This design aligns the evaluation with the asymmetric nature of formal reasoning, where successful proofs provide reliable evidence while proof failures are inherently ambiguous. Using Coins, we conduct a large-scale empirical study of specification generation on HumanEval, supported by a curated set of human-written Coq specifications. Our results show that even generating specifications remains a formidable challenge, and that verification complexity substantially obscures genuine differences in specification quality. Overall, we find that accurately evaluating specifications—rather than increasing model capacity alone—is the central challenge in understanding the power of LLMs for specification synthesis, and that the test-case--based formal reasoning offers a more faithful and discriminative measure of progress.

Abstract:
Anomaly detection (AD) on realworld tabular data is challenged by scarce labels, diverse anomaly types, and high sensitivity to data-specific hyperparameter tuning. We propose Discriminant Vector Machine for Anomaly detection (DVM-AD), a robust one-class method derived from discriminant analysis without data-specific tuning. DVM-AD makes one-class discriminant learning effective via a deterministic reference point, then solves a generalized eigenproblem using a Moore–Penrose pseudo-inverse to remain stable under high-dimensional or rank-deficient settings. Moreover, it selects discriminant directions from both extremes of a bounded inverse-scatter spectrum, combining directions to tighten the normal class with structure-preserving directions that retain informative geometry, addressing the compactness–structure tradeoff. At inference, DVM-AD samples scores by nearest-neighbor distance in the learned space with a training-geometry normalization to yield bounded and threshold-ready scores. Across 47 ADBench tabular datasets and 10 NLP/CV embedding benchmarks against 28 baselines, DVM-AD achieves the best average AUROC (89.65%, average rank 2.98) on tabular datasets and remains top-ranked across four anomaly types and on embedding tasks (average rank 1.60, AUROC 72.68%).

Abstract:
We study the Awakening Crowd of Experts (ACE) problem, an online learning problem where the set of experts available to the learner grows at each round. ACE is a special case of the wellknown sleeping experts problem (Kleinberg et al., 2010), where the number of experts is huge (K=T). Existing results on sleeping experts preclude any learner from achieving a sublinear regret when the number of available experts is linear in T. Inspired by real-world applications, such as Q\&A platforms and social proof marketing, we thus focus on the awakening version of the sleeping experts problem, where a new expert arrives at every round and never leaves. We show that in the stochastic version of ACE, it is possible to obtain regret \tilde\mathcalO(T^2/3) using an unusual pessimism in the face of the uncertainty principle. Moreover, we characterize the dependence of the regret on the stability of an optimal strategy. For both results, we present matching lower bounds. Surprisingly, the adversarial version of ACE is sensibly harder. In particular, we provide a lower bound precluding sublinear \alpha-regret when the competitive ratio is constant. We provide an algorithm to face this crucial trade-off between competitive ratio and regret, and bound its \alpha-regret, almost matching the aforementioned lower bound. As a corollary, we get a \tilde\mathcalO(\log(\log(T)) competitive ratio when an optimal strategy enjoys a reward linear in T.

Abstract:
Building physically grounded protein representations is central to computational biology, yet most existing approaches rely on sequencepretrained language models or backbone-only graphs that overlook side-chain geometry and chemical detail. We present SLAE, a unified all-atom framework for learning protein representations from each residue’s local atomic neighborhood using only atom types and interatomic geometries. To encourage expressive feature extraction, we introduce a novel multi-task autoencoder objective that combines coordinate reconstruction, sequence recovery, and energy regression. SLAE reconstructs allatom structures with high fidelity from latent residue environments and achieves state-of-the-art performance across diverse downstream tasks via transfer learning. SLAE’s latent space is chemically informative and environmentally sensitive, enabling quantitative assessment of structural qualities and smooth interpolation between conformations at all-atom resolution.

Abstract:
The Rademacher complexity of a function class is among the most basic notions of its ``size'' and yields classical offline generalization bounds for Lipschitz loss functions that lead in turn to a modern understanding of statistical learning. More recently, thesequentialandoffsetRademacher complexities were introduced to prove analogous generalization bounds for online learning and for prediction with squared loss. A fundamental structural result in the theory of Rademacher complexity, with many applications to learning theory, is the LedouxTalagrand contraction lemma, which states that the Rademacher complexity of a composition of a function class with a fixed Lipschitz function is at most that of the original class. We show that, under structural assumptions on the function class, this contraction extends to sequential and offset Rademacher complexity at the price of polylogarithmic factors. We further show that these logarithmic factors cannot be removed in general and, absent these additional structural assumptions, no such contraction inequality can hold. These results together indicate that the sequential and offset Rademacher complexities behave fundamentally differently from the classical Rademacher complexity with respect to contraction, which in turn has broad implications for understanding the sample complexities of online learning and regression with squared loss for composed function classes.

Abstract:
Passive models for long video understanding typically rely on a ``watchit-all'' paradigm, processing data uniformly regardless of query difficulty, causing input complexity to scale linearly with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, failing to strictly decouple perception from temporal length. We introduce OmniAgent, a POMDP-based active perception framework. OmniAgent executes on-demand actions to selectively distill audio-visual signals into a persistent textual memory, imposing an information bottleneck that fundamentally decouples reasoning complexity from raw video duration. To operationalize this, we bootstrap the agent via success-driven trajectory synthesis and optimize its policy using TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage). TAURA addresses credit assignment ambiguity in long-horizon reasoning by leveraging token entropy as a proxy for decision criticality, explicitly steering gradients toward pivotal discovery steps. Crucially, OmniAgent demonstrates a positive test-time scaling property, where performance improves as the reasoning turns increase, validating the efficacy of adaptive perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent establishes new state-of-the-art performance. Notably, on LVBench, our 7B agent outperforms the 10× larger Qwen2.5-VL-72B (50.5 vs. 47.3), validating the effectiveness of query-conditional active perception.

Abstract:
Finetuning a large language model is a ubiquitous method for enhancing its capability on a specific downstream task. However, prior work has shown that this increase in capability comes with a cost: it can increase a model's tendency to respond to unsafe adversarial prompts, even when fine-tuning with non-adversarial data. We present the first comprehensive empirical study of this phenomenon in multilingual settings by fine-tuning Llama-3.2, Qwen3, and Gemma-3 models using benign data translated across nine languages. We find that safety outcomes are highly sensitive to both the choice of fine-tuning language and the evaluation language, with adversarial compliance rates increasing four-fold in some settings. Multilingual safety drift is decoupled from general capability metrics, and occurs heterogeneously across languages and models. Fine-tuning in non-English languages often induces smaller internal representational drifts than English, but these shifts lead models to default to either exaggerated compliance or refusal. As such, assessing fine-tuning impacts solely in English provides inadequate assurance for deployment. To facilitate further research into these cross-lingual safety blind spots, we release the Multilingual-Benign-Tune dataset and the SORRY-Bench-Multilingual evaluation suite.

Abstract:
Offline handwritten text recognition (HTR) is blankdominated: task-relevant evidence lies in sparse ink strokes, yet mainstream recognizers still expend dense spatial compute and full-length width-axis token mixing across the canvas. Spiking neural networks (SNNs) promise activity-proportional computation, but static inputs make common frame repetition redundant and stochastic coding unstable under small timestep budgets. We propose Spike-HTR, a budgeted spiking Transformer that controls two coupled knobs: the spiking horizon T and the effective token length \ell_b after blank-guided reduction. InkCoder deterministically gates a shared static stem feature to form a stable coarse-to-fine temporal stream, and a stop-gradient CTC preview drives a CTC-aware keep-and-merge reducer to shorten the width-axis token stream before deep mixing. Trained from scratch without external pretraining, Spike-HTR reaches a rapid-response operating point and achieves T=2 val/test CERs of 3.5/5.4 on IAM, 2.3/2.5 on LAM, and 4.2/3.9 on READ2016. The implementation and scripts are included in the supplementary material.

Abstract:
We present a constructive lower bound of \Omega(1/\sqrt\varepsilon) for FrankWolfe (FW) when both the objective and the constraint set are smooth and strongly convex, showing that the known uniform \mathcalO(1/\sqrt\varepsilon) guarantees in this regime are tight. It is known that under additional assumptions on the position of the optimizer, FW can converge linearly. However, it remained unclear whether strong convexity of the set can yield rates \emphuniformly faster than \mathcalO(1/\sqrt\varepsilon), i.e., irrespective of the position of the optimizer. To investigate this question, we focus on a simple yet representative problem class: minimizing a strongly convex quadratic over the Euclidean unit ball, with the optimizer on the boundary. We analyze the dynamics of FW for this problem in detail and develop a novel computational approach to construct worst-case FW trajectories, which is of independent interest. Guided by these constructions, we develop an analytical proof establishing the lower bound.

Abstract:
Large language models have shown strong reasoning abilities and are increasingly explored as highlevel coordinators for multi-agent systems. However, directly deploying LLMs for coordination remains challenging, as effective policies often fail to reliably emerge at the low-level control level, and inference costs limit scalability. We propose SynCoord (Synthetic Coordination Distillation), a self-supervised pipeline that distills task-level decision-making for cooperation from high-capacity reasoning models into lightweight agent policies. Our approach does not rely on explicit supervision or handcrafted coordination rules. Instead, we define a set of task-level tool interfaces that constrain LLM interaction and enable the collection of interaction trajectories, which are then used to train compact coordinated policies. This distillation process transfers coordination behaviors that are difficult to elicit through prompting alone, while substantially reducing inference overhead at execution time. We evaluate our method in the multi-agent cooperation benchmark Overcooked-AI with varying team sizes and environment layouts. Experimental results show that the distilled policies achieve success rates and efficiency comparable to reinforcement learning–based methods, while exhibiting fewer erroneous or redundant actions and generalizing across team sizes without retraining.

Abstract:
Retrievalaugmented generation (RAG) enhances large language models with external knowledge, and tree-based RAG organizes documents into hierarchical indexes to support queries at multiple granularities. However, existing Tree-RAG methods designed for single-document retrieval face critical challenges in scaling to cross-document multi-hop questions: (1) poor distribution adaptability, where k-means clustering introduces noise due to rigid distribution assumptions; (2) structural isolation, as tree indexes lack explicit cross-document connections; and (3) coarse abstraction, which obscures fine-grained details. To address these limitations, we propose \Psi-RAG, a tree-RAG framework with two key components. First, a hierarchical abstract tree index built through an iterative "merging and collapse" process that adapts to data distributions without a priori assumption. Second, a multi-granular retrieval agent that intelligently interacts with the knowledge base with reorganized queries and an agent-powered hybrid retriever. \Psi-RAG supports diverse tasks from token-level question answering to document-level summarization. On cross-document multi-hop QA benchmarks, it outperforms RAPTOR by 25.9% and HippoRAG 2 by 7.4% in average F1 score. Code is available at https://anonymous.4open.science/r/Psi-RAG-7831/.

Abstract:
Empirically, optionbased hierarchical reinforcement (HRL) learning often produces longer and more diverse options when a deliberation cost is charged at option boundaries. However, when options are executed for many steps under an approximate dynamics model, small model errors compound along the option, degrading the quality of the resulting plan. In this work, we introduce the commitment loss to formalize the tradeoff between deliberation cost and model error as a function of option duration. We characterize how optimal termination probabilities vary with this tradeoff under two model-error mechanisms. First, the model is learned from finite data via maximum-likelihood estimation, producing statistical error that interacts with option duration. Second, we consider an input-driven setting where an exogenous input is only observed at option boundaries and evolves unobserved between them, creating a drift-induced mismatch between planned and realized dynamics. In both cases, we solve for the optimal termination behavior as a function of deliberation cost and the error scale, clarifying the behavior of some popular HRL algorithms that approach the deliberation cost as a heuristic.

Abstract:
We study the problem of computing a Ustatistic with a kernel function f of degree k \geq 2, i.e., the average of some function f over all k-tuples of instances, in a federated learning setting. U-statistics of degree 2 include several useful statistics such as Kendall's \tau coefficient, the Area under the Receiver-Operator Curve and the Gini mean difference. Existing methods provide solutions only under the lower-utility local differential privacy model and/or scale poorly in the size of the domain discretization. In this work, we propose a protocol that securely computes U-statistics of degree k \geq 2 under central differential privacy by leveraging Multi Party Computation (MPC). Our method substantially improves accuracy when compared to prior solutions. We provide a detailed theoretical analysis of its accuracy, communication and computational properties. We evaluate its performance empirically, obtaining favorable results, e.g., for Kendall's \tau coefficient, our approach reduces the Mean Squared Error by up to four orders of magnitude over existing baselines.

Abstract:
Recent advances have shifted the paradigm of tabular learning toward tabular foundation models, yet their accuracy relies on a heavy inference cost that scales poorly with context size. Deep neural networks remain a highly competitive and more efficient modeling paradigm when equipped with welldesigned architectures; however, identifying such architectures in a data-adaptive and budget-aware manner remains challenging. We propose pTNAS, the first progressive neural architecture search (NAS) approach tailored for tabular data, to enable fast identification of a viable architecture and continuously improve its search performance as more budget becomes available. pTNAS adopts a filter-and-refine optimization strategy that combines both efficient training-free and effective training-based architecture evaluation. At the filtering phase, we introduce pTProxy, a novel zero-cost proxy specifically designed for tabular networks that jointly captures architectural trainability and expressivity, to facilitate fast filtering of large architecture search spaces. At the refinement phase, pTNAS employs a fixed-budget scheduling algorithm to accurately identify the best-performing architecture from a small set of promising candidates. We further propose a budget-aware coordinator to optimize budget allocation holistically. Experiments show that pTNAS reduces the time to reach the globally best architecture by up to 82.75 X compared with other NAS approaches, and improves average predictive accuracy and end-to-end efficiency by up to 4.95 X compared with TabPFN.

Abstract:
Research on continual learning for LLMs seeks to acquire new skills without catastrophic forgetting of established prior knowledge. However, domainspecific fine-tuning still triggers severe, long-tailed forgetting issues even under narrow updates, particularly when the pre-training data is inaccessible. To tackle this challenge, we propose ALER, a data-free continual learning framework that adversarially searches for a small set of latent prompt embeddings to maximize logit divergence from a frozen reference model, proactively exposing high-risk forgetting modes at each step. It then performs online distillation from the frozen reference using the discovered embeddings to retain prior behaviors while preserving target-domain adaptation. We provide theoretical guarantees on the efficiency of our targeted repair, and extensive experiments demonstrate consistent improvements in the retention–adaptation frontier over representative baselines across 2 domain-specific fine-tuning datasets and 6 general-purpose benchmarks, suggesting a more proactive approach for LLM continual learning.

Abstract:
Distribution shift poses a fundamental challenge in offline reinforcement learning, often leading to value overestimation when querying outof-distribution actions. We introduce Regularized Evaluation Gap (REG) as a bridge between implicit methods like IQL and explicit conservative methods. We formulate policy evaluation as a robust optimization problem over an ambiguity set of critics and show that IQL’s objective can be viewed as an approximate dual solution to this problem. To extract a policy from the learned value function, we propose a practical Orthogonal Policy Gradient (OPG) update. This method regularizes an aggressive, mode-seeking policy gradient by projecting it onto the subspace orthogonal to a stable, in-sample behavior cloning gradient. Extensive D4RL experiments demonstrate that REG matches state-of-the-art performance among both Gaussian methods and diffusion-based approaches without the computational burden of the latter.

Abstract:
Offpolicy reinforcement learning is vulnerable to overestimation bias, which is rooted in the total value uncertainty. However, existing methods typically misaddress this by targeting the epistemic component, neglecting the aleatoric component. We identify for the first time that this oversight fails to contain a massive bias surge, termed theAleatoric Impulse. Although transient, this impulse fundamentally derails the learning trajectory, permanently locking the agent into suboptimal policies. To counteract this, we proposeAleatoricImpulseDamping(AID), the first mechanism that models total value uncertainty by disentangling the return variance into epistemic and aleatoric components, followed by their adaptive weighted recombination. Leveraging this derived uncertainty, the critic constructs a pessimistic lower confidence bound to surgically suppress the impulse. Complementing this, the actor utilizes a symmetrical upper confidence bound to drive optimistic exploration, ensuring that the necessary pessimism does not compromise exploration efficiency. We integrate this mechanism into the Distributional Soft Actor-Critic algorithm to establishDSAC-AID. Extensive experiments on the high-dimensional Gym-MuJoCo and DeepMind Control Suite benchmarks demonstrate that it achieves state-of-the-art results in final performance.

Abstract:
The high cost of finetuning LLMs poses a significant economic barrier; pre-hoc performance prediction offers a critical solution to substantially reduce this expense. However, the theoretical limits of pre-hoc performance prediction remain unexplored. We formulate it as a stochastic estimation problem under information constraints, decomposing prediction risk into two components: an intrinsic limit (static data-model compatibility) and a reducible optimization variance. We prove that optimization variance admits a necessary lower bound on its decay rate, implying fundamental constraints on how quickly uncertainty dissipates, regardless of the predictor used. Based on these dynamics, we derive a budget-optimal probing principle and introduce a predictability phase diagram that organizes tasks into three distinct regimes: Static-Sufficient, Dynamic-Critical, and Noise-Dominant. Extensive experiments on synthetic and real-world benchmarks validate these theoretical regimes and demonstrate the efficiency of our probing strategy.

Abstract:
With the widespread application of Video Diffusion Models (VDMs), video synthesis has achieved remarkable temporal dynamics. Imageto-Video (I2V) generation allows users to provide reference images, which enables attackers to inject adversarial noise into these conditions. Due to the robust spatio-temporal priors in VDMs, conventional frame-level attacks merely induce superficial artifacts and struggle to suppress the synthesis of motion semantics. In this work, we approach the problem by exploring the underlying mechanism of temporal dynamics. We reveal that the static video manifests as a temporal rank collapse, a degenerate state characterized by rank-1 degeneracy within the temporal attention matrix. Guided by this insight, we propose Motion Elimination in Diffusion Using Spectral Attack (MEDUSA) to freeze the video. It minimizes the nuclear norm of the attention matrix to induce the temporal rank collapse. This objective circumvents the vanishing gradient problem encountered when directly imposing a rigid temporal mapping on the attention matrix. Furthermore, we provide a mathematical analysis of this phenomenon and the gradient vanishing problem during the optimization. Experiments confirm that MEDUSA achieves excellent performance and validates the effectiveness of spectral constraints.

Abstract:
Stateof-the-art reasoning models can utilize long chain-of-thought to solve sophisticated coding and math problems. During this process, the model often attemps at a solution multiple times by utilizing verification and self-reflection capabilities. In this work, we view a long CoT as a process where the model makes K attempts at solving a problem in which each attempt is allowed to build on earlier solutions. This way, we formalize long CoT as a pass@K problem with dependent samples. Under this formalism, we derive the policy gradient and RL algorithms for optimizing long CoT reward and derive how each attempt should be weighed for unbiased gradient computation while maintaining small variance. Our theory reveals how the self-correction capability and dense feedback influence the training and eventual performance of long CoT-based reasoning. We provide both synthetic and real experiments corroborating our theory and the benefits of the associated algorithms. As a by product, our research also reveals when verification and long chain-of-thought is beneficial over parallel sampling strategies and the role of the model capability.

Abstract:
A surge of recent advancements has consistently highlighted the superiority of multimodal learning over unimodal approaches across a variety of tasks. However, the theoretical foundations elucidating this advantage remain underexplored: existing theoretical analyses are often constrained by tight assumptions, and lack empirical validation. In this paper, we link this gap by proposing a novel theoretical framework grounded in convolutional smoothing, offering a new perspective on how multimodal learning contributes to a smoother loss landscape compared to unimodal learning. Building upon this theoretical foundation, we introduce a simple yet effective distributional training approach based on stochastic modality pairing instead of fixed pairing; thus, further promoting flatter landscape via convolutional smoothing. Our empirical results across various multimodal datasets demonstrate that multimodal models not only achieve better performance but also exhibit smoother loss landscape, which represent better robustness and generalization.

Abstract:
Surface reconstruction with differentiable rendering has achieved impressive performance in recent years, yet the pervasive photometric ambiguities have strictly bottlenecked existing approaches. This paper presents AmbiSuR, a framework that explores an intrinsic solution upon Gaussian Splatting for the photometric ambiguityrobust surface reconstruction with high performance. Started by revisiting the foundation, our investigation uncovers two built-in primitive-wise ambiguities in representation, while revealing an intrinsic potential for ambiguity self-indication in Gaussian Splatting. Stemming from these, a photometric disambiguation is first introduced, constraining ill-posed geometry solution for definite surface formation. Then, we propose an ambiguity indication module that unleashes the self-indication potential to identify and further guide correcting underconstrained reconstructions. Extensive experiments demonstrate our superior performance in surface reconstruction compared to existing methods across various challenging scenarios, while excelling in broad compatibility. Our code will be made open-source upon acceptance.

Abstract:
Modelbased reinforcement learning aims to construct world models for efficient sampling. Current mainstream algorithms can be broadly categorized into two paradigms: maximum likelihood and value-aware world models. The former employs structured Recurrent/Transformer State-Space Models to capture environmental dynamics but overlooks task-relevant features. The latter prioritizes decision-critical states but suffers from sub-optimal performance. While recent efforts have sought to integrate these approaches, they typically rely on auxiliary modules or heavy external priors that significantly increase computational complexity. In this work, we propose a Value-Aligned World Model, a minimalist framework designed to synergize these two paradigms with negligible overhead. Specifically, We introduce an intrinsic latent-space value-alignment regularization that compels the world model to prioritize task-relevant features while maintaining the structural integrity of stochastic dynamics. To ensure stable optimization, we develop an adaptive weighting mechanism that acts as a self-regulating curriculum, balancing reconstruction fidelity with decision-making utility. Extensive experiments on Atari 100k and DeepMind Control benchmarks demonstrate that our algorithm consistently boosts existing methods with minimal added code and computational overhead. Code is available at supplementary material.

Abstract:
Gradientbased optimization in class-incremental learning (CIL) often faces the plasticity–stability dilemma, since continuous weight updates can distort decision boundaries learned from earlier tasks. We revisit this problem from the viewpoint of stochastic geometric memory allocation and propose BTSP-CAM, a gradient-free memory system that instantiates theoretical insights from the hippocampal simpleBTSP model into a practical algorithm. Rather than fine-tuning a frozen encoder by backpropagation, BTSP-CAM externalizes plasticity into a binary synaptic matrix that evolves through local stochastic bit-flip updates. A trace-gated plateau process, driven by eligibility traces together with familiarity and collision signals, modulates when and where synapses are rewritten and suppresses cross-class interference in Hamming space. The resulting geometric memory states are mapped to semantic logits through a CA1-like competitive layer and a closed-form ridge readout, enabling fast consolidation after each task. Empirically, BTSP-CAM rivals gradient-based methods in a strictly exemplar-free setting and consistently boosts SOTA baselines as a lightweight plugin. Mechanistic analysis validates our geometric theory, confirming that stochastic repulsion actively bounds class overlap and stabilizes decision margins.

Abstract:
Multimodal large language models (MLLMs) depend on in-context learning (ICL) for rapid task adaptation, but their scalability is severely limited by finite context windows and the growing cost of key–value (KV) caches in long multi-modal sequences. Existing memory compression approaches typically rely on rigid token removal or sample-dependent importance estimation, which introduces bias, disrupts semantic structure—particularly for visual representations—and yields static memories that cannot adapt to new queries. We introduce TASM (Task-Aware Structured Memory), a training-free framework that addresses these limitations through task-aware, structure-preserving, and dynamically accessible memory construction. TASM employs Task-Vector Guided Compression to replace sample-specific signals with a task-level direction that captures shared relevance across demonstrations. To preserve the underlying information manifold, it further applies Semantics-Aware Token Merging, formulating compression as a Bipartite Graph Matching problem that merges tokens without destructive pruning. Finally, TASM organizes compressed representations into a multi-resolution hierarchy consisting of a compact Core Memory and a Latent Bank, enabling Query-Adaptive Dynamic Activation and Dynamic Retrieval at inference time. Empirical evaluations show that TASM sustains strong multi-modal ICL performance under high compression ratios, demonstrating an effective balance between efficiency, adaptability, and semantic fidelity.

Abstract:
Identifying optimal intervention sequences from offline data to guide temporal systems toward target outcomes is a critical challenge with profound implications for fields like personalized medicine. While existing methods are mostly evaluated in offline settings, practical applications demand online, adaptive strategies that can respond in realtime. To address this, we propose Goal-conditioned Intervention via Factual-Targeted Training (GIFT), a novel framework for learning sequential intervention policies from observational data. GIFT learns a goal-conditioned policy by rescaling rewards with clipped importance weights, stabilizing learning and steering toward the target. Under standard assumptions, the induced operator has a unique fixed point and our procedure converges to it. We also bound the bias from clipping and approximation via the gap to the policy’s true value. Experiments show GIFT significantly outperforms existing methods in creating goal-conditioned policies for online deployment.

Abstract:
Federated Graph Learning (FGL) is a distributed machine learning paradigm based on graph neural networks, enabling secure and collaborative modeling of local graph data among clients. However, label noise in graph data can degrade the generalization performance of the global model. Existing federated label noise learning methods, primarily focused on computer vision tasks, often yield suboptimal results when directly applied to FGL. To address this issue, we propose a robust federated graph learning method with label noise, termedFedRGL. Specifically, FedRGL leverages the globally aggregated model and local subgraph structural information to implement a dualperspective consistency noise-node filtering mechanism under class-aware dynamic thresholds. The resultingclass-aware dual-consistency filtering (CADF)can also serve as a plug-and-play module, enhancing noise robustness across various subgraph federated learning frameworks. To better exploit the supervisory information from filtered noisy nodes, we employ the natural augmentation techniques from graph contrastive learning to assign high-confidence pseudo-labels to the noise nodes. Additionally, we measure model quality via the average predictive entropy of unlabeled nodes, enabling adaptive robust aggregation on the server side. Extensive experiments on real-world graph datasets show that FedRGL consistently outperforms existing methods under different noise rates, noise types, and client scales, achieving on average5--8%higher accuracy and up to30%improvement over the weakest baselines under noisy conditions. The anonymous source code is available at https://anonymous.4open.science/r/FedRGL_ICML26-376F.

Abstract:
Frozen selfsupervised representations often transfer well with only a few labels across many semantic tasks. We argue that a single geometric quantity,directionalCDNV (decision-axis variance), sits at the core of two favorable behaviors: strong few-shot transfer within a task, and low interference across many tasks. We show that both emerge when variabilityalongclass-separating directions is small. First, we prove sharp non-asymptotic multiclass generalization bounds for downstream classification whose leading term is the directional CDNV. The bounds include finite-shot corrections that cleanly separate intrinsic decision-axis variability from centroid-estimation error. Second, we link decision-axis collapse to multitask geometry: for independent balanced labelings, small directional CDNV across tasks forces the corresponding decision axes to be nearly orthogonal, helping a single representation support many tasks with minimal interference. Empirically, across SSL objectives, directional CDNV collapses during pretraining even when classical CDNV remains large, and our bounds closely track few-shot error at practical shot sizes. Additionally, on synthetic multitask data, we verify that SSL learns representations whose induced decision axes are nearly orthogonal.

Abstract:
To make largescale distributed training practical outside high-bandwidth datacenters, we must reduce blocking, high-volume synchronization. While DiLoCo communicates infrequently, its outer synchronization remains bandwidth-heavy and brittle to stragglers and transient failures. We relax exact synchronization to approximate synchronization via mixing/gossip, which degrades gracefully under delays and communication failures. This allows us to factorize DiLoCo synchronization into a non-blocking mixing step that overlaps computation with no staleness, and a blocking mixing step that tightens worker agreement, yielding a tunable trade-off between compute utilization and optimization stability. On up to billion-parameter language models in low-bandwidth settings, our method substantially improves compute utilization while matching DiLoCo’s training progress, and is more robust to failures.

Abstract:
Large language models (LLMs) are trained on heterogeneous multilingual corpora, yet existing preference optimization methods often implicitly restrict each training question to a single response language or rely on a fixed dominant language for supervision. We propose languagerouted preference optimization (LRPO), an online preference optimization framework that treats language as a selectable variable rather than a fixed input constraint. LRPO elicits multilingual rollouts for each training question and integrates their relative quality into preference-based policy updates, increasing the diversity and informativeness of training signals under the fixed rollout budget. To adaptively determine which languages to explore, we introduce a trainable language router formulated as a multi-armed bandit, balancing exploration of underutilized languages with exploitation of more informative ones. Extensive experiments show that LRPO consistently improves multilingual performance, demonstrating that adaptive language routing enables effective cross-lingual knowledge exploitation for training.

Abstract:
Large language models (LLMs) are often hypothesized to perform implicit Bayesian inference, yet a key coherence condition—the martingale property of predictive beliefs—has been shown to fail in controlled synthetic incontext learning settings. We revisit this question in a more typical usage regime: generic multiple-choice question answering. Exploiting the discrete answer space, we compute exact predictive distributions and study belief dynamics induced by autoregressive answer resampling. We introduce prompted predictive resampling (PPR), where an LLM generates a sequence of answers to the same question. Empirically, PPR reveals early-stage belief drift, indicating martingale violations. However, after sufficient resampling steps, the belief process self-stabilizes and converges to a coherent predictive distribution. Based on this observation, we further propose (i) a seed-answer prompting strategy to accelerate stabilization, and (ii) a self-consistency loss that amortizes early-stage drift into the model via fine-tuning. Experiments on multiple-choice QA benchmarks show that our methods substantially reduce belief drift and improve predictive coherence without sacrificing accuracy.

Abstract:
Extended reasoning in large language models (LLMs) requires long and accurate decoding and creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent postRoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions—Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5× higher throughput or 10.7× KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency.

Abstract:
The forest matrix of a signed graph plays an important role in network science and social opinion dynamics, yet existing algorithms are mainly designed for unsigned graphs and are difficult to extend to signed graphs. In this paper, we study the problem of efficiently estimating the forest matrix of signed graphs with (n) nodes and introduce the signed forest matrix theorem, which establishes the relationship between generalized spanning converging forests and the forest matrix. Based on this result, we propose a novel algorithm GSCF, built on a variant of looperased random walks, to generate generalized spanning converging forests in expected (O(n)) time. We further develop two sampling algorithms, FMDE and FMDE+, for estimating the diagonal of the forest matrix, both with time complexity (O(ln)), where (l) is the number of samples. Extensive experiments on various signed graphs show that our methods achieve high estimation accuracy, significantly improve computational efficiency, and scale to graphs with over twenty million nodes. Our source code is publicly available on \urlhttps://anonymous.4open.science/r/SignedForestDiagonal-FA09.

Abstract:
Molecular representation learning has achieved remarkable progress in molecular property prediction, yet outof-distribution (OOD) generalization remains challenging. In practice, training data typically cover only a limited portion of the chemical space, causing models to rely on environment-dependent factors that fail to transfer when scaffold structures or functional compositions shift. To address this issue, we propose MoSIR, a framework for learning molecular semantic invariant representation with prototype constraint, which projects entangled molecular embeddings into a learnable semantic prototype space to extract semantic invariant representation while isolating environment-sensitive variations. Building upon this decomposition, we optimize a bi-level min-max objective that introduces representation perturbations to simulate plausible environment shifts and enforce semantic stability. We further provide theoretical guarantees for MoSIR by deriving an OOD generalization bound under distribution shifts. Extensive experiments on multiple molecular OOD benchmarks demonstrate that MoSIR consistently outperforms strong baselines across diverse shift settings, and qualitative analyses confirm that the learned prototypes capture meaningful chemical semantics.

Abstract:
For communicationefficient decentralized learning, advanced network (NW) topologies, such as exponential and 1-peer exponential graphs, have been studied under homogeneous communication delays. However, real-world NWs exhibit heterogeneous communication delays, making node assignment optimization crucial for minimizing the Bottleneck Communication Delay (BCD). We propose BTSP-MSR, an approximate method for minimizing BCD on circulant digraphs, including exponential and 1-peer exponential graphs. Leveraging the fact that circulant digraphs can be viewed as a union of (directed) ring graphs, we derive an upper bound on the BCD by combining the ring-graph BCD (BTSP) with a deviation term (MSR). We then construct a solver that sequentially minimizes these two terms. Numerical experiments show that BTSP-MSR consistently reduces BCD across several circulant digraphs with large numbers of nodes. Notably, incorporating the exponential or 1-peer exponential graph enables communication-efficient decentralized learning under heterogeneous delay settings.

Abstract:
We introduce physicsas-representation, a learning paradigm that encodes physical structure and geometric laws into visual representations, enabling models to see the unseen—the underlying 3D geometry and motion dynamics not apparent in raw pixels. We instantiate this paradigm in gaze perception by proposing SG-Gaze, a framework that learns a Structurally and Geometrically Consistent Representation (SGR) through dual-branch adversarial learning. An analytical branch embeds appearance features onto a spherical manifold aligned with gaze geodesics, while a model-guided branch reconstructs the 3D eyeball with weak 2D edge supervision. We further introduce View-Consistent Regularization, which augments SGR learning with synthetic view perturbations and enforces rotation-equivariant consistency across gaze vectors and structural projections, eliminating the need for multi-view calibration or explicit 3D labels. Extensive experiments across 12 challenging cross-domain transfers demonstrate that SG-Gaze achieves state-of-the-art accuracy and strong generalization. Our work highlights that enforcing structural and geometric consistency with equivariant regularization serves as effective inductive biases for interpretable and generalizable representation learning—a step toward machines that perceive the world not only from pixels, but from physics.

Abstract:
With growing concerns about reviewers using Large Language Models (LLMs) for writing peer reviews, several conferences and journals have enacted policies thatprohibit LLM usage except for polishing, paraphrasing, and grammar correction of otherwise humanwritten reviews. But, are these policies enforceable? To answer this question, we assemble a dataset of peer reviews simulating multiple levels of human-AI collaboration, and evaluate 5 state-of-the-art detectors, including two commercial systems. Our analysis shows that all detectors misclassify a substantial fraction of LLM-polished reviews as AI-generated, thereby risking false accusations of academic misconduct. We further investigate whether peer-review-specific signals, including access to the paper manuscript and the constrained domain of scientific writing, can be leveraged to improve detection. While incorporating such signals yields measurable gains in some settings, we identify limitations in each approach and find that none meets the accuracy standards required for identifying AI use in peer reviews. If enforcement of policies is a priority, we recommend completely prohibiting AI use for writing peer reviews.

Abstract:
Modeling recognition between Tcell receptors (TCRs) and peptide-MHC (pMHC) complexes is a fundamental challenge in computational immunology, constrained by sparse paired interaction data relative to abundant unpaired sequences. We introduce DecoderTCR, a masked language model framework that addresses this through two contributions: (1) a compositional continual pre-training curriculum that learns component representations from marginal data before refining cross-chain dependencies from limited pairs, and (2) Iterative Entropy-Guided Refinement (IEGR), a non-autoregressive decoding algorithm that resolves high-confidence positions first to provide context for uncertain regions. On held-out benchmarks, DecoderTCR achieves 0.96 AUROC for zero-shot pMHC binding prediction and 0.76 AUROC for epitope-specific TCR recognition, approaching supervised baselines without epitope-specific training. Learned representations recover structural contacts without coordinate supervision, and generated sequences exhibit realistic recombination statistics. Experimental validation reveals a prediction-generation gap: strong discrimination does not yet yield reliable generation, highlighting an open challenge for the field.

Abstract:
Machine unlearning aims to eliminate the influence of specific data from trained models to safeguard privacy. However, this presents a significant challenge in the context of continual learning (CL), where models update sequentially on dynamic datasets. A major limitation is that current certified unlearning algorithms fail to account for the complex, cumulative model evolution inherent to CL framework. In this work, we establish the first theoretical foundation bridging CL and machine unlearning. We formulate the CL's unlearning objective as the minimization of postunlearning excess risk, which decomposes into CL excess risk and unlearning loss, characterizing the fundamental trade-off between preserving historical knowledge and targeted forgetting. Under mild assumptions, we first establish an upper bound for the CL excess risk in non-convex models. We then adapt two certified unlearning approaches, gradient-based and Hessian-based, to the CL framework. Our analysis reveals that while the gradient-based approach is less effective than the Hessian-based method in minimizing unlearning loss, it offers the distinct advantage of nearly zero storage overhead for enabling unlearning. This insight inspire us for a combination strategy to minimize storage costs while maintaining post-unlearning performance. Experimental results further validate our theoretical findings.

Abstract:
Extended Chainof-Thought (CoT) reasoning has significantly bolstered the capabilities of medical large language models (LLMs). However, current models exhibit static computational expenditure, applying lengthy reasoning processes indiscriminately to both simple queries and complex diagnostic cases. This inefficiency is particularly prohibitive in real-world healthcare, where clinical scenarios range from time-sensitive emergencies requiring rapid response to intricate pathologies demanding deep analysis. To address this, we propose AdaThink-Med, an end-to-end framework for adaptive reasoning via uncertainty-guided length calibration. Although the underlying mechanism is generalizable, we demonstrate its critical value in the medical domain, where balancing inference latency with diagnostic precision is paramount. AdaThink-Med leverages entropy-based uncertainty estimation within reinforcement fine-tuning to dynamically shape reward signals: it penalizes verbosity for high-confidence correct answers (e.g., straightforward knowledge retrieval) while incentivizing extended exploration for uncertain or ambiguous scenarios. Across six medical benchmarks, AdaThink-Med reduces inference token consumption by 4.7× to 6.4× on Qwen and Llama architectures, respectively, with minimal performance trade-offs. Notably, the model spontaneously develops distinct "non-thinking'' and "thinking'' modes, demonstrating an autonomous ability to allocate computational resources efficiently based on clinical urgency and complexity.

Abstract:
We study mechanism design for selling data as a digital good when the value derived from training AI models follows a scaling law. The seller faces a linear cost when producing data, while the buyers benefit from additional data with diminishing returns as data volume increases. This departs from classical auction models by allowing allocations to be continuous quantities of data rather than binary outcomes. We first analyze an offline setting in which all buyer types are realized simultaneously, characterizing profitoptimal mechanisms and showing how virtual-value methods extend to continuous data allocations. We then consider an online setting with sequential arrivals, where production decisions must be made under demand uncertainty. We show that myopic allocation and fixed production plans can be arbitrarily suboptimal, whereas a simple two-stage algorithm that combines upfront production with adaptive expansion achieves a constant-factor approximation to the offline optimum. Finally, we study bilateral data trading under asymmetric information, where both the buyer’s value and the seller’s cost are private. Although the optimal truthful mechanism has a complex structure, we show that simple and implementable mechanisms recover a constant fraction of the first-best gain-from-trade. Overall, our results highlight how scaling laws introduce new algorithmic trade-offs in market design and provide performance guarantees for data markets under uncertainty.

Abstract:
Visionlanguage models can perform new tasks without parameter updates through in-context learning (ICL), whose core mechanism is utilizing the support set for task induction. In standard ICL setting, once the task is induced, its decision boundary, i.e., the criterion, remains fixed. However, in real-world applications, many tasks exhibit a stable high-level intent, while their decision criteria shift according to specific requirements. Thus we introduce a new test setting, denoted as Criterion-Conditional In-Context Learning (CC-ICL), where models must infer the latent criterion from context under a fixed task semantics. To evaluate this capability, we propose two complementary metrics, Criterion-Sensitivity and Criterion-Invariance, capturing model's robustness and adaptability under criterion shifts. We further construct CC-Bench, a multi-domain benchmark that supports evaluation under the CC-ICL setting through hierarchical annotation, enabling legitimate ground-truth variation under fixed tasks. Experiments on CC-Bench reveal that most models exhibit a ''rigid boundary'' bias, struggling to align their decisions with the latent criterion. We also find that even a simple multi-criteria training strategy can significantly reduce this bias, improving Criterion-Sensitivity and enabling 7B-scale models to surpass proprietary models without degrading general multimodal performance.

Abstract:
Recent advances in audio deepfake detection have been driven by increasingly large speech foundation models and growing amounts of synthetic data. Despite steady improvements on different benchmarks, it remains unclear how detection performance scales with model capacity and training data under realistic deployment conditions, where detectors operate under distribution shift, signal corruption, and unseen synthesis pipelines. In this work, we present the first systematic study of scaling laws in posttraining audio deepfake detection, focusing on fine-tuning regimes rather than large-scale pretraining. Using a controlled family of speech foundation models with shared architecture and pretraining, we analyze how detection performance, robustness, and generalization evolve as a function of model size and training data scale. Our evaluation covers multiple dimensions, including out-of-distribution datasets, common audio corruptions, cross-language generalization, and cross-TTS (Text-to-Speech) generalization to unseen speech synthesis systems. Across settings, we observe consistent but highly non-uniform scaling behavior: while larger models are more sample-efficient and generalize better overall, scaling benefits weaken under corruptions and linguistic shift, and persistent error gaps remain even at the largest scales. Our results reveal a fundamental asymmetry between performance scaling and robustness scaling in audio deepfake detection. While larger detectors consistently improve in-distribution detection performance, gains in robustness and generalization, particularly under cross-language and cross-TTS evaluation, are substantially weaker and exhibit persistent error gaps.

Abstract:
Online reinforcement learning is widely used to align large language models (LLMs) with reward signals, yet training can be unstable under noisy or misspecified rewards. We identify a failure mode we call directional inconsistency: within a batch, a small set of highreward rollouts induces representation-space preference directions that sharply disagree with the batch majority, resulting in high-variance and destabilizing updates. We propose GeoAlign, a lightweight plug-in for rollout curation in iterative policy optimization. GeoAlign (i) forms within-prompt preference pairs, (ii) learns an online projector on per-rollout hidden states to concentrate reward-ordered displacement directions, and (iii) detects directionally inconsistent rollouts via their angular deviation from a batch consensus prototype and rectifies them with within-prompt stable alternatives. GeoAlign is forward-pass only and adds negligible overhead. Across dialogue alignment with a learned reward model and mathematical reasoning with binary verified rewards, GeoAlign improves final performance and reduces training oscillation, outperforming PF-PPO, PAR, PODS, and Seed-GRPO. These results suggest latent directional consensus as an effective reliability signal for online LLM RL.

Abstract:
Designing heuristic algorithms for complex optimization problems is a timeconsuming and expert-driven process. Recently, Automated Heuristic Design (AHD) using Large Language Models (LLMs) has shown significant promise for automating algorithm development. However, existing works mainly rely on programs to represent heuristics, which are inherently taskspecific and fail to generalize as effectively as established metaheuristics like tabu search or guided local search. To bridge this gap, we introduce Multi-Task Hierarchical Search (MTHS), an LLM-guided evolutionary method that co-designs general-purpose metaheuristics and task-specific programs. MTHS employs a hierarchical representation and adopts a two-level evolution framework to evolve task-agnostic metaheuristics and task-specific program implementations simultaneously across multiple heuristic design tasks. During this evolution, a knowledge transfer mechanism allows learning from elite programs designed for other tasks. We evaluated MTHS on distinct combinatorial optimization problems, where it outperforms both commonly-used heuristics and existing LLM-driven AHD approaches. Our results demonstrate that the hierarchical representations facilitate effective multi-task AHD, and the evolved metaheuristics exhibit strong generalization to related tasks.

Abstract:
This paper studies causal discovery in irregularly sampled time series—a key challenge in risksensitive domains like finance, healthcare, and climate science, where missing data and inconsistent sampling frequencies distort causal mechanisms. The main challenge comes from the interdependence between missing data imputation and causal structure recovery: errors in imputation and structure learning can reinforce each other, leading to an inaccurate causal graph. Existing methods either impute first and then discover, or jointly optimize both via neural representation learning, but lack explicit mechanisms to ensure mutual consistency of imputation and structure learning. We address this challenge with ReTimeCausal, an EM-based framework that alternates between imputation and structure learning, which encourages structural consistency throughout the optimization process. Our framework provides theoretical consistency guarantees for structure recovery and extends classical results to settings with irregular sampling and high missingness. ReTimeCausal combines kernel-based sparse regression and structural constraints in an alternating process that updates the completed data and the causal graph in turn. Experiments on synthetic and real-world datasets show that ReTimeCausal is more effective than existing methods under challenging irregular sampling and missing time series data.

Abstract:
Training large language models for code generation requires selecting highquality data from solution pools where each problem admits multiple correct implementations. Conventional studies on data selection hold that sophisticated strategies that employ various optimization objectives, such as diversity maximization or difficulty ranking, should outperform naive random sampling. However, research onwhether sophisticated selection methods truly benefit code generationremains limited. In this work, we systematically evaluate multiple selection strategies across different representation spaces, including continuous embeddings, discrete tokens, and syntactic structures, using various base language models. We observe counterintuitive phenomena, suggesting that sophisticated methods may not yield robust benefits. Instead, simple random sampling achieves consistently competitive performance across all models, exhibiting greater stability and transferability than sophisticated methods. We attribute that animplicit knowledge consensusexists among diverse correct solutions, such that random selection already covers the common algorithmic knowledge required for training. Our findings provide practical insights into data selection for code generation, suggesting practitioners can adopt simple random sampling without sacrificing performance.

Abstract:
Deep Reinforcement Learning agents are increasingly used in safety-critical domains but remain vulnerable to stealthy backdoor attacks. Existing outer-loop attacks face a trade-off be- tween perceptual stealth, poisoning efficiency, and value-function consistency, often making the at- tack ineffective or easily exposed. To address these challenges, we propose SpecDRL, a uni- fied framework that ❶ embeds triggers in the least sensitive subspaces of the state manifold via Subspace-Aware Injection, exploiting percep- tual blind spots, ❷ selects the most influential time steps for poisoning through Value-Guided Strategic Sampling based on Return-to-Go and Temporal-Difference error, and ❸ preserves re- ward integrity via Bellman-Consistent Dynamic Reward Poisoning, which analytically enforces ϵ- consistency of value functions and bounds global return deviations. Experiments across 12 Atari en- vironments demonstrate that SpecDRL achieves near-100% attack success, accelerates backdoor convergence, and maintains benign task perfor- mance.

Abstract:
Generating runnable frontend code from UI screenshots is a long-standing goal in automated software engineering. Existing MLLM-based methods predominantly focused on HTML/CSS, leaving multi-framework generation for React/Vue/Angular underexplored. Naively modifying prompts leads to substantial performance gaps across multi-framework and highly framework-specific error modes. To address this, we propose MulFCoder, a framework-conditioned multi-agent method that explicitly encodes framework constraints to bring multi-framework differences into a decidable rule space. MulFCoder orchestrates four agents: Grounder constructs an ElementTable, ContentTable, and macro-layout regions from detected UI elements; Planner builds a DOM-like hierarchical layout tree, produces a task schedule, and derives a framework-specific file Contract; Writer generates structured file writes or patches within a restricted edit window; Judger enforces lightweight, framework-conditioned FastGate constraints to accept or reject updates and trigger bounded repairs, preventing drift and deadlocks without expensive builds. Experiments demonstrate that MulFCoder substantially improves multi-framework compilation success and reduces framework-specific errors, with particularly pronounced gains on constraint-heavy frameworks.

Abstract:
Safe offline reinforcement learning (RL) requires optimizing policies within the support of static datasets while satisfying strict safety constraints. Although recent latent generative policies achieve strong empirical performance, they rely heavily on implicit regularization and lack systematic control over distributional shift during policy improvement. In this work, we propose a geometric control framework that leverages the bijective structure of conditional normalizing flows to provide a tractable mechanism to regulate distributional deviation of the policy. By constraining divergence in the latent base space, we derive tractable upper bounds on the induced Wasserstein distance and total variation of the policy distribution, establishing an analyzable connection between latent geometry and downstream behaviors. This insight motivates a decoupled architecture: a flow prior shapes a feasibilityweighted latent manifold using Hamilton--Jacobi reachability signals, while a latent refiner performs geometrically constrained optimization directly in the base space. Across multiple safe RL benchmarks, our method achieves robustly low violation rates with competitive returns, highlighting the benefits of structured geometric regularization.

Abstract:
While data augmentation is essential for robust point cloud recognition, conventional spatial mixup strategies often compromise geometric integrity by generating physically unrealistic samples. To overcome this limitation, we propose PSMix, which shifts the mixing paradigm to the spectral domain via the Spherical Harmonic Transform. Instead of simple coordinate interpolation, PSMix performs a rotationaware hierarchical mixing on spectral coefficients. This approach explicitly preserves global structural properties while diversifying local details, achieving a balance that spatial methods struggle to maintain. Complementing this, we introduce an adversarial rotation optimization strategy to enforce invariance against challenging orientations. Extensive experiments on ModelNet-C and ScanObjectNN-C demonstrate that PSMix achieves state-of-the-art robustness, while also serving as an orthogonal plug-in that further boosts the performance of existing spatial strategies.

Abstract:
Financial reasoning requires precise execution. While Large Language Model (LLM) agents have shown encouraging progress in financial reasoning, their effectiveness in realistic financial workflows is severely hindered by the lack of holistic benchmarks and the fragility of unstructured reasoning. To evaluate these capabilities, we introduce ProFinR, the first Professional Finance Reasoning benchmark that covers four types of financial tasks, comprising 528 expertdesigned tasks. To solve these complex financial reasoning questions, we construct Financial Tool Universe, a tool library containing 53 domain-specific tools organized into 13 categories. Building on the tool library, we introduce ProFinAgents, a structured agent framework based on Directed Acyclic Graph (DAG) and Case-Based Memory (CBM). Compared with strictly sequential workflows, ProFinAgent coordinates tool execution through DAG. This allows for parallel execution and reduces latency compared to serial pipelines. Furthermore, the CBM component refines decision-making over time by retrieving prior cases to mitigate reasoning failures. Experimental results demonstrate that ProFinAgent achieves a 49.81% performance gain over state-of-the-art baselines with a 47.1% reduction in inference latency.

Abstract:
Supervised finetuning (SFT) and reinforcement learning with verifiable rewards (RLVR) are two widely used post-training paradigms for improving the reasoning ability of large language models (LLMs). Recent methods attempt to integrate SFT and RLVR in a single stage by reweighting or scheduling their objectives. However, such coupling can be counterproductive because supervised updates are not uniformly beneficial for reward optimization, which can diminish reward gains. To address this, we propose \textscBRIDGE, a scalable framework in which SFT learns to supervise RL by selectively transferring knowledge that improves reward optimization. Specifically, \textscBRIDGE employs two nested optimization loops during meta-training: the inner loop updates base model parameters using a fused SFT--RL gradient. Concurrently, the outer loop updates a lightweight low-rank adapter (LoRA) to coordinate the two objectives by maximizing a reward-gap signal, defined as the reward of joint SFT--RL training over an RL-only baseline. Across three model scales and five reasoning benchmarks, \textscBRIDGE consistently outperforms two-stage cold start, naive mixing, and representative single-stage integration baselines, yielding over three points average absolute improvement and more stable training dynamics.

Abstract:
Learning state representations in Markov Decision Processes (MDPs) has proven crucial for addressing the curse of dimensionality in largescale reinforcement learning (RL) problems. A widely recognized approach exploits structural priors on the MDP by constructing state representations as linear combinations of the state-graph Laplacian eigenvectors. When there is no prior knowledge of the graph, or the state space is too large, the spectral features can be estimated in a model-free fashion. In this work, we prove an upper bound on the approximation error of linear value function approximation based on learned spectral features. We show how this error scales with the algebraic connectivity of the state-graph, making it interpretable in terms of the MDP topology. We also bound the error arising from the estimation of the spectral features themselves, providing an all-round view of the representation learning pipeline. Additionally, we provide a new expression of the Laplacian operator in the RL setting that clarifies some subtle ambiguities in the literature. Our results apply to general (non-uniform) policies without making any assumptions on the symmetry of the induced transition kernel. We validate our theoretical findings with numerical simulations on gridworld environments.

Abstract:
Existing fair multiview clustering methods typically suffer from a severe trade-off between clustering utility and fairness, while incurring prohibitive quadratic complexity on large-scale datasets. To address these challenges, we propose Causal Disentangled Anchor Learning (CDAL), a novel framework that achieves scalable fairness via structural disentanglement. Guided by a structural causal model perspective, CDAL utilizes a dual-anchor mechanism to structurally separate latent representations into orthogonal semantic and sensitive subspaces. We further ensure statistical independence through a linearized Hilbert-Schmidt Independence Criterion (HSIC) constraint, which is optimized via an efficient alternating scheme. Theoretically, we prove the identifiability of the disentangled factors and guarantee the algorithm's global convergence and linear time complexity \mathcalO(n). Extensive experiments on large-scale benchmarks demonstrate that CDAL outperforms state-of-the-art methods, achieving a superior utility-fairness trade-off.

Abstract:
Large Language Model (LLM)based multi-agent systems often suffer from high communication cost due to redundant interactions, as existing methods optimize communication structures without explicitly measuring whether exchanged messages contribute to the final decision. To better utilize the semantic information in the execution stage to further optimize the structure of multi-agent systems and reduce token costs, we proposeAgentTailor, a cost-aware framework that evaluates the semantic contribution of communication edges via an edge judgment mechanism, and employs anEdge Prediction Network (EPN)to estimate edge utilities through virtual execution without invoking LLMs. Experiments show that AgentTailor achieves the best average accuracy (91.36%) on six datasets of diverse fields, while reducing total tokens by 21.2%--61.6%. Our work demonstrates that explicitly modeling semantic edge contribution is crucial for scalable and efficient multi-agent systems, providing a principled approach to communication optimization that goes beyond structural heuristics. (We will open-source our code once accepted.)

Abstract:
Stereotactic electroencephalography (sEEG) provides temporally precise intracranial recordings but is inherently constrained by sparse and irregular spatial sampling due to clinical limitations on electrode implantation. Signal reconstruction under this setting aims to infer neural activity at unmonitored locations, potentially expanding the coverage of neural recordings without increasing the number of implanted electrodes. However, most existing sEEG reconstruction methods underutilize the spatial information of electrode contacts in both encoding and modeling, and rely on deterministic objectives that favor average patterns, leading to oversmoothed reconstructions. We propose EpiTwin, a conditional spatial graph transformer for sEEG signal reconstruction, comprising three key components. Hybrid Spatial Positional Encoding (HSPE) constructs explicit spatial identities from electrode coordinates, graph topology, and anatomical priors. Geometry–Functional Biased Attention (GFBA) incorporates geometric distance and data-driven functional similarity biases into attention computation. Adversarial Refinement Training employs a multi-scale discriminator to counter reconstruction over-smoothing. Experiments on real-world clinical sEEG data demonstrate that EpiTwin consistently achieves lower reconstruction error under electrode series-level masking, outperforming recent foundation models such as LaBraM with a 16.8% relative reduction in RMSE. Furthermore, EpiTwin effectively mitigates spectral over-smoothing and improves reconstruction fidelity.

Abstract:
Large language models (LLMs) are increasingly used as assistive interfaces for users who cannot reliably produce clean text due to accessibility constraints, yet existing evaluations assume iterative input repair and focus on task accuracy or generic noise robustness. We introduce Assistive Prompt Mediation (APM), a theorygrounded evaluation paradigm that reframes assistance as a constrained mediation problem: recovering latent user intent from accessibility-impaired input without clarification, while minimizing cognitive burden and hallucination risk. APM decomposes assistive quality along these axes and is instantiated across 8 languages, 4 accessibility-driven noise classes, and 10 frontier LLMs, with impairment severity yielding accessibility sensitivity curves. Results show that apparent robustness often masks trade-offs—high intent preservation frequently coincides with increased burden or hallucinated mediation, hallucination rates vary by more than 2× across noise types, and assistive decisions exhibit bounded entropy (<0.81 normalized), indicating systematic rather than unstable behavior. These findings demonstrate that standard robustness metrics substantially overestimate assistive reliability and motivate evaluating LLMs as constrained mediators under accessibility-driven input degradation.

Abstract:
ClassifierFree Guidance (CFG) is a cornerstone of flow-matching models, significantly enhancing visual quality and prompt adherence. However, high guidance scales inherently violate the optimal transport dynamics, leading to visual artifacts and mode collapse. In this paper, we investigate the mechanisms of this failure through the lens of velocity moment decomposition. Our analysis reveals that the distributional shift induced by CFG decouples into two geometric components: a Linear Barycentric Drift that shifts the global distribution center, and a Quadratic Energetic Instability that injects surplus kinetic energy, disrupting the transport cost and triggering variance explosion. To mitigate these issues, we introduce MIST (Moment-aligned Invariant Stability Transform), a training-free method designed to confine the sampling trajectory to the learned data manifold. MIST comprises two hierarchical stages: (1) Invariant Alignment (IA), a global statistical rectifier that restores structural integrity by removing the linear drift and realigning the energy profile; and (2) Stability Thresholding (ST), a local dynamical regulator that enforces Lipschitz-like smoothness via temporal decay and spatial suppression. MIST enables robust, high-fidelity generation across a wide range of guidance scales while consistently improving performance at moderate scales. Extensive experiments on diverse text-to-image and text-to-video benchmarks demonstrate that MIST outperforms standard CFG and state-of-the-art corrections, establishing a new benchmark for robust guidance in flow-based generative models.

Abstract:
Generative Image Steganography (GIS) embeds complex secrets within stegoimages that are indistinguishable from the stochastic synthesis process itself. It achieves this by exploiting the reversible probability flow between Gaussian noise and the natural image manifold. However, existing steganography faces a key-dependency paradox: precise secret extraction usually requires an external private key or random seed to synchronize the denoising path. In this paper, we present KFStego, a training-free framework substituting cryptographic secrecy with structural redundancy for key-free, high-resolution secure distribution. Our dual-guidance mechanism utilizes structural latent guidance to project secrets into a bipartite manifold via downsampling and halftoning, yielding self-synchronizing shares. While measurement posterior sampling leverages these shares as discrete invariants to steer a differentiable restoration. By backpropagating through a differentiable measurement surrogate, KFStego reconstructs high-fidelity continuous-tone details from sparse binary observations, mitigating fidelity loss from inversion drift. KFStego offers an endogenous secure image distribution paradigm by connecting discrete structural invariants with high-fidelity generative reconstruction.

Abstract:
Effective time series representation is critical for revealing temporal dynamics in many fields. However, existing approaches encounter fundamental limitations. Discretetime representations struggle with irregular sampling and the tradeoff of fidelity and efficiency, while traditional implicit neural representations suffer from spectral bias and frequency entanglement. To address these challenges, we conceptualize time series as the superposition of continuous trends and discrete events from a continuous-time perspective and propose DualTimesField, a framework that utilizes dual implicit neural fields. Its Continuous Time Field captures smooth trends through bandwidth-limited parameterization, while a Discrete Geometric Field models transient events using learnable Gabor atoms, gated sparsity, and coarse-to-fine scale annealing. This explicit field separation effectively overcomes both limitations. Experiments on nine real-world benchmarks demonstrate substantial improvements in representation fidelity, achieving 51.2% average MSE reduction over discrete-time baselines and competitive interpolation on irregular data. Code is available at https://anonymous.4open.science/r/DualTimesField-AF32.

Abstract:
Large multimodal models (LMMs) have shown strong capabilities in audiovisual understanding by jointly reasoning over visual, auditory, and linguistic inputs. However, processing long-form audio-visual content often requires a large number of tokens, leading to substantial computational and memory overhead during inference. Existing efficiency-oriented methods typically apply uniform compression or pruning strategies, which overlook modality-specific characteristics and instruction-dependent reasoning behaviors in multimodal models. In this work, we present PRIM, an inference framework for efficient multimodal reasoning that systematically compresses audio-visual representations based on attention dynamics and instruction relevance. Motivated by an attention-based analysis revealing modality imbalance and layer-wise redundancy in LMMs, PRIM introduces a cooperative compression pipeline that spans both multimodal encoders and the language model. Specifically, PRIM performs early text-conditioned audio-visual fusion to externalize cross-modal interactions, applies attention-guided and frequency-aware strategies to remove redundant audio and video tokens, and further adapts token retention inside the language model according to task demands. Extensive experiments on multiple audio-visual benchmark datasets demonstrate that PRIM consistently achieves stable and superior efficiency--accuracy trade-offs across diverse tasks and datasets. These results demonstrate that PRIM, a multimodal cooperative compression approach, provides an effective pathway toward scalable and efficient audio-visual reasoning.

Abstract:
Modern segmentation models achieve strong predictive performance but remain largely opaque, limiting our ability to diagnose failures, understand dataset shift, or intervene in a principled manner. We introduce MedSegLens, a model-diffing framework that decomposes segmentation model activations into interpretable latent features using sparse autoencoders trained on SegFormer and U-Net. Through cross-architecture and cross-dataset latent alignment across healthy, adult, pediatric, and sub-Saharan African glioma cohorts, we identify a stable backbone of shared representations, while dataset shift is driven by differential reliance on population-specific latents. We show that these latents act as causal bottlenecks for segmentation failures, and that targeted latent-level interventions can correct errors and improve cross-dataset adaption without retraining, recovering performance in 70% of failure cases and improving Dice score from 39.4% to 74.2%. Our results demonstrate that latent-level model diffing provides a practical and mechanistic tool for diagnosing failures and mitigating dataset shift in segmentation models.

Abstract:
Predictive Coding (PC) offers a braininspired alternative to backpropagation for neural network training, described as a physical system minimizing its internal energy. However, in practice, PC is predominantly digitally simulated, requiring excessive amounts of compute while struggling to scale to deeper architectures. This paper reformulates PC to overcome this hardware-algorithm mismatch. First, we uncover how the canonical state-based formulation of PC (sPC) is, by design, deeply inefficient in digital simulation, inevitably resulting in exponential signal decay that stalls the entire minimization process. Then, to overcome this fundamental limitation, we introduce error-based PC (ePC), a novel reparameterization of PC which does not suffer from signal decay. Though no longer biologically plausible, ePC numerically computes exact PC weights gradients and runs orders of magnitude faster than sPC. Experiments across multiple architectures and datasets demonstrate that ePC matches backpropagation's performance even for deeper models where sPC struggles. Besides practical improvements, our work provides theoretical insight into PC dynamics and establishes a foundation for scaling PC-based learning to deeper architectures on digital hardware and beyond.

Abstract:
Multitask vehicle routing solvers via deep reinforcement learning have attracted broad attention and achieved significant progress in handling multiple constraints. However, existing neural solvers still face critical challenges, including insufficient representation, unstable training, and inefficient exploration in large combinatorial action spaces, which often prevents performance from meeting its full potential. To address these issues, we propose PoMtVRS (Preference-Optimized Multi-Task Vehicle Routing Solver with Preference Gating), a plug-and-play framework that jointly improves decoder representations and exploration efficiency through a synergistic combination of decoder-side augmentation and preference-driven optimization. Specifically, we introduce the preference optimization objective to learn relative comparisons among candidate solutions for different routing tasks, encouraging a higher generation probability of better solutions. Meanwhile, we design a preference-gated block that adaptively modulates decoder representations via sparse gated attention and nonlinear residual refinement. Extensive experiments demonstrate that PoMtVRS elevates state-of-the-art unified neural VRP backbones, achieving leading performance in multi-task benchmarks and stronger generalization.

Abstract:
Recent advances in image generation have been largely driven by image tokenization, which compresses raw pixels into compact latent representations. While existing tokenizers excel at preserving lowlevel visual details through reconstruction-based training, they often lack explicit semantic guidance, which limits their ability to capture semantically structured representations and thus hinders their performance on downstream tasks like image generation. To overcome this limitation, we propose a novel tokenization framework that incorporates high-level semantics through two key innovations: (1) a text-integrated encoder that jointly processes images and textual descriptions to produce semantically enriched latent representations, and (2) a concept-guided training objective that leverages sparse autoencoders to decompose pre-trained vision-language model features to a semantic concept space, employing sparse and disentangled concept indices for guidance. Our approach achieves strong alignment with semantic concepts, maintaining high reconstruction fidelity with an rFID of 1.39 on ImageNet, while achieving a gFID of 2.65 on the class-conditional image generation task and 10.73 on the text-to-image generation task. By infusing high-level semantic structures into low-level visual fidelity, our method bridges the reconstruction-generation divide and drives generative modeling as a powerful foundation.

Abstract:
Adversarial Training (AT) is a leading defense against adversarial examples but often suffers from Catastrophic Overfitting (CO) in efficient singlestep variants, where robustness to multi-step attacks collapses despite high single-step performance. We address this failure mode with two contributions. First, we identify Epsilon Overfitting (EO), a previously overlooked phenomenon in which fixed perturbation magnitudes exacerbate CO, and show that introducing perturbation variability significantly improves robust generalization across different architectures and datasets. Second, we proposePertAlign(Perturbation Alignment), a theoretically grounded, computationally negligible metric that predicts CO onset by measuring gradient alignment across attack stages. Leveraging these insights, we introduceSORA, an adaptive step-size adversarial training method that dynamically adjusts perturbations based on loss-surface geometry. SORA consistently prevents CO, achieves state-of-the-art robustness and clean accuracy, and generalizes across datasets and architectures using a single fixed set of hyperparameters. Extensive experiments on diverse datasets and architectures, show that SORA matches or surpasses the robustness of prior methods while delivering higher clean accuracy and superior efficiency.

Abstract:
The learning rate in stochastic gradient methods is a critical hyperparameter that is notoriously costly to tune via standard grid search, especially for training modern largescale models with billions of parameters. We identify a theoretical advantage of learning rate annealing schemes that decay the learning rate to zero at a polynomial rate, such as the widely-used cosine schedule, by demonstrating their increased robustness to initial parameter misspecification due to a coarse grid search. We present an analysis in a stochastic convex optimization setup demonstrating that the convergence rate of stochastic gradient descent with annealed schedules depends sublinearly on the multiplicative misspecification factor \rho (i.e., the grid resolution), achieving a rate of \smashO(\rho^1/(2p+1)/\sqrtT) where p is the degree of polynomial decay and T is the number of steps. This is in contrast to the \smashO(\rho/\sqrtT) rate obtained under the inverse-square-root and fixed stepsize schedules, which depend linearly on \rho. Experiments confirm the increased robustness compared to tuning with a fixed stepsize, that has significant implications for the computational overhead of hyperparameter search in practical training scenarios.

Abstract:
From a statistical evidence perspective, we establish some asymptotic optimality properties of certain multiple testing rules based on the relative belief ratio (Evans, 2015). Under the twogroups model with an additive 0-1 loss and within a Bayesian decision theoretic asymptotic framework of Bogdan et al. (2011), we show that relative belief multiple testing rules induced by a simple one-group light-tailed normal prior with a single hyperparameter achieve the same asymptotic Bayes risk as the Bayes oracle benchmark. This risk is the minimum achievable in this asymptotic framework. Despite originating from a different starting point, the evidential relative belief approach enjoys oracle properties. The relative belief multiple testing approach is fundamentally different from existing Bayesian multiple testing procedures, virtually all induced by more complex heavy-tailed one-group global-local shrinkage priors using purely posterior-based inferences (Datta & Ghosh, 2013; Ghosh et al., 2016; Bhadra et al., 2017; Ghosh & Chakrabarti, 2017; Qin & Ghosh, 2025). By measuring statistical evidence via both the prior and posterior, the relative belief approach reveals an alternative new inferential paradigm for attaining asymptotic Bayes optimality under sparsity, one that does not rely on developing increasingly elaborate priors.

Abstract:
Latent diffusion models have become the dominant paradigm for video generation, making the video tokenizer a critical role. While most existing tokenizers are trained primarily for reconstruction, diffusion models are optimized to denoise heavily corrupted latents, which creates a mismatch between tokenizer training objectives and downstream generative learning. As a result, reconstruction metrics (e.g., rFVD) can be a poor proxy for generation quality (gFVD), and overly prioritizing reconstruction may even hinder diffusion training. We propose VideoMAETok, a simple family of ViTbased video tokenizers trained explicitly as corruption-inversion models for latent video diffusion. VideoMAETok builds on masked autoencoders: we (i) apply high-ratio token masking and encode only visible spatiotemporal tokens for efficiency, and (ii) corrupt latent tokens with interpolative Gaussian noise to better match the denoising nature of diffusion generators. Training under such corruption encourages latents that remain informative and well-conditioned for downstream denoising. Extensive experiments show that VideoMAETok consistently improves generation quality when paired with off-the-shelf diffusion models (SiT and LightningDiT), achieving state-of-the-art gFVD on Kinetics-600 and UCF-101 while remaining compute-efficient.

Abstract:
Loss spikes remain a persistent obstacle in largescale language model pretraining. While previous research has attempted to identify the root cause of loss spikes by investigating individual factors, we observe that, in practice, such spikes are typically triggered by the confluence of heterogeneous factors. Empirically, loss spikes may arise from a combination of data outliers, hardware or transient computational faults, numerical precision issues, and hyperparameter settings. Regardless of the underlying cause, these spikes manifest as unstable optimizer updates, as abnormal gradients contaminate both first- and second-moment states. In this paper, we propose a principled gradient-centric remedy: AdaGC, an adaptive per-tensor gradient clipping scheme that mitigates such contamination by bounding gradient norms relative to a tensor-wise exponential moving average of their historical clipped values. AdaGC is optimizer-agnostic, introduces negligible memory overhead, and reduces communication costs compared to GlobalGC, particularly in hybrid-parallel distributed training environments. Experiments on Llama-2 7B, Mixtral 8×1B, and ERNIE 10B-A1.4B demonstrate that AdaGC robustly eliminates training instabilities, consistently reducing spike scores to zero for all models and improving downstream accuracy over GlobalGC by 1.32%, 1.27%, and 2.48%, respectively. Furthermore, AdaGC seamlessly integrates with optimizers such as Muon and Lion, consistently yielding higher average accuracy and zero spike scores.

Abstract:
Knowledgebased Visual Question Answering (KB-VQA) remains a challenging task, particularly when queries require precise identification and grounding of fine-grained entities within large-scale knowledge base. Existing methods often treat visual and textual signals in isolation and rely heavily on image-centric retrieval, which makes them sensitive to visual ambiguities. To address these limitations, we propose EntRAG, an entity-centric retrieval-augmented generation framework. Our approach first introduces EntBind to align query representations with multimodal entity embeddings by explicitly binding entity tokens to latent visual features, retrieving a set of relevant candidate entities. A reranking mechanism is applied to these candidate entities to select the most informative context by combining entity-level alignment with overall contextual relevance. The selected evidence is incorporated into context-aware generation module to produce final answer. By explicitly operating at the entity level, EntRAG achieves more consistent and reliable results. Extensive experiments demonstrate that EntRAG consistently outperforms prior methods, achieving scores of 45.2 on E-VQA and 43.8 on InfoSeek.

Abstract:
In modern software development, particularly in emerging ``vibe coding'' paradigms, project implementation increasingly begins with visual interactions between users and AI coding assistants, where system architectures are communicated through visual designs before coding. This visualfirst approach necessitates AI systems capable of interpreting diagrams across multiple programming languages. However, the development of such systems is severely hindered by the lack of large-scale multimodal training data and evaluation benchmarks. To address these limitations, we present M2C-INSTRUCT, a comprehensive multilingual multimodal instruction-tuning dataset containing over 13.1M samples across 50+ programming languages, designed for visual understanding and diagram interpretation in code generation tasks. We validate our dataset by training M2-CODER, a multilingual multimodal software developer that successfully integrates visual design inputs with textual instructions. We also introduce M2EVAL, a novel multilingual evaluation benchmark for multimodal code generation performance. Experiments show our 7B M2-CODER performs on par with much larger 70B+ models, confirming the quality and effectiveness of our M2C-INSTRUCT. Together, M2C-INSTRUCT, M2-CODER, and M2EVAL provide essential infrastructure for visual-assisted programming in vibe-coding and visual-interactive development workflows.

Abstract:
Graph Edit Distance (GED), which aims to find an edit path with minimum number of edit operations to transform one graph into another, is a fundamental NPhard problem and a widely used graph similarity measure. Recent matching-based hybrid approaches have demonstrated better scalability than A search-based hybrids by reformulating GED as a graph matching problem. In these methods, a neural network predicts a single deterministic node matching matrix, from which top-k node mappings are extracted iteratively to derive candidate edit paths. However, these methods often suffer from highly correlated candidates that easily lead to suboptimal solutions, while the iterative extraction becomes inefficient for large k. In this paper, we propose DiffGED, the first generative approach for GED computation. Specifically, we formulate the graph matching problem as a generative task, and employ a diffusion-based model to generate multiple diverse node matching matrices simultaneously, from which diverse node mappings can be efficiently extracted. The generative diversity introduced by the diffusion process enables DiffGED to avoid suboptimal solutions and achieve superior solution quality close to the exact solution. Experiments on real-world datasets show that DiffGED generates multiple diverse edit paths with accuracy comparable to exact solutions, while running faster than existing hybrid approaches.

Abstract:
Deep hypergraph clustering has shown strong potential in learning node representations by modeling highorder relationships. However, most existing methods rely on fixed propagation mechanisms that implicitly assume homophily, where connected nodes tend to be similar. This assumption often fails in real-world scenarios, especially in heterophilic settings, leading to degraded clustering performance. To bridge this gap, we propose a Unified Framework for Deep Hypergraph Clustering (Uni-DHC). Specifically, we introduce a learnable high-order hypergraph propagation scheme that aggregates information from multiple propagation orders and adaptively learns their importance from data. To stabilize unsupervised training and prevent structural redundancy introduced by high-order aggregation, we further impose consistency at the node level and decorrelation at the hyperedge level. From a spectral perspective, we show that conventional HGNN-style propagation corresponds to a fixed low-pass filter, while the proposed approach induces a learnable polynomial spectral filter. Extensive experiments on homophilic and heterophilic benchmarks demonstrate that Uni-DHC consistently outperforms state-of-the-art methods, with particularly strong gains in heterophilic settings.

Abstract:
A common approach to perform PCA on probability measures is to embed them into a Hilbert space where standard functional PCA techniques apply. While convergence rates for estimating the embedding of a single measure from m samples are well understood, the literature has not addressed the setting involving multiple measures. In this paper, we study PCA in a double asymptotic regime where n probability measures are observed, each through m samples. We derive convergence rates of the form n^1/2 + m^-\alpha for the empirical covariance operator and the PCA excess risk, where \alpha>0 depends on the chosen embedding. This characterizes the relationship between the number n of measures and the number m of samples per measure, revealing a sparse (small m) to dense (large m) transition in the convergence behavior. Moreover, we prove that the dense-regime rate is minimax optimal for the empirical covariance error. Our numerical experiments validate these theoretical rates and demonstrate that appropriate subsampling preserves PCA accuracy while reducing computational cost.

Abstract:
Frontier models are transitioning frommultimodal large language models(MLLMs) that merely ingest visual information tounified multimodal models(UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to humanmental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goaloriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal,visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.

Abstract:
As video content creation shifts towards longform narratives, retrieving and composing short clips into coherent storylines becomes a critical challenge. Standard retrieval formulations, however, perform context-agnostic retrieval, prioritizing local semantic alignment while neglecting procedural state and identity consistency across time. To address this, we introduce the task of Consistent Video Retrieval (CVR) and establish a rigorous benchmark across YouCook2, COIN, and CrossTask, designed to explicitly evaluate temporal and identity consistency. We propose CAST (Context-Aware State Transition), a lightweight, embedding-agnostic adapter that models procedural progression by predicting a state-conditioned residual update (\Delta) from visual history, decoupling procedural progression from static identity. Extensive experiments demonstrate that CAST yields significant and consistent gains across diverse datasets over standard baselines. Furthermore, we showcase its potential as a plug-and-play consistency verifier, guiding black-box generation models (e.g., Sora, Veo) toward physically plausible continuations.

Abstract:
Scaling large language models (LLMs) has driven their success, yet dense Transformers couple capacity and computation: every parameter is activated for every token, making training and inference costs grow linearly with model size—a critical bottleneck as models approach trillionparameter regimes. We aim to scale capacity through MoE-style mixture throughout the LLM pipeline rather than only the FFN. Prior pipeline-level approaches include ParaScale, which introduces virtual tokens and parallel streams but incurs substantial overhead and suffers from homogenized routing and gradient collapse, and AltUp, which uses an auxiliary prediction branch but offers limited adaptivity and slow convergence. We establish that MoE-style mixture layers can be reformulated as variable-kernel dynamic convolutions, where each expert corresponds to a 1×1 convolutional kernel and routing implements input-conditioned kernel aggregation. Building on this equivalence, we introduce cMoLLM: a convolutionally gated mixture-of-LLMs that routes over end-to-end streams through fully differentiable dynamic convolution. In GPT-2-style models trained on FineWeb, cMoLLM improves language modeling perplexity and downstream GLUE and SQuAD accuracy under matched compute, with better stream utilization, more stable optimization, and favorable scaling compared to ParaScale- and AltUp-style baselines.

Abstract:
We propose contrastive order learning (ConOrd), a contrastive learning framework for ordinal regression that integrates the strengths of contrastive learning and order learning. While contrastive learning effectively leverages all samples in a batch, it typically ignores the inherent ordering among rank labels. Conversely, order learning explicitly models label ordinality but often relies on local, marginbased comparisons, limiting its ability to capture global ordinal structure. ConOrd addresses these limitations by introducing a contrastive order loss with soft affinity and disparity weights based on rank differences, enabling fine-grained modeling of ordinal relationships across all sample pairs within a batch. Extensive experiments on a range of ordinal regression tasks, including facial age estimation, blind image quality assessment, and blind video quality assessment, demonstrate that ConOrd consistently achieves state-of-the-art performance and generalizes well across diverse ordinal regression scenarios.

Abstract:
Multimodal sentiment analysis (MSA) aims to predict human sentiments by integrating signals from different modalities such as text, video, and audio. However, raw multimodal sequences often suffer from semantic inconsistenciesexhibiting redundancy or conflicts within and across modalities--which hinders robust understanding and increases computational cost. To this end, we introduce ConsMSA, which explicitly formalizes semantic distribution consistency across both intra- and inter-modality, providing a principled mechanism for robust and efficient multimodal sentiment prediction. Specifically, ConsMSA projects multimodal token features into a shared sentiment space to compute an Intra- and Inter-modality Consistency Score (I^2CS). By coupling this score with predictive relevance, we formulate principled importance signals that are utilized: (i) as a consistency regularizer to align latent distributions during training, (ii) to derive semantic-aware weights for adaptive multimodal token reweighting, and (iii) as a principled criterion to prune redundant or conflicting tokens. Extensive experiments on CMU-MOSI and CMU-MOSEI demonstrate that ConsMSA achieves state-of-the-art performance while remaining robust under aggressive token compression--retaining only 10% of tokens yields comparable accuracy. These results establish semantic distribution consistency as a principled foundation for synergizing predictive robustness with computational efficiency.

Abstract:
Stochastic optimization (SO) plays a central role in addressing decision‐making problems under uncertainty. Among them, timevarying stochastic optimization (TV-SO) is particularly important due to its applications in adaptive control and machine learning. Non-parametric approaches have been proposed for time-varying deterministic optimization, however, they have not been devised for their stochastic counterparts. This work specifically addresses non-parametric optimality by developing a stochastic variational framework based on Malliavin calculus. This framework enables deriving non-parametric optimality conditions for SO problems with a stochastic decision and supports the design of a scalable deep-learning algorithm that is insensitive to the parameterization dimension. Such an algorithm, called the stochastic path follower (SPF), is applied to solve two key problems under distribution drift, namely least-squares recovery and logistic regression. Experimental results show the merit of the proposed approach against learning-based and gradient-based methods in the state of the art in terms of both performance and scalability.

Abstract:
We have witnessed remarkable advances in LLM reasoning capabilities with the advent of DeepSeekR1. However, much of this progress has been fueled by the abundance of internet question–answer (QA) pairs—a major bottleneck going forward, since such data is limited in scale and concentrated mainly in domains like mathematics. In contrast, other sciences such as physics lack sufficient large-scale QA datasets to effectively train reasoning-capable models. In this work, we show that physics simulators can serve as a powerful alternative source of supervision for training LLMs for physical reasoning. We generate random scenes in physics engines, create synthetic question–answer pairs from simulated interactions, and train LLMs using reinforcement learning on this synthetic data. Our models exhibit zero-shot sim-to-real transfer to real-world physics benchmarks: for example, training solely on synthetic simulated data improves performance on IPhO (International Physics Olympiad) problems by 5–10 percentage points across different model sizes. These results demonstrate that physics simulators can act as scalable data generators, enabling LLMs to acquire deep physical reasoning skills beyond the limitations of internet-scale QA data.

Abstract:
Large language models (LLMs) face a "loyalty dilemma" when correctness is conditioned on an active worldof-discourse. We identify a systemic failure mode---world misattribution---where models implicitly ground generation in an incompatible regime and drift from the target world. We propose World Entropy Tethering (WET), an inference-time monitor-and-tether: a world-entropy probe flags drift risk on prompt anchors, and a conditional score matching geometry model identifies tethering heads for entropy-gated rescaling. Experiments show: (I) Linear Separability: world labels are linearly decodable from internal states; (II) Geometric Drift: hallucinations are preceded by measurable deviations from the target world region; and (III) Targeted Mitigation: WET improves world consistency and reduces hallucination rates by up to 22.4% without compromising generation quality. Code is available at https://anonymous.4open.science/r/WET-ADA0/.

Abstract:
Sparse Mixture of Experts (MoE) models scale more efficiently than dense models by routing tokens to modular expert networks that are only active when relevant to the task. A leading hypothesis for the performance of MoE models is that each expert specialises in a single, coherent domain. However, interpretability efforts that assume this hypothesis have generally been unsuccessful. We propose and present evidence for an alternative account that we call theSuperposed Specialisation Hypothesis(SSH): experts specialise in a disjoint union of finegrained features rather than one broad domain. Leveraging the SSH, we introduceRouterInterp, a method for interpreting expert routing that identifies Sparse Autoencoder features most predictive of routing decisions and produces unified natural language explanations. On gpt-oss-20b, explanations from RouterInterp predict expert routing with 77% higher accuracy than prior methods. This work provides a scalable method for generating concise and more accurate explanations of expert routing and increases our understanding of a previously uninterpretable component of foundation models.

Abstract:
Learning on graphs with missing node attributes is a prevalent yet challenging problem in realworld scenarios, as graph neural networks (GNNs) typically rely on complete attribute information. Existing solutions often employ adversarial learning in a shared latent space to align graph structure and attributes. However, these methods frequently suffer from training instability and mode collapse, failing to fully capture the complex, multi-modal joint distribution of topology and features. To address these limitations, we present GLAD (Graph Latent Attribute Diffusion with Bidirectional Alignment), a novel generative framework for robust node attribute completion. GLAD leverages the strong generative capabilities of diffusion models to learn the conditional distribution of attributes given the graph structure within a decoupled latent space. Unlike previous unidirectional approaches, GLAD introduces a robust bidirectional alignment mechanism. Specifically, we incorporate a structure reconstruction constraint during training and structure-aware classifier-free guidance during sampling, ensuring that generated attributes are not only plausible but also maintain strict topological consistency with the underlying graph. Theoretically, we show that GLAD maximizes a tighter variational lower bound on the joint log-likelihood compared to GAN-based predecessors, leading to superior mode coverage. Extensive experiments on standard and large-scale benchmarks demonstrate that GLAD significantly outperforms state-of-the-art baselines in both attribute recovery quality and downstream task performance.

Abstract:
With the advance of federated learning (FL) in privacysensitive domains such as healthcare, finance, and mobile intelligence, the need for efficient and robust training becomes increasingly urgent. Communication bottlenecks, heterogeneous client distributions, and fairness requirements make it essential to select the “right” data and features for model training. Yet existing FL research often addresses feature selection and data selection separately, ignoring their interplay in real-world high-dimensional and noisy datasets, leading to suboptimal performance. In this paper, we propose a unified framework for data and feature selection by formulating the problem as a generalized CUR decomposition problem. We introduce FedGCUR, a practical framework that integrates a federated column-pivoted QR (FedCPQR) decomposition routine with per-silo row selection. Specifically, FedCPQR is designed to securely compute a global pivot order without exposing raw data, while FedGCUR leverages this to jointly select shared features and silo-specific samples. We prove that FedCPQR produces exactly the same decomposition results as centralized CPQR and establish an upper bound of the reconstruction error of FedGCUR. Extensive empirical results show that the proposed framework achieves higher accuracy compared to the baselines of data and feature selection methods, demonstrating its effectiveness and efficiency.

Abstract:
Existing ParameterEfficient Fine-Tuning (PEFT) methods are fundamentally constrained by a static allocation paradigm, which overlooks the model's evolving optimization priorities during training. To address this, we introduce Dynamic Adaptive Fine-tuning (DAF), a novel framework that periodically evaluates and reconfigures the trainable structure based on a context-aware decoupled sensitivity analysis. DAF employs a Rebuild-and-Refocus strategy to preserve learned knowledge by freezing outdated modules while decisively reallocating the parameter budget to newly identified critical regions. Extensive experiments on challenging vision benchmarks demonstrate that DAF significantly outperforms mainstream static PEFT methods and achieves state-of-the-art (SOTA) performance and efficiency, particularly under extreme parameter budgets. Our work fundamentally challenges the static nature of the field, offering a more intelligent and efficient paradigm for adapting large pretrained models. The code is available at https://anonymous.4open.science/r/DAF-9372.

Abstract:
Decentralized SGD is a fundamental algorithm in decentralized learning, although the influence of an underlying network topology on its convergence behavior is not yet fully understood. Existing convergence analyses have shown that topologies with a small spectral gap significantly deteriorate the convergence rate of Decentralized SGD in both homogeneous and heterogeneous cases. However, many prior papers have reported that indeed the choice of the topology has a significant experimental impact in the heterogeneous case, but has little experimental impact on training behavior in the homogeneous case. In this paper, we present a tighter convergence analysis of Decentralized SGD, offering a more precise understanding of how topologies affect the convergence rate than the prior analysis. Specifically, unlike existing convergence analyses that used only the spectral gap as a property of the topology, our novel analysis shows that all eigenvalues of the mixing matrix affect the convergence rate. Throughout the experiments, we carefully evaluated the convergence behavior of Decentralized SGD and demonstrated that our novel convergence analysis can more accurately describe the effect of topology on the convergence rate.

Abstract:
Time series forecasting has long relied on dense endogenous observations, yet in many realworld scenarios, such data is scarce or even absent. Existing approaches attempt to compensate with exogenous variables, but their reliance on incomplete endogenous histories makes them brittle under data scarcity. In this work, we introduce sparse endogenous forecasting as a new setting, where exogenous sequences and only sparse endogenous observations are available. To tackle this problem, we propose TimeSeed, a lightweight architecture that redefines sparse forecasting as a context reconstruction task. By jointly exploiting the stability of exogenous sequences and the limited but informative endogenous signals, TimeSeed reconstructs robust historical representations and transforms forecasting into a tractable sequence-based prediction problem. Remarkably, TimeSeed achieves this with a purely linear architecture using only 0.19M parameters, consistently outperforming state-of-the-art deep models on seven real-world benchmarks, with an average improvement of 13.01% in MSE and 7.54% in MAE. These results establish sparse endogenous forecasting as a practical and promising paradigm, opening a new direction for time series analysis under extreme data scarcity. Code is available at this repository: \urlhttps://anonymous.4open.science/r/Alistair-7.

Abstract:
We propose a nonautoregressive branching diffusion model for generating spatio-temporal point processes. Starting from a geometric principle---the Wasserstein-Fisher-Rao (WFR) gradient flow of a generalized KL divergence toward a simple reference intensity---we obtain a tractable forward noising mechanism with two interpretable components: (i) a Langevin-type \emphdrift-diffusion step that perturbs event locations and times, and (ii) a \emphbirth-death branching step that changes the event count via location-dependent thinning (deaths) and Poisson offspring replication (births). We learn the reverse-time dynamics using a permutation-equivariant denoiser that predicts a drift field and a net-growth field, and we train it using an entropic-regularized unbalanced optimal transport (UOT), which naturally handles count mismatch between noisy and clean samples. The resulting generator produces complete spatio-temporal event sets without autoregressive simulation or explicit intensity normalization.

Abstract:
Textto-image (T2I) generation has achieved remarkable progress, yet existing methods often lack the ability to dynamically reason and refine during generation--a hallmark of human creativity. Current reasoning-augmented paradigms mostly rely on explicit thought processes, where intermediate reasoning is decoded into discrete text at fixed steps with frequent image decoding and re-encoding, leading to inefficiencies, information loss, and cognitive mismatches. To bridge this gap, we introduceLatentMorph, a novel framework that seamlessly integrates implicit latent reasoning into the T2I generation process. At its core, LatentMorph introduces four lightweight components: (i) acondenserfor summarizing intermediate generation states into compact visual memory, (ii) atranslatorfor converting latent thoughts into actionable guidance, (iii) ashaperfor dynamically steering next image token predictions, and (iv) an RL-trainedinvokerfor adaptively determining when to invoke reasoning. By performing reasoning entirely in continuous latent spaces, LatentMorph avoids the bottlenecks of explicit reasoning and enables more adaptive self-refinement. Extensive experiments demonstrate that LatentMorph(I)enhances the base model Janus-Pro by 16% on GenEval and 25% on T2I-CompBench;(II)outperforms explicit paradigms (e.g., TwiG) by 15% and 11% on abstract reasoning tasks like WISE and IPV-Txt,(III)while reducing inference time by 44% and token consumption by 51%; and(IV)exhibits 71% cognitive alignment with human intuition on reasoning invocation.

Abstract:
Openvocabulary mobile manipulation (OVMM) requires long-horizon navigation in unseen environments and object-centric manipulation. Most existing methods treat navigation and manipulation as separate stages, which can yield navigation endpoints that are poor for manipulation or manipulation-friendly poses that are globally inefficient. We address this mismatch with 3D Interaction Chains (3D-IC), a unified framework that couples multi-stage navigation and manipulation planning. 3D-IC maintains a shared 3D feature map for both skills, generates stage-aligned interaction waypoints, and links them into candidate multi-stage chains. A hierarchical policy then scores these chains by jointly considering feasibility (via VLM reasoning over waypoint-centric 3D features) and transition cost, selecting the best trade-off between success and path efficiency. The robot executes the next waypoint and replans as new observations arrive. Experiments in simulation and on a real Stretch 3 robot demonstrate consistent gains in both task success and trajectory efficiency.

Abstract:
Large language models (LLMs) have demonstrated remarkable performance on standard benchmarks, yet it remains largely unexplored whether they truly meet user expectations. Existing evaluation approaches, relying on model heuristics, expert rubrics, or user simulation, fail to capture the diversity and subtlety of real human expectations, causing models to appear competent while misaligning with what users actually seek. we present the first systematic study of user expectations in realworld LLM interactions, proposing a principled procedure to extract semantically rich expectations and introducing ExpectBench, a benchmark grounded in real user expectations. Analyses reveal that current LLMs struggle to satisfy and anticipate what users hope to obtain, highlighting a fundamental source of misalignment. Building on these observations, we propose LENS, a lightweight latent expectation–aware response generation framework. LENS enables models to internalize user expectations and generate better-aligned responses, consistently improving expectation satisfaction and underscoring the importance of explicitly modeling user expectations for realistic human–AI alignment.

Abstract:
Recently, visionlanguage models (e.g., CLIP) are increasingly adopted for continual learning to mitigate catastrophic forgetting. However, existing CLIP-based methods typically freeze the backbone to preserve pre-trained knowledge, which limits the model's ability to learn discriminative features for downstream tasks. In this paper, we introduce HypCL, a parameter-efficient framework that continually adapts CLIP in hyperbolic space for continual learning. Our key insight is that the exponentially expanding capacity of hyperbolic geometry naturally accommodates the growing class space and promotes stronger inter-class separation. Specifically, HypCL learns task-specific hyperbolic transformations and employs a lightweight task-weighting mechanism to aggregate transformations across tasks. To fully exploit the enhanced feature separability afforded by hyperbolic geometry, HypCL maintains class prototypes computed from the adapted features, which serve as stable anchors for calibrating predictions during inference. Extensive experiments on standard class-incremental benchmarks demonstrate that HypCL consistently outperforms existing CLIP-based continual learning methods.

Abstract:
Taskvector–based model merging enables low-cost, training-free multi-task learning for large language models, but suffers from severe performance degradation due to task conflict. Prior mitigation strategies largely rely on validation data for costly hyperparameter tuning, limiting both interpretability and practicality. We therefore propose OPIC, an evolutionary optimization–based model merging framework. Our preliminary experiments reveal that the degradation of In-Context Learning (ICL) capabilities is a primary driver of task conflict. Motivated by this insight, we formulate model merging as an optimization problem with ICL preservation as the objective. OPIC introduces a hierarchical refinement operators and optimizes it using self-generated data, effectively eliminating the reliance on external validation sets. Experimental results demonstrate that OPIC achieves an average performance retention of 80.73%, outperforming SOTA methods and improving by up to 11.1% over recent validation-free approaches. In addition, OPIC is compatible with existing merging pipelines, offering a new alternative solution for deploying without validation dependencies. Code is available at: https://anonymous.4open.science/r/OPIC-CFFE.

Abstract:
The Fisher Information Matrix (FIM) provides a principled geometric framework for parameter sensitivity in neural networks, but directly computing and using the full FIM is infeasible in highdimensional models. As a result, most existing methods rely on diagonal approximations that discard important correlation structure. We introduce Matrix-free Fisher Factorization (MFF), a GPU-tractable algorithm that captures both diagonal and off-diagonal dependencies without materializing the full matrix. For post-training neural network layer compression, we prove that under Matrix-Variate Normal assumptions, MFF yields GFWSVD, a unique closed-form linear layer decomposition that optimally minimizes the expected second-order loss increase. Experiments on controlled numerical benchmarks with large neural networks show that GFWSVD achieves up to 50% compression while matching or exceeding state-of-the-art diagonal and activation-based baselines across most tasks, and it reliably avoids collapse in dense architectures such as Llama 3. Moreover, when used to initialize existing optimization pipelines (e.g., Dobi-SVD), GFWSVD better preserves accuracy at 40% parameter reduction in regimes where standard methods substantially degrade. Together, these results position MFF and GFWSVD as foundational algorithmic primitives for scalable, second-order-aware neural network approximation and parameter sensitivity.

Abstract:
Estimating Individual Treatment Effects (ITE) in multitreatment scenarios faces two critical challenges: the Hyperparameter Selection Dilemma for balancing weights and the Curse of Dimensionality in computational scalability. This paper derives a novel multi-treatment generalization bound and proposes a theoretical estimator for the optimal balancing weight \alpha, eliminating expensive heuristic tuning. We investigate three balancing strategies: Pairwise, One-vs-All (OVA), and Treatment Aggregation. While OVA achieves superior precision in low-dimensional settings, our proposed Treatment Aggregation ensures both accuracy and \mathcalO(1) scalability as the treatment space expands. Furthermore, we extend our framework to a generative architecture, Multi-Treatment CausalEGM, which preserves the Wasserstein geodesic structure of the treatment manifold. Experiments on semi-synthetic and image datasets demonstrate that our approach significantly outperforms traditional models in estimation accuracy and efficiency, particularly in large-scale intervention scenarios.

Abstract:
Recent progress in large language models (LLMs) has led to systems capable of producing text with remarkable fluency. However, these models are still prone to factual inaccuracies, often referred to as ``hallucinations''. One strategy to alleviate this issue is uncertainty quantification (UQ), but most existing approaches are computationally intensive or require supervision. In this work, we propose Recurrent Attentionbased Uncertainty Quantification (RAUQ), an unsupervised and efficient framework for identifying hallucinations. The method leverages an observation about transformer attention behavior: when incorrect information is generated, certain ``uncertainty-aware'' attention heads, tend to reduce their focus on preceding tokens. RAUQ automatically detects these attention heads and combines their activation patterns with token-level confidence measures in a recurrent scheme, producing a sequence-level uncertainty estimate in just a single forward pass. Through experiments on twelve tasks spanning question answering, summarization, and translation across four different LLMs, we show that RAUQ consistently outperforms state-of-the-art UQ baselines. Importantly, it does so with minimal cost, less than 1% additional computation. Since it requires neither labeled data nor extensive parameter tuning, RAUQ serves as a lightweight, plug-and-play solution for real-time hallucination detection in white-box LLMs.

Abstract:
Neural networks for continual reinforcement learning (CRL) often suffer from plasticity loss, i.e., a progressive decline in their ability to learn new tasks arising from increased representational drift (churn) and Neural Tangent Kernel (NTK) rank collapse. Current methods mitigating this problem involve algorithmic interventions such as regularization, resets, and optimization schedules. Here, we propose InterpLayers, a lightweight architectural solution that combines a fixed, parameterfree reference pathway with a learnable projection pathway using input-dependent interpolation weights. This structure makes InterpLayers orthogonal to existing algorithmic solutions. We show through theoretical analysis that InterpLayers upper-bounds the output variability, bounds churn, and prevents a collapse of the NTK rank through continual non-zero rank contribution from the interpolation mechanism. Across different distributional shifts, including permutation, windowing, and expansion, InterpLayers outperform similar gated architectures and achieve similar performance as current state-of-the-art methods without the need for optimization-level intervention or the introduction of sensitive hyperparameters. Ablation studies highlight that these improvements are sustained when InterpLayers are combined with existing algorithmic methods for preventing plasticity loss. These results position InterpLayers as a simple, complementary solution for maintaining plasticity in CRL.

Abstract:
Mixedinteger optimization solvers often find optimal solutions early in the search, yet spend the majority of computation time proving optimality. We exploit this by learning when to terminate solvers early on distributions of similar problem instances. Our method trains a neural network to estimate the true optimality gap from the solver state, then uses conformal prediction to calibrate a stopping threshold with rigorous probabilistic guarantees on solution quality. On five problem families from the distributional MIPLIB library, our method reduces solve time by over 60% while guaranteeing 0.1%-optimal solutions with 95% probability.

Abstract:
Traditional robot codesign approaches typically converge to one configuration, which do not explore the flexibility from reconfiguration on heterogeneous environments. On the other hand, existing designs for reconfigurable robots require human-designed configurations. We present Learning to Reconfigure, a holistic pipeline for co-designing the configurations and control of reconfigurable robots in heterogeneous locomotion tasks consisting of several sub-tasks. Our pipeline proposes low-level specialized primitives with a high-level scheduler. To jointly optimize configuration design and control, our primitives employ a multi-tail architecture that disentangles these distinct objectives. Building on this, the scheduler learns to dynamically switch configurations based on global task progress. We evaluate our pipeline on locomotion tasks across walking, flying, and swimming, and compare with the state-of-the-art baselines, including single-robot control and multi-morphology co-design algorithms. Quantitative results based on traversal progress show that our pipeline outperforms single-robot baselines by 5.95x average progress. Compared with the reconfiguration-free design given by the co-design algorithms, our robots also exhibit 9.99x progress on average. These results highlight the critical role of configuration adaptation in achieving versatile robotic autonomy in complex worlds.

Abstract:
Learning from geometrically structured data is central to applications in biology, physics, and computer vision. In many tasks, meaningful comparisons depend on how features are aligned in space. Graph Neural Networks capture local structure but are constrained by message passing. Transformers model longrange dependencies but largely ignore geometry. We introduce the Vector Bundle Attention Transformer (VBA-Transformer), a framework that redefines attention as an intrinsic geometric operator. Each token couples a base manifold coordinate with a fiber feature vector, following vector bundle theory. A principled parallel transport mechanism aligns fiber features across local coordinate systems before similarity is computed. This embeds geometry directly into the attention operator. Unlike prior methods that inject geometry as an external bias or positional encoding, VBA integrates geometry natively inside attention. On challenging single-cell RNA sequencing benchmarks, VBA achieves state-of-the-art accuracy, outperforming Transformer baselines by over 3--5%. On spatial transcriptomics, it demonstrates superior clustering performance. On 3D point clouds, it achieves competitive accuracy, validating broad generalization across domains. Beyond empirical gains, we provide theoretical analysis of invariance and perturbation stability. We also demonstrate robust transport behavior empirically. Together, these results establish intrinsic geometric alignment as a powerful principle for scalable representation learning.

Abstract:
Textto-Image Person Re-Identification (TI-ReID) retrieves visible pedestrian images using text queries. Yet in low-light or nighttime settings, visible images lack sufficient identity details, while infrared images effectively capture pedestrian contours and textures. To enable all-day surveillance, we propose a dual cross-modal retrieval task called Text-to-Visible-Infrared Re-Identification (TVI-ReID) and construct corresponding tri-modal datasets. Compared to TI-ReID, TVI-ReID faces two key challenges: (1)complex hybrid discrepanciesin dual cross-modal retrieval from three modalities, and (2)semantic inconsistencybetween pretraining and downstream tasks. To address these issues, we propose a Cross-Modal Semantic Decoupling and Transfer (CSDT) framework. CSDT constructs color-related and color-irrelevant feature subspaces via Semantic Decoupling Learning (SDL) to align shared semantics across text and dual image modalities, reducing hybrid discrepancies. Moreover, Semantic Distribution Transfer (SDT) adapts pretrained text-visible alignment to text-infrared matching. Extensive experiments on tri-modal datasets show our approach outperforms existing state-of-the-art TI-ReID methods. Datasets and code will be released publicly.

Abstract:
Multimodal image generation, particularly subject-driven customization, has garnered growing attention in recent years. Despite the rapid advancement of generative models, their evaluation remains largely lagging. Existing methods, whether embedding-based or Multi-modal Large Language Model (MLLM)-based, evaluate alignment with each modal condition in isolation, which contradicts the simultaneous condition alignment objective of multi-modal image generation, leading to poor consistency with human judgments. To address this challenge, we propose UFO, the first UniFied framework for Omini-condition alignment simultaneous evaluation. Specifically, UFO introduces a novel Atomized Chain-of-Evaluation paradigm, \emphi.e., it first decomposes omni-condition alignment into a sequential chain of fine-grained, disentangled Atomic Evaluation Units (AEUs), categorizes them into distinct modality-relevance classes, and then employs general or dedicated functional calls for accurate verification of different AEU types. Experimental results demonstrate that UFO achieves the highest correlation with human evaluation preferences, delivering an average improvement of 15.25%. Furthermore, we present UFO-Bench, a dedicated benchmark designed to holistically evaluate the performance of existing customization models under the diverse mutual interactions of textual and visual conditions.

Abstract:
Openvocabulary navigation requires embodied agents to manage significant perception uncertainty stemming from semantic ambiguity and model errors. However, most existing works settle for local optimal deterministic approaches, depriving complex navigation decision-making over multiple composite possibilities that are critical for globally better solutions. In this paper, we propose Probabilistic Scene Graph Navigation (PSG-Nav), which constructs a 3D Probabilistic Scene Graph that uses full semantic categorical distributions to account for perception uncertainty. To efficiently use the local distributions to compose and reason about the optimal navigation landmarks, we propose Multiverse Decision to sample multiple most likely world settings from the joint distribution, and evaluate navigation landmarks based on the compatibility between landmarks and multiverses. To mitigate false positives due to epistemic uncertainty in open-vocabulary navigation, we introduce the Evidential Experience Calibrator, which enables online lifelong adaptation by cross-validating detections against memories of past successes and failures. Extensive experiments on widely-used benchmarks MP3D, HD3D, and HSSD demonstrate that PSG-Nav establishes new state-of-the-art results, achieving a Success Rate of 66.1%, 44.8%, and 67.9%, respectively.

Abstract:
Multiperson human mesh recovery (HMR) from a single image is inherently ill-posed, as multiple 3D poses can produce identical 2D projections due to depth ambiguity. Existing methods typically regress 3D translation implicitly from image features, which often leads to unreliable depth estimation. To address this issue, we propose a depth-guided multi-person HMR framework that explicitly models instance-level depth cues and integrates them into mesh recovery. Specifically, we first introduce an instance-aware depth estimator that predicts per-person pelvis depth from the full image, providing reliable instance-level 3D anchors and decoupling depth estimation from mesh regression. Then, based on these anchors, we design a geometry-anchored refinement decoder that injects instance-specific depth and spatial priors into the decoder initialization, guiding mesh refinement under joint 2D-3D supervision. Finally, we adopt a single-stage joint training strategy to coordinate depth estimation and mesh recovery in a unified framework. Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art performance in both mesh reconstruction accuracy and depth ordering.

Abstract:
Evolutionary multitasking accelerates black-box optimization via knowledge transfer but falters in scenarios involving many low-similarity tasks. We identify this scalability barrier as theMulti-Task Curse, driven by evaluation budget dispersion and negative transfer. To overcome this, we propose MES-RET (Many-taskEvolutionStrategy withReward-weightedEvaluation andTransfer), which combats budget dispersion via a reward-weighted evaluation scheme that guarantees superior expected improvement, while simultaneously mitigating negative transfer through a robust reward-weighted aggregation of mean and covariance statistics, ensuring a safe fallback to independent evolution. Furthermore, to handle neural dimensional mismatches in many-task policy search, we introduce a semantic parameter alignment strategy that bridges heterogeneous state-action spaces. Extensive experiments on synthetic benchmarks, real-world engineering problems, and reinforcement learning tasks demonstrate that MES-RET consistently outperforms state-of-the-art methods, notably enabling skill transfer across morphologically distinct policies.

Abstract:
There have been numerous attempts to distill quadratic attentionbased LLMs into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set out the goal of lossless distillation, which we define in terms of tolerance-corrected Win-and-Tie rates between student and teacher on sets of tasks. We propose an additional merging stage, where individually linearized experts are combined into a single model. We show the effectiveness of this pipeline by distilling base and instruction-tuned models from the Llama, Qwen, and Olmo families. In many settings, our xLSTM-based students recover most of the teacher's performance, and even exceed it on some downstream tasks. Our contributions are an important step towards more energy-efficient and cost-effective replacements for transformer-based LLMs.

Abstract:
LowRank Adaptation (LoRA) has gained popularity as a fine-tuning approach for Large Language Models (LLMs) due to its low resource requirements and good performance. While numerous studies have investigated improving LoRA serving efficiency by serving multiple LoRAs concurrently, existing methods assume that a wide range of LoRA adapters are available for serving. In our work, we conduct extensive empirical studies to show that current LoRA training paradigms do not efficiently utilize hardware resources and incur high overhead to obtain a performant LoRA adapter. Leveraging these insights, we propose PLoRA, which automatically orchestrates concurrent LoRA fine-tuning jobs under given hardware and model constraints and develops performant kernels to improve training efficiency. Across a range of LLMs and LoRA configurations, PLoRA improves training throughput by up to 12.8x and reduces the overall fine-tuning makespan by up to 7.52x compared to existing approaches.

Abstract:
Robust machine learning for graphstructured data has made significant progress against test-time attacks, yet certified robustness to poisoning – where adversaries manipulate the training data – remains largely underexplored. For image data, state-of-the-art poisoning certificates rely on partitioning-and-aggregation schemes. However, we show that these methods fail when applied in the graph domain due to the inherent label and structure sparsity found in common graph datasets, making effective graph-partitioning difficult. To address this challenge, we propose a novel semi-supervised learning framework called deep Self-Training Graph Partition Aggregation (ST-GPA), which enriches each graph partition with informative pseudo-labels and synthetic edges, enabling effective certification against node-label and graph-structure poisoning under sparse conditions. Our method is architecture-agnostic, scales to large numbers of partitions, and consistently and significantly improves robustness guarantees against both label and structure poisoning across multiple benchmarks, while maintaining strong clean accuracy. Overall, our results establish a promising direction for certifiably robust learning on graph-structured data against poisoning under sparse conditions.

Abstract:
Millimeterwave (mmWave) sensing enables privacy-preserving, always-on edge perception, but its measurements are often sparse, temporally irregular, and corrupted by high-frequency noise. Existing mmWave pipelines predominantly rely on artificial neural networks (ANNs), which achieve robustness through extensive preprocessing or deep architectures, thereby limiting their efficiency on edge devices. In this work, we study spiking neural networks (SNNs) for mmWave sensing from a mechanism–data alignment perspective. By leveraging the low-pass filtering behavior of leaky integrate-and-fire (LIF) dynamics, we analyze how their implicit temporal filtering interacts with the frequency structure of mmWave signals. Our analysis shows that when discriminative information resides in low-to-mid frequencies, LIF dynamics can inherently suppress high-frequency noise, clarifying when and why SNNs outperform ANNs. Based on this insight, we derive a principled criterion for configuring the membrane decay factor by matching the effective bandwidth of LIF dynamics to the data’s discriminative spectral content. Experimental results across four widely used mmWave datasets validate the proposed frequency-matching hypothesis, yielding an average test-accuracy improvement of 6.22% and a 3.64× reduction in theoretical energy consumption relative to ANN baselines, under a unified evaluation protocol.

Abstract:
Selective State Space Models (SSMs) such as Mamba have emerged as efficient alternatives to Transformers, achieving linear complexity through inputdependent parameterization. However, this selectivity transforms the system from linear time-invariant (LTI) to linear parameter-varying (LPV), where individually stable matrices can produce unbounded trajectories under switching. Existing work focuses on empirical performance, leaving global stability, robustness bounds, and practical certification unresolved. This paper develops a control-theoretic framework providing the first comprehensive stability and robustness analysis for selective SSMs. We prove BIBO stability by viewing selective scans as continuous-time LTI sampling and establish two-term robustness bounds with linear growth in sequence length. For general LPV systems, we provide common quadratic Lyapunov function conditions and develop algorithms to extract certificate constants directly from trained weights. These results bridge control theory and SSM architectures, enabling formal guarantees for safety-critical deployment.

Abstract:
The quantity and 3D shape of wear particles are essential indicators for assessing the health of jet engines, enabling early detection of potential damage and preventing accidents caused by catastrophic failures. However, capturing wear particles is difficult due to their minute sizes and ultra high-speed movement within intense jet flows. Existing technologies struggle with the extreme background noise and low resolution in such harsh environments. In this paper, we propose QuantWear, the first quantum sensing system designed to directly quantify and profile wear particles on the sub-millimeter scale. QuantWear innovatively tracks wear particles by monitoring the spectral signatures of Sodium (Na) and Potassium (K) atoms within jet flow, which naturally adhere to particle surfaces due to electrochemical reactions in high-temperature combustion. We construct a custom atomic detector that leverages quantum jump and Faraday rotation effects to isolate these specific atomic signals, effectively suppressing the broad-spectrum flame noise. Next, we apply a deep learning framework to effectively measure the quantity of wear particles in dynamic vaporous backgrounds. Finally, we generate a fully reconstructed 3-D model of the wear particles from multiple 2-D images. Extensive field tests and high-fidelity simulations demonstrate that QuantWear achieves an imaging Signal-to-Noise Ratio (SNR) of 22.5 dB and a 3-D reconstruction similarity exceeding 95%, significantly outperforming state-of-the-art technologies.

Abstract:
Federated learning enables decentralized clients to collaboratively train models without sharing local data. However, heterogeneous client distributions often induce client drift and hinder convergence. This paper proposes FedCDWA, a decoupled hierarchical federated distillation framework. FedCDWA decouples clientside personalized distillation from server-side mutual distillation to mitigate distillation-induced optimization conflicts. It further adopts Hierarchical Wasserstein Aggregation to aggregate prototypes without restrictive parametric assumptions while preserving intra-class structure and inter-class geometry. To achieve finer-grained feature alignment, Prototype–Variance Dual Alignment matches feature means and variances in the feature space. We prove convergence guarantees for FedCDWA. Experiments on three datasets demonstrate that FedCDWA consistently improves both global and personalized accuracy across heterogeneity levels, with smaller performance degradation under more severe heterogeneity.

Abstract:
Antibody expression ranking is a critical task in antibody design, yet its modeling is severely hindered by the scarcity of labeled experimental data. To address this, we propose a unified preferencebased learning framework that integrates scarce quantitative expression data with large-scale weak positive supervision from immunization data. We adapt Direct Preference Optimization (DPO) to protein language models by introducing a union-masked log-likelihood approximation and IMGT-based alignment, enabling efficient training on variable-length sequences. Evaluating on a diverse internal dataset of 1254 labeled sequences and 4 million unlabeled camelid-derived antibodies, we show that our method consistently outperforms baselines on most metrics. Our results demonstrate that preference learning can effectively learn from weak supervision, providing a scalable solution for antibody expressibility optimization in data-constrained settings.

Abstract:
Crossmodal Hashing (CMH) typically assumes a perfectly complete data annotation, whereas noisy labels are unavoidable in practical scenarios. Existing CMH methods often overlook the uncertainty introduced by noise or semantic ambiguity, making models susceptible to overfitting noisy labels and yielding unreliable similarity judgments during inference. To address this issue, we propose a Robust Self-reflective Hashing (RSH) framework that prudently analyzes semantic discrepancies while accounting for uncertainty, thereby effectively mitigating interference from noisy labels. Specifically, the Double Feature Representation (DFR) method is introduced, employing semantic and uncertainty features to represent the semantic representation and fuzziness of samples. With a double feature, we propose a novel cross-modal similarity metric - the Self-reflective Similarity Metric (SSM), which judges sample similarity by integrating semantic discrepancy and fuzziness, enabling the model to adaptively weaken semantic discrepancy according to uncertainty level. The proposed method is plug-and-play, enabling seamless integration into diverse objective functions to enhance model robustness and reliability. Extensive experiments on benchmark datasets demonstrate that RSH outperforms existing methods.

Abstract:
Lateinteraction retrievers like ColBERT achieve high quality but suffer from large multi-vector indices. Standard compression minimizes token reconstruction error, while ranking depends critically on preserving scores of sparse "winner" tokens. We introduce CrossQ which adaptively allocates precision within documents by conditioning token codes on lightweight document context computed at indexing time (but not stored). Trained with ranking-aligned objectives that preserve candidate score distributions and protect hard-negative margins, CrossQ improves MRR@10 by +0.012 on MS MARCO and average nDCG@10 by +0.018 on BEIR over strong baselines at matched footprints (2-8 B/token). With light fine-tuning, CrossQ achieves 61x compression while narrowing the gap with full-precision ColBERT to just 2.3% MRR@10, establishing a new state-of-the-art in the latency-quality tradeoff for memory-constrained retrieval.

Abstract:
Ensuring safe behavior in reinforcement learning (RL) is challenging when safety constraints are implicit and cannot be densely measured. In many settings, supervision is limited to coarse approvals or rejections of whole trajectories (e.g., whether a rollout remained within an unknown safety threshold). We proposeTraCeS(Trajectorybased Constraint Estimation for Safety), a method for learningper-timestep violation creditfrom such sparse trajectory-level labels.TraCeStrains a sequential violation estimator whose per-step credits factorize the predicted probability that a trajectory hasnot yet violatedthe constraint, and integrates this learned signal into constrained policy optimization. The method requires neither a known cost function nor a known threshold, and remains compatible with standard continuous-control algorithms. We provide a theoretical analysis of the approximation gap introduced by the learning objective, and demonstrate empirically thatTraCeSimproves constraint satisfaction and feedback efficiency over baselines across multiple continuous-control benchmarks, including long-horizon tasks and settings with noisy or inconsistent labels.

Abstract:
The growing capability of video generation poses escalating security risks, making reliable detection increasingly essential. In this paper, we introduceVideoVeritas, a framework that integrates finegrained perception and fact-based reasoning. We observe that while current multi-modal large language models (MLLMs) exhibit strong reasoning capacity, their granular perception ability remains limited. To mitigate this, we introduceJoint Preference AlignmentandPerception Pretext Reinforcement Learning (PPRL). Specifically, rather than directly optimizing for detection task, we adopt general spatiotemporal grounding and self-supervised object counting in the RL stage, enhancing detection performance with simpleperception pretext tasks. To facilitate robust evaluation, we further introduceMintVid, a light yet high-quality dataset containing 3K videos from 9 state-of-the-art generators, along with a real-world collected subset that has factual errors in content. Experimental results demonstrate that existing methods tend to bias towards eithersuperficialreasoning ormechanicalanalysis, whileVideoVeritasachieves more balanced performance across diverse benchmarks.

Abstract:
The natural world is richly structured over space and time. Much of this structure arises from the interplay between spatial geometry and motion. However, most existing world models ignore this structure, leading to an inability to generalize in dynamic environments. In this work, we show that enforcing equivariance between an agent's representations and the world's dynamics necessarily induces an efficient, structured memory. Concretely, we introduce Flow Equivariant World Modeling, a framework in which both selfmotion and external object motion are unified as one-parameter Lie-group ``flows'' acting on a latent world memory; and models are built to be equivariant with respect to these transformations. On 2D and 3D partially observed video world modeling benchmarks, we demonstrate that Flow Equivariant World Models significantly outperform comparable state-of-the-art diffusion-based and memory-augmented world modeling architectures in their ability to track and predict the locations of moving objects over long horizons. Project page: https://anonflowm.github.io/

Abstract:
Multimodal large language models (MLLMs) extend large language models (LLMs) with visual perception for openworld understanding, but exacerbate LLMs' hallucinations, in which generated text contradicts visual evidence or common sense. To mitigate hallucinations, a dominant strategy is Direct Preference Optimization (DPO) using hallucination-labeled responses. Existing pipelines, however, face two key limitations: they either (i) rely on human inspection or proprietary models to correct hallucinated outputs, producing off-policy preference data that violate the basic assumptions of DPO, or (ii) depend on stronger peer models to evaluate responses, leading to an unfavorable trade-off between performance and scalability. Departing from these paradigms, we propose an on-policy \emphself-feedback framework that constructs preference data for hallucination mitigation without any external supervision (e.g., large models or humans). Specifically, we present a novel \emphlocal fuzzy semantic evaluation paradigm that derives a hallucination-sensitive confidence signal directly from the model's own logits, which is then used to automatically rank diverse generated responses to build high-quality preference pairs for fine-tuning. Trained on a 10k-scale self-generated preference dataset, our self-feedback pipeline achieves over a 50% relative reduction in HalRate\downarrow on AMBER compared to the GPT-4V feedback baselines. Models, code, and datasets will be released upon acceptance.

Abstract:
Existing sentencelevel watermarking methods enhance robustness to paraphrasing by anchoring watermarks in sentence semantics. However, their prefix-based designs remain vulnerable to structural perturbations, such as sentence splitting and merging, which commonly arise under strong paraphrasers like DIPPER and GPT-3.5. To mitigate this issue, we propose AliMark, a framework that reformulates sentence-level watermarking as a bit sequence encoding and alignment problem between a potentially watermarked text and a secret bit sequence. Notably, our approach adopts a two-stage detection strategy: we generate multiple restructured text variants and adaptively align their extracted bit sequences with the secret bit sequence to minimize alignment cost. This multi-candidate alignment design naturally improves robustness to sentence merges and splits. Extensive experiments demonstrate that AliMark substantially outperforms state-of-the-art baselines under diverse paraphrasing attacks.

Abstract:
Measuring singlecell protein abundance is essential for resolving biological mechanisms and disease progression with high resolution. However, due to the high costs and antibody throughput limitations of current proteomics, inferring protein levels from readily available RNA data has become a critical computational necessity. Existing regression and generative methods face a fundamental geometric bottleneck: enforcing deterministic constraints on noisy, heteroscedastic data collapses intrinsic uncertainty into a rough latent manifold, which destabilizes the learning process. To overcome this, we present scChord, a noise-decoupled conditional flow matching framework built on Probabilistic Manifold Rectification. Our approach utilizes a probabilistic decoder to disentangle technical noise and over-dispersion from the raw counts, absorbing them into distributional parameters. This allows the rectified latent manifold to focus more on biological signals, serving as a robust geometric regularizer for learning efficient transport trajectories. Extensive experiments on multiple multi-omics benchmarks demonstrate that scChord not only achieves state-of-the-art inference accuracy but also faithfully reconstructs high-fidelity biological heterogeneity and complex protein distributions.

Abstract:
Learning motion latents for robotic manipulation heavily relies on extracting motion patterns from visual sequences, yet effective action abstractions require understanding threedimensional geometric transformations. Here, we introduce GeoMoLa (Geometry-Aware Motion Latents), which learns discrete motion latent codes by predicting how point clouds evolve during manipulation rather than reconstructing visual observations. This four-dimensional objective – spatial geometry changing through time – forces latent representations to encode actual physical motion rather than appearance patterns. GeoMoLa achieves state-of-the-art performance using only single-view RGB-D input, while existing methods require multi-view reconstruction, succeeding across diverse manipulation benchmarks. Our ablations reveal that geometric prediction is the key to driving performance, quantitatively validating that manipulation depends on spatial understanding. Furthermore, the learned codes exhibit effective motion abstraction: applying them to novel scenes produces physically consistent transformations regardless of visual context. Our real-world experiments also confirm this robustness capability, achieving robust manipulation with minimal demonstrations in cluttered environments where geometric reasoning determines success. Thus, we demonstrate that effective motion latents for robot control can better emerge from understanding motion through its three-dimensional effects rather than pixel-level patterns.

Abstract:
Large Language Models (LLMs) exhibit strong capabilities in highlevel semantic understanding and strategic planning, yet they suffer from persistent quantitative failure modes, such as imprecise computation and the illusion of quantitative coherence, which limit their reliability in high-stakes decision-making. To address these limitations, we decouple reasoning from computation by assigning LLMs to planning, analysis, and result interpretation, while delegating numerical computation and statistical inference to specialized external tools. These tools are not hard-coded; instead, they are created in a constrained and structured manner during planning as explicit intermediate reasoning artifacts, enabling adaptive and scenario-dependent quantitative reasoning. LLMs iteratively analyze tool outputs under diverse market conditions and leverage performance-based feedback to refine subsequent tool selection and construction, forming a bounded self-evolving loop. We instantiate this process through self-play in a controllable digital twin market, DecoupledMarket, where LLM agents continuously test, compare, and adapt their strategies. By coupling high-level planning with robust quantitative execution, the proposed framework improves the quantitative reliability of LLM-driven decision-making. Code will be released soon.

Abstract:
We present ReflFlow, a novel framework for highfidelity rendering of dynamic specular scenes by addressing two key challenges: precise reflection direction estimation and physically accurate modeling. To achieve this, we propose a Residual Material-Augmented 2D Gaussian Splatting representation that models dynamic geometry and material properties, allowing accurate reflection ray computation. Furthermore, we introduce a Dynamic Environment Gaussian and a hybrid rendering pipeline that decomposes rendering into diffuse and specular components, enabling physically informed specular synthesis via rasterization and ray tracing. Finally, we devise a coarse-to-fine training strategy to improve optimization stability and promote physically meaningful decomposition. Extensive experiments on dynamic scene benchmarks demonstrate that ReflFlow outperforms prior methods quantitatively and qualitatively, producing sharper and more realistic specular reflections in complex dynamic environments.

Abstract:
We study a learner who sequentially makes and broadcasts predictions of some underlying adversarially varying state. Many downstream decision makers with different goals and different longterm constraints consume these decisions to choose actions. In this setting we give the first algorithm that obtains simultaneousdynamic regretguarantees for all of the decision makers --- where regret for each agent is measured against a potentially changing sequence of actions across rounds of interaction, while also ensuring vanishing constraint violation for each agent. We can promise these dynamic regret bounds not just marginally, but simultaneously on many different intersecting subsequences, which lets decision makers compete with strategies that adapt with both long-term drift and short-term variation. Our results do not require the decision makers to maintain any state, but just to react myopically to our predictions.

Abstract:
Recent advances in VisionLanguage-Action (VLA) models have significantly improved robotic perception and manipulation capabilities, but still struggling to adapt in dynamic, open-ended real-world environments due to a lack of reliable task progress feedback and improvement mechanisms. To address these challenges, we propose a generalist Vision Language Action-Critic model, VLAC, which can integrate both human and robot data, and unify action policy and task progress critic within a single autoregressive architecture. Specifically, we propose a scalable and generalizable pair-wise progress understanding approach that can predict the delta of task progress between two steps in a trajectory and generate correct actions to complete the task. Then, we trained the model on large-scale, multi-source human, robot, and general vision-language data for a generalist. Furthermore, we deploy reinforcement learning where VLAC can autonomously evaluate task progress to provide intrinsic rewards. Extensive evaluations demonstrate that our model generalizes effectively across diverse tasks and environments, leveraging its pair-wise progress understanding to provide reliable dense rewards, robust action generation, and significant improvements in real-world reinforcement learning.

Abstract:
Neural architecture design lacks first principles: innovations are discovered empirically and justified posthoc, with no systematic way to diagnose why an architecture fails or derive what repair will succeed. We introduce the Axiomatic Atlas, encoding requirements as composable axioms over graph connectivity, operator contracts, numerical stability, and information preservation. Given an operator library and wiring conventions, the Atlas constructs certificates lower-bounding output variation via min-cut analysis and diagnoses failures by locating axiom violations. Crucially, the framework is prescriptive: each violation implies a targeted repair, reducing architecture design to constraint satisfaction. We prove variation bounds under exact and finite-precision arithmetic, enabling modular verification across transformers, MoEs, SSMs, and GNNs. Four Atlas-derived interventions validate the approach: +46 percentage points on GNN bottlenecks, 3× robustness to MoE quantization, 83% gap closure with adaptive expert budgets, and 0%\to100% retrieval via orthogonal keys---each against matched negative controls.

Abstract:
Time series forecasting plays a pivotal role in datadriven decision-making across various time series domains. Recently, leveraging their ability to extract semantically rich representations, Large Language Models (LLMs) have achieved promising results in time series forecasting. However, existing LLM-based methods struggle to obtain multi-scale retrieval-augmented representations due to entangled multi-scale representations and redundant multi-scale interference. To address this, we propose TimeMRA, an LLM-empowered Time series forecasting framework via Multi-Scale Retrieval-Augmented representations. Specifically, a scale-aware prompt generation (SAPG) module is designed to decompose time series into multiple scales and generate augmented multi-scale representations. Then, a cross-scale disentanglement constraint (CSDC) mechanism with a router network is designed to obtain the disentangled multi-scale semantic representations while mitigating interference from irrelevant scales. Finally, a cross-modality retrieval module is designed to obtain multi-scale retrieval-augmented representations for time series forecasting. Experiments on 10 real-world datasets demonstrate that TimeMRA achieves state-of-the-art (SOTA) performance.

Abstract:
Longform TV dramas present a formidable challenge for comprehensive video understanding, where deciphering complex storyline often relies onspeaker recognition, the task of accurately attributing each spoken utterance to its respective character. In this paper, we advance this field through two primary contributions. (1) We introduceDramaSR-532K, a large-scale benchmark comprising 532K annotated dialogue lines across more than 900 unique characters, necessitating the integration of auditory, linguistic, and visual cues for speaker recognition. (2) We proposeDramaSR-LRM, a robust approach built upon a large reasoning model (LRM). DramaSR-LRM is designed to autonomously aggregate contextual evidence via multimodal tool-use, synthesizing diverse inputs to achieve high-fidelity attribution. Experimental results demonstrate that DramaSR-LRM significantly outperforms existing baselines, particularly on short utterances where acoustic biometrics are inherently unreliable.All the data and code will be made publicly available.

Abstract:
Generative Graph Language Models (GLMs) must reconcile topology with causal language modeling. Linearization obscures multihop connectivity, while encoder-based methods bottleneck token-level reasoning during generation. Viewing context modeling as a form of message passing, we introduceWeaver, an encoder-free framework that extends the attention mechanism of decoder-only LLMs to enable graph reasoning. Weaver maps graph distances into rotary positional embeddings so that structurally connected nodes become proximate in attention space, propagating information over graph topology as if it were sequential context. To achieve this, we combine: 1) a masking mechanism for causal tokens with graph structures; 2) a unified geometric encoding that couples sequential position and graph distance in joint rotary embeddings (Graph-over-Tokens RoPE); and 3) a design principle to prioritize local information to resolve positional ambiguity under graph symmetries. On zero-shot benchmarks, Weaver achieves state-of-the-art performance among generative GLMs, with gains of up to 30% over prior generative methods on heterophilic graphs, while matching specialized discriminative models on citation networks---all within a unified decoder-only framework.

Abstract:
We formulate the learning of generalist VisionLanguage-Action (VLA) models as a Gromov-Wasserstein alignment problem, aiming to map semantically similar VL embeddings to physically similar motion primitives. However, solving this is challenging due to the mathematical heterogeneity between the domains: the semantic space of vision-language is topologically linear and isotropic, while the physical manifold of robotic action is non-Euclidean and anisotropic. As a result, direct regression approaches fail due to the disjoint metric structures of these domains, making standard distance minimization ill-posed. To resolve this incompatibility, we introduce LAST (Lie-algebraic Action Space Tokenizer). LAST reconstructs the action space to establish a more consistent metric alignment between the VL and Action modalities. Specifically, LAST bridges the heterogeneity via two stages: (1)Global Topological Linearization, which linearizes the action manifold through Lie-algebraic mapping, converting trajectories into a fixed-length, physically additive representation; and (2)Local Metric Discretization, where the representation is discretized hierarchically into schemas and whitened residuals, establishing a mathematical isomorphism with the isotropic Euclidean metric. By addressing the structural mismatch globally and locally, LAST enables VLA models with enhanced convergence and generalizability.

Abstract:
Dataset Distillation (DD) has emerged as a promising technique for compressing largescale datasets into compact synthetic sets while preserving model performance. However, the security implications of this paradigm, particularly within the Transformer-based text classification domain, remain underexplored. In this paper, we identify "Distilled Attention Labels" as a pivotal yet overlooked vulnerability. We propose Attention Hijacking (AH), a stealthy backdoor attack that manipulates the bi-level optimization process to explicitly hijack the attention mechanism of target models via synthetic data. Distinct from traditional poisoning that often compromises clean accuracy, AH achieves stealthiness without utility degradation. To explain this, we formulate the "Semantic Anchoring Hypothesis", characterizing the interaction between trigger semantics and attack mechanisms. We demonstrate that AH functions as a semantic-adaptive mechanism: when triggers align with domain-specific semantic anchors (e.g., "film" in sentiment analysis), our method achieves a synergistic effect, boosting both Attack Success Rates (ASR >99%) and Clean Test Accuracy (CTA). Conversely, for functional or noise triggers, AH enforces attention segregation to prevent utility collapse, maintaining exceptional robustness where baseline attacks fail. Extensive experiments across multiple datasets and varying model scales—ranging from BERT-Tiny to BERT-Base—validate the scalability and dominance of AH. Our findings reveal that attention-based distillation is a double-edged sword, underscoring the urgent need for robust defenses in the era of data-efficient learning.

Abstract:
Realworld time series inherently exhibit complex temporal patterns. Within chaotic systems, significant mixing and entanglement occur between different time-varying modes. Given that time series exhibit distinctly different patterns at various sampling scales, downsampling to extract multiscale features is a common approach. However, conventional downsampling causes high-frequency components in the original signal, those exceeding the new Nyquist frequency, to undergo spectral folding. This erroneously introduces spurious low-frequency patterns, perceived as low-frequency noise, thereby leading to thealiasing problem. To address this problem, we propose a Decomposition-Prevention-Fusion architecture framework calledDMANet, which introduces theDynamicMultiscaleAnti-AliasingNetwork. Specifically, DMANet comprises two key components: Multiscale Convolutional Downsampling, designed to capture temporal dependencies and inter-channel interactions, and an Anti-Aliasing Operation, which includes Pre-Sampling Anti-Aliasing Filtering and Post-Sampling Interpolation. These designs guarantee the fidelity of multiscale features before and after downsampling. We show that by mitigating the risk of aliasing, our proposed simple convolutional downsampling architecture achieves performance competitive with common baselines. Our codes are available at https://anonymous.4open.science/r/DMANet.

Abstract:
In autoregressive large language models (LLMs), temporal straightening offers an account of how the nexttoken prediction objective shapes representations. Across layers, models progressively straighten the trajectory of input sequences in activation space, potentially facilitating extrapolation to the next token. However, a direct link between this geometry and token-level behavior has been missing. We provide such a link by relating contextual curvature—a geometric measure of how sharply the representation trajectory bends over recent context—to next-token entropy. Across model families (GPT-2 XL and Pythia-2.8B), contextual curvature is correlated with entropy, and this relationship emerges during training. Perturbation experiments provide causal evidence: reducing curvature through trajectory-aligned interventions selectively lowers entropy, while geometrically misaligned perturbations have no effect. Finally, explicitly regularizing representations to be straighter during training modestly reduces token-level entropy without degrading validation loss. These results identify trajectory curvature as a task-aligned representational feature that influences output uncertainty, suggesting that temporal straightening could be a functional principle shaping prediction in autoregressive language models.

Abstract:
Graph structure learning is a core problem in graphbased machine learning, essential for uncovering latent relationships and ensuring model interpretability. However, most existing approaches are ill-suited for large-scale and dynamically evolving graphs, as they often require complete re-learning of the structure upon the arrival of new nodes and incur substantial computational and memory costs. In this work, we propose GraphFLEx—a unified and scalable framework for Graph Structure Learning in Large and Expanding Graphs. GraphFLEx mitigates the scalability bottlenecks by restricting edge formation to structurally relevant subsets of nodes identified through a combination of clustering and coarsening techniques. This dramatically reduces the search space and enables efficient, incremental graph updates. The framework supports 48 flexible configurations by integrating diverse choices of learning paradigms, coarsening strategies, and clustering methods, making it adaptable to a wide range of graph settings and learning objectives. Extensive experiments across 26 diverse datasets and graph neural network architectures demonstrate that GraphFLEx achieves state-of-the-art performance with significantly improved scalability. Our implementation is publicly available here: https://anonymous.4open.science/r/ScalingGraphLearning-CB67/README.md.

Abstract:
Pixelbased reinforcement learning agents often fail under purely visual distribution shift even when latent dynamics and rewards are unchanged, but existing benchmarks entangle multiple sources of shift and hinder systematic analysis. We introduce KAGE-Env, a JAX-native 2D platformer that factorizes the observation process into independently controllable visual axes while keeping the underlying control problem fixed. By construction, varying a visual axis affects performance only through the induced state-conditional action distribution of a pixel policy, providing a clean abstraction for visual generalization. Building on this environment, we define KAGE-Bench, a benchmark of six known-axis suites comprising 34 train-evaluation configuration pairs that isolate individual visual shifts. Using a standard PPO-CNN baseline, we observe strong axis-dependent failures, with background and photometric shifts often collapsing success, while agent-appearance shifts are comparatively benign. Several shifts preserve forward motion while breaking task completion, showing that return alone can obscure generalization failures. Finally, the fully vectorized JAX implementation enables up to 33M environment steps per second on a single GPU, enabling fast and reproducible sweeps over visual factors.

Abstract:
By introducing routers to selectively activate experts in Transformer layers, the mixtureof-experts (MoE) architecture significantly reduces computational costs in large language models (LLMs) while maintaining competitive performance, especially for models with massive parameters. However, prior work has largely focused on utility and efficiency, leaving the safety risks associated with this sparse architecture underexplored. In this work, we show that the safety of MoE LLMs is as sparse as their architecture by discovering \text\emphunsafe routes: routing configurations that, once activated, convert safe outputs into harmful ones. Specifically, we first introduce the \underline\textRouter \underline\textSafety \underline\textimportance \underline\textscore (RoSais) to quantify the safety criticality of each layer's router. Manipulation of only the high-RoSais router(s) can flip the default route into an unsafe one. For instance, on JailbreakBench, masking 5 routers in DeepSeek-V2-Lite increases attack success rate (ASR) by over 4× to 0.79, highlighting an inherent risk that router manipulation may naturally occur in MoE LLMs. We further propose a \underline\textFine-grained token-layer-wise \underline\textStochastic \underline\textOptimization framework to discover more concrete \underline\textUnsafe \underline\textRoutes (F-SOUR), which explicitly considers the sequentiality and dynamics of input tokens. Across four representative MoE LLM families, F-SOUR achieves an average ASR of 0.90 and 0.98 on JailbreakBench and AdvBench, respectively. Finally, we outline defensive perspectives, including safety-aware route disabling and router training, as promising directions to safeguard MoE LLMs. We hope our work can inform future red-teaming and safeguarding of MoE LLMs.

Abstract:
Gradient boosting decision forests, used by XGBoost or AdaBoost, offer higher accuracy and lower training times than decision trees for large datasets. Protocols for private inference over decision trees can be used to preserve the privacy of the input data as well as the privacy of the trees. However, naively extending private inference over decision trees to private inference over decision forests by replicating the protocols leads to impractical running times. In this paper, we propose an efficient private decision inference protocol using homomorphic encryption. We present several optimizations that identify and then remove (approximate) duplication between the trees in a forest, thereby achieving significant improvements in communication and computation cost over the naive approach. To the best of our knowledge, we present the first private inference protocol for highly scalable gradient boosting decision forests. Our protocol's (SilentWood) inference time is faster than the baseline of parallel running \revisethe RCCPDTE protocol by Mahdavi et al.~by up to 42.5x, and faster than Zama's Concrete ML XGBoost by up to 27.8x, and faster than SoK-GGG's two-party garbled circuit protocol by 2.94x.

Abstract:
Lagrangianbased methodologies are one of the fundamental paradigms of safe reinforcement learning (RL) for constrained Markov decision processes, particularly when dealing with multi-constraint cases. While the specific details of the methodologies may differ, with some using a single estimator for the overall mixed penalty term of the constraints and others using separate estimators for the constraints, the fundamental question of the theoretical validity of the methodologies has remained largely unexplored. The present paper performs the first theoretical analysis of the methodologies and proves that the use of the mixed critic structure leads to the presence of a bias due to the target drift of the Lagrange multipliers. On the other hand, the use of the dedicated critic structure, where separate critics are used for the reward function and the constraint functions, does not suffer from this bias. The theoretical analysis is supported with experiments on a realistic power system environment with multiple constraints, where the dedicated critic structure succeeds in satisfying the constraints, whereas the mixed critic structure fails.

Abstract:
Wireless channel modeling is essential for the design, analysis, and optimization of modern wireless sensing and communication systems. However, accurately modeling wireless channels in electrically large and complex environments remains a longstanding challenge, owing to the intricate interactions between radio-frequency (RF) signals and surrounding objects (e.g., reflection, diffraction, and scattering). Unlike conventional ray-tracing pipelines that rely on hand-engineer interaction rules, or black-box neural surrogates that do not explicitly model physical structure, we introduce SNRFT, a novel framework that integrates neural representations with physics-based RF propagation modeling. Our key idea is to view RF transport as a stochastic propagation process, from which a material-dependent attenuation coefficient emerges naturally as the rate parameter governing transport dynamics. This formulation inherently satisfies key physical constraints such as reciprocity and reversibility. Building on this foundation, we employ implicit neural representations to capture complex RF-object interactions while preserving the composability of traditional ray tracing. Extensive evaluations on real-world wireless communication and sensing testbeds demonstrate that SNRFT consistently outperforms existing methods, while requiring significantly fewer training samples.

Abstract:
Concept erasure serves as a vital safety mechanism for removing unwanted concepts from textto-image (T2I) models. While extensively studied in U-Net and dual-stream architectures (e.g., Flux), this task remains under-explored in the recent emerging paradigm of single-stream diffusion transformers (e.g., Z-Image). In this new paradigm, text and image tokens are processed as a single unified sequence via shared parameters. Consequently, directly applying prior erasure methods typically leads to generation collapse. To bridge this gap, we introduce Z-Erase, the first concept erasure method tailored for single-stream T2I models. To guarantee stable image generation, Z-Erase first proposes a Stream Disentangled Concept Erasure Framework that decouples updates and enables existing methods on single-stream models. Subsequently, within this framework, we introduce Lagrangian-Guided Adaptive Erasure Modulation, a constrained algorithm that further balances the sensitive erasure-preservation trade-off. Moreover, we provide a rigorous convergence analysis proving that Z-Erase can converge to a Pareto stationary point. Experiments demonstrate that Z-Erase successfully overcomes the generation collapse issue, achieving state-of-the-art performance across a wide range of tasks.

Abstract:
Quantile estimation and inference play essential roles in diverse scientific and industrial applications, and their accuracy can often be enhanced by integrating auxiliary data from multiple sites. However, developing efficient aggregation methods for quantile inference under potential privacy constraints, particularly with heterogeneous datasets, remains challenging. To address these issues, we propose a systematic framework for quantile estimation and inference under potential local differential privacy (LDP). The key idea is to construct weighted estimators by adaptively aggregating quantile estimates from target and source sites. The adaptive weights are determined by minimizing the asymptotic variance, incorporating an additional \ell_2 penalty to account for parameter shift. A parallel stochastic gradient descent algorithm under LDP constraints is developed for weight estimation and valid inference. Additionally, we introduce a conservative weighted estimator to ensure robust inference across diverse heterogeneous scenarios. Rigorous theoretical analysis establishes the consistency, normality, and effectiveness of the proposed methods. Extensive numerical studies and real data application corroborate our theoretical findings.

Abstract:
Logistic regression is widely used in applications; however, when the dimension scales with the sample size, theory reveals that the asymptotic behavior of common Mestimators depends on bias and variance scaling constants, which are functions of the signal strength. To leverage the theory to design statistical methodologies, it is essential to obtain accurate estimates of the signal strength. In this work, we utilize a data-splitting strategy to efficiently estimate the signal strength. To alleviate issues caused by separable data, we analyze the exact asymptotics of an M-estimator with a data-driven, non-decomposable regularizer that adapts to the true covariance structure. We justify the validity of our method through both theoretical analysis and numerical experiments.

Abstract:
Empirical successes show that regularization improves the stability and efficiency of reinforcement learning (RL), with applications in robotics and posttraining of large language models. Yet, theoretical analyses of regularized Markov decision processes (MDPs) have mostly been confined to the standard RL setting. In this work, we investigate regularized MDPs through the lens of robust RL. We introduce a doubly regularized MDP framework that combines policy and dynamics regularization to enable robust policy learning against reward and dynamics perturbations. Within this framework, we develop an optimism-based online algorithm and provide the first finite-sample regret guarantees in both tabular and rich-observation settings, where the state-action space may be continuous. Our results show that algorithms for doubly regularized MDPs are as sample-efficient as well-studied robust MDP algorithms, while additionally benefiting from the flexibility of soft policies. Finally, we use experiments to demonstrate that our approach efficiently and effectively handles function approximation and exploration in large state-action spaces, achieving robust performances.

Abstract:
The alignment of large language models with human preferences is typically achieved via Reinforcement Learning from Human Feedback or Direct Preference Optimization. However, these methods are susceptible to the significant noise prevalent in realworld preference datasets. To address this critical issue, we present a theoretical framework for unbiased alignment, introducing theUnbiased Reward Model(URM) loss and theUnbiased Direct Preference Optimization(UDPO) loss. These novel objectives allow for the training of unbiased models directly from noisy preferences by mathematically correcting for label noise without requiring clean ground-truth supervision. We provide rigorous theoretical analyses demonstrating that our methods are noise-tolerant, parameter downward compatible, and classification-calibrated. Comprehensive experiments across diverse datasets demonstrate that our approaches outperform state-of-the-art baselines.

Abstract:
Neural Controlled Differential Equations (Neural CDEs) provide a powerful continuoustime framework for sequence modeling, yet the roughness of the driving control path often restricts their efficiency. Standard splines introduce high-frequency variations that force adaptive solvers to take excessively small steps, driving up the Number of Function Evaluations (NFE). We propose a novel approach to Neural CDE path construction that replaces exact interpolation with Kernel and Gaussian Process (GP) smoothing, enabling explicit control over trajectory regularity. To recover details lost during smoothing, we propose an attention-based Multi-View CDE (MV-CDE) and its convolutional extension (MVC-CDE), which employ learnable queries to inform path reconstruction. This framework allows the model to distribute representational capacity across multiple trajectories, each capturing distinct temporal patterns. Empirical results demonstrate that our method, MVC-CDE with GP, achieves state-of-the-art accuracy while significantly reducing NFEs and total inference time compared to spline-based baselines.

Abstract:
KullbackLeibler divergence (KL) regularization is widely used in reinforcement learning, but it becomes infinite under support mismatch and can degenerate in low-noise limits. Utilizing a unified information-geometric framework we introduce (Kalman)-Wasserstein-based KL analogues by replacing the Fisher–Rao geometry in the dynamical formulation of the KL with transport-based geometries, and we derive closed-form values for common distribution families. These divergences remain finite under support mismatch and yield a geometric interpretation of regularization heuristics used in Kalman ensemble methods. We demonstrate the utility of these divergences in KL-regularized optimal control. In the fully tractable setting of linear time-invariant systems with Gaussian process noise, the classical KL reduces to a quadratic control penalty that becomes singular as process noise vanishes. Our variants remove this singularity and yield well-posed problems. On a double integrator and a cart-pole example, the resulting controls outperform KL-based regularization.

Abstract:
Instancewise feature selection is a valuable tool for interpreting labeled data and the predictions of black-box models. In contrast to global feature selection techniques, instance-wise methods dynamically identify important features for each instance. A growing number of methods learn aselector, which identifies important features, and apredictor, which uses these to make predictions. However, these pioneering methods face challenges including information leakage and lack of differentiability, which can slow training. In this paper, we present Hide&Seek, an end-to-end differentiable model for instance-wise feature selection. We jointly learn feature selection and prediction under a single objective without information leakage. Hide&Seek outperforms existing state-of-the-art models across a range of synthetic and real-data experiments and is fast to train. We achieve this by reformulating feature removal as a differentiable operation where instead of discretely removing features, we replace a proportion of each feature. Training is further stabilized via a parsimony-weight annealing framework.

Abstract:
Generalized Category Discovery (GCD) aims to recognize known classes while autonomously discovering novel ones in openworld settings. However, current approaches primarily focus on designing clustering objectives, often overlooking a critical bottleneck: standard vision backbones yield high-rank, entangled token representations that are ill-suited for unsupervised discovery of latent concepts and structures. In this paper, we propose Compositional Primitive Fields (CPF-GCD), a novel representation learning framework that reshapes the feature space to make such latent structure identifiable by enforcing a low-rank compositional organization. Our core hypothesis is that all categories (known/novel) reside on a latent manifold spanned by a finite set of learnable visual primitives that capture reusable concepts. CPF instantiates this geometric constraint via a spatial field mechanism. Inserted between the backbone and the head, it projects noisy patch tokens onto this structured manifold, effectively decomposing images into reusable atomic parts and their spatial layouts. By explicitly modeling the spatial distribution of primitives, CPF enables novel categories to emerge naturally as new activation patterns over a shared vocabulary. Extensive experiments demonstrate that CPF serves as a generic, plug-and-play module that consistently boosts performance across diverse GCD baselines, validating that identifying and leveraging low-rank compositional structure is a crucial inductive bias for open-world recognition.

Abstract:
Continual learning (CL) aims to acquire new knowledge from a nonstationary data stream while retaining performance on previously learned tasks. Memory-based replay methods mitigate catastrophic forgetting by storing and revisiting past samples, but their effectiveness is fundamentally constrained by limited memory capacity, as each stored example represents only a single data instance. In this work, we propose data reassembly for CL, a new paradigm that significantly increases memory efficiency by reassembling composite replay samples from existing training data. Instead of storing raw training examples, we partition the current task training data into elementary patches and dynamically reassemble them into coherent replay instances through an energy-based optimization framework. The proposed objective jointly enforces semantic compatibility with target labels and global consistency among assembled patches. To make this optimization tractable, we derive an efficient variational inference algorithm that constructs a compact yet diverse set of reassembled samples for replay. Extensive theoretical analysis and experiments across multiple CL benchmarks demonstrate that data reassembly consistently outperforms existing memory-based approaches, achieving stronger retention of past knowledge while maintaining competitive computational efficiency.

Abstract:
Many geometricstatistics and manifold-learning pipelines produce bundle-valued observations (for example, tangent vectors) that lie in different fibers. Forming empirical averages, therefore, requires transporting data to a common reference fiber, which can introduce curvature- and holonomy-driven effects. We study transported empirical means for a measurable section s: M → E of a rank-k vector bundle (E, π, M) equipped with a bundle metric and a compatible metric connection, and we derive finite-sample, dimension-free concentration bounds in the reference fiber E_x₀. Using sharp Hilbert-space inequalities due to Pinelis, we obtain Hoeffding- and Bernstein-type tail bounds controlled by a uniform per-sample bound B and a variance proxy σ². When minimizing geodesics are not unique, we isolate a deterministic holonomy ambiguity term Δ_hol and bound it in terms of bundle curvature and loop geometry; for the specific case E = T S²_r we give sharp area-based formulas. The resulting bias–variance decomposition provides practical uncertainty quantification for bundle-valued averaging and clarifies when curvature-driven transport ambiguity is negligible relative to sampling error.

Abstract:
Existing CrossTokenizer Knowledge Distillation (CTKD) methods fail to outperform simple supervised fine-tuning when vocabulary overlap is low due to severe alignment noise. We identify this phenomenon as the``Low-Overlap negative transfer regime,''To overcome this, we proposeEntropy-aware Span-Constrained Optimal Transport (E-SCOT), a robust framework that treats distillation as a sparse transport problem with a vocabulary-agnostic ground metric. Unlike prior OT approaches that incur quadratic costs via dense optimization, E-SCOT employs span-anchored lexical alignment to construct a deterministic, locality-preserving coupling in linear time. Furthermore, we introduce R\'enyi-entropy adaptive reweighting to dynamically concentrate the distillation budget on informative positions exhibiting significant uncertainty-profile gaps. Extensive experiments demonstrate that E-SCOT achieves state-of-the-art performance across diverse model families, effectively eliminating negative transfer even in challenging low-overlap scenarios.

Abstract:
While lowrank decomposition offers potential for LLM size reduction, its application is limited by considerable performance degradation. In this work, we identify and formalize a key overlooked issue in LLM decomposition: representation drift. We show that approximation errors introduced by decomposition propagate and amplify non-linearly through the deep layers of the transformer architecture, progressively distorting internal representations and degrading downstream performance. To mitigate this, we introduce a conceptually simple but principled compensation mechanism, named ``\our'', that operates by suppressing error at its source. By learning to align the output distribution of decomposed transformer blocks with their original counterparts, our method effectively counteracts representation drift, achieving notable performance recovery with zero inference overhead. Extensive experiments across OPT, LLaMA-2, LLaMA-3, and QWen exhibit remarkable improvements in language modeling, common-sense reasoning, knowledge-based reasoning, and vision-language tasks. For instance, on LLaMA-3-8B and OPT-13B at 40% compression, perplexity is reduced by more than 70% while reasoning task accuracy improves by over 10%. Our code is available at this anonymous URL.

Abstract:
The pursuit of highfidelity image generation faces a fundamental trade-off between sampling speed and output quality. While diffusion models excel in quality, their iterative nature incurs high computational costs. Current efficient methods primarily focus on distilling pre-trained models into few-step samplers; however, this distillation process is challenging and heavily reliant on teacher model quality. In this paper, we introduce \XYZFlow, a novel framework that rethinks this paradigm through multidimensional scaling of flow matching. Unlike MeanFlow's single-step deterministic mapping, our approach intensively scales the expressive power of generative models by enhancing the uniqueness and learnability of probability paths through structured, multidimensional conditioning. Theoretically, we frame autoregressive modeling as an implicit flow straightening mechanism, where expanding contextual constraints reduce trajectory ambiguity. XYZFlow implements this via two orthogonal scaling dimensions: (1)Temporal scaling through non-Markovian conditioning on the full denoising history, and (2) Spatial scaling through our proposed Next Shortcut Prediction, where patches are generated sequentially using the complete denoising trajectories of preceding patches as priors. This multidimensional conditioning constructs a high-dimensional coordinate system for probability flows, enforcing mapping uniqueness. Our Next Shortcut Prediction mechanism specifically enables efficient generation by leveraging rich contextual information from previously generated patches' full denoising processes. Extensive evaluations demonstrate XYZFlow achieves state-of-the-art performance, with 7.2--8.5× speedup over teachers while maintaining competitive FID. Notably, our structured Next Shortcut Prediction design establishes a more parameter-efficient scaling dimension and achieves superior quality-latency trade-offs compared to simply enlarging models or compressing sampling steps.

Abstract:
Supervised finetuning (SFT) is a dominant post-training strategy for vision-language-action (VLA) models, but its reliance on scarce expert demonstrations limits scalability and generalization. We propose HIER, a plug-and-play four-stage refinement framework that improves real-world VLA policies with minimal demonstrations by combining world-model imagination and human-in-the-loop correction. HIER warm-starts a VLA policy and a pretrained world model from a few demonstrations, then splits the policy into a deployment branch and an exploration branch. The deployment branch interacts with the world model to generate imagined rollouts, which are used to fine-tune the exploration branch for autonomous execution with occasional human interventions. The resulting corrected rollouts are preference-filtered and used to fine-tune the deployment branch, which is the final policy used for real-world inference. Across multiple real-world manipulation tasks on a Franka arm, HIER achieves nearly 100% success with few demonstrations and improves success rates by more than 50% relative to SFT, while in some tasks attaining shorter episode lengths than human demonstrations, indicating improved execution efficiency. Ablations further show that imagination-driven diversification and human correction are crucial for gains in exploration and self-recovery.

Abstract:
Latent reasoning has emerged as a powerful alternative to textbased Chain-of-Thought (CoT), offering significant gains in computational efficiency by compressing verbose reasoning into compact embeddings. However, compressing reasoning into the latent space renders the thinking opaque, hindering its interpretability. Current methods present a stark trade-off: they either function as unexplainable “black boxes” (e.g., Coconut), where the latent reasoning is not human-readable, or rely on separate post-hoc decoders for explainability (e.g., Heima), introducing architectural overhead and decoupling the explanation from the actual reasoning process. In this work, we present a unified framework for Self-Explainable Latent Reasoning (SELR) that trains a single model to perform efficient and inherently explainable latent reasoning. Our core contribution is a novel multi-task training objective that optimizes for two goals simultaneously: (1) an Answer Loss that optimizes the latent reasoning trajectory to produce accurate final answers, and (2) a CoT Loss that explicitly trains the same model to decode its own latent representations back into human-understandable reasoning steps. This design ensures that generated latent representations are both task-effective and semantically interpretable, eliminating the need for external decoders. We validate the effectiveness of SELR on both Large Language Models (LLMs) and Vision-Language Models (VLMs), demonstrating that SELR achieves superior token efficiency and accuracy compared to baselines, while uniquely providing self-contained explainability without auxiliary models.

Abstract:
Model merging in the pretrainfinetune paradigm has proven effective by combining multiple finetuned models into one with multi-task capabilities. However, existing methods rely on fix or manually tuned merging coefficients, making the unified model sensitive to the initial merging strategy and suboptimal for downstream adaptation. Thus, this paper proposed an innovative model merging framework called MetaMerging, a novel meta-learning algorithm to adaptively optimize the merging coefficients to construct a unified model tailored for task-specific adapter training. By simulating adapter updates in an inner loop and meta-optimizing merging coefficients in an outer loop, MetaMerging produces more balanced and generalizable unified models. Extensive experiments on CV and NLP fields show strong performance of MetaMerging on various downstream tasks and demonstrate the effectiveness of meta-learning in our method compared to other parameter merging methods. Our code is available at https://anonymous.4open.science/r/MetaMerging-53A1

Abstract:
Item response theory (IRT) is widely used in areas such as recommender systems, education, psychology, and other fields. A popular model for IRT is the Rasch model. Under this model, if a user with ability \theta performs a task with difficulty \beta then its label X ～ \textBernoulli (1 / (1 + \exp((\theta - \beta))). Existing joint maximum likelihood estimation approaches for this problem do not perform well on small datasets and also lack theoretical guarantees. Recently, Nguyen and Zhang proposed a two step approach: (1) spectral method for estimation of task parameters, (2) likelihood optimization for estimation of user parameters. While this approach is theoretically sound, it is not computationally efficient. In this work, we propose an EM-based algorithm for joint estimation of item and user parameters by introducing Pólya-Gamma latent variables, which simplify the logistic log-likelihood. We show that our algorithm is both theoretically sound and consistently outperforms existing methods on synthetic and real-world datasets.

Abstract:
Recent curriculum techniques in the posttraining stage of LLMs have been empirically observed to outperform non-curriculum approaches in improving reasoning performance, yet a principled understanding of their effectiveness and limitations remains incomplete. To bridge this gap, we develop an abstract theoretical framework and identify sufficient conditions under which curriculum post-training yields exponential improvements in sample complexity. To substantiate this framework, we model the base model’s Chain-of-Thought generation as a state-conditioned autoregressive reasoning tree, and formalize curriculum subtasks as either depth-increasing curricula that progressively extend reasoning horizons or hint-decreasing curricula that gradually remove partial hints. Our analysis shows that, under outcome-only reward signals, reinforcement learning finetuning with both curriculum strategies achieves high accuracy with polynomial sample complexity, whereas non-curriculum counterpart encounters an exponential complexity bottleneck. We further establish analogous guarantees for test-time scaling, demonstrating that curriculum-aware querying strategies reduce both reward oracle complexity and sampling cost from exponential to polynomial order. Empirical simulations support our theoretical findings.

Abstract:
Flowbased generative models have emerged as powerful priors for solving inverse problems. One option is to directly optimize the initial latent code (noise), such that the flow output solves the inverse problem. However, this requires backpropagating through the entire generative trajectory, incurring high memory costs and numerical instability. We propose MS-Flow, which represents the trajectory as a sequence of intermediate latent states rather than a single initial code. By enforcing the flow dynamics locally and coupling segments through trajectory-matching penalties, MS-Flow alternates between updating intermediate latent states and enforcing consistency with observed data. This reduces memory consumption while improving reconstruction quality. We demonstrate the effectiveness of MS-Flow over existing methods on image recovery and inverse problems, including inpainting, super-resolution, and computed tomography.

Abstract:
Point cloud data, with its inherent geometric and topological structures, plays a critical role in 3D vision tasks. However, existing parameterefficient fine-tuning (PEFT) methods predominantly focus on input token prompting, overlooking the intrinsic geometric information. To address this limitation, we propose TopAdapter, a novel PEFT framework that enhances geometric perception by injecting local topological information into pre-trained 3D vision models. TopAdapter leverages 0D, 1D, and 2D simplices from algebraic topology as fundamental building blocks, introducing two core modules: the Topology Injection module (ToInjection) and the Topology Transfer module (ToTransfer). ToInjection constructs multi-scale topological features using a simplex generator and dynamically fuses them with semantic features via a geometric controller, thereby enhancing geometric adaptability. ToTransfer propagates these topological primitives across Transformer layers, ensuring efficient transmission of geometric information. Extensive experiments demonstrate that TopAdapter outperforms existing PEFT methods, achieving performance comparable to full fine-tuning across various benchmarks.

Abstract:
Continual Offline Reinforcement Learning (CORL) enables building longterm autonomous agents from static datasets. However, it faces heterogeneity in environment dynamics, reward functions, and behavior policies across tasks. Combined with the inherent distribution shift in offline learning, this requires agents to selectively reuse shared knowledge during transfer while isolating task-specific features. The flat knowledge sharing mechanisms employed by existing methods struggle to capture such distinctions, limiting cross-task generalization. To address this, we propose Hierarchical Task-Aware Composition (HTAC), which balances plasticity and stability through dual-level task encoding and soft composition mechanisms. HTAC comprises four modules: (1) a Hierarchical Semantic Task Representation that decomposes tasks into domain-level and task-level embeddings; (2) a Dual-level Expert Network that creates domain and task experts on demand for parameter-efficient knowledge isolation; (3) an Adaptive Knowledge Composition module that integrates historical expert outputs via attention mechanisms for knowledge reuse; (4) Task Adapters that preserve historical routing weights to prevent forgetting. Experiments on Offline Continual World show that HTAC outperforms existing baselines, demonstrating better knowledge reuse and transfer capabilities.

Abstract:
Continual learning with large pretrained models offers significant potential for cross-task knowledge accumulation, but faces critical challenges such as catastrophic forgetting and parameter interference, especially when historical data is unavailable. Existing approaches typically rely on sequential fine-tuning or model merging strategies, yet often overlook the impact of loss landscape sharpness and dominant singular value directions, which leads to subspace misalignment and severe knowledge forgetting. In this paper, we propose the Sharpness-Aware Isotropic Merging (SAIM) framework, which introduces targeted optimizations in both the fine-tuning and merging stages to address these issues. Specifically, SAIM consists of two synergistic modules: (1) a Sharpness-Aware Block Coordinate Descent (SA-BCD) optimizer that guides the model toward flatter minima and selectively updates the most task-sensitive parameters, thereby mitigating parameter interference and enhancing robustness; (2) an adaptive isotropic merging algorithm that dynamically balances the singular value spectrum across tasks, effectively preventing the model from overemphasizing any single task direction, maintaining balanced knowledge representation, and improving subspace alignment. Extensive experiments on vision and language benchmarks demonstrate that SAIM achieves 5-10% higher accuracy than existing methods and maintains robust performance as the number of tasks increases. Ablation studies further validate the effectiveness of the SA-BCD fine-tuning strategy in promoting flat minima and reducing parameter interference, as well as its compatibility with various merging approaches.

Abstract:
Multiobjective optimization (MOO) has emerged as a powerful approach to solving complex optimization problems involving multiple objectives. In many practical scenarios, function evaluations are unavailable or prohibitively expensive, necessitating optimization solely based on a fixed offline dataset. In this setting, known as offline MOO, the goal is to find out the Pareto set without access to the true objective functions. This setting suffers from the out-of-distribution (OOD) issue, where the surrogate model is not accurate for unseen designs. Due to the OOD issue, surrogate errors may cause the optimizer to select solutions that do not lie on the true Pareto front and are biased toward its extremes. To address this, this paper proposes Diversity-driven Offline Multi-Objective Optimization (DOMOO), which aims to find out a diverse and high-quality set of solutions. Firstly, DOMOO incorporates an accumulative risk control module that estimates the potential risk of candidate solutions and alleviates the OOD issue between the training data and the generated solutions. In addition, a nested Pareto set learning (PSL) strategy is proposed to jointly learn preference and PSL parameters, then optimize them, enabling adaptation to diverse Pareto front geometries. To further enhance solution quality, we design a diversity-driven selection strategy that extracts a representative and well-distributed set of final solutions. To achieve this diversity-driven selection strategy, we propose \textIGD\_\textoffline, a tailored indicator for the offline setting that considers both diversity and convergence, and avoids the bias of hypervolume indicator. Extensive experiments on synthetic and real-world benchmarks, such as neural architecture search, show that, on average across benchmarks, DOMOO achieves a 1.44× improvement in convergence and diversity over comparable methods.

Abstract:
Spiking Neural Networks (SNNs) enable eventdriven computation with sparse activations, but building multimodal Transformers on SNNs is hindered by unstable training in deep spiking stacks and a mismatch between dense softmax attention and spike-based communication. We propose SMM Transformer, an SNN-based multimodal Transformer framework that combines (i) a Parallel LIF with Multistage Learnable Parameters (PLMP) neuron and a tailored P-STBP algorithm to stabilize training, (ii) a spike-driven attention approximation (SMSA) with a lightweight self-compensation branch, and (iii) a spiking mixture-of-experts (SMoE) module for modality-aware fusion. Across visual and multimodal benchmarks, SMM Transformer achieves competitive accuracy compared to ANN baselines while reducing the estimated compute energy of the attention module by up to 97% under a standard MAC/AC cost model.

Abstract:
While CLIP has achieved strong performance across vision–language tasks, finegrained image–text alignment remains challenging. Recent efforts improve textual granularity by leveraging long, detailed descriptions and replacing CLIP’s text encoder with LLM, but often overlook the visual-side bottleneck: achieving finer alignment requires region- and pixel-level visual grounding, not just finer text. To address this issue, we propose PixCLIP, a framework that jointly enhances both sides by accommodating visual prompt regions and long-form text within a unified training objective. Firstly, to support training at this granularity, we develop an automated annotation pipeline that produces long-form descriptions with pixel-level localization, and use it to construct LongGRIT, a high-quality dataset with nearly 1.5M samples. Furthermore, we introduce a three-branch pixel–text alignment framework that aligns image regions with corresponding textual descriptions across multiple granularities. Experiments show that PixCLIP achieves state-of-the-art performance on pixel- and region-level alignment tasks while preserving strong results on standard global image–text retrieval benchmarks, even with arbitrarily shaped region prompts and long compositional texts.

Abstract:
We study the online routing problem in large language model serving, where requests arrive sequentially and must be dispatched to parallel decode workers under tight batchsize and KV-cache constraints. Unlike widely used routing heuristics that are not tied to explicit service-level objectives (SLOs) and offer limited control over latency–throughput trade-offs, we introduce an multi-objective optimization framework that formulates routing as an online linear programming with interpretable decision rewards. We apply an efficient bid-price control policy based on the online linear programming that admits requests when their SLO-weighted benefit exceeds their shadow prices. To meet millisecond decision requirements, we develop a warm-started, projected first-order updates that track the evolving dual shadow prices online with predictable runtime. We integrate our router into the Vidur simulator and demonstrate substantial improvements over standard baselines across multiple SLO regimes, including end-to-end latency, time-to-first-token, throughput, and tail performance. A big picture from our result: a science-based approach outperforms others based on heuristics.

Abstract:
Predicting protein mutation effects is fundamental to protein engineering and disease variant interpretation, yet experimental mutation data remain accurate but extremely sparse. Largescale computational augmentation offers scalability, but introduces heterogeneous and poorly calibrated supervision signals arising from distinct modeling paradigms. We construct a PDB-wide, structure-aligned mutation augmentation dataset that exhaustively enumerates single-site substitutions and aligns mutation signals from physics-based energy models, protein language models, and structure-conditioned inverse folding models. Large-scale analysis under a unified mutation preference representation reveals substantial differences in preference structure, confidence, and cross-model agreement, indicating that disagreement is pervasive and reflects conflicting inductive biases rather than random noise. Motivated by these observations, we propose an unsupervised multi-source mutation preference distillation framework that learns from relative mutation preferences while explicitly modeling cross-source disagreement. Without using any experimental mutation labels during training, our approach achieves consistent improvements on the ProteinGym benchmark over zero-shot baselines and naive multi-source fusion strategies. We release the dataset and evaluation pipeline to support reproducible studies of protein mutation effects.

Abstract:
Graph Neural Networks (GNNs) have achieved remarkable successes in graph analysis due to the MessagePassing (MP) mechanism, yet they struggle with heterophilic graphs where connected nodes often have distinct labels or dissimilar attributes. Graph Contrastive Learning (GCL) serves as a promising approach to extract the information beyond neighboring nodes, effectively mitigating the limitations of the MP mechanism in handling heterophilic graphs. Nevertheless, GCL faces two critical challenges when applied to heterophilic graphs, i.e., the potential distribution shift from data augmentation and the loss of robustness caused by high-frequency signals. To address these problems, we propose a novel model, namely the Spectral Fusion Contrastive Learning with Topology-Adaptive Graph Augmentation (SFCLTA) for unsupervised graph representation learning. Our method dynamically adjusts graph structures by a heterophily-aware augmentation strategy, and constrains high-frequency distortions by spectral regularization. We utilize the confidence-weighted fusion to enhance the robustness. Additionally, we introduce a feature reconstruction task as the prerequisites to explicitly mitigate feature-level distribution shifts. Experiments on multiple real-world datasets demonstrate that the proposed SFCLTA consistently outperforms baseline models in multiple tasks.

Abstract:
Solving largescale Generalized Eigenvalue Problems (GEPs) is a fundamental yet computationally prohibitive task in science and engineering. As a promising direction, contour integral (CI) methods offer an efficient and parallelizable framework. However, their performance is critically dependent on the selection of integration contours---improper selection without reliable prior knowledge of eigenvalue distribution can incur significant computational overhead and compromise numerical accuracy. To address this challenge, we proposeDeepcontour, a novel hybrid framework that integrates a deep learning-based spectral predictor with Kernel Density Estimation (KDE) for principled contour design. Specifically, Deepcontour utilizes its specialized Eigen-Neural-Operator (ENO) to provide rapid spectral distribution priors, driving a KDE module to automatically construct the optimized integration contours, which guide the CI solver to efficiently find the desired eigenvalues. Deepcontour achieves up to a 5.63x speedup across diverse scientific datasets while maintaining strict numerical rigor. By merging the predictive power of deep learning with the numerical rigor of classical solvers, this work establishes an efficient and robust paradigm for solving large-scale GEPs.

Abstract:
PageRank (PR) is a fundamental algorithm in graph machine learning tasks. Owing to the increasing importance of algorithmic fairness, we consider the problem of computing PR vectors subject to various groupfairness criteria based on sensitive attributes of the vertices. At present, principled algorithms for this problem are lacking - some cannot guarantee that a target fairness level is achieved, while others do not feature optimality guarantees. In order to overcome these shortcomings, we put forth a unified in-processing convex optimization framework, termed FairRARI, for tackling different group-fairness criteria in a ``plug and play'' fashion. Leveraging a variational formulation of PR, the framework computes fair PR vectors by solving a strongly convex optimization problem with fairness constraints, thereby ensuring that a target fairness level is achieved. We further introduce three different fairness criteria which can be efficiently tackled using FairRARI to compute fair PR vectors with the same asymptotic time-complexity as the original PR algorithm. Extensive experiments on real-world datasets showcase that FairRARI outperforms existing methods in terms of utility, while achieving the desired fairness levels across multiple vertex groups; thereby highlighting its effectiveness.

Abstract:
Voxelbased 3D object detectors typically discretize the spatial domain using a uniform Cartesian grid, which allocates the same voxel size to both near-range and far-range regions. However, this uniform discretization is suboptimal for small objects such as pedestrians and cyclists, as they occupy only a few voxels and thus struggle to capture fine-grained geometric details. Although increasing the global voxel resolution can alleviate this problem, it inevitably increases substantial memory consumption and computational overhead. In this paper, we propose Radial Scaling Voxelization (RSV), a simple yet effective non-uniform discretization strategy that adaptively modulates the effective voxel size based on the radial distance from the LiDAR sensor. Unlike previous cylindrical or polar discretization schemes, RSV preserves the Cartesian grid topology by applying a continuous radial scaling function to the input coordinates before standard voxelization. This operation yields a near-high, far-unchanged resolution pattern: the effective voxel size becomes finer in near regions, where the geometric structures of small objects are difficult to capture, while remaining nearly unchanged in far regions to avoid unnecessary computational cost. Importantly, RSV is architecture-agnostic and can directly replace the discretization module in any voxel-based detector without modifying the backbone, network design, or training pipeline. Extensive experiments on the KITTI and nuScenes datasets demonstrate that integrating our RSV into several voxel-based baselines consistently enhances small-object detection performance, especially for the Pedestrian and Cyclist categories, while incurring only marginal additional computational overhead.

Abstract:
Diffusion models are widely used for highquality generation, but their performance is sensitive to the accuracy of the estimated score. We first develop our main bounds in a Gaussian-mixture setting, where the score admits a closed-form structure and the score Hessian can be controlled explicitly, leading to sharp Wasserstein estimates. We then extend the analysis to general data distributions, which yields a more general but typically looser upper bound. This general bound can be sharpened under mild regularity: when the initial distribution has a globally Lipschitz score, the curvature contribution at small times is uniformly bounded, avoiding the worst-case blow-up. Our results make precise how discretization choices govern the accumulation of score error, aligning with empirical observations on the benefits of certain step-size schedules. The results hold for both variance-preserving (VP) and variance-exploding (VE) diffusions, and apply to both the reverse-time SDE and the associated probability-flow ODE.

Abstract:
Automatic Speech Recognition (ASR) systems, such as those in intelligent assistants, are vulnerable to adversarial examples (AEs). Benign audio clips like music, when embedded with small perturbations, can trick ASR models into recognizing attackerspecified commands. Prior studies focus on minimizing perturbation magnitude to craft AEs. However, they fails to achieve high attack stealthiness against black-box ASR systems in the physical world. In this paper, we introduce the first music carrier selection algorithm and an attention-aware stealthiness loss function to generate stealthy AEs. Extensive evaluations on five commercial ASR APIs and three widely-used voice assistants demonstrate that our method significantly outperforms state-of-the-art techniques in both effectiveness and stealthiness. Notably, in a user study involving 200 participants, 55.6% of participants perceived our physical adversarial examples as benign audio, which is an improvement of over 20% compared to existing methods.

Abstract:
As semiconductor technology nodes continue to shrink, computational lithography has become critical to yield and performance. However, realworld lithography is a continuous, multi-stage physical process driven by implicit interventions, which cannot be captured by the existing static or stage-wise models. To address these issues, we present LithoDreamer, the first physics-informed World Model (WM) framework for computational lithography, designed to represent the ``Layout-Mask-Resist Image-After Development Image (ADI)'' pipeline as a decision-driven multi-stage physical evolution system, enabling multi-step latent state rollouts within stages and intervention-aware decision-making across stages. First, we learn the feature variations between adjacent states in latent spaces to capture the physical dynamics of each stage. Second, the model plans continuous process interventions through physical mappings in the spaces, which in turn drive subsequent state transitions. Furthermore, we propose a contrastive variational optimization paradigm that jointly explores the evolutions of the interventions and states without discrete action supervision, enabling stable and continuous process rollouts in the WM. Extensive experiments show that LithoDreamer achieves state-of-the-art accuracy and generalization performance.

Abstract:
The Automated Design of MultiAgent Systems (Auto-MAS) has emerged as a promising framework for addressing complex reasoning tasks. However, existing approaches often suffer from structural rigidity and entangle the design of system topology with the implementation of individual agents. To overcome these limitations, we propose MAS-Architect, a framework that automates MAS design through a novel code-based declarative MAS paradigm rooted in the Separation of Concerns principle. By decoupling topology planning from node implementation via a unified interface, our approach enables the from-scratch generation of task-adaptive architectures. We further employ a Distill-then-Explore training strategy to optimize these designs. Comprehensive experiments on five benchmarks show that MAS-Architect sets a new Pareto frontier in the efficiency–performance trade-off: it surpasses state-of-the-art methods while substantially lowering token usage. Notably, the framework achieves a strong average accuracy of 78.7% across benchmarks with an inference cost of only 2,533 tokens per query. Qualitative analysis reveals the autonomous emergence of advanced collaboration patterns, validating the generative flexibility of the declarative paradigm. Code and data will be released.

Abstract:
Coldstart personalization requires inferring preferences from minimal interaction when no user-specific historical data is available. The space of possible preferences is vast, yet users care about only a sparse subset and rarely articulate them upfront; combined with limited interaction budgets, this makes preference elicitation challenging. Our key insight is that preferences exhibit predictable structure across populations; e.g., users who want detailed explanations often also value worked examples. We propose PEP (Preference Elicitation with Priors), a principled system decomposition framework for cold-start personalization: learning a structured world model of preference correlations offline using latent variables, then performing Bayesian inference online without retraining. Even simple belief model instantiations (e.g., linear regression) substantially outperform end-to-end RL. Across medical, mathematical, social, and commonsense reasoning, PEP achieves 80.8% alignment with ground-truth user preferences versus 68.5% for RL, requires 3-5× fewer interactions, and adapts twice as often. Our contribution is a principled decomposition of cold-start personalization that makes Bayesian preference elicitation practical at scale for LLM systems.

Abstract:
Many realworld tasks are recorded as large collections of unannotated task executions, such as videos, which contain rich information about task progress but lack the supervision assumed by standard reinforcement learning (RL) pipelines. In many practical settings, the goal is not to train a reusable policy but simply to recover one feasible solution, making policy-centered learning unnecessarily costly. We propose Policy-Free Recursive Search (PFR-Search), a framework that directly recovers solutions from unannotated task executions without policy-grounded supervision or policy training. PFR-Search organizes videos into high-level task flows and performs recursive backward-forward search to recover solutions without policy modeling. To evaluate the efficiency of policy-free search in exploiting unannotated data, we use RL as an evaluation interface, incorporating task-flow-aligned intrinsic rewards and compare against video-driven RL methods. Experiments on long-horizon Minecraft tasks show that PFR-Search recovers feasible solutions from unannotated videos with minimal exploration.

Abstract:
Selfsupervised learning (SSL) effectively learns representations from massive unlabeled data, yet the resulting models typically operate as black boxes, necessitating domain-specific post-hoc explanations. We introduce KREPES, a unified framework that learns inherently interpretable representations for arbitrary SSL objectives, including SimCLR, BYOL, VICReg. By bridging empirical neural tangent kernel approximations of neural networks with the Representer Theorem for kernels, we express the learned latent space directly via "Representer Landmarks", which are the representations of influential unlabeled training examples. We introduce two novel metrics, "Sample-Specific Influence Score" and "Conceptual Influence Profile", to quantify the transparency of the learned representations. KREPES enables direct audit of the latent space without supervision, for example, revealing an algorithmic bias in the Adult-1M dataset where SSL uses demographic proxies for income. Finally, to ensure scalability to SSL benchmarks with 1M+ samples (ImageNet-1K, Adult-1M), KREPES introduces a novel Nyström approximation-based optimization of any non-convex SSL objective.

Abstract:
Large VisionLanguage Models (LVLMs) demonstrate impressive multimodal capabilities, yet suffer from hallucination—generating factually inaccurate content. Contrastive Decoding (CD) mitigates this by contrasting amateur and expert branches at the logit level. However, our investigation reveals that such logit-level interventions fundamentally compromise generation coherence, necessitating restrictive penalty constraints unrelated to hallucination suppression. We introduce Attention Contrastive Decoding (ACD), a training-free plug-in that complements logit-level CD by relocating part of the contrastive operations to the attention mechanism. Operating at an earlier stage of the forward pass, ACD performs smooth semantic-preserving interventions through an Adaptive Subtraction Strategy (ASS), which attenuates hallucination-associated attention patterns while amplifying critical visual information. Extensive experiments demonstrate that combining ACD with existing CD methods (e.g., VCD+ACD) produces substantially more coherent outputs with further reduced hallucinations, eliminating restrictive penalties while enabling trustworthy multimodal generation.

Abstract:
Sparse cameraconditioned image-to-video generation presents a pivotal challenge: synthesizing geometrically consistent 3D motion from minimal pose cues. Existing methods, which largely rely on dense supervision or naive interpolation, suffer from severe pose drift and motion discontinuities due to the lack of robust 3D priors. In this paper, we introduce CamGeo, a novel framework that distills rich 3D geometric knowledge from a pre-trained video-to-3D model (VGGT) directly into the diffusion backbone. To achieve this without incurring inference latency, we propose a training-only distillation strategy. Specifically, CamGeo incorporates: (1) keyframe trajectory distillation that enforces cycle-consistency with sparse input poses, (2) cross-frame consistency distillation with both camera trajectory and depth constraints to generate consistent structure across unsupervised frames, and (3) a three-stage coarse-to-fine curriculum learning, progressively scales geometric complexity, from global structure coherence to fine-grained refinement, achieving stable optimization. Extensive experiments demonstrate that CamGeo achieves consistent improvements under various sparsity ratios.

Abstract:
With the widespread application of datadriven classifiers in high-risk domains, group fairness has increasingly become a key research focus. However, most existing methods rely on model constraints or data reweighting, which often suffer from limited interpretability and may distort the original data distribution. Granular-ball computing (GBC), as a structured and highly interpretable learning framework, provides a natural foundation for incorporating group fairness into the data partitioning process. Building on this insight, we first propose a Fair Granular-Ball Generation framework (FairGBG), which employs the fair clustering algorithm to ensure a balanced proportion of sensitive groups within each granular-ball (GB) during its construction, aiming to enhance within-ball group fairness. Theoretical analysis shows that FairGBG preserves high purity within each GB while satisfying group fairness. Furthermore, we introduce a Fair Granular-Ball-based Fair data Classification method (FairGBFC), which enhances classification fairness by leveraging group fairness within GBs. Experimental results on multiple benchmark datasets demonstrate that, compared to existing methods, FairGBFC significantly improves classification fairness while maintaining competitive accuracy. Notably, FairGBFC exhibits superior classification performance compared to standard GB-based methods across all benchmark datasets. Furthermore, compared with state-of-the-art fairness-aware baselines, it achieves a superior trade-off between accuracy and fairness, effectively mitigating bias while preserving high utility.

Abstract:
Modern federated and streaming learning systems often release intermediate models, so privacy must hold for the full trajectory under adaptive interaction. Motivated by participation privacy, we study singleedit neighboring user streams, where one insertion/deletion shifts all subsequent updates and defeats standard Hamming-neighbor continual-release analyses. We give an auditable modular recipe. A randomized buffering wrapper emits bins of size [U,2U], reducing single-edit streams to a Hamming-style per-bin update stream with explicit backlog/delay guarantees, where U is calibrated by the privacy parameters (\varepsilon,\delta). We then prove a certification theorem for independently decomposable (prefix-causal, fresh-noise) continual mechanisms: any non-adaptive Hamming-neighbor DP proof lifts to adaptive inputs. Together, these ingredients yield trajectory-level (\varepsilon,\delta)-DP for single-edit streams using standard primitives (e.g., tree prefix sums), with an explicit privacy--latency link via U. Streaming DP-SGD experiments validate the privacy-utility-latency tradeoffs and the induced delay distributions.

Abstract:
LowRank Adaptation (LoRA) and its variants enable parameter-efficient fine-tuning of large language models under the supervised fine-tuning (SFT) paradigm. However, their efficacy and behavior under Reinforcement Learning with Verifiable Rewards (RLVR) are less well understood. In particular, two structurally initialized LoRA variants, PiSSA and MiLoRA, which outperform standard LoRA under SFT, can underperform standard LoRA under RLVR and may even exhibit training instability. These observations suggest that how to initialize the low-rank matrices in RLVR remains unclear. In this work, we develop a theoretical analysis of LoRA in RLVR, showing that orthonormal initialization achieves the minimal gap between LoRA’s outcome and that of full fine-tuning. Guided by this insight, we propose geometry-preserving orthonormal initialization for low-rank adaptation in RLVR, leading to two new variants, LoRA-RLPO and LoRA-RLMO. Experiments on mathematical reasoning benchmarks show that our orthonormal initialization stabilizes RLVR training and outperforms standard LoRA, contrasting with PiSSA and MiLoRA. Finally, our unified analysis also explains why PiSSA and MiLoRA can underperform in RLVR, which may be of independent interest.

Abstract:
Competitive programming is increasingly being used to evaluate the algorithmic reasoning capabilities of large language models (LLMs). However, existing benchmarks primarily focus on fullinformation tasks where all problem inputs are provided upfront. This overlooks a critical dimension of algorithmic reasoning: the ability of generated programs to operate when key information is not revealed upfront.Interactiveproblems, a distinctive component of competitive programming, embody this challenge. These problems require programs to engage in multi-round interaction with an interactor (a judge program) under strict protocol constraints and limited query budgets. Crucially, new information is revealedonlyin response to queries. To address this gap, we introduceInteractBench, a benchmark comprising 322 high-quality interactive problems curated from Codeforces, AtCoder, IOI, and ICPC. Each problem is packaged with executable local interactors, enabling fully offline evaluation without external judge submission. Unlike existing benchmarks, InteractBench assesses whether model-generated code can acquire information and track state dynamically. Our evaluation reveals a significant interaction gap: even the most advanced reasoning models achieve limited success on interactive problems. Beyond success rates, we propose a fine-grained failure taxonomy to systematically diagnose the root causes of these deficiencies. Although algorithmic logic errors remain the dominant failure mode, protocol violations and query-budget overruns are frequently observed. The benchmark is provided in the supplementary material.

Abstract:
Posttraining quantization (PTQ) suffers from severe accuracy degradation in ultra-low-bit regimes. To address this challenge, we propose SHARP-Q, a unified framework grounded in Information Geometry that aligns the quantization objective with the intrinsic Fisher geometry. Following a "Rectify-then-Approximate" strategy, SHARP-Q first preconditions the optimization landscape via Hessian-Aware Rectification (HAR) and subsequently approximates the rectified Fisher Information Matrix through Dynamic Fisher-Subspace Compensation (DFSC). Our findings reveal a pivotal insight: precise geometric alignment enables hardware-friendly uniform quantizers to consistently outperform specialized non-uniform designs. Extensive experiments across representative Vision Transformer and convolutional architectures confirm that SHARP-Q establishes new state-of-the-art results, achieving substantial accuracy gains in the challenging W2A2 and W3A3 settings. Code is available in the supplementary material.

Abstract:
The Incontext generation paradigm has demonstrated strong power in instructional image editing for better synthesis quality. Nevertheless, shaping such in-context learning for instructional video editing is not trivial. Without specifying editing regions, the results can suffer from the issue of inaccurate editing regions and the token interference between different areas. To address these, we present ReCo, a new instructional video editing paradigm that novelly delves intoRegionalConstraint modeling between editing and non-editing areas. Technically, ReCo width-wise concatenates source and target video for joint denoising. In model training, ReCo formulates regional constraints with two regularization terms, i.e., latent and attention regularization, on one-step backward denoised latents and attention maps, respectively. The former increases the latent discrepancy of the editing region between source and target videos while reducing that of non-editing areas, emphasizing editing area modification and alleviating unexpected content generation. The latter suppresses the attention of tokens in the editing region to the tokens in counterpart of the source video, thereby mitigating their interference during novel object generation in target video. Furthermore, we propose a large-scale, high-quality video editing dataset, i.e., ReCo-Data, comprising 500K instruction-video pairs. Extensive experiments conducted on four major instruction-based video editing tasks verify the superiority of ReCo.

Abstract:
Longtailed recognition suffers from extreme class imbalance, where scarce tail data leads to biased and fragile feature distributions that exacerbate confusion with semantically or visually similar classes. Prior feature-space reconstruction methods transfer head-class structure or train conditional generators to synthesize tail features, yet the resulting statistical updates are often heuristic and can degrade multi-class separability when tail estimates are unreliable. Given this issue, we propose Risk-Bounded Distribution Reconstruction (RBDR), an offline statistic calibration framework for the two-stage long-tailed pipeline, grounded in an analysis of rival-induced discriminative directions. RBDR performs (i) risk-aware mean calibration by softly projecting any candidate update onto a supportive set such that a surrogate discriminative-risk upper bound does not increase, and (ii) covariance control by shrinking dispersion in a rival subspace while preserving diversity in orthogonal components. These plug-and-play modules transform heuristic reconstruction signals into controllable updates, improving performance and stability across multiple long-tailed benchmarks.

Abstract:
We introduce Euler–Poincaré Neural Dynamics (EPND), a geometricmechanics–driven framework that redefines how Koopman-type neural models learn dynamical systems. Unlike conventional operator-learning methods that rely on function-space linearization, EPND places geometric mechanics at the core of its architecture the mathematical engine governing evolution through Lie-group flows. This foundation enables a principled treatment of curvature, symmetry, and conservation, that ensures both interpretability and physical consistency. Building on this foundation, we develop the Euler–Poincaré Parallel Scan, a parallel algorithm that leverages the associative algebra of Lie-group compositions to overcome the inefficiencies of sequential computation. By unifying geometric structure with scalable computation, EPND achieves high accuracy, strong stability, and significant parallel acceleration in modeling long-horizon dynamics of versatile scientific simulation.

Abstract:
Standard contrastive learning minimizes geometric distance between positive pairs, implicitly assuming that strict compactness optimizes discrimination. However, this topologyagnostic confusion neglects intrinsic data structures and topological complexity, leading to class confusion—particularly when aggressive augmentations induce semantic drift. To address this, we propose Topology-Aware Contrastive Learning, a framework that shifts the objective from geometric singularity to topological connectivity. Leveraging Persistent Homology, we explicitly regulate the connectivity of the latent space, ensuring positive pairs maintain an α–β that balances intra-class cohesion with separability. Theoretically, we formally define the topology-agnostic confusion phenomenon, prove that excessive compactness strictly lower-bounds the probability of confusion and derive a generalization bound demonstrating that richer topological connectivity tightens downstream risk. Furthermore, we establish a measure-theoretic framework to mitigating the sensitivity of our method against varying augmentation strengths. Empirical results on benchmarks confirm that our approach enhances representation quality and reduces reliance on specific augmentation strategies compared to standard baselines. Our code will be made publicly available upon acceptance.

Abstract:
Embodied 3D object detection is a fundamental perception capability for embodied agents, where observations are partial, heavily occluded, and sequential, requiring modeling of temporal continuity. However, existing benchmarks and methods are primarily designed for fully reconstructed global scenes and fail to capture temporal scene context and instance evolution in firstperson perception. We introduceEmbodied-Det, a new benchmark for egocentric 3D object detection that evaluates detection accuracy, temporal stability, and consistency under embodied settings. Building on this benchmark, we proposeEmbodied-DETR, an end-to-end temporal detection framework that models scene-level context and instance-level continuity through two complementary temporal modules,Scene-aware Feature AggregationandInstance-aware Query Embedding. Experiments on Embodied-Det show that existing methods suffer substantial performance degradation in egocentric temporal settings, while Embodied-DETR achieves superior accuracy and temporal consistency, demonstrating the effectiveness of temporal modeling for embodied 3D perception.

Abstract:
Highresolution visual encoders in multimodal large language models (MLLMs) substantially improve fine-grained perception, yet incur prohibitive computational costs.Existing token pruning methods are effective on natural images but struggle with spatially sparse structured inputs (e.g., charts), where critical high-frequency information is sparse, localized, and structurally essential. To address this challenge, we propose Adaptive Dual-Branch Token Sparsification (AD-BTS), a density-aware framework that dynamically allocates computation according to input signal characteristics. Specifically, AD-BTS introduces a Gradient-based Routing Gate (GRG) that uses lightweight pixel-level gradient statistics to estimate structural flatness and guide routing. Then, AD-BTS activates either a Redundancy Selection Branch (RSB) for aggressive token pruning with a frozen encoder, or a Structural Fusion Branch (SFB) with conditional LoRA and context fusion to preserve sparse structural information.Extensive experiments on Qwen2.5-VL demonstrate that AD-BTS establishes a new Pareto frontier between efficiency and accuracy. Under extreme compression (20% token retention), AD-BTS outperforms the strongest baseline by 12.1% on ChartQA while achieving a 1.8× prefill speedup, effectively reconciling computational efficiency with structural robustness.

Abstract:
Twoserver secure inference allows a client to query a hosted large language model (LLM) without revealing prompts or embeddings. Recent GPU systems based on function secret sharing (FSS) make linear layers efficient, but fixed-point nonlinearities and helper operations remain a bottleneck because each operator is typically implemented as a bespoke protocol with its own comparisons, wrap-around corrections, and preprocessing material. We present FuseFSS, a compiler that replaces per-operator protocol design with a single compilation pipeline. For each scalar fixed-point operator, a compact specification lists its interval partition, low-degree arithmetic pieces, and required predicate bits. The compiler emits two batched FSS evaluations on the public masked value: one packed comparison that returns all predicate bits, and one vector interval lookup that returns the active coefficients and constants. Compared to the current state-of-the-art FSS-based GPU secure inference, FuseFSS preserves accuracy while achieving a 1.24×--1.50× end-to-end speedup and reducing online communication by 9%--16% on BERT and GPT-style models; preprocessing is also lighter, with 14%--23% lower key-generation time and 20%--24% smaller keys.

Abstract:
As realworld knowledge continues to evolve, the parametric knowledge acquired by multimodal models during pretraining becomes increasingly difficult to remain consistent with real-world knowledge. Existing research on multimodal knowledge updating focuses only on learning previously unknown knowledge, while overlooking the need to update knowledge that the model has already mastered but that later changes; moreover, evaluation is limited to the same modality, lacking a systematic analysis of cross-modal consistency. To address these issues, this paper proposes MMKU-Bench, a comprehensive evaluation benchmark for multimodal knowledge updating, which contains over 25k knowledge instances and more than 49k images, covering two scenarios, updated knowledge and unknown knowledge, thereby enabling comparative analysis of learning across different knowledge types. On this benchmark, we evaluate a variety of representative approaches, including supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and knowledge editing (KE). Experimental results show that SFT and RLHF are prone to catastrophic forgetting, while KE better preserve general capabilities but exhibit clear limitations in continual updating. Overall, MMKU-Bench provides a reliable and comprehensive evaluation benchmark for multimodal knowledge updating, advancing progress in this field.

Abstract:
Large Reasoning Models (LRMs) enhance performance by generating explicit Chainof-Thought (CoT) trajectories, yet enabling them to self-evaluate correctness without external supervision remains a critical challenge. Existing methods often rely on ground-truth labels or shallow output probabilities, neglecting the layerwise evolution of the reasoning trajectory. In this work, we introduce \ourmethod (Geometry of Reasoning), a white-box self-evaluation framework based on layerwise trajectory evolution. \ourmethod decomposes reasoning fidelity into two complementary dimensions: (1) Geometric Evolution, which synthesizes the first- and second-order evolution of layerwise hidden-state trajectories to quantify geometric progress in reasoning; and (2) Difficulty-Aware Calibration, which utilizes cross-entropy of reasoning progress to normalize the Geometric Evolution against intrinsic query uncertainty. By jointly modeling these factors, \ourmethod effectively distinguishes the coherent evolution of correct reasoning from the chaotic trajectories of errors. Extensive experiments across eight LRMs and seven benchmarks demonstrate that \ourmethod consistently outperforms state-of-the-art baselines in AUROC, AUPR, and FPR@95.

Abstract:
We present DMSampler, a framework that accelerates diffusion reinforcement learning by using fast distillation models as its trainingtime sampling engine. It overcomes the key bottleneck of sampling from the policy model—typically requiring around 50 denoising steps—by employing a co-evolving distilled sampler that needs only 4–8 steps, yielding an order-of-magnitude speedup. This approach inherently offers several advantages: it drastically reduces sampling steps, operates without classifier-free guidance to prevent potential optimization bias, and often yields superior sample quality due to more deterministic denoising trajectories. The core of DMSampler is a dual iterative training scheme, where the policy model and the distillation sampler are alternately optimized to convergence. This scheme is enhanced by two key innovations: hybrid distillation sampling, which blends outputs from both models to ensure training stability, and reward-aware distillation, which explicitly preserves high-reward capabilities during knowledge transfer. Extensive experiments on text-to-image and text-to-video generation demonstrate that DMSampler produces a final policy model which achieves state-of-the-art performance—significantly boosting textual accuracy on OCR-specific benchmarks and outperforming existing diffusion RL methods on comprehensive GenEval and VBench benchmarks.

Abstract:
Coreset Selection (CS) aims to extract a small yet representative subset from a large dataset, reducing the complexity of model training. Although CS has been primarily investigated for classification tasks, it is still underexplored for object Reidentification (ReID). In this paper, we first formulate Coreset Selection for Object Re-identification (CSOR) as a joint optimization problem to find both the optimal coreset and the optimal class subset. We identify intra-class diversity as a key factor for effective coreset construction for ReID. Based on this insight, we propose a novel two-stage framework, consisting of Diversity-driven Class Pruning (DCP) and Coverage-Prioritized Sampling (CPS), to address the unique challenges of ReID datasets. First, classes with low feature diversity are pruned to allocate the storage budget to the remaining informative classes. Then, samples are greedily selected in an easy-to-hard class order to maximize feature coverage within each class. Extensive experiments on three person ReID datasets and one vehicle ReID dataset demonstrate that our method consistently outperforms existing CS approaches, establishing a new state-of-the-art in CSOR.

Abstract:
Small language models (SLMs) are attractive for agent deployment, but they struggle to reliably retain and reuse decisionrelevant state information over long interactions. This issue is exacerbated when working memory is maintained via unstructured natural-language summarization. Some recent work addresses this limitation by fine-tuning or distilling smaller models to better construct and utilize working memory, but such approaches typically incur substantial additional training cost and require continuous data construction. We present a training-free working-memory framework for SLM-based agents that makes decision-relevant state explicit: conditioned on the active (sub)goal, the agent maintains a compact information state needed for progress assessment and the currently effective action set. Our approach decomposes tasks into subgoals and organizes memory hierarchically into task-level global memory and subtask-level local memory, where local memory directly conditions SLM action selection and is updated from new observations. To instantiate goal-conditioned memories without parameter updates, we introduce an offline LLM-based induction pipeline that builds a reusable schema once per task family from a small number of representative traces. Training-free refers to no parameter updates of the deployed SLM and no online LLM calls; we only use a one-time offline LLM-based schema induction per task family. On ALFWorld valid_unseen, a 4B SLM achieves 0.910 success, while representative prompting and prior working-memory baselines under the same setting remain below 0.320.

Abstract:
Diffusionbased models on continuous spaces have seen substantial recent progress through the mathematical framework of gradient flows, leveraging the Wasserstein-2 (W_2) metric via the Jordan-Kinderlehrer-Otto (JKO) scheme. Despite the increasing popularity of diffusion models on discrete spaces using continuous-time Markov chains, a parallel theoretical framework based on gradient flows has remained elusive due to intrinsic challenges in translating the W_2 distance directly into these settings. In this work, we propose the first successful approach addressing these challenges, leveraging an appropriate metric W_K on the simplex of probability distributions, which enables us to interpret widely used discrete diffusion paths, such as the discrete heat equation, as gradient flows of specific free-energy functionals. Through this theoretical insight, we introduce a novel methodology for learning diffusion dynamics over discrete spaces, which recovers the underlying functional directly by leveraging first-order optimality conditions for the JKO scheme. The resulting method optimizes a simple quadratic loss, trains extremely fast, does not require individual sample trajectories, and only needs a numerical preprocessing computing W_K-geodesics. We validate our method through extensive numerical experiments on synthetic data, showing that we can recover the underlying functional for a variety of graph classes.

Abstract:
Unified multimodal models seek a shared visual token space that supports both multimodal understanding and image generation. Discrete methods unify the interface via a shared codebook, whereas continuous pipelines often rely on two disparate representations—semantic features (e.g., ViT) for understanding and lowlevel latents (e.g., VAE) for synthesis—resulting in mismatched latent spaces. We propose Twins, a unified continuous token space formed by channel-wise concatenating ViT and VAE features on the same token grid, so the sequence length is unchanged and attention cost does not increase. However, jointly modeling Twins in a Diffusion Transformer exposes a severe optimization imbalance: the model fits the ViT component well but struggles to match the VAE latent distribution. We trace this imbalance to three sources of heterogeneity: frequency bias, intrinsic dimensionality, and condition-aligned vs condition-independent uncertainty. To address it, we adapt a focal regression objective for flow matching that upweights large-error VAE dimensions, better balancing optimization across the ViT and VAE components. On ImageNet, this yields up to 10.57 gFID gain over naive MSE loss without classifier-free guidance. Twins also performs competitively on multimodal understanding benchmarks and improves reconstruction fidelity, narrowing the gap between understanding- and generation-oriented representations.

Abstract:
Parameterefficient fine-tuning (PEFT) is a standard way to adapt multimodal large language models, yet extremely low-rank settings---especially rank-1 LoRA---are often unstable. We show that this instability is not solely due to limited capacity: in the rank-1 regime, optimization is highly sensitive to the update direction. Concretely, pretrained vision and text features form mismatched anisotropic regions, yielding a dominant "gap" direction that acts like a translation component and disproportionately steers early gradients under rank-1 constraints. Analyzing pretrained representations, we identify a modality-gap axis that dominates early gradient flow, while a random rank-1 initialization is unlikely to align with it, leading to weak gradients and training collapse. We propose Gap-Init, a geometry-aware initialization that aligns the rank-1 LoRA direction with an estimated modality-gap vector from a small calibration set, while keeping the initial LoRA update zero. Across multiple vision-language tasks and backbones, Gap-Init consistently stabilizes rank-1 training and can match or outperform strong rank-8 baselines. Our results suggest that at the extreme low-rank limit, initial alignment can matter as much as rank itself.

Abstract:
LiDARbased 3D object detectors are fundamental to autonomous driving, where missed detections pose severe safety risks. While adversarial attacks are crucial for evaluating the robustness of these detectors, existing point-level perturbation methods rarely cause complete object disappearance and prove difficult to implement in physical environments. We introduce OBJVanish, a prompt-driven text-to-3D adversarial generation framework that enables physically realizable attacks by generating 3D object models that are effectively invisible to LiDAR-based 3D object detectors. We first conduct a systematic empirical study of detection vulnerability in LiDAR-based 3D object detectors, revealing multi-object compositions as the dominant factor. Based on this analysis, the proposed framework iteratively refines text prompts—optimizing verbs, objects, and poses—to generate LiDAR-invisible pedestrian instances as representative vulnerable road users under physical constraints. To ensure realizability, the framework operates over a curated pool of representative real-world 3D object models and restricts generation to their valid combinations. Extensive experiments show that OBJVanish consistently evades six state-of-the-art (SOTA) LiDAR-based 3D object detectors in both simulation and real-world physical settings, exposing critical vulnerabilities in safety-critical detection systems.

Abstract:
Recent textto-image models built on large-scale Transformer backbones and flow-based objectives deliver strong text–image alignment and high visual quality, yet often produce overly similar samples under a fixed prompt. Existing diversity-enhancement methods can increase sample-to-sample variation, but typically rely on extra sampling, auxiliary optimization, or careful tuning—incurring non-trivial runtime and memory overhead. We examine intermediate Transformer features and observe that the lowest-frequency (DC) component rapidly homogenizes across seeds early in generation, infivsyinh an early trajectory lock-in that limits downstream variation. Building on this, we propose DC Attenuation for diVersity Enhancement (DAVE), a training-free representation-level intervention that selectively attenuates this component in the early regime. DAVE preserves the sampling pipeline and incurs negligible overhead, improving prompt-consistent diversity without sacrificing image quality.

Abstract:
Simulating deformable objects under rich interactions remains a fundamental challenge for realto-sim robot manipulation, with dynamics jointly driven by environmental effects and robot actions. Existing simulators rely on predefined physics or data-driven dynamics without robot-conditioned control, limiting accuracy, stability, and generalization. This paper presents SoMA, a 3D Gaussian Splat simulator for soft-body manipulation. SoMA couples deformable dynamics, environmental forces, and robot joint actions in a unified latent neural space for end-to-end real-to-sim simulation. Modeling interactions over learned Gaussian splats enables controllable, stable long-horizon manipulation and generalization beyond observed trajectories without predefined physical models. SoMA improves resimulation accuracy and generalization on real-world robot manipulation by 20%, enabling stable simulation of complex tasks such as long-horizon cloth folding.

Abstract:
Spatial proteomics enables singlecell-resolution characterization of protein expression within tissue architecture, playing a critical role in understanding tumor microenvironments and guiding precision medicine. However, current analysis workflows remain fragmented, requiring expert manual orchestration of heterogeneous tools and limiting research scalability and reproducibility. We presentSP-Mind, the first autonomous AI agent designed to unify the spatial proteomics analysis pipeline, from raw multiplexed tissue imaging to downstream phenotype discovery. Equipped with expert-curated biological analysis skills and specialized computational tools, SP-Mind converts natural-language queries into end-to-end analytical workflows without task-specific fine-tuning. To rigorously evaluate its capabilities, we introduceSP-Bench, a comprehensive benchmark spanning diverse tissue types and imaging technologies (fluorescence-based and mass spectrometry imaging), comprising 102 tasks across 18 distinct categories. Through extensive evaluation on SP-Bench and established downstream tasks, SP-Mind achievesstate-of-the-artperformance compared to existing open-source biomedical agent baselines. Code and benchmark will be publicly available upon acceptance.

Abstract:
Generating longhorizon, all-atom molecular dynamics (MD) is difficult due to error accumulation in time-domain autoregressive models (causing drift) and fixed step-size constraints on temporal resolution. We propose BioDynaSpec, which reformulates protein dynamics as spatio-spectral generation: Independent Windowed Fourier Decomposition (IWFD) decomposes trajectories into independent windowed frequency representations, and a generator combines low-to-high frequency autoregression with diffusion denoising to reconstruct continuous motion. We introduce Inter-Residue Frequency Coupling (IRFC) bias, a learnable Gaussian distance bias in attention that injects a resonance-inspired structural prior to stabilize training and improve cross-residue, cross-frequency consistency. On ATLAS, BioDynaSpec improves 250-frame trajectory generation with R_250=1.509 Å (where R_s denotes the mean per-frame C\alpha-RMSE over the first s frames after alignment), reducing error by 60.4% vs. MDGEN and 57.2% vs. ProAR, and achieves the best PCA-2D displacement-profile correlation and stepwise distribution matching. For equilibrium conformational sampling, it achieves Root Mean W_2=1.31, MD PCA W_2=0.90, and Joint PCA W_2=1.19 (50.03%, 35.25%, and 47.58% lower than the next best), while ablations show removing IRFC severely degrades RMSE/MAE and correlation.

Abstract:
AI benchmark ecosystems compress rich evaluation data into aggregate leaderboard scores, but these scores contain substantial measurement noise whose sources and magnitudes remain unquantified. Without systematic methods to measure this noise and separate signal from artifact, it is unclear when benchmark rankings reflect genuine capability differences versus measurement error. We introduce a psychometric framework to methodically test hypotheses about benchmark ecosystem structure and quantify the reliability of common benchmarkderived claims. Applying Confirmatory Factor Analysis and Generalizability Theory to 4,000+ models from the Open LLM Leaderboard, we decompose sources of ranking variance and find that human contributors account for more variance (9%) than model architecture (4.8%), revealing that benchmark noise stems as much from contributor practices as from model characteristics. We further demonstrate methods to assess the reliability of scaling laws by controlling for model size and other confounds. Our findings provide actionable diagnostics for when benchmark rankings can be trusted and establish a measurement framework for evaluating the validity of AI evaluation claims.

Abstract:
Reinforcement Learning from Verifiable Rewards (RLVR) on chainof-thought reasoning has become a standard part of language model post-training recipes. A common assumption is that the reasoning chains trained through RLVR represent how a model gets to its answer. In this paper, we develop two metrics for critically examining this assumption: Causal Importance of Reasoning (CIR), which measures the cumulative effect of reasoning tokens on the final answer (faithfulness), and Sufficiency of Reasoning (SR), which measures whether a verifier can arrive at an unambiguous answer based on the reasoning alone (verifiability). Through experiments with the Qwen2.5 model series and ReasoningGym tasks, we find that: (1) While RLVR does improve task accuracy, it does not reliably improve CIR or SR, calling the role of reasoning in model performance into question. (2) A small amount of SFT before RLVR can be a remedy for low CIR and SR. (3) CIR and SR can be improved even without SFT by applying auxiliary CIR/SR rewards on top of the outcome-based reward. This joint reward matches the accuracy of RLVR while also leading to causally important and sufficient reasoning. These results show that RLVR does not always lead models to rely on reasoning in the way that is commonly thought, but this issue can be remedied with simple modifications to the post-training procedure.

Abstract:
Handling largescale incomplete multi-view data poses a significant challenge in unsupervised representation learning. While anchor-based strategies have alleviated computational burdens, they typically rely on shallow bipartite graphs restricted to pairwise relations, failing to capture complex high-order correlations among samples. Furthermore, existing methods often treat observed and missing instances indiscriminately, ignoring the distributional shifts that lead to systematic bias in consensus anchor learning. To address these limitations, we propose a novel framework tailored for scalability and robustness, termed Hypergraph-Augmented Incomplete Multi-View Clustering (HA-IMVC). Unlike traditional approaches, HA-IMVC constructs a consensus anchor-guided hypergraph that explicitly models group-wise interactions, thereby preserving structural integrity even under high missing rates. Crucially, we incorporate a dual-adaptive reweighting mechanism that calibrates importance at both the view and sample levels. This strategy adaptively penalizes severely incomplete samples to mitigate bias while harmonizing inconsistent views. Extensive experiments on diverse benchmarks demonstrate that HA-IMVC achieves superior clustering accuracy and maintains high efficiency, even in scenarios characterized by severe data incompleteness.

Abstract:
Reinforcement Learning (RL) has become a cornerstone for enhancing the reasoning capabilities of Large Language Models (LLMs). However, standard actorcritic methods, such as PPO, rely on scalar value functions that estimate only the expectation of cumulative returns. This reduction inherently discards higher-order statistical information (e.g., variance and multimodality), leading to inaccurate value estimation and suboptimal credit assignment in complex tasks. While Distributional RL offers a solution by modeling the full return distribution, its application to LLMs remains challenging due to the computational intractability of value-based operations over large vocabularies and the instability and memory burden of off-policy replay mechanisms. In this paper, we propose DisPPO, a novel on-policy framework that seamlessly integrates non-parametric quantile regression into PPO. Theoretically, we prove that our distributional update operator---composed of the \lambda-return Bellman operator and quantile projection---is a contraction mapping in the Wasserstein metric, guaranteeing convergence to a unique fixed point. Empirically, we evaluate DisPPO using Llama and Qwen models across diverse benchmarks, including mathematical reasoning and Text-to-SQL generation. DisPPO consistently outperforms standard PPO and recent group-based baselines in both Pass@1 and Pass@k metrics, demonstrating that distributional critics provide a richer, more robust learning signal for large-scale reasoning models.

Abstract:
3D Gaussian Splatting (3DGS) enables realtime, high-fidelity view synthesis via explicit scene representations and has recently been extended to dynamic scene modeling. In spite of excellent quality and interpretability, we find explicit Dynamic GS often exhibits generalization degradation in large-motion scenes. Motivated by generalization insights from deep learning and the characteristics of Gaussian primitive optimization, we propose an update mixture strategy. This work focuses on two representative open-source explicit Dynamic GS pipelines and our approach includes: (i) a space–time dependent Strictly Sparse Update with additional regularization to stabilize adaptive updates; (ii) a constant-corrected adaptive algorithm to attenuate the over-scaling of primitive gradients, yielding a stable mixture of adaptive and non-adaptive steps; and (iii) attributes mixture via Stochastic Attribute Averaging to mitigate frame-preference under motion disturbances. Experiments show consistent improvements and reduced generalization issues, highlighting the role of non-adaptive updates and the influence from frame-preference in explicit Dynamic GS optimization.

Abstract:
Stereo vision is widely utilized in automotive imagery and 3D reconstruction, creating a demand for compressing stereo images. Existing methods for stereo image compression often employ VAElike architectures based on distortion optimization, leading to subpar perceptual quality at low bitrates. While generative compression excels in high perceptual fidelity at low bitrates, it struggles to maintain consistency across viewpoints, making decoded images less useful for critical downstream tasks. To address this, we introduce SDiD, a distributed stereo image compression architecture based on shared pre-trained diffusion priors. We employ a diffusion prior alignment module to efficiently obtain the main-view-prior from the foundation diffusion, and utilize a prior transformation structure to enable the auxiliary view to achieve reliable and fast perceptual enhancement while maintaining consistency. Through extensive experiments, we demonstrate that SDiD outperforms existing methods in perceptual quality across multiple datasets. Even at extremely low bitrates, SDiD can accurately recover depth information between decoded images. On the InStereo2K dataset, SDiD requires only one-third of the bits compared to the state-of-the-art baseline (0.02 bpp vs. 0.06 bpp) to reconstruct image pairs with similar depth information.

Abstract:
Pearl’s Causal Hierarchy (PCH) is a central framework for reasoning about probabilistic, interventional, and counterfactual statements, yet the satisfiability problem for PCH formulas is computationally intractable in almost all classical settings. We revisit this challenge through the lens of parameterized complexity and identify the first gateways to tractability. Our results include fixedparameter and XP-algorithms for satisfiability in key probabilistic and counterfactual fragments, using parameters such as primal treewidth and the number of variables, together with matching hardness results that map the limits of tractability. Technically, we depart from the dynamic programming paradigm typically employed for treewidth-based algorithms and instead exploit structural characterizations of well-formed causal models, providing a new algorithmic toolkit for causal reasoning.

Abstract:
Hallucinations in large visionlanguage models (LVLMs) remain a critical challenge, with models often generate tokens that fail to align with visual evidence. To address this issue, we propose AFS: Anchor-Final Self-Supervision, a novel framework for hallucination-aware optimization in LVLMs. By leveraging discrepancies between intermediate and final layer predictions, AFS selectively applies self-supervision to visually descriptive tokens, incorporates hallucination-aware token classification, and encourages consistency between intermediate and final layer distributions. Unlike traditional methods that rely on explicit supervision or post-hoc interventions, AFS optimizes the model via Group Relative Policy Optimization (GRPO), using token-specific rewards derived solely from internal model signals. Experiments demonstrate that AFS significantly reduces hallucinations without compromising recall in caption generation. Beyond captioning, AFS excels in discriminative tasks, improving the reliability of object existence predictions and multimodal reasoning. Furthermore, AFS demonstrates strong cross-dataset generalization, transferring effectively across diverse visual domains.

Abstract:
Egocentric reasoning fundamentally differs from thirdperson understanding in LVLMs. Third-person settings offer wide and stable contexts with consistent global regularities, allowing models to utilize broad statistical correlations. In contrast, egocentric scenes are highly dynamic and heterogeneous, where decisive cues are localized and atypical. Therefore, robust egocentric reasoning requires models to focus on ''what is seen now'', i.e., the immediate visual input. However, existing methods tend to exhibit "inertial thinking'', relying excessively on language priors and global context. To address this limitation, we propose a novel three-stage Ego3S framework to ground models' reasoning in interaction evidence. Specifically, before training, we first utilize the counterfactual-based paradigm to select high-value samples that effectively activate multimodal reasoning, thus mitigating the over-reliance on language priors and global context. Moreover, we introduce an interaction-centric reward for reinforcement learning that strengthens the model’s sensitivity to localized interaction cues. Finally, during training, we employ a variance-aware learning schedule that monitors reward distributions to dynamically synchronize data selection with the evolving model competence. Experiments on five datasets show that our Ego3S consistently achieves superior performance using only 26.5% of the training data, while reducing computational costs by over 46%. Code is available at https://anonymous.4open.science/r/Ego3S-70A2.

Abstract:
Recommender systems (RS) may suffer from feedbackdriven exposure concentration, where repeated engagement optimization collapses exposure onto a narrow set of categories, reducing catalog coverage and degrading long-horizon learning. Existing methods are often post hoc and typically lack principled uncertainty-aware risk estimates for regulating exposure under endogenous feedback. We therefore proposeCORAL, a model-agnostic, uncertainty-aware framework that formulates exposure regulation as a constrained sequential decision problem. Specifically, we model self-reinforcing interactions to construct an exposure-saturation state, then derive an upper confidence bound on category-conditioned violation risk from observed history and incorporate it through a state-dependent penalty for adaptive intervention near saturation. Moreover, we provide theoretical guarantees for risk bounds, finite-time recovery, and efficient long-term performance. Extensive experiments on real-world datasets and controlled simulations validate the effectiveness of the proposed framework, which aligns with our theoretical analysis. Our code is available at: https://anonymous.4open.science/r/Coral_Rec-8400.

Abstract:
Blockwise diffusion language models (DLMs) generate multiple tokens in parallel, offering a promising alternative to autoregressive decoding. However, their inference efficiency remains bottlenecked by memory-bound attention in long-context scenarios. Naïve sparse attention is ineffective for DLMs due to the KV inflation problem: different queries select different prefix positions, causing the union of accessed KV pages to remain large. To address this challenge, we observe that block-wise diffusion exhibits locality of representation changes across denoising steps: only a small fraction of tokens (active tokens) undergo significant hidden-state updates, while most tokens (stable tokens) remain nearly unchanged. Based on this insight, we propose LoSA (Locality-aware Sparse Attention), which reuses cached prefix-attention results for stable tokens and applies sparse attention only to active tokens with large representation changes. This design reduces the number of queries contributing to the union of KV indices, substantially shrinking the KV pages that must be loaded. Across multiple block-wise DLMs and reasoning benchmarks, LoSA preserves near-dense accuracy while significantly improving efficiency, achieving up to 4.14× speedup over dense attention on RTX A6000 GPUs. LoSA also delivers up to 5% average improvement over baselines across all datasets and configurations, demonstrating the effectiveness of the proposed method.

Abstract:
Sparse Mixtureof-Experts (MoE) architectures scale model capacity efficiently but suffer from massive static parameter footprints, creating significant deployment burdens on memory-constrained hardware. Existing post-training pruning methods often rely on scalar statistics, ignoring the representational geometry of expert feature spaces. This leads to sub-optimal resource allocation across layers and the retention of redundant experts. To address this, we propose a Rank-aware Geometric Expert Pruning (RaGEP) framework to compress MoE models by analyzing the geometric properties of expert activations. First, in the inter-layer allocation stage, we introduce a Rank-aware budget allocation mechanism that adaptively assigns expert budgets based on the effective rank of layer-wise representations. Second, in the intra-layer selection stage, we propose a Spectral-Salience Pruning metric that harmonizes subspace orthogonality and activation magnitude to identify high-energy orthogonal experts. Extensive experiments across MoE models of different scales show that our method consistently outperforms state-of-the-art baselines on a diverse set of zero-shot tasks, while reducing model size and inference cost. Code is available at supplementary material.

Abstract:
Mean Field Variational Inference (MFVI) is widely understood to underestimate posterior variance. By analysing conjugate Bayesian linear regression, we show that this characterisation is incomplete: while MFVI underestimates the variance in parameter space, it can overestimate the predictive variance compared to the exact posterior. We show that if the MFVI posterior underestimates predictive variances in some directions, it necessarily overestimates them in others. Crucially, this overestimation occurs in directions where the training data concentrates. This leads to the surprising result that, for a test point drawn from the training distribution, MFVI's expected predictive variance exceeds that of the exact posterior. We demonstrate a pathological case of this effect, where the MFVI posterior fails to reduce predictive variance compared to the prior on i.i.d. data. We connect these results to the Cold Posterior Effect, arguing that varying the temperature can correct this overestimation, yielding predictions closer to those of the exact posterior. We validate our theory on synthetic and realworld regression tasks.

Abstract:
While previous research has documented the sensitivity of Large Language Models (LLMs) to surfacelevel performance degradation, the underlying impact on internal representations and learning dynamics remains under-explored. In this work, we study this question using a controlled setup with paired reasoning tasks that are logically identical but expressed either in an abstract formal language (FL) or in natural language (NL). We find that converting FL problems into NL consistently degrades reasoning accuracy. More importantly, we show that FL and NL inputs activate largely separate internal representations and exhibit weak learning transfer between them. We refer to this phenomenon as reasoning compartmentalization. To test whether this compartmentalization can be mitigated, we introduce abstraction-based alignment, where models are trained to translate NL inputs into their corresponding FL forms. While this significantly improves reasoning performance, FL and NL representations remain largely distinct, and learning transfer across formulations remains limited. Through activation-level interventions, we further show that performance improvements arise not from representational fusion, but from improved routing. This suggests that abstraction alleviates formulation sensitivity by strengthening connections between formulation-specific reasoning pathways, rather than by aligning their representations.

Abstract:
Existing lifelong person reidentification (Re-ID) methods assume that each identity maintains a relatively stable appearance distribution over time. However, in real-world scenarios, identities often reappear asynchronously with substantial clothing changes, which is not modeled in existing lifelong Re-ID formulations. We therefore study a realistic lifelong cloth-changing Re-ID (LCC) setting, in which identities reappear asynchronously under substantial clothing changes. This setting leads to two core difficulties: the model must acquire new identities while adapting representations of recurring ones, and at the same time remain robust to the substantial representation drift induced by clothing changes over time, which undermines cross-temporal identity consistency. To address these, we develop a framework that disentangles identity-intrinsic representations from clothing-induced biases, enabling identity modeling beyond appearance changes. We further introduce a Dynamic Identity-Anchor Alignment to maintain stable identity anchors under stage-wise distribution shifts. Experiments on the LTCC and PRCC benchmarks demonstrate superior performance and representational stability across multiple learning stages.

Abstract:
This paper develops a general transfer learning framework for nonparametric regression with heterogeneous data consisting of multiple groups. Under the assumption that groups share a common structure along with groupspecific deviations in additive form, the proposed method employs a two-stage offset learning procedure: the first stage pools data from all groups to estimate an overall mean function, and the second stage estimates offsets for each group, yielding final group-level estimators through additive combination. Non-asymptotic upper bounds are established for the proposed framework, covering a broad class of nonparametric estimators under mild complexity and noise conditions. When instantiated with deep ReLU networks, explicit convergence rates are derived under hierarchical composition models, demonstrating the ability to overcome the curse of dimensionality. Conditions that enable positive transfer with faster rates are considered, including learning with simpler functions and data augmentation through pooling samples across groups. Various simulations and real-data experiments further validate the effectiveness of the proposed method.

Abstract:
Recent advances in zeroshot referring image segmentation (RIS), driven by foundation models such as SAM and CLIP, have improved cross-modal alignment between visual regions and natural language expressions. Nevertheless, selecting the correct segmentation proposal remains challenging, as existing methods typically rely on independent proposal scoring and lack contextual reasoning among visually similar candidates. To address this limitation, we propose RefChess, a training-free framework that reformulates proposal selection as a decision-making problem under contextual perturbations rather than a single-step ranking task. RefChess models each proposal as a candidate chess move and applies Monte-Carlo Tree Search to evaluate its robustness by simulating interactions with competing regions, guided by a stability-aware reward that integrates language decomposition, vision–language similarity, object-centric cues, and spatial guidance signals. Experimental results on standard RIS benchmarks indicate that this decision-centric formulation leads to consistent improvements in robustness and referring segmentation performance. Code is available at \urlanonymous URL.

Abstract:
Domain adaptation theory studies upper bounds on the target risk in order to mitigate performance loss of machine learning models due to distribution shift. In this paper, we take a closer look at the optimization of one such bound based on optimal transport (OT) and propose various strategies that improve the optimization in practice. We first introduce gradual shift and probabilistic margin assumptions to control the incomputable entanglement term that appears in the bounds. We prove that under these assumptions, better optimization of the computable part of the bound can translate to better target accuracies. Motivated by this fact, we tighten the bound, via importance weighting of the source (output) distribution, to obtain the weighted Wasserstein regularized risk (\mathrmW^2\mathrmR^2), that is often easier to minimize than the original bound. \mathrmW^2\mathrmR^2 is shown to be equivalent to an unbalanced OT problem, which in the limit converges to a nearest neighbor based alignment strategy. We highlight the tradeoffs faced with such an approach and show that a suitably regularized \mathrmW^2\mathrmR^2 improves over the state of the art and is robust to multiple distribution shifts under different models, confirming, moreover, the validity of our assumptions.

Abstract:
We present MMDReg, a novel correspondence-free approach to point-cloud registration that is differentiable and has linear computational complexity in the number of points. We model registration as a nonlinear least-squares problem based on the Maximum Mean Discrepancy, approximated using random Fourier features. The resulting objective can be solved efficiently with standard methods such as Levenberg–Marquardt, and the solution is differentiable via the implicit function theorem. This allows MMD-Reg to be used as a differentiable optimization layer within end-to-end trainable models, supporting registration under challenging conditions such as poor initial alignment and partial overlap. We demonstrate this Neural MMD-Reg formulation by integrating the layer with a set transformer, training the resulting model in supervised and unsupervised settings, and comparing its performance against recent learning-based methods. We also evaluate standalone MMD-Reg, comparing its accuracy and scalability against widely used non-learning-based registration methods.

Abstract:
Existing neural combinatorial optimization (CO) solvers often rely heavily on expensive labeled data and additional postprocessing to produce feasible solutions. Research into mixed integer linear programs (MILPs) is particularly limited due to the lack of effective heuristics for feasibility and the challenge of modeling mixed-type variables for neural solvers. To address these issues, we propose a novel unsupervised Langevin sampler for solving MILPs. Our framework learns only integer variables, while continuous variables are solved using an exact linear programming solver, thus isolating the combinatorial hardness of the problem and avoiding unnecessary modeling complexity. The sampler is based on Langevin dynamics and incorporates both objective optimization and constraint satisfaction into a unified energy function, enabling the model to jointly learn feasibility and optimality. Experiments demonstrate that our method achieves 100% feasibility without expensive post-processing and matches or outperforms supervised solvers on benchmark datasets, highlighting its effectiveness and scalability.

Abstract:
Continuous Variational Autoencoders (VAEs) serve as the fundamental continuous tokenizer for modern neural audio generation systems, enabling highfidelity reconstruction while providing a compact, smooth latent space for downstream generative priors. However, continuous VAEs face a fundamental conflict when balancing compression rate, reconstruction fidelity, and latent space topology—a challenge we formalize as the Rate-Distortion-Regularity Trilemma. This trilemma stems from a critical topological mismatch: the prevailing isotropic Gaussian prior in standard VAEs imposes a flat latent geometry that fails to accommodate audio's hierarchical nature, where low-frequency components are structured and compressible while high-frequency components are stochastic and incompressible, leading to disordered information packing where crucial semantic features are randomly interleaved with high-entropy noise. To resolve this challenge, we propose Structured Topology-Aware Regularization (STAR), a general training strategy that reshapes latent space geometry by imposing a growth-based constraint field, routing structural and textural information into channel subspaces with matching capacities. STAR is applicable to any VAE architecture and effectively resolves the trilemma, as demonstrated in CNN-based VAEs. To fully exploit STAR's potential, we present STAR-VAE, combining STAR with a hybrid CNN-Mamba architecture that synergizes local feature extraction with linear-complexity global context modeling, achieving state-of-the-art performance. We further propose STAR-Gen, an LLM-based Flow Matching framework that leverages STAR-VAE's structured latent space for high-fidelity generation without suffering from vector quantization artifacts. Empirical results demonstrate that STAR-VAE successfully resolves the trilemma, achieving state-of-the-art reconstruction fidelity and enhanced semantic information preservation across diverse audio domains. The structured latent space improves both traditional diffusion models and our STAR-Gen paradigm, achieving state-of-the-art performance in text-to-audio generation. The project page is available at~\urlhttps://STAR-VAE.github.io.

Abstract:
Speculative inference accelerates large language model (LLM) decoding but provides no inherent safety guarantees. Existing safety defenses are largely incompatible with speculative inference: they either introduce additional computation or disrupt the draft–verify mechanism, negating acceleration benefits. This reveals a fundamental incompatibility between current safety methods and speculative decoding. We propose SafeSpec, a safetyaware speculative inference framework that integrates risk estimation directly into the verification process. SafeSpec attaches a lightweight latent safety head to the target model to jointly evaluate semantic validity and safety in a single forward pass. When unsafe generations are detected, SafeSpec applies rollback and safety-guided reflective multi-sampling to recover safe continuations rather than terminating generation. We model jailbreak attacks as distributional shifts over generative trajectories, where adversarial prompts increase the probability of harmful continuations without eliminating safe ones. Under this model, SafeSpec performs risk-aware trajectory recovery within the speculative decoding process. Across multiple models and adversarial benchmarks, SafeSpec achieves a substantially improved safety–efficiency trade-off. On Qwen3-32B, SafeSpec reduces attack success rates by 15% while preserving a 2.06× inference speedup, demonstrating that speculative acceleration and inference-time safety can be jointly optimized.

Abstract:
Differentiable logic gate networks map learned models directly to gatelevel circuits, enabling ultra-low-latency inference, yet their logic footprint often exceeds FPGA capacity budgets. Tightly fitting a trained model to a target FPGA requires a post-training mechanism to trade off network complexity and accuracy—analogous to pruning in standard neural networks. To this end, we introduceunit tying: a simplification that forces selected gates to constants (0 or 1), enabling constant propagation and downstream logic elimination. However, we observe that naively extending pruning criteria to logic networks is unreliable under such near-discrete modifications. We therefore propose a two-stage algorithm for unit tying: (i) a fast Gauss–Newton screening step under a teacher-referenced logit-distortion objective that constructs a high-recall overshoot set and (ii) a refinement step that corrects approximation and interaction-driven errors using a small number of finite-difference evaluations. On CIFAR-10 and MNIST, our method consistently improves the accuracy–area trade-off over common saliency baselines, yielding substantial post-synthesis LUT reductions of up to 48% on CIFAR-10 and 43% on MNIST, with modest accuracy degradation.

Abstract:
Despite their strong general capabilities, large language models (LLMs) often remain unreliable when outputs must be numerically precise. A key reason is the training objective: standard crossentropy treats numeric tokens as unstructured categories and ignores the metric structure of their values. We address this mismatch by proposingSmoothMaximumMeanDiscrepancy (SMMD), which builds on the classic MMD by incorporating value-distance kernels over numeric tokens and graph-based smoothness. With this kernel defined over a numeric sub-vocabulary, SMMD aligns the predicted numeric distribution to the target via kernel matching and smooths the prediction--target residual over the induced kernel graph to encourage local consistency. We evaluate SMMD on four numeric-target tasks---mathematical reasoning, arithmetic calculation, clock-time recognition, and chart question answering---across multiple open-weight LLM and VLM backbones. SMMD consistently improves accuracy over both cross-entropy and recent numeric-target losses; analyses show complementary effects between MMD and smoothness and underscore the importance of distance-based kernel design.

Abstract:
Generative Flow Networks (GFNs) offer a powerful paradigm for diverse sampling, yet they often exhibit instability and poor convergence when applied to stochastic or sparsereward environments. To mitigate the high variance inherent in these settings, we propose a fundamental re-framing of the GFlowNet training objective within the frequency domain. We present Spectral Time-Dependent GFlowNets (ST-GFNs), a framework that leverages Fourier analysis to enforce smoothness and stability in learned policies. Our theoretical analysis proves that our proposed spectral loss is mathematically equivalent to regularized value iteration, acting as a principled low-pass filter that separates signal from noise. Furthermore, we tackle the challenge of exploration in sparse landscapes by introducing a novel autocorrelated intrinsic reward derived from the Wiener-Khinchin theorem. Through extensive experiments ranging from adversarial games and noisy sequence generation to high-dimensional single-cell perturbation modelling, we demonstrate that ST-GFNs significantly outperform existing baselines in terms of robustness, sample efficiency, and mode discovery.

Abstract:
Decentralized Federated Learning (DFL) enables collaborative model training across connected clients without a central server, effectively mitigating communication bottlenecks and avoiding the single point of failure in Centralized Federated Learning (CFL). However, existing DFL methods mostly focus on parameter averaging with compromised update directions, which limits their performance potential due to insufficient exploration of the loss landscape, especially for complex models. We observe that layer exchanges among clients enhance exploration while introducing instability due to highly diverse update directions. To address these limitations, we propose Decentralized Layerwise Fusion (DecFus), the first DFL framework that unifies layer-level exchange and averaging to balance exploration and exploitation. DecFus dynamically transitions the decentralized training process from exploration-dominant to exploitation-dominant phases, guided by the loss variance among connected neighbors. Furthermore, a layer-wise fusion strategy, informed by pairwise cosine similarity, categorizes all layers into two groups: an exchange group for exploration and an averaging group for exploitation. Specifically, we theoretically establish the convergence of DecFus without relying on the common assumption in existing literature that the aggregation matrix must be doubly stochastic. Extensive experiments demonstrate that DecFus achieves superior performance in both IID and non-IID scenarios, substantially outperforming existing CFL and DFL methods.

Abstract:
Diffusion models have become a dominant paradigm for highquality generative modeling, while post-training is essential for adapting them to diverse downstream applications. However, post-training of large diffusion models is still challenging due to the prohibitive memory footprints and slow training speed, which existing parameter-efficient fine-tuning methods only partially address. To overcome these limitations, we propose FourTune, an efficient post-training framework for diffusion models based on an end-to-end W4A4G4 paradigm. FourTune introduces a triple-branch hybrid pipeline that augments the standard LoRA architecture with a frozen numerical stabilizer to isolate quantization-sensitive outliers, enabling stable training under native 4-bit computation. In addition, FourTune employs hardware-efficient block-wise quantization and customized fused kernels to support efficient quantized backpropagation and reduce memory bandwidth overhead. Across customization, reinforcement learning, and distillation tasks, FourTune matches the quality of full-precision fine-tuning. On FLUX.1-dev (12B), FourTune reduces memory overhead by 2.25× and increases end-to-end training throughput by 2.27× compared to BF16 LoRA.

Abstract:
Existing VLA models frequently fail in robotic manipulation tasks, with poorly structured fault types that often require expert diagnosis.While VLMs offer strong explanatory capabilities, their effectiveness in assisting VLAs is limited by their unclear role in diagnostics and inadequate collaboration mechanisms.To address this, we introduce VLAFixBench, a fault evaluation dataset that spans perception, planning, and control failures, and provides annotations for task stages, fault types, and spatiotemporal repair strategies.We further propose FaultEval, a static-to-dynamic-to-real evaluation framework that benchmarks 20 VLMs across multiple fault-related dimensions.Building on these insights, we design a VLM–VLA collaboration mechanism that localizes spatiotemporal deviations and rolls back task execution to enable targeted recovery.Experiments show that FaultEval reliably characterizes VLM-based closed-loop diagnosis and repair.The upper-bound analysis using human expert intervention shows that an idealized feedback loop can improve task success rates by 13% on LIBERO and 35% on real-world robots.

Abstract:
Singleview 3D object reconstruction presents a formidable challenge in computer vision due to the inherent limitations of information obtainable from a solitary viewpoint. Recent 3D Gaussian Splatting (3DGS) inspired approaches perform a feed-forward way of learning a neural network that predicts 3D Gaussians which compose the 3D object, given a single image. However, they often struggle with occlusions and exhibit high sensitivity to small changes in input viewpoint, leading to inconsistencies and blurry artifacts in novel view renderings. Our method leverages 3DGS and introduces a new learning scheme that continuously adapts to input viewpoints. To address inherent continuity of camera viewpoints that are represented by polar and azimuthal angles, we use Neural Ordinary Differential Equations to continuously model filter subspace of neural network, thus seamlessly embedding inductive bias of perspective distortions into its structure. By continuously adapting to view-specific features, our approach fosters view consistency in 3D reconstruction, allowing better coherency and accuracy across different angles. Experiments demonstrate that our model outperforms previous methods on multiple single-view 3D reconstruction benchmark datasets and excels in extrapolating to unseen camera angles and categories.

Abstract:
The GromovWasserstein (GW) distance enables comparison across different spaces but remains fragile to structural noise due to its global quadratic coupling. Existing robust extensions primarily rely on node-centric mass relaxation. However, we argue that this strategy is far from sufficient: it only addresses node-induced structural noise (outliers) while neglecting edge-induced distortions where spurious connections exist between valid nodes. To overcome this limitation, we propose the Structurally Robust Gromov-Wasserstein (SRGW) distance, a novel formulation that adaptively filters geometric distortions during optimization. By introducing a structure-aware dual-mask mechanism, our method effectively isolates these stubborn structural outliers while preserving strict marginal constraints for balanced transport. We solve this objective using a Mask-Guided GW Algorithm, which jointly optimizes the transport plan and the structural noise filters. We provide a rigorous theoretical analysis proving that our algorithm converges to a critical point under the Kurdyka-Łojasiewicz framework. Extensive experiments on synthetic geometric matching and real-world subgraph alignment benchmarks demonstrate that Mask-Guided GW achieves superior alignment quality, particularly under severe structural noise.

Abstract:
AI models readily refuse explicitly unlawful requests, but realworld illegality often depends on context. We evaluate frontier models on contextual illegality across four corporate law domains in which routine actions—editing documents, trading stock, requesting payment, approving communications—become unlawful due to triggers such as pending investigations or bankruptcy filings. We study both chat and agentic settings and compare results to a human baseline. The best-performing models achieved near-zero compliance with illegal requests while maintaining high compliance with legal ones, though performance varied sharply by domain. We also identify distinct failure modes such as excessive refusal of legal requests and find improved performance from reasoning models and agentic environments. By utilizing the structure of contextual illegality to create controlled evaluations, our methodology provides empirical grounding for emerging research on law-following AI and extends naturally to additional legal domains.

Abstract:
Many realworld graphs support multiple predictive tasks over the same underlying structure, creating an opportunity to reuse supervision across node classification (NC) and link prediction (LP). However, existing evaluations often rely on incompatible splits, observed-graph assumptions, and negative sampling rules, making conclusions about same-graph cross-task transfer unreliable. We formalize same-graph NC–LP transfer and propose a leakage-free protocol that fixes node and edge splits, uses a shared message-passing graph that excludes evaluated edges, and employs fixed negatives for LP. Across three backbones (GCN, GraphSAGE, GPS), we find transfer is strongly directional and predictable: NC\toLP is consistently beneficial on homophilic graphs, while LP\toNC is fragile and can even degrade accuracy under naive representation reuse. LP\toNC becomes reliably positive mainly in a structure-dominant regime where LP is easy but NC is unsaturated, suggesting LP acts as structural pretraining. Finally, we introduce CoTask Score (CTS) to summarize joint NC+LP utility when a shared encoder must serve both tasks, and show that simple dataset statistics, especially homophily, can guide mechanism choice and help avoid negative transfer.

Abstract:
Recent works have explored integrating VisionLanguage Models (VLMs) with classical planners that rely on symbolic representations of planning problem to generate long-horizon plans for complex embodied tasks. However, in open-ended environments, these symbolic representations obtained from perception are often incomplete, leading to suboptimal performance. To address this, we introduce SCOPE, a self-adaptive symbolic planning framework that supports refining action plans and evolving the symbolic world—the symbolic representations of open-ended environments. SCOPE comprises two synergistic modules: a Symbolic Execution Simulator (SESim) that conducts symbolic validation and real execution of action plans, leveraging the feedback to refine the plans and evolve the symbolic world; and a Self-Adaptive Symbolic Memory (SASMem) that further distills feedback into evolving symbolic knowledge to enhance long-horizon planning and modeling of the symbolic world. Experiments in open-ended environments show that SCOPE significantly improves the completeness of the symbolic world, the success rate of plans under environment perturbations, and cross-task grounding and adaptability across diverse embodied scenarios.

Abstract:
Large language models (LLMs) are typically trained on shuffled corpora, yielding models whose knowledge is frozen at training time and whose temporal grounding remains poorly understood. In this work, we study the impact of pretraining dynamics on the acquisition of timesensitive factual knowledge, focusing specifically on data ordering. Our main contributions are twofold. First, we introduce a comprehensive benchmark of over 7,000 temporally grounded questions and an evaluation protocol that enables analysis of whether models correctly associate facts with their corresponding time periods. Second, we pretrain 6B-parameter language models on temporally ordered Common Crawl snapshots and compare them against standard shuffled pretraining. Our results show that sequentially trained models match shuffled baselines on general language understanding and common knowledge while consistently exhibiting more up-to-date and temporally precise knowledge. Temporally ordered pretraining yields improved factual freshness, while shuffled pretraining peaks on older data, possibly due to increased factual repetition. These findings, along with the release of our checkpoints and datasets, provide a foundation for future research on continual learning for large language models.

Abstract:
Robust conformal prediction (RCP) addresses confidence miscalibration in machine learning models by producing prediction sets with guaranteed coverage — these sets are guaranteed to include the true label with a userspecified high probability, even under worst-case noise. Recent works use randomized smoothing, as it provides robustness for black-box models at larger radii. Currently, there exist two setups for smoothing-based RCP: one requires extensive Monte Carlo sampling at calibration and test time but results in smaller prediction sets; the other setup produces larger prediction sets but uses a single sample at both stages. Since calibration is a one-time preprocessing step, it can accommodate substantially higher computational overhead than inference. Inspired by that, we offer procedures in between: we increase the sample rate at calibration time while using only one or few samples at test time. Increased calibration-time sampling can reduce the size of the prediction sets. With a large enough test set (which is often the case in production), our Front-Loaded RCPs have the same computational complexity as the state of the art, while producing considerably smaller sets at larger radii.

Abstract:
Automatic loop invariant generation remains a challenging problem in program verification, particularly for memorymanipulating programs where shape invariants are required to characterize heap-allocated structures and memory layouts. While existing approaches succeed on numerical invariants, they achieve limited accuracy on shape invariants. We hypothesize that this stems from the need to reason about memory state evolution—information that remains implicit in source code. To address this, we ground LLM reasoning in symbolic execution traces that explicitly capture such transitions. We propose \textscMalice, a two-stage framework incorporating these traces: (1) guided multi-step reasoning that predicts invariants via chain-of-thought analysis of traces, and (2) agentic iterative refinement that corrects candidates through verification tool feedback. Evaluated on LIG-MM+, a benchmark featuring common operations on typical memory structures, \textscMalice substantially outperforms existing approaches.

Abstract:
While frontier formal mathematics systems now routinely develop repositoryscale proof engineering artifacts requiring multi-file coordination and semantic correctness beyond compilation, existing evaluation benchmarks remain focused on isolated theorem proving. We introduce Automated Proof Engineering (APE), the first systematic framework for evaluating repository-scale proof engineering through dual verification that validates both syntactic compilation and semantic requirement satisfaction in pinned library environments. We present a complete infrastructure comprising APE-Bench, which automatically extracts proof engineering tasks from real library commit histories, and APE-Harness, a unified execution framework based on task contract abstraction. This contract-based design enables standardized evaluation across diverse formal mathematics tasks and fair systematic comparison of different agent implementations (including our APE-Agent reference scaffold alongside Claude Code and Codex CLI) on identical task specifications. We demonstrate the framework's effectiveness through comprehensive evaluation. All code, benchmarks, and infrastructure will be open-sourced.

Abstract:
Cotraining, which combines limited in-domain real-world data with abundant surrogate data such as simulation or cross-embodiment demonstrations, has been widely adopted for training generative visuomotor robot policies. Despite its empirical success, the mechanisms underlying when and why co-training works remain poorly understood. Starting from theoretical analysis and a toy example, we identify these two key intrinsic factors for end-to-end co-training systems:balanced mixing ratio" andstructured representation alignment". We propose an explanation that when simulation and real-world data are combined with a balanced mixing ratio, co-training naturally learns representations that are aligned across domains while remaining domain-distinguishable, enabling effective knowledge transfer without sacrificing real-world adaptation, which we refer to as structured representation alignment. We validate the hypothesis with comprehensive sim-and-sim and sim-and-real robotic experiments, showing that structured representation alignment reliably emerges under balanced mixing ratios and largely determines downstream performance. Benchmarking several recent co-training methods further supports this explanation. Guided by our analysis, we propose a simple combination of co-training techniques that jointly promote alignment and domain discernibility, achieving substantial improvements over prior approaches.

Abstract:
Disentangled representation learning is a powerful paradigm for robust attribute prediction. While recent methods address attribute correlations, hidden correlations remain underexplored, where data under the value of a certain attribute exhibit underlying modes correlated with other attributes. To preserve mode information and achieve disentanglement, we jointly discover modes and enforce modebased conditional independence. Yet, the interdependency between these two modules may lead to error amplification under naive iterations. We propose Coordinated Disentanglement with Iterative mode Discovery (CoDID), an end-to-end framework featuring a dynamic architecture that adapts to evolving number of modes, and a coordination mechanism that mitigates error amplification via meta-optimization. Empirical results demonstrate the state-of-the-art performance on diverse tasks. Codes are available at anonymous Github https://anonymous.4open.science/r/CoDID-B038.

Abstract:
Accurate seasonal‑to‑interannual climate forecasting provides critical support for decisionmaking in agriculture, energy, and disaster preparedness. Current deterministic models often fail to capture climate uncertainty, while existing generative approaches oversimplify the system by neglecting key spatiotemporal dependencies and cross-scale interactions. To address these limitations, we introduce ClimateAR, an AutoRegressive generative model for probabilistic seasonal-to-interannual Climate forecasting. The framework incorporates two novel components: (1) an aligned tokenizer that bridges and aligns heterogeneous simulation and real-world data to improve transferability across domains, and (2) a mixed-scale conditioning mechanism that captures multi-scale climate interactions for robust probabilistic forecasting. Extensive evaluations on the ERA5 reanalysis dataset show that ClimateAR achieves state-of-the-art performance, improving anomaly correlation skill by 37.56% on average compared to leading baselines. The Code is available at https://anonymous.4open.science/r/ClimateAR-956D.

Abstract:
Masked Diffusion Models (MDMs) as language models generate by iteratively unmasking tokens, yet their performance crucially depends on the inferencetime order of unmasking. Conventional methods such as confidence-based sampling are short-sighted, focusing on local optimization which neglects test-time computation and allows early decoding errors to cascade. We propose Lookahead Unmasking (LookUM), which addresses these concerns by guiding sampling path with a verifier over alternative unmasking orders, without requiring an external reward model. Our framework couples (i) a path generator that proposes paths by sampling from pools of unmasking sets with (ii) a verifier that computes the uncertainty of the proposed paths and performs importance sampling to subsequently select the final paths. Erroneous unmasking inflates sequence-level uncertainty, and our method exploits this to avoid error-prone trajectories. We validate our framework across six benchmarks, such as mathematics, planning, and coding, and demonstrate consistent performance improvements. LookUM requires only two to three paths to achieve peak performance. LLaDA with LookUM matches the performance of RL-tuned LLaDA 1.5 and yields additional gains when applied to LLaDA 1.5, suggesting complementarity with reinforcement learning.

Abstract:
Discrete diffusion models have recently emerged as a compelling alternative for language generation, enabling efficient nonautoregressive sampling while achieving strong empirical performance. A key design choice in discrete diffusion---absent in most continuous diffusion formulations---is the forward corruption kernel, with masked/absorbing corruption now dominating practice. Despite this empirical preference, there is limited statistical theory explaining when and why masking should outperform alternative kernels such as uniform replacement. In this paper, we take a step toward closing this gap from a statistical learning perspective. Our analysis establishes generalization bounds and, through an explicit comparison across different forward corruption kernels, reveals a central advantage of masking: it scales with the effective data support rather than the full ambient state space, thereby mitigating the curse of state space cardinality. We further derive structure-aware refinements that capture how concentration and sparsity in real sequential data sharpen the sample complexities. Together, these results offer a principled explanation for the empirical strength of masked diffusion and provide guidance for forward-kernel design in discrete generative modeling.

Abstract:
EEG foundation models aim to learn transferable representations, yet EEG recordings are dominated by highfrequency noise and large cross-subject variability. Existing pretraining strategies such as masked autoencoding or autoregressive modeling often treat waveform reconstruction as the learning signal, making the objective sensitive to stochastic fluctuations rather than consistent neurophysiological structure. To address this overlap, we propose PATCHCODE, a region-aware discrete predictive learning framework that keeps the encoder input continuous while introducing region-aware discrete codes as stable supervision targets. We pretrain a masked predictive encoder on continuous EEG patches with dual-granularity learning: it predicts missing patch-level representations to preserve fine spatiotemporal structure, while aligning them to discretized code targets from a frozen tokenizer to anchor robust semantics. Extensive Experiments across ten downstream datasets spanning emotion recognition, motor imagery, sleep staging, and seizure detection demonstrate that PATCHCODE achieves competitive performance compared to state-of-the-art baselines, with notable gains in data efficiency under limited labels. Our code is available at https://anonymous.4open.science/r/PATCHCODE-323D/.

Abstract:
Existing preference datasets for textto-image (T2I) models typically store only the final winner/loser images. This representation is insufficient for rectified flow (RF) models, whose generation is naturally indexed by a specific prior noise sample and follows a nearly straight denoising trajectory. In contrast, prior DPO-style alignment for diffusion models commonly estimates trajectories using an independent forward noising process, which can be mismatched to the true reverse dynamics and introduces unnecessary variance. We propose Prior Noise-Aware Preference Optimization (PNAPO), an off-policy alignment framework specialized for rectified flow. PNAPO augments preference data by retaining the paired prior noises used to generate each winner/loser image, turning the standard (prompt, winner, loser) triplet into a sextuple. Leveraging the straight-line property of RF, we estimate intermediate states via noise–image interpolation, which constrains the trajectory estimation space and yields a tighter surrogate objective for preference optimization. In addition, we introduce a dynamic regularization strategy that adapts the DPO temperature/regularization based on (i) the reward gap between winner and loser and (ii) training progress, improving stability and sample efficiency. Experiments on state-of-the-art RF T2I backbones (FLUX.1-dev and SD3-Medium) show that PNAPO consistently improves preference metrics while substantially reducing training compute compared to Diffusion-DPO.

Abstract:
Multimodal sentiment analysis aims to infer human emotions by integrating signals from diverse modalities. However, missing modalities are common in realworld applications due to sensor failure, data corruption, or privacy concerns. Existing approaches typically follow two main paradigms: recovery-based and non-recovery-based methods. This dichotomy results in two critical limitations: I) computational inefficiency and semantic inconsistency (recovery-based methods rely on heavy generators that incur prohibitive inference latency and risk semantic drift due to lack of class-level priors); II) lack of instance specificity (non-recovery-based methods rely on static global mappings that fail to capture sample-specific affective cues). To address these gaps, we propose Adaptive Prototype Imputation (API). To mitigate I), we introduceSemantic-anchored Class-Temporal Prototype Estimation (SCOPE)to construct non-trainable prototypes as stable semantic anchors, ensuring semantic reliability. To resolve II), we designDirectional Instance-Adaptive Affine Modulation (DIAM)to dynamically modulate these anchors via direction-specific affine transformations, capturing instance-unique affective characteristics without generative overhead. Experimental results on CMU-MOSI and CMU-MOSEI demonstrate that API outperforms state-of-the-art baselines, establishing a robust and lightweight prototype-centric paradigm for multimodal sentiment analysis.

Abstract:
Domain generalized object detection (DGOD) aims to train an object detector on a single source domain and generalize it to unseen target domains. Recent advances in DGOD have increasingly exploited vision foundation models (VFMs) via parameterefficient finetuning strategies. However, existing approaches typically adapt VFMs with fixed, style-agnostic parameters, overlooking that different visual styles may induce distinct task discrepancies. To address this challenge, we propose SCoA, a novel Style Conditioned Adaptation framework for dynamic, style-aware task compensation. Specifically, we introduce a Spectral Style Modeling (SSM) module that preserves local style cues via a memory-based mechanism, enabling diverse style characterization from a single source domain. Conditioned on the extracted style signals, we design a Mixture-of-Tokens Adaptation (MTA) mechanism, which maintains multiple adaptation tokens and dynamically routes each sample to an optimal combination of tokens, thereby explicitly modeling style-dependent task mismatches. In addition, we propose a Style-Conditioned Query Refinement (SCQR) module that injects style information into object queries, enabling a style-aware detection head. By jointly integrating these components, SCoA allows the model to follow style-specific adaptation trajectories, achieving effective and flexible task compensation for VFM-based DGOD. Extensive experiments demonstrate that the proposed SCoA achieves state-of-the-art performance across two challenging scenarios.

Abstract:
VisionText Compression (VTC) offers a scalable path for long-context multimodal modeling by rendering textual data into dense visual tokens. While recent Vision-Language Models (VLMs) demonstrate high decoding fidelity (OCR) on such inputs, they exhibit a severe reasoning gap: models that reason robustly on native text often fail on visually compressed equivalents, particularly in long-range retrieval and multi-step deduction. We identify a phenomenon of post-training transfer failure, where standard supervised fine-tuning and reinforcement learning on visual prompts yield marginal gains compared to their textual counterparts. To address this, we propose CoRe (Coordinated Reasoning), a training framework that enforces lockstep consistency between the reasoning processes of textual and visual modalities. By treating the text-conditioned policy as a dynamic anchor, CoRe aligns the visual-conditioned policy via step-wise distribution matching, seamlessly integrating into both SFT and RL pipelines. Extensive evaluations across mathematical reasoning, long-context memory, and tabular retrieval benchmarks show that CoRe significantly outperforms standard visual post-training, recovering up to 70% of the performance gap relative to the textual upper bound and effectively activating latent reasoning capabilities in the compressed visual modality.

Abstract:
Domain adaptation transforms generalpurpose LLMs into specialized experts for specific domains or tasks. This process typically follows a two-stage recipe: first, Supervised Fine-Tuning (SFT) to inject domain knowledge or induce specific behaviors (e.g., reasoning patterns), followed by Reinforcement Learning (RL) for self-improvement. However,does RL truly require a pre-SFT as cold-start phase?We argue that pre-SFT is inherently problematic: (1) it indiscriminately reinforces knowledge and behaviors from references regardless of whether the LLM has already acquired them, leading to distribution contraction that constrains subsequent exploration; (2) it introduces substantial overhead in multi-stage training and data curation. While our pilot studies reveal that, without pre-SFT, RL struggles to acquire off-policy knowledge from scratch, we bridge this gap withOne-stage Policy Optimization (OnePO). OnePO is an SFT-free paradigm that enables LLMs to selectively internalize off-policy knowledge and behaviors directly during RL evolution. Crucially, we design anAdaptive Objective Evolutionmechanism for rapid knowledge injection and aTeacher Retirementmechanism that prevents off-policy anchoring. Experiments demonstrate that OnePO successfully transforms the Qwen3-8B-Base model into a high-performance medical LLM in one RL stage, achieving competitive performance on HealthBench (67.2) and other benchmarks using only 20K samples. This highlights SFT-free RL can efficiently cultivate domain experts without the need for traditional multi-stage pipelines.

Abstract:
Modern neural networks often achieve high accuracy on their training distribution but degrade sharply under distribution shifts. We address this problem throughRobust Satisficing(RS), an optimization objective that seeks parameters which attain a target level of indistribution performance while minimizingfragility, defined as the rate at which performance deteriorates as the data distribution departs from training. We develop a gradient-based algorithm,Iterative Robust Satisficing(IRS), that directly optimizes this criterion. Across a range of synthetic and real-world distribution shifts, including long-tailed image classification, group shifts induced by spurious correlations, and natural shifts in tabular regression, IRS consistently improves performance on minority and worst-case groups without sacrificing overall accuracy. Notably, IRS achieves these robustness gains with a per-step computational cost similar to standard stochastic gradient descent and requires only a single forward and backward pass per update. Together, these results suggest that minimizing fragility provides a practical and effective alternative to existing robust training methods for learning models that remain reliable under distribution shift.

Abstract:
Programmingby-Example (PBE), as a typical few-shot inductive reasoning paradigm, aims to synthesize corresponding algorithms from a set of input-output examples. Although Large Language Models (LLMs) have demonstrated strong program synthesis potential, they still remain ineffective when handling complex PBE tasks. Specifically, LLMs often struggle to accurately grasp the underlying intent of examples, resulting in synthesized programs that either partially satisfy the examples or completely deviate from the target. To address these limitations, we introduce a process-supervised reinforcement learning method that provides fine-grained feedback during the synthesis process, improving the ability of LLMs to capture the intended behavior of provided examples. Firstly, we develop a reasoning tree construction method that is used to build a PBE process supervision dataset. Subsequently, we train a process reward model through preference learning to evaluate the effectiveness of reasoning steps. Finally, we introduce a curriculum learning strategy based on the difficulty of PBE tasks, using Proximal Policy Optimization (PPO) to optimize the model. Experimental results on representative PBE benchmarks show that our approach achieves an average pass rate of 56.61%, significantly outperforming the state-of-the-art baseline by 8.73%.

Abstract:
We introduce STEERINGSAFETY, a benchmark for evaluating representation steering methods across nine safety perspectives spanning 18 datasets. While prior work highlights general capabilities of representation steering, we focus on safety perspectives including bias, harmfulness, hallucination, social behaviors, reasoning, epistemic integrity, and normative judgment. Our benchmark provides modularized building blocks for stateof-the-art steering methods, enabling unified implementation of DIM, ACE, CAA, PCA, and LAT with recent enhancements like conditional steering. Results on Gemma-2-2B, Llama-3.1-8B, and Qwen-2.5-7B reveal that strong steering performance depends critically on pairing of method, model, and specific perspective. For instance, DIM shows consistent effectiveness, but all methods exhibit substantial entanglement - where improving effectiveness on one perspective changes performance in other safety perspectives. Social behaviors show highest vulnerability (reaching degradation as high as 76%), jailbreaking often compromises normative judgment such as commonsense morality (degradation up to 26%), and hallucination steering unpredictably shifts political views, from 21% shifts right to 19% shifts to the political left. Our findings underscore the critical need for understanding steering methods from various safety angles.

Abstract:
The structural organization of language models plays a crucial role in the inference process of large language models (LLMs), occurring both iteratively within a single model for testtime scaling and interactively across multiple models for collaborative intelligence. While current systems primarily facilitate such interaction through natural language, this paper proposes constructing a high-level neural network, termed LMNet, by treating pre-trained LLMs as optimizable nodes connected via continuous dense vectors. Our approach eliminates the unnecessary embedding and de-embedding steps when one LLM connects to another, enabling more efficient information transfer, a fully differentiable optimization path, and exploration of capabilities beyond human heuristics. We place stripped LLMs as vertexes and optimizable seq2seq modules as edges to construct LMNet, a directed graph with a similar structure to MLPs, and perform end-to-end gradient-descent for efficient optimization. As two exemplar applications, we show the proposed architecture can effectively improve LLM’s general intelligence, and customize LLM with limited data. We also provide detailed discussion and analysis about the emergent behavior of this high-level network.

Abstract:
We present a blackbox model-stealing attack that recovers private vision-tokenizer configurations of deployed vision-language models (VLMs), including the visual patch size and input preprocessing pipeline. The key idea is a task-level side channel induced by ViT-style patchification: when a synthetic grid image is aligned with the hidden patch grid, boundary cues are erased at tokenization, causing periodic accuracy drop. By sweeping the grid cell size and measuring these collapses, we infer the patch size; by introducing padding and a consistency-check test, we further identify whether preprocessing is dynamic- or fixed-resolution and recover the target resize resolution. Across open-source Qwen-VL variants and proprietary models including GPT and Claude, we reliably recover tokenizer-related parameters. Finally, we show that such leakage enables preprocessing-aware transfer attacks and model-targeted adversarial manipulation.

Abstract:
Lowcount Positron Emission Tomography (PET) reconstruction is severely hindered by the dissipative nature of prevailing generative models, where the inherent phase-space contraction leads to the numerical extinction (``wash-out'') of weak but diagnostically critical lesion signals. To overcome this geometric limitation, we propose FlowPET, a physics-informed framework that reformulates reconstruction as volume-preserving transport in a symplectic phase space. By parameterizing the posterior dynamics via a Separable Hamiltonian System, our approach guarantees a divergence-free vector field by construction, theoretically immunizing weak signals against probability mass collapse. To steer this conservative flow, we introduce conjugate boundary conditions based on the Range-Null space decomposition of the PET operator; this strictly enforces data consistency in the range space while confining stochastic uncertainty injection to the unobserved null space. We train the model via symplectic flow matching and perform inference using a symplectic leapfrog integrator. Extensive experiments on BrainWeb, clinical pediatric, and UDPET datasets demonstrate that FlowPET not only surpasses state-of-the-art deterministic and stochastic baselines in SSIM and PSNR but, more crucially, exhibits superior recovery of low-contrast lesions. The results confirm that imposing Hamiltonian structural constraints offers a robust geometric safeguard for medical inverse problems in high-noise regimes.

Abstract:
We study the Probably Approximately Correct (PAC) learnability of partial concept classes in the multiclass setting, where the label space can be infinite. While the Natarajan dimension characterizes learnability for finite label spaces, we show it fails when the label space is unbounded. Instead, we prove that the DanielyShalev (DS) dimension provides a characterization of learnability for partial concept classes in the general multiclass setting. Furthermore, our analysis reveals a surprising phenomenon we call the ``Disambiguation Paradox'': disambiguation schemes with simple label space can destroy learnability, while richer labeling may preserves it. We further characterize how the number and structure of disambiguation labels control the induced DS dimension, yielding a trade-off between label complexity and sample complexity.

Abstract:
Data selection during supervised finetuning (SFT) can critically change the behavior of large language models (LLMs). Although existing work has studied the effect of selecting data based on heuristics such as perplexity, difficulty, or length, the reported findings are often inconsistent or context-dependent. In this work, we systematically study the role of data difficulty in fine-tuning from both empirical and theoretical perspectives, and find that there is no universally optimal difficulty level; rather, its effectiveness depends on the dataset size. We show that for a fixed data budget, there exists an optimal data difficulty for SFT, and that this optimal difficulty shifts toward harder data as the data budget increases. To explain this phenomenon, we conduct controlled synthetic experiments that reveal a simple underlying mechanism: the interplay between the (in-distribution) generalization gap and the extrapolation gap. We further support this mechanism through a theoretical analysis using PAC-Bayesian generalization bounds. Overall, our results clarify how data size and difficulty jointly affect the trade-off between generalization and extrapolation in SFT, providing guidance for difficulty-based data selection under certain model and data conditions.

Abstract:
Geometrypreserving dimension reduction is critical for single-cell transcriptomics, where low-dimensional distances should reflect biological divergence between cell types along the transcriptomic manifold. Due to inadequate metrics, the global structure is not sufficiently preserved in the low-dimensional manifold in standard dimension reduction regimes. We model RNA counts as Multinomial samples, leveraging their hierarchical closure property: gene-level counts refine functional gene-group counts via nested Multinomial distributions. Extending Chentsov's Theorem, we show that the Fisher-Rao metric on coarse (gene-group) and fine (gene) statistical manifolds is isometric. Following this isometry property, we propose InfoGlobe, an information-preserving statistical manifold learning framework that projects cells from high-dimensional hyperspheres (full transcriptome) to low-dimensional hyperspheres (functional groups) while preserving information geometry. Embeddings on the low-dimensional sphere explicitly represent Multinomial distributions by functional gene groups. Benchmarks demonstrate superior preservation of local-and-global cell-type geodesic distances, automatic and robust gene-group discovery, nuanced cell subtype resolution without manual feature engineering and natural batch effect mitigation without explicit alignments.

Abstract:
Generations from large language models often fail to reliably conform to logical constraints such as JSON schema. Existing locallyconstrained decoding (LCD) approaches enforce constraints by myopically masking out next tokens, resulting in biased sampling and degradation in downstream performance. Recent work introduces sequential Monte Carlo (SMC) methods to mitigate such sampling biases, but designing effective proposal distributions or potential functions remains a key challenge. In this work, we propose a generic approach to construct proposals and potentials for SMC sampling from p_\textttlm( \cdot \mid \textttconstraint). First, we show that constraints specified as finite automata (FA) can be tensorized for efficient execution on GPUs, which we use to construct globally-constrained decoding (GCD) proposals. In addition, leveraging the fact that a tensorized FA shares the same circuit structure as hidden Markov models (HMM), we circuit-multiply it with an HMM to obtain the probabilistic GCD (P-GCD) proposal that encodes both logical and probabilistic information about the target distribution p_\textttlm( \cdot \mid \textttconstraint). We evaluate (P-)GCD on xLAM, a widely adopted function-calling dataset, and on CommonGen, a keyword-based constrained generation benchmark. Experiments show that compared to LCD proposals, under the same SMC sampling setup, (P-)GCD achieve faster convergence to the target distribution with significantly fewer particles.

Abstract:
Multivariate timeseries forecasting (MTSF) learns from high-dimensional covariates with strong temporal dependence, periodic structure, and cross-variable correlations. While modern pipelines often mitigate non-stationarity through instance-wise normalization and decomposition, these interventions operate at the data level and do not directly control dependence that can emerge among the parameters during training. We study MTSF optimization from a parameter-decorrelation viewpoint. Modeling stochastic optimization as a Markov chain in parameter space and leveraging its stochastic differential equation interpretation, we use the per-step transition-variance induced by gradient noise as a tractable signal for optimization-induced dependence and update uncertainty. This signal can empirically inflate during training; we theoretically show that such inflation can degrade generalization diagnostics. Motivated by this mechanism, we propose transition-variance alignment (TVA), an architecture-agnostic procedure that regulates transition-variance by smoothly gating the step size based on the mismatch between an estimated noise scale and a chosen target. TVA maintains effective transition-variance near a prescribed scale without architectural changes, incurs negligible overhead, and integrates seamlessly with diverse methods. Across real-world multivariate benchmarks, TVA consistently improves forecasting accuracy.

Abstract:
One essential limitation of neural networks is how to enforce (hard) constraints on prediction. We propose a plugin, differentiable layer, which involves a fast implicit (convex) optimization procedure to enforce the general linear constraint. It aims to minimize a divergence between unconstrained and constrained outputs. Connecting to and beyond existing handcrafted layers, we show that our layer degrades to classic layers like Softmax, Sinkhorn and tanh etc. when the corresponding constraint is enforced by KL-divergence minimization. We further show that by replacing the KL-div with a Euclidean distance, a closed-form solution can be derived for highly-efficient constraint enforcing. We evaluate the above two variants of layers, termed as BLCLayer and GLCLayer, with their corresponding neural solver BLCNet and GLCNet with simple MLP/GNN-like backbone. Experiments on linear programming, as well as two real-world problems: partial graph matching and portfolio allocation which involve other discrete constraints.

Abstract:
OneShot Federated Learning (OSFL) addresses extreme communication regimes in which clients interact with the server only once, amplifying the impact of heterogeneous client data distributions. In particular, the interaction of domain shift and label shift across clients induces misaligned feature representations that cannot be corrected through iterative optimization. Existing OSFL methods rely on distillation, server-side generation or ensemble-based aggregation, but assume aligned representations or address domain and label shift separately. We introduce \textscSLOT-Align (Single-round, Learning-free Optimal Transport Alignment), a geometry-aware feature harmonization framework for OSFL. SLOT-Align uses a shared frozen encoder to extract compact feature statistics, constructs a global reference via Bures–Wasserstein barycenters, and aligns local representations using closed-form geodesic optimal transport maps. The method is computationally efficient and can be combined with existing OSFL pipelines relying on frozen encoders without modifying their training procedures. Extensive experiments across multiple benchmarks, pretrained backbones, and OSFL methods show that SLOT-Align consistently improves accuracy and robustness under joint domain and label shift.

Abstract:
The advancement of diffusionbased text-to-music generation has opened new avenues for zero-shot music editing. However, existing methods fail to achieve context-aware editing, which requires altering specific stems while strictly preserving the background accompaniment. This limitation severely hinders practical application, since real-world production necessitates precise manipulation of components within dense mixtures. Our key finding is that, while vanilla cross-attention captures semantic intent, it lacks the spectral resolution to strictly localize targets in dense mixtures, leading to boundary leakage. To resolve this dilemma, we propose Polyphonia, a training-free editing framework with Acoustic-Informed Attention Calibration. Rather than relying solely on diffuse semantic attention, Polyphonia leverages a probabilistic acoustic prior to establish coarse boundaries, enabling background context preserved precise semantic synthesis. For evaluation, we propose PolyEvalPrompts, a standardized prompt set with 1,170 context-aware music editing tasks. Specifically, Polyphonia achieves an increase of 15.5% in target alignment compared to baselines, while maintaining competitive music fidelity and background integrity.

Abstract:
Foundation models for genomics have the potential to revolutionize therapeutic design, yet the optimal architectural choices for modeling the vast and diverse distribution of metagenomic data remain underexplored. In this work, we present the machine learning methodology behind MODEL, a family of metagenomic foundation models scaled up to 28 billion parameters and trained on 9.7 trillion nucleotide tokens. We provide a systematic empirical study of architectural trade-offs between autoregressive Transformers (Llama-style), State-Space Models (Mamba), and Long-convolutional architectures (Hyena) for nucleotide-level modeling. Contrary to recent trends favoring linear-time sequence models for long-range biological data, we demonstrate that the Llama architecture exhibits superior scaling efficiency and semantic retrieval capabilities as the model capacity grows. We derive a set of quality-aware scaling laws for metagenomics, showing how model performance follows predictable power-law behavior across three orders of magnitude in parameters and data. Through extensive benchmarking, spanning unsupervised zero-shot fitness prediction, semantic completion, and gene recovery, we establish a blueprint for scaling biological foundation models and provide empirical evidence demonstrating why Transformer-based architectures define the current frontier.

Abstract:
While existing AIgenerated image detectors report high performance, we identify that this is largely driven by a criticalprediction asymmetry: a bias toward the real class that severely limits sensitivity to generated content, especially under standard post-processing operations such as compression and resizing. We hypothesize that this stems from the model's reliance on spurious features, distracting signals that obscure true generative artifacts. To address this, we propose DEAR (Dissect and Prune), which leverages inpainted images to identify and prune these interfering components. Specifically, we find that features strongly aligned to either inpainted or non-inpainted regions are less robust to post-processing. By measuring the alignment between channel activations and inpaint masks, DEAR removes features at both extremes, retaining only those that capture genuine generative artifacts. Experimental results demonstrate that our approach significantly enhances robustness against unseen generators and post-processing, effectively mitigating the prediction asymmetry.

Abstract:
Prompt learning has recently emerged as a novel, parameterefficient paradigm to tackle the missing modalities challenge. However, existing prompting methods often overlook the internal structural information within prompt vectors, limiting their effectiveness in guiding frozen backbone models under diverse missing modality scenarios. To address this limitation, we propose a Structured Prompt Refining (SPR) network that refines the internal structure of prompt vectors across multiple dimensions: (1) a Global Interaction Fusion Module captures bidirectional interactions across prompt layers, thereby mitigating sub-optimal adaptation from inconsistent guidance under missing modalities, (2) a Local Feature Refinement Module structures adjacent prompt vectors into coherent semantic units, leveraging local contextual relationships to maintain semantic integrity during modality absence, and (3) a Channel Feature Selection Module uses point-wise gating to adaptively suppress noise and enhance critical channels based on the specific missing modality. Using only 0.8% trainable parameters, SPR achieves significant improvements on three mainstream multimodal classification datasets. Notably, it surpasses state-of-the-art by 3.8% in F1-Macro on the MM-IMDB dataset, even at a 90% modality missing rate. Extensive experiments and in-depth ablations validate SPR's effectiveness and robustness under various missing conditions.

Abstract:
Data assimilation (DA) addresses the problem of sequentially estimating the state of a dynamical system from noisy and incomplete observations. In this work, we employ a diffusion model as a world model to simulate and predict the system’s dynamics. Recently, scorebased diffusion models have learned global diffusion priors that effectively model (stochastic) dynamics, revealing strong potential for data assimilation. In this paper, we investigate how information from noisy observations can be incorporated to enable continuous correction and refinement of the predicted system state when using a diffusion prior. Motivated by particle filtering methods, we represent the posterior distribution using a set of particles. After receiving noisy observations, the diffusion model is guided using the observation likelihood to steer the generation process toward observation-consistent states. Nevertheless, such guidance does not guarantee sampling from the true posterior. We therefore employ a Sequential Monte Carlo approach over the diffusion trajectory, viewed as a path measure, to reweight and resample particles, thereby correcting the generation process and ensuring convergence toward the desired posterior distribution. This leads to an unbiased particle filtering method that rigorously fuses observational data with diffusion model simulations.

Abstract:
Adapting generative foundation models, in particular diffusion and flow models, to optimize given reward functions (e.g., binding affinity) while satisfying constraints (e.g., molecular synthesizability) is fundamental for their adoption in realworld scientific discovery applications such as molecular design or protein engineering. While recent works have introduced scalable methods for reward-guided fine-tuning of such models via reinforcement learning and control schemes, it remains an open problem how to algorithmically trade-off reward maximization and constraint satisfaction in a reliable and predictable manner. Motivated by this challenge, we first present a rigorous framework forConstrained Generative Optimization, which brings an optimization viewpoint to the introduced adaptation problem and retrieves the relevant task of constrained generation as a sub-case. Then, we introduce Constrained Flow Optimization (CFO), an algorithm that automatically and provably balances reward maximization and constraint satisfaction by reducing the original problem to progressive fine-tuning via established, scalable methods. We provide convergence guarantees for constrained generative optimization and constrained generation via CFO. Ultimately, we present an experimental evaluation of CFO on both synthetic, yet illustrative, settings, and a molecular design task.

Abstract:
Improving the reasoning capabilities of large language models (LLMs) typically relies either on the model's ability to sample a correct solution to be reinforced or the existence of a stronger model able to solve the problem. However, many difficult problems remain intractable for even current frontier models, preventing the extraction of valid training signals. A promising alternative is to leverage highquality expert human solutions, yet naive imitation of this data fails because it is fundamentally out-of-distribution: expert solutions are typically didactic, containing implicit reasoning gaps intended for human readers rather than computational models. Furthermore, high-quality expert solutions are expensive, necessitating generalizable sample-efficient training methods. We propose Distribution Aligned Imitation Learning (DAIL), a two-step method that bridges the distributional gap by first transforming expert solutions into detailed, in-distribution reasoning traces and then applying a contrastive objective to focus learning on expert insights and methodologies. We find that DAIL can leverage fewer than 1000 high-quality expert solutions to achieve 10–25% pass@k gains on Qwen2.5-Instruct and Qwen3 models, improve reasoning efficiency by 2× to 4×, and enable out-of-domain generalization.

Abstract:
Bayesian Optimization is widely used for expensive blackbox optimization, yet its success often depends on choosing a kernel that matches the objective’s unknown structure. In this work, we propose ALAS, a flexible Gaussian Process kernel family built from symmetric \alpha-stable spectral components. By learning the stability parameter \alpha, ALAS adapts its effective smoothness from data, capturing both smooth trends and sharp irregularities. We present two parameterizations: ALAS, a single stationary component with joint spectral modulation, and ALAS-Sep, a separable variant that learns dimension-wise tail behavior to improve robustness on approximately decomposable objectives. Experiments on standard benchmarks and real-world surrogates demonstrate strong and robust performance across diverse settings.

Abstract:
Amodal instance segmentation is hindered by the scarcity of scalable and transferable annotations. We introduce MaviGen, an automated 3D retail scene modeling and rendering framework that generates photorealistic multiview images with complete amodal masks. Building on MaviGen, we present the IRAIS dataset, a sim-to-real benchmark comprising a large-scale synthetic multi-view set (3D-IRAIS) and a human-annotated real image set (Real-IRAIS), both sharing unified label definitions and evaluation protocols to facilitate rigorous transfer studies. We propose EUREKA, an encoder-only, query-efficient network for amodal instance segmentation that performs full-image multi-task inference via unified amodal/visible queries and dual mask heads. The dual heads enable mutual supervision between complete and visible masks, while the conditional masked self-attention mechanism further strengthens occlusion reasoning. Experiments establish strong baselines on IRAIS and achieve state-of-the-art performance on D2SA and COCOA-cls, demonstrating substantial improvements in sim-to-real transfer.

Abstract:
Recent work on Neural Networkbased methods for nonlinear control use Lyapunov Functions to obtain controllers with guarantees of stability. However, Lyapunov-based methods are fundamentally limited: they cannot be used for smooth blending with formal Region of Attraction (RoA) expansion guarantees, and also fail to certify stability when unstable equilibria or saddle points are present. Density functions provide an alternate stability certificate, and address these limitations by certifying almost everywhere stability, and enable smooth blending of controllers. Learning valid density certificates is challenging due to integrability constraints, and the effect of density-based blending controllers on RoAs is not well understood. In this work, we provide the first guarantee that controllers blended with density functions yield RoAs containing the union of the RoAs achieved by the constituent controllers. Then, we propose a novel exponential characterization of density functions that provably satisfies the integrability condition, and introduce Neural Control Density Functions (NCDFs), that leverage this new parameterization. We also extend NCDFs for synthesizing safe-stable controllers by combining NCDFs with control barrier functions (NCDF-CBFs). Our experiments show that blended controllers obtain superior RoAs to state-of-the-art methods like Neural Lyapunov Control and Sum-of-Squares based techniques.

Abstract:
Multimodal representation models such as CLIP, SigLIP, and their variants are widely used to represent data across multiple modalities in modern learning systems. While these models are commonly evaluated through downstream performance, the analysis of their structural differences in how multi-modal representations group data across modalities remains inadequate. In this work, we aim to identify modality pairs and sample subsets that induce divergent grouping behavior between two candidate embeddings. We propose \emphKernel Optimization for Discrepancy Analysis (KODA), a kernel method that constructs unified multi-modal kernels via modality-wise kernel multiplication and formulates discrepancy identification as an optimization problem that seeks components with high coherence under one embedding while constraining coherence under another. This formulation provides interpretable discrepancy structures associated with specific modality interactions. We establish finite-sample guarantees characterizing the effective reference sample size required for reliable analysis. To enable scalable computation in multi-modal settings, we develop a randomized low-dimensional approximation of joint kernels using random projections, including Random Fourier Features for shift-invariant kernels. Our empirical results indicate that KODA can identify consistent discrepancy structures across modalities.

Abstract:
Lowrank SVD-based compression offers a powerful strategy to reduce the computational costs of Large language models (LLMs); however, existing methods commonly encounter two recurring obstacles: (i) global rank allocation, where uncalibrated error proxies fail to account for complex error propagation, and (ii) decomposition quality, where Fisher-based estimators suffer from severe rank collapse. In this work, we address these limitations by presenting Layer-wise Error Modeling Search (LEMS) and KFAC-SVD. LEMS advances rank allocation by introducing a layer-wise error surrogate that integrates both local and global layer importance alongside a propagation bias, allowing us to determine global rank configurations efficiently as an Integer Linear Program (ILP). Simultaneously, KFAC-SVD improves decomposition quality by utilizing token-wise statistics, preventing the rank deficiency observed in prior Fisher-based SVD. We demonstrate across Mistral, Qwen3, and Llama-3 families that KFAC-SVD achieves an average perplexity improvements of 15%, while LEMS consistently outperforms existing search strategies, delivering significant zero-shot accuracy improvements of up to 4.7 p.p. that generalize to scales of 70B parameters. Code is made available in the Supplement.

Abstract:
The rapid advancement of AIdriven video generation has transformed content creation, while simultaneously increasing the risk of misinformation through localized manipulations in long-form videos. Existing video forensic methods predominantly operate on short, independent clips, and thus fail to capture realistic scenarios where AI-generated content is sparsely embedded within otherwise authentic footage. To bridge this gap, we formulate the task of Temporal AI-Generated Segment Localization and Explanation, which targets authenticity detection, temporal localization, and interpretable analysis of manipulated segments in untrimmed long videos. We further introduce TASLE, a large-scale benchmark comprising 12,472 untrimmed videos with diverse manipulation patterns and rich annotation signals, including temporal boundaries, authenticity labels, and segment-level rationales. In addition, we propose MSLoc, a coarse-to-fine forensic baseline that combines a boundary-sensitive proposal generation module for efficient long-video scanning with an MLLM-based refinement module for precise boundary localization and interpretable reasoning. Experiments validate the effectiveness of the proposed baseline, highlighting the importance of segment-level explainable forensics for long-form AI-generated video analysis. Dataset and code will be made publicly available.

Abstract:
While Vision Transformers (ViTs) offer strong global modeling, their quadratic computational cost limits utility in latencysensitive applications like person re-identification (ReID). Existing compression strategies, such as token pruning or generic merging, typically rely on coarse-grained criteria tailored for image classification. In fine-grained retrieval, these approaches often discard or smooth out subtle but discriminative local details. To resolve this, we propose SRE-Merge, a training-free framework designed for discriminative token compression. SRE-Merge injects spatial priors into the merging process through three mechanisms: (i) Spatial-Entropy Saliency Assessment (SES-Assess), which quantifies token importance as Spatial-Entropic Mass (SE-Mass) by coupling spatial structure with local attention entropy; (ii) Hybrid Context-Affinity Matching (HCA-Match), which guides precise pair selection by combining feature similarity with mass-derived context; and (iii) Energy-Preserving Weighted Fusion (EPW-Fuse), which incorporates SE-Mass weighting to counteract feature variance reduction. Extensive experiments on standard benchmarks show that SRE-Merge reduces GFLOPs of the base ViT model by about 24% while maintaining state-of-the-art accuracy, establishing a superior accuracy-efficiency trade-off.

Abstract:
Panoramic video depth estimation is pivotal for applications such as Virtual Reality and World Models. However, advancements in this field are impeded by two primary obstacles: the scarcity of largescale training data and the unique spatiotemporal challenges of Equirectangular Projection (ERP), which hinder the direct transfer of perspective models. In this paper, we first presentPanoCARLA, a large-scale synthetic RGB-D panoramic video dataset, featuring natural motion trajectories and drone-like roaming perspectives. Building on this foundation, we proposePVDepth, an end-to-end framework adapted from perspective video depth models. To tackle ERP-specific geometric distortions and consequent non-linear temporal dynamics, we introduce two core mechanisms: (1) AProgressive Sphere-aware Noise Initializationstrategy that anneals the noise distribution from planar to spherical, guiding the model to adapt to non-uniform information density; and (2) ACube-rectified Temporal Modelingmodule that incorporates an auxiliary cubemap temporal branch to rectify non-linear temporal dynamics in the ERP domain. Extensive experiments demonstrate that PVDepth achieves superior performance, generating geometrically accurate and temporally consistent depth sequences.Code and data will be released.

Abstract:
While Large language models (LLMs) have strong abilities, they generally rely on finetuning to supplement downstream task-specific knowledge. Due to the prohibitive memory overhead of full fine-tuning (FT), existing parameter-efficient fine-tuning techniques, e.g., LoRA and Adapters, update parameters only in low-rank or restricted subspaces. However, they fail to approximate FT---the performative fine-tuner---and risk performance degradation in tough tasks. Therefore, we naturally raise aLow-cost Full Fine-tuningquestion: Can we approach standard full fine-tuning in theory, yet with much lower costs in practice? Our key insight is that performing selective updates at each step can, theoretically, recover FT asymptotically, while being cost-effective and ignoring no parameter direction. This motivates a new general fine-tuning paradigm (calledThink-Touch): we first predict potentials of parameter groups (think) and then update only the selected (touch) in one step. Theoretically, we show that under a very weak sufficient condition---divergence of the cumulative coverage of the expected gradient norm---any selection strategy can converge in the full-parameter space to a stationary point at which the FT admits no further first-order improvement. Besides, we further derive the general convergence rate for our paradigm and identify a post-hoc greedy strategy that is rate-optimal. Unfortunately, this strategy cannot be directly applied in practice due to its reliance on full and accurate gradient information. Thus, we propose a bandit-based method to online approximate this ideal strategy in the long run with a rigorous regret guarantee. Extensive experimental results on various tasks demonstrate the potential of our paradigm, including much lower space overheads against FT and better performance than LoRAs.

Abstract:
Creating and editing slides is a rich, multimodal activity that is ubiquitous in professional and educational settings, making it an ideal testbed for realworld computer-use agents. Microsoft PowerPoint is among the most widely adopted and feature-rich environments for presentation creation. We introduce PPT-Eval, a benchmark of 120 diverse PowerPoint tasks across 12 files that cover both content creation and presentation editing scenarios, organized by difficulty. A central challenge in this domain is evaluation: tasks are complex, multimodal, and often admit many valid solutions. Moreover, today’s agents frequently make only partial progress, which binary success metrics fail to capture. To address this, we design a robust evaluation framework to help create task-specific rubrics for PowerPoint tasks, taking inspiration from and building on past works for rubric-based evaluation. These rubrics award partial credit for intermediate steps, penalize unnecessary changes and poor aesthetics, and provide natural language feedback. This nuanced approach proves highly effective, achieving a Kendall's \tau_b correlation of 0.77 with human judgments. We find that existing frontier agents still struggle with solving PowerPoint tasks, with strong models like Claude-4.5-Opus achieving only a 43% success rate and an average partial score of 59%.

Abstract:
While large visionlanguage models (LVLMs) have shown remarkable adaptability to downstream applications, their fine-tuning process remains susceptible to bias under long-tailed data. Compared to zero-shot scenarios, fine-tuning LVLMs on imbalanced datasets often yields limited performance improvements on tail data. This is because LVLMs tend to rapidly overfit the head data at an early fine-tuning stage, thereby impairing the learning of the tail data while simultaneously failing to exploit their quantitative advantage. Furthermore, in many downstream LVLM scenarios, quantified long-tailed prior knowledge of data distribution is often unavailable, significantly limiting the applicability of traditional long-tailed techniques that rely heavily on such information. To address these issues, we propose the Adaptive Token Refinement (ATR), a novel framework that adaptively refines the learning process of LVLMs under long-tailed data. Specifically, ATR consists of two token-level operations applied to output and input tokens, respectively: 1) a bounded adaptive loss that dynamically filters and reweights output tokens to mitigate overfitting on head data, and 2) a visual token mask strategy that augments the probability paths of input tokens to enhance long-tailed performance. Extensive experiments across multiple benchmarks demonstrate that ATR consistently enhance both performance and generalization for long-tailed LVLMs fine-tuning in a distribution-agnostic manner.

Abstract:
The remarkable capabilities of large language models (LLMs) are often undermined by their instability. Even subtle and semantically irrelevant changes in prompts can cause dramatic fluctuations in performance, a phenomenon known as prompt sensitivity. Previous studies typically evaluate prompt sensitivity by comparing the LLM's final outputs when prompts change. However, such coarsegrained metrics fail to explain the internal reasons for prompt sensitivity. In this paper, we introduce interactions as a fine-grained tool to analyze prompt sensitivity of LLMs. Specifically, we decompose the output score of the LLM into a set of interactions. Each interaction represents a nonlinear relationship involving a set of input variables. We discover that subtle changes to prompts can trigger severe instability in interactions, even when the outputs of the LLM remain the same. To this end, we propose an Interaction-based Prompt Sensitivity (IPS) metric by quantifying changes in interactions when we introduce subtle changes to prompts. We apply the IPS metric to 50 open-source LLMs and uncover four factors that reduce the prompt sensitivity of LLMs, including supervised fine-tuning, increased model scales, dense architectures, and few-shot learning. More crucially, we discover a common mechanism by which these four factors reduce prompt sensitivity: all four factors tend to reduce the prompt sensitivity of low-order interactions (i.e., interactions involving few input variables).

Abstract:
Learning humanoid control from video provides a scalable alternative to the scarcity of highfidelity robot data. Existing methods, however, often rely on curated datasets and treat video as passive kinematic priors. They fail to capture dynamic humanoid interactions with the environment, which are essential for real-world deployment. To address this, we propose TRansferable Interaction Primitives (TRIP), a framework designed to extract and ground interactions from unstructured, unlabeled game videos for physical controllers. TRIP explicitly models dependencies between motion dynamics and environmental context via a discrete library of interaction-based action primitives. To bridge the reality gap, we introduce a shared context latent space that aligns implicit video-domain features with functional target-domain observations, enabling the seamless transfer of video-mined strategies to reinforcement learning policies. Our experiments on complex terrain navigation demonstrate that TRIP achieves significant improvements in task performance, sample efficiency, and robustness.

Abstract:
We study firstorder methods for solving monotone variational inequalities arising in min-max optimization. Classical approaches such as the extragradient method rely on two gradient queries per iteration, which limits their analysis and applicability in the online and stochastic settings. We propose a family of Generalized Optimistic Methods with Anchoring (GOMA), which combine two time-scale optimistic updates with an anchoring term inspired by Halpern iteration. In particular, we show that for monotone Lipschitz operators, GOMA achieves an accelerated last-iterate convergence rate of \mathcalO(1/k^2) in the squared gradient norm which is optimal. We also show that in stochastic games where classical methods, such as the extragradient and optimistic method, fail, GOMA can converge. Theoretically, we show that it has a last-iterate convergence rate of \mathcalO(1/\sqrtk) for monotone Lipschitz operators in stochastic regimes with linearly increasing minibatches.

Abstract:
Federated MultiObjective Learning (FMOL) enables collaborative training of conflicting objectives but faces a compounded challenge: the recursive coupling between intra-task client drift and inter-task aggregation bias. We propose DREAM, a unified framework that jointly corrects these two coupled error sources through drift-aware control variates and momentum-smoothed local updates. On the server side, DREAM formulates multi-objective aggregation as a regularized quadratic program parameterized by a task correction matrix, which provides a generalized formulation that can flexibly adapt to scalarization, prioritization, and gradient manipulation strategies. Theoretically, we establish a linear speedup convergence rate of \mathcalO(1/\sqrtNT) for non-convex objectives. We further provide theoretical guarantees for the conflict-avoidant descent direction. In the strongly convex setting, DREAM achieves convergence in weighted sub-optimality and admits a unified Lyapunov analysis showing linear convergence to a regularization-dependent neighborhood. Numerical experiments validate the superior performance and effectiveness of DREAM in practice.

Abstract:
This paper studies active regression for singleindex models under general \ell_p-loss with an unknown 1-Lipschitz link function f, formulated as \min_f,x \Vert f(Ax)-b\Vert_p^p with full access to A but coordinate-query access to b. Prior work established upper bounds for known link functions for all p\geq 1 and for unknown link functions only in the p=2 case, together with lower bounds for p\leq 2. This work addresses the more challenging setting of unknown link functions and general p \geq 1. A non-adaptive sampling algorithm is presented that achieves a (1+\epsilon)-approximation using O(d^p/2\vee 1/\epsilon^p\vee 2\textpoly\log(n/\epsilon)) queries. Nearly tight lower bounds are also established for non-adaptive queries when p>2. These results close much of the remaining gap in active \ell_p regression for single-index models.

Abstract:
Robust reinforcement learning (RL) under the averagereward criterion is essential for long-term decision-making, particularly when the environment may differ from its training dynamics. However, most existing studies focus on model-based settings and provide only asymptotic guarantees, hindering their principled understanding and practical deployment, especially in data-limited scenarios. We aim to close this gap by proposing a model-free algorithm, Robust Halpern Iteration (RHI). We first design our algorithm based on a black-box sampling oracle, which can estimate the worst-case performance accurately. We then derive the finite sample complexity of RHI under the generative model setting, assuming the sampling oracle. To concretely design such an oracle, we propose a K-order multi-level Monte-Carlo estimator, which is shown to have a lower bias compared to prior methods. We further instantiate our design for multiple uncertainty models, including KL and \chi^2 divergence sets, and show that our RHI algorithm achieves an \varepsilon-optimal robust policy with a sample complexity of \tilde\mathcalO\left( \fracSA\mathcalH^2\varepsilon^(2+o(1))\right), where S,A are the number of states and actions, and \mathcalH is the robust optimal span. Our result asymptotically matches the best complexity in robust average reward RL.

Abstract:
Relational deep learning (RDL) converts relational databases (RDBs) into heterogeneous graphs, but graphs derived directly from database schemas are often not well suited for how graph neural networks (GNNs) perform relational reasoning. We study what makes a relational graph suitable for deep learning and show that schemaderived graphs suffer from two systematic failures: information overload and semantic fragmentation. Through an empirical analysis on real-world databases, we find that effective graphs arise from a task-dependent balance between removing task-irrelevant structure and injecting task-aligned relational connectivity. Filtering exhibits a non-monotonic effect on performance, while structural injection is beneficial only when it reflects the logic of the downstream task. Based on these findings, we develop an end-to-end structural optimizer that applies both operations to adapt relational graphs automatically. Across 23 tasks spanning classification, regression, and recommendation, the optimized graphs consistently improve accuracy while often reducing inference cost. Code and data are available at https://anonymous.4open.science/r/StructuralOptimizerRDL-0F74/.

Abstract:
Data plays an invaluable role in largescale ML training pipelines. Multiple factors, including the need to incentivize the creation of high-quality data and efforts to compensate creative data work, have led to increased interest in data \em pricing. Data pricing mechanisms seek to establish a market where data providers are compensated based (in part) on the value of their data to the data buyer, e.g., frontier AI labs. However, assessing the exact value that each provider's data adds to the data buyer's objective requires repeated re-training, which is infeasible in practice. Our work studies \em data pricing under compute constraints. In our setting, data buyers cannot make data acquisition decisions optimally due to limited compute. Inspired by existing practice in the field of data selection, we propose a model for this problem called ``pricing with an attribution oracle,'' and provide a theoretical and empirical analysis of compute-efficient acquisition and pricing.

Abstract:
Planning collaboration strategies for multiagent embodied systems remains a core challenge for LLM-based planners, which often fail to capture the physical and coordination constraints of realworld environments. To address this, we present EvoCF, an agentic memory-driven evolutionary counterfactual planning framework for discovering improved multi-agent collaboration strategies through counterfactual plan generation and evaluation. First, we propose a symbolic constraint inductor that induces reusable symbolic constraints from failures, forming an evolving rule library. Then, we propose an evolutionary counterfactual plan generator that systematically explores semantically consistent plan variants through rule-conditioned mutations, enabling robust collaboration strategies beyond short-sighted one-shot LLM plans. Finally, we design an agentic memory-grounded evaluator that ranks candidate plans using retrieval-augmented evidence, producing interpretable, constraint-aware selections. Across multi-agent embodied simulation benchmarks, EvoCF consistently discovers more robust and executable plans compared to baseline approaches. Our results demonstrate that grounding multi-agent planning in agentic memory and counterfactual reasoning significantly enhances both effectiveness and robustness.

Abstract:
Photoplethysmography (PPG), a noninvasive measure of changes in blood volume, is widely used in both wearable devices and clinical settings. Recent PPG foundation models either use open-source ICU datasets with pretraining paradigms that require high-quality data and thus complicate generalization to field-like data, or use closed-source field-like PPG data. In contrast, we propose a PPG foundation model that does not require high-quality or field-like pretraining data, and instead leverages accompanying electrocardiogram and respiratory signals in ICU datasets to select contrastive samples during pretraining. Our approach allows the model to retain and learn from noisy PPG segments, improving robustness at inference. Our model, pretrained on 3x fewer subjects than existing state-of-the-art approaches, achieves performance improvements on 14 out of 15 diverse downstream tasks, including field-like daily activity and heart rate prediction. Our results demonstrate that multimodal supervision can leverage the multimodal nature of clinical data to develop robust PPG foundation models that generalize better to consumer-level data.

Abstract:
Recently, a line of works can generate impressive 3D objects from a single image, but they are limited by restricted representation resolution, making them unsuitable for 3D scene generation. In this work, we introduce \name, a novel method for highquality 3D scene generation based on hierarchical voxel enhancement framework. Specifically, given a single scene image as input, we first produce a coarse initial scene, then introduce image segmentation and attention-based retrieval to align 2D image components with 3D scene components. Subsequently, we organize these scene relations into a hierarchical component tree, where nodes closer to the leaves denote finer-grained components. Finally, we propose a voxel super-resolution model that generates refined voxels for the target instance while maintaining strong consistency with the coarse voxels. Equipped with this model, we perform coarse-to-fine hierarchical super-resolution on images and voxels for each component, producing a high-resolution and high-quality 3D scene. Extensive experiments demonstrate that our method significantly outperforms previous approaches, achieving state-of-the-art performance.

Abstract:
Designing RNA sequences that reliably fold into specific secondary structures is essential for understanding their biological functions but remains a challenging computational problem. We propose CocoRNA, a cooperative multiagent reinforcement learning framework for RNA inverse design. CocoRNA simplifies the design task by decomposing it into smaller sub-problems, each solved collaboratively by multiple agents. This approach reduces the complexity of the problem and improves the exploration of design policies. During training, a centralized critic uses global structural information to guide the agents, enabling them to jointly optimize their design strategies. As a result, CocoRNA learns high-quality RNA design policies that generalize effectively to unseen structures without additional training. Experiments on the Rfam dataset demonstrate that CocoRNA substantially outperforms state-of-the-art methods in both success rate and design speed. Further experiments on other biological sequence design tasks highlight the effectiveness and broad potential of CocoRNA for complex design tasks.

Abstract:
We present Zeus, a unified tuningfree Time Series Foundation Model (TSFM) that delivers superior performance across diverse analysis tasks without any task-specific fine-tuning. Unlike prior studies that primarily focus on zero-shot forecasting but require task-specific tuning for other tasks, Zeus bridges this gap by addressing two fundamental challenges in multi-task generalization. First, to reconcile point-level granularity with long-sequence scalability, Zeus incorporates a multi-scale Transformer featuring point-wise tokenization and a U-shaped hierarchy, effectively balancing fine-grained fidelity with computational efficiency. Second, to accommodate varying inductive biases across different tasks, Zeus introduces Multi-Objective Temporal Masking (MOTM), a unified strategy that supports heterogeneous tasks (e.g., extrapolation, interpolation, and global abstraction) within a single framework. Extensive experiments across five representative tasks demonstrate that Zeus consistently achieves competitive results in tuning-free settings, underscoring its potential as a general-purpose TSFM.

Abstract:
Neural scaling laws describe predictable powerlaw relationships between model size, dataset size, compute cost, and performance. While these laws are applied to improve the performance of modern foundation models, the mechanisms underpinning them are less understood, in part due to the absence of scalable analysis tools. To this end, we introduce a framework for efficiently measuring the alignment between the empirical neural tangent kernel (eNTK) and loss residuals. Applying this framework to scaling experiments reveals a consistent pattern: larger and better-performing models exhibitlowerkernel alignment throughout training. We interpret thisunalignmentthrough the lens ofspectral reach: the capacity of a model to learn from progressively weaker spectral modes in its eNTK. This interpretation allows us to explain why larger models achieve lower losses: they sustain learning on weaker signals that smaller models cannot access. We further demonstrate that feature learning improves spectral reach and provide a mechanistic explanation of how this occurs, suggesting practical avenues for performance improvement.

Abstract:
Polypharmacology provides a powerful strategy for treating complex diseases, but identifying molecules that simultaneously satisfy coupled constraints across multiple biological targets remains difficult. Existing methods typically model protein pockets in isolation and struggle to jointly account for multiple heterogeneous binding sites when designing a single shared ligand. To address these limitations, we propose a pocketstructure-centric generative framework for polypharmacology. This framework introduces a novel protein topological representation that selectively masks ligand-irrelevant residues while explicitly modeling backbone folding geometry and inter-residue spatial proximity within binding pockets. In addition, structural representations are jointly fused with amino acid and nucleotide sequences to capture their complementary information across targets. Experiments on COVID-19, schizophrenia, and tumor targets show that this framework generates valid candidates with significantly improved binding affinities compared to state-of-the-art methods.

Abstract:
This paper presents an optimized approach to enhance the computation of HirschfeldGebelein-Rényi (HGR) maximal correlation, addressing computational and efficiency challenges in large-scale neural networks and multimodal learning. The UniFast HGR framework introduces three key innovations: replacing covariance with cosine similarity to eliminate matrix inversion, removing the diagonal of the correlation matrix to mitigate self-correlation bias, and simplifying variance constraints via \ell_2-normalization. These contributions reduce computational complexity from O(K^3) to O(m^2K) while improving accuracy and stability. The framework scales effectively across diverse multimodal applications. Additionally, the OptFast variant minimizes normalization steps, achieving efficiency comparable to dot-product operations without sacrificing precision. Experimental evaluations across benchmark datasets validate the framework's ability to balance computational efficiency with accuracy, establishing it as an effective solution for addressing contemporary deep learning challenges.

Abstract:
Applying machine learning techniques to solving longstanding mathematical conjectures can be particularly challenging due to their extreme reward sparsity. As an illustrative example, we consider Kalai's algebraic Hirsch conjecture and recast the construction of its counterexamples as a sparse-reward reinforcement learning problem on graphs. We propose a constrained options-based HRL framework with an equivariant graph neural network policy, which allows us to learn useful temporal abstractions for this task. We evaluate our approach over a wide range of degrees and demonstrate that it consistently outperforms classical RL algorithms as well as greedy search. By exploiting the hierarchical structure of the problem, we effectively provide a first-of-its-kind application of HRL to a problem in commutative algebra.

Abstract:
Large language model (LLM) exists a subset of attention heads that are highly responsible for longcontext processing. Existing work has identified different long-context heads in models, but their detection methods mainly rely on model inference on actual long texts and do not analyze the inherent properties of the head parameters. In this paper, we use kernel methods to analyze staticfrequency kernelsformed by different rotation frequency components of attention heads, and we design a Long-context Potential Score (LPS) to measure the potential of attention heads in processing long contexts. Kernels of heads with high LPS exhibit concentrated low-frequency energy and low effective rank, which allow them to effectively capture highly specialized information from distant contexts. Experiments and analysis on long-context tasks and model behaviors show that the LPS metrics can well reflect the actual capability of heads on long contexts. Furthermore, by simply amplifying low-frequency kernels of heads with high retrieval potential, we can further improve model's performance on long-context tasks. Our metrics and head enhancement methods are fully static and offline, and they can be quickly conducted under low-resource constraints.

Abstract:
While existing multiagent systems (MAS) can handle complex problems by enabling collaboration among multiple agents, they are often highly task-specific, relying on manually crafted agent roles and interaction prompts, which leads to increased architectural complexity and limited reusability across tasks. Moreover, most MAS communicate primarily through natural language, making them vulnerable to error accumulation and instability in long-context, multi-stage interactions within internal agent histories. In this work, we propose Agent Primitives, a set of reusable latent building blocks for LLM-based MAS. Inspired by neural network design, where complex models are built from reusable components, we observe that many existing MAS architectures can be decomposed into a small number of recurring internal computation patterns. Based on this observation, we instantiate three primitives: Review, Voting and Selection, and Planning and Execution. All primitives communicate internally via key–value (KV) cache, which improves both robustness and efficiency by mitigating information degradation across multi-stage interactions. To enable automatic system construction, an Organizer agent selects and composes primitives for each query, guided by a lightweight knowledge pool of previously successful configurations, forming a primitive-based MAS. Experiments show that primitives-based MAS improve average accuracy by 12.0–16.5% over single-agent baselines, reduce token usage and inference latency by approximately 3×–4× compared to text-based MAS, while incurring only 1.3×–1.6× overhead relative to single-agent inference and providing more stable performance across model backbones.

Abstract:
This paper addresses the problem of linear seriation: recovering the intrinsic order of noisy samples drawn from an unknown onedimensional manifold embedded in a higher-dimensional space. We propose a multi-stage approach that first robustly estimates local tangent directions using Principal Component Analysis (PCA) on neighborhoods, establishing theoretical consistency for these local estimates. Global orientation consistency of these tangents is then achieved through a spectral relaxation of a pairwise alignment objective. Finally, a globally consistent 1D embedding is computed by solving a carefully formulated linear system (or equivalently, a spectral problem on a derived Laplacian) that aligns the embedding with the oriented local projections. This method effectively leverages local geometric information while ensuring global coherence, producing an ordering robust to noise, curvature, and initial data rotation. We demonstrate its performance on simulated manifold data and discuss the theoretical underpinnings of its core components.

Abstract:
Visual foundation models have achieved remarkable progress in scale and versatility, yet understanding the 3D world remains a fundamental challenge. While 2D images contain cues about 3D structure that humans readily interpret, deep models often fail to exploit them, underperforming on tasks such as multiview semantic consistencycrucial for applications including robotics and autonomous driving. We propose a self-supervised approach to enhance the 3D understanding of vision foundation models by (i) introducing a temporal nearest-neighbor consistency loss that finds corresponding points across video frames and enforces consistency between their nearest neighbors, (ii) incorporating reference-guided ordering that requires patch-level features to be not only expressive but also consistently aligned, and (iii) constructing a mixture of video datasets tailored to these objectives, thereby leveraging rich 3D information. Our method, 3DPoV, achieves state-of-the-art performance in keypoint matching under viewpoint variation, as well as in depth and surface normal estimation, and consistently improves a diverse set of backbones, including DINOv3.

Abstract:
Local posthoc explanations are widely used to understand black-box models on tabular data, with Local Interpretable Model-agnostic Explanations (LIME) being a popular approach. LIME approximates a black-box model using a sparse linear surrogate in a local neighborhood, implicitly assuming feature-wise linear homogeneity. However, this assumption often fails when local feature effects exhibit heterogeneous or nonlinear behaviors, resulting in unfaithful and unstable explanations. Moreover, LIME relies on a decoupled feature selection procedure that is not aligned with the surrogate modeling objective, further exacerbating instability under local sampling. To address these limitations, we propose PL-LIME, a two-stage sparse local explanation framework that ensures objective consistency across stages. PL-LIME models feature-wise local effects using instance-anchored piecewise linear functions, providing a minimal yet principled extension beyond linear surrogates under a fixed explanation budget. Sparsity is enforced through a decoupled nonnegative shrinkage procedure that directly scales the estimated local effects, improving stability while preserving interpretability. Experiments on synthetic and real-world datasets demonstrate that PL-LIME achieves higher local fidelity and stability, and provides more reliable local explanations that capture finer-grained local effect structures than LIME.

Abstract:
Responsible investing aims to generate positive impact across Environment (E), Society (S), and Governance (G), and rating companies along these dimensions is now widespread, making ESG scores highly popular. Allocating retail capital with sustainability in mind could be transformational, yet it remains unclear how individual investors can do so in practice. Current ESG solutions cannot model highdimensional, multi-modal time series capturing the joint evolution of ESG risks, financial returns, news, and sentiment, even though this domain requires jointly reasoning over distinct numerical signals where both numerical proximity and semantic type must be preserved. To bridge this gap, we introduce a novel domain-aware representation learning framework that learns geometry-preserving representations for heterogeneous time series using value-aware tokens with block-wise orthogonal embeddings. To capture trajectory-level structure, we introduce FACET tokens and train the model using a geometry-preserving loss. The resulting model jointly learns to forecast future values and to organize entities in a representation space that reflects their temporal evolution. Trained on ESG, returns, news, and sentiment, the domain-aware LLM learns a representation space that enables accurate ESG forecasting, trajectory-based grouping, and latent-space search for superior asset selection and downstream application like portfolio rebalancing

Abstract:
In multitarget regression and multi-class classification, uncertainty is inherently multivariate: prediction regions must capture joint dependencies across correlated outputs. Conformal prediction provides distribution-free guarantees, yet extending it to vector-valued outputs remains challenging—scalar aggregation discards geometric structure, while optimal transport (OT) approaches are computationally demanding and sensitive to outliers. We introduce two conformal methods based on geometric quantiles and spatial ranks: Geometric Conformalized Quantile Regression (GCQR) constructs prediction regions from learned conditional geometric quantiles, while Geometric Rank Conformal Prediction (GRCP) uses the radial rank of vector-valued conformity scores as the nonconformity measure. We propose multiple estimators offering different tradeoffs between computational cost and adaptivity to feature-dependent heterogeneity, with scalable learning via partially input-convex neural networks. On multi-target regression and multi-class classification benchmarks, GCQR and GRCP attain near-nominal coverage with consistently tighter prediction regions than scalarized and multivariate baselines.

Abstract:
Attention selects information from long contexts, but standard softmax attention scales as O(N_qN_k) in the number of queries N_q and keys N_k, making longcontext training and inference expensive. We propose PLASH, an attention block with provably linear-time complexity in N_k that preserves the usual interface: each query still returns a data-dependent weighted combination of values. PLASH first compresses the key / value side into M \ll N_k learned representatives, and then restores expressivity by enriching these representatives with selective higher-order feature sketching (e.g., TensorSketch), which approximates chosen polynomial interactions without explicit feature expansion. The final softmax readout from \mathbfQ to the enriched (\mathbfK_g,\mathbfV_g) is exact, so PLASH applies to both self- and cross-attention by treating N_q and N_k independently. We give a runtime analysis O(N_k M d + N_q M d) (plus sketching costs), provide error bounds for the randomized sketches and an end-to-end deviation analysis relative to standard attention, and show strong long-context performance with favorable scaling versus efficient-attention baselines.

Abstract:
In the bigworld regime, agents cannot be universally capable and their ability is inevitably specialized across a world in pieces. Consequently, standard uniform guarantees fail to distinguish between the understanding of critical bottlenecks and irrelevant failures. We first formalize this limitation by proving that general agents are not universal, rendering standard worst-case analysis uninformative. To overcome this, we introduce structural certification, a transition-local framework that maps bounded goal-conditioned performance to entry-wise guarantees on the agent's internal world model. Our main contribution is constructive. We provide algorithms that filter specific transitions using deep compositional goals and prove that a general agent on these goals has a structural world model with a \mathcalO(1/n)+\mathcalO(\delta) error bound. Conversely, this bound is tight in the small-\delta regime, whose existence is explicitly guaranteed by our certification. These results enable the certifiable deployment of general agents by localizing the specific transitions where long-horizon planning is reliable.

Abstract:
Parameter Sharing (PS) is widely used to improve efficiency in MultiAgent Reinforcement Learning (MARL), but it can limit behavioral diversity and degrade performance. This limitation stems from gradient conflicts among agents on shared weights, which hinders effective policy learning. To fully characterize this phenomenon, we propose Geometric Gradient Decomposition Analysis that decomposes gradients with respect to weight vector into radial (scale) and tangential (direction) components and uncover a key insight: agents largely agree on directional updates but substantially disagree on scale updates. Consequently, while recent methods split the shared network into agent-specific subnetworks to mitigate conflicts, they also discard shared directional updates, limiting training efficiency. To address this issue, we propose Hyperspherical Parameter Sharing (HPS), which explicitly decouples direction and scale in parameter sharing. Specifically, HPS constrains the shared backbone weights onto a Riemannian manifold(unit hypersphere), enforcing purely directional learning. Building on this, an agent-specific scale generator outputs multiplicative modulation factors to adjust each agent’s scales, thus preserving heterogeneous response magnitudes without disrupting the shared directions. Experiments on SMAC, SMACv2, VMAS and Predator Prey demonstrate that HPS effectively resolves the scale conflict, significantly outperforming state-of-the-art methods.

Abstract:
Recently, the integration of large language models (LLMs) with generative recommendation (GR) has demonstrated promising potential. However, most existing GR methods adopt residual quantization to implicitly model hierarchical relationships across codebook layers in Euclidean space, which distorts the intrinsic treelike hierarchy and leads to low codebook utilization. To address these issues, we propose a Hyperbolic RQ-VAE enhanced Generative Recommendation, namely HG-Rec. Specifically, HG-Rec enhances the residual quantization mechanism by embedding the latent discrete representations into hyperbolic space to explicitly model hierarchical relationships across codebook layers. Motivated by the exponential volume growth of hyperbolic space, we further design a differential-length codebook strategy, i.e. the codebook size follows a pyramidal structure, which aligns with the tree-like structure and effectively compresses the codebook size. Hence, benefiting from the alignment of hyperbolic geometry and codebook hierarchy, HG-Rec achieves lower collision rates, more uniform codebook usage, and less training time compared to existing methods. Extensive experiments across multiple benchmark datasets demonstrate that HG-Rec consistently achieves state-of-the-art performance. The code is available in the Supplementary Material.

Abstract:
Human motion inherently exhibits a sophisticated temporal hierarchical architecture, spanning from global lowfrequency trajectories to local high-frequency dynamics. Inspired by this intrinsic property and the success of multi-scale autoregressive modeling in vision, we propose MotionMAR, a novel framework for human motion reconstruction from sparse observations. Unlike traditional methods, MotionMAR adopts a temporal coarse-to-fine design: it first estimates the global motion envelope and progressively refines temporal details for higher precision. The framework comprises three key components. First, a Temporal Multi-scale VQ-VAE defines hierarchical levels based on temporal resolutions, effectively disentangling global semantic information from fine-grained jitter. Second, the Motion Autoregressive Network (MAN) employs a next-scale prediction strategy: it generates coarse-scale indices to lock in the global structure, followed by finer-scale indices to restore details. This process is strictly guided by a Control Module that incorporates sparse tracking priors to ensure alignment with observed signals. Finally, a Motion Refinement Network functions as a temporal stabilizer on the decoded continuous pose space, mitigating quantization artifacts and smoothing local kinematics. Experiments demonstrate that MotionMAR achieves state-of-the-art accuracy, offering a principled, temporal-hierarchy-aware paradigm for robust motion reconstruction.

Abstract:
Many combinatorial optimization problems admit quadratic unconstrained binary formulations (QUBO) which can often be relaxed to the box [0,1]^n and optimized using scalable gradientbased methods. However, the resulting non-convex landscape can often contain local optima that are spurious or infeasible. In this paper, we establish sufficient structural conditions on quadratic penalties that rule out these failures, guaranteeing that every local minimizer of the relax problem is both binary and feasible. For each problem we study, we examine existing QUBO formulations when available, identify why they fail when they do, and propose alternative relaxed QUBOs that satisfy our conditions. We show for several common combinatorial problems, including open-pit mining, knapsack, and traveling salesman formulations, that these constructions allow gradient-based methods such as projected gradient descent and Adam to be safely applied to obtain valid binary solutions. Our results clarify when differentiable optimization is a reliable local solver for quadratic combinatorial objectives.

Abstract:
Graph neural networks (GNNs) tightly couple their inputoutput parameters to dataset-specific feature spaces and target sets, exhibiting limited transferability across different datasets. In contrast, language models (LMs) generalize flexibly via a unified input-output interface, motivating recent attempts to adapt LMs to graph tasks. However, existing methods struggle to encode whole-graph information, leading to potential information loss and suboptimal graph understanding. In this work, we propose a novel weight-level information injection paradigm for adapting LMs to graph tasks. This paradigm injects whole-graph information by generating task-specific weight updates that interact directly with hidden representations. Instantiating this paradigm following low-rank adaptation (LoRA), we introduce GaRA, a Graph-aware LoRA generation model. GaRA constructs low-rank weight updates conditioned on the original graph structures and constrains the norm of the generated updates, thus injecting whole-graph information and avoiding the optimization bias in the weight generation. Empirical studies demonstrate that GaRA consistently outperforms baselines on zero-shot graph learning tasks. Codes are provided in the supplementary material.

Abstract:
Model mergingoffers an appealing route to multitask learning by composing independently fine-tuned checkpoints without centralized data or retraining. However, this convenience can come with a hidden cost. Model merging mayamplifyperformance disparities across subgroups, raising fairness concerns even when average accuracy remains competitive. To explain this phenomenon, we develop a sensitivity-based theoretical analysis that upper bounds the fairness gap induced by model merging. Theoretical analysis with empirical verifications reveals that the resulting fairness gap is governed by two coupled factors, amerging magnitudeterm that measures how far the merged parameters move from the target model andglobal sensitivityterms that determine how unevenly the perturbation affects subgroup losses. Guided by these insights, we proposeFairMerging, a two-stage merging framework that first reduces the sensitivity of the target model and then performs fairness-aware coefficient optimization with orthogonally normalized task vectors. Experiments across multiple datasets, backbones, and merging baselines demonstrate that FairMerging substantially mitigates unfairness while retaining competitive multi-task performance.

Abstract:
Heterogeneous multimodal remote sensing object detection aims to accurately detect objects from diverse sensors (e.g., RGB, SAR, Infrared). Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning. This tight coupling complicates optimization and often results in unstable training and suboptimal generalization. To address these limitations, we propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning. BabelRS comprises two key components: Concept-Shared Instruction Aligning (CSIA) and Layerwise Visual-Semantic Annealing (LVSA). CSIA aligns each sensor modality to a shared set of linguistic concepts, using language as a semantic pivot to bridge heterogeneous visual representations. To further mitigate the granularity mismatch between high-level language representations and dense detection objectives, LVSA progressively aggregates multi-scale visual features to provide fine-grained semantic guidance. Extensive experiments demonstrate that BabelRS stabilizes training and consistently outperforms state-of-the-art methods without bells and whistles. Code will be released soon.

Abstract:
Comprehensive urban understanding requires integrating macroscopic spatial structure with finegrained street-level semantics. However, existing urban Multimodal Large Language Models (MLLMs) primarily rely on satellite imagery, limiting their ability to capture detailed urban appearance and cross-view relationships. We propose UrbanMLLM, a unified MLLM that jointly learns from satellite and street-view imagery for cross-view urban perception and reasoning. To support this, we construct a large-scale dataset with paired cross-view urban images, geospatial alignment, and textual annotations. UrbanMLLM introduces a cross-view perceiver to explicitly model interactions between satellite and street-view representations, and adopts a structural interleaved pre-training paradigm that organizes cross-view image–text content as coherent urban documents to enhance cross-view knowledge fusion. We evaluate UrbanMLLM on 13 diverse urban understanding tasks spanning satellite, street-view, and cross-view settings. Experimental results demonstrate consistent improvements over strong open-source and proprietary MLLMs, highlighting effectiveness and scalability of UrbanMLLM for urban environment understanding.

Abstract:
We proposeBlock Adaptive Signum (BAS), which bridges Adam and SignSGD via blockwise scaling of sign updates. By discarding element-wise second moments, BAS reduces memory overhead relative to AdamW without sacrificing performance. Crucially, BAS mimics Adam’s dynamics closely enough to directlyinherit its hyperparameters, matching the performance of AdamW without the need for re‑tuning, a common fragility of prior low‑memory optimizers. This structural alignment makes it particularly suitable for tuning Adam-pretrained models. Furthermore, we exploit the inherent robustness of sign-based updates to store the first moment in FP8 without performance degradation. This shrinks the optimizer‑state footprint to12.5% of AdamW’s. We theoretically prove convergence under standard assumptions and introduce a communication-efficient variant enabled by the sign-based update. Across extensive evaluations, including pre‑training a 1.5B model on 100B tokens and supervised fine-tuning of models up to 32B parameters, we demonstrate that BAS achieves performance on par with AdamW.

Abstract:
Continual learning for visual question answering (VQA) is typically implemented by training one expert per task and routing each query using taskID supervision. Yet continual VQA tasks overlap substantially: on the VQA-v2 task stream, a non-native expert outperforms the task’s own expert on 49.9% of queries, so hard routing both wastes transferable knowledge and can be confidently wrong when mismatched. We propose a calibrated Bayesian mixture-of-experts that trains parameter-efficient per-task adapters, learns routing by directly maximizing expected VQA utility, and marginalizes expert identity at inference via Bayesian aggregation in a unified answer space; an entropy penalty prevents the utility objective from collapsing to one-hot routing, enabling evidence pooling across plausible experts. We reach 64.16 accuracy with 0.63 forgetting on VQA-v2 CL-LS (+5.74% accuracy, -2.99 forgetting vs. the strongest prior method), 78.81 with 0.40 forgetting on TDIUC CL-LS (+3.10, -1.74), and 83.41 with 3.21 forgetting on TDIUC CL-VS (+1.58, -0.82). Calibration also improves on VQA-v2, reducing ECE from 0.15 to 0.07.

Abstract:
Estimating statistical parameters from quantized data has received significant attention in recent years, as recovering information from quantized measurements has numerous applications across signal processing, communications, and data analysis. In this work, we focus on maximum likelihood (ML) estimation of statistical parameters from quantized samples. Directly solving the ML problem is challenging, as the likelihood function involves multiple integrals that are difficult to evaluate. To address this challenge, we propose an expectationconditional-maximization (ECM) algorithm under a general distributional framework. Our approach generalizes the quantization model to multi-bit settings and allows the underlying signal to follow any distribution within the normal mean-variance mixture family. By designing suitable surrogate functions, the ECM algorithm ensures that all model parameters can be updated in closed form at each iteration. Leveraging the ECM framework, we provide convergence guarantees, and under specific distributional assumptions, we further derive bounds on the convergence rate and the statistical error. Extensive experiments demonstrate the effectiveness of our method in recovering statistical parameters from quantized data.

Abstract:
Highquality data is a key input to modern machine learning models, leading to the emergence of platforms that facilitate the buying and selling of data. A central challenge in these platforms is how the data is priced to balance the interests of both buyers and sellers. Traditional market equilibrium notions, where demand meets supply are commonly used to price goods but do not extend naturally to data due to its non-rivalrous nature, whereby multiple buyers can simultaneously benefit from the same dataset. We therefore introduce a new notion of equilibrium for data pricing based on Nash equilibrium and study it in settings where data may be complementary or substitutable, focusing on the canonical utility models for each, namely Leontief and linear, respectively. We show that equilibrium prices fail to exist for linear utilities even with homogeneous buyers and two sellers, while establishing strong existence, efficiency, and polynomial-time computation guarantees for Leontief utilities in general markets with n homogeneous buyers and m sellers. We further examine the role of platform mediation and price discrimination in enabling optimal equilibrium outcomes efficiently. On the technical front, we develop a novel proof technique based on systematically reducing the space of candidate equilibria through the graph-of-deviations, which may be of independent interest.

Abstract:
As Large Language Models (LLMs) integrate into writing workflows, precise governance requires distinguishing ''how AI participated'' rather than merely ''whether AI was used.'' Traditional binary detection often misclassifies ``AIpolished'' content as generated, creating fairness risks. We propose shifting from passive post-hoc detection to active intent attribution, focusing on the distinction between Editing (source-anchored) and Generation (unanchored). We introduce IACW-Instruct, a corpus of diverse editing operations constructed via a Director--Actor--Judge pipeline to enable systematic evaluation. Building on this benchmark, we propose Intent-Aware Controllable Watermarking (IACW), featuring intent-adaptive entropy gating for semantically lossless embedding. Experiments show that IACW achieves 95% attribution accuracy under 20% token deletion while preserving near-unwatermarked semantic fidelity, establishing a practical paradigm for fine-grained provenance.

Abstract:
Recent studies show that Hypergraph Neural Networks (HGNNs) are vulnerable to adversarial attacks, while adversarial learning in the context of hypergraphs remains substantially underinvestigated. In particular, all existing attacks on HGNNs are white-box and customized for either structural or feature perturbation. But in reality, the attacker might not have access to the target model parameters. Motivated by this knowledge gap, we propose a generic meta-objective-based learning framework, MeLA, that leverages the hypergraph Laplacian to conduct gray-box, structural, and feature perturbations, while ensuring the stealthiness of the attack. In contrast to the attack literature, there is no adversarial training mechanism for HGNNs to defend against such attacks. Hence, we propose a novel adversarial training mechanism for HGNNs to obtain a robust classifier. We further prove the convergence of our robust training. Extensive experiments across various HGNN models and datasets show that (a) our proposed attack is stealthy in poisoning and evasion settings, and (b) our adversarial training enhances defense against adversarial attacks.

Abstract:
Supervised FineTuning (SFT) with Negative Log-Likelihood (NLL) remains the standard post-training paradigm for Large Language Models, yet it imposes an excessive penalty on low-probability target tokens. This focus forces the model to prioritize minimizing the loss of difficult samples over optimizing the overall quality of the generation, often leading to unwarranted overconfidence. On the other hand, alternatives like Dynamic Fine-Tuning (DFT) suffer from vanishing gradients on these tokens, which severely hinders the acquisition of new concepts. To bridge this gap, we proposeSAFTSpectrum-AdaptiveFine-Tuning), a unified framework that interpolates between the aggressive learning signal of NLL and the robust nature of probability-weighted optimization. By adaptively balancing these objectives, SAFT effectively mitigates outlier sensitivity without sacrificing learning efficiency. Empirically, our method achieves state-of-the-art performance on mathematical reasoning benchmarks, demonstrating superior generalization on out-of-distribution tasks. Our anonymized code is available at https://anonymous.4open.science/r/SAFT-9FEB.

Abstract:
RetrievalAugmented Generation (RAG) grounds large language models with external evidence, but many implementations rely on pre-built indices that remain static after construction. Related queries therefore repeat similar multi-hop traversal, increasing latency and compute. Motivated by \emphschema-based learning in cognitive neuroscience, we propose GAM-RAG, a training-free framework that accumulates retrieval experience from recurring or related queries and updates retrieval memory over time. GAM-RAG builds a lightweight, relation-free hierarchical index whose links capture potential co-occurrence rather than fixed semantic relations. During inference, successful retrieval episodes provide sentence-level feedback, updating sentence memories so evidence useful for similar reasoning types becomes easier to activate later. To balance stability and adaptability under noisy feedback, we introduce an uncertainty-aware, \emphKalman-inspired gain rule that jointly updates memory states and perplexity-based uncertainty estimates. It applies fast updates for reliable novel signals and conservative refinement for stable or noisy memories. We provide a theoretical analysis of the update dynamics, and empirically show that GAM-RAG improves average performance by 3.95% over the strongest baseline and by 8.19% with 5-turn memory, while reducing inference cost by 61%.\footnoteOur code and datasets are available at: \urlhttps://anonymous.4open.science/r/GAM_RAG-2EF6.

Abstract:
Supervised Graph Outlier Detection has long been constrained by severe class imbalance, and although recent diffusionbased augmentation methods have improved sample quality, their practical utility is hindered by the high computational costs of multi-step iterative sampling and the stochasticity of the generation process. To overcome these bottlenecks, we propose Graph Outlier Synthesis via Origin Consistency Model (GOCM), a single-step graph outlier synthesis framework based on a consistency model. Theoretically, we pioneer the Origin Consistency (OC) mechanism by employing an ``Interval-based Origin Inference'' strategy, which mathematically derives a direct mapping from the noise trajectory to the data origin, achieving robust and efficient single-step sample generation. Architecturally, to address the complexity of heterogeneous graphs containing multiple relations, we design the Multi-input Variational Graph Auto-Encoder (MiVGAE), which decouples intricate structures via relation-level message passing and cross-relation fusion, mapping them into a unified latent space, from which GOCM synthesizes high-quality outlier nodes. Extensive experiments on multiple real-world datasets demonstrate that GOCM achieves superior detection performance with significantly improved generation efficiency. The source code is publicly available at: https://anonymous.4open.science/r/RFS-2026-EB63/.

Abstract:
Despite the growing use of large language models (LLMs) in subjective tasks such as roleplaying, humor, emotional intelligence, and dialogue quality, their evaluation faces a pressing reproducibility crisis: even the same evaluator may contradict itself when re-judging the exact same sample. We attribute this instability to dimension drift, where free-form evaluation protocols (e.g., Chain-of-Thought reasoning) unpredictably shift their implicit criteria, undermining reliability. To address this fundamental challenge, we reformulate subjective evaluation as an information-theoretic optimization problem. Specifically, we propose anExpected Information Gain (EIG)-based frameworkthat constructs a stable yet adaptive personalized rubric to eliminate dimension drift. Our two-stage “generate–then–score” design first produces a diverse pool of candidate evaluation questions and then selects the most informative subset via EIG, yielding explicit and repeatable criteria. Experiments on six benchmarks, including CharacterEval, The rJokes, and MT_bench, demonstrate that our approach substantially improves both evaluation consistency and alignment with human judgments, outperforming CoT-based and fixed-questionnaire baselines. These results highlight that information-theoretic questionnaire construction offers a principled and reliable path toward reproducible evaluation of subjective tasks.

Abstract:
Optimal Transport (OT) traditionally relies on a fixed ground cost to produce a single deterministic transport plan—a practice that overlooks the inherent variability and noise in realworld data. While recent sampling based approaches of OT offer a principled way to quantify this uncertainty, these are computationally prohibitive and struggle to scale. In this paper, we introduce Sinkhorn-parameterized Variational Inference, a first scalable variational framework for performing posterior inference over transport plans. Our key insight is that the Sinkhorn map can be treated as a differentiable reparameterization of the set of entropic plans. This enables the use of flexible generative models like normalizing flows to approximate distributions over transport plans while enforcing marginal constraints. We experimentally demonstrate that our method matches the quality of intensive sampling techniques at a fraction of the computational cost, scaling effectively to large-scale problems.

Abstract:
In multiagent reinforcement learning (MARL), communication enables agents to mitigate partial observability and stochasticity through information sharing, but large-scale systems inherently lead to a rapidly growing number of pairwise interactions. Previous studies often struggle to simultaneously achieve scalability and task adaptivity in large-scale multi-agent communication. To address this challenge, we propose a scalable communication scheme for large-scale MARL, termed Sparse tOpology-aware Pairwise Scoring (SOPS). We argue that scalable MARL communication requires decoupling scalability from task-adaptive link allocation. To ensure scalability, we constrain communication to an exponential-graph backbone with a small diameter, which preserves rapid potential information mixing while keeping per-agent candidates logarithmic. On top of this constraint, we learn a task-conditioned probabilistic subgraph distribution via a pairwise scoring network over agent states and edge-type embeddings to allocate sparse links for maximizing return, optimized end-to-end through differentiable Gumbel-Sigmoid reparameterization. Evaluation results show that SOPS significantly outperforms existing state-of-the-art methods across cooperative benchmarks of diverse scales and exhibits robust zero-shot transfer capabilities.

Abstract:
Predicting transferability within sharedoutput model families (e.g., LLMs that share a vocabulary) poses a dilemma: representation-similarity metrics can be uninformative without assumptions about error geometry, while update-geometry metrics are computationally prohibitive. We show that, in a shared-output head setting, representation metrics (e.g., CKA) are non-identifiable for transfer; models can share identical representations yet have orthogonal head updates, so CKA alone cannot reliably rank transfer. We make head Fisher alignment tractable at vocabulary scale (K=128,256) using FisherSketch, a streaming algorithm that compresses joint (activation, error) geometry to a 16KB task signature (m=4096) with a 192KB per-task streaming state. Beyond the head, we prove a full-network decomposition and bounds, and we propose measurable diagnostics (profile cosine and off-diagonal discrepancy). We validate these on ViT-B/16 and LLMs up to 70B, showing that head/block approximations track the full Fisher and that FisherSketch remains informative in fixed-prefix verbalizer-shift settings where activation similarity cannot distinguish tasks.

Abstract:
Uncovering the causal mechanisms of educational social dynamics is critical for designing effective pedagogical policies. However, traditional methods face a fundamental dilemma. vational studies often lack causal power, while controlled experiments are ethically prohibitive. While LLMpowered multi-agent simulations offer a scalable in silico alternative, current approaches often fail to support rigorous experimentation due to shallow psychological grounding and unquantifiable interactions. To address this, we introduce EduMirror, a multi-agent simulator for the scientific study of educational social dynamics. EduMirror employs a value-driven cognitive architecture for agents that grounds agent behaviors in social value and intrinsic motivation, coupled with a dual-track measurement protocol that utilizes LLMs to quantify both overt actions and latent psychological states. We validate the realism and usability of our platform through case studies on school bullying and group cooperation. The results show that EduMirror generates realistic social phenomena aligned with established theories and measurable by empirical criteria. These properties enable structured in-silico educational research. Results demonstrate that EduMirror generates dynamics aligned with established theories, providing a robust tool for hypothesis testing in educational science.

Abstract:
Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multistep interactive tasks. In this paper, we first propose ARLArena, a fair and systematic analysis framework that encompasses a broad spectrum of ARL algorithms and decomposes policy optimization (PO) through multiple policy gradient dimensions. Through this fine-grained analysis, we distill a unified perspective on ARL and, guided by the identified governing factors, propose SAMPO, a stable agentic PO method designed to mitigate the dominant sources of instability in ARL. Empirically, SAMPO achieves consistently stable training and strong performance across diverse agentic tasks. Overall, this study provides a unifying policy gradient perspective for ARL and offers practical guidance for building stable and reproducible LLM-based agent training pipelines. Our codebase is open-sourced at https://anonymous.4open.science/r/SAMPO-02B3.

Abstract:
Imbalanced Unreliable Partial Label Learning (IUPLL) is a challenging weakly supervised learning setting in which severe class imbalance and unreliable candidate labels jointly degrade model performance. By revisiting existing approaches for imbalanced learning, we observe that most of them fundamentally rely on estimating the class prior to guide balancing operations, such as re-sampling, pseudo-label generation, or logit adjustment. However, under I-UPLL, obtaining stable and accurate prior estimates at the early stage of training is often unrealistic due to the ambiguity and unreliability of partial labels, thereby leading the model to rapidly converge to a suboptimal solution. To address this issue, we propose CLAPOR, a novel CLAss-PriOr perturbation-Robust regularization framework that fundamentally avoids dependence on accurate prior estimation. Specifically, the proposed regularization trains the model under deliberately perturbed class priors, sampled from a Dirichlet distribution that deviates from the current estimated prior. This design encourages consistent performance under prior uncertainty and naturally preserves attention to minority classes. Extensive experiments on benchmark datasets demonstrate the effectiveness of CLAPOR across various settings of I-UPLL.

Abstract:
As large language models (LLMs) are increasingly composed into heterogeneous multiagent systems, a fundamental reliability challenge emerges: knowledge and governance fragment across agents, leading to composition-dependent behaviors and linear scaling of violations. We introduce Judgment Operators (JO), a decision-time framework that adapts corrective projection via precedent memory from agent actions onto admissible sets. JO establishes a unified projection interface in which governance constraints \mathcalC define the target admissible set and corrective precedents \mathcalP provide executable corrective knowledge for adapting the projection map. The centralized operator \Pi_J: \mathcalX \to \mathcalX_J implements four-way intervention semantics (Allow, Edit, Escalate, Deny), enabling minimal repair without modifying agent internals. We formalize JO as an adaptive projection operator and establish guarantees of: (1) composition-invariant enforcement with constant violation probability (vs. linear scaling without JO); (2) sublinear mistake accumulation for online adaptation via JO-A under recurring violations; and (3) semantic preservation for code transformation tasks via structure-preserving projection. Empirically, JO provides portable corrective knowledge transfer: (1) capability---learns and reuses corrective precedents under recurring violations, improving task success over strong baselines; (2) governance---achieves near-perfect constraint enforcement in fully verifiable settings (0% observed violation rate vs. 48--68% for baseline methods); and (3) portability---enables 13.5--20.5% absolute zero-shot cross-model transfer where few-shot prompting fails. Judgment Operators thus provide a portable, auditable, and composable interface for both decision-time governance and capability injection in multi-agent LLM systems, addressing fragmentation at its architectural root through adaptive, composition-invariant projection.

Abstract:
While prior studies have successfully compressed vision Transformers (ViTs) through various pruning techniques, most have concentrated on width pruning to achieve significant reductions in model size. Depth pruning, which involves the removal of entire layers from a ViT, is notoriously difficult for accuracy recovery, although depth pruning usually leads to higher speedups of compressed ViTs. Consequently, existing joint approaches that incorporate both width and depth pruning have exhibited limited acceleration ratios due to the inefficiencies of previous depth pruning methods. In this work, we reveal that the failure of existing depth pruning methods lies in their neglect of heterogeneity between different layers. Through a comprehensive analysis of the heterogeneity, we introduce HetDPT, a method that handles heterogeneity during depth pruning while avoiding dimension mismatch. Comprehensive experiments on ImageNet1k, CIFAR100, COCO, and ADE20K have validated our methods. HetDPT achieves a 1.58× speedup for DeiT-B while maintaining accuracy, and a 1.39× speedup for DeiT-S with nearly no accuracy degradation. Furthermore, when combined with width pruning, i.e., HetDPT+, our method sets a new state-of-the-art record in extreme ViT pruning. HetDPT+ enhances the acceleration ratio from 4.24× to 5.19× for the Isomorphic-Pruning-2.6G configuration while maintaining near-lossless accuracy.

Abstract:
Mixtureof-Experts (MoE) models scale by activating only a small subset of experts per token, but standard deterministic top-k routing is non-differentiable and trained using surrogate gradients that ignore the discrete expert selection used at inference. We introduce SIMoE routing by modeling expert selection as a stochastic latent variable, casting it as probabilistic inference over discrete expert subsets under explicit cardinality constraints. First, we purpose SIMoE Exact-k routing, which samples discrete k-expert subsets and propagates gradients through tractable inclusion probabilities for stochastic expert selection. We then extend this to SIMoE Dynamic-k routing, where both training and inference constrain the routing cardinality to the same predefined range, allowing adaptive expert allocation per token. Across different benchmarks on OLMoE and Qwen MoE backbones, SIMoE Exact-k improves expert utilization and routing diversity over competitive baselines, and Dynamic-k achieves comparable performance with fewer activated experts.

Abstract:
As large language models increasingly serve as autonomous coding agents, code documentation must be optimized for agent comprehension rather than human readability. We frame agentoriented documentation generation as a black-box optimization problem over the documentation space, where quality is measured solely by downstream code correctness. A central challenge for conventional LLM refinement methods isoutput coupling—program entities are interdependent, and refining the documentation of one entity can invalidate its callers, resulting in a persistentwhack-a-molephenomenon during inference-time scaling. We propose DocSearch, a dependency-guided bi-level search framework that systematically exploits test-time feedback. The outer level conducts a priority search over the program-entity dependency DAG, enforcing a callee-before-caller refinement order to prevent downstream interference. The inner level performs a beam search over documentation refinements, using diversified error message sampling from self-generated unit tests to better exploit diagnostic signals and escape local optima. We provide theoretical guarantees of monotonic progress, showing that our worthy condition prevents regression while enabling efficient exploration. On DevEval+, DocSearch achieves a 90.7% solve rate with GPT-4o, outperforming the strongest baseline by 32.6%. Cross-language experiments further demonstrate that optimized documentation transfers effectively to different target programming languages.

Abstract:
Multimodal large language models (MLLMs) have achieved remarkable success in general perception, yet complex multistep visual reasoning remains a persistent challenge. Although recent agentic approaches incorporate tool use, they often neglect critical execution feedback. Consequently, they suffer from the imagination-action-observer (IAO) bias, a misalignment between prior imagination and observer feedback that undermines reasoning stability and optimality. To bridge this gap, we introduce V-ABS, an action-observer driven beam search framework that enables deliberate reasoning through thinker-actor-observer iterations. We also propose an entropy-based adaptive weighting algorithm to mitigate the IAO bias by dynamically balancing the confidence scores between the policy priors and the observational feedback. Moreover, we construct a large-scale supervised fine-tuning (SFT) dataset comprising over 80k samples to guide the model to assign higher prior confidence to correct action paths. Extensive experiments across eight diverse benchmarks show that V-ABS achieves state-of-the-art performance, delivering an average improvement of 19.7% on the Qwen3-VL-8B baseline and consistent gains across both open-source and proprietary models.

Abstract:
Vision–Language Models (VLMs) are increasingly deployed with web search tools, yet we still lack benchmarks that isolate a critical capability for realworld use: deciding when to search and how to steer search from ambiguous visual evidence, especially when multiple images provide overlapping or conflicting cues. We introduce NAVIGATE, a novel benchmark centered on images as primary evidence for open-web search planning and multi-step reasoning. It contains 500 questions across 20 domains and spans three difficulty tiers, from single-image, self-contained problems to multi-image joint search and multi-domain composition. Unlike prior benchmarks that specify explicit search targets, NAVIGATE evaluates search decision-making: models must infer whether external search is necessary and iteratively refine search directions based on holistic reasoning over visual cues. Across a broad set of VLMs and search-enabled systems, performance remains low, Gemini-3-Pro-Preview-Search reaches only 36.4% accuracy, highlighting persistent failures in cross-image grounding, search triggering, and search strategy coordination. We will release NAVIGATE publicly.

Abstract:
Online evaluation of large language models increasingly relies on sequentially collected pairwise preferences, enabling humanaligned assessment and continuous data collection until closely performing models can be reliably distinguished. However, adaptive sampling and continuous monitoring invalidate classical fixed-sample inference, rendering existing ranking procedures largely heuristic. We propose SERPANT (Sequential E-value Ranking and Pruning via Adaptive Null Testing), a principled framework for online LLM ranking with anytime-valid guarantees. SERPANT formulates model comparison as a collection of pairwise hypothesis tests and constructs e-processes to ensure family-wise error rate control at any monitoring time. Anytime validity provides a theoretical justification for early stopping, enabling substantial cost savings from expensive human annotation. To improve efficiency, we introduce a novel tournament-based sampling strategy that adaptively selects comparisons based on past outcomes. The proposed framework further provides anytime-valid confidence sets for top-k model identification. Theoretical and empirical results on benchmark datasets validate the efficiency and statistical guarantees.

Abstract:
Multiagent LLM debates are increasingly deployed in domains such as policy analysis and city planning, where no objective ground truth exists. Despite this, debate quality is typically evaluated using outcome-based proxies such as LLM-as-judge scores that provide little insight into whether meaningful deliberation has occurred. Additionally, consensus and majority vote are viewed as ideal goals without analyzing the underlying interaction dynamics beneath them. In this work, we introduce a diagnostic evaluation framework that measures debate quality by measuring both the outcome and the process. Grounded in deliberative theory, our framework defines four interpretable process-level metrics capturing engagement, responsiveness, influence asymmetry, and balance, and two outcome-based metrics capturing stability and agent utility. Across both objective benchmarks and real-world domains, we find that process-level diagnostics are consistently more informative than commonly used outcome-based proxies. They better reflect correctness when ground truth exists and align more closely with human judgments of deliberative quality when it does not, revealing interaction failures that outcome-only measures fail to capture. These results demonstrate that process-level diagnostics are necessary for reliable evaluation of multi-agent debates and provide a principled foundation for analyzing and designing deliberative LLM systems.

Abstract:
Image quality assessment (IQA) is inherently multimage quality assessment (IQA) is inherently multi-dimensional, yet existing reward models are typically limited to a single task and become unstable when extended to multi-task settings. In particular, heterogeneous reward scales and variances across tasks can lead to conflicting optimization signals during reinforcement learning. We propose VisualScore, a unified visual evaluation framework that formulates multi-task IQA as structured, task-aware reasoning followed by continuous reward optimization. VisualScore produces interpretable rationales together with scalar quality scores under explicit evaluation principles. We construct a reasoning-enhanced reward modeling dataset via rejection sampling and initialize the model through supervised fine-tuning. VisualScore is then optimized with Group Relative Policy Optimization (GRPO) using a Gaussian-based continuous reward. To address multi-task reward conflicts and stabilize training, we introduce standard deviation filtering and entropy gating to normalize task-wise reward signals and suppress noisy updates. Experiments on technical quality, aesthetic quality, and text–image alignment show that VisualScore improves robustness, generalization, and interpretability, and can effectively guide text-to-image generation at test time without retraining.

Abstract:
While sparse autoencoders yield features easier to study than individual neurons, their reliable interpretation remains challenging. We propose Query Lens, which extends Logit Lens to provide more comprehensive and faithful interpretations of sparse features. By jointly considering encoderside key features and decoder-side value features, we characterize both the inputs that activate a feature and the outputs it promotes. We also account for indirect, module-mediated effects that arise when the feature is processed by downstream modules, going beyond the direct effect captured by Logit Lens. In experiments, we find that Query Lens yields coherent token signatures for features that were previously uninterpretable under Logit Lens. Finally, we propose the Subspace Channel Hypothesis, suggesting that downstream modules read features through layer-specific subspaces.

Abstract:
This paper introduces Mixtures of Geodesic Factor Analyzers (MGFA) on Riemannian homogeneous spaces. MGFA uses a geodesic factor model within each mixture component, providing greater expressiveness than mixtures of Riemannian radial distributions and enabling clustering of manifoldvalued data with anisotropic subpopulations. We establish root-n consistency for the MGFA maximum likelihood estimator (MLE), thereby filling a theoretical gap for mixtures of Riemannian radial distributions as a special case. We also propose an iterative estimation algorithm and implement it on spheres, shape spaces, and hyperbolic spaces. Numerical experiments show that MGFA substantially outperforms competing methods in well-specified regimes while remaining robust under model misspecification. Finally, case studies on corpus callosum and left hippocampus shape datasets demonstrate MGFA’s effectiveness for both 2D contour and 3D shape analysis.

Abstract:
Twosided matching markets are pervasive in numerous real-world applications, ranging from labor markets to online advertising. Recently, a rich line of research has studied the matching bandit problem, where participants learn their preferences through iterative interactions. However, existing works assume a static environment with fixed participants and require synchronized learning, in which all participants start simultaneously and have access to a global clock. In reality, matching markets are inherently dynamic: participants may enter and leave at arbitrary time steps without any global signal, leading to an uncoordinated scenario that render existing algorithms inapplicable. To address this challenge, we first investigate the one-sided learning setting under uncoordinated player arrivals, where only players need to learn their preferences. We propose the Way-SE algorithm, which achieves an O(\fracK^2 \log T\Delta_\min^2) regret through a distributed exploration mechanism that enables participants to implicitly coordinate exploration phases using only local clocks, without global synchronization. More importantly, we extend to the fully decentralized, dynamic two-sided learning setting where both sides need to learn their preferences and players may arrive or depart arbitrarily. We introduce Way-SE-2S, the first algorithm to achieve a sublinear regret O(T^\fracK-1K \log T) in this challenging environment, without requiring global signals, restrictive preference structures, or observability of competing agents' outcomes. Our work provides the first theoretical guarantee for stable matching in fully decentralized and uncoordinated bandit markets.

Abstract:
Latent Diffusion Models (LDMs) have become dominant in visual synthesis, but their quality–compute tradeoff is largely constrained by the tokenizer’s fixed compression ratio. Variable-length tokenizers (VLTs) promise adaptive compression by varying token counts, allowing diffusion models to flexibly balance quality and compute. However, conventional VLTs modulate length by truncating ordered token sequences, which changes token semantics across lengths and breaks representational alignment. This leads to significant cross-length variation in the latent distribution that hinders a single variable-length diffusion model from operating effectively. To address this, we propose a novel variable-length tokenizer that modulates length by merging tokens. We show that encouraging similar tokens to merge enables direct cross-length representation alignment when the diffusion transformer operates according to the merging pattern. Since conventional merging methods are data-dependent, making the merging pattern inaccessible during generation, we introduce learnable global merging, which is data-independent, to ensure compatibility with diffusion transformers. On ImageNet 256×256 generation, our merging-based variable-length tokenizer integrated with a diffusion transformer achieves a superior gFID–compute trade-off compared to prior VLT methods.

Abstract:
Textconditioned image generation has made rapid progress, yet rendering images with long-form text remains challenging due to the limitations of existing datasets, which predominantly focus on short and simple text. We introduce TextAtlas5M, a large-scale dataset designed to evaluate long-text rendering, where “long text” encompasses not only textual length but also layout complexity and semantic richness. TextAtlas5M contains 5 million generated and collected images across diverse data types, enabling comprehensive evaluation of large-scale generative models. We further curate 4,000 human-improved test cases (TextAtlasEval) spanning four domains, forming one of the most extensive benchmarks for text rendering. Evaluations show that TextAtlas5M poses substantial challenges even for state-of-the-art proprietary models (e.g., GPT-4o), with significantly larger gaps observed for open-source models. Training on TextAtlas5M consistently improves text rendering for both diffusion-based and autoregressive models, demonstrating its effectiveness for advancing text-rich image generation.

Abstract:
Instructionbased image editing (IIE) is a vital tool for iterative content creation, enabling multi-round interactions that refine visual details while preserving cross-round consistency. However, this workflow is constrained by the compute-bound nature of Diffusion Transformers (DiTs): because DiTs process tokens uniformly, they waste substantial computation on regions untouched by the instruction. We investigate the Round--Step--Layer hierarchy of DiT-based editing and identify a phenomenon we term Delayed Latent Emergence (DLE). Although pronounced latent changes emerge only in the late denoising stages, deep-layer activations within transformer blocks at the very first sampling step already diverge markedly in edited regions. Building on this insight, we propose CacheEdit, a training-free framework centered on an Adaptive Activation Cache (Acache) that exploits early-step sensitivity to detect invariant tokens and reuse their cached activations across subsequent sampling steps, thereby bypassing redundant computation. Experiments on FLUX.1 Kontext and Qwen-Image-Edit show that CacheEdit achieves up to 2.5× end-to-end acceleration. Moreover, by isolating and reusing static features, CacheEdit mitigates stochastic drift and improves instruction-following and structural consistency over full-recomputation baselines.

Abstract:
Multimodal large language models (MLLMs) are trained on massive multimodal data, making data unlearning increasingly important as data owners may request the removal of specific content. In practice, these requests often arrive sequentially over time, giving rise to the challenging problem ofMLLM Lifelong Unlearning. However, most existing benchmarks are limited in scale and scope, failing to capture the complexities of MLLM lifelong unlearning. To fill this gap, we introduce the MLUBench, a largescale and comprehensive benchmark featuring 127 entities across 9 classes under lifelong unlearning requests. We perform extensive experiments using MLUBench and reveal that existing unlearning methods suffer from severe, cumulative degradation. More critically, we further identify the unique challenge of this problem: unlike in unimodal models, MLLM lifelong unlearning is constrained by the need to preserve multimodal alignment. Continually unlearning from one modality could degrade the entire model. To alleviate this challenge, we propose LUMoE, an effective and efficient method. Experiments demonstrate that LUMoE significantly mitigates the degradation problem faced by baselines. We present source code and the MLUBench in this anonymousURL.

Abstract:
Native 4K (2176×3840) video generation remains a critical challenge due to the quadratic computational explosion of fullattention as spatiotemporal resolution increases, making it difficult for models to strike a balance between efficiency and quality. This paper proposes a novel Transformer retrofit strategy termed T3 (Transform Trained Transformer) that, without altering the core architecture of full-attention pretrained models, significantly reduces compute requirements by optimizing their forward logic. Specifically, T3-Video introduces a multi-scale weight-sharing window attention mechanism and, via hierarchical blocking together with an axis-preserving full-attention design, can effect an “attention pattern” transformation of a pretrained model using only modest compute and data. Results on 4K-VBench show that T3-Video substantially outperforms existing approaches: while delivering performance improvements (+4.29\uparrow VQA and +0.08\uparrow VTC), it accelerates native 4K video generation by more than 10×. Demo and source code are available in \#Supp.

Abstract:
Denoising diffusion models are widely used for highquality image and video generation. Their performance depend on noise schedules, which define the distribution of noise levels applied during training and the sequence of noise levels traversed during sampling. Noise schedules are typically handcrafted and require manual tuning across different resolutions. In this work, we propose a principled way to design per-image noise schedules for pixel diffusion, based on the images spectral properties. By deriving theoretical bounds on how efficacy of minimum and maximum noise levels, we design "tight" noise schedules that eliminate redundant steps. During inference, we propose to conditionally sampled such noise schedules. Experiments show that our noise schedules improve generative quality, particularly at the low-step regime.

Abstract:
Proteins are inherently dynamic, with biological functions often emerging from transitions between multiple conformational states. While recent breakthroughs have largely addressed the static structure prediction problem, a systematic benchmark is absent to demonstrate how well current models capture functionally relevant dynamics. We introduce ProMiSE, the first benchmark that provides both a dataset and evaluation scheme, based on native biological assemblies and integrating major conformational change mechanisms—intrinsic, ligandinduced, and protein-induced—within a single curated dataset. We conducted a comprehensive evaluation of state-of-the-art structure prediction models, including AlphaFold3 and recent generative approaches. Our findings reveal that current models exhibit a limited ability to sample intrinsic dynamics and are often insensitive to biological context in induced scenarios. We further investigate whether these multi-state prediction biases are associated with multiple sequence alignment (MSA) signals or training data distributions, while analyzing internal model representations throughout the model to identify where these biases arise. Ultimately, ProMiSE benchmarks limitations in conformational diversity and biological relevance, enabling improved multi-state and dynamics-aware modeling.

Abstract:
ClassIncremental Learning (CIL) is important in building real-world learning systems. In CLIP-based CIL, the model performs classification by comparing similarity between visual and textual embeddings obtained from template prompts, e.g., ``a photo of a [CLASS]''. This seemingly monolithic matching process can be decomposed into two conceptually distinct stages: attribute extraction and attribute aggregation. For example, a model may recognize cat using attributes such as fur texture and whiskers. When learning a new class like car, the model must extract additional attributes like wheels and adjust how they are aggregated in the shared representation space. However, since only data from the current task is available, incremental updates can bias both attribute extraction and aggregation toward new classes, leading to catastrophic forgetting. Therefore, we propose AREA for attribute extraction and aggregation for CLIP-based CIL. To stabilize extraction, we anchor class-level visual and textual attributes on the hyperspherical embedding space via principal geodesic analysis. To stabilize aggregation, we learn lightweight task-specific experts with scoring and residual refinement, regularized by a variational information bottleneck objective. During inference, we perform routing over task attribute manifolds via optimal transport for more concise prediction. Experiments on multiple benchmarks show that AREA consistently outperforms SOTA methods.

Abstract:
Sample inefficiency remains a challenge in pixelbased visual reinforcement learning (RL), primarily due to ineffective state representation learning. While recent advances employ auxiliary tasks to improve representation learning, their representation goals (e.g., mask reconstruction, state prediction) are misaligned with the ultimate RL goal of maximizing return, constraining further improvements in representation quality. To achieve efficient visual reinforcement learning, we propose Return-Critic (RC), an auxiliary framework that bridges goal discrepancy by return prediction. RC samples partial frames from an episode, processes them through a shared visual encoder, and employs a lightweight Transformer to predict the episode's return, forcing the encoder to learn return-relevant representation. The attention weights naturally highlight important frames, enabling a key function for prioritized learning. Theoretically, RC can be shown to bridge goal discrepancy, thereby improving representation quality. Extensive experiments on both online (DMControl) and offline (V-D4RL) benchmarks demonstrate that RC significantly enhances the sample efficiency, particularly achieving 68% performance boost on average across nine challenging tasks from DMControl.

Abstract:
While deep neural networkbased long-term time series forecasting (LTSF) has become indispensable for critical infrastructures such as smart grids and IoT platforms, the deployment of these models as black-box APIs introduces severe security vulnerabilities that remain largely underexplored. In this paper, we propose TSFAdv, a query-efficient adversarial framework for LTSF models. The framework systematically analyzes model sensitivity to spectral perturbations in both magnitude and phase of the frequency domain. By embedding frequency-domain priors into Natural Evolution Strategies, we achieve sensitivity-guided gradient estimation that improves perturbation efficacy without violating practical query constraints. To overcome ambiguities inherent to point-wise regression metrics, we adopt a trajectory-level evaluation protocol based on Dynamic Time Warping (DTW) and Slope Misalignment Error (SME), enabling the capture of complex geometric and directional deviations. Extensive experiments across seven state-of-the-art architectures demonstrate that TSFAdv achieves substantial performance gains, with average DTW improvements of 21.91–85.00% and SME improvements of 15.04–61.97% under a restrictive 200-query budget. These findings reveal that existing defense mechanisms are ineffective against frequency-domain manipulation, underscoring an urgent necessity for robust LTSF models; the code and artifacts are available at https:// anonymous.4open.science/r/TSFAdv.

Abstract:
Recent advances in generative distillation have shown strong potential in constructing high quality surrogate datasets within a fraction of the time required by optimizationbased approaches. However, most existing generative solutions rely on diffusion models, which suffer from two limitations. (i) Indirect matching objectives. Their sequential denoising process makes it difficult to directly match representative prototypes. (ii) Target-agnostic generation. The generation process is often decoupled from the target task, causing the synthesized samples to drift from the desired distribution. Building on this insight, We propose ProtoVAR, a prototype-guided visual autoregressive framework. Instead of relying on latent space, ProtoVAR uses the coarse-to-fine next-scale prediction of Visual AutoRegressive (VAR) modeling to maintain semantic consistency during generation. By injecting multi-scale class prototypes, ProtoVAR enforces clear representativeness constraints while preserving diversity. A pool-based selector further distills the prototype-guided outputs into a compact, task-aligned surrogate dataset. Extensive experiments show that ProtoVAR achieves state-of-the-art performance with comparable or lower computational cost than diffusion-based distillation.

Abstract:
Realworld time-series data often suffer from missing observations, hindering long-range temporal modeling. However, most existing imputation methods formulate imputation as conditional reconstruction over limited context, which restricts temporal information propagation and fails to explicitly model temporal evolution. To overcome this limitation, we propose the Conditional Temporal Inference Paradigm (CTIP), which formulates time-series imputation as conditional inference along temporal evolution. Under this paradigm, we introduce CBiT, which leverages a history compression mechanism to encode long-range history into a compact latent space for history-conditioned temporal imputation. In addition, we adopt a partitioned modeling strategy that distinguishes historical context and temporal imputation targets with only linear-time complexity. Extensive experiments on multiple public benchmarks show that CBiT improves imputation accuracy by reducing Masked MAE and Masked RMSE by 27.3% and 18.6%, respectively, across different missing rates.

Abstract:
While large language models (LLMs) are pretrained on massive amounts of data, their knowledge coverage remains incomplete in specialized, datascarce domains, motivating extensive efforts to study synthetic data generation for knowledge injection. We proposeSPA(ScalingPrompt-engineeredAugmentation), a simple but tough-to-beat baseline that uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. Through systematic comparisons, we find thatSPAoutperforms several strong baselines. Furthermore, we identify two key limitations of prior approaches: (1) while RL-based methods may improve the token efficiency of LLM-based data augmentation at small scale, they suffer from diversity collapse as data scales, leading to diminishing returns; and (2) while multi-stage prompting may outperform simple augmentation methods, their advantages can disappear after careful prompt tuning. Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective, and we hope SPA can serve as a strong baseline for future studies in this area.

Abstract:
One emerging approach to mitigating data heterogeneity in Federated Learning (FL) is to employ diffusion models to generate synthetic data for clients, thereby aligning local data distributions with the global distribution. Prior work has primarily focused on balanceoriented augmentation, which assumes a balanced global class distribution and thus generates samples of rare classes to rebalance each client's local dataset. However, in practice, global data distributions are often inherently imbalanced. Moreover, privacy constraints in FL hinder the server’s ability to accurately estimate the global distribution, rendering balance-oriented augmentation suboptimal. This raises a key, underexplored challenge: How can synthetic data be generated and selected to align local distributions with the true, yet unknown, global distribution? Our key insight is that a model’s performance implicitly reflects the data distribution it has been trained on. Based on this observation, we use the performance discrepancy between local and global models to identify the regions where each client’s local dataset is lacking, and generate corresponding samples for clients. Furthermore, we adapt the diffusion model via preference optimization, enabling it to generate data that better aligns with the true global distribution. Extensive experiments on multiple benchmarks demonstrate that FedPDG outperforms state-of-the-art methods, achieving up to 3.82% improvement.

Abstract:
KnowledgeBased Visual Question Answering (KB-VQA) relies on retrieval from large-scale knowledge bases, yet this stage is often treated simplistically. Existing methods typically adopt single-view indexing or naive multi-view fusion, leading to systematic coverage gaps. In this work, we demonstrate that different views exhibit strong complementarity in retrieval. Motivated by this observation, we propose SCAR, a Structure-aware Cross-View Retrieval framework that exploits cross-view structural complementarity at inference time without additional training. SCAR enhances retrieval via structure-aware similarity propagation within each view and explicit cross-view redundancy regulation. Experiments on multiple KB-VQA benchmarks demonstrate that SCAR substantially improves retrieval recall, approaches retrieval coverage upper bounds, and consistently boosts end-to-end KB-VQA performance with negligible inference overhead.

Abstract:
Token cramming compresses sequences into learned embeddings with nearperfect reconstruction, but prior work used fixed token budgets and 99% accuracy thresholds, obscuring whether residual errors reflect optimization failures or fundamental limits. We introduce progressive cramming, which grows the target prefix token-by-token and stops only when reconstruction is no longer achievable within a fixed optimization budget. Using progressive trajectories, we find that optimization paths occupy surprisingly low-dimensional structure in the embedding space. Attention analysis shows that compression embeddings often become attention sinks in specific intermediate layers, which correlates with both optimization difficulty and downstream degradation. On likelihood-based multiple-choice evaluation, prepending a crammed embedding drops accuracy to close to random guessing, even with the original prefix in context. These results suggest that perfect reconstruction can arise from brittle steering and attention hijacking, rather than a transferable semantic representation. Our results position progressive cramming as a tool for studying compression limits, while showing that perfect reconstruction is insufficient for meaningful compression.

Abstract:
Diffusion transformers (DiTs) equipped with multimodal attention (MMAttn) have become a dominant paradigm for image generation. However, preventing the generation of harmful content remains a critical challenge, particularly in imageto-image (I2I) editing tasks. Existing safety mechanisms are primarily designed for text-toimage (T2I) synthesis or U-Net-based architectures, which limits their effectiveness for unified safety mitigation in DiT-based frameworks. To bridge this gap, we propose Unified Visual Safety Regulator ( UVR) , a training- free safe generation framework that regulates unsafe semantics in generated images. UVR is grounded in an analysis of attention dynamics from the perspective of information flow in MM-Attn. We identify a taskindependent start-up stage, during which unsafe semantics in output patches rapidly emerge and can be accuratelv localized, where unsafe semantics in output patches quickly emerge and can be precisely localized, followed by task-specific semantic amplification and interference stages, where harmful signals are further propagated and entangled with benign content. Based on these observations, UVR mitigates unsafe generation through unified, targeted attention modulation and explicit restriction of harmful information flow over the identified unsafe output patches. Experiments across various concepts show that UVR achieves state-of-the-art safety performance by achieving 91% and 77% erase rate in image synthesis and editing tasks, while preserving visual quality and fidelity with minimal degradation.

Abstract:
Multitenant multi-LoRA serving multiplexes many LoRA adapters on a single GPU under high utilization, where most device memory is reserved for the KV cache, leaving only a small residency budget K for adapters. In this regime, adapter readiness is atomic: if an adapter is not device-resident, the engine must perform a mandatory fetch, stalling shared execution and amplifying tail latency system-wide. With only K residency slots, we identify a LoRA residency cliff: once the active adapter working set exceeds K, time-to-first-token (TTFT) tail latency can exhibit a congestion collapse rather than smooth degradation. To tame this cliff, we propose CLIMB, a minimal ingress controller that enforces feasibility-first admission by queueing non-resident adapters outside the engine, prioritizing critical (VIP) traffic, and rotating background adapters via round-robin. On a cliff-inducing workload, CLIMB averts collapse, reducing VIP TTFT p99 from 38.7 s to 13.1 s at matched throughput (10.66 rps) by keeping VIP engine latency near 0.13 s and shifting the residual tail into explicit ingress queueing. Overall, CLIMB shifts fetch-induced stalls from inside the engine to managed ingress queues, mitigating tail amplification without throughput loss in the evaluated settings.

Abstract:
Selfsupervised Continual Graph Learning (CGL) aims to successively learn from a graph sequence with different tasks without label supervision—a paradigm that has attracted widespread attention. Most existing self-supervised CGL methods rely on instance-level consistency objectives that enforce stability of individual node (or node-pair) embeddings. Due to optimizing nodes in isolation, these methods fail to maintain global relational structure, causing inter-node correspondences to progressively distort under continual learning. To this end, we propose a novel Structure-Aware Optimal Transport (SAOT) framework that explicitly captures and preserves relational structure within graph representations across sequential tasks. Specifically, SAOT leverages optimal transport theory to capture global inter-node correspondences, thereby facilitating and enhancing graph representation learning. Simultaneously, SAOT incorporates a cross-task knowledge distillation mechanism to preserve the previous structural knowledge. Extensive experiments on four CGL benchmark datasets demonstrate that SAOT outperforms existing self-supervised baselines. In particular, SAOT achieves significant performance gains, improving average accuracy by up to 6% on CoraFull-CL and over 15% on Products-CL compared with state-of-the-art methods in the Class-IL setting.

Abstract:
Large language models are increasingly used to accelerate scientific discovery, especially in iteratively searching scientific hypotheses. Yet in many discovery settings the goal is not to identify a single ``best'' hypothesis: validation is noisy and expensive, multiple hypotheses can remain plausible, and scientists benefit from a set of highquality but meaningfully diverse hypotheses that hedge against downstream uncertainty. Nevertheless, commonly used evolutionary search recipes tend to underemphasize this requirement, implicitly prioritizing optimization over exploration, and the resulting selection pressure during the search process leads to diversity collapse. Motivated by these limitations, we formulate hypothesis search as a sampling problem, where the objective is to efficiently produce diverse, high-quality hypotheses under fixed validation budget. Building on this perspective, we propose, EvoDiverse, an evolutionary framework inspired by the classical parallel tempering algorithm that searches hypotheses at multiple temperature levels and enables principled information exchange across temperatures to improve exploration without disrupting convergence. Across domains including molecular discovery, equation discovery, and algorithm discovery, our approach consistently improves both hypothesis quality and diversity under the same validation budget, and produces candidate sets that remain robust under more expensive downstream computational validations.

Abstract:
Link prediction (LP) is a fundamental problem in graph learning and can be broadly categorized into nodebased and subgraph-based approaches. While subgraph-based LP methods often achieve superior predictive performance by exploiting localized structural information, they suffer from efficiency bottlenecks due to the high computational cost of per-query subgraph message-passing during both training and inference. To address this challenge, we propose MIMO-LP, a Multi-Input Multi-Output (MIMO) framework that accelerates subgraph-based LP. Given a batch of query node pairs and their corresponding subgraphs extracted from a shared full graph, MIMO-LP superposes their message-passing processes into a shared latent space while ensuring minimal interference among them. This design enables MIMO-LP to multiplex multiple queries within a single forward pass during both training and inference, substantially reducing redundant message-passing computations in overlapping subgraph regions. Extensive experiments demonstrate that MIMO-LP achieves a 14x-44x speedup over existing one-to-one subgraph-based methods, while maintaining comparable predictive performance. The code for MIMO-LP will be released publicly.

Abstract:
Label shift refers to the distribution shift scenario where the marginal label distribution changes while the classconditional distribution remains invariant. To address this challenge in complex real-world settings, we proposeRegularized Discriminative Alignment for Label Shift (RDALS), a novel framework that adapts to target domains by aligning distributions within the deep latent space. By shifting the focus from raw inputs to learned representations, RDALS effectively operates under a weaker and more practical invariance assumption. Specifically, we construct a moment-matching linear system using Linear Discriminant Analysis (LDA) and show that this choice maximizes numerical stability. We further provide rigorous theoretical analysis, establishing finite-sample error bounds for the importance weight estimation and the generalization bounds for the adapted classifier. Extensive experiments on standard benchmarks demonstrate that RDALS significantly outperforms state-of-the-art baselines, achieving superior robustness and accuracy in both data-scarce and extreme-shift regimes.

Abstract:
We propose to model text generation as a continuoustime latent dynamical process, where token generation is formulated as a Markov Decision Process whose internal state evolves via a neural ODE. This formulation bridges discrete token sequences and continuous semantic evolution, providing a theoretically grounded approach for coherent long-range generation. The framework is optimized via reinforcement learning, maximizing a composite objective that integrates task-specific rewards with knowledge distillation from a powerful pre-trained language model. Experiments demonstrate that our method, Continuous-Time Latent Language Model (CT-LLM), outperforms discrete baselines in generation coherence and long-context performance, offering a new paradigm for fluid and controllable language generation.

Abstract:
Transferable jailbreaking attacks enable redteaming of black-box large language models by optimizing adversarial prompts on open-source surrogates. A natural approach to improve transferability is multi-model training---optimizing against multiple source models simultaneously. Yet this approach has been largely abandoned, as it yields only marginal gains with standard optimizers. We argue the root cause is poor generalization: standard gradient descent lacks stability whenaggregating gradients from diverse models. Since GCG and its variants~\citepzou2023universal, jia2024improved, yang2025guiding mplicitly perform SGD in discrete token space, they inherit this instability in multi-model settings.We address this with Lookahead-GCG, which combines: (1) Stochastic Nesterov Accelerated Gradient (SNAG), whose lookahead mechanism reduces sensitivity to individual gradient updates, providing stability for multi-model optimization; (2) embedding-space momentum accumulation, which enables SNAG in discrete token optimization; and (3) maximally distant initialization, which exploits SNAG's improved generalization by starting from a universally beneficial point. Experiments show our method achieves 50.37% ASR on open-source and 34.03% on closed-source LLMs, outperforming GCG and TransferAttack with +11.78% gains from multi-model optimization.

Abstract:
Extracting skills from multiagent offline dataset improves learning efficiency via sharing task-invariant coordination skills among tasks. In settings where tasks occur sequentially and the space of skills grows exponentially, existing approaches that rely on heuristically designed and fixed-sized skill libraries struggle to resolve the problem of distributional shift and interference, facing catastrophic forgetting and plasticity loss. To address this problem and endow agents with the ability to continually discover and reuse coordination skills in open-environment, we propose COMAD, a principled framework forContinualOfflineMulti-agent SkillDiscovery via Skill Partition and Reuse. We first discover skills from mixed multi-agent behavior data with an auto-encoder to transform coordination knowledge into reusable coordination skills. Then we construct a skill-augmented policy learning objective with multi-head architectures, explicitly guiding the advantage function with reusable skills identified via a density-based reusability estimator. Theoretical analysis shows our method approximates the optimum of a continual skill discovery problem. Empirical results across diverse MARL benchmarks show that COMAD continually expands its skill library to mitigate interference, achieving superior forward and backward transfer for task streams compared to multiple baselines.

Abstract:
Video action models are a promising foundation for Vision–Language–Action (VLA) because they can learn rich visual dynamics directly from video. However, likelihoodoriented training of diffusion predictors emphasizes globally plausible futures and does not guarantee precision-critical visual dynamics needed for manipulation, so small prediction errors can be amplified by downstream policies. We propose Dyn-VPP, a post-training framework that casts multi-step denoising as policy optimization and aligns predicted future latents with expert visual dynamics via verifiable terminal reward, without modifying any architecture. This enables explicit optimization of dynamics signals that are not captured by likelihood-only training. As a result, Dyn-VPP yields more accurate visual dynamics and improves downstream task execution. Experiments across diverse simulated and real-world manipulation settings show improved dynamics consistency and consistently higher task success.

Abstract:
Graph foundation models face several fundamental challenges including transferability across datasets and data scarcity, which calls into question the very feasibility of graph foundation models. However, despite similar challenges, the tabular domain has recently witnessed the emergence of the first successful foundation models such as TabPFNv2 and LimiX. Many of these models are based on the priordata fitted networks (PFN) framework, in which models are pretrained on carefully designed synthetic datasets to make predictions in an in-context learning setting. Recently, G2T-FM has made the first step towards adopting PFNs for graphs, yet it is limited to hand-crafted features and was never pretrained on graph data. In this work, we make the next step by proposing GraphPFN, a PFN-based model designed and pretrained specifically for graph node-level tasks. Following the PFN framework, we first design a prior distribution of synthetic attributed graphs by using a novel combination of multi-level stochastic block models and a preferential attachment process for structure generation and graph-aware structured causal models for attribute generation. Then, we augment the tabular foundation model LimiX with attention-based graph neighborhood aggregation layers and train it on synthetic graphs sampled from our prior. On diverse real-world graph datasets with node-level tasks, GraphPFN shows strong in-context learning performance and achieves state-of-the-art results after finetuning, outperforming both G2T-FM and task-specific GNNs trained from scratch on most datasets. More broadly, GraphPFN shows the potential of PFN-based models for building graph foundation models.

Abstract:
Optimizing pretraining data composition is pivotal for LLM generalization. While dynamic mixing outperforms static strategies by capturing evolving training dynamics, current methods fail to reconcile computational efficiency with sample efficiency and structural flexibility for diverse pipelines.We introduce ActorCritic Online Data Mixing (AC-ODM), which approaches data mixing from a reinforcement learning perspective with a parameterized policy that we theoretically prove to act as a dynamic linear surrogate maximizing the constructive interference of gradients. To enhance practical flexibility, AC-ODM supports two operational modes: (i) a proxy mode for fixed, pre-prepared corpora, where a policy learned on a small model is transferred to a larger target; and (ii) a non-proxy mode for direct end-to-end training from scratch without priors. Empirically, AC-ODM significantly outperforms prior methods in convergence speed and downstream accuracy across various architectures. On Pythia-1B, it reaches optimal validation perplexity using up to 66% fewer training steps than competitive baselines, delivering a 27.5% relative improvement in MMLU accuracy and a 2.23× higher pass@1 on HumanEval, all while incurring a virtually negligible (~0.4%) per-step wall-clock increase and only 2% additional memory overhead.

Abstract:
LLMs have traditionally scaled along dense dimensions, where performance is coupled with nearlinear increases in computational cost. While MoE decouples capacity from compute, it introduces large memory overhead and hardware efficiency challenges. To overcome these, we propose token-indexed parameters as a novel, orthogonal scaling axis that decouple model capacity from FLOPs. Specifically, we introduce ReToken and MoRT, which augment Transformer layers with modulation vectors retrieved from auxiliary embedding tables. These vectors modulate the backbone via lightweight, element-wise operations, incurring negligible FLOPs overhead. Extensive experiments on both dense and MoE backbones, spanning from 190M to 9.8B parameters, demonstrate that our approach consistently reduces validation loss and significantly improves downstream task performance (e.g., +7.3 on ARC-C, +6.3 on GSM8K). Rigorous isoFLOPs analysis further confirms that MoRT fundamentally shifts the quality–compute Pareto frontier, achieving comparable model quality with 35% less compute relative to vanilla MoE architectures, and we validate that token-indexed parameters exhibit a predictable power-law scaling behavior. Moreover, our efficient implementation ensures that the overhead introduced by ReToken and MoRT remains marginal.

Abstract:
Current reinforcement learning (RL) methods for code generation are predominantly optimized on Python, showing weak generalization to other programming languages (PLs). Although leveraging multilingual solutions offers richer semantics and a wider search landscape, naive independent training across languages suffers from optimization imbalance and fails to effectively transfer knowledge from highresource languages. We propose Group Cross-lingual Relative Policy Optimization (GXPO), which forms training groups by generating solutions for the same problem in multiple PLs and jointly optimizes language-specific and cross-language signals, enabling more balanced optimization and improved transfer to low-resource PLs. We additionally introduce Multilingual LiveCodeBench (ML-LCB), extending LiveCodeBench to a unified multilingual evaluation setting. On ML-LCB across 8 PLs, GXPO consistently improves performance, with pronounced gains on low-resource PLs, demonstrating scalable multilingual RL for language-consistent code generation.

Abstract:
Many realworld systems exhibit both context- and time-dependent causal dynamics, where the dynamical system state also influences its context. For instance, soil moisture is driven by precipitation, yet also provides the context for heat-flux realization. We capture such dynamics in Structural Causal Models (SCMs) by introducing endogenous and time-dependent discrete context variables, also allowing for possibly lagged dependencies with the system variables. While context variables are discrete, they may also be proxies of continuous variables. The enabling assumptions for causal discovery of our model are either persistence of the context or sparsity of the context–system dependencies. We design two new PCMCI-based algorithms for causal discovery with endogenous context variables for time series and prove their soundness. A systematic evaluation on synthetic benchmarks and an application to a real-world land-atmosphere feedback problem demonstrate their effectiveness and applicability.

Abstract:
The Tensor Nuclear Norm (TNN), derived from the tensor singular value decomposition, is a widely used lowrank modeling tool that enforces element-wise sparsity on frequency-domain singular values. However, as a direct extension of the matrix nuclear norm, TNN fundamentally assumes single-level spectral sparsity, which is misaligned with the multi-level spectral structures prevalent in real-world data, where low-rankness within frequency components coexists with sparsity across them. To overcome this limitation, we propose the tensor \ell_p-Schatten-q quasi-norm (p,q\in(0,1]), which enables explicit control of dual spectral sparsity by jointly regularizing inter-frequency sparsity and intra-frequency low-rankness. This formulation strictly generalizes TNN and subsumes several existing tensor regularizers by coupling global frequency sparsity with local spectral low-rankness, leading to a fundamentally different modeling principle. We establish the first minimax error bounds under this model and develop an efficient reweighted optimization algorithm for the resulting nonconvex problem. Numerical experiments on noisy and Poisson tensor completion as well as image clustering demonstrate the effectiveness and robustness of our method across reconstruction and representation learning tasks involving complex multi-way data.

Abstract:
Longterm memory empowers LLM-based agents with adaptive reasoning but exposes a critical attack surface---adversaries can inject malicious records to bias agent behaviors. However, existing attacks face a dilemma: effective injections are often visibly malicious and easily detected, while stealthy, benign-looking injections are often less effective in altering agent behaviors. To address this, we propose MemIncept, a memory poisoning attack that can impact agents even in black-box settings using only benign-appearing queries. Unlike prior methods that inject isolated records, MemIncept generates a cooperative set of queries that work together to bias the agent. It achieves this via a bidirectional evolutionary strategy that optimizes the query set from two ends. A forward pass ensures the queries collectively lead the agent to the target outcome, while a backward pass ensures they are semantically close to victim (benign) queries for reliable retrieval. This ``meet-in-the-middle'' approach creates injected records that are both easy to retrieve and effective at steering behavior. Through extensive experiments across diverse agents, we show that MemIncept significantly outperforms single-record attacks, achieving high success rates comparable to explicit attacks while remaining virtually undetectable to both humans and automated filters.

Abstract:
Do the chains of thought (CoT) of reasoning Large Language Models (LLMs) reflect their internal computation? In this paper, we provide evidence of performative CoT, where a model becomes strongly confident in its final answer, but continues generating excess tokens without revealing its internal belief. Our analysis compares activation probing of the model's final answer and early forced answering to a CoT monitor across two large reasoning models (DeepSeekR1 671B & GPT-OSS 120B). We observe difficulty-specific differences for these methods: the gap between the expressed CoT and the model's internal belief is larger for MMLU-Redux questions that are easier and recall-based, and is smaller on more difficult multihop GPQA-Diamond questions. We also study certain inflection points within individual reasoning traces, finding that they correspond to updates in probe confidence. Finally, we leverage our probes to enable confidence-based early exit from CoT that saves up to 80% of tokens on MMLU and 30% of tokens on GPQA while maintaining similar accuracy. This work provides nuance to discussions on CoT faithfulness, and establishes attention probing as an efficient method for detecting performative reasoning and for adaptive computation in reasoning LLMs.

Abstract:
Computational identification of lipidbinding proteins is critical for both fundamental research and therapeutic development. Existing models are typically trained in a fully supervised manner, treating unlabeled samples as negatives. However, missing evidence does not imply non-binding, leading to systematic false negatives. Pocket-level lipid-binding prediction also remains underexplored compared to residue- or sequence-level approaches. To bridge these gaps, we presentLipoPU, a pocket-centric predictor that formulates lipid-binding learning under a ranking-based positive-unlabeled objective, and supports both binary lipid-binding detection and multi-label lipid category prediction. LipoPU learns an attention-based pocket representation that is robust to ambiguous pocket definitions while providing residue-level interpretability. Experiments show consistent gains over supervised baselines and prior pocket-level work, and a structural case study recovers a literature-supported allosteric lipid-binding pocket while highlighting biologically informative residues.

Abstract:
We study mechanism design for the budgetfeasible procurement problem, a natural problem that arises when a buyer wants to procure goods or services from multiple strategic sellers who each have a cost to provide that service, the buyer has a value for each service procured, but is constrained by a budget. In contrast to prior work, which has focused on buyer value maximization for this problem, we solve for optimal and approximately-optimal mechanisms for the objectives of buyer utility (value of procured services minus payments), welfare (value minus production costs), and generalizations of the two. For welfare, we design a simple mechanism that obtains a constant-factor approximation for the prior-free (worst-case) setting. As prior-free mechanisms fail to provide any guarantee for utility, even for a single seller, we consider Bayesian settings, where the buyer has distributional knowledge over sellers' costs. We first provide a utility-optimal mechanism that satisfies the buyer's budget constraintin expectation, then we show how to modify the mechanism to satisfy the budget constraintex-post, for every realization of seller costs, while still obtaining near-optimal utility guarantees. Finally, we generalize our mechanisms to other objectives.

Abstract:
Recent Openvocabulary Object Detection (OVD) approaches adapt CLIP through region-level distillation to improve semantic alignment for novel categories. However, the distilled regional features are often used for both classification and localization, enhancing semantic consistency at the expense of spatial fidelity. To resolve this, we propose Object-level Semantic and Spatial Distillation (OSSD), a two-stage framework that explicitly decouples semantic and spatial feature learning. OSSD first distills object-level semantics from CLIP’s global [CLS] embeddings to enhance region discrimination, and then injects fine-grained spatial and structural priors via spatial distillation from a detector trained only on COCO base categories. Furthermore, we propose a Location Quality Estimation Head (LQEH) that predicts class-agnostic localization quality, complementing objectness confidence to improve the novel-object perception. Extensive experiments show that our method achieves 49.2 AP50 on the OV-COCO benchmark. exceeding the best previous result by 3.6%, On the OV-LVIS benchmark, our method reaches 40.5 mAP on novel categories, outperforming previous state-of-the-art methods.

Abstract:
OpenVocabulary Segmentation(OVS) aims to achieve pixel-level semantic recognition from arbitrary text queries. Existing large-scale visual-linguistic models, such as CLIP, perform well in zero-shot generalization, but their image-level training objectives and real-valued cross-modal alignment mix amplitude and phase information, limiting fine-grained segmentation and often causing blurred boundaries and fragmented structures. Inspired by the ability of electromagnetic wave phase to control interference independently of amplitude, we propose PhaseAlign, an OVS framework based on Complex Phase Alignment (CPA). CPA explicitly decouples the magnitude and phase of visual and textual embeddings in the complex domain, refining effective features for stable cross-modal alignment. To further enhance structural awareness, we introduce spatial-aware cross-modal projection, which models local neighborhood relations via multi-scale spatial contrast normalization, and attention-guided affinity modeling, which leverages pre-trained ViT self-attention to propagate category activations, improving boundary clarity and region integrity. Experiments show that PhaseAlign achieves state-of-the-art performance on multiple OVS benchmarks.

Abstract:
While emerging trainingfree video anomaly detection (VAD) methods offer advantages such as interpretability and ease of deployment, they often suffer from computational inefficiency due to complex memory retrieval mechanisms or high-latency visual language models (VLMs). To address this, we propose PRISM (Parameter-free Recognition Based on Intrinsic Statistical Modeling), a novel framework for efficient open-set anomaly detection with minimal computational cost. PRISM based on a pre-trained multimodal embedding model, introduces differential amplification and whitening mechanisms to statistically suppress common-mode background noise in the embedding space, thereby significantly improving the signal-to-noise ratio of anomalous events. Extensive experiments on three mainstream datasets demonstrate that PRISM achieves state-of-the-art performance Real-time reasoning ability and interpretability. Furthermore, our statistical model provides a theoretical explanation for the performance gap (particularly mean accuracy (AP)) observed in existing training-free methods on complex datasets such as XD-Violence.

Abstract:
Diffusion models achieve stateof-the-art performance in generative modeling but are limited by their reliance on Gaussian noise and the high computational cost of iterative sampling. Star-shaped diffusion addresses the former by introducing a non-Markovian forward process, yet this comes at the expense of temporal coherence in the reverse process. We propose a novel framework that resolves this trade-off by learning a Markovian projection of a star-shaped forward process, and its reversal. This design enables learning over a broad class of exponential models and recovers DDPM as a special case. It is particularly well-suited for knowledge distillation, allowing few-step or even single-step generation. Experiments demonstrate the effectiveness and flexibility of our approach across multiple generative tasks. Code is available at \urlhttps://anonymous.4open.science/w/MStar-Diffusion-B3EE/.

Abstract:
In many machine learning applications, the most relevant items for a query should be efficiently extracted. The relevance function is usually an expensive similarity model making the exhaustive search infeasible. A typical solution is to train another model that separately embeds queries and items to a vector space, where similarity is defined via the dot product or cosine similarity. This allows one to search the relevant items through fast approximate nearest neighbors search at the cost of some reduction in quality. To compensate for this reduction, the found items (candidates) are reranked by the expensive ranking model. In this paper, we investigate an alternative approach to candidate selection that utilizes the scores of the expensive model to improve the representations of queries and items. The idea is to describe each query (item) by its relevance for a set of support items (queries) and use these new representations to obtain query (item) embeddings. We theoretically prove that such embeddings are powerful enough to approximate any complex similarity model (under mild conditions). We also investigate the choice of support items, which is a crucial ingredient of the proposed approach. The experiments on diverse academic and production datasets illustrate the power of our method.

Abstract:
Groupbased reinforcement learning (RL) methods have achieved remarkable success in improving the performance of large language models (LLMs) and have been rapidly extended to agentic tasks. However, their credit assignment relies heavily on coarse-grained trajectory-level attribution according to final outcomes, making it difficult to capture the contribution of individual steps, such as valuable steps obscured within failed trajectories. To uncover latent information and enable more faithful step-level credit assignment, we propose Graph-based Group Policy Optimization (GraphGPO), which first aggregates all rollout trajectories into a unified state-transition graph and then estimates the distance from each state to the task goal using the global information encoded in the graph. Finally, GraphGPO assigns credit to each edge by estimating a graph-based advantage, based on how much the transition reduces the distance to the task goal. In this way, GraphGPO significantly improves training efficiency and achieves state-of-the-art performance across a range of challenging benchmarks.

Abstract:
We consider the problem of estimating the effect function of a continuous treatment, which maps each treatment value to its average effect over the population. A central challenge in this setting is confounding: treatment assignment often depends on covariates, creating selection bias that makes direct regression of the response on the treatment unreliable. To address this issue, we propose a twostage kernel ridge regression method. In the first stage, we learn a model for the response as a function of both treatment and covariates; in the second stage, we use this model to construct pseudo-outcomes that correct for the distribution shift and then fit a second model to estimate the treatment effect. Although the response varies with both treatment and covariates, the induced effect function obtained by averaging over covariates is typically much simpler, and our estimator adapts to this structure. Furthermore, we introduce a fully data-driven model selection procedure that achieves provable adaptivity to both the unknown degree of overlap and the regularity (eigenvalue decay) of the underlying kernel.

Abstract:
Existing embodied control research demonstrates remarkable performance improvements by scaling training data and model size. We instead explore inferencetime strategy as an alternative axis. Non-deterministic generative models, such as diffusion and autoregressive models, have been widely adopted in the field of embodied control. However, the single-shot inference paradigm limits their performance. In this paper, we propose TapSampling, a plug-and-play framework for inference-time sampling. First, we introduce an Action-VAE to represent actions in a low-dimensional latent space. The Action-VAE maps initial actions from policies into a compressed posterior distribution, from which an arbitrary number of latent samples can be drawn and decoded into candidate actions that approximately follow the true action distribution. Second, we formulate action verification as task-progress outcome prediction and train the verifier by leveraging the intrinsic sequential information of robotic datasets. The predicted scores have clear semantic grounding, enabling interpretable action selection. Furthermore, TapSampling is a policy-agnostic framework. Extensive experiments in both simulated and real-world environments demonstrate that our method effectively improves multiple generalist policies substantially without further finetuning the policy models.

Abstract:
A fundamental tension exists in the largestep inference of diffusion models via their deterministic probability flow ordinary differential equation (PF-ODE) paths, which we formally identify as thecontractivity trap: efficient inference favors large step sizes, while stable convergence requires strong contractivity that limits expressiveness. To address this, we propose SteinDiff, an inference-time stabilization framework based on reference-free Stein corrections. Specifically, SteinDiff introduces a geometry-aware correction mechanism that stabilizes PF-ODE inference trajectories. To this end, we contribute closed-form correction estimators via Stein's identity in the continuous-time setting, enabling the method to adapt to local data geometry. We theoretically demonstrate that SteinDiff reduces integration error even when contractivity is violated and establishes its robustness against discretization-induced distributional shifts. Our analysis further reveals that these corrections act as persistent geometric anchors, providing new insights into the stability of SOTA EDM parameterizations. Extensive experiments demonstrate that SteinDiff significantly mitigates mode collapse and improves generative quality in large-step inference.

Abstract:
Crossmodal knowledge distillation (CMKD) aims to transfer knowledge from a teacher model in one modality to a student model in another modality. Existing CMKD methods have demonstrated strong empirical performance when paired multimodal data with aligned semantics are available, but such paired data are often costly or infeasible to obtain. This paper studies CMKD in the more challenging and practically relevant setting of unpaired data. We establish a distributional relationship between teacher and student models under cross-modal distillation and characterize two fundamental quantities that underlie effective knowledge transfer: feature and label alignments. These quantities capture semantic discrepancy between modalities at the level of representation distributions and prediction distributions, respectively. Guided by this theoretical insight, we propose a principled framework, with theoretical guarantees, that enables effective cross-modal knowledge distillation by aligning distributions rather than individual samples, thereby eliminating the need for data-level pairing. Extensive experiments across a wide range of multimodal benchmarks show that our framework is highly effective in both unpaired and paired data settings, improving significantly over prior work.

Abstract:
Asynchronous reinforcement learning (RL) has shown notable success in accelerating the posttraining of large language models (LLMs). However, its decoupled data generation and training paradigm introduces a fundamental distributional mismatch between data generated by stale behavior policies and current policy, leading to unstable training and degraded performance. To address this challenge, we propose D-ARL, a Distribution-matched Asynchronous Reinforcement Learning framework that selects high-quality asynchronous samples whose distributions are well aligned with the current policy for policy optimization. Specifically, D-ARL maintains a replay buffer that collects samples from the most recent K behavior policies and proposes a variance-guided metric to select distribution-matched data. During training, D-ARL introduces a multi-behavior policy optimization algorithm to leverage the multi-source nature of the selected samples for policy update. Experiments on six widely used reasoning benchmarks show that D-ARL outperforms state-of-the-art asynchronous methods, achieving an average improvement of 6.4% in reasoning performance and 34.7% in sample efficiency.

Abstract:
Realtime satellite video tracking poses distinct challenges, including accommodating high spatial-temporal resolution, dynamic backgrounds, and constrained onboard computational resources. While Discriminative Correlation Filter (DCF)-based methods offer high-speed inference, they suffer from limited accuracy. In contrast, Vision Transformer (ViT)-based trackers achieve strong performance by unifying representation and aggregation in a single-stream design, yet their heavy computational footprint limits practical deployment in real-time satellite scenarios. In this work, we present FATrack, a novel tracking framework that effectively balances tracking accuracy and computational efficiency. At its core is FA-ViT, a lightweight Vision Transformer backbone that introduces foreground-aware token routing, enabling the model to concentrate computation on target-relevant regions while suppressing redundancy. To mitigate semantic degradation caused by token sparsification, we propose the Adaptive Scatter Module (ASM), which selectively reinforces informative tokens via joint spatial-channel attention and sparse structural propagation, thereby enhancing both semantic fidelity and spatial coherence. By synergistically integrating FA-ViT and ASM, FATrack forms a unified architecture that delivers real-time performance with significantly improved tracking precision. Extensive evaluations on multiple satellite video benchmarks demonstrate that FATrack surpasses existing real-time trackers in accuracy and achieves inference efficiency comparable to DCF-based methods, highlighting its potential for practical deployment in large-scale aerial video tracking systems.

Abstract:
Concerns over dataset misuse in deep learning have highlighted the need for effective auditing. Unlike existing intrusive methods that require dataset modifications, which risk model performance and security, we present DataGuard, a nonintrusive framework for quantitative dataset auditing. Specifically, DataGuard integrates three key components: 1) a differential comparison between the target dataset and auxiliary non-training datasets, 2) an information-forensic analysis establishing formal inequalities to distinguish training data; and 3) a multivariate statistical test that translates these discrepancies into rigorous auditing scores. Extensive experiments demonstrate that DataGuard can detect both full and partial dataset usage without false positives while remaining robust under diverse training scenarios, offering a principled, information-theoretic solution for transparent AI development.

Abstract:
VisionLanguage Pre-trained (VLP) models are vulnerable to real-world query noise. Current cross-modal Test-Time Adaptation (TTA) methods often rely on high-confidence predictions, which induces confirmation bias and neglects the informative signals in ambiguous Low-Confidence Queries. To address this, we propose Generative Semantic Refinement and Exclusion-Guided Adaptation (ReEx), a robust retrieval framework that extends adaptation to the entire query stream. Specifically, textual structural noise is rectified by a Generative Semantic Refinement (GSR) module, which employs Confidence-Guided Dynamic Fusion to anchor LLM-based repairs and prevent semantic drift. To exploit ambiguous data, adaptation is driven by Exclusion-Guided Proxy Contrastive Learning (EPCL), which imposes negative constraints via Exclusion Sets of unlikely candidates. Experimental results on COCO-C and Flickr-C demonstrate that ReEx consistently outperforms existing TTA methods, achieving significant robustness gains with a justifiable computational trade-off.

Abstract:
Reasoningintensive retrieval is increasingly important for downstream applications, requiring more than lexical overlap or coarse semantic matching. While prior work mainly relies on Language Models (LMs) to synthesize reasoning-oriented supervision, we posit that it is already latent in LM-based retrievers but suppressed by contrastive overfitting. To elicit this latent reasoning, we introduce ElicitR, a retriever–LM framework with generative regularization that captures nuanced relationships among a query and its candidate documents beyond binary relevance. Concretely, alongside contrastive learning, we regularize the retriever by co-training a small LM on query–positive–negative batches. Next token prediction (NTP) for each text is conditioned on its prefix and the other in-batch texts, with cross-text conditioning weighted by retriever-computed similarities. Using MS MARCO as the only paired query-document supervision and a 135M LM for generative regularization with unlabeled raw-text initialization, ElicitR consistently improves BRIGHT by 16-29% relative across 0.1B–3B retriever scales while maintaining performance on BEIR. At 3B, ElicitR reaches an nDCG@10 of 23.1, substantially outperforming larger models trained with far more curated pairs and proprietary APIs. Further analyses show that ElicitR prevents overfitting, improves retrieval calibration, and remains robust to batch sizes, supporting its practicality.

Abstract:
Rotary Position Embeddings (RoPE) are widely adopted in Transformers to encode positional information, yet standard implementations enforce a uniform frequency schedule and scaling across all attention heads. Using simplified retrieval tasks and length generalization scenarios, we show—both empirically and theoretically—that heads with different functional roles require distinct frequency ranges and scaling factors to operate effectively. Ignoring this structure leads to suboptimal utilization of embedding dimensions and degraded performance, particularly under longcontext settings. To address these limitations, we proposeAdaRoPE, which equips each attention head with learnable rotation frequencies and scaling factors. Pretrained LLM with AdaRoPE consistently outperforms existing RoPE variants, including partial-RoPE and NoPE baselines. For context extension, we further show that uniform frequency and attention scaling, used in methods such as YaRN, are suboptimal. By applying head-specific scaling, AdaRoPE enables better context extension while better preserving short-context performance in both extrapolation setting and long context continued pretrain setting. These results highlight the importance of optimizing rotary position embeddings at the level of individual attention heads.

Abstract:
LoRA and its variants have attracted considerable attention because of their abilities to tune a negligible number of parameters while achieving comparable downstream performance. This success is largely attributed to the intrinsic lowrank structure of model parameter spaces, which allows LoRA to train two projection matrices to project weights into a low-dimensional subspace and then map them back. However, it does not consider how to explore this low-rank subspace sufficiently and may lose the expression ability accordingly. Moreover, when using LoRA to tune convolution layers, a flatten operation is required to convert tensors into matrices. We argue that this will degrade the model's performance. In this paper, we address this issue from a general parameter sub-space perspective: we present a unified Language And Vision Adaption finetuning framework (called LAVA). Specifically, we verify the existence of low-rank subspaces in convolution layers empirically and propose to parameterize the increment of both convolution kernels and matrices as sum of learnable rank-1 components. To improve training stability, we analyze the optimization dynamics of LoRA and incorporate orthogonal regularization into our parameterization, for which we give theoretical proof that it will help reduce the variance of the gradient. We conduct various experiments on different downstreaming tasks to validate LAVA's superiority. For example, when tuning LLaMA2-7b for commonsense tasks, the performance of our LAVA is +1.9% higher than that of LoRA. For metric depth estimation tasks, LAVA only tunes ～1.5% of Depth-Anything (335.3M), and achieves +3.5% \delta_1 accuracy against that of LoRA and +5.6% \delta_1 accuracy against that of SVDiff.

Abstract:
Latent Chainof-Thought (CoT) aims to internalize reasoning into continuous hidden states, promising to transcend the computational bottlenecks of explicit tokens. However, the precise mechanisms ensuring its validity remain opaque. To bridge this gap, we establish an Information-Theoretic Framework that dissects supervision into Trajectory Control and State Alignment. Our analysis identifies structural scaffolding as the fundamental prerequisite for valid latent dynamics, and demonstrate that Outcome Supervision falters due to optimization barriers, while Process Supervision succeeds by minimizing conditional entropy, thereby enforcing trajectory predictability. And we expose a divergence in alignment strategies: rigid Geometric Compression acts as a destructive prior that collapses the reasoning manifold, whereas Generative Reconstruction serves as a flexible semantic tether, optimizing for reconstructibility to preserve the intrinsic dimensionality of the latent space. To quantify these dynamics, we introduce the Unified Latent-MI Probe (ULP), which unveils a strict Information-Performance Binding: reasoning accuracy is deeply correlated with the mutual information retained in the latent chain.Ultimately, we advocate for a paradigm shift from geometric imitation to mutual information maximization to counter the information decay inherent in autoregressive generation.

Abstract:
Prior probingbased analyses show that individual layers specialize in distinct linguistic and semantic functions. A complementary line of work observes that deeper layers enter an alignment-dominated regime, suggesting that residual updates become largely collinear with the hidden state vectors. These findings appear to conflict: layer-wise functional specialization versus apparent representational stagnation in deeper layers. This raises a key question: Is the observed representational stagnation in deeper layers attributable to suboptimal acquisition or encoding of novel features, or are these features learned appropriately but exhibit minimal marginal contribution to the model’s predictive output? We address this by separating model behavior into (i) representational capacity: the richness and spectral diversity of the encoded features, and (ii) accessibility: the extent to which these features are aligned with, and exploited by, the output-relevant subspace. Analyzing hidden-state covariance across depth shows that intermediate layers maintain a broad representational span and rich spectral diversity, indicating that their representational capacity remains largely intact. Yet they project only weakly onto output-relevant subspace, indicating that their accessibility within task-relevant representational dimensions is tightly constrained. To probe their functional relevance, we add a minimal diagnostic pathway that routes intermediate-layer embeddings to the terminal readout. Analysis of the induced behavioral changes indicates that the apparent stagnation is better accounted for by geometric limitations on accessibility, rather than by a deficit in representational capacity. The code is available at https://anonymous.4open.science/r/Capacity

Abstract:
Recent advances in endto-end, multi-task robot policies based on transformer models have demonstrated impressive generalization to real-world embodied navigation tasks. Trained on vast datasets of simulated and real-world trajectories, these policies map multimodal observations directly to action sequences for physical execution. Despite promising real-world capabilities, these models are still data-driven and, therefore, lack explicit notions of behavioral correctness. We address this gap by introducingSafeDec, a constrained decoding framework for autoregressive, transformer-based robot navigation policies that enforces safety specifications expressed as Signal Temporal Logic (STL) formulas. Our method ensures that generated actions provably satisfy STL specifications under assumed dynamics at runtime without retraining while remaining agnostic of the underlying policy. We evaluateSafeDecon tasks from the CHORES benchmark for state-of-the-art embodied navigation policies across hundreds of procedurally generated environments and show that our decoding-time interventions are useful not only for filtering unsafe actions but also for conditional action generation. Videos are available at constrained-robot-fms.github.io.

Abstract:
Tabular data within a domain often exhibit heterogeneous schemas yet shared semantics, posing a key challenge: determining what should remain invariant across tables and what should preserve instancelevel distinctions. Existing token- or row-centric encoders conflate these roles, leading to schema sensitivity or weakened discriminability. We introduce the segment, a header–value pair, as an atomic unit that captures both functional roles and semantic content. Using value entropy, we treat low-entropy segments as domain anchors and high-entropy segments as entity-specific signals. We realize this design through Masked Segment Modeling and Entropy-driven Segment Alignment, which jointly enforce structured header–value coupling and selective semantic alignment. Experiments on in-domain heterogeneous tables demonstrate improved performance on discriminative and generative tasks, yielding stable and interpretable representations.

Abstract:
While Transformerbased architectures have revolutionized neural decoding, existing models often treat codes as generic sequences, ignoring their inherent algebraic properties. In this paper, we take a step toward bridging these two domains by proposing a novel decoding approach that integrates the algebraic structure of cyclic codes into Transformer-based decoders. Leveraging the inherent cyclic properties, we introduce interpretable error correction patterns and inter-node relationship hypotheses that link the structural characteristics of the codes to the model parameters. Building on these insights, we design a plug-and-play, flexibly deployable decoding method tailored for cyclic codes. Experimental results show that our method achieves an average reduction in bit error rate (BER) by an order of magnitude, while also reducing the total number of parameters by approximately 97%. Additional comparative experiments validate our proposed conjectures and highlight a promising pathway for bridging classical coding theory and modern Transformer-based decoding architectures.

Abstract:
The unification of generative details and discriminative semantics presents a structural paradox in diffusionbased representation learning. Early approaches decouple semantics from generation, inevitably compromising representational completeness (i.e., information split). While recent bridge-based methods achieve unification via a tightly coupled mapping, they suffer from information overload. This is because unconstrained reconstruction objectives incentivize the encoder to entangle high-frequency stochastic noise into the latent bottleneck. To solve this, we introduce asymmetric rectified contrastive diffusion autoencoder (ArcDAE), which rebuilds the diffusion bridge as a dynamic sifter. Through imposing a timestep-aware rectification constraint that orthogonalizes the semantic manifold from the stochastic noise space, ArcDAE compels the bottleneck to distill discriminative features while actively shedding high-frequency redundancy. Consequently, our approach eliminates the overload trap without reverting to decoupling. Extensive experiments validate the superiority of our FFHQ-trained ArcDAE, surpassing state-of-the-art methods by up to 6.4% in downstream semantics regression and 9.7% in reconstruction fidelity.

Abstract:
a transformative standard for connecting large language models (LLMs) with external data sources and tools, and has been rapidly adopted across personal applications and development platforms. However, existing benchmarks predominantly focus on generic informationseeking tools and fail to capture the practical challenges posed by personal social applications, where tools interact with individual accounts or local databases. To bridge this critical gap, we introduce MCP-Persona, the first benchmark specifically designed for evaluating agent performance on real-world, personalized MCP tools. MCP-Persona encompasses a diverse set of widely-used applications, ranging from social media platforms like Reddit and Xiaohongshu (Rednote) to enterprise collaboration suites such as Lark (Feishu) and Slack. Our extensive experiments on various state-of-the-art (SOTA) agents demonstrate their significant struggles with personalized tool use, thereby highlighting the benchmark's crucial role in identifying and addressing these limitations. MCP-Persona is publicly available at \hrefhttps://anonymous.4open.science/r/MCP-Persona-F85Dhttps://anonymous.4open.science/r/MCP-Persona-F85D

Abstract:
The default Gaussian latent in flowbased generative models poses challenges when learning certain distributions such as heavy-tailed ones. We introduce a general framework for learning data-adaptive latent distributions using one-dimensional quantile functions, optimized via the Wasserstein distance between noise and data. The quantile-based parameterization naturally adapts to both heavy-tailed and compactly supported distributions and shortens transport paths. Numerical results confirm the method’s flexibility and effectiveness achieved with negligible computational overhead.

Abstract:
Large language models (LLMs) are increasingly adapted to downstream tasks in resourceconstrained scenarios, making quantization-aware fine-tuning (QAF) a common practice for practical deployment. However, we find that quantized LLMs are substantially more vulnerable to safety alignment degradation during fine-tuning than full-precision models by interpretability analyses. In this paper, we first theoretically reveal that this vulnerability is driven by quantization errors, manifesting as an initial safety shift followed by a distorted optimization path. Based on this insight, we propose Explicit-Safety Quantization-Aware Fine-tuning (ExSQF), which effectively restores model safety while preserving downstream performance. It initializes adapters by combining quantization error with a safety matrix projection to mitigate early safety shifts, followed by post-training refinement that corrects deviations in the optimization path. Extensive experimental results show that ExSQF achieves state-of-the-art safety alignment recovery, even surpassing existing full-precision safety-aware fine-tuning baseline, while effectively preserving model performance.

Abstract:
In this paper, we reveal a significant shortcoming in class unlearning evaluations: overlooking the underlying class geometry can cause information leakage about the forgotten class. We further propose a simple unlearning strategy to mitigate this issue. We introduce Class Membership Inference Attack (CMIA) that uses the probabilities the model assigns to neighboring classes to detect unlearned samples. We find that existing unlearning methods are vulnerable to CMIA across multiple datasets. We then propose a new finetuning objective that mitigates this privacy leakage by approximating, for forget-class inputs, the distribution over the remaining classes that a retrained-from-scratch model would produce. To construct this approximation, we estimate inter-class similarity and tilt the target model’s distribution accordingly. The resulting Tilted REWeighting (TREW) distribution serves as the desired distribution during fine-tuning. We also show that across multiple benchmarks, TREW matches or surpasses existing unlearning methods on prior unlearning metrics. More specifically, on CIFAR-10, it reduces the gap with retrained models by 19% and 46% for U-LiRA and CMIA scores, accordingly, compared to the SOTA method for each category.

Abstract:
Reinforcement learning (RL) is a fundamental framework for sequential decisionmaking, in which an agent learns an optimal policy through interactions with an unknown environment. In settings with function approximation, many existing RL algorithms achieve favorable sample complexity, but often rely on computationally intractable oracles. In this paper, we use supervised learning as a computational proxy to establish a clear hierarchy of commonly adopted RL oracles under low-rank Markov Decision Processes (MDPs). This hierarchy shows that policy evaluation is the most computationally efficient oracle, provided that supervised learning can be efficiently solved. Motivated by this observation, we propose a novel optimistic actor–critic algorithm that relies solely on the policy evaluation oracle. We prove that our algorithm outperforms the existing sample complexity guarantees for low-rank MDPs while avoiding computationally expensive planning or optimization oracles commonly assumed in prior works. We further extend our theoretical results to approximately low-rank MDPs and demonstrate that this setting captures a broad class of real-world environments. Finally, we validate our theoretical results with experiments on several standard Gym benchmarks.

Abstract:
Climate change research increasingly requires AI systems that reason across text, dynamic visual content, and scientific figures, yet existing climate QA benchmarks are small, mostly textual, and cover a narrow range of models. We introduce MMClima, a largescale multimodal climate question answering framework with over 104k expert-validated question–answer pairs spanning articles, video transcriptions, and figures across five core climate science domains. MMClima is constructed via automated claim extraction and QA synthesis with human-in-the-loop validation to ensure both scale and reliability. Using MMClima, we benchmark state-of-the-art multimodal language models on tasks requiring factual recall, visual interpretation, and cross-modal synthesis. We additionally fine-tune on the textual split to produce mmclima-70b-txt, a domain-adapted baseline that outperforms strong open- and closed-source models on textual QA. We release the dataset, evaluation pipeline, fine-tuned model weights, and data creation framework to support standardized multimodal evaluation for climate science.

Abstract:
Personalized multisubject video generation is a promising direction within the field of controllable video generation; however, existing methods face challenges in maintaining cross-frame identity consistency and incur high computational overhead. To address these issues, we propose DiasR, an efficient framework that integrates Dual-Modal Identity-Anchored Alignment and a novel Sparse Routing Strategy. The Dual-Modal Identity-Anchored Alignment employs learnable identity queries to align visual and textual modalities with ground-truth subject masks, thereby mitigating cross-frame identity drift. The Sparse Routing Strategy dynamically routes video tokens to relevant subjects and groups them through bucket aggregation, reducing computational overhead and alleviating identity entanglement induced by redundant tokens. We have also constructed MuSA-2M, a large-scale dataset comprising 2 million annotated samples equipped with subject-level masks, which fills the gap in existing multi-subject video datasets. Experiments conducted on the OpenS2V-Eval benchmark demonstrate that our method achieves superior performance in identity consistency, text fidelity, and video naturalness. Notably, it maintains a nearly constant inference time as the number of reference subjects increases, outperforming existing baselines in both efficiency and generation quality for scenarios involving multi-subject interactions.

Abstract:
Large Language Model (LLM) agents use memory to learn from past interactions. However, this reliance on memory introduces a critical security risk: an adversary can inject seemingly harmless records into an agent's memory to manipulate its future behavior. This vulnerability is characterized by two core aspects: First, the malicious effect of injected records is only activated within a specific context, making them hard to detect when individual memory entries are audited in isolation. Second, once triggered, the manipulation can initiate a selfreinforcing error cycle: the corrupted outcome is stored as precedent, which not only amplifies the initial error but also progressively lowers the threshold for similar attacks in the future. To address these challenges, we introduce \emphA-MemGuard (\underlineAgent-\underlineMemory \underlineGuard), the first defense framework for LLM agent memory. The core idea of our work is the insight that memory itself must become both \emphself-checking and \emphself-correcting. Without modifying the agent's core architecture, A-MemGuard combines two mechanisms: (1) consensus-based validation, which detects anomalies by comparing reasoning paths derived from multiple related memories and (2) a dual-memory structure, where detected failures are distilled into ``lessons'' stored separately and consulted before future actions, breaking error cycles and enabling adaptation. Comprehensive evaluations on multiple benchmarks show that A-MemGuard effectively cuts attack success rates by over 95% while incurring a minimal utility cost. This work shifts LLM memory security from static filtering to a proactive, experience-driven model where defenses strengthen over time.

Abstract:
Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Despite the significant advancement in multimodal reasoning, existing benchmarks fail to comprehensively evaluate their reasoning abilities due to the lack of explicit categorization for logical reasoning types and an unclear understanding of reasoning. To address these issues, we introduce MMEReasoning, a comprehensive benchmark designed to evaluate the reasoning ability of MLLMs, which covers all three types of reasoning (i.e., inductive, deductive, and abductive). We carefully curate the data to ensure that each question effectively evaluates reasoning ability rather than perceptual skills or knowledge breadth, and extend the evaluation protocols to cover the evaluation of diverse questions. Our evaluation reveals substantial limitations of SoTA MLLMs when subjected to holistic assessments of logical reasoning capabilities. Even the most advanced MLLMs show limited performance in comprehensive logical reasoning, with notable performance imbalances across reasoning types. In addition, we conducted an in-depth analysis of approaches such as ``thinking mode'' and Rule-based RL, which are commonly believed to enhance reasoning abilities. We hope the community can pay more attention to the comprehensive reasoning capabilities of MLLMs instead of only focusing on its subset.

Abstract:
Large language models (LLMs) have revolutionized research in software engineering, and among various tasks, LLMbased code synthesis is promising. A recent line of benchmarks aims to evaluate LLM-generated codes in time efficiency, beyond their correctness. However, space, another vital aspect of code efficiency, is rarely evaluated in prior benchmarks. To fill in the gap, this paper introduces BEST, the first benchmark for evaluating the efficiency of LLM-generated codes in both time and space. It comprises 440 coding tasks that are rigorously constructed by experts. In addition, we propose a fine-grained subtask-based evaluation scheme by dividing each task into multiple subtasks, with different input scales and difficulties. Each subtask is then accompanied by an expert-crafted standard implementation as the efficiency baseline, which achieves the Pareto optimum. Building on BEST, we introduce a unified and novel dual-indicator (time and space) metric, named dual@k, generalizing the notion of the standard pass@k metric and building on a careful and novel construction of a weight matrix of subtasks. Through extensive experiments with dual@k across 47 LLMs on BEST, our evaluation demonstrates that while LLMs exhibit weak capabilities in generating time-efficient code, their capabilities in space-efficient code generation are even worse. The benchmark is provided in the supplementary material.

Abstract:
Safetycritical scenarios are essential for evaluating autonomous driving (AD) systems, yet they are rare in practice. Existing generators produce trajectories, simulations, or single-view videos—but they don’t meet what modern AD systems actually consume: realistic multi-view video. We present SMD, the first framework for generating multi-view safety-critical driving videos in the real-world domain. SMD couples a safety-critical trajectory engine with a diffusion-based multi-view video generator through three design choices. First, we pick the right adversary: a GRPO-fine-tuned vision-language model (VLM) that understands multi-camera context and selects vehicles most likely to induce hazards. Second, we generate the right motion: a two-stage trajectory process that (i) produces collisions, then (ii) transforms them into natural evasion trajectories—preserving risk while staying within what current video generators can faithfully render. Third, we synthesize the right data: a diffusion model that turns these trajectories into multi-view videos suitable for end-to-end planners. Videos generated by SMD substantially increase collision rates when stress testing multiple end-to-end planners, and reduce collision rates when incorporated into training, improving planner robustness and safety. Our code and video examples are available at: \hrefhttps://icml-2.github.io/SMD/https://icml-2.github.io/SMD/.

Abstract:
The widespread adoption of large language models (LLMs) has intensified the demand for principled methods to distinguish humanfrom machine-generated text. Watermarking provides a promising avenue, yet existing detectors exhibit sharp performance deterioration under multiple paraphrasing and when applied to shorter texts. We introduce Pattern Stability Score (PSS), a novel detection framework that leverages local statistical features and stability dynamics across paraphrased variants. Specifically, the proposed method combines global and local z-score features with higher-order statistics of run-length patterns, enriched by autocorrelation signals and stability scores computed over paraphrase depth. Numerical evaluations are performed on three benchmark datasets (PG-19, CNN/DailyMail, and WikiText) using multiple LLMs (Llama-3-8B, Qwen2-7B) and paraphrasers (Mistral-7B, Qwen2-7B, Gemma-7B), systematically stress-testing robustness under up to eight rounds of paraphrasing. Compared to prior z-score thresholding baselines and some state-of-the-art deep learning methods, our approach improves detection AUC (area under the receiver operating characteristic curve) by over 10-15 percentage points across different token lengths. Additionally, extensive cross-domain experiments demonstrate that a single universal classifier generalizes across different LLMs, paraphrasers, and text domains without retraining, maintaining above 83.7% AUC even when all components differ from training.

Abstract:
This paper studies \emphpreferencebased reinforcement learning (PbRL), where agents learn from comparative, trajectory-level feedback rather than numeric rewards. While PbRL has seen rapid empirical and theoretical progress, existing analyses are largely confined to restricted settings and fail to jointly capture the outcome-based and comparison-based nature of preference feedback. We prove that under a broad \emphgeneral function approximation framework, PbRL admits a \sqrtT regret guarantee. In particular, we introduce a simple and provably efficient algorithm, \emphRecursive Trajectory-based Preference Q-Learning (RTPQ), and establish its regret bound while explicitly accounting for the trajectory-level and comparative structure of preferences. Our analysis is characterized by a new complexity measure, the \emphDual Episodic Eluder Dimension (DEED), which quantifies the intrinsic difficulty of PbRL. We show that for linear MDPs, the DEED scales as \mathcalO(dH), yielding a regret bound of \tilde\mathcalO(dH\sqrtT\max(H^3/2,\,1/\kappa)), where \kappa is a problem-dependent constant. This bound is near-optimal up to horizon- and problem-dependent factors when compared to standard reward-based linear MDPs. In addition, our framework recovers the best-known regret bounds in the special cases of dueling bandits and standard outcome-based reinforcement learning. Overall, our results provide a general regret guarantee for PbRL with outcome-based preference feedback and broad function approximation.

Abstract:
TestTime Adaptation (TTA) empowers pre-trained models to adapt online to distribution shifts during inference, but such online updates often become unstable in long-horizon deployments. Prevailing approaches attribute this failure to error accumulation from noisy pseudo-labels, relying on heuristics to gate which samples are used for updates. We argue that this statistical view is insufficient: the problem lies not only in the quality of samples but also in the directionality of their gradients. In this work, we identify a geometric failure mode termed manifold erosion. Through spectral analysis, we find that reliable gradients concentrate in a stable low-rank subspace, while gradients from confident mispredictions are high-rank yet exhibit a persistent directional leakage into this protected subspace. This leakage can accumulate coherently and gradually erode core representations, eventually leading to collapse. To address this, we propose Manifold-Aware Gradient Projection (MGP), a geometric intervention that tracks the dominant subspace online and projects gradients onto its orthogonal complement. By blocking the leakage path, MGP decouples stability from plasticity. Extensive experiments on diverse TTA benchmarks demonstrate the long-horizon stability of our method, whereas prior methods often fail.

Abstract:
Nonmonotone submodular maximization is a fundamental problem in machine learning and combinatorial optimization, with a range of applications including text and video summarization, recommendation systems, feature selection, Max Cut problems in graphs, and viral marketing strategies. In this work, we study non-monotone submodular maximization under a cardinality constraint k in the fully dynamic setting, and obtain results that improve upon the previously established approximation guarantees of (0.125 - \epsilon) using \tildeO(\epsilon^-1k^2) oracle queries per update (NeurIPS'20) and 0.171 using \tildeO(\epsilon^-3k^4) oracle queries per update (NeurIPS'25). We present a dynamic algorithm that achieves a 0.262-approximation with worst-case expected update time O(\epsilon^-3\log(k)\log(\epsilon^-1k) + \epsilon^-2k^2\log(k)), where 0 < \epsilon \leq 1 is the error parameter. We also obtain another dynamic algorithm with update time bounded by \textpoly(\epsilon^-1, k) that achieves a 0.277-approximation guarantee.

Abstract:
As LLMs generate increasingly long outputs, effective uncertainty estimation must identify errors at finegrained levels rather than discard entire responses. While such methods exist, evaluating uncertainty at any resolution (token to an entire generation) is challenging and highly sensitive to label imperfections, making zero-noise benchmarks essential; yet, long-form generation benchmarks tend to rely on fallible labels rather than deterministic ground truth. We introduce Single-answer Atomic Long-form Target (SALT), a benchmark of six procedurally generated tasks with single deterministic long textual ground truths, enabling unit-level evaluation of correctness, calibration, and ranking without external judges. Equipped with SALT, our analysis of 50+ LLMs reveals key insights: We identify which confidence functions dominate each uncertainty aspect and show that effective ranking benefits more from coarser evaluation resolutions; SALT further facilitates precise calibration tracking throughout generation, revealing a divergence in the accuracy–calibration relationship, with high- and low-performing models exhibiting degradation (\rho=0.87) and improvement (\rho=-0.92). Finally, we demonstrate that reasoning, via Chain-of-Thought prompting or internalized through training, introduces a trade-off, improving accuracy while degrading confidence ranking. These findings directly impact risk-critical applications requiring reliable error identification and mitigation.

Abstract:
In this work, we reveal that Large Language Models (LLMs) possess intrinsic behavioral plasticity—akin to chameleons adapting their coloration to environmental cues—that can beexposedthrough tokenconditional generation andstabilizedvia reinforcement learning. Specifically, by conditioning generation on carefully selected token prefixes sampled from responses exhibiting desired behaviors, LLMs seamlessly adapt their behavioral modes at inference time (e.g., switching from step-by-step reasoning to direct answering) without retraining. Based on this insight, we proposeToken-ConditionedReinforcementLearning (ToCoRL), a principled framework that leverages RL to internalize this chameleon-like plasticity, transforming transient inference-time adaptations into stable and learnable behavioral patterns. ToCoRL guides exploration with token-conditional generation and keep enhancing exploitation, enabling emergence of appropriate behaviors. Extensive experiments show that ToCoRL enables precise behavioral control without capability degradation. Notably, we show that large reasoning models, while performing strongly on complex mathematics, can be effectively adapted to excel at factual question answering, which was a capability previously hindered by their step-by-step reasoning patterns.

Abstract:
Endto-end (E2E) autonomous driving has recently attracted increasing interest in unifying Vision–Language–Action (VLA) with World Models to enhance decision-making and forward-looking imagination. However, existing methods fail to effectively unify future scene evolution and action planning within a single architecture due to inadequate sharing of latent states, limiting the impact of visual imagination on action decisions. To address this limitation, we propose DriveWorld-VLA, a novel framework that unifies world modeling and planning within a latent space by tightly integrating VLA and world models at the representation level, which enables the VLA planner to benefit directly from holistic scene-evolution modeling and reducing reliance on dense annotated supervision. Additionally, DriveWorld-VLA incorporates the latent states of the world model as core decision-making states for the VLA planner, facilitating the planner to assess how candidate actions impact future scene evolution. By conducting world modeling entirely in the latent space, DriveWorld-VLA supports controllable, action-conditioned imagination at the feature level, avoiding expensive pixel-level rollouts. Extensive open-loop and closed-loop evaluations demonstrate the effectiveness of DriveWorld-VLA, which achieves state-of-the-art performance with 91.3 PDMS on NAVSIMv1, 86.8 EPDMS on NAVSIMv2, and 0.16 3-second average collision rate on nuScenes. Code and models will be publicly released.

Abstract:
Predicting longhorizon trajectories of biological dynamical systems remains challenging due to substantial system heterogeneity. Most existing machine learning approaches are system-specific, requiring retraining for each new system and exhibiting limited generalization across distinct biological regimes. To address this limitation, we create a large-scale benchmark of over 1,000 ODE-based systems biology models spanning diverse organisms, biological processes, and dynamical behaviors. Building on this benchmark, we propose a regime-aware trajectory prediction framework that enables cross-system generalization and uncertainty quantification for unseen systems. Our approach introduces structured initial states derived from biological regime priors, such as growth trends and oscillatory rhythms, into conditional flow matching, replacing the standard Gaussian source distribution. We provide theoretical justification for this initialization and empirically demonstrate state-of-the-art accuracy (31% MAE reduction), well-calibrated uncertainty (17% CRPS improvement), and efficient long-horizon inference across the benchmark.

Abstract:
Controllable molecule generation is crucial for diverse scientific applications, such as drug discovery and materials design. While large language models (LLMs) show great promise, their dense and entangled representations impede precise control over the generation of molecules with bespoke substructures or properties. To address this, we propose Sparse Representation Editing (SpaRE), an interpretabilitydriven framework for fine-grained and precise control in LLM-based molecule generation. The crux of SpaRE is to learn an overcomplete sparse feature space that disentangles LLM representations into a compact set of latent features corresponding to chemically meaningful concepts. Within this space, we can directly manipulate these concept-aligned latent features to achieve (1) local control, by generating target atoms and functional groups at specified positions; and (2) global control, by customizing the overall structural and physicochemical properties within defined ranges. In this way, our framework advances interpretability from post-hoc analysis to actionable generative control. Experiments show that SpaRE can generate chemically desirable molecules under complex constraints in real-world scenarios, while offering mechanistic insights for quantitative structure–property analysis. The code and demo are available at https://github.com/SpaRE-paper/SpaRE.

Abstract:
Large language models (LLMs) exhibit exceptional general language processing capabilities, but their memory and compute costs hinder deployment. Ternarization has emerged as a promising compression technique, offering significant reductions in model size and inference complexity. However, existing methods struggle with heavytailed activation distributions and therefore keep activations in high precision, fundamentally limiting end-to-end inference acceleration. To overcome this limitation, we proposeTWLA(TernarizedWeights andLow-bitActivations), a post-training quantization (PTQ) framework that achieves 1.58-bit weight compression and 4-bit activation quantization while maintaining high accuracy. TWLA comprises three components: (1) Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ) minimizes layer-output error under weight ternarization via a two-stage optimization from Euclidean initialization to manifold relocation; (2) Kronecker Orthogonal Tri-Modal Shaping (KOTMS) applies a Kronecker-structured orthogonal rotation to reshape weights into ternary-friendly tri-modal distributions, while the shared rotation statistically suppresses activation outliers; and (3) Inter-Layer Aware Activation Mixed Precision (ILA-AMP) explicitly introduces adjacent-layer second-order interaction costs in bit allocation and jointly optimizes for the layer-wise disparity of activation quantization gains induced by the shared orthogonal transform, preventing cascades triggered by a few weak layers. Extensive experiments demonstrate that TWLA is a PTQ method that maintains high accuracy under theW1.58A4configuration, while delivering significant inference acceleration. The code is available atTWLA.

Abstract:
Recent work has shown that brainaligned visual representations can emerge even in randomly initialized, high-dimensional neural networks, suggesting that cortical representations may be discovered rather than fully learned through task optimization. However, how such latent brain-relevant representations are stabilized and refined during development remains unclear. Motivated by this perspective and by neuroscientific evidence of activity-dependent synaptic pruning, we study how brain-aligned representations can emerge and be refined from high-dimensional unsupervised spiking systems. We propose a biologically grounded deep spiking neural network (SNN) that integrates unsupervised learning with developmental pruning dynamics. Starting from an overcomplete spiking architecture, the model self-organizes through sensory-driven activity while selectively eliminating weak or redundant synapses, progressively yielding compact and informative representations. Without using labels, the resulting network forms hierarchical visual representations that strongly align with neural responses across multiple areas of the mouse and macaque visual cortex, outperforming supervised and unsupervised ANN and SNN baselines. Synaptic pruning consistently enhances this alignment and further improves robustness under noisy and few-shot recognition settings. By unifying high-dimensional unsupervised spiking representations with activity-dependent synaptic pruning, this work provides a computational account of developmental refinement in visual cortex and bridges recent findings on emergent brain alignment in random networks with biologically grounded models of representation learning and structure formation.

Abstract:
In the pretraining-fine-tuning paradigm, pre-trained encoders have become high-value intellectual property (IP) due to their immense training costs, necessitating robust protection. Existing fingerprinting or watermarking methods typically rely on pre-defined samples and labels, or require intrusive modifications to the training process. However, downstream fine-tuning can significantly alter an encoder's representation and label space, thereby destroying the label consistency of existing methods and rendering them ineffective. Consequently, it is both challenging and urgent to provide a downstream-agnostic, black-box ownership verification mechanism for pre-trained encoders. To address this, we propose a downstream-agnostic, label-only fingerprinting method that leverages Adversarial Shifting to construct stable fingerprint clusters in the encoder’s latent space. By exploiting the inherent output consistency of these clusters, our method remains effective regardless of the specific downstream task or label mapping. Extensive experiments demonstrate that our method maintains superior robustness and stealthiness across various downstream tasks and category scales, providing a practical and reliable IP protection scheme for high-value pre-trained encoders.

Abstract:
Modern LLMpowered autonomous agents increasingly rely on rich user interface (UI) state observations to achieve reliable action grounding in complex digital environments. However, many deployments transmit the full UI state to remote inference servers even when most elements are irrelevant to the current task, which can leak sensitive but unnecessary context such as authentication codes, private notifications, and background application states. We propose MINIM, a trusted local broker that performs privacy-aware minimization on the client side before any observation leaves the device. Grounded in Contextual Integrity (CI), MINIM learns a dual-score representation for each UI element by predicting an inherent sensitivity score (s) and a task-conditioned necessity score (n). These scores drive a trinary disclosure policy that keeps essential elements, abstracts sensitive attributes when needed, and removes task-irrelevant content. We optimize a CI-aware objective that penalizes necessity errors more strongly on high-risk content, enabling aggressive pruning while preserving task-critical information. Experiments on real-world UI observations derived from WebArena show that MINIM reduces task-irrelevant sensitive leakage to 1.1% while preserving 79.8% of task-critical semantic context. Moreover, MINIM retains 95.7% of task-critical interactive affordances, indicating that local sanitization can enforce task-conditioned minimization at both the structural and attribute levels while retaining the signals required for agent actions.

Abstract:
What and how do language models learn during training? When does learning elicit existing knowledge, and when does it primarily teach new capabilities? We find that the amount of generalizable information language models learn during training predicts the origins of their emergent capabilities. Minuscule amounts of informationin many cases, a few bits in a single example---can unlock large fractions of models' maximum performance when capabilities are elicited rather than taught. We quantify these learning regimes using excess description length (EDL), an information-theoretic measure of generalizable information learned during training. We find that elicitation and teaching exhibit distinct EDL signatures that characterize the predominant learning mechanism as information scales: elicitation requires orders of magnitude less information than teaching to comparable performance. We demonstrate that EDL provides a practical tool for quantitatively estimating the maximum amount of predictive information models can compress from data into trainable parameters during learning. These capacity limits describe optimal tradeoffs between data and parameter count that robustly predict when parameter-efficient fine-tuning methods (e.g., LoRA) will underperform full fine-tuning.

Abstract:
The development of longcontext Large Language Models (LLMs) is constrained by the memory bandwidth bottleneck and quadratic complexity of the attention mechanism during decoding. To overcome the inherent trade-offs between the memory overhead of metadata-based metrics and the computational inefficiency of adaptive selection strategies, we present Faster Flash Decoding (FFD), a novel hardware-algorithm co-design framework designed to break the memory wall in long-context decoding. FFD integrates the selector and computer into a fully fused kernel, replacing external metadata indices with content-aware scanning via low-bit quantization. Furthermore, we introduce the top-\delta strategy, which dynamically filters blocks to achieve distribution-adaptive sparsity without global synchronization. As a training-free, plug-and-play solution, FFD enables the reuse of scanning results for computation, achieving up to 11.6x kernel-level speedup at 256k context length and 2.37x end-to-end throughput improvement. Empirical validation on Ruler and Longbench confirms that FFD maintains model accuracy while delivering high-ratio sparsity.

Abstract:
Recent progress in visionlanguage models (VLMs) has driven significant advances in video understanding. However, existing methods often act as naive empiricists, mapping video input directly to output without any mechanism to introspect or challenge inherent bias. In this work, we challenge this paradigm by reframing video reasoning as a Bayesian cognitive process. We propose Video-BCI (Bayesian Cognitive Integration of Self-Prior Hypotheses), a novel framework that first samples a set of Self-Prior Hypotheses to represent the model's intuitive yet potentially biased cognitive state, and then guides the VLMs to perform a critical integration of these priors. This process encourages the model to challenge erroneous majority consensus in cases of high information divergence and to distill superior reasoning chains from its own prior space. The integration is driven by a composite Cognitive Utility Function comprising two intrinsic learning signals: Dialectical Uncertainty Signal (DUS) and Process Tracing Signal (PTS). The DUS incentivizes correct, non-majority judgments by quantifying both the conflict (entropy) among priors and their consensus-challenging strength. The PTS guides the model to trace and learn from reasoning paths within its own priors that lead to better answers, enabling self-driven procedural knowledge distillation. Extensive experiments on six mainstream benchmarks show that Video-BCI achieves new state-of-the-art (SOTA) results across the board. For example, it surpasses the previous best on the MMVU benchmark by 3.8%. Our code will be made publicly available.

Abstract:
Integrating VisionLanguage Models (VLMs) into robotics has facilitated the development of generalizable Vision-Language Action (VLA) policies. However, unified discrete frameworks lag behind decoupled continuous designs due to limitations in action chunking and temporal modeling. To address this, we introduce RoboOmni, a unified multi-modal next-token prediction framework. Challenging the assumption that continuous modeling is essential for high-performance manipulation, RoboOmni demonstrates that actions are just another modality capable of being effectively modeled discretely. At the core of our method is Multi-Token Action Prediction (MTAP), which integrates action chunking directly into the discrete tokenizer. This design resolves temporal modeling bottlenecks and significantly reduces distribution shift between training and inference. By preserving the native VLM training and inference pipeline, RoboOmni naturally benefits from large-scale multimodal co-training and modern decoding optimizations. Extensive evaluations on the CALVIN, SimplerEnv, and real-world platforms confirm that RoboOmni establishes new state-of-the-art performance, significantly outperforming diffusion-based baselines such as \pi_0. Notably, combining our proposed MTAP with the FAST tokenizer achieves a 94.4% average success rate on CALVIN, while the Bin tokenizer implementation attains a 27× inference speedup compared to OpenVLA.

Abstract:
Optimizationbased adversarial suffixes can jailbreak aligned large language models (LLMs) while remaining fluent, weakening detectors based on static global or windowed perplexity statistics. We cast adversarial suffix detection as an \emphonline change-point detection problem over the token-level next-token entropy stream. Using the fixed system prompt to estimate a robust baseline via the median and median absolute deviation, we standardize user-token entropies and monitor them with a one-sided CUSUM statistic. The resulting detector is model-agnostic, training-free, operates online, and localizes the onset of adversarial suffixes. On a benchmark of 724 optimization-based suffix attacks (GCG, AutoDAN, AdvPrompter) and 765 benign prompts from a TyDiQA+OpenOrca mixture with controlled post-prefix perplexity, CPD consistently outperforms perplexity baselines; on LLaMA-2-7B it reaches AUROC 0.90 and F1 0.82. At an operating point with \approx 10% benign false-positive rate, CPD detects 74% of suffix attacks and concentrates 87% of its triggers inside the adversarial suffix. By comparison, windowed perplexity detects 35--43% and frequently fires on boundary-straddling windows. Finally, we show CPD Online can act as a lightweight gate for LLaMA Guard, reducing guard invocations by 17--22% on a high-volume stream dominated by benign prompts while preserving guard-level detection quality.

Abstract:
Conventional wisdom suggests that reasoning models fail when problems exceed their capabilities. However, we find that frontier reasoning models sometimes possess the necessary capabilities to solve problems but fail due to premature selfdoubt -- a phenomenon informally known as context anxiety. We provide the first systematic study of context anxiety, demonstrating that it arises, in part, from a model's inability to accurately estimate the tokens required to complete a task. We also show that context anxiety leads to material efficiency losses when models operate under perceived constraints. Building on this analysis, we further show that models can learn alternative strategies for solving long-horizon problems without exhibiting context anxiety, suggesting that performance improvements may be achievable not through scaling model capabilities, but by improving models' ability to accurately assess and adapt to their own limitations.

Abstract:
Spatial transcriptomics (ST) links gene expression with tissue morphology but remains expensive and lowthroughput, motivating surrogates that infer expression from routine histology. Whole-slide H&E-to-ST inference pairs a gigapixel image with gene measurements at a sparse, irregular set of locations, making multiscale modeling challenging without incurring dense-grid overhead or quadratic token mixing. We propose HiST, a hierarchical sparse transformer that treats measured locations as a lattice-indexed sparse field and builds a dyadic encoder--decoder directly on the active tissue footprint. HiST combines sparse window attention for local geometric correspondence with resolution-changing operators for rapid multiscale context integration. For a fixed window size, the dominant runtime and memory scale with the number of observed locations rather than the dense slide area. To mitigate slide-specific acquisition variation, HiST adds a bottlenecked global conditioning pathway via a \emphslide calibration token that summarizes slide-level context and conditions local representations. On a multi-organ benchmark spanning diverse tissues and acquisition sources, HiST improves predictive performance over recent baselines while reducing runtime and peak memory.

Abstract:
Largescale, multi-subject widefield calcium imaging provides unprecedented access to brain-wide cortical dynamics. However, the high dimensionality, complex spatiotemporal structure, and substantial task-irrelevant activity in widefield recordings have largely restricted modeling efforts to single-session analyses, limiting scalability and generalization. While cross-subject foundation-style modeling has been explored for some neural modalities, cross-subject models for widefield calcium imaging have not yet been demonstrated, and subject-invariant zero-shot behavior decoding remains elusive in neurofoundation modeling more broadly. Towards foundation modeling of widefield data, we introduce WiCAT, a multi-subject model that leverages self-supervised pretraining to both outperform single-session models and enable zero-shot behavior decoding on unseen subjects. WiCAT introduces an atlas-grounded tokenization scheme without session-specific components and learns globally shared spatiotemporal representations. Across multiple widefield datasets, the pretrained model supports lightweight downstream decoding, transfers across subjects, tasks, and datasets, and outperforms baseline models. Notably, the model also achieves robust zero-shot continuous behavior decoding and left-out brain region reconstruction on unseen subjects.

Abstract:
Multistep retrosynthesis planning is a fundamental challenge in organic chemistry, defined by its enormous search space. Existing methods typically formulate it as a Markov Decision Process (MDP) with a fixed choice of transition model (i.e., a single-step retrosynthesis model), and focus on improvinghow to searchthrough better policies and value functions. However,how the transition space itself is navigatedremains largely unexplored. This limitation is particularly urgent given our observation of pronouncedskill disparityamong single-step prediction models: different models exhibit substantially different performance across molecule states. Motivated by this observation, we introduce RetrOrchestrator, an LLM-powered agent that explicitly accounts for model skill disparity by reframing retrosynthesis planning as a Partially Observable Markov Decision Process (POMDP). By regarding each single-step prediction model as a tool, we further propose a scaffold-aware reinforcement learning algorithm to optimize navigation policy within the transition space. As a result, RetrOrchestrator jointly searches which molecule to expand and which single-step model to apply for the molecule at the current step. Empirically, RetrOrchestrator significantly outperforms static baselines on the Retro-190 benchmark, achieving a state-of-the-art 94.21% success rate as well as a Pareto front in both wallclock time and number of model queries.

Abstract:
Achieving crosstask generalization remains a critical challenge in Multi-Agent Reinforcement Learning (MARL), fundamentally relying on effective inductive biases. However, existing entity-level biases often overlook collaborative patterns, whereas task-level biases lack sufficient coverage for novel scenarios. To address this, we introduce a role-level inductive bias as an intermediate abstraction that integrates entity-level flexibility with task-level inter-agent collaboration. To instantiate this, we propose Gaussian-mixture-model-based Transferable Role discovery (GTR). Specifically, GTR constructs a structured role space to ensure diverse role assignment, further achieves role decoupling via regularization, and ultimately utilizes these roles for efficient generalization. Empirical results demonstrate that GTR achieves superior zero-shot and few-shot transfer performance on unseen tasks compared to state-of-the-art methods.

Abstract:
Chainof-Thought (CoT) prompting improves large language models (LLMs) on difficult reasoning tasks, but it generates long natural-language rationales that are poorly optimized towards higher-level machine efficiency and intelligence. We propose Communicative Language Symbolism Routing (CLSR), a test-time framework in which multiple LLM agents autonomously invent, evolve, and share compact Language Symbolism Frameworks (LSFs), and a latent-free router adaptively selects and composes these languages per query to optimize the accuracy--token budget trade-off. Unlike prompt optimization that refines surface instructions, CLSR treats each LSF as a reusable symbolic protocol and improves it through an evolutionary loop. In inference, the router may invoke a single low-cost LSF call, ensemble multiple dialects with aggregation, or execute a multi-round composition protocol on harder queries. Across challenging benchmarks, CLSR reduces latency-oriented token completion by 3-6× compared to standard CoT while maintaining accuracy, outperforming other token-reduction and prompt optimization baselines. We further theoretically (i) yield an information-theoretic lower bound relating accuracy and tokens under arbitrary symbolism, and (ii) characterize the CLSR protocols as a generalization of program-execution pipelines.

Abstract:
Conditional flow matching has emerged as a powerful generative modeling framework that learns a vector field to transport an initial distribution toward a target data distribution. However, theoretical understanding of its outof-distribution (OOD) performance under unseen conditions remains limited. In this work, we establish a rigorous geometric formulation to decompose the source of generalization error. We treat the conditional task as a map from the condition space to the Wasserstein space and derive a generalization bound under a coarse embedding assumption. The resulting decomposition separates OOD error into three tractable components:Interpolation Sparsity,Geometric Distortion, andIn-Distribution Fit. Our empirical evaluation confirms that this framework demonstrates three key functions: (1) it acts as a diagnostic tool that tracks the dynamics of generalization during training; (2) it identifies dataset-specific failure modes (e.g., topological gaps, geometric instability); and (3) it enables mathematically motivated interventions that yield predictable gains by minimizing specific terms.

Abstract:
Object insertion has emerged as a promising augmentation paradigm to solve the label scarcity and longtail distributions in remote sensing. It aims to generate training samples by synthesizing target instances onto real backgrounds. However, existing methods have three critical issues: (i) Semantic placement inconsistency, (ii) Radiometric inconsistency with illumination and atmospheric conditions, and (iii) Textural discontinuity. To cope with these issues, we propose a physics-aware method, called "Plan, Decouple, Assimilate" (PDA), for generating high-fidelity training samples. In the planning stage, the Planning (P) module automatically generates geometrically bounding boxes. In the generation stage, we design a dual-module model to generate the target instance: the Decoupling (D) module employs Asymmetric Spectral Adaptation Decoupling to disentangle structural identity from environmental illumination, while the Assimilation (A) module utilizes Neighborhood-Aware Texture Assimilation to harmonize the local manifold. By strategically integrating these modules, PDA enforces multi-level consistency spanning global geometry to local micro-textures. Extensive experiments verify that PDA consistently outperforms existing state-of-the-art methods in generative quality, reducing whole-image FID by 15.7%, and substantially improves downstream detection performance, boosting average mAP50 by 15.9% over the real-data baseline.

Abstract:
Generating simulationready tabletop scenes from task instructions is an intriguing and promising research direction in the field of Embodied AI. However, existing task-to-scene generation methods rely exclusively on large language models (LLMs) to predict scene layouts, inevitably yielding object collisions or floating due to LLMs’ inherent limitations in 3D spatial reasoning. In this paper, we present STABLE, a semantics–physics dual-system tailored for simulation-ready tabletop scene generation. STABLE consists of two complementary modules: (i) a Semantic Reasoner, a fine-tuned LLM trained on a structured tabletop scene dataset to generate coarse layouts from input task instructions, and (ii) a Physics Corrector, a physics-aware flow-based denoising model that outputs pose updates to refine layouts, which ensures the physical plausibility of scenes while preserves semantic alignment with task instructions. STABLE adopts a progressive generation paradigm: by alternating between the Semantic Reasoner and Physics Corrector, it incrementally expands the scene from task-critical objects to background objects. Experiments demonstrate that STABLE successfully generates simulation-ready tabletop scenes that strictly conform to task instructions and significantly enhances the physical validity of scenes over prior art.

Abstract:
Offpolicy evaluation (OPE) is essential for deploying reinforcement learning in safety-critical settings, yet existing estimators such as importance sampling and doubly robust (DR) often exhibit prohibitively high variance when rewards are sparse. In this work, we introduce Reward-Shaping Control Variates, a new family of unbiased estimators that leverage potential-based reward shaping to construct additional zero-mean control variates. We prove that shaped estimators always yields valid variance reduction, and that combining shaping-based and Q-based control variates strictly expands the variance-reduction subspace beyond DR and its minimax variant MRDR. Empirically, we provide a systematic regime map across synthetic chains, a cancer simulator, 5 single-stock and 1 multi-stock DOW-30 trading environments and an ICU-sepsis benchmark showing that shaping-based OPE consistently outperforms DR in sparse-reward settings, while a hybrid estimator achieves state-of-the-art performance across sparse, noisy, and misspecified environments. Our results highlight reward shaping as a powerful and interpretable tool for robust OPE, offering both theoretical guarantees and practical improvements in domains where standard estimators fail.

Abstract:
Biomedical imaging data presents enormous potential for deep learning models to predict invaluable properties, such as diseases and drug effects. However, unavoidable alterations of the technical conditions causebatch effects: variations between groups of samples that are not due to any biological signal of interest. Batch effects greatly hinder the generalization abilities of deep learning models, preventing their practical use in the real world. Unsupervised Domain Adaptation (UDA) methods have been proposed to mitigate batch effects, but they usually assume that the data is comprised of only one source domain and one target domain, whereas biological datasets are comprised of multiple domains, both at training and at inference time. While Batch Normalization–based testtime and meta-learning adaptation methods offer a promising mechanism for domain alignment, we show that existing approaches exhibit degraded performance under the usual inference scenarios of small target batch sizes and label shift. We address these limitations by leveraging negative control samples, which are consistently present in every experimental batch in biological datasets, as stable context for adaptation. We propose CS-ARM-BN, a meta-learning BN adaptation method that uses controls both during training and inference to stabilize domain statistics. We perform a suite of experiments of Mechanism-Of-Action (MoA) classification, a crucial task for drug discovery, on the large JUMP-CP imaging dataset. Our experiments show that CS-ARM-BN substantially improves robustness to batch size and class distribution shifts, enabling practical use of deep learning models for biomedical images.

Abstract:
Large Language Models have shown strong generalization across natural language tasks but remain underexplored for longitudinal clinical profiles. In sports antidoping, biological profiles are analyzed to support early detection of prohibited substance use and identification of anomalous biological patterns, both of which require joint modeling of temporal dynamics and metabolic relationships. We propose STT-LLM, a structural-temporal tokenization framework that adapts LLMs to longitudinal clinical analysis without modifying their backbone architectures. STT-LLM constructs biologically grounded structural-temporal embeddings and transforms them into LLM-compatible tokens via specialized tokenizers that explicitly encode pathway structure and temporal evolution. We evaluate STT-LLM on real-world longitudinal datasets from athletes, showing consistent improvements over native LLM tokenization strategies in sequence prediction and anomaly detection. In addition, we present a case study where STT-LLM provides contextual reasoning that aligns more closely with expert assessments compared to baseline models. These results highlight tokenization as a key bottleneck and opportunity for adapting LLMs to clinical data.

Abstract:
SecuringContextual Integrity(CI) is critical for privacypreserving Large Language Model (LLM) agent execution. However, existing agents struggle to balance the agility of direct generation against the prohibitive latency of CI-constrained thinking. To address this, we proposePrivGate, a framework that selectively invokes explicit reasoning based on internal privacy signals. Our approach is grounded in the discovery of aPrivacy Manifold, where models linearly encode privacy sensitivity within their residual streams, even during non-compliant generation. Leveraging this structure,PrivGateemploysLatent Gating, a training-free mechanism that triggers explicit reasoning only when high latent risk is detected, thereby optimizing the efficiency-privacy trade-off by minimizing unnecessary compute. On the real-world PrivacyLens benchmark,PrivGateachieves an out-of-distribution AUROC of 0.97 in risk identification, confirming the universality of the discovered manifold. End-to-end evaluations show thatPrivGateachieves a 70% reduction in privacy leakage with less than 5% overhead, offering a practical pathway to reconcile rigorous CI requirements with the performance demands of LLM agents.

Abstract:
Outdoor LiDAR generation has shown strong potential for autonomous driving and largescale 3D perception. However, existing approaches remain computationally intensive and primarily static, lacking explicit modeling of temporal dynamics. This limitation weakens spatiotemporal coherence and reduces the realism of 4D LiDAR generation. We propose a hierarchical recoupling generation framework that explicitly disentangles and reconstructs large-scale geometry and motion within a unified hierarchical structure. First, we design a multi-resolution feature scaffold that predicts time-correlated unsigned distance fields and spatial gradients, enabling hierarchical decomposition of 4D dynamics into static and motion-varying components. Next, to achieve compact yet expressive modeling, we introduce a neural contourlet representation that prunes redundant scaffolds into minimal directional bases, efficiently capturing essential geometric and motion cues. Finally, we progressively re-couple these hierarchical components to generate realistic and temporally coherent 4D LiDAR data. Extensive experiments demonstrate that our method outperforms baselines in both quality and consistency, achieving 3.3%, 25.0%, 17.8% improvements in FRD, MMD, and JSD, respectively, over the strong competitors, LiDMs and RangeLDM.

Abstract:
Graph Neural Networks (GNNs) are increasingly deployed on sensitive relational data, from social networks to healthcare records. However, their outputs can leak private graph structure, enabling linkstealing attacks that infer whether a connection between two entities existed in the training graph. While prior work demonstrates high average performance for such attacks, privacy is fundamentally a worst-case property, not an average one. The key question is whether an adversary can reliably compromise even a small set of critical links under strict precision constraints. We revisit posterior-only link-stealing attacks in a strict black-box setting and show that they remain effective at extremely low false-positive rates, revealing tail-risk vulnerabilities that current evaluations overlook. We further find that intra-class vulnerabilities are suppressed by geometric bottlenecks that collapse discriminative directions in posterior space. Building on this insight, we propose a geometry-aware reconditioning method that reshapes intra-class distances, substantially improving separability without harming reliability. Across six real-world graphs and multiple GNNs, this diagnostic correction achieves up to 2x higher success on intra-class pairs than generic attacks, redefining link-privacy evaluation as a tail-risk problem and revealing that posterior leakage remains substantially under-measured in current GNN deployments.

Abstract:
Graph anomaly detection (GAD) is a fundamental task in graph learning. However, most existing methods rely on the homophily assumption, which posits that connected nodes tend to share the same labels. This assumption often fails in the presence of edge heterophily, leading to degraded performance. We first observe that downweighting heterophilic edges, relative to the original or randomly weighted graphs, results in a more concentrated spectral energy distribution, thereby facilitating the learning of discriminative spectral embeddings. Moreover, existing methods typically embed graphs in Euclidean spaces, neglecting the importance of heterophily in manifold spaces. Motivated by these observations, we propose HSMAD, a novel framework for GAD. It consists of two key components: the Heterophily-Weighted Spectral Filtering module, which reconstructs the Laplacian using heterophily-based edge weighting for spectral filtering, and the Heterophily-Routed Manifold Update module, which routes neighborhood messages to the appropriate manifold for node feature updates, enabling curvature-adaptive representation learning. These spectral and geometric representations are jointly leveraged for anomaly detection. Extensive experiments on six real-world datasets show that HSMAD achieves state-of-the-art performance across the average F1-Macro, AUROC, AUPRC, and G-Mean. Specifically, the average F1-Macro score improves by 2.66% over the best-performing method.

Abstract:
Addressing Constrained Online Convex Optimization (COCO), we introduce CLASP (Convex Losses And Squared Penalties), a framework that minimizes cumulative loss together with squared constraint violations. We propose two variants of CLASP, CLASPI and CLASP-F, allowing for a joint or separate handling of the static decision set and the time-varying constraints, a decoupling flexibility that affords simpler implementations when projections onto the static decision set are easy. Our theoretical analysis departs from prior work by fully leveraging the variety of \emphcutter operators, and contraction properties such as the strongly quasi-nonexpansiveness, a proof strategy not previously applied in this setting. For convex losses, both CLASP algorithms achieve regret O\left(T^\max\\\beta,1-\beta\\\right) and cumulative squared penalty O\left(T^\\1-\beta\\\right) for any \beta \in (0,1). Most importantly, for strongly convex problems, we provide the first logarithmic guarantees on both regret and cumulative squared penalty: In the strongly convex case, both CLASP algorithms guarantee that the regret is upper bounded by O( \log T ) and the cumulative squared penalty is also upper bounded by O( \log T ).

Abstract:
Speech restoration in realworld conditions is challenging due to compounded distortions and mismatches between input and desired output rates. Most existing systems assume a fixed and shared input–output rate, relying on external resampling that incurs redundant computation and limits generality. We address this setting by formulating speech restoration under decoupled input–output rates, and propose TF-Restormer, a query-based asymmetric modeling framework. The encoder concentrates analysis on the observed input bandwidth using a time–frequency dual-path architecture, while a lightweight decoder reconstructs missing spectral content via frequency extension queries. This design enables a single model to operate consistently across arbitrary input–output rate pairs without redundant resampling. Experiments across diverse sampling rates, degradations, and operating modes show that TF-Restormer maintains stable restoration behavior and balanced perceptual quality, including in real-time streaming scenarios.

Abstract:
Recent advancements in aligning large language models via reinforcement learning have achieved remarkable gains in solving complex reasoning problems, but at the cost of expensive onpolicy rollouts and limited exploration of diverse reasoning paths. In this work, we introduce TreePO, involving a self-guided rollout algorithm that views sequence generation as a tree-structured searching process. Composed of dynamic tree sampling policy and fixed-length segment decoding, TreePO leverages local uncertainty to warrant additional branches. By amortizing computation across common prefixes and pruning low-value paths early, TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity. Key contributions include: (1) a segment-wise sampling algorithm that alleviates the KV cache burden through contiguous segments and spawns new branches along with an early-stop mechanism; (2) a tree-based segment-level advantage estimation that considers both global and local proximal policy optimization. and (3) analysis on the effectiveness of probability and quality-driven dynamic divergence and fallback strategy. We empirically validate the performance gain of \modelname on a set reasoning benchmarks and the efficiency saving of GPU hours from 22% up to 43% of the sampling design for the trained models, meanwhile showing up to 40% reduction at trajectory-level and 35% at token-level sampling compute for the existing models. While offering a free lunch of inference efficiency, TreePO reveals a practical path toward scaling RL-based post-training with fewer samples and less compute.

Abstract:
LLM selfexplanations are often presented as a promising tool for AI oversight, yet their faithfulness to the model's true reasoning process is poorly understood. Existing faithfulness metrics have critical limitations, typically relying on identifying unfaithfulness via adversarial prompting or detecting reasoning errors. These methods overlook the predictive value of explanations. We introduceNormalized Simulatability Gain(NSG), a general and scalable metric based on the idea that a faithful explanation should allow an observer to learn a model's decision-making criteria, and thus better predict its behavior on related inputs. We evaluate 18 frontier proprietary and open-weight models, e.g., Gemini 3, GPT-5.2, and Claude 4.5, on 7,000 counterfactuals from popular datasets covering health, business, and ethics. We find self-explanations substantially improve prediction of model behavior (11-37% NSG). Self-explanations also provide more predictive information than explanations generated by external models, even when those models are stronger. This implies an advantage from self-knowledge that external explanation methods cannot replicate. Our approach also reveals that, across models, 5-15% of self-explanations are egregiously misleading. Despite their imperfections, we show a positive case for self-explanations: they encode information that helps predict model behavior.

Abstract:
Safe reinforcement learning (Safe RL) seeks to optimize longterm performance while ensuring adherence to safety constraints. However, most existing approaches address safety in a simplified manner, typically by linearly combining rewards and costs, which provides limited guidance when safety and performance interact in complex, nonlinear ways. We present USB-RL (Unsupervised Score-Balanced Reinforcement Learning), a model-based framework that learns implicit safety–performance preferences directly from experience. Our approach infers a monotone partial-order score through unsupervised pairwise comparisons of long-horizon outcomes, capturing nuanced trade-offs without relying on manually tuned cost weights. The learned score guides model-based policy optimization by dynamically balancing safety and performance, enabling flexible and adaptive multi-step planning in imagination-based control. Across diverse safety benchmarks, USB-RL achieves strong returns while substantially reducing safety violations, demonstrating stable and interpretable safety–performance trade-offs.

Abstract:
Integrating graph knowledge into Large Language Models (LLMs) via passive representation faces critical bottlenecks: limited context windows, unreliable numerical computation, and structural hallucinations. To solve this, we proposeGRASP(Graph Reasoning via Agentic Solving and Probing), shifting the paradigm from passive ingestion to proactive agentic exploration. By interleaving Neighbor Retrieval for ondemand probing with Code Interpreter as a deterministic solver, GRASP enables LLMs to autonomously navigate and compute over complex topologies. We employ a staged reinforcement learning strategy (GRPO) that transitions from visible tuning to a structure-blind environment, forcing the agent to develop genuine topological awareness. Evaluated on multi-domain graph reasoning benchmarks, our 4B model achieves a 53.06% average performance boost, surpassing SOTA baselines like DeepSeek-V3.2 and successfully generalizing to unseen tasks, with high potential for tackling sampling on million-node graphs and solving Hard-level LeetCode graph problems.

Abstract:
Does \emphComprehending the main idea of a 2hour movie and \emphCounting the birds appearing in a 15-second clip really warrant the same video processing pipeline? We present Task-Aware Mechanism (TAM), a hybrid-gated Mixture-of-Experts (MoE) vision tower that adapts frame count and resolution to the user query and video length. TAM introduces a compact 0.1B text-only \emphInductor trained on our TA-116K dataset to infer task types, enabling task-aware visual budgeting and routing: a soft-gated MoE vision encoder for stability, and hard-gated resolution-specific projectors/pipelines for efficient specialization. Built on Qwen2-7B, TallVA-8B-A7B achieves state-of-the-art performance among models with comparable LLMs across diverse video benchmarks and remains competitive against stronger-LLM baselines, showing that task-aware visual budgeting makes video understanding more holistic. The code is included in the supplementary material.

Abstract:
Reward feedback learning (ReFL) is effective for both textto-image (T2I) and text-to-video (T2V) generation with image reward models (RMs). However, image RMs are misaligned with temporal objectives of T2V, motivating ReFL with video reward models. Nevertheless, directly deploying video RMs is impractical due to their large parameter size and the prohibitive cost of memory. To address this, we propose VELR: an efficient framework that employs ensemble latent reward models (LRMs) to predict rewards directly in latent space, bypassing expensive backpropagation through VAE decoders and video RMs. Specifically, we introduce the ensemble technique for the LRM, which enhances capacity, quantifies uncertainty, and mitigates reward hacking. VELR achieves a reduction of up to 150GB in memory, requiring as little as 12.4% of the memory compared to standard ReFL. Experiments on OpenSora-1.2, CogVideoX-1.5, and Wan-2.1 with large-scale video RMs demonstrate that VELR achieves comparable performance as standard ReFL and enables efficient and robust video RM-based ReFL at scales previously unattainable.

Abstract:
Partial Differential Equations (PDEs) play a fundamental role in scientific computing, and recent efforts have sought to extend the success of foundation models to PDE solving. However, multiphysics PDE pre-training faces the unique challenge of disentangling dynamic heterogeneity to learn universal, elementary patterns that generalize to new PDEs. Additionally, cross-physics transfer lacks a theoretical framework for interpretability—specifically, understanding which pre-trained operator knowledge is effectively transferred to target PDEs. To bridge these gaps, we introduce the theory of neural operator splitting, which decomposes PDE evolution into a modulated global spectral operator and sparse local constitutive mechanisms. A key innovation is Origo, which provides a neural operator bank that enables the identification of operator-level generalization patterns. Extensive experiments demonstrate strong zero-shot generalization and mechanism-level interpretability on unseen PDEs.

Abstract:
Recent studies have attempted to finetune time-series foundation models to enhance a target dataset's forecasting performance. However, these approaches proceed without a clear criterion for identifying complex datasets that require fine-tuning due to performance degradation in zero-shot forecasting. To distinguish datasets that are more challenging from standard benchmarks, we introduce data-driven temporal complexity and multichannel complexity. Temporal complexity captures the difficulty of identifying distinct patterns by quantifying spectral entropy in the frequency domain, while multichannel complexity captures inter-channel dependencies by measuring the channel information flow impacting predictive uncertainty. These metrics serve as effective proxies for performance gains achievable through fine-tuning. Based on the two metrics, we develop Time-PEFT, a parameter-efficient fine-tuning framework that incorporates a frequency adapter for top-k filtering and a channel adapter for multichannel modeling. Time-PEFT is shown to significantly improve forecasting performance by up to 2.51 times compared with existing fine-tuning techniques on complex datasets.

Abstract:
Providing nonconservative uncertainty quantification for function estimates derived from noisy observations remains a fundamental challenge in statistical machine learning, particularly for applications in safety-critical domains. In this work, we propose novel non-asymptotic probabilistic uniform error bounds for kernel-based regression. Compared to related bounds in the literature that are restricted to (conditionally) independent sub-Gaussian noise, our bounds allow to consider a broad class of non-Gaussian distributions, such as sub-Gaussian, bounded, sub-exponential, and variance-bounded noise. Moreover, our results apply to correlated and uncorrelated noise. We compare our proposed error bounds with existing results in terms of the induced uncertainty region and their performance in safe control, demonstrating the tightness of the proposed bounds.

Abstract:
With the evolution of generative models, deepfakes have achieved nearperfect semantic realism, leaving forensic traces only in subtle structural anomalies. However, existing single-view paradigms often fail to generalize, as dominant semantic features overwhelm subtle artifact cues within entangled representations. This imbalance leads to overconfident yet brittle predictions—a phenomenon we term the Semantic Masking Effect. To address this challenge, we propose \Reliable Multi-View Evidential Learning for Deepfake Detection under a "Divide-and-Conquer" strategy. In the "Divide" phase, we employ Geometric View Purification to decompose the entangled representation space through principled geometric projection. This process suppresses semantic interference within artifact-sensitive representations, forming the foundation for independent semantic and artifact views. In the "Conquer" phase, we leverage Uncertainty-Aware Evidential Learning to synthesize these distinct views. By explicitly modeling the "epistemic conflict" between semantic and artifact cues, this mechanism provides calibrated uncertainty estimates instead of forcing rigid deterministic decisions. Extensive experiments across multiple benchmarks demonstrate that our method consistently outperforms existing approaches in generalization performance, while providing reliable uncertainty estimation for trustworthy deepfake detection.

Abstract:
Wasserstein metrics are increasingly adopted as similarity scores for images. We consider the sensitivity of Wasserstein metrics with respect to pixelwise additive noise when the images are treated as discrete measures on the pixel grid. We derive finite-sample expectation bounds for a Gaussian noise model. Among other results, we prove that the error in the signed 2-Wasserstein distance scales with the square root of the noise standard deviation. This is favorable compared to the Euclidean metric that scales linearly, and thus provides a theoretical basis for the benefits of optimal transport distances in noisy settings. We present experiments that support our theoretical findings and point to a peculiar phenomenon where increasing the level of noise can decrease the Wasserstein distance. A case study on cryo-electron microscopy images demonstrates that the Wasserstein metric can capture the geometry of the data manifold in high noise settings even when the Euclidean metric fails.

Abstract:
Generalization remains a central challenge in offline reinforcement learning (RL), where policies are trained solely from static datasets and must perform reliably under distribution shift. While most existing offline RL methods focus on reducing training loss using standard optimizers such as Adam, the role of loss landscape geometry particularly sharpness - has received little attention. Sharpness-Aware Minimization (SAM) has recently shown strong generalization benefits in supervised learning by favoring flatter minima. However, directly applying SAM to offline RL is non-trivial: unlike supervised settings with ground-truth labels, offline RL relies on bootstrapped targets, making sharpness estimation noisy and often destabilizing optimization. In this paper, we revisit offline RL from an optimization perspective and investigate how sharpness-aware optimization can be made effective in this setting. We propose Q bound weighted SAM (Q-SAM), a robust and scalable framework that treats sharpness as a weighted objective and selectively prioritizes samples that are most suitable for sharpness-aware optimization based on Q bounds. By aligning the SAM objective with the characteristics of bootstrapped value estimation, Q-SAM amplifies the benefits of sharpness minimization while preserving training stability. Extensive experiments on standard offline RL benchmarks demonstrate that Q-SAM consistently improves generalization performance across diverse datasets and algorithms. Our results highlight the importance of loss sharpness in offline RL and suggest optimizer design as a promising direction for developing more robust offline RL methods.

Abstract:
Brain network classification is pivotal for diagnosing neurological disorders, yet clinical interpretability and the identification of discriminative biomarkers fundamentally rely on precise functional parcellation. However, existing graph learning models for brain network analysis typically suffer from a critical limitation termed spectral unreachability. This stems from the widely used coupled encoderpooling architecture, where the inherent representation smoothing property of graph encoders (including Graph Neural Networks and Graph Transformers) inevitably corrupts the high-frequency topological signals essential for delineating sharp functional boundaries. To solve this issue, the Hierarchical Spectral Parcellation Network (HiSP-Net) is proposed. Adopting aproject-then-alignphilosophy, HiSP-Net structurally decouples partition learning from representation smoothing. Specifically, this model is constructed as a hierarchy of Spectral Parcellation blocks. Within each block, node (or module) representations are mapped directly via a topology-agnostic projection into a partition space to preserve high-frequency details, while a Topology-Aware Alignment mechanism enforces spatial coherence using a joint structural objective. Extensive evaluations on real-world datasets show the capability of HiSP-Net in achieving superior classification performance and extracting interpretable functional biomarkers. The source code is publicly available at https://anonymous.4open.science/r/HiSP-Net-demo-0F62/

Abstract:
Vision Transformers increasingly incorporate extra tokens beyond patch tokens—from class tokens for aggregation to register tokens for artifact mitigation. While effective for their intended purposes, these tokens typically lack semantic structure. We ask a more ambitious question: Can we design regularization constraints that transform extra tokens into disentangled representations, enabling them to decompose images into semantic parts (e.g., heads, bodies, legs) without explicit supervision? We propose XTRA, an intuitive yet powerful framework that augments Vision Transformers with dedicated ``factor tokens'' and enforces disentanglement via a novel Minimum Volume Constraint (MVC). A multistage aggregation process further enforces these factor tokens into semantically pure components, preventing token collapse that often occurs when training with MVC alone. On ImageNet-1K, XTRA achieves superior disentanglement (8.4× improvement in SEPIN@1 over DINOv2) while simultaneously improving representation quality: KNN accuracy improves by 5.8% and linear-probe accuracy by 2.3%.

Abstract:
Estimating causal quantities traditionally relies on bespoke estimators tailored to specific assumptions. Recently proposed Causal Foundation Models (CFMs) promise a more unified approach by amortising causal discovery and inference in a single step. However, in their current state, they do not allow for the incorporation of any domain knowledge, which can lead to suboptimal predictions. We bridge this gap by introducing methods to condition CFMs on causal information, such as the causal graph or more readily available ancestral information. When access to complete causal graph information is too strict a requirement, our approach also effectively leverages partial causal information. We systematically evaluate conditioning strategies and find that injecting learnable biases into the attention mechanism is the most effective method to utilise full and partial causal information. Our experiments show that this conditioning allows a generalpurpose CFM to match the performance of specialised models trained on specific causal structures. Overall, our approach addresses a central hurdle on the path towards all-in-one causal foundation models: the capability to answer causal queries in a data-driven manner while effectively leveraging any amount of domain expertise.

Abstract:
Linear attention architectures based on the Delta rule, such as DeltaNet and RWKV7, combine Transformers' training scalability with RNNs' inference efficiency and can provably solve regular language tasks. However, due to their fixed-size state, these models fundamentally struggle to capture the recursive, hierarchical structures that are intrinsic to natural languages. To bridge this gap, we introduce DeltaStack, a novel architecture that augments the associative memory of DeltaNet with a lightweight, differentiable stack. Unlike prior stack-augmented approaches that rely on sequential recurrence, DeltaStack formulates stack operations as linear delta-rule updates. This novel formulation enables a hardware-aware implementation that is fully parallelizable over sequence length, preserving the training efficiency of linear transformers. Theoretically, we prove that DeltaStack extends the expressivity of DeltaNet to model both regular and hierarchical languages. Empirically, our method outperforms DeltaNet and Stack-Attention on comprehensive formal language benchmarks. Furthermore, a 340M-parameter DeltaStack model trained on 15B tokens surpasses strong DeltaNet baselines in both perplexity and zero-shot downstream performance.

Abstract:
Quadratically regularization has emerged as a potential alternative to the popular entropic regularization in computational optimal transport, offering the theoretical advantage of producing sparse couplings through its hinge density structure. Despite recent progress in onedimensional setting and general upper bounds, fundamental questions about the localization rate of QOT optimizers around the Monge coupling have remained open. In this work, we establish a general lower bound showing that the support of the QOT optimizer cannot concentrate around the Monge graph faster than order \varepsilon^\frac1d+2 in the directed Hausdorff distance, matching the conjectured optimal exponent under standard regularity assumptions in Wiesel & Xu (2025). We also show that the QOT value gap controls the mean-squared deviation \mathbb E_\pi_\varepsilon||y-T(x)||^2 by the scale of \varepsilon^\frac2d+2. As a corollary, in the affine Brenier regime, which includes Gaussian-to-Gaussian transport, we derive a sharp pointwise tube bound of order \varepsilon^\frac1d+2 by reducing the problem to self-transport and applying recent self-transport sparsity results. Finally, we validate our theoretical bound with synthetic experiment in high dimensions setting.

Abstract:
Finding flat minima in the loss landscape is a key strategy for Domain Generalization (DG). However, its effectiveness is often limited by two crucial challenges. 1) Domain Shift: Existing methods like SharpnessAware Minimization (SAM) apply a uniform optimization strategy across all domains, overlooking the differences of the learning difficulties among multiple domains and thus performing poorly on challenging domains. 2) Anisotropic Sharpness: By perturbing parameters along a single gradient direction, SAM and its variants ignore multi-directional flatness, making the model converge to minima that remain sharp in other directions. The combined challenges make it more difficult for the model to find truly robust solutions in multi-domain scenarios. To overcome these limitations, we propose the Dual Adaptive Sharpness-Aware Minimization (DA-SAM), which comprises two key modules: Dynamic Adaptive Scaling (DAS) module and Adaptive Multi-Directional Flattening (AMDF) module. First, to tackle the domain shift problem, the DAS module computes the real-time loss on each domain to adaptively generate domain-specific scaling factors that guide the generation of perturbation directions. Second, the AMDF module calculates local flatness by generating multiple directions to simulate perturbations in the parameter space. Based on the learned local flatness metric, it dynamically adjusts the perturbation step size to guide the model parameters to be away from anisotropic sharp regions. Crucially, DAS provides domain-level guidance that makes AMDF’s multi-directional geometric exploration more targeted and effective. Extensive experiments on five DG benchmarks demonstrate the effectiveness of our DA-SAM algorithm.

Abstract:
The core challenge of TestTime Reinforcement Learning (TTRL) lies in estimating rewards without access to ground-truth supervision. Existing TTRL methods predominantly rely on majority voting to generate pseudo-labels, under the assumption that the most frequent answer among sampled trajectories is correct. However, we observe that this assumption frequently breaks down in complex reasoning tasks, where correct solutions often constitute a logical minority. As a result, rare yet correct trajectories are systematically undervalued by majority-voting-based approaches. To address this limitation, we propose Self-Reflective Test-Time Reinforcement Learning (SR-TTRL), a novel framework that leverages self-reflective verification to produce high-fidelity pseudo-labels. Specifically, given multiple sampled trajectories for a problem, SR-TTRL first groups trajectories according to their final answers and selects one representative from each group to form a candidate pool. Each candidate trajectory is then summarized to preserve its core reasoning steps while reducing verbosity. Finally, the model performs self-reflection over the candidate pool, critically evaluating and selecting the most plausible trajectory as the pseudo-label. Empirically, SR-TTRL achieves substantially higher pseudo-label fidelity and sample efficiency than prior majority-voting-based TTRL methods. Extensive experiments across diverse benchmarks and model families demonstrate that SR-TTRL consistently outperforms majority-voting baselines and significantly improves generalization to novel problems. For example, SR-TTRL improves the Pass@1 accuracy of Qwen3-8B on AIME24 from 29.1 to 55.8 (a gain of +26.7), exceeding standard TTRL by an additional +9.1.

Abstract:
Active mapping enables embodied agents to understand and interact in previously unseen environments. However, most methods struggle to achieve zeroshot generalization to large-scale scenes and lack support for language instructions. We propose a VLM-based active mapping method that achieves zero-shot mapping while facilitating language-driven human–agent interaction. First, we introduce a 360-BEV representation that integrates omnidirectional semantics with BEV-aligned geometric structure to enhance scene understanding. Second, we develop a candidate waypoint generation strategy that allows the VLM-driven agent to select informative 2D waypoints in image space and back-project them into executable metric actions in 3D space, enabling the VLM to plan in its strongest modality. Third, we design a VLM-based depth-first exploration agent that decomposes the scenes into explorable regions, selects informative waypoints within each region, and organizes them into a topological tree. The agent follows the depth-first exploration policy to achieve thorough coverage of large-scale scenes. Without task-specific training, our method outperforms the strongest baseline, improving coverage and AUC by approximately 13.25% and 14.00%, respectively, while enabling language-conditioned interaction.

Abstract:
Tensorvalued prediction is fundamental to geometric deep learning, yet uncertainty quantification (UQ) for such outputs remains an open challenge. While E(3)-equivariant neural networks excel at point estimates, they lack rigorous confidence measures. We introduce a general framework for E(3)-equivariant UQ, modeling the full predictive distribution where both mean and covariance preserve rotational symmetry. Our approach decomposes the covariance into irreducible representations \mathrmSym^2(\rho_c) \cong 2×(l=0) \oplus 2×(l=2) \oplus 1×(l=4). By mapping from the flat Lie algebra \mathfraksym(6) to the curved SPD manifold via matrix exponentiation, we strictly ensure positive-definite covariances while maintaining exact equivariance. Furthermore, we formulate a Log-Euclidean Equivariant Scoring Objective (LE-ESO)---a robust surrogate loss based on the Multivariate Laplace distribution---providing mathematical robustness to heavy-tailed errors and guaranteed stability. Extensive validation on ModelNet40 (inertia tensors) and large-scale materials science benchmarks (dielectric tensors) demonstrates that our method achieves competitive performance and provides physically consistent, symmetry-preserving uncertainty estimates with reliable OOD detection capabilities.

Abstract:
Motivated by the recency effect in online learning, we study algorithms for singlepass \emphsliding-window streaming multi-armed bandits (MABs) in this paper. In this setting, we are given n arms with unknown sub-Gaussian reward distributions and a parameter W. The arms arrive in a single-pass stream, and only the most recent W arms are considered valid. The algorithm is required to perform pure exploration and regret minimization with \emphlimited memory, \textcolorreddefined as the number of stored arms. The model is a natural extension of the streaming multi-armed bandits model (without the sliding window) that has been extensively studied in recent years. We provide a comprehensive analysis of both the pure exploration and regret minimization problems with the model. For pure exploration, we prove that finding the best arm is hard with sublinear memory while finding an \emphapproximate best arm admits an efficient algorithm. For regret minimization, we explore a new notion of regret and give sharp memory-regret trade-offs for any single-pass algorithms. We complement our theoretical results with experiments, demonstrating the trade-offs between sample, regret, and memory.

Abstract:
Spiking Neural Networks (SNNs) emulate the spiking behavior of biological neurons and are promising for energyefficient neuromorphic computing. A widely used strategy to train SNNs is to convert pretrained Artificial Neural Networks (ANNs), where the accuracy and efficiency are determined by the spike encoding scheme. Traditional methods based on spike count or timing severely underutilize the available encoding space, leading to large accuracy degradation under low-timestep constraints. More expressive alternatives involve complex dynamics, which hinder scalability and practical deployment. To address these challenges, we propose Temporal Weighted Encoding (TWE). Through a simple recursive integration, spikes are implicitly assigned exponentially decaying weights, drawing an analogy to a temporal bit sequence. We systematically analyze the temporal mismatch caused by this weight pattern and propose temporal relaxation and threshold relaxation to resolve this issue, enabling fast and accurate activation encoding. Extensive experiments demonstrate that TWE achieves negligible conversion loss with significantly fewer timesteps, offering a scalable and efficient solution for SNN deployment.

Abstract:
Reinforcement learning (RL) excels on tasks with verifiable rewards, but in openended tasks, the reliability of reward models remains a key challenge. Existing solutions either depend on costly proprietary LLM-as-a-Judge systems or opaque scalar reward models that lack interpretability. Recent works on generative reward models offer a promising alternative, but they remain constrained by static evaluation criteria, fragmented evaluation paradigms, and limited multilingual support. To address these challenges, we introduce MixReward, a large-scale multilingual dataset spanning six domains and 103 languages, containing both pairwise and listwise data, and propose UniRRM, a unified reasoning reward model supporting multiple languages and evaluation paradigms. UniRRM uses a staged reasoning chain to dynamically generate task-generic and instruction-specific criteria, enabling fine-grained, input-adaptive judgments while maintaining consistency across languages. Experiments demonstrate that UniRRM-8B and UniRRM-14B achieve performance close to the state-of-the-art for models of comparable size across multiple benchmarks, and are effective for unseen evaluation paradigms. In addition, ablation studies validate the reliability and effectiveness of UniRRM.

Abstract:
As pretrained models evolve rapidly, transferring fine-tuning knowledge to updated models without retraining has become a critical challenge. Most existing methods reuse parameter updates, yet the same dataset can induce substantially different updates across base models due to mismatched local loss landscapes, making such transfer unstable. We instead adopt a Bayesian-updating perspective: a base model defines a prior, while fine-tuning contributes a task-update factor that is prior-agnostic, thereby making it feasible to reuse the update across base models. Specifically, we formalize a reusable task-update factor by requiringinvariance across base modelsanda fixed-dimensional parameterization. Our main theoretical result shows that such reusable factors exist when the variational family is a half-space, and it is already maximal among convex families. In particular, an ideal regime arises when the priors and their Bayesian posteriors remain within a shared exponential family, as it always admits a reusable update factor. Building on this existence, we propose Bayesian Task UpdateTransfer(BTransfer), which extracts a reusable task-update factor from a single fine-tuning run and applies it to a new prior. For deep networks, we implementBTransferwith a ``lift–transfer–return'' pipeline: 1) lift model parameters to distributions; 2) transfer the extracted task-update factor in the exponential family distributions; and 3) return the updated posterior distribution to parameter space. Extensive experiments demonstrate that our approach effectively reuses fine-tuning knowledge across models without post-training.

Abstract:
Existing EEG models are limited by electrode heterogeneity and rigid "channelfirst" architectures that treat sensors as independent features. We propose Brain Signal Rendering (BSR), which reinterprets EEG as a physical projection of neural activity and transforms raw signals into geometry-aware Spectrum Videos. By utilizing VideoMAE for self-supervised pre-training, BSR learns robust, layout-agnostic spatiotemporal representations that preserve neural topology. We further introduce subject-level few-shot learning and cross-montage fine-tuning to rigorously evaluate generalization across subjects and electrode configurations. Experiments show that VideoMAE model integrated with the BSR framework significantly outperforms state-of-the-art spectrum based methods, providing a scalable and data-efficient foundation for generalizable EEG modeling.

Abstract:
The scalability of world models is currently bottlenecked by the scarcity of action annotations. While selfsupervised latent action learning offers a potential solution, existing single-view paradigms—relying on information bottlenecks or Vector Quantization (VQ)—often conflate superficial 2D pixel displacements with the underlying physical-spatial dynamics of an action. Consequently, these methods remain highly susceptible to view-dependent noise, such as camera shake. We introduceMuCoLA(Multi-viewConsistentLatentAction learning), a framework that learns robust, view-invariant action representations by enforcing semantic consistency across synchronized video streams. MuCoLA utilizes a Student-Teacher network with DINO-style self-distillation to align action distributions across viewpoints, effectively filtering high-frequency visual noise while preserving motion semantics. Theoretical analysis reveals that our multi-view objective functions as a spectral filter, isolating agent dynamics from environmental nuisances. Empirically, MuCoLA significantly outperforms baselines in action regression, video reconstruction, and downstream visual control tasks. Furthermore, we demonstrate that MuCoLA exhibits favorable scaling properties with respect to model capacity and data volume, paving the way for large-scale action-free world modeling.

Abstract:
In this paper, we present a novel federated independence testing method that addresses both theoretical and practical challenges arising from client heterogeneity. We begin by revisiting existing federated independence testing methods and showing why they fail to provide valid guarantees or maintain statistical power under data distributional shift across clients. Building on this analysis, we develop a copulabased marginal alignment technique together with a stacking-based aggregation strategy that amplifies intra-client dependence while mitigating inter-client variation, resulting in a theoretically sound and powerful global test. For practicality, we further accelerate the aggregation step and incorporate a privacy-preserving mechanism. On the theoretical side, we prove both the correctness of our method and the validity of the test. Empirically, we conduct extensive experiments on both synthetic and real-world datasets, which demonstrate the superiority of our solution over existing methods.

Abstract:
Multimodal Large Language Models (MLLMs) have achieved remarkable success in instructionfollowing tasks by integrating pretrained visual encoders with large language models (LLMs). However, existing approaches often struggle with fine-grained visual grounding due to semantic entanglement in visual patch representations, where individual patches blend multiple distinct visual elements, making it difficult for models to focus on instruction-relevant details. To address this challenge, we propose MoDA (Modulation Adapter), a lightweight module that enhances visual grounding through instruction-guided channel-wise modulation. Unlike token-level methods such as Q-Former that perform additive feature selection, MoDA operates at the channel level through multiplicative modulation on already-aligned features, enabling fine-grained control over which embedding dimensions are relevant for each instruction. MoDA applies cross-attention between language instructions and pre-aligned visual features, generating dynamic modulation masks without architectural modifications or additional supervision. We evaluate MoDA on recent baselines including LLaVA-MoRE (2025) across 12 diverse benchmarks spanning visual question answering, vision-centric reasoning, and hallucination detection, including recent 2024 benchmarks (MMVP, CV-Bench, MMStar, RealWorldQA). MoDA achieves substantial improvements of +12.0 points on MMVP and +4.8 points on ScienceQA, outperforming baselines on all 12 benchmarks with minimal overhead (<1 FLOPs).

Abstract:
Diffusionbased large language models generate text by gradually filling in masked tokens, yet they remain slow because they usually decode only a few tokens per step. Parallel decoding, which unmasks multiple tokens simultaneously, promises acceleration but often degrades quality when too many tokens are predicted at once. We identify the root cause: when decoding is viewed as iterative token grouping, overly permissive grouping places interdependent tokens in the same step, violates the conditional independence assumption, and amplifies reliance on noisy context even when the top prediction is already correct. We introduce SPEED, a framework that enlarges safe parallel groups through complementary training and inference designs. At training time, a sharpened teacher distillation objective selectively aligns the student to teacher-correct positions using a temperature-scaled KL term together with a masked language modeling loss, producing a student that assigns more probability mass to correct token identities and elevates more positions above the decoding threshold. At inference time, Slow–Fast Decoding partitions tokens by sensitivity to revealed context using token-wise Jensen–Shannon Divergence computed with and without access to the preceding block, decoding high-sensitivity tokens jointly while deferring low-sensitivity tokens until sufficient context resolves them. Through extensive experiments, our framework attains up to 12.2× speedup on LLaDA-8B-Instruct and 6.7× on Dream-7B-Instruct with accuracy close to greedy decoding across standard reasoning and code benchmarks.

Abstract:
Existing symbolic regression approaches primarily focus on learning explicit inputoutput mappings, often neglecting relational structures among data instances. This paper introduces Contrastive Symbolic Regression (CSR), a feature-construction-based symbolic regression approach that integrates evolutionary feature construction with contrastive learning to shape a representation space where geometric proximity reflects similarity in the target space. CSR employs a contrastive objective that optimizes a linear transformation of constructed features, with a closed-form solution for aligning the feature space with the target space. The constructed features are applied to K-nearest neighbor regression, where we propose an efficient leave-one-out cross-validation (LOOCV) method to address standard LOOCV's computational expense, along with a linear-rank weighted K-nearest neighbor variant for adaptive selection of the neighborhood size and faithful assessment of representation quality during evolution. A determinantal point process-based ensemble selection mechanism further enhances robustness by jointly considering model quality and diversity. Extensive experiments on 58 real-world regression datasets demonstrate that CSR consistently surpasses both traditional symbolic regression and modern machine learning counterparts, highlighting CSR as a promising direction for interpretable and effective regression modeling.

Abstract:
Large Language Models (LLMs) are increasingly vulnerable toPrompt Injection (PI)attacks, where adversarial instructions hidden within retrieved contexts hijack the model's execution flow. Current defenses typically face a critical tradeoff:prevention-basedfine-tuning often degrades general utility via the "alignment tax", whiledetection-basedfiltering incurs prohibitive latency and memory costs. To bridge this gap, we proposeRedVisor, a unified framework that synthesizes the explainability of detection systems with the seamless integration of prevention strategies. To the best of our knowledge, RedVisor is the first approach to leverage fine-grained reasoning paths to simultaneouslydetectattacks andguidethe model's safe response. We implement this via a lightweight, removable adapter positioned atop the frozen backbone. This adapter serves a dual function: it first generates an explainable analysis that precisely localizes the injection and articulates the threat, which then explicitly conditions the model to reject the malicious command. Uniquely, the adapter is active only during this reasoning phase and is effectively muted during the subsequent response generation. This architecture yields two distinct advantages: (1) it mathematically preserves the backbone's original utility on benign inputs; and (2) it enables a novelKV Cache Reusestrategy, eliminating the redundant prefill computation inherent to decoupled pipelines. We further pioneer the integration of this defense into the vLLM serving engine with custom kernels. Experiments demonstrate that RedVisor outperforms state-of-the-art defenses in detection accuracy and throughput while incurring negligible utility loss.

Abstract:
Openvocabulary 3D segmentation is crucial for real-world applications, yet existing methods are constrained by fragmented masks and inconsistent captions in dataset generation, and by multi-stage pipelines prone to error propagation. We present SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset with 846K instances from 15K scenes, and SpaCeFormer (Space-Curve Transformer), a proposal-free segmentation architecture. Our data pipeline leverages multi-view mask clustering to produce geometry-consistent 3D instances and employs multi-view VLM prompting for view-consistent captions. On the modeling side, SpaCeFormer combines spatial window attention with Morton curve serialization for spatially coherent features, and a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200, our approach achieves 11.1 zero-shot mAP, a 2.8× improvement over prior proposal-free methods while requiring only 0.21 seconds per scene.

Abstract:
Incomplete multiview clustering (IMVC) aims to uncover shared cluster structures from data with partially observed views. Although recent imputation-free methods based on variational inference demonstrate robustness to missing views, they commonly rely on a conditional independence assumption across views, which fails to capture the inherently structured and potentially correlated nature of multi-view data. In this paper, we propose a variational framework that explicitly goes beyond this assumption by introducing a learnable cross-view correlation structure. Specifically, we explicitly model and learn correlations between views by utilizing the covariance structure of posterior estimation errors. To facilitate robust and efficient learning, the correlation matrix is parameterized through a normalized Cholesky decomposition, ensuring positive definiteness and enabling the entire model to be trained jointly through a unified variational objective. Extensive experiments on multiple IMVC benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches across a wide range of missing-view settings. These results highlight the effectiveness of adaptive correlation modeling in variational IMVC, demonstrating the need to go beyond the independence assumption in IMVC.

Abstract:
Fairness is a crucial concern for generative models, which not only reflect but can also amplify societal and cultural biases. Existing fairness notions for generative models are largely adapted from classification and focus on balancing the probability of generating samples from each sensitive group. We show that such criteria are brittle, as they can be met even when different sensitive groups are modeled with widely varying quality. To address this limitation, we introduce a new fairness definition for generative models, termed as \emphequalized generative treatment (EGT), which requires comparable generation quality across all sensitive groups, with quality measured via a reference fdivergence. We further analyze the trade-offs induced by EGT, demonstrating that enforcing fairness constraints necessarily couples the overall model quality to that of the most challenging group to approximate. This indicates that a simple yet efficient min–max fine-tuning method should be able to balance f-divergences across sensitive groups to satisfy EGT. We validate this theoretical insight through a set of experiments on both image and text generation tasks. We demonstrate that min-max methods consistently achieve fairer outcomes compared to other approaches from the literature, while maintaining competitive overall performance for both tasks.

Abstract:
Current Large Reasoning Models (LRMs) exhibit remarkable general capabilities but significantly underperform in spatial reasoning tasks. Existing approaches treat this gap as a knowledge deficit, relying on supervised finetuning (SFT) to ingest labeled data from external vision sources or synthetic engines. In contrast, we argue that for many tasks, spatial capabilities are already present in pre-trained LRMs but require alignment through principles of internal logical coherence. In this work, we propose a self-supervised reinforcement learning (RL) framework that targets the internal Chain-of-Thought (CoT) process without requiring ground-truth annotations. By formalizing the notion of consistency verifiers—reward functions that check for geometric and semantic consistency under transformations like flipping or swapping the order of objects in the question—and optimizing them via our new OT-GRPO strategy, a minimal-consistency matching variant of group relative policy optimization, we demonstrate that models can self-correct their spatial logic. Our results show that this label-free consistency training approaches the accuracy of models trained with ground-truth supervision and achieves similar generalization across diverse tasks and domains.

Abstract:
Existing machine unlearning methods primarily focus on removing data influence after training completes, which is effective for many scenarios, but a complementary capability is needed when removal requests arise during ongoing training. We propose LMCleaner, an efficient and certified \emphonline unlearning framework that can process unlearning requests at any training step without waiting for training completion. Our key insight is that influence propagation can be decomposed into a trust region where linear approximation is accurate, and a residual that concentrates in a lowdimensional subspace and can be efficiently masked by calibrated noise. Building on this insight, we design an influence propagation truncation mechanism that treats mini-batch influence as atomic units, computes influence within a truncation window for efficient removal, and injects subspace-aware noise for certified privacy. Our theoretical analysis proves that the truncation residual decays exponentially with window size and that the unlearned model is (\varepsilon, \delta)-indistinguishable from retraining. Experiments demonstrate that LMCleaner achieves over 100× computational savings compared to baselines while maintaining model utility and defending against membership inference attacks.

Abstract:
Large VisionLanguage Models (LVLMs) exhibit sophisticated reasoning but remain susceptible to object hallucination. Deviating from the prevailing attention intensity assumption, we reveal a deeper dynamic structural misalignment: hallucination is triggered at decision-critical steps where specific attention heads, acting as risky mediators, decouple from visual evidence to lock onto language priors. This establishes a pathological shortcut that bypasses visual grounding. To dismantle this, we propose Fox (Faithfulness and Observational-flow via eXpression-rectification), a training-free inference-time framework. Fox diagnoses structural misalignment using a visual attention entropy probe to localize risky mediators unsupervisedly. We then execute a targeted causal intervention via numerical logit saturation to physically sever the shortcut path. Finally, a conflict-gated cooperative decoding strategy reconciles interventional faithfulness with observational fluency. Extensive experiments demonstrate that Fox achieves SOTA performance, outperforming SID by 29.1% while preserving linguistic richness and general reasoning capabilities.

Abstract:
We investigate learning dynamics in competitive newsvendor games, a class of continuousaction games with strategic substitutes. Despite established equilibrium properties, convergence of independent learning algorithms in repeated general-sum play remains uncertain. We analyze structural properties under complete and incomplete information, deriving closed-form equilibria for a symmetric complete-information benchmark with perfect substitution. Our main theoretical contribution proves strict monotonicity in both complete-information and Bayesian models with private costs, ensuring equilibrium uniqueness and ruling out unstable dynamics. This provides convergence guarantees for variational-inequality-based algorithms. Numerical experiments using deep reinforcement learning agents with Proximal Policy Optimization empirically demonstrate convergence to Nash and Bayesian Nash equilibria, verified by equilibrium checks. These results establish a foundation for applying deep reinforcement learning in competitive inventory management.

Abstract:
Diffusion policies offer a powerful and expressive parameterization for continuous control. Yet, their integration with reinforcement learning remains conceptually and algorithmically challenging. In this work, we address this gap by introducing a noisyspace action-value (Q-)function that assigns values to diffusion latents through the distribution of executed actions induced by the denoising process. We show that this construction admits a precise semantic interpretation and derive a noisy-space policy gradient (NSPG) in which value estimates for noisy latents are computed exclusively using clean action-space values. Building on this result, we formulate a KL-regularized policy improvement over noisy latents and show that the resulting objective admits a diffusion-compatible regression form, avoiding backpropagation through the denoising process. Experiments on the D4RL benchmark demonstrate that semantically grounded value gradients provide a principled, effective foundation for training diffusion policies in offline reinforcement learning.

Abstract:
With the rapid emergence of personal AI agents based on Large Language Models (LLMs), implementing them ondevice has become essential for privacy and responsiveness. To handle the inherently personal and context-dependent nature of real-world requests, such agents must ground their generation in device-resident personal context. However, under tight memory budgets, the core bottleneck is what to store so that retrieval remains aligned with the user. We propose EPIC (Efficient Preference-aligned Index Construction), which focuses on user preferences as a compact and stable form of personal context and integrates them throughout the RAG pipeline. EPIC selectively retains preference-relevant information from raw data and aligns retrieval toward preference-aligned contexts. Across four benchmarks covering conversations, debates, explanations, and recommendations, EPIC reduces indexing memory by 2,404×, improves preference-following accuracy by 20.17%p, and achieves 33.33× lower retrieval latency over the best-performing baseline. In our on-device experiment, EPIC maintains a memory footprint under 1 MB with 27.9 ms/query retrieval latency in streaming updates. The code is available at

Abstract:
Despite rapid advances in textto-video generation, state-of-the-art generative models still suffer from producing temporally incoherent and unrealistic motion for videos. The key weakness of existing works is that they commonly treat videos as frame sequences and directly adopt Flow Matching objectives, which are originally designed for images. This practice fails to explicitly model motion priors or temporal dependencies, resulting in suboptimal dynamics that may appear incoherent and unrealistic. To solve this problem, we propose Temporal-aware Flow Matching (TFM), a novel training paradigm that embeds inter-frame constraints into the flow objective, leading to temporally coherent motion modeling in video generation. More specifically, the proposed TFM enforces temporal correlations across frames while retaining the desirable properties of Flow Matching, and further introduces a residual-type loss that aligns naturally with this new flow. We theoretically prove that models trained with TFM are able to exhibit remarkably enhanced temporal perception ability and better capture motion dynamics. Notably, TFM imposes no additional cost during inference and is applicable to any model using Flow Matching. Extensive experiments demonstrate that our TFM can significantly improve motion realism across diverse motion types. Generated videos are presented at https://tfm-2026.github.io.

Abstract:
Transformers have demonstrated remarkable incontext learning (ICL) capabilities. The strong ICL performance of transformers is commonly believed to arise from their ability to implicitly execute certain algorithms on the context, thereby enhancing prediction and generation. In this work, we investigate how transformers with softmax attention perform in-context learning on linear classification data. We first construct a class of multi-layer transformers that can perform in-context logistic regression, with each layer exactly performing one step of normalized gradient descent on an in-context loss. Then, we show that our constructed transformer can be obtained through (i) training a single self-attention layer supervised by one-step gradient descent, and (ii) recurrently applying the trained layer to obtain a looped model. Training convergence guarantees of the self-attention layer and out-of-distribution generalization guarantees of the looped model are provided. Our results advance the theoretical understanding of ICL mechanism by showcasing how softmax transformers can effectively act as in-context learners.

Abstract:
Over the years, the unsupervised and supervised learning research directions of tensor networks (TNs) have mainly developed in parallel. In this paper, we provide a view for their cooperative advancement through a novel mixed tensor network learning (MTNL) framework that unifies the two fields. Specifically, inspired by supervised TN learning tasks, multiple TNs are fused in a deepnetwork style in MTNL, enhancing the expressive power for the unsupervised TN learning tasks. We then develop a more flexible TN structure search prior with theoretical guarantees for learning multiple TN structures, aligning with trends in many supervised learning setups. More interestingly, by combining these components within a Bayesian framework, we show that MTNL induces a lightweight uncertainty quantification mechanism that is theoretically guaranteed by its connection to the dropout-based counterpart problem, making the mechanism a potential alternative for large-scale learning problems. Finally, we demonstrate the effectiveness of the MTNL framework on tensor recovery, parameter-efficient fine-tuning, and tensor regression experiments.

Abstract:
SimulationBased Inference (SBI) is critical for scientific discovery, with generative models offering a promising path toward efficient inference. However, existing methods struggle with effective multimodal modeling. They often rely on brute-force fusion strategies that ignore the structural disparities between parameters and observations, thus limiting estimation fidelity. In this work, we introduce FUSE (Feynman-Kac steered mUlti-modal flow matching for efficient Simulation-based posterior Estimation). Unlike prior work, FUSE employs a dual-track architecture that preserves the distinct features of multimodal inputs while facilitating dynamic interaction. Additionally, we propose an FK-steered sampling strategy that leverages intermediate observation likelihoods to guide the generative trajectories, effectively improving the sample quality during inference. Our approach outperforms state-of-the-art baselines on standard SBI benchmarks, producing posteriors that closely match ground-truth MCMC. Furthermore, in a real-world exoplanet orbital estimation task, FUSE successfully resolves complex parameter degeneracies that challenge existing methods, highlighting its potential to accelerate complex scientific discoveries in astrophysics and beyond.

Abstract:
Neural Forecasters (NFs) have become a cornerstone of Longterm Time Series Forecasting (LTSF). However, recent progress has been hampered by an overemphasis on architectural complexity at the expense of fundamental forecasting principles. In this work, we revisit the principles of LTSF. We begin by formulating a Variance Reduction Hypothesis (VRH), positing that generating and combining multiple forecasts is essential to reducing the inherent uncertainty of NFs. Guided by this, we propose Boosted Direct Output (BDO), a streamlined paradigm that synergistically hybridizes the causal structure of Auto-Regressive (AR) with the stability of Direct Output (DO), while implicitly realizing the principle of forecast combination within a single network. Furthermore, we address the critical validation-test generalization gap by employing parameter smoothing to stabilize optimization. Extensive experiments demonstrate that these trivial yet principled improvements enable a direct temporal MLP to outperform recent, complex state-of-the-art models in nearly all benchmarks, without relying on intricate inductive biases. Finally, we empirically verify our hypothesis, establishing a dynamic performance bound that highlights promising directions for future research. The code for review is available at: \urlhttps://anonymous.4open.science/r/ReNF-A151.

Abstract:
Autonomous vehicles operating in openworld environments must continually adapt to rare long-tail scenarios while preserving previously acquired driving skills. However, existing trajectory planning approaches struggle with this stability-plasticity trade-off, as they rely on static models or rigid rule-based controllers that cannot robustly handle evolving and complex traffic dynamics. Against this background, we proposeNOMAD, a lifelong trajectory planning framework that integrates non-parametric Bayesian memory with diffusion-based trajectory generation, enabling continuous adaptation to long-tail scenarios without catastrophic forgetting. Our method maps continuous scene contexts to a dynamically growing set of discrete memory clusters, which guide a conditional diffusion model to function as a mixture of experts specialized for diverse driving behaviors. To retain past knowledge during incremental learning, we introduce a generative replay mechanism that synthesizes pseudo-experiences from previously learned memory clusters. Extensive closed-loop evaluations on the nuPlan benchmark demonstrate that our approach achieves state-of-the-art performance on long-tail scenarios, improving the interPlan score by9.4%over the strongest baseline, while maintaining competitive performance on regular driving benchmarks. Moreover, our method exhibits robust continual learning capability, achieving the highest average closed-loop score with positive backward transfer when adapting to sequentially introduced long-tail scenarios.

Abstract:
Medical image segmentation requires balancing global context with computational efficiency, where selfattention mechanisms suffer from quadratic \mathcalO((HW)^2 C) complexity. We propose S2M-Net, a parameter-efficient architecture (4.7M parameters) that achieves computational savings through Spectral--Spatial Token Mixing (SSTM). SSTM achieves \mathcalO(HWC^2) complexity through efficient combination of \mathcalO(HWC \log(HW)) frequency-domain processing and \mathcalO(HWCd) bottlenecked spatial gating (d=16), exploiting spectral concentration where >93% of energy is captured by K=32 low-frequency components (～0.8% of the spectrum at 352×352 resolution). This design avoids self-attention's prohibitive \mathcalO((HW)^2C) attention map computations while preserving global receptive fields. To handle geometric diversity, we introduce Morphology-Aware Adaptive Segmentation Loss (MASL), which automatically modulates five loss objectives based on per-sample morphological descriptors (tubularity, compactness, irregularity, and scale). Evaluation across 15 datasets spanning 8 modalities demonstrates competitive performance, obtaining the best performance on 14 of 15 datasets, with statistically significant improvements (p < 0.0033, Bonferroni-corrected) on 7 challenging tasks (complex morphology, class imbalance, and multi-class segmentation), and clinically meaningful gains (0.5--1.6% Dice) on 8 mature benchmarks. Notably, S2M-Net achieves 83.43% Dice on EndoVis17 multiclass instrument segmentation (+8.69% over TransUNet and +9.14% over the best baseline UMamba at 74.29%), while using 12.8× fewer parameters (4.7M vs.\ 60M).

Abstract:
We introduce a practical, blackbox framework termed Detection Augmented Learning (DAL) for the problem of piecewise stationary bandits without knowledge of the underlying non-stationarity. DAL accepts any stationary bandit algorithm with order-optimal regret as input and augments it with a change detector, enabling applicability to all common bandit variants. Extensive experimentation demonstrates that DAL consistently surpasses all state-of-the-art methods across diverse non-stationary scenarios, including synthetic benchmarks and real-world datasets, underscoring its versatility and scalability. We provide theoretical insights into DAL's strong empirical performance, complemented by thorough empirical validation.

Abstract:
Multiinstance partial-label learning (MIPL) is a recently proposed learning paradigm to address tasks that multi-instance bags are associated with a candidate label set comprising one ground-truth label and several false positive labels. Existing MIPL methods rely on simple instance level information, and can hardly find the key instances under noisy labels. In this paper, we propose a novel algorithm termed AGOPMIPL, i.e., Average Gradient Outer Product based Multi-Instance Partial-Label Learning to address the problem. AGOP derives a data-dependent metric in the embedding space by computing the outer product of classifier gradients, which stretches discriminative feature dimensions and facilitates more accurate key instance identification. Moreover, AGOP aggregates gradient information across all training samples, providing inherent robustness to label noise. Additionally, we introduce a progressive label disambiguation strategy that gradually refine the learning targets. Experimental studies on benchmark and real-world datasets demonstrate the superiority of AGOPMIPL over existing MIPL methods.

Abstract:
AI Scientists have shown promising progress across multiple stages of the research pipeline, among which automatic scientific paper writing remains a formidable challenge. The Introduction writing is especially challenging, which demands not only linguistic fluency, but logical soundness and verifiable faithfulness. Most AIassisted methods treat the task as text generation instead of reasoning and structuring, leading to severe drawbacks, e.g., hallucinating citations. To address this, we first formulate the Content-Conditional Introduction Generation (CCIG) task, which requires grounding the Introduction in the paper's core evidence. We then propose LECTOR, a novel Logic-Expression Co-Reinforcement Learning framework that can strictly follow the scientist's logic, add high-quality citations and keep structured expressions. LECTOR first constructs a logic-reasoning graph from the paper's main body to serve as a verifiable logical blueprint. Subsequently, it employs a Logic-Expression Co-Rewarding mechanism to jointly optimize for both the graph's structural fidelity and the final narrative's quality. We conduct a dataset from Nature Communications papers to assess our method. Extensive experiments show consistent improvements in both logic fidelity and Introduction generation quality metrics, e.g., Graph Quality (+26.7%), Citation Quality (+8.6%), and Paper Consistency (+3.3%). The datasets, code, and pretrained models shall be released.

Abstract:
Recent years have witnessed a surge of interest in federated learning. In particular, federated continual learning (FCL) emerged as an effective approach that enables clients with evolving, nonstorable data to engage in collective learning. Among FCL approaches, replay-based methods excel by mitigating data storage constraints through synthetic data generation. However, existing replay-based methods overlook spatial and temporal information inherent in FCL settings, leading to suboptimal model performance. For instance, spatial variation in COVID-19 prevalence across hospitals in different states (e.g., Delta surging in Florida vs. Omicron in New York) and the temporal evolution of regional outbreaks are critical information for accurately distinguishing between COVID variants. This paper presents GFedCL to address this limitation. GFedCL is a new FCL approach that (1) constructs spatial- and temporal-aware relational graphs with attention mechanisms, and (2) uses the graphs, combined with generative adversarial learning, to generate high-quality synthetic data. GFedCL can generate synthetic data that matches the expectation of real data distribution while preserving privacy with theoretical guarantees. GFedCL consistently outperforms state-of-the-art FCL methods, gaining 27.95% improvement on TinyImageNet.

Abstract:
In adaptive experiments, statistical inference is essential for reliable decisionmaking and scientific discovery. Often in these settings, collecting labeled data is expensive, but decision-makers have access to large unlabeled datasets and strong pretrained AI models that can generate outcome predictions. Effectively leveraging these predictions in online experiments poses fundamental challenges for statistical inference: AI models may be misspecified, and data collected under adaptive policies are inherently non-i.i.d., invalidating classical inference techniques. To address these challenges, we propose a Prediction-Powered Adaptive Inference (PPAI) estimator that integrates unlabeled data, predicted labels, and adaptively collected labeled data through a single estimating equation. We establish asymptotic normality of the PPAI estimator under mild conditions on the data-collection policy, enabling valid confidence intervals and hypothesis tests for a broad class of Z-functionals. The method incorporates a data-driven tuning mechanism that adaptively weights AI predictions according to their informativeness, guaranteeing that the resulting asymptotic variance is no worse than that of the labeled-only baseline, and is strictly smaller when predictions are informative. Numerical experiments further support the theory, illustrating efficiency gains with informative AI predictions and robust performance when predictions are inaccurate.

Abstract:
Stochastic programming is often challenged by epistemic uncertainty, where critical probability distributions are poorly characterized or unknown due to a lack of data. To address this, we pioneer a novel framework for stochastic programming that minimizes an upper confidence bound (UCB) on the expected random cost, acting as a robustnessseeking strategy. Our central contribution is the Average Percentile Upper Bound (APUB), a new statistical construct that serves as both a statistically rigorous upper bound for population means and an approximate risk metric for sample means. We rigorously prove the asymptotic correctness and consistency of APUB, establishing a reliable foundation for data-driven decision-making. We also develop practical solution methods, including a bootstrap sampling approximation method and an L-shaped method, to solve APUB optimization problems, with a specific focus on two-stage linear stochastic optimization with random recourse. Empirical demonstrations on a two-stage product mix problem reveal the significant benefits of our APUB optimization framework, which fortifies the process against epistemic uncertainty while reinforcing key decision-making attributes like reliability and consistency.

Abstract:
Recent advances in generative diffusion and flowmatching models have revolutionized molecular design, enabling the creation of novel proteins, small molecules, and RNA sequences with unprecedented fidelity. Yet, these models remain intuitive rather than intelligent—they generate without reasoning. ThinkProteo reimagines generative science by introducing reasoning-guided diffusion models that think step-by-step, akin to how a scientist hypothesizes, tests, and refines molecular ideas. By embedding chain-of-thought (CoT) reasoning into the continuous generative trajectory, ThinkProteo transforms diffusion into a process of thought: each denoising step becomes an interpretable act of molecular reasoning guided by structural, energetic, and functional objectives. This framework bridges symbolic reasoning and physical generation, yielding models that not only design molecules but also explain why they work. We envision ThinkProteo as a foundation for cognitive generative chemistry—uniting the creativity of diffusion models with the deliberation of human reasoning to accelerate the discovery of safe and effective therapeutics.

Abstract:
Generative distillation significantly accelerates textto-image (T2I) generation by compressing multi-step trajectories into few-step student models while preserving perceptual quality. However, existing distillation methods prioritize efficiency and output fidelity, often overlooking the preservation of critical properties inherent to the original trajectory. In this work, we identify a key lost property: sensitivity to initial noise, the absence of which impairs downstream control methods that rely on noise-based optimization and manipulation. We trace this deficiency to standard distillation objectives, which enforce pointwise output alignment. This inadvertently flattens the input-output landscape and suppresses the local geometric structure present in the teacher model. To address this, we propose Geometry-Aware Distillation (GAD), a sensitivity-preserving framework that explicitly aligns the local functional behavior of the teacher and student. GAD enforces geometric consistency by matching Jacobian-vector products with respect to input noise, ensuring the student faithfully reproduces the teacher’s differential response to perturbations. Extensive experiments across multiple T2I paradigms and noise-driven control tasks demonstrate that GAD significantly recovers sensitivity and improves diversity, while maintaining high visual fidelity.

Abstract:
In correctnesssensitive scenarios, it is crucial for Large Language Models (LLMs) to strictly follow the provided evidence. However, even with reference texts, models often suffer from hallucinations, especially when processing long contexts. Existing work attempts to reinforce the use of citations through Retrieval-Augmented Generation (RAG) or post-hoc methods, while citations remain a probabilistic output rather than a foundation for the generated content. To address this, we propose Guidance, which aims to correct outputs and naturally incorporate citations during the LLM decoding phase. Specifically, we first build a structured fact pool (Prefix-Tail pairs) from the documents. Then, during inference, Guidance predicts the model's intent using a lookahead strategy. When it detects a match with a context prefix, it automatically replaces the output with the verified fact and its citation. This approach is training-free and can be plugged into general-purpose or citation-fine-tuned LLMs. Experiments on LongBench-Cite demonstrate that Guidance improves the citation F1 score by 11.2% over state-of-the-art baselines. The source code is available at: https://anonymous.4open.science/r/Guidance-D870/.

Abstract:
Diffusion language models (DLMs) can match or surpass similarly sized autoregressive language models on language understanding and reasoning. However, their maskand-denoise pretraining relies on heuristic random masking, which fails to target the most informative tokens. Consequently, the model spends significant computational effort on redundant or trivial tokens. To address this, we propose InfoDLM, an adaptive DLM pretraining framework that reformulates mask selection as an active, feedback-driven process. InfoDLM targets tokens that offer the highest measurable information gain during mask selection. Specifically, we: (1) introduce a Trainable Information-Gain (TIG) signal to quantify information gain of each masking configuration; (2) develop a feedback mechanism that adapts the masking policy to the model’s evolving state with a maturity indicator; and (3) jointly optimize the DLM and masking policy through an interleaved training flow with minimal computational overhead. Across reasoning-oriented benchmarks, InfoDLM achieves up to 13% improvement in reasoning accuracy over a small variant of LLaDA under comparable pretraining budgets.

Abstract:
We present the first comprehensive Lean 4 formalization of statistical learning theory (SLT) grounded in empirical process theory. Our endto-end formal infrastructure implement the missing contents in latest Lean library, including a complete development of Gaussian Lipschitz concentration, the first formalization of Dudley’s entropy integral theorem for sub-Gaussian processes, and an application to least-squares regression with a sharp rate. The project was carried out using a human–AI collaborative workflow, in which humans design proof strategies and AI agents execute tactical proof construction, resulting in approximately 30,000 lines of human-verified Lean 4 code produced over 500 hours of supervised development. Beyond implementation, the formalization process exposes and resolves implicit assumptions and missing details in standard SLT textbooks, enforcing a granular, line-by-line understanding of the theory. This work establishes a reusable formal foundation for future developments in machine learning theory. The code is provided in the supplementary materials.

Abstract:
Ensuring robust safety alignment is crucial for Large Language Models (LLMs), yet existing defenses often lag behind evolving adversarial attacks due to their reliance on static, precollected data distributions. In this paper, we introduce MAGIC, a novel multi-turn multi-agent reinforcement learning framework that formulates LLM safety alignment as an adversarial asymmetric game. Specifically, an attacker agent learns to iteratively rewrite original queries into deceptive prompts, while a defender agent simultaneously optimizes its policy to recognize and refuse such inputs. This dynamic process triggers a co-evolution, where the attacker's ever-changing strategies continuously uncover long-tail vulnerabilities, driving the defender to generalize to unseen attack patterns. Remarkably, we observe that the attacker, endowed with initial reasoning ability, evolves novel, previously unseen combinatorial strategies through iterative RL training, underscoring our method’s substantial potential. Theoretically, we provide insights into a more robust game equilibrium and derive safety guarantees. Extensive experiments validate our framework's effectiveness, demonstrating superior defense success rates without compromising the helpfulness of the model.

Abstract:
Identifying the drivers of causal mechanism shifts, distinguishing functional changes from noise alterations, known as dissection, is a critical yet underexplored problem in data science (e.g., biomedical science and manufacturing). This paper introduces a more general and unified framework, the function and noise separation framework (FANS), that detects and dissects shifts in non-additive, non-linear Structural Causal Models (SCMs) beyond existing additive noise models. Our approach is grounded in a theoretical independence criterion, where function shifts induce a statistical dependence between a node's parents and residual noise. Building on this foundation, we develop a practical two-stage algorithm to efficiently detect and dissect these shifts without retraining. Furthermore, we address the complex challenge of simultaneous function and noise shifts, introducing a formal assumption to resolve their inherent non-identifiability. Our results are corroborated by simulations. Our code is available at https://anonymous.4open.science/r/FANS-CFEB/.

Abstract:
We study the task of learning the structure of a dsparse Gaussian graphical model on n variables from a single trajectory of Glauber dynamics. Beyond algorithmic considerations, many applications present temporally correlated observations rather than i.i.d. samples. Moreover, in the classical i.i.d. setting, polynomial-time structure learning from a sublinear in n number of samples is suspected to be computationally hard without additional assumptions on the precision matrix. Motivated in part by this, we design the first polynomial-time algorithm that recovers the conditional-independence graph from a single Glauber trajectory, with a trajectory-length guarantee that does not depend on the mixing time. Technically, our algorithm has three components. First, we estimate the conditional variances and rescale the trajectory to reduce to the unit-diagonal case, without changing the underlying graph. Second, we design a local edge test that extracts adjacency information from short update windows by isolating pairwise influence. Third, we aggregate these local statistics using a robust median-based estimator, and prove accuracy despite contamination and temporal dependence arising from a single trajectory.

Abstract:
The Massart noise condition is a central model in Probably Approximately Correct (PAC) learning theory. Its importance lies in it being an interpolation condition between realizable and the agnostic settings, under which one can attain faster rates than latter, and, under strict conditions, recover the rates of the former. Despite its importance, the Massart condition has not yet been fully explored in emerging extensions of statistical learning theory beyond the classical PAC framework. In this work, we present two such extensions. First, we revisit the transductive empirical risk minimization (TERM) algorithm of (Hanneke & Moran, 2026) and derive sharper excess error bounds under Massart noise using offset Rademacher techniques and local metric entropy introduced by (Zhivotovskiy & Hanneke, 2018). We then leverage this analysis to obtain new optimal sample complexity bounds for PAC learning with partial concept classes and complete the characterization of universal learning rates under Massart noise.

Abstract:
Most multiobjective reinforcement learning (MORL) methods either condition on a fixed preference weight w or consider episodic robustness where an adversary selects a single w per episode. We study a time-consistent robustness model with reactive preferences: after each transition, an opponent chooses the next weight w_t+1 after observing s_t+1, and incurs a switching cost \lambda D_\Phi(w_t+1\mid w_t) based on a Bregman divergence. This yields a Bellman–Isaacs recursion with an inner weight minimization at every backup. We prove the induced operator is a contraction and derive a Bellman-residual certificate that turns approximation error into a uniform bound on robust performance. We develop practical solvers in both tabular and deep settings using Bregman-prox inner updates and a stabilized fixed-point iteration. To evaluate robustness without optimistic critic reuse, we introduce BR-K, testing policies against K independently trained best-response preference adversaries. Across MO-Gymnasium benchmarks, our approach consistently improves WRR under strong step-wise opponents over preference-conditioned baselines while keeping DRIFT smoothly controllable via \lambda.

Abstract:
Reinforcement learning from verifiable rewards (RLVR) has become an important paradigm for enhancing the reasoning capabilities of large language models, while it also involves a persistent tradeoff between optimization stability and learning efficiency. Tokenlevel importance weighting supports fine-grained credit assignment, but it often introduces high variance and unstable parameter updates, whereas sequence-level optimization provides more stable learning dynamics while failing to fully exploit informative local signals. We introduceTrust-GatedPolicyOptimization (TGPO), an efficient policy optimization framework that integrates two complementary mechanisms, namelysequence anchorsandinformation gates. TGPO aligns token-wise updates with a stable sequence-level reference, which reduces the influence of extreme local likelihood fluctuations on the gradient, and a trust-based information gate adaptively modulates the contribution of token-level signals. By retaining and reweighting gradients from imperfect trajectories rather than excluding them, TGPO improves gradient utilization and sample efficiency while maintaining stable optimization behavior. Empirical results across seven mathematical reasoning datasets and multiple model scales show that TGPO consistently enhances learning efficiency and overall performance in outcome-supervised reinforcement learning settings.

Abstract:
In Mixtureof-Experts (MoE) training, sparse routing, i.e., activating only the top-K experts per token, is essential for balancing convergence speed and computational cost. However, existing works typically choose K empirically, without theoretical guidance. To address this gap, we characterize the convergence behavior of MoE training using stochastic optimization theory. Specifically, we derive a convergence upper bound of \mathcalO\left(\frac1+M/K\sqrtT\right), where T is the number of training iterations and M is the total number of experts per MoE layer. This result guarantees convergence and shows that increasing K can accelerate training. By further fixing the total computational budget R (in FLOPs), we obtain a refined bound of \mathcalO\left(\sqrt\fracKR + \fracM\sqrtK R\right), which is convex in K and implies the existence of an optimal K^\in[1,M] that achieves the best convergence performance. Extensive experiments validate our theoretical analysis under diverse settings.

Abstract:
As LLMbased agents are deployed in increasingly complex real-world settings, existing benchmarks underrepresent key challenges such as enforcing global constraints, coordinating multi-tool reasoning, and adapting to evolving user behavior over long, multi-turn interactions. To bridge this gap, we introduce TRIP-Bench, a long-horizon benchmark grounded in realistic travel-planning scenarios. TRIP-Bench leverages real-world data, offers 18 curated tools and 40+ travel requirements, and supports automated evaluation. It includes splits of varying difficulty; the hard split emphasizes long and ambiguous interactions, style shifts, feasibility changes, and iterative version revision. Dialogues span up to 15 user turns, can involve 150+ tool calls, and may exceed 200k tokens of context. Experiments show that even advanced models achieve at most 50% success on the easy split, with performance dropping below 10% on hard subsets. We further propose GTPO, an online multi-turn reinforcement learning method with specialized reward normalization and reward differencing. Applied to Qwen2.5-32B-Instruct, GTPO improves constraint satisfaction and interaction robustness, outperforming Gemini-3-Pro in our evaluation. We expect TRIP-Bench to advance practical long-horizon interactive agents, and GTPO to provide an effective online RL recipe for robust long-horizon training.

Abstract:
Highquality data is essential for modern machine learning, yet many valuable corpora are sensitive and cannot be freely shared. Synthetic data offers a practical substitute for downstream development, and large language models (LLMs) have emerged as powerful engines for generating it. However, existing private text generation methods are severely inefficient: they are data-intensive, computationally slow, and often require large private corpora or batch sizes to achieve usable quality. We introduce EPSVec, a differentially-private lightweight alternative that steers LLM generation usingdataset vectors-directions in activation space that capture the distributional gap between private data and public priors. EPSVec extracts and sanitizes steering vectors just once and then performs standard decoding. This decouples the privacy budget from generation, enabling arbitrarily many synthetic samples without additional privacy cost and yielding strong fidelity even in low-data regimes. Furthermore, we enhance our method by utilizing pretrained (base) models and introducing fixed-shot prompting to boost generation diversity and fidelity. Our experiments demonstrate that EPSVec outperforms existing baselines in distributional alignment and downstream utility, particularly in low-data regimes, while significantly reducing computational overhead.

Abstract:
Resilience against Byzantine attackers and faster convergence on sparse networks are critical for decentralized optimization, yet existing methods fail to achieve both simultaneously. Existing DSGDbased Byzantine-resilient methods suffer from high transient complexity of \mathcalO\left((1-\lambda)^-6\right), where 1-\lambda denotes the spectral gap of the network. While bias-correction methods such as Exact Diffusion can improve topology dependence, directly combining them with robust aggregators can lead to error accumulation. To address this issue, we introduce the scaled dual ascent (SDA) within the augmented Lagrangian framework for decentralized optimization, which mitigates error accumulation by scaling the dual update steps. Based on this, we propose BRED, which integrates Byzantine-robust Exact Diffusion with the SDA framework. We prove that BRED attains linear speedup, and achieves transient complexity of \mathcalO\left((1-\lambda)^-2\right) when the Byzantine fraction \delta is small. We further propose the momentum variant BRED-M, which reduces the Byzantine-affected transient complexity from \mathcalO\left(\delta^2(1-\lambda)^-6\right) to \mathcalO\left(\delta^2(1-\lambda)^-4\right). Empirical results on benchmark datasets demonstrate the efficacy of the proposed methods across diverse network topologies.

Abstract:
Existing online changepoint detection (CPD) methods rely on fixed-dimensional Euclidean summaries, implicitly assuming that distributional changes are well captured by moment-based or feature-based representations. They can obscure important changes in distributional shape or geometry. We propose a geometry-aware CPD framework that treats streaming batch data as a stochastic process on the 2-Wasserstein space. Our method detects changes in the law of this process by mapping each empirical distribution to a tangent space relative to a pre-change Fréchet barycenter, yielding a reference-centered local linearization of 2-Wasserstein space. This representation enables sequential detectors by adapting classical multivariate monitoring statistics to tangent fields. We provide theoretical guarantees and demonstrate, via synthetic and real-world experiments, that our approach detects complex distributional shifts with reduced detection delay at matched \mathrmARL_0 compared with moments-based and model-free baselines.

Abstract:
Optimizing OS kernels for specific applications is vital for peak performance, yet existing LLMbased methods struggle with a semantic mismatch between generalized reasoning and low-level system behaviors. As a result, these static, open-loop approaches suffer from runtime blindness, configuration fragmentation, and search drift, ultimately failing to unlock the system’s performance potential. To address this, we propose PerceptOS, an autonomous framework that shifts the paradigm to perception-driven tuning. PerceptOS integrates: (1) a Perception Module that aligns raw telemetry into high-fidelity semantic fingerprints; (2) a Global Search Module utilizing a Bi-level Hierarchical Induction Tree (BHIT) for global navigation and efficient pruning; and (3) a Posterior Enhancement Module to suppress hallucinations via trajectory synthesis. Experiments across Redis, Apache, PostgreSQL, and RAG show that PerceptOS achieves significant performance breakthroughs by optimizing kernel configurations, reaching 296.6% of default Redis throughput and surpassing SOTA baselines by 32.6% within only 15 iterations. By establishing a perception-driven closed-loop, PerceptOS provides new insights for fully automated, large-scale system optimization.

Abstract:
Multivariate time series (MTS) forecasting critically depends on modeling intervariable dependencies, yet existing paradigms face a trade-off: channel-isolation strategies can suffer from information fragmentation in strongly coupled systems, whereas channel-interaction methods often introduce spurious interactions among irrelevant variables. To address this challenge, we propose Coherent Resonance Interaction with Spectral Priors (Crisp). Crisp adopts the principle that effective information exchange should occur only between variables with compatible oscillatory patterns. Concretely, we derive spectral priors in the frequency domain to construct dynamic resonance topologies. With a differentiable, adaptive, and strictly sparse blocking mechanism, Crisp forces attention weights for spectrally inconsistent neighbors to be exactly zero. In addition, we introduce a spectral-gated feature filtering module to refine variable representations using intrinsic spectral characteristics. Extensive experiments demonstrate that Crisp significantly outperforms 20+ baselines. Our code is available at Anonymous GitHub.

Abstract:
Reward shaping can accelerate reinforcement learning, but in sparsereward \emphoffline multi-agent RL it is often brittle: dense intrinsic rewards may alter the underlying Markov game, while world-model guidance can amplify model bias. We find that shaping becomes reliable when it is (i) activated only after \emphstatistically validated learning plateaus and (ii) constrained to \emphpotential-based shaping, which preserves the task optimum. Motivated by this, we propose \textscSOLAR, a simulate--evaluate--shape framework. A learned world model enables low-cost rollouts to test plateaus; once a plateau is detected, we inject shaping in the form r+\gamma\Phi(s')-\Phi(s) with adaptively updated potentials; and we attenuate shaping using uncertainty-aware throttling in unreliable regions. We provide theoretical analysis on policy invariance and on the deviation of plateau decisions under model error, and establish stability for the resulting two-timescale adaptation. Experiments on sparse-reward offline MARL benchmarks show consistent gains in stability and final performance across dataset qualities.

Abstract:
Bayesian additive spanning tree (BAST) is an useful tool for interpretable, nonparametric regression on complex constrained domains. It improves upon the performance of Bayesian additive regression trees (BART) by replacing axis-aligned splits through binary tree components by cuts on a spanning tree components, enabling the formation of contiguous splits that respect the underlying complex structure. While BAST is effective for learning on constrained spaces, it still relies on hard partitions, albeit on spanning trees, which limits its ability to represent smoothly varying functions on constrained domains. We propose Probabilistically-routed Bayesian additive spanning trees (PR-BAST), a principled relaxation that replaces hard cuts on spanning tree components with probabilistic routing along spanning tree components. PR-BAST represents the regression surface as an additive ensemble of such spanning tree-aligned smooth components. Conditional on a fixed spanning tree, each component in PR-BAST induces a Gaussian random field with a sparse, tree-structured precision matrix, enabling scalable posterior computation via sparse linear algebra. We theoretically establish that PR-BAST yields strictly faster posterior contraction rates compared to BAST under graph-smooth truth. Experiments on synthetic and real datasets demonstrate that PR-BAST consistently improves accuracy over BAST and other baselines, while retaining the interpretability of tree-based models.

Abstract:
Medical large visionlanguage models (Med-LVLMs) have recently achieved remarkable progress in vision–language comprehension and medical image segmentation. However, existing models still struggle to unify these two capabilities, which is essential for achieving clinically reasoning that connects visual findings with semantic interpretation. We present MedSIGHT, a unified framework that equips Med-LVLMs with structured, pixel-level understanding for grounded visual comprehension. MedSIGHT introduces a novel Region Perceiver module that produces region-centric tokens, encoding spatial information directly into representation space of the language model. We further propose a medical region codebook into the LLM vocabulary, allowing the model to generate discrete region codes as symbolic representations of anatomical and pathological regions. These codes are decoded through the Region Perceiver to reconstruct segmentation mask, achieving end-to-end spatial grounding. Lastly, MedSIGHT combines Region Perceiver, Codebook and LLM using our proposed progressive training strategy to gradually aligns these modules stably. Trained on only 72K multimodal instruction pairs, MedSIGHT achieves state-of-the-art performance across diverse imaging modalities on both medical comprehension and segmentation tasks.

Abstract:
Offline Federated Deep Reinforcement Learning (FDRL) methods aggregate multiple clientside offline Deep Reinforcement Learning (DRL) models, each trained locally, to facilitate knowledge sharing while preserving privacy. Existing offline FDRL methods assign client weights during global aggregation using either simple averaging or Q-values, but they neglect the combined consideration of Q-values and policy inconsistency, the latter of which reflects the distributional discrepancy between the learned policy and the policy from offline data. This causes clients with no significant advantages in one aspect but obvious disadvantages in the other to disproportionately affect the global model, thereby degrading its capabilities in that aspect. During local training, clients in existing methods are compelled to fully adopt the global model, which negatively impacts clients when the global model is weak. To address these limitations, we propose a novel Federated Learning (FL) framework that can be seamlessly integrated into current offline FDRL approaches to improve their performance. Our method considers both policy inconsistency and Q-values to determine the weights of client models, with the latter adjusted by a scaling factor to avoid significant numerical discrepancies with the former. The aggregated global model is then distributed to clients to facilitate their learning from the global model. The impact of the global model on the local models is reduced when a client's model performance exceeds that of the global model, thereby mitigating the influence of a weaker global model. Experiments on the Datasets for Deep Data-Driven Reinforcement Learning (D4RL) demonstrate that our method enhances six state-of-the-art (SOTA) offline FDRL methods in terms of return and D4RL score.

Abstract:
Independent Metropolis–Hastings (IMH) algorithms are widely used in Bayesian inference, but their efficiency deteriorates when proposal distributions are constructed from inaccurate or approximate models. We introduce ProximalIMH, an IMH method that enhances proposal distributions through a proximal posterior correction. Given an approximate posterior sample, each proposal is generated by minimizing a quadratically regularized surrogate objective, producing a local correction that balances fidelity to the exact model with stability around the approximate state. We analyze the resulting proposals from an optimization and probabilistic perspective, showing how the proximal correction improves alignment between approximate and exact posteriors and leads to improved acceptance and mixing behavior. The proposed framework applies to both linear and nonlinear forward operators, and is particularly well suited to Bayesian inverse problems where exact posterior sampling is computationally prohibitive. Numerical experiments on inverse problems with approximate forward models, including nonlinear operators, demonstrate that Proximal-IMH consistently outperforms existing IMH variants while retaining their simplicity and scalability.

Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated exceptional proficiency in standard text extraction, but they encounter significant challenges when confronting realworld implicit text. Such content typically contains malicious information, intentionally concealed through physical deformation, visual camouflage, or cognitive suggestion. These concealment techniques circumvent content moderation systems and pose severe risks to user safety. To bridge the research gap in text recognition under real-world adversarial scenarios, we define the task of Implicit Text Reasoning and introduce ImpText-Bench, a meticulously constructed benchmark. Extensive evaluations on this benchmark reveal significant vulnerability in current systems; even advanced proprietary models achieve a maximum Text Match Score of only 35.79%. In response, we propose ImpText-Reader, a tool-augmented framework. It employs a three-stage training strategy utilizing capability-boundary data to collaboratively optimize tool selection and semantic reasoning, thereby effectively extracting hidden text. Extensive experiments demonstrate that our approach achieves SOTA performance, significantly enhancing model robustness in adversarial environments.

Abstract:
Allin-one image restoration seeks a single model that can recover images degraded by diverse and spatially non-uniform corruptions. However, many unified Transformers rely on fixed patch partitioning: task/degradation condition is injected only into the backbone blocks after tokenization, leaving the embedding and reconstruction stages insensitive to local degradation variations. In contrast to previous approaches, we present Flexible Image Transformer (FIT) that explicitly models degradation awareness across the \emphentire pipeline, from patch sampling to pixel reconstruction. Specifically, FIT employs a lightweight Degradation Encoder to predict a global degradation vector \mathbfg and a spatial degradation map \mathbfM from local degradation severity, which jointly condition the patch embedding and unembedding through adaptive deformation. Moreover, to improve robustness across degradation types, we introduce a task-token dropout strategy that regularizes task conditioning during training. On five standard benchmarks (BSD68, Rain100L, SOTS, GoPro, and LOLv1), FIT achieves state-of-the-art performance with 30.72 dB average PSNR on the five-degradation setting and 32.83 dB on the three-degradation setting, outperforming recent unified restoration methods by +0.5～1.1 dB. Moreover, the learned offsets provide a direct handle for visualizing degradation-aware spatial adaptation.

Abstract:
Large language models (LLMs) are increasingly deployed in highstakes domains, where free-text explanations such as chain-of-thought and post-hoc rationales are used to justify model outputs. Yet it remains unclear whether these explanations aresufficient, i.e., if they contain enough information to explain the model’s output-generating process. We generalize classical sufficiency from feature attributions to arbitrary explanations and prove that explanation sufficiency is inherently relative to an input distribution, which must be explicitly defined for LLM explanations. We propose using the LLM itself to generate alternative inputs conditioned on an explanation, capturing its beliefs about possible inputs. We formalizeself-consistent sufficiencyas a goal for free-text explanations and introduce an information-theoretic metric, SCSuff, that enables evaluation of free-text explanations without relying on predefined biases or shortcuts. Our experiments show that SCSuff aligns with targeted perturbation tests where applicable and demonstrate that explanation sufficiency can vary with the input distribution. We further find that SCSuff is uncorrelated with model size, accuracy, or uncertainty, suggesting that improving self-consistent sufficiency requires approaches beyond scaling or standard performance optimization.

Abstract:
While Multimodal Large Language Models (MLLMs) have achieved remarkable progress in general visual understanding, they suffer from a fundamental geometric fragility: standard visual representations often degrade rapidly under changes in viewpoint and viewing distance. Our analysis identifies that existing paradigms, whether relying on inputlevel fusion or latent reconstruction, remain entangled with the view-dependent pixel grid, failing to decouple intrinsic 3D structure from extrinsic camera pose. To address this, we introduce AffIn-Space, a framework that enforces strict affine invariance to enable robust spatial understanding. Unlike implicit learning approaches, AffIn-Space introduces a two-stage explicit decoupling mechanism. First, it employs explicit geometric resampling by utilizing decomposed affine quantities (derived from pose features) to spatially align 3D features to a canonical state before fusion. Second, within the MLLM, we implement affine-invariant constraints via an orthogonal projection mechanism, which mathematically strips away pose-dependent noise from the hidden states while retaining recoverable geometric semantics through conditional reconstruction. Extensive experiments on VSI-Bench, ScanQA, SQA3D, Scan2Cap, and EmbodiedScan demonstrate that AffIn-Space achieves state-of-the-art performance. Code and detailed instructions will be publicly released. Crucially, our approach exhibits superior stability against affine perturbations, validating the effectiveness of explicitly modeling geometric invariance for complex spatial tasks. Code will be made available. Extensive experiments show that AffIn-Space achieves state-of-the-art performance on spatial reasoning tasks (VSI-Bench, SQA3D and Scan2Cap), and on spatial grounding tasks (ScanRefer and EmbodiedScan), demonstrating the effectiveness of affine invariant representations for complex spatial understanding.

Abstract:
Dynamic programming (DP) algorithms are central to sequence alignment, parsing, and timeseries analysis, yet their non-differentiability has limited integration into end-to-end learned systems. We present d2p, a unified differentiable framework covering twelve fundamental DP algorithms across alignment, edit distance, and parsing families. By replacing hard max/min with temperature-scaled soft operators, each algorithm yields a smooth log-partition function. Our main theoretical contribution is a complete characterization of mixed second-order derivatives: cross-Jacobians of posterior marginals with respect to algorithm parameters (gap penalties, edit costs, temperature), with natural covariance interpretations under the induced Gibbs distribution. Our GPU-accelerated implementations achieve up to 10,000× speedups over naive PyTorch, with full support for torch.compile, automatic mixed precision, and variable-length batching. We demonstrate d2p on protein structure alignment, where neural network encoders (ProteinMPNN, GVP, IPA) achieve 0.75 F1 versus 0.51 for discrete structural alphabets substitution matrices and 0.21 for amino acid sequence substitution matrices alone, showing that differentiable DP enables learning continuous representations that discrete methods cannot capture.

Abstract:
Toolaugmented language model agents have shown great potential in solving complex real-world tasks, but a key challenge remains balancing planning flexibility with the reliability required for production deployment. Existing approaches either execute tools sequentially without parallelism or generate unconstrained code, hindering optimization and verification. Additionally, agents that learn from experience often suffer from skill library pollution, where unverified abstractions degrade performance over time. We propose EvoC2F, a framework that redefines tool orchestration through program compilation and verified continuous learning. By constraining plan generation to a well-defined intermediate representation with explicit semantic annotations, EvoC2F enables provably correct optimizations, parallelism, and fault tolerance, while ensuring soundness guarantees. Our verification-gated code-to-function evolution process ensures that learned skills undergo rigorous testing before library admission. Experiments across diverse benchmarks demonstrate that EvoC2F outperforms existing methods, reducing latency and establishing a robust foundation for building reliable, evolving autonomous agents. Our code and datasets are available at https://anonymous.4open.science/r/EvoC2F-1DEF/.

Abstract:
Autonomous CLI agents can now execute hundreds of actions across multihour sessions: writing code, executing shell commands, browsing the web, and managing cloud infrastructure, all with minimal human oversight. Does greater autonomy invite greater risk? We introduce ANCHOR, an automated auditing framework that stress-tests CLI agents on illegal tasks grounded in public US court cases. ANCHOR deploys an adversarial auditor agent fine-tuned on dark personality data using supervised and reinforcement learning. This auditor roleplays persistent malicious users who decompose tasks, reframe requests upon refusal, and adapt strategies across multi-turn interactions. Evaluating frontier models, we find that while they often refuse illegal tasks when prompted directly, compliance rates reaches 100% under persistent malicious interaction. When agents comply, they frequently exceed user requests, autonomously building infrastructure for large-scale harm, including catastrophic risk scenarios such as large-scale financial fraud and bioweapon development. These findings demonstrate that current alignment techniques are insufficient for autonomous agents and underscore the need for safety evaluations against persistent, adaptive malicious users.

Abstract:
Large language models (LLMs) are vulnerable to malicious inputs, posing serious risks in highstakes applications. While existing detection-based methods have shown strong empirical performance, they fail to explicitly control the false acceptance rate (FAR), a critical safety metric in sensitive scenarios. This limitation is further compounded by two key challenges: the absence of access to known malicious samples and the dynamic, online nature of real-world data streams. To address these challenges, we propose OSCS, a novel framework designed to enable online FAR control without relying on malicious calibration samples. OSCS leverages detection scores from existing defenses and employs recursive density estimation to infer benign likelihoods directly from the test stream. This approach allows OSCS to make real-time accept/reject decisions while adhering to a user-specified FAR threshold. We provide theoretical guarantees demonstrating that OSCS effectively controls the FAR, with only a vanishing excess term under mild conditions. Extensive experiments on backdoor attacks and jailbreak scenarios further validate OSCS's effectiveness, showing that it consistently achieves robust FAR control across a variety of tasks and attack settings. These results underscore the practicality of OSCS for ensuring safety in critical LLM deployments.

Abstract:
Anomaly detection in tabular data poses significant challenges due to heterogeneous feature types—mixing numerical, categorical, and textual attributes, which complicate learning meaningful representations of normality. Recent work has applied large language models (LLMs) to this problem by serializing table rows as text sequences, yet these approaches rely on oneshot supervised fine-tuning that offers limited signal to tighten the model's description of normality. We propose DiSPaT, a self-play fine-tuning framework that strengthens the model's understanding of normal data. Building on the theoretical foundation of f-divergence minimization, we derive a tight approximation connecting our training objective to reducing the distributional gap between real normal data and model-generated samples. DiSPaT operates through an alternating optimization: at each iteration, the current policy generates synthetic samples that serve as pseudo-anomalies, while a critic discriminator learns to distinguish these from real normal samples; this signal drives policy updates that progressively align the model distribution with the true normal-data distribution. Extensive experiments on diverse benchmarks demonstrate that DiSPaT consistently outperforms prior LLM-based methods, deep learning approaches, and classical unsupervised detectors for tabular anomaly detection.

Abstract:
How do we map numeric values to representations in transformerbased sequence processing, particularly in electronic health record (EHR) data? We systematically compare discrete, continuous, and hybrid value embeddings using synthetic arithmetic tasks embedded within real-world EHR data, as well as real-world clinical prediction tasks. Our study reveals trade-offs between numeric precision, optimisation stability, and architectural flexibility. We find that representations that explicitly model value-concept interactions perform best on precision-sensitive arithmetic tasks when architectural constraints permit. Hybrid token-based approaches that retain numeric values but apply binning prior to projection provide a more robust and broadly applicable alternative, with the optimal number of bins following a simple empirically derived power-law in dataset size. Across tasks, models consistently exhibit reliable “good enough” numeric computation rather than exact arithmetic, while clinical gains from incorporating laboratory values are task-dependent. This suggests that robustness and deployability often outweigh maximal numeric precision in practice, motivating hybrid token-based approaches as a practical default.

Abstract:
Knowledge distillation from powerful reasoning models underpins the development of Small Language Models (SLMs). A prevailing assumption in this paradigm is that training data with higher perceived quality, often defined by rigorous logic and superior reward scores, monotonically enhances downstream performance. In this paper, we identify a counterintuitive Quality-Utility Paradox across diverse model families(Qwen2.5, LLaMA-3, DeepSeek): data refined by a superior Synthesis Oracle consistently underperforms the SLM's self-generated Rejection Sampling (RFT) data, despite achieving higher reward scores. We argue that Oracle models introduce an intrinsic representation bias that shifts training data into a distribution incompatible with the target SLM, where the SLM allocates limited computational capacity to stylistic imitation rather than logical reasoning. We utilize a Style-Aligned Refinement strategy to correct logical errors and strictly preserve the SLM's native syntax. Our experiments demonstrate that maintaining native syntax effectively mitigates syntactic adaptation costs, enabling distilled models to match or even surpass self-generated baselines. These findings underscore the necessity of syntactic alignment and advocate for model-aware reward designs that prioritize distributional compatibility alongside logical rigor. Our datasets and code will be publicly available.

Abstract:
Conditional independence testing (CIT) is essential for reliable scientific discovery. It prevents spurious findings and enables controlled feature selection. Recent CIT methods have used machine learning (ML) models as surrogates of the underlying distribution. However, modelagnostic approaches require a train-test split, which reduces statistical power. We introduce Semi-knockoffs, a CIT method that can accommodate any pre-trained model, avoids this split, and provides valid p-values and false discovery rate (FDR) control for high-dimensional settings. Unlike methods that rely on the model-X assumption (known input distribution), Semi-knockoffs only require conditional expectations for continuous variables. This makes the procedure less restrictive and more practical for machine learning integration. To ensure validity when these expectations are estimated, we present two new theoretical results: (i) stability for regularized models trained with a null feature and (ii) the double-robustness property.

Abstract:
VisionLanguage Models (VLMs) achieve remarkable performance on multimodal tasks but remain highly vulnerable to adversarial examples, making transferable attacks essential for realistic robustness evaluation. Recent Adversarial Evolution Triangle (AET) methods improve transferability by interpolating over a simplex formed by clean and historical adversarial samples, yet rely on finite random sampling to approximate effective perturbation distributions, which is unstable under limited budgets. In this paper, we propose Dirichlet Distributional Gradient Aggregation (DDGA), a distribution-aware adversarial attack framework that explicitly models and optimizes perturbations over the AET simplex. DDGA parameterizes simplex mixing weights with a learnable Dirichlet policy and optimizes the expected adversarial objective via policy gradient, replacing heuristic sampling with principled distributional optimization. Moreover, we exploit the closed-form covariance of the learned distribution to construct orthogonal perturbations that enhance gradient diversity. Extensive experiments on image-text retrieval and image captioning demonstrate that DDGA consistently outperforms state-of-the-art transfer-based attacks across multiple VLM architectures.

Abstract:
Recent progress in largescale self-supervised learning has improved dense geometric prediction, but it remains unclear whether such scaling yields inference-time physical plausibility checks. We propose Scrambled Edges, a controlled counterfactual that injects salient edge-like cues while violating surface continuity, illumination coherence, and occlusion ordering. With energy-matched and structure-matched controls, we isolate the effect of unsupported edge evidence from high-frequency energy and edge sparsity. Across CNN/ViT/SSL depth predictors on NYU Depth v2 and KITTI, Scrambled Edges induce up to 3.2× larger deviation from clean predictions than energy-matched noise. The resulting Geometric Collapse propagates globally: even with oracle knowledge of the corrupted region, output-level repair recovers only 47%, with substantial error outside the mask. These findings provide controlled behavioral evidence that current dense predictors lack reliable mechanisms to quarantine physically unsupported edge cues, motivating explicit plausibility scoring and selective cue integration.

Abstract:
Geotemporal understanding, the ability to infer location, time, and contextual properties from visual input alone, is a core aspect of human intelligence and underpins applications such as disaster management, traffic planning, embodied navigation, world modeling, and geography education. Although recent vision–language models (VLMs) have made progress in image geo-localization using salient cues like landmarks or road signs, their ability to reason about temporal signals and physically grounded spatial cues remains underexplored. To address this gap, we introduceTimeSpot, a benchmark for evaluating real-world geo-temporal reasoning in VLMs.TimeSpotconsists of 1,455 ground-level images from 80 countries and requires structured prediction of temporal attributes (season, month, time of day, daylight phase) and geographic attributes (continent, country, climate zone, environment type, latitude-longitude) directly from visual evidence. The benchmark further includes spatial–temporal reasoning tasks that probe physical plausibility and cue integration under real-world uncertainty. Evaluations of state-of-the-art open- and closed-source VLMs show consistently low performance, particularly for temporal inference, and while supervised fine-tuning yields measurable gains, it remains insufficient, highlighting the need for new approaches to achieve robust, physically grounded geo-temporal understanding. By jointly evaluating spatial and temporal inference with diagnostic rigor,TimeSpotprovides a principled framework for assessing physically grounded, real-world geo-temporal reasoning. We will releaseTimeSpotupon acceptance.

Abstract:
Current LLM evaluations often conflate benchmark performance with intrinsic model capability. This is misleading, as observed outcomes arise from the entire evaluation system, including datasets, prompting methods, decoding parameters, and the software–hardware stack, rather than the model alone. When this system is underspecified, attribution becomes unreliable; in practice, evaluation choices alone can induce accuracy swings of up to 70%. This attribution challenge is compounded by the openended nature of LLM evaluation. Questions span languages, domains, and usage styles, forming highly variable and implicitly shifting datasets. Consequently, strong performance on static benchmarks may reflect alignment with surface patterns rather than robust underlying capability. Prior studies either focus on individual components, overlooking their interactions, or investigate manually curated and small-scale question variants, lacking a holistic perspective, precluding precise attribution of intrinsic model capabilities amidst the confounding influences. To address these limitations, we propose LLM evaluatology, a principled framework that grounds LLM evaluation in a causally informed system design. By jointly modeling evaluation components and structured question variations, it enables interpretable, reproducible, and causally faithful assessment of model capability, establishing clear conditions under which evaluation results are meaningful and trustworthy.

Abstract:
The increasingly realistic and diverse generative data has led some deepfake detection methods to shift towards learning robust real content, e.g., via reconstructionbased tasks. However, most existing approaches rely primarily on prevalent continuous modeling (e.g., GMMs, VAEs, Diffusion Models) to construct a continuous latent manifold of real data, with the aim of improving the generalization capability, while overlooking a critical issue, i.e., such continuity may facilitate the interpolation of forgery artifacts, consequently causing ambiguity in detection. To alleviate this problem, we integrate discrete modeling into the feature space of the CLIP vision encoder, striking a balance between continuous manifold modeling and discrete representation. By incorporating a learnable vector quantized codebook, the real latent manifold is discretized, imposing a more stringent information bottleneck that reduces the likelihood of embedding generative artifacts. In order to further enhance the generalization of discrete modeling, we propose an adaptive tangent space projection mechanism that yields a continuous relaxation of the discrete real distribution within a controllable range. With these components, our method constructs a real distribution that is both tightly constrained and broadly generalizable, enhancing robustness to unseen forgeries. Extensive experiments on diverse datasets demonstrate the effectiveness of our method.

Abstract:
Safety alignment of large language models is fragile: even small finetuning perturbations elastically revert behaviors toward those of the pre-training, with degradation inversely proportional to the size of the alignment set. We ask how to achieve safety alignment with \emphminimal augmentation. To this end, we model augmentation as a set of group actions on sequences and formalize robustness gains as a normalized, monotone submodular function over transformations. We then leverage submodular optimization to select minimal augmentations that provably improve robustness. Experiments confirm that our approach efficiently restores safety alignment while minimizing the overhead of augmentation.

Abstract:
We study compute reduction in neural networks through a unified partial versus full computation view, captured by oneshot magnitude pruning in the static regime and early exit in the adaptive regime. In an asymptotic single-neuron model, we prove a concentration theorem for one-shot magnitude pruning with explicit rates. We also introduce the conditional perceptron for early exit and show that its excess generalization error decays as a power of the compute gap, with an exponent that grows to infinity as the alignment between partial and full computations tends to one. We then extend the analysis to deep networks, characterizing how pruning-induced distortions accumulate with depth and deriving a corresponding compute–accuracy tradeoff for frozen-backbone early exit under a neural network Gaussian process model. Numerical simulations corroborate the predicted scaling laws.

Abstract:
Training large language models (LLMs) for nonverifiable tasks—such as creative writing, dialogue, and ethical reasoning—remains challenging due to the absence of ground-truth labels. While LLM-as-Judge approaches offer a scalable alternative to human feedback, they face a fundamental limitation: performance is constrained by the evaluator's own quality. If the judge cannot recognize good solutions, it cannot provide useful training signals, and evaluation biases (e.g., favoring verbosity over quality) remain unaddressed. This motivates meta-evaluation—the ability to evaluate and improve the evaluator itself. We introduce CoNL, a framework that unifies generation, evaluation, and meta-evaluation through multi-agent self-play. Our key insight: critique quality can be measured by whether it helps others improve their solutions. In CoNL, multiple agents sharing the same policy engage in structured conversations to propose, critique, and revise solutions. Critiques that enable other agents' solution improvements earn a diagnostic reward, creating explicit supervision for meta-evaluation and enabling joint optimization of generation and judging capabilities through self-play, without external judges or ground truth. Experiments on five benchmarks show that CoNL achieves consistent improvements over self-rewarding baselines while maintaining stable training.

Abstract:
Deep learning has been widely regarded as a powerful tool for Koopman operator theorybased modeling, as it provides a promising architecture for data-driven learning of observable functions. To fully leverage this advantage, a well-designed training paradigm is required. However, the existing training paradigms typically either incur high optimization complexity or hinder effective end-to-end training, limiting modeling accuracy and training efficiency. To address this issue, we propose a differentiable quadratic programming (QP)-embedded deep Koopman framework (QPKO). In QPKO, a QP problem, which comprises a one-step accuracy-oriented objective function and a set of multi-step accuracy-oriented constraints, is formulated to introduce a mapping from observable functions to the global linear model. By doing so, the global linear model no longer needs to be treated as an independent trainable component, thereby effectively reducing optimization complexity. This QP-based mapping is implemented as a differentiable and computationally efficient module by leveraging OptNet (a differentiable QP layer), enabling effective end-to-end training. Experiments on four nonlinear dynamical systems show that QPKO achieves satisfactory improvements in modeling accuracy, training efficiency, and control performance.

Abstract:
Endto-end online map generation is a core component of autonomous driving perception systems. However, in complex traffic scenes, Bird’s-Eye-View (BEV) with vectorized mapping suffers from limitations such as target misclassification, spatial localization drift, and ambiguous semantic segmentation. Introducing uncertainty quantification can alleviate these problems, so we propose MapUQ, a robust BEV vectorized mapping method guided by uncertainty-aware optimization. Specifically, we quantify uncertainty at the feature level to enhance semantic perception, apply an error-driven dynamic receptive field adaptation mechanism at the decoding stage to enforce geometric consistency, and leverage negative sample information at the output head to improve lane classification accuracy. Experimental results on the nuScenes and Argoverse 2 datasets show that our method outperforms prior approaches in AP across three road types, achieving an average improvement of 1.5% over the baseline with marginal computational overhead. In addition, our method surpasses the baseline on uncertainty metrics such as ECE and NLL, significantly improving robustness and mapping accuracy in complex scenarios. Our code has been released at github: https://anonymous.4open.science/r/MapUQ-D287.

Abstract:
Constrained multiobjective reinforcement learning aims to discover a diverse set of feasible trade-offs, yet scalarization and signed, normalized group-relative advantages can be brittle under objective-scale drift, near-ties, and feasibility scarcity. We propose constrained projection policy optimization (CoPro), which alternates between an E-step moment projection and an M-step policy projection. In the E-step, we solve a Kullback-Leibler (KL)-regularized, moment-constrained projection over each sampled group to compute a nonnegative reweighting distribution (q) that promotes feasible Pareto-front (PF) progress, preserves feasibility anchors, and suppresses ambiguous near-ties. This E-step admits a closed-form exponential-family solution and guarantees strictly positive probability mass on feasible anchors whenever feasible candidates appear in the group. In the M-step, we project the policy toward q via weighted maximum likelihood with a trust-region regularizer, yielding a PF-aligned update direction from comparisons without hand-crafted reward shaping. Empirically, CoPro improves feasible PF quality and robustness on constrained multi-objective benchmarks for large language model tool use and analog circuit design tasks. Code is available at https://anonymous.4open.science/r/CoPro-8A95/README.md.

Abstract:
Embodied agents replan frequently to recover from execution drift, partial observability, and coordination hazards. In many LLMbased planners, each replanning call consumes an accumulated textual context that grows over time and across agents (history, failures, summaries, and messages). Once this context becomes large, replanning latency develops heavy tails and can miss real-time deadlines even when task success remains high---a failure mode that is hard to detect from average latency or success alone. We present BRACE, a controller that formulates replanning for embodied agents as a budgeted control loop. At each replanning trigger, BRACE decides whether to replan, selects a replanning mode, and allocates an explicit token budget together with a latency service-level objective (SLO), while accounting for the overhead of optional efficiency modules. As a reusable component, we introduce E-RECAP, a cost-aware progressive token pruning method that predicts token utility and prunes replanning contexts across transformer layers while preserving critical head and tail tokens. On Habitat-Lab navigation with growing multi-agent context, E-RECAP reduces tokens per replanning call by 71-76% and end-to-end replanning latency by 2.1-2.6x with minimal impact on success or SPL. In Meta Habitat, BRACE combined with E-RECAP reduces SLO violation rates from 85.5% to 4.7% without degrading task success. Results across three embodied platforms demonstrate that tail-aware, per-call budgeting is an effective and practical design principle for replanning systems.

Abstract:
We present an online, distributionfree framework for controlling the Conditional Value-at-Risk (\operatornameCVaR), extending conformal tail risk control to non-stationary and adversarial environments. Unlike classical risk control methods, which rely on stationarity or linearity of expectation, our approach provides provable safety guarantees for a nonlinear tail risk functional under arbitrary data generating processes that may drift or shift strategically over time. By leveraging deep connections between conformal tail risk control, parameter-free online learning, and the variational representation of \operatornameCVaR introduced by Rockafellar and Uryasev, we develop a novel procedure for online \operatornameCVaR control with adversarial regret guarantees. The proposed method operates without assumptions on the underlying data-generating process, making it broadly applicable in modern high-stakes deployment settings. We prove that the realized empirical \operatornameCVaR is always controlled at the target level, and that the resulting control is asymptotically tight up to a vanishing \tildeO(1/\sqrtT) conservatism gap. We demonstrate the effectiveness of our approach on portfolio risk management and toxicity mitigation for Large Language Models (LLMs), where rare but catastrophic failures dominate system risk.

Abstract:
While parallel decoding is central to the efficiency of Diffusion Large Language Models (dLLMs), current strategies are often hindered by overly conservative confidence thresholds. These thresholds, necessitated by the Joint Probability Dependence Error (JPDE), result in redundant denoising iterations and suboptimal inference speeds. To overcome this, we propose DCLeap, a training-free framework that enables reliable acceleration of dLLMs in the moderate-confidence regime. DC-Leap introduces a Dynamic Contiguous Verification strategy that integrates strictly-ordered causal constraints into the parallel decoding process. By progressively validating token dependencies, this mechanism effectively neutralizes the JPDE, enabling reliable acceleration with near-lossless performance. Furthermore, DC-Leap incorporates the draft-guided decoding mechanism, where the draft helps extend the context by leaping forward across multiple tokens, providing look-ahead context and retaining the structural benefits of bidirectional attention during inference. Extensive experiments on standard benchmarks demonstrate that DC-Leap achieves substantial speedups, up to 53.19× on MBPP for long-sequence generation, and up to 105.02× when combined with KV-Cache with comparable generation quality. Code and models will be made publicly available.

Abstract:
VisionLanguage-Action models have recently shown promising progress in embodied robotic manipulation, yet their generalization to diverse open-ended embodied tasks is often hindered by execution failures. While prior work has explored failure handling, existing approaches still suffer from two fundamental limitations: coarse-grained failure correction and unreliable failure prevention. These limitations lead to brittle decision-making when VLA models are deployed in novel tasks and environments. To address them, we propose NeurVLA, a neural-symbolic framework that jointly addresses failure correction and prevention via neural-symbolic reasoning and further internalizes these failure-handling capabilities into VLA models. Experiments demonstrate that NeurVLA achieves strong performance and robust generalization across diverse tasks. Code is provided in the supplementary material.

Abstract:
LLMaugmented database analytics face a major bottleneck in the costly prefill phase. Although relational tables inherently contain repeated attribute values, standard row-by-row processing produces fragmented prompt layouts that obscure shared prefixes, thereby minimizing opportunities for prefix KV cache reuse and constraining system efficiency. Existing solutions typically employ heuristic or exhaustive search methods to reorder prompt layouts, but these approaches can be inefficient and may not leverage the structural properties of relational tables. We address this challenge by formulating prefix-cache-aware prompt layout optimization as a problem rooted in the isomorphism between prefix-cache reuse and the radix tree topology induced by the relational data distribution. Building on this perspective, we introduce a practical greedy tree-shaping algorithm that efficiently selects row and column orderings to maximize prefix overlap. Our approach, SOLO, improves prefill throughput by up to 90.3% under fixed prefix-cache budget. Moreover, it reduces planning overhead by up to 242× compared to state-of-the-art baselines.

Abstract:
Neural operators learn mappings between function spaces, enabling fast PDE surrogates. Despite architectural diversity, these methods often share a common input representation: raw coordinatevalue pairs. This ignores the differential structure that defines the underlying physics. We study whether derivative features can improve neural operator learning. Through Picard iteration on mild solutions, we show that derivatives of the input naturally enter PDE solution operators, and we prove that providing them improves approximation rates substantially. But the optimal derivative order is not what one might expect. We show that the statistically optimal order is strictly less than the PDE order, for any finite sample size. This gap arises from a bias-variance tradeoff: higher-order derivatives carry more information but amplify noise. We characterize this tradeoff in closed form and show that learning the derivative order from data achieves automatic spectral regularization. We introduce \partial-NO (derivative-augmented neural operators), a simple augmentation that provides learnable fractional derivative features to any neural operator backbone. Across benchmarks, this consistently improves accuracy. Learned orders reflect dominant PDE structure while adapting to finite-sample constraints, confirming the theory.

Abstract:
Streaming video understanding (SVU) must answer queries that arrive asynchronously while visual tokens stream continuously under strict GPUmemory and query-time latency budgets. A key challenge is delayed evidence: decisive cues may appear briefly, yet many subsequent updates occur before the query arrives, increasing the risk that those cues are evicted or diluted under bounded memory. We propose ProtoKV, a constant-footprint SVU memory that represents far history as a fixed-capacity summary state rather than retaining token instances. ProtoKV keeps an exact near-window KV cache and aggregates older content into a semantic–spatial prototype bank with residual statistics. At query time, each prototype is exposed through a bounded pseudo-token interface that is drop-in compatible with standard attention. Under matched budgets and comparable query-time cost, ProtoKV improves accuracy by up to 12.5 points over token-retention baselines on SVU benchmarks, with gains that grow as evidence delay increases.

Abstract:
Datadriven dynamics prediction often fails under environmental shifts, while traditional fine-tuning remains computationally prohibitive for hardware-constrained or data-scarce applications. We propose DynaDiff, a generative meta-learning framework that transitions the paradigm from gradient-based tuning or modulation to direct weight-space generation. Specifically, we first abstract expert weights as novel weight graphs, utilizing multi-head attention to explicitly capture topological coupling within weights. Subsequently, we design a functional loss to ensure that the generated models achieve consistency with expert models in physical behavior. Finally, we develop a dynamics-informed prompter that extracts cross-domain physical and spectral features from observation sequences to condition the diffusion model. Experiments demonstrate that DynaDiff boosts average prediction accuracy by 10.78% over competitive baselines. Furthermore, by pre-constructing a model zoo of expert predictors, we amortize the fine-tuning overhead into a one-time offline cost, significantly boosting deployment efficiency in new environments.

Abstract:
In domains such as healthcare and marketing, learning optimal individualized dosing policies to maximize utility is crucial, yet high experimental costs impose strict budget constraints, necessitating efficient active policy learning. Existing active learning methods in causal inference primarily focus on binary treatments and effect estimation, leaving continuous dosing and policy optimization underexplored. To address this gap, we propose an active learning framework tailored for optimal policy learning. Exploiting the inherent structure of doseresponse curves, we theoretically show that the policy optimization regret is bounded by the expected posterior gradient variance at the estimated optimal doses. Motivated by this result, we introduce Gradient Variance Active Learning for Individualized Dosing (GVALID), a batch acquisition strategy that greedily selects samples to minimize target gradient variance for efficient policy learning. Experiments demonstrate that GVALID achieves superior performance under strict budget constraints.

Abstract:
We benchmark time series decomposition as a standalone evaluation task. While decomposition outputs are widely used to interpret trend and periodic structure, their quality is often assessed informally, and no unified benchmark exists for comparing component recovery under controlled generative mechanisms. We introduce a synthetic evaluation suite with explicit trend and cycle taxonomies, a unified interface covering representative decomposition families, and complementary metrics capturing distinct error modes (shape, phase, and spectral fidelity). Across stationary periodic regimes, STLfamily methods are near-ceiling; under non-stationary periodicity (frequency drift, regime switching), fixed-period priors induce phase degradation, while subspace/time-frequency methods better preserve seasonal consistency (adaptive spectral methods may require tuning). We further extend the benchmark with a downstream scientific-discovery track---symbolic regression on decomposed components---showing that a decompose-then-regress pipeline materially improves recoverability and reduces expression complexity, linking decomposition quality to structure discovery. We release a pip-installable package and a lightweight web interface to make the benchmark and results easily accessible.

Abstract:
Batch inference for agentic workloads stresses the GPU key–value (KV) cache in a sustained and cumulative manner, often causing severe throughput degradation well before memory capacity is exhausted. We identify this phenomenon as middlephase thrashing, a previously under-characterized pathology in which cache efficiency collapses as long-lived agents accumulate state over time. We argue that mitigating this pathology requires moving beyond reactive, request-level cache management to proactive, agent-level admission control. Drawing inspiration from congestion control in distributed systems, we view the KV cache as a shared resource whose efficient utilization depends on feedback-driven regulation. Based on this insight, we present PACE, a lightweight control layer that regulates agent admission to bound aggregate cache pressure while preserving execution continuity. PACE adapts a cache-aware control algorithm to dynamically adjust the number of active agents using runtime cache signals. Across large models and real-world agent workloads, PACE prevents middle-phase thrashing and improves batch inference throughput by up to 4.09× on Qwen3-32B and 1.90× on DeepSeek-V3, while remaining compatible with existing LLM serving systems.

Abstract:
Reconstructing atomistic crystal structures from a single noisy STEM projection is an illposed inverse problem: multiple lattices can explain similar contrast, and purely feed-forward models cannot verify physical validity. We presentAutoMat, a failure-aware agenticcontrollerthat performs inference-time hypothesis search withclosed-loop verificationto convert Scanning Transmission Electron Microscopy (STEM) images into simulation-ready crystal structures and downstream properties. AutoMat composes perception and physics modules—pattern-adaptive denoising, physics-guided template retrieval (as a fallback), symmetry-constrained atomic reconstruction, and MLIP-based relaxation/validation—and triggers rollback-and-retry when verification fails. For systematic evaluation, we introduceSTEM2Mat-Bench, a benchmark dataset containing 450+ annotated samples. Performance is assessed using lattice root-mean-square deviation (RMSD), formation energy mean absolute error (MAE), and structure matching accuracy. Results demonstrate that AutoMat outperforms existing approaches including SOTA models, specialized domain tools, and closed-source multimodal large models. This work establishes a direct pathway from microscopic characterization to atomic-scale modeling, addressing a fundamental challenge in materials science.

Abstract:
The advancements of Large Language Models (LLMs) are primarily attributed to massive pretraining data, which also introduces risks like privacy leakage and data contamination. Therefore, it is crucial to determine whether an LLM has been trained on a given target text. Existing detection methods primarily rely on local statistics of isolated tokens (e.g., those with the lowest probabilities), neglecting the probability dynamics during the token generation process. In this paper, we shift the detection paradigm from a local token to a global sequence perspective, grounded in the core intuition that memorized sequences exhibit volatility patterns distinct from those generated via inference. We propose Adaptive Entropic Convolutional Analysis (AECA), a framework that conceptualizes the probability sequence as a dynamic signal, integrating calibration with convolutional filtering to effectively capture memorization signals. Extensive experiments demonstrate that AECA surpasses stateof-the-art baseline methods, achieving an average AUC improvement of up to 1.5%, with its advantage being particularly pronounced in long-text scenarios.

Abstract:
Traditional large language model (LLM) alignment methods are based on Reinforcement Learning From Human Feedback (RLHF), which learns a single reward model (implicitly or explicitly) from pairwise comparison data. This approach implicitly assumes homogeneous preferences across human labelersan assumption that is violated in practice. As a result, the learned reward model in RLHF is generally misspecified: Prior work shows that it is inconsistent with the population-average utility, incurring large distortion, and that recovering this utilitarian objective is provably impossible in the worst case. In this work, we show that the average utility is recoverable under a mild assumption. Our method, the Sign Estimator, simply replaces the standard cross-entropy loss function with a notion of binary classification loss and yields a reward model that is ordinally consistent with the population-average utility. We further establish a fast finite-sample convergence rate of O(n^-1/3) , which provides, to our knowledge, the first consistent estimator for heterogeneous preferences that does not suffer from the curse of dimensionality.

Abstract:
Progressive Learning (PL) reduces pretraining computational overhead by gradually increasing model scale. While prior work has extensively explored depth expansion, width expansion remains significantly understudied, with the few existing methods limited to the early stages of training. However, expanding width during the mid-stage is essential for maximizing computational savings, yet it remains a formidable challenge due to severe training instabilities. Empirically, we show that naive initialization at this stage disrupts activation statistics, triggering loss spikes, while copy-based initialization introduces gradient symmetry that hinders feature diversity. To address these issues, we propose SPARKLING (balancing Signal Preservation And symmetRy breaKing for width-progressive LearnING), a novel framework for mid-stage width expansion. Our method achieves signal preservation via RMS-scale consistency, stabilizing activation statistics during expansion. Symmetry breaking is ensured through asymmetric optimizer state resetting and learning rate re-warmup. Extensive experiments on Mixture-of-Experts (MoE) models demonstrate that, across multiple width axes and optimizer families, SPARKLING consistently outperforms training from scratch and reduces training cost by up to 35% under 2× width expansion.

Abstract:
Graph coarsening is a fundamental dimensionality reduction technique for scaling large graphs while preserving structural and feature information. However, most existing coarsening methods are designed for static graphs and do not extend well to dynamic settings where nodes, edges, and connectivity patterns evolve over time. Recomputing a coarsened graph from scratch after every update is often infeasible, which limits scalability and realtime applicability. To address this, we propose a unified framework for coarsening discrete-time dynamic graphs by incrementally updating the coarsening mapping matrix. The framework initializes from any static coarsening technique and then efficiently incorporates real-world graph events, including node additions, node deletions, and edge modifications. We instantiate this framework with two optimization based incremental update algorithms tailored to different dynamic regimes, one focusing on efficiently integrating growth related changes and another handling broader topology evolution with adaptive reassignment. We derive fast and scalable solvers with convergence guarantees, and provide theoretical guarantee via \epsilon-similarity bounds that quantify and control quality degradation in the coarsened graph. Extensive experiments under realistic dynamic scenarios show substantial improvements in runtime and memory, delivering significant speedups while maintaining or improving downstream task performance, including graph neural network accuracy.

Abstract:
Highstakes decision making involves reasoning under uncertainty about the future. In this work, we train language models to make predictions on open-ended forecasting questions. To scale up training data, we synthesize novel forecasting questions from global events reported in daily news. While directly training on this data leads to performance drops, carefully curating questions creates a valuable training resource. We use the resulting dataset, OpenForesight, to post-train Qwen3 thinking models. To prevent leakage of future information during training and evaluation, we use an offline news corpus, both for data generation and retrieval in our forecasting system. Guided by a small validation set, we show the benefits of retrieval, and an improved reward function for reinforcement learning (RL). Once we obtain our final forecasting system, we perform held-out testing between May to August 2025. Our specialized model, OpenForecaster-8B, matches much larger proprietary models, with our training improving the accuracy, calibration, and consistency of predictions. We find calibration improvements from forecasting training generalize across popular benchmarks. We will open-source our models, code, and data to make LLM based forecasting research broadly accessible.

Abstract:
Representation alignment (REPA) has been investigated to accelerate diffusion training, but we observe that regularizing intermediate representations in diffusion Transformers (DiT) may implicitly entangle latents and limit generative capacity. To address this issue, we propose ReGen, a hierarchical multiprompt representation generation framework that jointly estimates multiple vector fields for both representations and data within a single diffusion model. We further introduce generalized flow matching (GFM) to improve the generalization of conditional flow matching (CFM). We validate ReGen on single-stage waveform diffusion models including neural audio codec and Wave-VAE. ReGen significantly improves waveform generation quality from highly compressed latent representations at 12.5 Hz. We also present ReGenVoice, a latent diffusion model (LDM)-based text-to-speech model that achieves strong speech intelligibility (WER) and speaker similarity (SIM) with a small dataset. Moreover, operating the LDM at 6.25 Hz with rich semantic and acoustic latent representation enables efficient training and sampling, requiring only 1 day of training on 4 GPUs and fast inference with an RTF of 0.08.

Abstract:
Model Context Protocol (MCP) enables Large Language Model (LLM) agents to interact with external tools, but this extensibility introduces significant supply chain vulnerabilities that enable covert privacy exfiltration. Prior studies have revealed privacy leakage in MCPenabled agents via indirect prompt injection; however, existing attacks are typically misaligned with the agent's tool-usage context and rely on rigid templates, resulting in recognizable patterns that are readily flagged by existing defenses. In this work, we exploit the observation that privacy exposure is inherently scenario-dependent, to associate certain privacy items with specific tools. We introduceSOPE, a Scenario-aware and zerO-click Privacy Exfiltration framework that transforms any benign MCP server into its privacy-exfiltrating variants.SOPE(1) identifies privacy items that are appropriate to the tool usage, (2) embeds privacy-probing instructions into tool-invocation prompts, and (3) achieves zero-click data transmission via code-level modifications. We evaluateSOPEacross 27,216 test cases, where 324SOPE-transformedreal-worldservers attacking four benchmark and three commercial agents withninestate-of-the-art defenses. Results demonstrate thatSOPEremains highly effective and robust, highlighting critical protocol-level safety gaps in the agent ecosystem.

Abstract:
As ClassifierFree Guidance (CFG) has proven effective in conditional diffusion model sampling for improved condition alignment, many applications use a negated CFG term as a Negative Prompting (NP) to filter out unwanted features from samples. However, simply negating CFG guidance creates an inverted probability distribution, often distorting samples away from the marginal distribution. Inspired by recent advances in conditional diffusion models for inverse problems, here we present a novel method to achieve guidance toward the given condition using contrastive loss. Specifically, our guidance term aligns or repels the denoising direction based on the given condition through contrastive loss, achieving a similar guiding effect to traditional CFG for positive conditions while overcoming the limitations of existing negative guidance methods. Experimental results demonstrate that our approach effectively injects or removes the given concepts while maintaining sample quality across diverse scenarios, from simple class conditions to complex and overlapping text prompts.

Abstract:
Federated Learning (FL) often suffers from a tradeoff between global model performance and client-level fairness due to data heterogeneity, which often leads to inconsistent performance of the globally trained models, resulting in unfair outcomes among users. Existing fair FL algorithms face a trade-off: they either sacrifice global model performance to promote fairness or fall short of achieving optimal fairness. In this paper, we propose a novel framework that bridge this trade-off by integrating information-theoretic principles with model alignment. Specifically, we leverage the Maximum Entropy Principle to derive an analytic, closed-form solution for fair aggregation weights, ensuring significant fairness enhancements with minimal computational overhead. To maintain the global model performance, we further employ a step-wise model alignment strategy that synchronizes gradient directions across heterogeneous clients, effectively mitigating the drift induced by local updates. Theoretical analysis proves that our method guarantees convergence even in non-convex settings. Importantly, we push the theoretical frontier of federated fairness by extending performance variance analysis to generalized regression, providing broader guarantees. Extensive experiments on five datasets demonstrate that our approach consistently outperforms state-of-the-art methods, achieving superior fairness without sacrificing global accuracy.

Abstract:
Discrete flow matching (DFM) provides a principled framework for generative modeling on discrete state spaces via continuoustime Markov chain dynamics. In practice, sampling for DFM commonly employs discretizations such as \tau-leaping, yet efficient sampling methods under a limited number of function evaluations (NFE) remain less studied. To address this gap, we propose the Time-Reparameterized Cumulative Intensity Extrapolation (TR-CIE) sampler, which aims to improve sampling quality when function evaluations are restricted. TR-CIE consists of two components. First, a schedule-based time reparameterization rescales the time grid according to the noise schedule. Under standard factorized DFM rate parameterizations, this transformation of variables absorbs the schedule-dependent growth term and mitigates stiffness near the terminal sampling stage. Second, we introduce a cumulative-intensity extrapolation updating rule. By reusing cached model outputs from the previous step as a history term, this improves the approximation of stepwise cumulative intensities on the resulting non-uniform time grid. We provide a theoretical analysis that bounds the local approximation error of cumulative intensities and establishes convergence results. The resulting sampler requires one NFE per step and introduces no additional model evaluations compared to the standard \tau-leaping sampler. Extensive experiments on synthetic tasks, text generation, and text-to-image benchmarks demonstrate that our method improves sampling quality under limited NFE.

Abstract:
Federated stochastic multilevel compositional minimax optimization supports a growing number of machine learning applications. However, the interplay of multi-level compositional structure, minimax formulation, and federated setting poses significant optimization challenges, resulting in slow convergence rates for existing algorithms. In this paper, we propose a novel federated algorithm, LS^2MC-GDA, that leverages smoothing techniques and variance reduced stochastic compositional gradients. To support our theoretical analysis, we introduce a stage-wise extension of LS^2MC-GDA, which serves to bridge the gap between different stationarity measures. We establish that our algorithm achieves a sample complexity of O(\kappa^3/2/N\epsilon^3) and a communication complexity of O(\kappa/\epsilon^2), substantially improving existing theoretical results in terms of the condition number \kappa and the solution accuracy \epsilon and achieving a linear speedup with respect to the number of workers N. Finally, experimental results validate the effectiveness of our approach.

Abstract:
Most existing time series forecasting models are trained with backpropagation, which often brings high computational cost and limited transparency, so it can be hard to understand why a model makes a given prediction. This paper presents FIPN, a forward selforganizing interpretable polynomial network for time series forecasting. FIPN grows its architecture layer by layer and avoids backpropagation. Each neuron couples a fuzzy-rule antecedent with a Fourier-enhanced polynomial consequent: fuzzy clustering softly partitions the input space and produces interpretable rule weights for local regimes, while the consequent operates directly on the original features and uses Fourier functions to capture periodic and frequency-related structure. Forward growth can lead to redundancy, collinearity, and overfitting as depth increases, so FIPN introduces regularized node scoring, node-level dropout, and persistent access to raw inputs at every layer to stabilize closed-form estimation and improve generalization. Experiments on long-horizon forecasting benchmarks show that FIPN achieves competitive accuracy with a compact model size, and the learned fuzzy rules provide consistent, structure-based explanations. These results suggest that forward self-organizing polynomial networks offer a practical balance among accuracy, efficiency, and interpretability for long-term time series forecasting.

Abstract:
VisionLanguage Models (VLMs) are costly at inference time because they must process long sequences of visual tokens. Existing token pruning methods often degrade under high compression by blindly discarding information, breaking spatial structure or collapsing diversity. We propose SpecFlow, a training-free framework that shifts the paradigm from destructive pruning to conservative condensation, strictly enforcing spatial coverage and statistical conservation to ensure stability. Treating visual tokens as nodes in a kNN graph, SpecFlow (i) computes a stable importance field via spectral heat flow to preserve structural coherence, (ii) allocates budgets via adaptive spatial partitioning to guarantee coverage, and (iii) aggregates discarded information into coreset sinks to maintain statistical conservation. The method is plug-and-play, requires no fine-tuning, and is compatible with FlashAttention. Experiments confirm that our SpecFlow outperforms SOTA methods across tasks, VLM architectures, and pruning ratios. Notably, LLaVA-1.5 with SpecFlow retains 95.6% of original performance despite pruning 88.9% of visual tokens, offering an exceptional efficiency-accuracy balance.

Abstract:
Zerothorder (ZO) optimization offers a more memory-efficient alternative to first-order methods for fine-tuning large language models (LLMs). Recent ZO methods, exemplified by LOZO, estimate gradients within low-rank subspaces to align with the low-rank structure of LLM gradients. However, these methods rely on randomly generated subspaces of a fixed rank, which provides no guarantee of alignment with the actual dominant subspaces of the gradients; essentially, they remain ZO gradient descent with stochastic subspace sampling. To more effectively exploit the low-rank nature of LLM gradients, we propose LOZO+, an efficient ZO fine-tuning algorithm for LLMs that incorporates greedy Low-Rank subspace selection. Specifically, LOZO+ leverages loss-based feedback to assess alignment between candidate directions and the dominant low-rank gradient subspaces, and employs an adaptive thresholding criterion to retain only directions yielding substantial gradient descent, thereby steering ZO optimization toward more effective convergence. Importantly, we establish a theoretical framework that characterizes the convergence behavior of LOZO+, formally prove its superiority over existing methods. Extensive experiments demonstrate that LOZO+ consistently outperforms existing ZO methods and achieves performance competitive with FO algorithm, while retaining the memory efficiency inherent to ZO optimization.

Abstract:
In many realworld networks, relationships are inherently directional, yet most graph neural networks (GNNs) assume undirected edges, and naïve adaptations of undirected GNNs to directed graphs amplify oversmoothing and gradient pathologies that cap model depth. Unitary graph convolutions (UniConv) provably prevent representational collapse and oversmoothing, but cannot incorporate edge directionality or edge features. In this paper, we introduce adirectedunitary GNN withedge features (Dune), which retains these guarantees while overcoming UniConv’s limitations by incorporating edge directionality and edge features. Dune keeps gradient norms bounded at any number of layers, allowing it to benefit from neural network depth, unlike existing directed GNNs. The same unitary operator can be embedded in hybrid architectures with graph transformers, where its wavelike propagation supplies positional information and reduces the importance of random-walk or Laplacian-based encodings. We prove that Dune avoids exponential oversmoothing that plagues existing directed GNNs and empirically show that it achieves state-of-the-art performance on 12 directed-graph benchmarks while remaining trainable beyond 100 layers, improving performance by up to 18 percentage points over strong baselines. Our results establish unitary convolutions as a scalable, geometry-aware foundation for deep learning on directed graphs.

Abstract:
Engineering construction automation aims to transform natural language specifications into physically viable structures, requiring complex integrated reasoning under strict physical constraints. While modern LLMs possess broad knowledge and strong reasoning capabilities that make them promising candidates for this domain, their construction competencies remain largely unevaluated. To address this gap, we introduce BuildArena, the first physicsaligned interactive benchmark designed for language-driven engineering construction. It takes a first step towards engineering automation using LLMs. Technically, it contributes to the community in two aspects: (1) an extendable task design strategy spanning static and dynamic mechanics across multiple difficulty tiers; (2) a 3D Spatial Geometric Computation Library for supporting construction based on language instructions. On nine frontier LLMs, BuildArena comprehensively evaluates their capabilities for language-driven and physics-grounded construction automation.

Abstract:
Disentanglement learning aims to separate the underlying factors of variation (FoV) to improve generalization. However, most FoVbased latent-vector-centric methods impose objective-driven constraints at a bottleneck, and it is difficult to translate disentanglement into consistent gains on downstream tasks without inductive bias. Motivated by architectural approaches complementary to vector-centric objectives for downstream tasks, we propose theOrthogonal Subspaces Projection(OSP) layer, a plug-and-play module that integrates into intermediate layers and promotes FoV separation by projecting latent features into mutually orthogonal subspaces. Across diverse domains and tasks, models equipped with the OSP layer improve disentanglement quality and generalization in downstream tasks, including computer vision (classification, detection, and segmentation), natural language processing (word analogy), and fine-tuning settings on large backbones.

Abstract:
Counterfactual explanations (CFs) help understand machine learning models by identifying minimal input changes that would lead to alternative model outcomes. Recent work demonstrates their utility for reconstructing blackbox models, enabling third-party auditing of opaque decision systems for fairness and accountability. Still, CF-based reconstruction may suffer from decision boundary shifts, overfitting, and restrictive assumptions requiring online query access to target platforms. We propose REconstruction via Counterfactual-Aware waSserstein opTimization (RECAST) under limited data and restricted access, a behavioral surrogate model based on Wasserstein barycenteric prototypes. Our approach addresses decision boundary shifts by incorporating CFs as informative, though less representative, samples for both classes, maintaining high surrogate fidelity in low-sample regimes without requiring online access during reconstruction. To enhance fairness auditing, our method enables systematic group fairness diagnostics. Experiments on real-world datasets and various setups show that RECAST effectively achieves high fidelity and query efficiency, as well as stable results even when the access is limited and noisy.

Abstract:
Most multimodal learning assumes redundant views (such as image–caption pairs), yet many applications require combining complementary modalities that provide distinct evidence (such as an Xray and medical history). We term this setting Complementary Multimodal Classification (CMC). In CMC, existing explainable-by-design methods often force an accuracy–interpretability trade-off because single shared similarity metrics fail under asymmetric, class-conditional evidence. To address this, we propose Coherent Alignment of Multimodal Prototypes (CAMP). CAMP enforces coherent multimodal reasoning by aligning class-wise evidence via optimal transport and imposing geometric constraints to counter modality dominance and representation collapse. We provide theoretical guarantees showing that these mechanisms eliminate such degeneracies without restricting expressivity. Empirically, across 17 public CMC datasets, CAMP matches or exceeds large (>100M parameter) AutoML baselines with fewer than 1M trainable parameters, and when fine-tuned end-to-end it achieves state-of-the-art performance. To the best of our knowledge, this work is the first modality-agnostic prototype-learning framework designed for complementary multimodal tasks.

Abstract:
Recently, many empirical work has shown that, in machine learning, the noise distribution of stochastic gradients often exhibits heavy tails when stochastic optimization methods are employed. Most existing theoretical analyses of heavytailed stochastic methods rely on various convexity and smoothness assumptions and our knowledge of how heavy-tailed stochastic methods behave in the setting of weakly convex optimization is still limited. In the weakly convex setting, this paper derives new upper bounds on the convergence of the stochastic gradient method (SGD) under heavy-tailed noises. In particular, for vanilla SGD, we establish an in-expectation convergence guarantee on the bounded constrained domain under the assumption of bounded p-th central moment (p-BCM) of the gradient noise, and a high-probability guarantee on the unbounded domain when the noise follows a heavy-tailed sub-Weibull distribution. By equipping SGD with the gradient clipping (Clip-SGD), we demonstrate that it achieves high-probability convergence in the unbounded domain under the p-BCM gradient noise. All of our high-probability convergence bounds depend on the failure probability only through polynomial-logarithmic factors. Finally, we present numerical experiments to validate our theoretical findings.

Abstract:
Hypergraph Neural Networks (HNNs) model highorder interactions effectively but rely on costly node annotations, motivating Hypergraph Active Learning (HAL). However, many HAL pipelines adapt graph-based querying through clique expansion, which introduces structural bias and can cause \emphranking collapse, making utilities overly determined by hyperedge cardinalities rather than informative high-order relations. We proposeHIAL(HypergraphInfluence-basedActiveLearning), a training-free framework that reformulates HAL as influence maximization directly on hypergraphs. HIAL employs a High-Order Interaction (HOI)-aware propagation mechanism that modulates influence flow using within-hyperedge feature consistency, capturing both feature sensitivity and topological reachability while preserving HOI semantics. We prove the resulting objective is monotone and submodular, enabling an efficient greedy solver. Experiments on eight benchmarks demonstrate that HIAL consistently outperforms strong baselines across diverse hypergraph domains.

Abstract:
Despite remarkable progress in Multimodal Large Language Models (MLLMs), these models still struggle with finegrained understanding tasks. In this work, we proposeProcedurally Generated Tasks (PGT)a simple data-driven framework that serves a dual purpose: inducing fine-grained visual understanding and acting as a low-cost diagnostic tool to identify the source of perception failures. By overlaying unambiguous geometric primitives on images, PGT generate additional dense supervision that disentangles visual grounding capability from semantic priors. Extensive experiments on relational, quantitative, and 3D/depth understanding benchmarks show that PGT yields remarkable gains across diverse architectures. Instruction tuning MLLMs on LLaVA-v1.5-Instruct augmented with PGT data results in improvements of up to +20% on the What’sUp benchmark and +13.3% on CV-Bench-2D, while maintaining general perception capabilities. Moreover, finetuning state-of-the-art MLLMs on PGT data leads to boosts of up to +5.5% on What’sUp and +8.3% on CV-Bench-2D. These findings demonstrate that PGT effectively address the bottleneck of fine-grained perception, revealing that many spatial reasoning deficits stem from inadequate supervision signals rather than inherent architectural or resolution limitations.

Abstract:
Despite progress of Multimodal Large Language Models (MLLMs) in biomedical visual question answering (VQA), existing benchmarks provide limited assessment of their scientific reasoning capabilities. Most datasets adopt singleimage question construction and outcome-oriented evaluation, where correctness is judged by answer plausibility rather than alignment with experimental evidence. Such formulations fail to capture the evidence-constrained, multi-step nature of biomedical reasoning, and obscure whether models can derive conclusions through causal interpretation of experimental observations. To address these critical gaps in reasoning evaluation, we propose a principled benchmark construction framework that reconstructs scientific reasoning paths directly from biomedical literature. By jointly modeling clusters of experimentally related images together with their captions and context, the framework generates tightly coupled question–reasoning–answer triples that require multi-image integration and explicit evidence-driven inference. Based on this framework, we introduce SORBE (Scientific Observation & Reasoning for Biomedical Evaluation), a large-scale multi-image biomedical VQA benchmark designed to evaluate evidence alignment and multi-step experimental reasoning. Under a process-oriented evaluation metric, state-of-the-art biomedical-specialized MLLMs exhibit substantial performance degradation, revealing systematic limitations in evidence grounding and causal reasoning that are not reflected by existing benchmarks.

Abstract:
Recent formal reasoning systems achieve IMOlevel performance, but create a fragmented landscape: algebra and number theory use Lean, while geometry relies on domain-specific languages with limited formal guarantees. This fragmentation increases the trusted computing base and hinders unified model development. Existing geometry-in-Lean efforts (LeanEuclid, LeanGeo) introduce custom axiom systems incompatible with standard \mathlib, and their small scale (<1K problems) prevents large-scale training. However, native \mathlib autoformalization of geometry poses unique challenges: explicating implicit diagrammatic assumptions (e.g., topological configuration and non-degeneracy)---unlike existing custom systems that defer validity checks to external solvers---and adapting to \mathlib's small, rapidly-evolving geometric constructs. We present \method, a framework that addresses these challenges through a four-stage pipeline---constraint explication, configuration anchoring, formalization mapping, and iterative repair---to automatically formalize geometry in native \mathlib. We construct OMNI-Geometry (768 competition problems) and Numina-Geometry (177,597 problems), the largest geometry formalization dataset in Lean. Human evaluation shows 48.89% TOP1 and 73.33% TOP5 accuracy. Training Goedel v2 on our formalizations improves proof success from 13.6% to 15.1%, validating dataset quality for unified neural theorem proving.

Abstract:
Despite the high accuracy, Vision Transformer (ViT) predictions can be driven by spurious cues. Interpreting their inner workings is therefore essential for safe deployment. Sparse autoencoders (SAEs) shed light on decomposing languagemodel representations into concepts. However, adapting SAE-based interpretation to ViTs remains challenging due to limited control over concept coverage and subjective, non-scalable feature interpretation. To fill the gaps, motivated by stimulus-efficient probing in systems neuroscience, we propose ViSAE, a compact diagnostic toolbox for interpreting the internal mechanisms of ViTs. Specifically, we introduce a probing suite with 64K images and a 16K visually grounded concept vocabulary, improving concept coverage efficiency by 20× over ImageNet and interpretation accuracy by 28.7% over existing concept sets. We further develop an algorithm that automatically interprets SAE features at scale and causally traces cross-layer interactions to recover ViT inner workings using concept circuits. Our method supports auditing spurious correlations and failure modes, and boosts worst-group accuracy on WaterBirds by 48.2% through concept steering.

Abstract:
While diffusion models have advanced anomaly detection, their reliance on multistep noise schedules introduces significant computational complexity. In this paper, we demonstrate that the generative capability of diffusion is not required for tabular one-class anomaly detection. We revisit core principles of denoising without targeting data generation and present a deep-learning approach that streamlines these objectives into a fixed-noise formulation. Unlike standard denoising autoencoders that rely on reconstruction error, our method utilizes a preconditioning with an explicit linear reference channel. We train a denoising predictor to recover clean samples from perturbed observations and derive anomaly scores from the expected deviation under repeated perturbations. We theoretically motivate this with a stability proxy by a first-order approximation rather than just distance to the manifold. On the well-established ADBench benchmark, our approach achieves improvements over existing methods of 1.22% in AUCROC and 1.13% in AUCPR, the most informative and threshold-independent metrics. Our approach emphasizes structural simplicity and efficiency, indicating that a single-step, stability-based objective outperforms complex generative schedules for tabular data.

Abstract:
Embodied agents operating in decentralized and partially observable environments have attracted growing attention in recent years. However, existing large language model (LLM)–based agents often exhibit behaviors that are misaligned with their partners or inconsistent with the environment state, leading to inefficient cooperation and poor task success. To address this challenge, we propose a novel framework, Learning Laws for Cooperation (LLawCo), that enables embodied agents to autonomously align with both their partners and task objectives. Our framework allows agents to reflect on past failures to extract misaligned behavioral patterns, which are used to derive highlevel behavioral laws (e.g., “Talk when necessary”, “Wait for partner”). These laws are explicitly incorporated into the agents’ chains of thought via supervised fine-tuning, aligning their reasoning with task requirements and the behavior of other agents. To evaluate our approach, we introduce PARTNR-Dialog, a large-scale multi-agent communicative and cooperative planning benchmark built on the PARTNR environment. Experiments on existing tasks and our new benchmark demonstrate significant improvements in cooperative efficiency and task success rates. Across four backbone LLMs, our method achieves average success rate improvements of 4.5% on the PARTNR-Dialog benchmark and 6.8% on the TDW-MAT benchmark over state-of-the-art open-source communicative agent frameworks.

Abstract:
Collaborative perception (CP) significantly extends the sensing range of connected and autonomous vehicles (CAVs). However, its reliance on data fusion among multiple CAVs makes it inherently vulnerable to adversarial attacks from malicious participants. Existing defenses primarily rely on outputlevel consensus, assuming that malicious messages manifest as statistical outliers, while suffering from poor adaptability to environmental noise. This makes them vulnerable to stealthy adversarial attacks and prone to high false positive rates. To address this challenge, we shift the defense paradigm from superficial output-level consensus to deeper consistency within the internal feature space. Guided by this principle, we propose \textttCerberus, a novel defense framework against adversarial attacks in CP systems by leveraging multi-dimensional consistency in the feature space. By quantifying conflicts in topological structure, semantic direction, and energy distribution within feature maps, \textttCerberus effectively detects adversarial perturbations and provides dynamic protection against adversarial attacks. Experimental results demonstrate that \textttCerberus significantly outperforms state-of-the-art methods, effectively limiting the attack success rate to as low as 0.05% while restoring the mAP to 0.88.

Abstract:
Large Language Models have significantly advanced autonomous agents through their sophisticated perception and execution capabilities. Despite effective, agents still struggle with robust decisionmaking due to passive learning from similar experiences that often confound correlation with causality. Inspired by the Scientific Method, we propose a Cycle-of-Science framework that autonomously explores potential causal pathways through an iterative loop of Hypothesis, Experiment, and Validation, enabling agents to identify truly effective causal dependencies. To be specific, we first leverage causal knowledge to guide the initial hypotheses generation. These hypotheses are then analyzed through experiments using counterfactual samples. Afterward, we perform causal analysis to quantify effects of interventions, deriving well-validated hypotheses for next agent steps. Finally, we introduce adaptive threshold calibration that modulates causal validation based on policy uncertainty. Experiments on benchmarks demonstrate that our method achieves superior performance over state-of-the-art approaches.

Abstract:
The training of neural networks often entails objective functions that are not globally Lsmooth. For these functions, it is both theoretically and practically difficult to reply to the question: what is the largest possible step size that ensures the convergence of gradient descent (GD)? We address this longstanding open question in deep learning by providing a unifying definition of "large'' step sizes that requires only local Lipschitz (or even Hölder) continuity of the gradient. We design first-order adaptive methods that provably yield large step sizes and show that they operate at the edge of stability (EoS) right from the start of the training. In particular, the loss decreases nonmonotonically and the product between the step size and sharpness, i.e., the largest eigenvalue of the hessian, stays above the EoS threshold of 2 throughout training. Using our method, we are also able to minimize the sharpness all the way down to its global minimum. Contrary to expectation, we find that encountering globally-flat regions too early in the training may both slow down convergence and jeopardize the generalization ability of the network. Exploiting a self-stabilization argument, we allow GD to enter slightly sharper valleys and turn unsuccessful training runs into very successful ones.

Abstract:
Reconstructing hyperspectral images from compressive measurements is challenging due to a fundamental mismatch between locally reliable observations and globally entangled structures induced by spectral dispersion. This study formalizes this issue as a local–global dissonance in representation learning for CASSI systems. To resolve it, we propose a Hierarchical ScaleReconciling Architecture (HSRA) that enforces local sufficiency and global consistency in a principled, scale-aware manner. HSRA combines multi-kernel token mixing, latent window interactions, and hierarchical multi-granularity spatially shifted attention to progressively reconcile physical constraints across scales. Embedded into a deep unfolding framework as a physically grounded learned prior, Extensive experiments on benchmarks demonstrate that HSRA achieves consistent and significant improvements over state-of-the-art methods.

Abstract:
Reasoningcapable large language models (LLMs) have recently been adopted as automated judges, but their benefits and costs in LLM-as-a-Judge settings remain unclear. Through controlled comparisons between reasoning and non-reasoning judges, we show that explicit reasoning substantially improves judgment accuracy on tasks requiring structured verification (e.g., math and coding), while offering limited or even negative gains on simpler evaluations and incurring significantly higher computational cost. These findings motivate that reasoning should be used selectively rather than universally, with awareness of possible distribution shift. We propose a Robust Adaptive Cost-Efficient Router (RACER), which dynamically selects between reasoning and non-reasoning judges under a fixed budget by formulating routing as a constrained distributionally robust optimization problem. RACER explicitly accounts for distribution shift via a KL-divergence uncertainty set, admits an efficient primal–dual algorithm, and enjoys theoretical guarantees including uniqueness of the optimal policy and linear convergence. Extensive experiments show that RACER achieves superior accuracy–cost trade-offs under distribution shift.

Abstract:
Collaborative perception enhances environment understanding through multiagent information sharing, but its performance in real-world scenarios is constrained by heterogeneous sensor modalities and model architectures. Recent protocol-based two-stage methods alleviate this problem by mapping heterogeneous features into a shared protocol space; however, independently trained modality-specific converters often generate modality-specific pseudo-protocol distributions, leading to semantic inconsistency and error accumulation, which is particularly pronounced in scenarios with large modality discrepancies. To address this issue, we propose CauseCollab, a causal unified and modality-agnostic network. CauseCollab formulates representation learning in the protocol space from a causal perspective, explicitly disentangling semantic factors from modality-specific statistical confounders via causal metric learning. Meanwhile, CauseCollab adopts context-guided Unified Converter for heterogeneous modalities to ensure cross-modal semantic consistency. In addition, integrating new modalities only requires training adapters with minimal parameters. Extensive experiments on the OPV2V and DAIR-V2X datasets demonstrate that CauseCollab achieves state-of-the-art performance, with more significant gains in scenarios involving large modality gaps.

Abstract:
Deep neural networks have recently advanced infrared and visible image fusion (IVIF), but most existing methods rely on sophisticated yet redundant designs, which hinder realtime deployment on mobile devices with limited compute and memory. In this paper, we present MobileFusion, an extremely lightweight and effective convolutional framework that achieves high-quality fusion under strict resource constraints. MobileFusion leverages a re-parameterizable multi-branch convolution module to promote cross-modal interactions during training while collapsing into a single-path operator for fast inference. It further incorporates a lightweight attention module to enhance context awareness, together with a re-parameterized feed-forward network to improve feature expressiveness. Extensive experiments demonstrate that MobileFusion delivers a favorable trade-off between fusion quality and computational efficiency, enabling real-time and high-quality IVIF on resource-constrained platforms.

Abstract:
Despite remarkable progress in textto-image diffusion models, accurately generating the specified number of objects remains a persistent challenge. We identify the initial noise as a primary determinant of spatial layout formation, with early-stage cross-attention serving as the key mechanism that mediates the propagation of noise-induced structures throughout the denoising process. We formalize this phenomenon as the Noise-Induced Layout Prior. Leveraging this insight, we propose a novel training-free framework for object counting in diffusion models. Our approach consists of two key components: (1) a \emphCount-Aware Noise Adjustment Strategy, which explicitly manipulates the initial latent noise to align layout formation with the target object count, and (2) an \emphAttention-Guided Layout Consistency Strategy, which performs test-time optimization on early-stage cross-attention to further stabilize layout formation during denoising. Extensive experiments on both single-category and multi-category benchmarks demonstrate that our method consistently outperforms strong diffusion baselines and state-of-the-art object count control methods in terms of counting accuracy and image quality.

Abstract:
In this paper, we study a class of nonconvex optimization problems known as multi-affine quadratic equality constrained problems, which appear in various applications--from generating feasible force trajectories in robotic locomotion and manipulation to training neural networks. Although these problems are generally non-convex, they exhibit convexity or related properties when all variables except one are fixed. Under mild assumptions, we prove that the alternating direction method of multipliers (ADMM) converges when applied to this class of problems. Furthermore, when the "degree" of non-convexity in the constraints remains within certain bounds, we show that ADMM achieves a linear convergence rate. We validate our theoretical results through practical examples in robotic locomotion.

Abstract:
The surge of large language model (LLM) applications on personal devices imposes massive, bursty workloads on cloud serving infrastructure. While prefilldecode disaggregation improves throughput and scalability, memory-bound decode instances often suffer from persistent load imbalance, as output lengths are unknown when requests arrive at the cloud. To address this, we propose MAPS, a memory-aware predictive scheduling framework tailored for disaggregated LLM serving. MAPS performs device-assisted speculative output-length prediction overlapped with cloud-side prefilling, incurring negligible latency overhead. To handle generation uncertainty, MAPS applies uncertainty-aware calibration to derive output length upper bounds with target coverage, enabling safe scheduling decisions. Building on these bounds, MAPS employs a hierarchical global-local scheduling strategy to mitigate inter-decoder queue buildup and intra-decoder head-of-line blocking. Extensive experiments on two real-world workloads and two LLMs show that MAPS significantly outperforms three state-of-the-art systems, reducing average end-to-end latency by 42.6% and tail latency by up to 84.8%.

Abstract:
Shapley values are a widely used tool for attributing importance and interactions among input variables in blackbox models, but their computation involves a function defined over an exponentially large space of subsets. We propose TN-SHAP-G, a framework that exploits structure in graph-structured inputs to compute Shapley values and higher-order interaction indices efficiently. Given a predictor and a fixed masking scheme, TN-SHAP-G learns a compact, graph-aligned multilinear surrogate that approximates the masked-input behavior, represented as a tensor network whose topology mirrors the input graph. Once trained from a small number of oracle queries, the surrogate enables deterministic recovery of first- and higher-order Shapley indices via the multilinear extension, without additional model queries or Monte Carlo variance. Experiments on molecular benchmarks show that the learned factorization closely matches exact Shapley values on small graphs and scales efficiently to larger graphs where sampling-based methods become infeasible.

Abstract:
Hallucination detection is critical for ensuring the reliability of large language models (LLMs) in contextbased generation. Prior work has explored intrinsic signals available during generation, among which attention offers a direct view of grounding behavior. However, existing approaches typically rely on coarse summaries that fail to capture fine-grained instabilities in attention. Inspired by signal processing, we introduce a frequency-aware perspective on attention by analyzing its variation during generation. We model attention distributions as discrete signals and extract high-frequency components that reflect rapid local changes in attention. Our analysis reveals that hallucinated tokens are associated with high-frequency attention energy, reflecting fragmented and unstable grounding behavior. Based on this insight, we develop a lightweight hallucination detector using high-frequency attention features. Experiments on the RAGTruth and HalluRAG benchmarks show that our approach achieves performance gains over verification-based, internal-representation-based, and attention-based methods across models and tasks.

Abstract:
Retinalike event cameras and brain-inspired Spiking Neural Networks (SNNs) demonstrate exceptional energy efficiency through bio-inspired sensing and computation. While SNNs are naturally well-suited to the asynchronous nature of event data, their practical applications face the following challenges: sensitivity to noise, dense representations that disrupt spike pathways, and insufficient multi-scale feature perception. To address the aforementioned challenges, we propose a bio-vision-inspired object detection method motivated by biological (bio) vision systems. First, at the micro level, this paper proposes a noise-filtering STATNF-Neuron architecture to address the current sensitivity of basic neurons to noise. Based on STATNF-Neurons, the paper introduces two bio-vision-inspired macro-structures: Events-to-Spikes Representation (E2S), which preserves spiking characteristics while mimicking the memory and noise-filtering abilities of retinal neurons; Bidirectional Multi-Scale Spiking Network (BiSNet), which simulates cortical information flow pathways to integrate multi-scale features in both directions, enhancing the network's ability to perceive information at multiple scales. Extensive experiments show that the proposed bio-vision-inspired method achieving state-of-the-art performance. Notably, it reaches 96.1% accuracy on NCAR, 63.5% mAP\textsubscript50 on N-Caltech101, and 69.1% mAP\textsubscript50 on Gen1.

Abstract:
Repositorylevel automated program repair (APR) requires long-horizon reasoning over interdependent decisions. However, most LLM-based approaches reconstruct repair reasoning independently for each issue, failing to reuse successful patterns from prior repairs, even though real-world repositories contain many related issues with shared structure or constraints. Existing methods typically rely on forward exploration, which operates under outcome uncertainty, incurs substantial inference-time overhead, and can drift from the final correct patch. We propose Conditional Reasoning Distillation (ConRAD), which leverages in-repository resolved issues by reconstructing repair reasoning backward from verified patches and distilling outcome-consistent, stage-wise repair reasoning plans. Injected at inference time, these plans guide fault localization and patch generation, replacing open-ended exploration with constrained inference without fine-tuning or search. On SWE-Bench Lite, ConRAD improves Pass@1 by 10.4% (GPT-4o), 8.6% (DeepSeek-V3), and 10.3% (GPT-5), demonstrating a scalable inference-time alternative to forward exploration for long-horizon APR.

Abstract:
Alignment of large language models (LLMs) with human values has recently garnered significant attention, with prominent examples including the canonical yet costly Reinforcement Learning from Human Feedback (RLHF) and the simple Direct Preference Optimization (DPO). In this work, we demonstrate that both RLHF and DPO can be interpreted from the perspective of mutual information (MI) maximization, uncovering a profound connection to contrastive learning. Within this framework, both RLHF and DPO can be interpreted as methods that performing contrastive learning based on the positive and negative samples derived from base model, leveraging the Donsker–Varadhan (DV) lower bound on MI (equivalently, the MINE estimator). Such paradigm further illuminates why RLHF may not intrinsically incentivize reasoning capacities in LLMs beyond what is already present in the base model. Building on the perspective, we replace the DV/MINE bound with the Jensen–Shannon (JS) MI estimator and propose the Mutual Information Optimization (MIO). Comprehensive theoretical analysis and extensive empirical evaluations demonstrate that MIO mitigates the latestage decline in chosen-likelihood observed in DPO, achieving competitive or superior performance across various challenging reasoning and mathematical benchmarks

Abstract:
Semanticrich features from Vision Foundation Models (VFMs) have been leveraged to enhance Latent Diffusion Models (LDMs). However, raw VFM features are typically high-dimensional and redundant, increasing the difficulty of learning and reducing training efficiency for Diffusion Transformers (DiTs). In this paper, we propose Repack then Refine, a three-stage framework that brings the semantic-rich VFM features to DiT while further accelerating learning efficiency. Specifically, the RePack module projects the high-dimensional features onto a compact, low-dimensional manifold. This filters out the redundancy while preserving essential structural information. A standard DiT is then trained for generative modeling on this highly compressed latent space. Finally, to restore the high-frequency details lost due to the compression in RePack, we propose a Latent-Guided Refiner, which is trained lastly for enhancing the image details. On ImageNet-1K, RePack-DiT-XL/1 achieves an FID of 1.82 in only 64 training epochs. With the Refiner module, performance further improves to an FID of 1.65, significantly surpassing latest LDMs in terms of convergence efficiency. Our results demonstrate that packing VFM features, followed by targeted refinement, is a highly effective strategy for balancing generative fidelity with training efficiency.

Abstract:
Estimating treatment effects from observational text is increasingly practical with Large Language Models (LLMs). However, applying causal representation learning directly to highdimensional LLM embeddings faces a fundamental barrier: empirical Wasserstein matching suffers from the curse of dimensionality, rendering standard generalization guarantees effectively vacuous. We propose SPIKED-CFR, a framework bridging this gap by assuming a Spiked Confounding Structure, where treatment selection bias concentrates in a low-dimensional subspace of the semantic representation. We develop Wasserstein Projection Pursuit, a minimax objective that adversarially learns an orthogonal projection on the Stiefel manifold to identify and balance only this confounding subspace while preserving prognostic information. Under a spiked confounding structure, we show the projected discrepancy can be estimated at a rate governed by the intrinsic dimension k \ll D, and we derive a tighter PEHE generalization bound that depends on k rather than the ambient embedding dimension. Experiments on four semi-synthetic benchmarks and four real-world clinical benchmarks demonstrate improved accuracy and robustness over strong baselines. Code is available at \urlhttps://anonymous.4open.science/r/SpikedCFR-7E13.

Abstract:
Longvideo temporal reasoning remains challenging for Large Video Language Models (LVLMs). Recent reasoning-enhanced models apply reinforcement learning with outcome supervision to improve temporal understanding. However, outcome-only rewards cannot distinguish whether a model arrived at the correct answer through valid temporal reasoning or fabricated claims, a fundamental limitation that undermines trustworthiness. We observe a key structural correspondence: in videos, events form \emphtemporal traces, ordered sequences of how entities interact over time; in model reasoning, \emphreasoning traces capture step-by-step temporal claims. Correct temporal reasoning requires the latter to mirror the former. This correspondence enables us to \emphverify reasoning traces against the video's temporal structure. We introduce Temporal Reasoning Traces (TRT), a unified representation that indexes ordered event chains from videos and serves as a verification oracle for model reasoning. Building on TRT, we propose trace-grounded process supervision: during reinforcement learning, each temporal claim in the model's reasoning trace is programmatically verified against TRT,rewarding grounded reasoning and penalizing fabrications. Unlike neural reward models that may themselves err, our verification is fully deterministic. Extensive experiments show the effectiveness of our model, achieving state-of-the-art performance.

Abstract:
Mainstream methods for multivariate timeseries forecasting largely follow the Direct-Mapping paradigm. They learn a unified mapping from history to the future in the observation space to fit value-level dependencies. However, real-world systems often undergo distribution shifts and regime changes. In such cases, a unified mapping can exhibit response lag around turning points, causing error accumulation within the switching window and reducing forecasting reliability. To address this issue, we propose L-Drive, a change-aware forecasting framework. L-Drive introduces a Latent-Context, to explicitly characterize high-level dynamics evolving over time, and uses gating to modulate increment representations. This provides more timely change cues and improves adaptation to changing segments. In addition, it incorporates patch-shared relative positional basis functions to strengthen intra-segment structural modeling and reduce overfitting caused by absolute-position memorization. Extensive experiments validate the effectiveness of L-Drive and show a better overall trade-off between forecasting accuracy and computational efficiency.

Abstract:
Reinforcement learning with verifiable rewards (RLVR) on foundation models has led to significant improvements in math and code generation. Extending these gains to openended domains remains challenging: ground-truth verification is unavailable, human annotation is expensive, and learnt reward models are prone to reward hacking. We introduce Rubric Curriculum RL (RcRL), a self-improvement method for creative short-fiction writing that requires no new data, human annotations, or stronger teacher models. RcRL exploits the generation-verification gap: it is easier to judge whether work is creative than to produce something creative. While this gap exists across open-ended domains, exploiting it for RL is challenging due to reward hacking. During training, we use pairwise preferences against a curriculum of rubric criteria, which provides a more stable signal than absolute scoring while reducing reward hacking against a stationary objective. Unlike baseline methods, which plateau or collapse within a few dozen steps, our approach preserves output entropy and shows improvements over 1000+ training steps. In human evaluations, RcRL-trained models achieve a 70.5% win rate and demonstrate consistent gains across multiple creative writing benchmarks and judges.

Abstract:
Scorebased diffusion models have demonstrated remarkable empirical success in learning high-dimensional distributions, particularly those exhibiting low-dimensional and multi-modal structures. However, theoretical understanding of their statistical efficiency remains limited. Existing theories typically rely on strong regularity assumptions, such as uniformly bounded densities or globally smooth score functions, which fail to capture such intrinsic structures. In this work, we study the sample complexity of diffusion models in learning distributions supported on a union of low-dimensional subspaces. Assuming that data within each subspace follows a subgaussian distribution, we show that diffusion models require at most \widetildeO(\varepsilon^-k \vee 2) samples to achieve \varepsilon error in 1-Wasserstein distance, where k is the intrinsic dimension. This near-optimal rate depends only on the intrinsic dimension and significantly improves upon prior works that suffer from the curse of dimensionality. Notably, our analysis applies to a broad collection of distributions without requiring smoothness or log-concavity assumptions. These results provide rigorous evidence for the effectiveness of diffusion models in learning low-dimensional, multi-modal distributions.

Abstract:
Time series forecasting can be viewed as a generative problem that requires both semantic understanding over contextual conditions and stochastic modeling of continuous temporal dynamics. Existing approaches typically rely on either autoregressive large language models (LLMs) for semantic context modeling or diffusionlike models for continuous probabilistic generation. However, neither method alone can adequately model both aspects simultaneously. In this work, we propose CoGenCast, a hybrid generative framework that couples pre-trained LLMs with flow-matching mechanism for effective time series forecasting. Specifically, we reconfigure pre-trained decoder-only LLMs into a native forecasting encoder–decoder backbone by modifying only the attention topology, enabling bidirectional context encoding and causal representation generation. Building on this, a flow-matching mechanism is further integrated to model temporal evolution, capturing continuous stochastic dynamics conditioned on the autoregressively generated representation. Notably, CoGenCast naturally supports multimodal forecasting and cross-domain unified training. Extensive experiments on multiple benchmarks show that CoGenCast consistently outperforms previous compared baselines. Code is available.

Abstract:
Although Reinforcement Learning FineTuning (RLFT) applied to Vision-Language Models (VLMs) substantially enhances multimodal reasoning capabilities, their prohibitive training cost limits broad adoption. Surprisingly, most existing methods simply port Large Language Model (LLM) RLFT techniques to VLMs, while ignoring a intrinsic property of multimodal models: their dynamic text–vision alignment. We ask a new question: Can this intrinsic alignment be turned into a training signal that makes VLM RLFT more efficient? We analyze how a VLM plans to attend, actually attends, and ideally should attend during reasoning, and derive two lightweight metrics from these patterns. Predictive View Accuracy (PVA) estimates sample difficulty, and Reasoning View Accuracy (RVA) reflects the quality of chain-of-thought (CoT) reasoning. These alignment signals enable automated data curriculum and dense reasoning supervision. We introduce FOCUS-RL, a plug-and-play framework that can be seamlessly integrated into any VLM and dramatically boosts RLFT training efficiency. FOCUS-RL achieves 2.5 x – 4 x faster convergence over vanilla GRPO and consistent accuracy gains (+4.4 on average) across six different benchmarks and multiple VLM families.

Abstract:
Functional magnetic resonance imaging (fMRI) provides dynamic measurements of human brain activity at high spatial resolution and depth, but its use is constrained by high cost, limited accessibility, and strict acquisition requirements. Synthesizing fMRI data from more accessible, noninvasive modalities such as electroencephalography (EEG) offers a promising alternative, enabling inference of deep brain dynamics from low-cost scalp recordings in naturalistic settings. Despite recent progress, existing EEG-to-fMRI translation methods typically rely on region-specific models and offer limited support for subject-level and dataset-level heterogeneity, restricting their generalizability. We propose UniEFS, a unified EEG-to-fMRI synthesis model that enables full-brain fMRI reconstruction while accommodating varying demographic and physiological contexts within a single model. Our approach leverages a pretrained fMRI decoder to embed rich spatial priors and introduces condition-aware prompt tokens that encode subject-level and experimental metadata, enabling effective handling of heterogeneous datasets. We extensively evaluate our model performance on eyes-closed resting-state data and demonstrate that it can reliably reconstruct temporally-resolved whole-brain fMRI activity, with strong potential to generalize to task-based fMRI and clinical populations in a zero-shot manner.

Abstract:
Recent advancements in multimodal reward models (RMs) have substantially improved posttraining for visual generative models. However, current RMs face inherent limitations:(1)visual inputs consume large context budgets, forcing fewer frames and causing a loss of details; and(2)all visual information is packed into the initial prompt, exacerbating forgetting during chain-of-thought reasoning. To overcome these issues, we introduceVR-Thinker, a thinking-with-image RM equipped with visual reasoning operation and a configurable visual memory window. This allows the RM to actively acquire visual evidence, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline:(i)Cold Startwith curated visual chain-of-thought data to distill basic operation formatting;(ii)select samples with correct judgments, then conductRejection sampling Fine-Tuningon these high-quality traces to further enhance reasoning; and(iii)applyGroup Relative Policy Optimization(GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks: a 7BVR-Thinkerachieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video.

Abstract:
Endto-end data science agent workflows involve tightly coupled sub-processes with strong dynamic dependencies, posing a challenging long-horizon orchestration problem. Existing frameworks primarily rely on static, chain-like execution plans, which are prone to error propagation from early stages—often causing reasoning chain collapse and task failure, resulting in fragile inference and poor cost-effectiveness. To address these issues, we propose \textR^3DAO, a reactive data agent orchestration framework based on feedback-driven topology evolution, aiming to build a dynamic evolutionary closed-loop of "hierarchical exploration, iterative recovery, and empirical convergence." First, we introduce a dynamic hierarchical task network that recursively decomposes global intent into macro-logical anchors and micro-operators, enabling low-cost exploration through dimensionality reduction in the logical space. Second, we establish a reactive topology reconfiguration mechanism that leverages semantic reflection to map execution anomalies into diagnostic signals, replacing costly global resets with localized topological optimization for resilient self-healing. Finally, semantic experience distillation implements a dual-loop accumulation that compresses long-horizon trajectories into structured prior, steering execution efficiency toward the optimal regime. Evaluations on the MLE-bench show that \textR^3DAO achieves a 77.36% improvement in success rate over advanced R\&D-Agent while maintaining competitive task scores. Notably, \textR^3DAO compresses the average execution time by 36× and limits token consumption to just 104k per task, showcasing superior reliability, efficiency, and cost-effectiveness.

Abstract:
Multimodal Large Language Models often suffer from object hallucinations, where generated outputs are inconsistent with the visual evidence. This issue is typically attributed to the overreliance on language priors, which can override the visual context. Recent training-free decoding strategies address this by penalizing language priors. However, these methods overlook the dual nature of language priors, where they can be both helpful and harmful depending on the alignment with visual evidence. In particular, blindly suppressing language priors often disrupts the model’s semantic manifold, leading to performance degradation, a phenomenon we term Manifold Departure. To address this, we propose Manifold-Guided Adaptive Projection (MGAP), a geometry-aware, training-free decoding method that mitigates hallucinations while preserving representation structure. MGAP first constructs a language-prior subspace from blind hidden states (null-image inputs) via SVD. During decoding, MGAP projects each multimodal hidden state onto this subspace and applies a consistency-aware gate to adaptively attenuate only the projected prior component, yielding a subspace-selective update that largely preserves the orthogonal semantic components. Extensive experiments on POPE and CHAIR show that MGAP outperforms prior decoding baselines, achieving stronger hallucination suppression without sacrificing coherence.

Abstract:
Deep learning approaches typically require prohibitive amounts of data to approximate strict Exact Cover Problems, while existing neurosymbolic methods often face training infeasibility and scalability bottlenecks. To bridge this divide, we propose the Hypergraph Optimization Network (HONet), an end-to-end framework integrating a topologically complete Deep Residual Hypergraph Encoder with a differentiable Equality-Constrained Quadratic Programming layer. By adopting a "Fixed Polytope" paradigm guided by the Geometric Consistency Loss, HONet explicitly shapes the objective landscape, forcing the valid discrete solution to align with the unique global energy minimum. Empirical results show that HONet rapidly achieves 100% accuracy on 9 × 9 Sudoku using limited samples, exhibiting superior data efficiency over baselines while maintaining exceptional robustness in highly sparse regimes and additional tasks.

Abstract:
We present a systematic study and a benchmark of inferencedriven linkage risk in LLM-based agents, where identities are reconstructed by aggregating fragmented, individually non-identifying signals. Historically, data linkage attacks required substantial domain expertise and bespoke feature engineering, but agentic reasoning can collapse this barrier, expanding privacy risk beyond memorization or direct leakage. We evaluate this risk across three settings: reproducing classical Netflix and AOL linkage attacks with agents, introducing a controlled de-anonymization benchmark that varies fingerprint structure and task intent, and analyzing real-world unstructured traces. Agents re-identify 79.2% of users on the Netflix dataset (vs. 56.0% for the classical heuristic), and on our benchmark, identity hypotheses emerge even under benign task framing and rise sharply under explicit re-identification requests. A privacy-aware system prompt can reduce leakage dramatically for some models (e.g., from 19/20 to 1/20 cases) but incurs utility loss, showing that mitigating inference-driven linkage often comes at a measurable utility cost.

Abstract:
ParameterEfficient Fine-Tuning (PEFT) is essential for adapting Large Language Models, yet existing methods typically struggle to balance model capacity with computational efficiency. Standard approaches often enforce rigid low-rank constraints, while dynamic alternatives incur significant memory overheads. To resolve this dilemma, we propose Spectral Bridge Variational Inference (SBVI), a geometric framework that reformulates LoRA not as static parameter optimization, but as a continuous Wasserstein gradient flow on the manifold of Gaussian measures. Rather than fixing the rank at initialization, SBVI governs the singular value evolution via a stochastic differential equation driven by a thermodynamic competition between task gradients and adaptive entropic friction. This mechanism induces a spectral bifurcation that automatically prunes redundant noise modes while amplifying signal-rich components, naturally discovering a layer-wise optimal rank distribution. We derive a scalable algorithm with linear complexity using factorized Riemannian retractions and an Empirical Bayes friction update. Experiments on reasoning and coding benchmarks demonstrate that SBVI achieves state-of-the-art performance, offering superior accuracy and memory efficiency compared to existing static and dynamic adaptation methods.

Abstract:
Recent image generators produce photorealistic content that undermines the reliability of downstream recognition systems. As visual appearance cues become less pronounced, appearance-driven detectors that rely on forensic cues or high-level representations lose stability. This motivates a shift from appearance to behavior, focusing on how images respond to controlled perturbations rather than how they look. In this work, we identify a simple and universal behavioral signal. Natural images preserve stable semantic representations under small, structured perturbations, whereas generated images exhibit markedly larger feature drift. We refer to this phenomenon as robustness asymmetry and provide a theoretical analysis that establishes a lower bound connecting this asymmetry to memorization tendencies in generative models, explaining its prevalence across architectures. Building on this insight, we introduce Robustness Asymmetry Detection (RA-Det), a behavior-driven detection framework that converts robustness asymmetry into a reliable decision signal. Evaluated across 14 diverse generative models and against more than 10 strong detectors, RA-Det achieves superior performance, improving the average performance by 12.92%. The method is data- and model-agnostic, requires no generator fingerprints, and transfers across unseen generators. Together, these results indicate that robustness asymmetry is a stable, general cue for synthetic-image detection and that carefully designed probing can turn this cue into a practical, universal detector.

Abstract:
Sequential finetuning on multiple datasets is ubiquitous, but the training order of sources can measurably change downstream performance; testing both orders roughly doubles compute. We model a single gradient step on a dataset as a nonlinear operator and show that non-commutativity induces order-dependent effects governed by a commutator (Lie-bracket) term. For two sources A,B and target domain E, this yields a directional score \sigma_AB^(E) = \langle g_E, H_B g_A - H_A g_B \rangle that predicts whether A \to B or B \to A yields lower L_E. We evaluate g_E at a reference point capturing the shared drift of both orders (Trotter scoring) and develop a theory-driven \eta-autopilot that selects step sizes from pilot data by balancing signal-to-noise against higher-order stability constraints. On four LLMs and a diffusion UNet, our planner achieves 81–94% overall sign accuracy and 82–100% on highest-impact decisions, enabling practical transfer-order planning without manual hyperparameter tuning.

Abstract:
Existing benchmarks for biological language models (BLMs) inadequately capture the challenges of realworld applications, often lacking realistic out-of-distribution (OOD) scenarios, evolutionary depth, and consistency in measurement. To address this, we introduce TadABench-1M, a new benchmark based on a wet-lab dataset of over one million variants of the therapeutically relevant TadA enzyme, purpose-built to embody these three essential attributes. Generated across 31 rounds of wet-lab evolution, it offers unparalleled evolutionary depth and naturally presents a stringent OOD challenge. To ensure measurement consistency across this extensive campaign, we developed Seq2Graph, a scalable graph-based algorithm that systematically unifies multi-batch experimental data. Our high-fidelity benchmark highlights a critical finding: while state-of-the-art BLMs excel on a standard random split of the data (Spearman’s ρ ≈ 0.8), they fail dramatically on a realistic temporal prediction task (ρ ≈ 0.1). This stark performance gap validates the importance of our benchmark’s design principles and suggests that evolutionary depth is critical for building models with realistic utility.

Abstract:
Computational approaches have become central to Protein–Protein Interaction (PPI) research, complementing experimental techniques that remain costly and incomplete. While modern deep learning methods capture diverse biological signals and hold promise in expanding the known interactome, empirical validation remains a critical bottleneck due to its long and expensive procedures. To address this challenge, we introduce the problem of PPI candidate ranking, aiming to prioritize interactions for experimental testing. We propose a novel framework that leverages domain knowledge through interpretabilityguided ranking and further refines prioritization by integrating complementary sources of evidence, including interaction scores, structural plausibility, and biomedical language features. Evaluations on a large-scale dataset constructed from successive STRING releases demonstrate that our approach yields significant improvements over two state-of-the-art PPI prediction models, providing more accurate and biologically coherent rankings.

Abstract:
Preference learning has garnered extensive attention as an effective technique for aligning diffusion models with human preferences in visual generation tasks. However, existing alignment approaches such as DiffusionDPO suffer from two fundamental challenges: training instability caused by high gradient variances at various timesteps and high parameter sensitivities, and off-policy bias arising from the discrepancy between the optimization data and the policy model's distribution. Our first contribution is a systematical analysis of the diffusion trajectories across different timesteps and identify that the instability primarily originates from early timesteps with low importance weights. To address these issues, we propose SIPO, a Stabilized and Improved preference Optimization framework for aligning diffusion models with human preferences. Concretely, a key gradient, \emphi.e., DPO-C&M is introduced to facilitate stabilize training by clipping and masking uninformative timesteps. Followed by a timestep aware importance re-weighting paradigm to fully correct off-policy bias and emphasize informative updates throughout the alignment process. Extensive experiments on various baseline models, including image generation models on SD1.5, SDXL, and video generation models CogVideoX-2B, CogVideoX-5B, and Wan2.1-1.3B, demonstrate that our SIPO consistently promotes stabilized training and outperforms existing alignment methods, with meticulous adjustments on parameters. Overall, these results highlight the importance of timestep-aware alignment and and provide valuable guidelines for improved preference optimization in diffusion models.

Abstract:
To overcome computation burden of traditional computational fluid dynamics (CFD) simulations, researchers have explored different architectures to develop physicsinformed simulation methods. Among them, graph neural networks (GNN) are most suitable for adopting CFD meshes, which are extensively used in engineering and industrial applications. However, classical GNNs propagate information among neighbour nodes, which highly restrict information exchange within the network. To address this issue, graph rewiring methods have been developed for generic graph problems, but not particular for fluid simulation. PIORF, introducing edges connecting distant nodes, is the first graph rewiring method to do so, and previous experiments have demonstrated its effectiveness against state-of-the-art generic rewiring methods. Nevertheless, in this work, we found that simply connecting all 2-hop nodes can provide competitive performance with PIORF. This result raises three questions: 1) Is physics-informed rewiring really useful for improving flow predictions? 2) Should we consider just local connection, instead of connecting distant nodes? 3) Do we need to change the connections based on input flow for rollout simulations? By thoroughly adopting physical fluid principles, we propose a simple yet very efficient method, Flow Alignment Rewiring (FLARE) technique, which connects 2-hop nodes only when the node direction aligns with input flow direction. Hence, FLARE is a physics-informed local rewiring method, different from PIORF and well-aligned with fluid physics. Extensive numerical experiments on flows over a cylinder and single and tandem airfoil under different flow conditions and deep network architectures demonstrate that FLARE outperforms PIORF and various 2-hop rewiring approaches by a significant margin.

Abstract:
Modern decision transformers, trained similarly to LLMs, can achieve strong indistribution performance in complex sequential domains like chess, but it remains unclear to what extent they reason systematically about rules and strategy. We study the reasoning capabilities of a 270M-parameter chess transformer trained via behavior cloning on standard chess. To investigate its abilities, we construct out-of-distribution test sets ---including board states and variants never seen during training---designed to reveal failures of systematic generalization. Our analysis shows that the model exhibits robust rule-based reasoning, consistently generating legal moves in novel configurations, but its strategic reasoning is more limited. The model generates high-quality moves on curated OOD puzzles and shows basic strategy adaptation in full games. It underperforms symbolic AI algorithms that rely on explicit search, although the performance gap is smaller when playing against human users on Lichess. Moreover, the training dynamics reveals distinct phases in how the model learns to respect the fundamental constraints, suggesting an emergent compositional understanding of the game.

Abstract:
Multimodal fusion is commonly implemented via symmetric token interaction, implicitly allowing information to flow in both directions. Undermodality imbalancewhen an auxiliary stream is substantially noisier than a designated primary stream---such symmetry creates abackflow channelthat injects auxiliary noise into the primary representation and amplifies errors across iterative refinement stages. We formulate fusion in this regime asdirected refinement with one-way safety: the primary modality defines a guidance field, while only auxiliary representations are iteratively purified, and primary perturbations induced by the auxiliary stream are explicitly bounded. We proposeHamiltonian Asymmetric Fusion(HAF), a lightweight unrolled refinement block that updates auxiliary tokens with momentum regularization and gated driving. The refinement force is instantiated by FFT-based spectral global correlation and modulated by a shared learnable spectral response to emphasize reliable frequency components with minimal parameters; a leaky momentum gate and a stable integrator improve multi-step refinement stability. We provide guarantees of auxiliary error contraction and bounded primary perturbation, which symmetric fusion operators do not satisfy under imbalance. Experiments on six RGB--D SOD benchmarks show consistent gains and substantially more graceful degradation under controlled auxiliary corruption.

Abstract:
Neural Flows efficiently model irregular multivariate time series by directly learning ODE solution trajectories with neural networks, bypassing stepby-step numerical solvers. Despite their efficiency, many existing approaches treat variables independently, leaving inter-variable interactions underexplored. Moreover, their one-step mapping makes interaction modeling inherently challenging, as it removes the iterative refinement of interactions during learning. To address this challenge, we propose one-step Graph-Structured Neural Flows (GSNF), which introduce two auxiliary-trajectory self-supervision strategies to strengthen interaction learning: (i) interaction-aware trajectory generation via re-initialization, which induces trajectory divergence to expose graph-induced interactions, with a theoretically derived lower bound on divergence; and (ii) reverse-time trajectory generation, which enforces forward–backward consistency to regularize graph learning, enabled by flow invertibility. Experiments on five real-world datasets show that GSNF achieves state-of-the-art classification performance with highly competitive training time and memory usage.

Abstract:
Motivated by variational inference methods, we propose a zerothorder algorithm for solving optimization problems in the space of Gaussian probability measures. The algorithm is based on an interacting system of Gaussian particles that stochastically explore the search space and self-organize around global minima via a consensus-based optimization (CBO) mechanism. Its construction relies on the Linearized Bures–Wasserstein (LBW) space, a novel parametrization of Gaussian measures we introduce for efficient computations. We establish well-posedness and study the convergence properties of the particle dynamics via a mean-field approximation. Numerical experiments on variational inference tasks demonstrate the algorithm’s robustness and superior performance with respect to gradient-based method in presence of non log-concave targets.

Abstract:
Temporal systems often exhibit nonstationary behaviour, such as seasonal climate variation or glucose fluctuations in patients with type-1 diabetes. One way to model non-stationarity is through discrete latent regimes, i.e., stationary segments of time. Such systems induce a Markov Switching Model (MSM), a class of Hidden Markov Models with autoregressive dependencies among latent regimes and observed variables. Identifying latent regimes is challenging in the presence of frequent regime switches and nonlinear and non-Gaussian dynamics, particularly when there are instantaneous effects between the variables, e.g., due to slow rates of measurements. In this work, we establish the identifiability of both latent regimes and regime-dependent causal structures under temporal regime dependencies, nonlinear lagged and instantaneous effects, and independent noise from the exponential family. Our identifiability theory subsumes non-temporal mixtures of causal models. Furthermore, we introduce FlowMSM, a regime detection framework that can be paired with any stationary causal discovery method to recover regime-dependent causal structures. Experiments on synthetic benchmarks and a financial economics dataset demonstrate the effectiveness of our approach to detect latent regimes and discover causal structures from non-stationary time series.

Abstract:
Battery lifetime early prediction is crucial for safety assessment and decision planning. However, earlystage degradation signals are extremely weak and difficult to distinguish from noise. Existing methods primarily rely on denoising or signal decomposition, which risks losing critical degradation cues. In nature, most organisms exhibit binaural effect, exploiting differences between left and right auditory inputs to enhance perceptual reliability. Inspired by this, we propose DITING, a weak degradation listener for battery lifetime early prediction. DITING first employs optimal-transport-based selective matching to extract a robust health template from initial cycles for degradation representation. To manifest degradation signals from noise, we further design a tri-coupled degradation manifestation mechanism. By exploiting the randomness of noise, matched responses under symmetric coupling suppress stochastic fluctuations. Conversely, cumulative deviations driven by degradation propagate through the coupling to form stable bilateral discrepancies. This design effectively amplifies weak cues in the early stage for lifetime prediction. Experiments on multiple datasets demonstrate that DITING achieves state-of-the-art performance and provides more reliable early support for full-lifecycle battery management. The code is available at https://anonymous.4open.science/r/Batt_DITING.

Abstract:
Pretrained foundation models have demonstrated remarkable success in many domains, enabling a unified backbone to generalize across diverse downstream tasks. However, extending this paradigm to graph learning remains challenging due to the intrinsic mismatch between graph data and fixed architectural designs. In this work, we show that this limitation can be overcome via recurrent graph models. To achieve this, we conduct a systematic theoretical analysis, rigorously deriving step dependence as a necessary and sufficient condition for an adaptively convergent recurrent process. Building on this foundation, we propose AdaR, an Adaptive Recurrent graph model, empowering flexible test-time computing on various datasets without changing model parameters. To enable adaptive inference, AdaR explicitly encodes normalized step information and representation–target relations into the recurrent updates. To ensure convergence of the recurrent process, AdaR employs gradient-based supervision signals that guide representation updates throughout the recurrence. Empirical results demonstrate that AdaR consistently outperforms strong baselines in both inductive and transductive settings. Codes are provided in the supplementary material.

Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under realworld visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark (R-Bench) and maintains superior performance under adversarial corruptions on general VQA benchmarks (MMMB, MMStar, RealWorldQA). Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. Code, demo, and models will be open-sourced soon.

Abstract:
Diffusion models deliver stateof-the-art image quality but are expensive to deploy. Post-training quantization (PTQ) can shrink models and speed up inference, yet residual quantization errors distort the diffusion distribution (the timestep-wise marginal over \vx_t), degrading sample quality. We propose a distribution-preserving framework that absorbs quantization error into the generative process without changing architecture or adding steps. Deformable Noise Scheduler (DNS) reinterprets quantization as a principled timestep shift, mapping the quantized prediction distribution \vx_t back onto the original diffusion distribution so that the target marginal is preserved. Unlike trajectory-preserving or noise-injection methods limited to stochastic samplers, our approach preserves the distribution under both stochastic and deterministic samplers and extends to flow-matching with Gaussian conditional paths. It is plug-and-play and complements existing PTQ schemes. Empirically, our method consistently enhances generation quality across diverse backbones and existing PTQ baselines. Notably, when further quantizing the FP16 LoRA branch of SVDQuant to enable fully integer inference, our approach effectively mitigates the performance drop, reducing FID from 27.16 to 26.22.

Abstract:
Timeseries anomaly detection (TSAD) with multimodal large language models (MLLMs) is an emerging area, yet a persistent challenge remains: MLLMs rely on coarse time-series heuristics but struggle with multi-dimensional, detailed reasoning, which is vital for understanding complex time-series data. We present AnomSeer to address this by reinforcing the model to ground its reasoning in precise, structural details of time series, unifying anomaly classification, localization, and explanation. At its core, an expert chain-of-thought trace is generated to provide a verifiable, fine-grained reasoning from classical analyses (e.g., statistical measures, frequency transforms). Building on this, we propose a novel time-series grounded policy optimization (TimerPO) that incorporates two additional components beyond standard reinforcement learning: a time-series grounded advantage based on optimal transport and an orthogonal projection to ensure this auxiliary granular signal does not interfere with the primary detection objective. Across diverse anomaly scenarios, AnomSeer, with Qwen2.5-VL-3B/7B-Instruct, outperforms larger commercial baselines in classification and localization accuracy, particularly on point- and frequency-driven exceptions. Moreover, it produces plausible reasoning traces that support its conclusions.

Abstract:
Autoregressive generative models for irregularly sampled clinical timeseries data are increasingly used for zero-shot risk forecasting. Prior work typically adopts a single fine-grained discretization of time, where tokens are generated at one fixed, pre-determined, temporal resolution. We demonstrate that zero-shot accuracy for a given task varies depending on the temporal dynamics of the task in question, where performance will be low when the temporal dynamics is not well-matched to temporal resolution of the generative model. We then propose MoRGen (Mixture-of-Resolutions Generation), which fuses zero-shot generative experts trained at multiple resolutions, to improve zero-shot performance across tasks with very different temporal dynamics. Across multiple horizons and outcomes on three independent clinical datasets, MoRGen achieves lower binary-cross entropy (BCE) and statistically significant AUROC gains over autoregressive generative models that forecast tokens at a fixed temporal resolution.

Abstract:
LiDAR point cloud anomaly detection is critical for autonomous system safety, yet most existing methods rely only on visible measurements, overlooking occlusion as a structured consequence of the LiDAR sensing process. We argue that anomalies are characterized not only by what is observed, but also by the spatial voids they create, which alter occlusion patterns and volumetric visibility. We propose Counterfactual OcclusionVisibility Anomaly Learning (COVAL), a framework that intervenes on volumetric visibility during training. Using physics-conformed synthetic anomaly construction, COVAL generates paired factual and counterfactual observations with identical scene geometry but different occlusion patterns. Then, we introduce two complementary objectives: Visibility-Variant Counterfactual Reconstruction, which models occlusion-induced missing regions, and Visibility-Invariant Counterfactual Consistency, which enforces stable representations across visibility changes. Together, these objectives isolate anomaly-induced structural missingness and in turn refine representation of normal scenes, thus improving anomaly sensitivity at test time. Experiments on standard LiDAR anomaly segmentation benchmarks show that COVAL achieves state-of-the-art performance.

Abstract:
Graph prompt tuning has shown great potential in graph learning by introducing trainable prompts to enhance the model performance in conventional singledomain scenarios. Recent research has extended graph prompts to improve Graph Foundation Models (GFMs) by few-shot tuning auxiliary prompts. Despite their progress, most existing methods embed source-domain information into prompts, which serve either as input to GFMs or encoded during model pre-training. Such prompt entanglement with specific source domains and GFM pre-training strategy restricts their generalisability to other domains and different GFMs. Furthermore, existing GFM prompts merely rely on few-shot tuning for adaptation, neglecting the rich information in unlabelled target domain test data. Motivated by these insights, this paper aims to empower GFMs with pre-training-agnostic test-time graph prompt tuning, named GFMate. GFMate introduces centroid and layer prompts applied after pre-training on target domains, avoiding entanglement with specific source domains and model pre-training. In addition, a test-time complementary learning objective is devised to exploit both labelled and unlabelled target domain data for effective test-time prompt tuning. Extensive experiments on 12 benchmark datasets demonstrate the superior performance and efficiency of GFMate, achieving improvements of up to 30.63%. Code will be released upon acceptance.

Abstract:
Implicit incontext learning (ICL) has newly emerged as a promising paradigm that simulates ICL behaviors in the representation space of large language models (LLMs), aiming to attain few-shot performance at zero-shot cost. However, existing approaches largely rely on injecting shift vectors into residual flows, which are typically constructed from labeled demonstrations or task-specific alignment. Such designs fall short of utilizing the structural mechanisms underlying ICL and suffer from limited generalizability. To address this, we propose In-Context Routing (ICR), a novel implicit ICL method that internalizes generalizable ICL patterns at the attention logits level. It extracts reusable structural directions that emerge during ICL and employs a learnable input-conditioned router to modulate attention logits accordingly, enabling an efficient train-once-and-reuse framework. We evaluate ICR on 12 real-world datasets spanning diverse domains and multiple LLMs. The results show that ICR consistently outperforms existing implicit ICL methods that require task-specific retrieval or training, while demonstrating robust generalization to out-of-domain tasks where they struggle. These findings position ICR to push the boundary of the practical value of ICL.

Abstract:
Longtailed recognition fundamentally suffers from optimizer blindness where the optimization process mistakenly conflates the magnitude of gradient accumulation with the scarcity of semantic information. Existing strategies relying on static frequency-based priors fail to correct this bias and result in state blindness regarding supervision and micro-level blindness regarding parameter updates. To address these limitations, we propose the AES framework to establish a dynamic and state-aware correction system across the entire learning lifecycle. We specifically introduce Adaptive Residual Supervision loss to act as a real-time reality check for supervision completeness via precision shielding. We also propose Entropy-aware PCGrad to resolve parameter-level conflicts by quantifying task specificity through gradient entropy. Additionally, we devise Sample-level Conflict Arbitrated Fusion to serve as a dynamic inference arbiter that routes predictions based on instance difficulty. Extensive experiments on CIFAR-100-LT, ImageNet-LT, and iNaturalist 2018 demonstrate that our method consistently achieves state-of-the-art performance by effectively balancing head-class stability and tail-class discrimination. Code is available at Supplement.

Abstract:
Effective skill grounding is essential for deploying reusable skills in embodied agents, as even minor embodiment or environmental differences can render an entire skill incompatible. This challenge is particularly pronounced in embodied settings, where agents must operate in dynamic, partially observable environments without access to large language models (LLMs). In this setting, reliance on LLMs is impractical, while small language models (sLMs) remain insufficient for the effective skill grounding required for reliable longhorizon control. We present RECENT, a refactoring-centric agent framework that enables efficient skill grounding with sLMs by decoupling skill semantics from embodiment- and environment-specific execution binding. By representing skills as executable code, RECENT preserves the semantic intent encoded in a skill’s control structure while grounding it by modifying only execution bindings through localized refactoring, rather than regenerating code from scratch. We evaluate RECENT across diverse skill grounding scenarios spanning multiple robot embodiments in dynamic environments, demonstrating robust long-horizon performance when deployed with an sLM. Across all scenarios, RECENT achieves the best performance among sLM-based Code-as-Policies (CaP) methods and matches the task performance of LLM-based CaP.

Abstract:
Realworld data is rarely clean; it is plagued by severe class imbalance (long-tailed distributions) and label corruption. Current solutions lean heavily on ''black-box" meta-learning to re-weight samples. However, this paradigm introduces a fatal circular dependency: it relies on pristine, balanced validation sets to guide the optimization, which are essentially non-existent in the wild. We propose ProMeCD, a self-referential framework that breaks this dependency by recasting optimization as an autonomous control problem. Instead of training an opaque neural meta-learner, we employ a transparent proportional-integral controller. The system monitors ''cognitive entropy'' that is a metric derived from von Mises-Fisher gradient statistics to assess learning uncertainty. To resolve the scalar ambiguity between tail and noisy samples, ProMeCD employs a decoupled control strategy: it boosts tail classes via integral accumulation of magnitude deficits when directional consistency is high, while suppressing noise via proportional feedback when consistency collapses. Theoretically, we prove that this mechanism guarantees convergence and formally prevents the minority initial drop, ensuring monotonic improvement for rare classes. Crucially, ProMeCD is fully white-box and validation-free. Experiments on CIFAR-LT, iNaturalist, CIFAR-N, and mini WebVision confirm that ProMeCD is not merely efficient; it outperforms the recent meta-learner FMW-Net by over 10% in severe imbalance settings, proving that explicit control theory offers a superior path to handling imperfect data.

Abstract:
Research idea innovation is a fundamental engine of scientific progress, yet it remains difficult to generate and evaluate in a scalable and controllable way. This challenge lies in its inherently openended and multi-objective nature, where ideas should balance novelty, plausibility and feasibility. While recent LLM-based approaches have made progress through carefully designed prompts or agent pipelines, they are constrained by predefined, static ideation workflows. To address this limitation, we propose MindFlow, a framework that explicitly formulates ideation as a graph-structured Flow in Mind, which is composed of modular thinking operators and modeled by a probabilistic mind supernet. Given a research topic, a controller dynamically samples thinking flows to generate candidate ideas. This open-ended problem is optimized using a tournament-based relative ranking, enabling the controller to progressively favor higher-quality thinking flows. We further introduce an evaluation protocol that jointly assesses problem finding and problem solving, going beyond title- or abstract-only judgments. Across diverse topics, MindFlow shows its superiority as an explicit, controllable and optimizable research idea innovator.

Abstract:
Continual Learning (CL) continually performs parameter updates, posing a significant challenge to backdoor persistence. In this paper, we reveal that the most advanced attack relies on an implicit assumption that taskcritical neurons remain stable across task learning; however, it does not hold in class-incremental learning (CIL). This exposes a critical research gap: the backdoor persistence in CIL is still an open question. Inspired by the function stability despite neuron instability, we discover that the CIL models preserve task knowledge in shallow, structurally invariant subspaces. Motivated by the findings, we propose PBTO, the first persistent and targeted backdoor attack in CIL. PBTO trains a surrogate model on proxy tasks to obtain the parameter trajectory. Then, it optimizes a universal trigger that ensures misclassification to the target label across all model states and anchors trigger embeddings in shallow layers. Experimental results verify that PBTO maintains effectiveness even after learning multiple tasks, while existing methods degrade to below 10%.

Abstract:
This paper explores unsupervised disentangled representation learning from a functional perspective. We define latent concepts as factors that influence observations through locally orthogonal directions, formalized as an orthogonality constraint on the Jacobian of the generative mapping. We prove that this condition yields identifiability of general nonlinear generative models, without requiring statistical independence or causal assumptions, provided the latent domain admits all combinations of factor values. Experiments with orthogonalityregularized normalizing flows empirically confirm the theory, demonstrate reliable recovery of ground-truth factors, and shed light on the success of VAEs. These findings challenge the prevailing impossibility claims for unsupervised disentanglement and provide a principled alternative foundation.

Abstract:
Thelearnwareparadigm aims to enable users to leverage numerous existing highperforming models instead of building machine learning models from scratch. A learnware consists of a submitted model together with aspecificationderived from the developer’s training data. As the key component, a specification should characterize the capabilities of the model, enabling it to be adequately identified and reused, while preserving the developer's original data. In this paper, we present the first formal study of the risks that arise when a specification is attached to a model, as opposed to releasing the model alone. We develop a game-theoretic framework and, by combining variational inference with geometry analysis, provide quantitative estimates of the resulting risk of specification. Our analysis provides theoretical guarantees on the data protection ability for the commonly adopted RKME specification. Finally, we prove that with a properly chosen size of specification, releasing the specification alongside the model introduces almost no additional risk of exposing the raw data, while still retaining sufficient information for effective learnware identification.

Abstract:
We develop a meanfield theory of dropout as a perturbation of critical signal propagation at the edge of chaos. Dropout shifts the perfect-alignment fixed point, making the depth scale for information propagation finite even at critical initialization. We derive critical and crossover scaling laws for correlation decay and establish that smooth activations and kinked (ReLU-type) activations constitute distinct universality classes, with different critical exponents and a universal two-parameter scaling collapse in detuning and dropout strength. The distinction traces to the analytic structure of the correlation map: smooth activations admit a Taylor expansion near perfect alignment, while kinked activations develop a branch point with universal non-analyticity. As a corollary, the framework yields principled dropout schedules that maximize effective correlation length under fixed budget. We validate the theoretical predictions in MLPs and Vision Transformers, where the predicted schedules outperform constant dropout, illustrating the practical utility of the mean-field approach.

Abstract:
ComputerAided Design (CAD) generation is typically formulated as a sequence modeling task over parametric tokens. Recent studies introduce visual information through additional visual inputs or rendering of the final generated programs. However, these methods provide no intermediate visual feedback, hindering the association of individual tokens with their geometric effects. In this work, we propose an Argument-induced 3D Point Loss (A3PL) that maps argument tokens to corresponding 3D points, enabling dense token-wise geometric supervision. To reduce learning complexity and invalid sequences, we further introduce a Grammar-constrained Operator (GCO) that leverages the structured nature of CAD programs to regulate sequence generation. We evaluate our approach on five CAD generation tasks with diverse input modalities, including text, Scalable Vector Graphics (SVG) sketches, point clouds, and CAD sequences. Our approach improves generation accuracy and program validity across different input modalities. Code and dataset are made publicly available.

Abstract:
Threedimensional (3D) shape classification plays a central role in computer vision and computer-aided design (CAD), underpinning applications in intelligent manufacturing, automated inspection, and digital engineering. Despite recent progress with 3D CNNs and graph-based approaches, existing methods often overlook the geometric-topological regularities and symmetry principles intrinsic to CAD boundary representations (B-reps). To address this challenge, we introduce EquiCAD, a symmetry-aware learning framework that integrates equivariant representations with graph-based reasoning. By leveraging group-theoretic decomposition of curve and surface descriptors, EquiCAD enforces consistent SO(3)/O(3)-equivariance while preserving rich geometric details. The model further exploits hierarchical message passing to capture interactions between local features and global structure. Experimental results across multiple datasets, including SolidLetters, Parts, the Machining Feature benchmark, and our newly constructed Features dataset, demonstrate substantial improvements over prior state-of-the-art approaches, particularly on industrially relevant shapes with fine-grained attributes. These findings highlight the value of symmetry-aware modeling for robust and generalizable 3D shape analysis.

Abstract:
Geometric properties can be leveraged to stabilize and speed reinforcement learning. Existing examples include encoding symmetry structure, geometryaware data augmentation, and enforcing structural restrictions. In this paper, we take a novel view of RL through the lens of order theory and recast value function estimates into learning a desired poset (partially ordered set). We propose \emphGCR-RL (Geometric Coherence Regularized Reinforcement Learning) that computes a sequence of super-poset refinements -- by refining posets in previous steps and learning additional order relationships from temporal difference signals -- thus ensuring geometric coherence across the sequence of posets underpinning the learned value functions. Two novel algorithms by Q-learning and by actor--critic are developed to efficiently realize these super-poset refinements. Their theoretical properties and convergence rates are analyzed. We empirically evaluate GCR-RL in a range of tasks and demonstrate significant improvements in sample efficiency and stable performance over strong baselines.

Abstract:
As Large Language Model (LLM) agents are increasingly deployed in complex environments, multiturn interaction attacks have become a significant security challenge. Existing detection methods typically rely on historical context. However, this retrospective logic struggles to identify deep malicious intents that are split across turns to hide future risks. Inspired by speculative decoding, we propose the Speculative Safety Honeypot (SSH) framework. SSH uses a multi-agent simulation system composed of small LLMs to build an action-level speculate-and-verify workflow. In the speculation stage, SSH predicts future behaviors of the target agent and asynchronously builds a trajectory tree to expose potential risks in advance. In the verification stage, the system uses the target agent's real actions to calibrate and prune the trajectory tree, effectively reducing false positives. As a plug-and-playable component, SSH provides existing detectors with rich decision redundancy beyond the current interaction slice. By judging risk based on the evolution of the entire trajectory tree rather than a single point in time, the system reduces the reliance on the absolute precision of individual detection components. This improves the defense resilience and the warning lead-time of agent systems against complex temporal attacks.

Abstract:
Federated learning increasingly faces deletion requests that require clientlevel unlearning without sacrificing model quality, yet a client’s influence is often deeply entangled after many rounds of aggregation. We aim to make unlearning fast, stable, and predictable by reducing the gap to leave-one-out retraining under realistic heterogeneity. We propose Influence-Disentangled Federated Training (IDFT), which instruments standard FedAvg with training-time influence logging: each round’s updates are decomposed into shared covarying directions and a client-separable residual trace, and an entanglement-aware shrinkage suppresses non-removable components. Deletion then becomes a single subtraction followed by a short anchored repair, yielding a stability-style characterization of retrain fidelity driven by the unremoved residual. Across representative benchmarks, IDFT consistently attains the lowest retrain gap (Avg.\ Gap) on multiple dataset--architecture pairs and improves the fidelity--cost frontier, matching retrain-level forgetting with substantially lower communication/compute than history-heavy baselines. These results suggest a practical pathway to unlearning-friendly federated systems by designing for removability during training rather than relying solely on post-hoc corrections.

Abstract:
Hamiltonian Monte Carlo and underdamped Langevin Monte Carlo are stateof-the-art methods for taking samples from high-dimensional distributions with a differentiable density function. To generate samples, they numerically integrate Hamiltonian or Langevin dynamics. This numerical integration introduces an asymptotic bias in Monte Carlo estimators of expectation values, which can be eliminated by adjusting the dynamics with a Metropolis-Hastings (MH) proposal step. Alternatively, one can trade bias for variance by avoiding MH, and select an integration step size that ensures sufficiently small asymptotic bias, relative to the variance inherent in a finite set of samples. Such \emphunadjusted methods often significantly outperform their adjusted counterparts in high-dimensional problems where sampling would otherwise be prohibitively expensive, yet are rarely used in statistical applications due to the absence of an automated way of choosing a step size. We propose just such an automatic tuning scheme that takes a user-provided asymptotic bias tolerance and selects a step size that ensures it. The key to the method is a relationship we establish between the energy change during integration and the asymptotic bias. We show that this procedure rigorously bounds the asymptotic bias for Gaussian target distributions. We then numerically show that the procedure works beyond Gaussians. To demonstrate the practicality of the proposed scheme, we provide a comprehensive comparison of adjusted and unadjusted samplers on Bayesian inference problems and on a statistical physics model in more than one million parameters. With our tuning scheme, the unadjusted methods achieve close to optimal performance, significantly and consistently outperforming their adjusted counterparts.

Abstract:
Large language models (LLMs) have opened new paradigms in optimization modeling by enabling the generation of executable solver code from natural language descriptions. Despite this promise, existing approaches typically remain solverdriven: they rely on single-pass forward generation and apply limited post-hoc fixes based on solver error messages, leaving undetected semantic errors that silently produce syntactically correct but logically flawed models. To address this challenge, we propose SAC-Opt, a backward-guided correction framework that grounds optimization modeling in problem semantics rather than solver feedback. At each step, SAC-Opt aligns the original semantic anchors with those reconstructed from the generated code and selectively corrects only the mismatched components, driving convergence toward a semantically faithful model. This anchor-driven correction enables fine-grained refinement of constraint and objective logic, enhancing both fidelity and robustness without requiring additional training or supervision. Empirical results on seven public datasets demonstrate that SAC-Opt improves average modeling accuracy by 7.7%, with gains of up to 21.9% on the ComplexLP dataset. These findings highlight the importance of semantic-anchored correction in LLM-based optimization workflows to ensure faithful translation from problem intent to solver-executable code.

Abstract:
Unified multimodal models capable of both understanding and generation have achieved remarkable strides. However, despite their unified designs, existing evaluations typically assess understanding and generation capabilities in isolation, overlooking the synergy between comprehension and generation. To bridge this gap, we introduceUnison, a comprehensive benchmark comprising 2,169 highquality unified task samples, designed to evaluate joint understanding and generation in unified multimodal models. Unison offers three key strengths:1) Comprehensive Dimensions: Unison encompasses internal consistency, understanding-guided generation, generation-guided understanding, and mutual enhancement to enable holistic evaluation.2) Diagnostic Evaluation: it provides both unified and decoupled tracks for understanding and generation, allowing fine-grained attribution of failure modes and quantitative analysis of the gains from unified modeling.3) Human Alignment: we also train Unison-Judge, an evaluation model well aligned with human judgments to achieve reliable assessment. Based on systematic evaluations of state-of-the-art models on Unison, we uncover critical limitations in current unified multimodal systems and highlight promising directions for future research. Unison will be publicly released to facilitate evaluation and advance this field.

Abstract:
We study blackbox detection of machine-generated text under practical constraints: the scoring model (proxy LM) may mismatch the unknown source model, and per-input contrastive generation is costly. We propose SurpMark, a reference-based detector that summarizes a passage by the dynamics of its token surprisals. SurpMark discretizes surprisals into interpretable states, estimates a state-transition matrix for the test text, and scores it via a generalized Jensen–Shannon (GJS) gap between the test transitions and two fixed references (human vs. machine) built once from existing corpora. Theoretically, we derive design guidance for how the discretization bins should scale with data and provide a principled justification for our test statistic. Empirically, across multiple datasets, source models, and scenarios, SurpMark consistently matches or surpasses baselines, demonstrating strong robustness across domains and generators; our experiments on hyperparameter sensitivity exhibit trends that our theoretical results help to explain.

Abstract:
Generative models have emerged as a powerful paradigm for AI planning, yet their performance remains constrained by training data distribution. One approach is to improve generated solutions during inference by scaling testtime compute. A more efficient alternative is to optimize the inferential process itself. In this paper, we show that a modified version of a classical Open-Closed List (OCL) search provides just such an efficient inferential procedure. Our algorithm synergizes two learned components: a generative model that performs fast rollouts from specific reasoning paths and a value model that manages which of many possible reasoning lines to follow. We present novel contributions in exploration control and how learned models are integrated within the OCL framework. Experimental evaluation across multiple combinatorial planning domains shows that our approach consistently outperforms baseline search algorithms in both computational efficiency and solution quality.

Abstract:
Large language models finetuned on instruction–code pairs may memorize and subsequently leak sensitive training data. Existing differentially private (DP) code generation methods primarily protect code snippets while assuming prompts are public, which fails in realistic scenarios where prompts may also contain sensitive information. When prompts cannot be explicitly learned or used during generation, code synthesis suffers from severe utility degradation and reduced diversity. To address these challenges, we propose PrivCode++, the first work to explore DP code generation under where both prompts and code snippets are considered sensitive in LLM fine-tuning. PrivCode++ introduces a two-stage DP framework with a Privacy-Free Latent Conditioning module, enabling effective DP fine-tuning and data synthesis without direct access to sensitive prompts or code. Extensive experiments show that PrivCode++ achieves substantially higher utility than baselines, remains competitive with the method with relaxing privacy assumptions, and provides stronger privacy guarantees.

Abstract:
Despite the success of parameterefficient fine-tuning (PEFT) methods in reducing parameter-related overhead, fine-tuning large language models (LLMs) is still bottlenecked by significant memory and computational demands. In this paper, we propose TokenDrop, a token-level importance-aware backpropagation skipping method that reduces activation memory and accelerates LLM fine-tuning by skipping backward computations for less informative tokens. TokenDrop evaluates token importance based on the magnitude of residual updates during the forward pass, enabling lightweight, gradient-free importance estimation. Furthermore, we introduce cumulative token selection to preserve gradient continuity across layers and lazy selection scheduling that defers token selection to facilitate globally informed importance scoring under memory constraints. Across a range of experiments, TokenDrop achieves up to 42.9% reduction in memory usage and up to 1.50× training speedup, while preserving accuracy and outperforming existing backpropagation-skipping baselines. The code is available at https://anonymous.4open.science/r/tokendrop_official-B469.

Abstract:
Imageto-point cloud registration is often challenged by viewpoint changes, cross-modal discrepancies, and repetitive textures, which induce scale ambiguity and consequently lead to erroneous correspondences. Recent detection-free methods alleviate this issue by leveraging multi-scale features and transformer-based interactions. However, they still suffer from attention drift across layers and intra-scale inconsistencies, hindering precise registration. Inspired by human behavior, we propose a "Focus-Sweep'' paradigm and develop a Hierarchical Focus-Sweep Interaction Module within an SSM-based framework to enhance multi-level cross-modal feature association. In addition, we introduce a Dynamic Layer Allocation Strategy that adaptively determines the iteration depth to better exploit geometric constraints and improve matching robustness. Extensive experiments and ablations on two benchmarks, RGB-D Scenes V2 and 7-Scenes, demonstrate that our approach achieves state-of-the-art performance.

Abstract:
Natural language interfaces can greatly benefit the accessibility and usability of optimization modeling, and recent advances in large language models (LLMs) show promise in automatically translating textual problem descriptions into executable solver formulations. However, a key challenge for existing approaches is to ensure that the inferred formulation correctly implements the intended task, even if it may execute without errors. We introduce \textscVeriSimpl, a solver–LLM framework for robust naturallanguage-to-optimization formalization. Our approach is based on the idea of \emphsimplification-based verification, where the optimization solver is leveraged to generate simplified diagnostic queries about a candidate formulation to allow the LLM to tractably reason about the correctness of the formulation with respect to the task description. We present such simplification strategies along different dimensions with respect to problem constraints and decision variables, which allow the LLM to reason locally under fixed global contexts. Evaluations on a range of optimization benchmarks show how our approach provides consistent improvements in accuracy over existing methods, while also providing a novel high-precision self-verification signal.

Abstract:
Quantizationaware training (QAT) is essential for extremely low-bit large language models (LLMs). Current QAT methods are mainly based on scalar quantization (SQ), which enables efficient optimization but suffers from severe performance degradation at 2-bit precision. On the other hand, vector quantization (VQ) provides substantially higher representational capacity, but its discrete codebook lookup prevents end-to-end training. We propose LC-QAT, a 2-bit weight-only VQ-QAT framework that represents quantized weights via a learned affine mapping over discrete vectors, which yields a high-quality PTQ initialization and enables fully differentiable end-to-end optimization without explicit codebook lookup in the training forward pass. This strong post-training initialization makes LC-QAT highly data-efficient. Experiments across diverse LLMs demonstrate that LC-QAT consistently outperforms state-of-the-art QAT methods while using only 0.1%–10% of the training data. Our results establish LC-QAT as a practical and scalable solution for extreme low-bit model deployment.

Abstract:
Multiobjective protein design is essential for meeting the complex demands of synthetic biology. To adapt to shifting multi-functional targets without the prohibitive cost of retraining, test-time scaling has emerged as a flexible, training-free alternative. However, current test-time diffusion methods face critical challenges: i) ineffective learning from interaction history leading to repetitive design errors, ii) over-reliance on successful cases as the reward signal, and iii) difficulties in balancing multi-objective functional trade-offs . To address these limitations, we propose MoMST, a framework for Multi-objective protein design via Memory-aware Self-contrastive learning with Test-time scaling in diffusion models. At test time, we develop a memory bank to extract generalizable reasoning experience from historical iterations. Building on this powerful experience learner, we derive rich residue-level relative preference signals from both successful and failed cases via self-contrastive learning for guiding protein generation. To ensure balance among competing multi-objective functions, we present an inference-time Pareto alignment strategy to resolve objective conflicts. Evaluations on both single-objective and complex multi-objective tasks demonstrate that MoMST exhibits remarkable performance.

Abstract:
Efficient deployment of large language models (LLMs) requires extreme quantization, forcing a critical tradeoff between low-bit efficiency and performance. Residual binarization promises hardware-friendly, matmul-free inference by stacking binary (\pm1) layers, but is plagued by pathological feature co-adaptation. We identify a key failure mode, which we term inter-path adaptation: during Quantization-Aware Training (QAT), parallel residual binary paths learn redundant features, degrading the error-compensation structure and crippling the model's expressive capacity. While prior work relies on heuristic workarounds (e.g., path freezing) that limit model capacity, we propose RaBiT, a novel quantization framework that resolves co-adaptation by algorithmically enforcing a residual hierarchy. Its core mechanism sequentially derives each binary path from a single shared full-precision weight, ensuring each path corrects its predecessor's error. This process is stabilized by a robust initialization that prioritizes functional preservation over mere weight approximation. RaBiT redefines the 2-bit accuracy-efficiency frontier: it achieves state-of-the-art performance, rivals even hardware-intensive Vector Quantization (VQ) methods, and delivers a 4.49× inference speed-up over full-precision models on an RTX 4090.

Abstract:
Dualencoder vision-language models (VLMs) expose a similarity interface that enables zero-shot retrieval but fails compositional constraints: queries like “umbrella and no person” retrieve images containing both, even when concept detection is reliable. We trace this to an interface-levelBag-of-Conceptseffect, where similarity scores approximate mean pooling of concept evidence regardless of operators, although operator-dependent signals exist in text embeddings, they are too weak or misaligned to affect rankings. Fine-tuning cannot fully resolve this failure because the bottleneck lies in how similarity aggregates evidence, not in what encoders represent. We proposefactored inference, which separates evidence extraction from constraint execution, and introduce LCSE (Logic-Constrained Score Editing), a training-free method that executes constraints externally using concept scores from frozen encoders. We also introduce FACTOR-Bench, where LCSE achieves 85.5% accuracy versus 73.2% for the best fine-tuned baseline, 90.7% when applied to SigLIP 2, and improves NegBench COCO MCQ accuracy from 27.2% to 65.2% while preserving retrieval performance.

Abstract:
Bayesian optimization (BO) is a method commonly used for policy search in problems with lowdimensional policy parameterizations. While it is generally considered data-efficient, existing BO approaches are agnostic to the sequential structure of the optimization objective induced by policy roll-outs. Thereby, valuable information is discarded that could improve the convergence of BO. We address this inefficiency by developing and rigorously analyzing a novel approach for BO that relies on a temporal difference learning formulation for discounted infinite-horizon value functions based on Gaussian process (GP) regression. We derive learning error bounds for the proposed temporal difference GPs, such that we can exploit upper confidence bounds to analyze the cumulative regret of our BO approach. This analysis is further refined by bounding the maximal information gain for our temporal difference GP model. In a comparison with relevant baseline methods, we demonstrate the practical advantages of our method.

Abstract:
LLM agents have achieved strong performance in toolaugmented reasoning, but most remain largely stateless: after each episode, the agent discards interaction traces and does not accumulate reusable strategies. Prior work either stores raw trajectories for case-based reuse or relies on external teacher models to write reflections, which limits generalization or leaves the agent’s policy unchanged. We introduce EvolveR, an experience-driven framework that allows an agent to improve using its own interaction history. EvolveR maintains an experience base of distilled strategic principles derived from past trajectories. In an offline phase, the agent self-distills successful and failed trajectories into concise principles, applies semantic deduplication, and assigns each principle an empirical utility score for maintenance and pruning. In an online phase, the agent retrieves top-ranked principles to guide reasoning and tool usage, generating new trajectories. We then perform policy evolution with reinforcement learning on these experience-conditioned trajectories, reinforcing behaviors that effectively retrieve and apply useful principles. We demonstrate the effectiveness of EvolveR on complex multi-hop question-answering benchmarks, where it achieves superior performance over strong agentic baselines. Our work presents a comprehensive blueprint for agents that learn not only from external data but also from the consequences of their own actions, paving the way for more autonomous and continuously improving systems.

Abstract:
Segmenting small and sparse structures in largescale images is fundamentally constrained by pixel-level, lattice-bound computation and extreme class imbalance--dense, full-resolution inference scales poorly and forces most pipelines to rely on fixed regionization or downsampling, coupling computational cost to image resolution and attenuating boundary evidence precisely where minority structures are most informative. We introduceSEMIR(Semantic Minor-Induced Representation Learning), a representation framework that decouples inference from the native grid by learning a task-adapted inference space.SEMIRtransforms the underlying grid graph into a compact, boundary-aligned graph minor through parameterized edge contraction, node deletion, and edge deletion, while preserving an exact lifting map from minor predictions to lattice labels. Minor construction is formalized as a few-shot optimization problem that replaces hand-tuned preprocessing with aboundary-alignment objective: minor parameters are learned by maximizing agreement between predicted boundary elements and class-agnostic semantic edges under aboundary Dice criterion, and the induced minor is annotated with scale- and rotation-robust geometric and intensity descriptors and supports efficient region-level inference via message passing on a graph neural network (GNN) with relational edge features. We benchmarkSEMIRon three tumor segmentation datasets—BraTS2021,KiTS2023, andLiTS2017—where targets exhibit high structural variability and distributional uncertainty, providing a stringent testbed forstructure-adaptive inference.SEMIRyields consistent improvements inminority-structure Diceat practical runtime, positioningminor-induced representationsas a principled alternative to pixel-centric segmentation in challenging, high-variability visual domains.

Abstract:
Preferencebased reward modeling is a core component of RLHF and DPO pipelines. In practice, the humans providing preference feedback are rarely an i.i.d. sample: recruitment and exposure often follow social, institutional, or spatial structure, inducing non-uniform inclusion probabilities that correlate with graph centrality. We formalize preference learning with network-sampled annotators and show that identity-agnostic scalar reward modeling implicitly represents an inclusion-weighted welfare, over-representing structurally central communities when the inclusion distribution q differs from a designer-chosen target weighting \pi. We propose Graph-Preference Learning, which combines (i) a graph-personalized reward model that shares statistical strength across neighboring annotators and (ii) graph-balanced aggregation that computes stabilized importance weights to target \pi. Our analysis characterizes the induced welfare represented by the learned aggregate reward and bounds its deviation from the target in terms of weight mismatch, reward-model approximation, and finite-sample effects. Experiments on synthetic graphs and a semi-synthetic case study on the LMArena preference dataset, where biased inclusion is induced via graph-based sampling, demonstrate up to 62% reduction in target-welfare recovery error and 17% reduction in cross-language performance gaps under biased inclusion.

Abstract:
In modern machine learning, parallelization of training is an important strategy for increasing scale. Asynchronous stochastic gradient descent (ASGD), which maximally utilizes available hardware, avoids having to wait for slow workers. However, with constant step sizes, the convergence of ASGD is nonetheless negatively effected by slow workers due to large delays in updates. At the same time, it has been empirically observed in asynchronous training of deep learning models that gradient clipping ``stabilizes'' training. In this work, we provide a theoretical justification for this behavior, as we show that clipping removes the dependence of the maximum delay in the oracle complexity. We employ a subWeibull model of gradient noise which generalize sub-Gaussian and sub-exponential disitributions to more heavy-tailed distributions, motivated by empirical observations in deep learning. We show convergence in expectation, and for the first time in asynchronous optimization, convergence with high probability.

Abstract:
Realworld physical sensing exhibits complex, heterogeneous noise patterns that deviate significantly from idealized simulation, posing a fundamental bottleneck for sim-to-real transfer. Existing sensor modelings typically treat depth noise as a monolithic black-box process, overlooking the distinct physical mechanisms that govern different error modalities. In this work, we introduce a physics-grounded paradigm that disentangles monolithic noise into two complementary modalities: sensing invalidation and measurement inaccuracy, enabling a tailored treatment of noise sources based on their physical origins. Building on this insight, we propose PRISM (Physics-Reasoned Implicit Sensor Modeling), a tripartite framework that distills 3D Visual Foundation Model features as rich spatial-semantic priors for physics-based reasoning. To address the inherent sparsity and class imbalance of invalidation regions, we develop Hierarchical Positive-Prioritized Supervision, integrating multi-scale positive-weighted objectives with a positive-preserving dynamic hard mining strategy to enforce precise artifact delineation. Extensive benchmarks demonstrate that PRISM achieves state-of-the-art fidelity in noisy depth synthesis. Furthermore, downstream robotic experiments show that PRISM facilitates a 93.8% average success rate in the real world, marking a significant improvement over monolithic baselines.

Abstract:
While Graphical User Interface (GUI) agents have shown promising performance in automated device interaction, they primarily depend on static parametric knowledge from pretraining or instruction tuning. This reliance fundamentally limits their ability to handle long-tailed tasks that require explicit procedural knowledge absent from model parameters, often forcing agents to resort to inefficient and brittle trial-and-error exploration. To mitigate this limitation, we introduce Proactive Document-Guided Action for GUI agents in dynamic, open-web environments, a novel paradigm that mirrors human problem-solving by enabling agents to autonomously search for relevant documentation to resolve long-tailed tasks. To evaluate agents' capability in this paradigm, we propose DocOS, a benchmark designed to assess document-guided problem solving in fully interactive environments. DocOS requires agents to autonomously navigate a web browser, locate relevant online documentation, comprehend procedural instructions, and faithfully ground them into executable GUI actions. Extensive experiments reveal that progress is strictly constrained by dual bottlenecks: agents struggle to reliably locate relevant information during proactive search and frequently fail to faithfully ground retrieved instructions into precise actions, pointing toward document-guided interaction as a crucial pathway for enabling self-evolving GUI agents in dynamic environments.

Abstract:
We investigate whether JointEmbedding Predictive Architectures (JEPA) can learn useful representations of U.S. equity markets. We jointly train a permutation-invariant tokenizer that maps each trading day's unordered per-asset features to a fixed set of learned factor tokens, together with a temporal JEPA using masked prediction to obtain a compact daily market-state embedding. Our evaluations show that these embeddings are strongly associated with second-moment market structure (realized volatility, correlation concentration, effective factor dimensionality) and weakly associated with market direction. The embedding helps predict gradual recovery dynamics but not sudden stress onsets. Without any text supervision, latent regimes show statistically significant alignment with news-topic shifts.

Abstract:
Humans abstract experiences into structured representations to facilitate pattern inference and knowledge transfer. While the hippocampalentorhinal (HPC-MEC) circuit is known to represent both spatial and conceptual spaces, the mechanisms for concurrently extracting abstract structures from continuous, high-dimensional dynamics remain poorly understood. We propose a brain-inspired hierarchical model that simultaneously infers latent transitions and constructs a predictive visual world model. Our architecture employs an inverse model for structural extraction alongside an HPC-MEC coupling model that dissociates relational structures (MEC) from integrated episodic scenes (HPC). Using rotation dynamics as a benchmark, we demonstrate the model's capacity for structural abstraction. By leveraging velocity-driven path integration, the framework enables robust prediction and structural reuse across diverse contexts, thereby achieving structural generalization. This work provides a novel computational framework for understanding how brain-inspired, self-supervised learning of world models facilitates the acquisition of reusable abstract knowledge.

Abstract:
Scene Graph Generation (SGG) paradigms predominantly model relationships as static, unidirectional mappings (s \to o), effectively treating objects as passive recipients of actions. This formulation suffers from an inherent unidirectional bias, violating the physical reality that visual interactions are intrinsically reciprocal. Consequently, existing models often fail to maintain logical selfconsistency when the reasoning anchor shifts from the subject to the object. To rectify this cognitive deficiency, we establish the Mutual-Perspective Inverse Relations (MPIR) principle, positing that a robust visual representation must satisfy logical consistency across dual perspectives. Guided by this principle, we propose the Reciprocal Perspective Calibration (RPC) framework, a model-agnostic framework that operationalizes MPIR via a novel Adaptive Inverse-Relation Augmentation (AIRA) strategy. Furthermore, we introduce Hypernym-Guided Prompts (HGP) to bridge the gap between semantic context and computational efficiency in vision-language models, enabling precise modeling of inverse relations. Extensive experiments demonstrate that RPC not only achieves competitive performance on standard benchmarks but also significantly enhances the model's capability to understand inverse relations, as verified by a new inverse consistency evaluation protocol, demonstrating the cognitive robustness of our method.

Abstract:
Diffusion language models (DLMs) have recently emerged as an alternative to autoregressive approaches, enabling parallel sequence generation and flexible token generation orders. Machine unlearning plays a critical role in mitigating legal and ethical risks by removing the influence of specific training examples from trained models. While unlearning has been extensively studied for autoregressive language models, its applicability to DLMs remains unexplored. The architectural differences of DLMs raise new challenges for effective and robust unlearning that are not addressed by existing methods. In this paper, we present the first comprehensive study of unlearning for DLMs. Through systematic empirical analysis, we show that unlearning performance in DLMs is highly sensitive to generation hyperparameters, highlighting the need for evaluation across diverse generation settings. We further observe that DLMs tend to reproduce unlearned information when target inputs are embedded within informative contexts, due to their ability to incorporate both prefix and suffix conditioning, which increases vulnerability to elicitation attacks and weakens the robustness of existing unlearning methods. To design a robust unlearning method, we propose an adversarial reinforcement learning framework, where a context generator adversarially produces informative contexts to elicit unlearned knowledge, while the DLM is optimized to suppress undesired recall. We further introduce novel components to address credit assignment and stability issues in this adversarial learning setup. Extensive experiments demonstrate that our method significantly improves unlearning effectiveness while preserving model utility. Our code is available at: https://anonymous.4open.science/r/dllm_unlearning771D/

Abstract:
Asynchronous reinforcement learning enables highthroughput training but introduces policy lag, where experiences are collected under stale policy weights. We identify a key phenomenon in code generation: gradient variance scales exponentially with task difficulty under staleness, because hard tasks have narrow solution spaces corresponding to sharp loss landscape curvature (high Hessian eigenvalues). We formalize this as a staleness budget optimization problem and prove that the optimal allocation follows an exponential decay: \eta^(d) = \eta_base \cdot e^-\lambda d where \lambda = \alpha/2 is half the Hessian growth rate. Building on this theory, we propose ACEAS (Adaptive Curriculum with Execution-Aware Async Scheduling), combining bandit-based curriculum selection, execution-aware staleness budgets, and curriculum-staleness coupling derived from first principles. Our mechanistic analysis validates the theoretical predictions: the "safe zone" of gradient coherence follows the derived exponential boundary. On code generation benchmarks, ACEAS achieves over 2× higher throughput than synchronous baselines while improving Pass@1 from 39.7% to 60.1%, demonstrating that principled staleness control grounded in loss landscape geometry enables efficient asynchronous curriculum learning.

Abstract:
Polarization information is valuable for many computer vision applications. However, in hazy environments, polarization information is severely attenuated due to the degradation of captured polarized images. Existing dehazing methods struggle to effectively restore polarization information, as singleimage methods are unaware of polarization, and polarization-based methods are constrained by the traditional polarization models. These deficiencies lead to inaccurate polarimetric signatures and physical inconsistencies in scattering environments. To overcome these limitations and achieve the joint restoration of scene radiance and polarization information, we propose DuRP, a dual-stage physics-embedded learning framework. Specifically, we derive generalized polarization physics models that relax the ideal assumptions of traditional theory to provide a more precise foundation for the joint restoration of polarimetric and amplitude information. We then design a dual-stage neural network to estimate latent physical parameters through differentiable operators, ensuring that both the polarimetric state and radiance are accurately recovered. Experimental results show that DuRP achieves state-of-the-art performance in joint restoration and significantly enhances polarization-based vision tasks.

Abstract:
Scene Graph Generation (SGG) connects visual perception with structured reasoning, but is limited by scarce annotations and the longtailed distribution of relational predicates. Training-free methods based on vision-language models (VLMs) reduce supervision requirements, yet often rely on flat graph structures and produce hallucinated relations. We present HSGG, a training-free framework for open-world Hierarchical Scene Graph Generation, whose inference proceeds in two steps. First, bidirectional hierarchical entity perception combines top-down object expansion with bottom-up attribute reasoning to construct multi-level scene hierarchies capturing part–whole semantics. Second, geometry-guided relation reasoning infers valid relations from these structured entities: geometry-aware relation filtering first prunes spatially implausible object pairs using 2D proximity, depth cues, and object scale, and geometry-grounded contrastive relation decoding then suppresses hallucinated predicates by contrasting predictions from a visually grounded expert against a hallucination-prone geometric prior, ensuring relations are both geometrically consistent and semantically coherent. Experiments show that HSGG generalizes effectively to unseen objects and predicates without training, substantially reduces relational hallucinations, and consistently improves downstream reasoning performance.

Abstract:
Notification dispatch plays a critical role in largescale user engagement, involving complex trade-offs across notification timing, presentation style, multiple outcomes, and constraints. In this paper, we formulate it as a constrained optimization over bundle treatments that jointly specify timing and presentation style, aiming to maximize incremental Daily Active Users (DAU) subject to platform-level budget and device vendor-specific quota constraints. The problem is challenging due to multi-dimensional, small-effect uplift estimation and large-scale constrained optimization. To address these challenges, we propose Bundle Uplift Optimization with Pruned Lagrangian-based Relaxation (BUOPLR), a two-stage notification dispatch method that decouples uplift estimation from constrained decision-making. BUOPLR first learns bundle-level, multi-outcome small uplift through an architecture that captures cross-treatment and cross-outcome relationships, and then performs scalable assignment by restricting the decision space and applying Lagrangian relaxation to a small set of global constraints. Offline experiments show BUOPLR outperforms state-of-the-art methods, and online A/B tests increase DAU by 0.5%. BUOPLR is now deployed on a major Internet platform serving over 100 million users daily.

Abstract:
We present a comprehensive benchmarking dataset and empirical scalinglaw analysis for neural network wavefunctions by matching them to a wide spectrum of famous many-body target wavefunctions. The dataset, WF-Bench, spans multiple distinct regimes of strongly correlated quantum matter, including topological states, Wigner crystals, and superconducting wavefunctions, providing a diverse and challenging test bed for neural-network wavefunction expressivity. We introduce a systematic and reproducible benchmarking protocol for target wavefunction matching, enabling consistent performance evaluation across different neural network wavefunction architectures. By using wavefunction fidelity as the uniform metric, we discover empirical scaling laws that characterize how representability depends on system size and key model parameters, including number of determinant and model depth. By applying our benchmark protocol on Psiformer and Ferminet, we show that WF-Bench establish a unified dataset-driven framework for evaluating and comparing neural network wavefunctions and for guiding the design of future architectures.

Abstract:
Diffusion Transformers (DiTs) have driven substantial progress in image and video generation but suffer from prohibitive computational costs. Feature caching accelerates inference by reusing intermediate representations. Existing methods rely on historical features for implementation simplicity, yet suffer from severe error accumulation at high acceleration ratios. To address this limitation, we investigate the nature of the requisite feature correction. We demonstrate that the optimal calibration update is characterized by a shared lowrank subspace across diverse prompts. Guided by this structural insight, we propose LearniBridge, a learnable calibration mechanism for feature caching that bridges multiple timesteps through lightweight LoRA updates. This mechanism enables effective calibration requiring only 3-5 training samples. Extensive experiments on image and video generation show that LearniBridge achieves up to 5.87×, 5.75×, and 4.10× acceleration on FLUX, HunyuanVideo, and WAN 2.1, respectively. On WAN 2.1, it improves VBench by 1.28% over the previous SOTA at 4.10× acceleration. Our code is included in the supplementary material and will be released on GitHub.

Abstract:
Machine learning interatomic potentials (MLIPs) have brought substantial gains in the extrapolation capability in computational chemistry. However, most equivariant models are typically built with spherical tensors (STs), and it remains unclear whether it is the only practical design principle, or whether irreducible Cartesian tensors (ICTs) can offer distinct advantages by operating directly in the Cartesian space that naturally aligned with atomistic coordinates and tensor targets. In this work, we introduce the Cartesian3j and Cartesian-nj symbols, which serve as direct analogues of the Wigner-3j and Wigner-nj symbols defined for spherical tensor coupling. We further extend the e3nn library to support ICT products, and use this framework to build Cartesian counterparts of MACE, NequIP, and Allegro, allowing the first controlled comparison where architectures are held fixed and only the tensor basis is changed. Leveraging the ICTs and Cartesian-based architecture, a universal interatomic potential is trained and demonstrated competitive performance on a widely used public benchmark for materials discovery against SOTA ST models.

Abstract:
Social media platforms enable largescale cross-lingual communication, yet translating user-generated content (UGC) remains challenging due to its informal style, culture-laden expressions, and interaction-driven nuances. While recent LLMs have advanced translation quality, existing benchmarks and metrics often overlook whether translations preserve intended meaning and cultural resonance in real-world contexts. In this work, we introduceCULTURE-MT, a benchmark for social media translation that explicitly emphasizesCULturalTransmission andUGC-specific emotionREsonance. CULTURE-MT comprises 1,002 Chinese-to-English UGC notes spanning 14 domains, systematically categorized into four types based on culture-loaded symbols and linguistic styles. We also construct UGC-oriented training data to fine-tune Qwen3-8B and Qwen3-32B as strong baselines. We propose thecultural effectivenesscriterion and train a related JUDGER model that jointly assesses expression accuracy and cultural adaptability. Evaluating 15 models, we find that standard automatic metrics are largely insensitive to cultural effectiveness. Our work establishes a comprehensive framework for evaluating and advancing UGC translation, and will provide an open evaluation platform to support future research in culturally effective UGC translation.

Abstract:
Diffusion models have achieved stateof-the-art performance in image generation tasks. However, training powerful diffusion models remains time-consuming, which limits their practical deployment. In this paper, we revisit the learning dynamics of diffusion models through the lens of spectral bias, a phenomenon in which deep neural networks prioritize learning low-frequency modes. Through an empirical analysis of diffusion training, we observe that diffusion models exhibit a dual spectral bias. First, over training iterations, they fit low-frequency components earlier than high-frequency details. Second, along the diffusion timesteps, early denoising steps mainly reconstruct coarse low-frequency content, while high-frequency details emerge in later steps. Motivated by this observation, we propose Frequency-aware Patch Selection (FaPS), a general and fast training method for diffusion models that can be applied to both UNet and DiT backbones. Specifically, FaPS introduces a frequency-aware gating that adaptively selects image patches based on their frequency information and focuses computation only on the selected patches. Since the selection decisions are discrete and thus non-differentiable, we model the gating as a stochastic policy network and optimize it end-to-end using a policy gradient method. Our experiments demonstrate that FaPS achieves up to \mathbf3× faster training while maintaining comparable or superior generation quality, and improves the performance of diffusion models in limited-data settings.

Abstract:
Conventional dynamic SLAM approaches typically treat dynamic objects as outliers based on predefined categories, creating perceptual blind spots that limit the comprehensive environmental perception required for embodied agents. Although integrating Gaussian Splatting into SLAM enables holistic scene representation, it introduces an optimization paradox: without categorical priors, flexible dynamic primitives rapidly overfit static residuals. This phenomenon undermines the self-supervised error signals necessary for distinguishing motion. In response, we present De4D-SLAM, a novel framework designed for decoupled 4D reconstruction from monocular video. Our approach features a Gradient-Isolated Decoupling strategy, which leverages static reconstruction residuals to supervise a Spatially-Aware Kolmogorov-Arnold Network (SA-KAN), ensuring robust, category-agnostic motion segmentation. Additionally, we propose a Flow-Induced Initialization prior to stabilize the non-convex optimization of 4D Gaussian primitives using dense optical flow. Extensive evaluations on the TUM and Bonn benchmarks demonstrate that De4D-SLAM achieves state-of-the-art performance in both tracking and dynamic reconstruction, successfully reconciling the tension between robust localization and high-fidelity 4D mapping.

Abstract:
Auditory attention decoding (AAD) based on Electroencephalography (EEG) aims to identify the attended speaker in multispeaker environments. However, existing methods typically overlook the crucial phase information of EEG signals, which limits their ability to distinguish structured neural patterns from random noise in the frequency domain and hinders robust decoding. To address these issues, this paper proposes a Phase-aware Complex Refinement Network (PCRNet) for AAD, which consists of a Temporal Context Calibration (TCC) module and a Dual-Domain Integration (DDI) module. Specifically, the TCC module captures long-range temporal dependencies through multi-scale temporal attention mechanism, while the DDI module employs a phase-guided spectral filtering strategy to dynamically suppress noise-dominated frequencies and refine the real and imaginary components separately. This design enables effective phase recalibration and enhances the discriminability of target features in the complex domain. Experimental results on three public datasets demonstrate that PCRNet outperforms state-of-the-art (SOTA) methods, particularly under challenging ultra-short 0.1-second windows.

Abstract:
We study a simple yet principled modification of classical Qlearning that clips the value estimate in the Bellman backup by a threshold \lambda. The resulting algorithm, clipped Q-learning, is motivated by a key theoretical insight: the clipped Bellman backup is an unbiased one-sample estimation of a robust Bellman operator arising naturally from a transition-regularized MDP framework. This formulation corresponds to optimizing performance against a specific class of adversarial dynamics perturbations at the test time that reallocate transition probability mass away from high-value states, thereby inducing conservative but stable decision making. Under this interpretation, clipped Q-learning can be viewed as tracking the fixed point of the robust Bellman equation and learning policies that hedge against adversarial dynamics shifts at test time. We analyze two clipped Q-learning variants with an optimistic exploration bonus and establish polynomial regret guarantees, demonstrating statistical efficiency. Beyond the tabular setting, our framework suggests that value clipping is a modular mechanism that can be incorporated into general value-based RL algorithms with function approximation. As a proof of concept, we evaluate a clipped Double DQN algorithm on a control task and observe robustness improvements consistent with our theoretical predictions.

Abstract:
Computeruse agents (CUA) automate tasks specified with natural language such as "order the cheapest item from Taco Bell", by generating sequences of calls to tools such as click, type, and scroll on a browser. Current CUA implementations follow a sequential fetch-screenshot-execute loop where each iteration requires an LLM call, resulting in high latency and frequent errors from incorrect tool use. We present agent JIT compilation, an alternative that compiles task descriptions directly into executable code that is free to include LLM calls, tool calls, and parallelization. Our approach comprises three components: (1) JIT-Planner, which generates multiple code plans from a task, validates each against tool specifications, and selects the minimum-cost candidate; (2) JIT-Scheduler, which explores parallelization strategies via Monte Carlo cost estimation from learned latency distributions; and (3) an invariant-enforcing tool protocol specifying pre/postcondition state requirements that reduces the rate of generating plans with incorrect tool use. Evaluation across 5 applications demonstrates that JIT-Planner achieves 10.4× speedup and +28% accuracy improvement over Browser-Use, while JIT-Scheduler achieves 2.6× speedup and +9% accuracy improvement over OpenAI CUA.

Abstract:
Longhorizon motion prediction under external commands is challenged by latent decision uncertainty, where the internal states governing future behavior are unobservable and evolve stochastically over time. This issue is particularly pronounced in biological agents, whose motion trajectories reflect decision-making processes rooted in underlying cognitive states. To address these challenges, we propose CogSDE, a formulation of the controlled stochastic differential equation (SDE) for modeling instruction-driven latent decision dynamics. The drift term in the SDE incorporates a dual-channel control modulation mechanism, enabling external commands to modulate the latent state evolution. The diffusion term in the SDE adopts a state-dependent operator to model intrinsic uncertainty in latent decision dynamics. Furthermore, we establish a theoretical upper bound on the long-horizon prediction of CogSDE through dissipativity-based analysis. Experiments demonstrate that CogSDE consistently improves predictive accuracy in long-horizon motion generation. Importantly, predicted trajectories remain well aligned with control instructions over extended horizons, a property widely recognized as challenging in long-horizon motion prediction.

Abstract:
Affective and cognitive disorders manifest as distributed, timevarying brain network dynamics across regions, channels, and time, challenging robust representation learning from EEG/sEEG for clinical diagnosis. We proposeRECTOR(MaskedRegion–Channel–Temporal Modeling), an end-to-end self-supervised framework that unifies joint region-channel-temporal representation learning beyond fixed anatomical priors. At its core,RECTOR-SAis a hierarchical, block-sparse self-attention induced by Adaptive Functional Partitioning that evolves region structures from static anatomical definitions to adaptive functional regions. The self-supervision is driven byMasked Topology and Representation Learning, which jointly optimizes three complementary objectives: Masked Predictive Modeling, Topological Structure Modeling, and Cross-View Consistency. Across diverse benchmarks, RECTOR sets a new state-of-the-art in EEG emotion recognition and sEEG task-engagement classification. Crucially, its strong robustness to missing channels and cross-montage generalization underscores its potential for large-scale pre-training on heterogeneous EEG/sEEG, providing interpretable insights at both region and channel levels.

Abstract:
Recent progress in language model reasoning capabilities has revived a classic goal: characterizing which algorithms such systems can implement, or equivalently, which formal languages they can recognize. A growing research program studies RNNs, LSTMs, SSMs, and related architectures via reductions to finitestate automata and regular languages. Yet many results are not well-posed: they rely on implicit or incompatible semantic assumptions, invoking associativity and real-arithmetic techniques while also assuming floating-point arithmetic, which is finite and non-associative. Moreover, many proofs are highly architecture-specific and hard to transfer across closely related models. We address these issues with a unifying algebraic framework for a broad class of RNN language models, formally translating them to wreath products of transformation semigroups. By separating universal algebraic structure from contingent choices such as numerical semantics and wiring, the framework yields a disciplined workflow for rigorous expressivity analysis under realistic assumptions. We illustrate its value by rederiving and correcting representative expressivity claims from the literature under explicit deterministic finite-precision semantics.

Abstract:
Recent work has demonstrated the curse of depth in large language models (LLMs), where later layers contribute less to learning and representation than earlier layers. Such underutilization is linked to the accumulated growth of variance in Pre-Layer Normalization, which can push deep blocks toward near-identity behavior. In this paper, we demonstrate that, sparsity, beyond enabling efficiency, acts as a regulator of variance propagation and thereby improves depth utilization. Our investigation covers two sources of sparsity: (i) implicit sparsity, which emerges from training and data conditions, including weight sparsity induced by weight decay and attention sparsity induced by long-context inputs; and (ii) explicit sparsity, which is enforced by architectural design, including key/value-sharing sparsity in Grouped-Query Attention and expert-activation sparsity in Mixture-of-Experts. Our claim is thoroughly supported by controlled depth-scaling experiments and targeted layer effectiveness interventions. Across settings, we observe a consistent relationship: sparsity improves layer utilization by reducing output variance and promoting functional differentiation. We eventually distill our findings into a practical rule-of-thumb recipe for training depth-effective LLMs, yielding a notable 4.6% accuracy improvement on downstream tasks. Our results reveal sparsity, arising naturally from standard design choices, as a key yet previously overlooked mechanism for effective depth scaling in LLMs. Code is submitted.

Abstract:
Training large language models via selfplay often suffers from a persistent iteration-collapse, where performance initially improves but subsequently regresses as training iterations increase. We analyze this phenomenon as arising from cross-iteration degeneration, where the task-generation distribution becomes increasingly confined to a narrow subset of familiar (seen) problems, weakening the effective learning signal and destabilizing training. To address this issue, we propose a plug-in approach that augments existing self-play pipelines with a one-class novelty reward. A Seen Detector trained on a historical buffer of previously used training problems identifies in-support instances and discourages redundant generation by the questioner, thereby steering exploration toward under-explored yet learnable regions. Experimental results show that the proposed method mitigates iteration-collapse during iterative training and yields consistent improvements.

Abstract:
Recent advances in groupbased reinforcement learning (RL) greatly improve LLMs' ability in text reasoning. Yet, these methods lack sufficient modeling of multimodal information, leading to significant reasoning hallucination. In this work, we propose POLIA, a novel group-based RL method with visual-object-level intrinsic advantage for multimodal reasoning. POLIA introduces two advantage computation stages over candidate answers and visual objects, respectively. The answer-level extrinsic advantages are computed based on the extrinsic rewards of a group of candidate answers. Moreover, we compute an intrinsic advantage for each visual object based on its confidence score and reference relations with final answers. Intuitively, the intrinsic advantage of an object reflects its potential contribution to the correct answer. This two-stage advantage computation ensures an accurate credit assignment mechanism over multimodal reasoning sequences with multiple visual objects. Experimental results on diverse multimodal reasoning benchmarks show that POLIA significantly outperforms open MLLMs and strong baselines. Our code is available at https://anonymous.4open.science/r/POLIA-AB1D.

Abstract:
Although Multimodal Large Language Models (MLLMs) demonstrate strong omnimodal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni along with standard audio-visual and video-only benchmarks show that OFF improves future forecasting performance and generalization. Code and data are available at https://anonymous.4open.science/r/FutureOmni_anon-1F01/.

Abstract:
It is widely recognized that, after generative pretraining, Transformer FeedForward layers implicitly function as semantic memory, encoding linguistic and factual knowledge, while the contexts in key–value (KV) cache contain raw events, serving as the source of models' episodic memory. In this work, we show that a same group of Transformer FeedForward-layer parameters can both be semantic and episodic memory, which is retrievable without explicitly attending to the related KV cache. To realize this idea, we introduce Hypermem, a hypernetwork that recurrently maps contexts into targeted updates of FeedForward parameters. We post-train the hypernetwork using continuation and random-access associative memory objectives, eliminating the need for test-time training. Extensive experiments demonstrate that our approach outperforms related methods, including MemoryLLM and generative adapter, on memory retrieval, long-context question answering, and personalization benchmarks, establishing a new state of the art for hypernetwork-based memory mechanisms. Our results suggest that directly bridging data and parameters provides a viable direction for exploring next-generation foundation models with more flexible and persistent memory capabilities.

Abstract:
Spatial–temporal time series forecasting is challenging due to complex lead–lag dependencies, which are often ignored or inadequately modeled by existing methods. Thus, we propose LagLLM, the first LLMempowered framework that explicitly models lead–lag dependencies by unifying data-driven dynamics modeling and knowledge-driven semantic reasoning. Specifically, LagLLM constructs a lead–lag graph by integrating learnable embeddings, spatial proximity, and prompt-guided reasoning from a frozen LLM, which can capture lead-lag dependencies informed by underlying data structure and semantic knowledge. In addition, LagLLM introduces structural token sorting based on the graph, which can make a fine-turned LLM explicitly perceive directional and delayed interactions. Experiments on eight real-world datasets show that LagLLM achieves the state-of-the-art performance with improved accuracy, robustness, and interpretability. The code is available at https://anonymous.4open.science/r/LagLLM.

Abstract:
A probability distribution \mathbbP is a measure on a \sigmaalgebra, assigning mass to sets rather than points. This poses a challenge for the training of neural networks, which often struggle to reconstruct the global topology of continuous manifolds from sparse samples. We mitigate this issue by innovatively transforming \mathbbP into a continuous function via spectral methods, providing theoretical guarantees for the convergence of the learned distribution to the true distribution. Specifically, we introduce a network, PriSpecNet, with a single-function evaluation (1-NFE) Pseudoinverse Lévy Inversion (PiLI) solver that regards generation as a fast analytical problem, eliminating the need for iterative numerical integration. By reformulating the generation in the spectral domain, we bypass the computationally expensive sampling trajectory while maintaining full compatibility with the stochastic interpolants. We test our PriSpecNet in two applications. For time series, it unifies generation and forecasting, outperforming state-of-the-art (SOTA) baselines with Context-FID reductions of 50.0%, 41.5%, 80.6%, and 63.1% on Sines, Solar, ETTh, and Stock benchmarks, respectively, also decreasing forecasting MSE by 29.8% on Solar and 23.8% on Stock. For vision on ImageNet 256 × 256, 1-NFE PiLI achieves a competitive FID of 1.66 using only 26 Gflops, representing a 170 × reduction in total Gflops compared to the 4,436 Gflops required by the advanced 25-NFE DPM-Solver++.

Abstract:
Diffusion models typically operate in fixeddimensional metric spaces, whereas geometric molecular data vary in dimensionality because molecules differ in size (number of atoms). A common adaptation in diffusion models for geometric molecular generation is to use architectures that handle variable-sized inputs, such as graph neural networks and transformers.However, these approaches ignore that molecular size also sets the spatial scale of atomic coordinates, which induces inconsistent generative trajectories across sizes.In 3D molecular diffusion, generation can be seen as forming a coarse structure first and then refining atomic positions. Larger molecules form coarse structures earlier than smaller ones because their spatial scales are larger relative to the noise. This makes the generative process inconsistent across sizes, with trajectories driven by molecular size rather than by a unified generative pattern. We are the first to identify and analyze this size-induced inconsistency by decomposing denoising dynamics, showing how spatial scale shapes formation of both 3D structure and atom types. Based on this, we propose Scaling the Prior (StP), which rescales the prior distribution by molecular size to normalize learning and generation across sizes, harmonize denoising trajectories, and enable consistently high-quality molecules.

Abstract:
Identifying highutility candidates from massive discrete spaces under expensive evaluations is a recurring challenge across the sciences, with structure-based drug discovery as a prominent example. While surrogate-based optimization can increase sample efficiency by reducing the number of expensive evaluations, modern molecular libraries have reached billions to trillions of compounds, making full-library surrogate inference itself a major computational bottleneck. We introduce BOBA, a bandit-guided surrogate optimization framework that eliminates full-library inference by adaptively allocating computation across partitions of the action space. By treating partitions as arms in a multi-armed bandit, BOBA concentrates inference and evaluations on empirically promising partitions while maintaining principled exploration. Experiments on real-world synthesis-on-demand libraries demonstrate that optimism-under-uncertainty bandits, combined with meaningful action space partitioning, are essential for effective allocation of inference and evaluations. Our findings reveal a tunable tradeoff between screening performance and surrogate inference cost, which supports practical optimization over current libraries, and establishes a viable route to ultra-large library virtual screening.

Abstract:
Symbol grounding (Harnad, 1990) describes how symbols such as words acquire their meanings by connecting to realworld sensorimotor experiences. Recent work has shown preliminary evidence that grounding may emerge in (vision-)language models trained at scale without using explicit grounding objectives. Yet, the specific loci of this emergence and the mechanisms that drive it remain largely unexplored. To address this problem, we introduce a controlled evaluation framework that systematically traces how symbol grounding arises within the internal computations through mechanistic and causal analysis. Our findings show that grounding concentrates in middle-layer computations and is implemented through the aggregate mechanism, where attention heads aggregate the environmental ground to support the prediction of linguistic forms. This phenomenon replicates in multimodal dialogue and across architectures (Transformers and state-space models), but not in unidirectional LSTMs. Our results provide behavioral and mechanistic evidence that symbol grounding can emerge in language models, with practical implications for predicting and potentially controlling the reliability of generation.

Abstract:
We present VSA2, a frontier trainable sparse attention for video DiTs. VSA2 includes a variety of new architectural features and training procedures that we apply across all stages of DiT development cycle including pretraining, RL, and inference -- to produce a 5.8B DiT with comparable or better quality than a full attention counterpart. Architecturally, VSA2 introduces a fine-grained router that improves the precision of identifying critical tokens and supports dynamic computation by allowing each query to attend to a variable number of key–value pairs. On training, we identify a Hard-to-Easy Curriculum, where models trained under high sparsity and later evaluated with lower sparsity during inference not only generalize effectively, but also outperform models trained with full attention in motion quality. VSA2 is also flexible: it can replace full attention during the middle of progressive low-to-high resolution pretraining, rebasing early-stage full-attention checkpoints. Experiments show that VSA2 reduces attention computation by half over VSA with lower loss. On 720P videos, it accelerates attention by 8.9x and end-to-end generation by 4.62x compared to the FlashAttention3 baseline, while achieving comparable or better video quality.

Abstract:
While AIgenerated hallucinations pose considerable risks, the underlying cognitive mechanisms by which humans can successfully recognize or be misled by these hallucinations remain unclear. To address this problem, this paper explores humans' neural dynamics to characterize how the brain processes hallucinated content. We record EEG signals from 27 participants while they are performing a verification task to judge the correctness of image descriptions generated by a multi-modal large language model (MLLM). Based on an averaged event-related potential (ERP) study, we reveal that multiple cognitive processes, e.g., semantic integration, inferential processing, memory retrieval, and cognitive load, exhibit distinct patterns when humans process hallucinated versus non-hallucinated content. Notably, neural responses to hallucinations that were misjudged versus correctly judged by human participants showed significant differences. This indicates that misjudged AI-generated hallucinations failed to trigger the standard neurocognitive fact verification pathway. The detailed code can be accessed openly through https://anonymous.4open.science/r/EEG-Hallucination-anonymous-F125.

Abstract:
Large visionlanguage models (LVLMs) have demonstrated remarkable capabilities across a wide range of multimodal reasoning tasks. However, recent research shows that they are susceptible to adversarial examples. Existing LVLM attack methods are generally deployed in the white- or black-box setting, which severely rely on full-model gradients or elaborated transfer strategies, resulting in large resource costs. To this end, this paper focuses on a more efficient gray-box attack setting by solely accessing LVLM's vision encoder. Instead of using target images as the adversarial guidance, our main goal is to perturb the visual feature to best match more natural attacker-chosen target texts. Specifically, we develop a global semantic alignment module to project the visual features onto the SVD-structured subspace spanned by the textual semantics. We also propose to align detailed visual features with multi-context semantic texts extended by LLMs over discrete distributions via optimal transport. Extensive experiments demonstrate the superiority of the proposed method, while our attack is further proven to achieve great transferability across various LVLMs with CLIP-aware transfer designs.

Abstract:
Recent Multimodal Large Language Models (MLLMs) have advanced crossmodal reasoning by extending Chain-of-Thought (CoT) prompting to visual tasks. However, existing methods still rely heavily on explicit textual reasoning steps, leading to information loss, unstable perception–reasoning interaction, and high computational cost. Inspired by human cognition, we argue that effective visual reasoning emerges from a dynamic interplay between perception and latent thought, rather than a purely linear verbalization process. Motivated by this insight, we propose Latent-Driven Progressive Visual Reasoning (LDPVR), a framework that formulates multimodal reasoning as a Markov Chain of Recursive State Simplification, where explicit textual states are progressively refined under the guidance of latent transitions. Central to LDPVR is Interleaved Latent Grounding, which leverages latent semantic intent to actively retrieve fine-grained visual evidence and drive robust state evolution, enabling the model to iteratively reduce uncertainty before committing to simplified textual states. To optimize this process, we introduce a three-stage curriculum combining supervised fine-tuning, latent-text distillation, and reinforcement learning via Group Relative Policy Optimization (GRPO). Experiments on six multimodal reasoning benchmarks demonstrate that LDPVR improves reasoning accuracy while maintaining low inference latency. Code will be made public.

Abstract:
Posttraining adaptation has become the central paradigm for leveraging large language models (LLMs) in recommendation. While recent preference optimization methods, such as Direct Preference Optimization (DPO), enhance pairwise preference discrimination, they remain vulnerable to \emphreward hacking: models exploit imperfections in reward signals, leading to inflated training metrics without genuine recommendation gains. We analyze this issue from a gradient perspective and formalize the concept of the \emph\varepsilon-insensitive region region, where pairwise updates exert little influence on the ordering between positives and unsampled negatives. Under the Bradley–Terry model, we further show that these regions can occupy a substantial fraction of the preference space, inevitably leading to misaligned rankings. To address this issue, we propose Simulated Preference Optimization for Reward-hacking mitigation using Pseudo-negatives (SIRIUS). Our framework introduces pseudo-negative samples to enrich contrastive signals and reduce the prevalence of \varepsilon-insensitive regions. Extensive experiments on three public benchmarks show that \our consistently improves ranking quality and effectively mitigates reward hacking, providing both theoretical and practical insights for advancing LLM-based recommendation. Our code is available at \urlhttps://anonymous.4open.science/r/C557-id

Abstract:
Existing deep learning models have demonstrated potential in Diabetic retinopathy (DR) diagnosis, but they still suffer from three key challenges: reliance on singlesource inputs, opaque and untraceable reasoning processes, and the absence of a mechanism for result verification. Thus, we propose a provenance-enabled concept-based framework for multi-view DR diagnostic (ProConMV), which integrates DR lesion masks, clinical text and multi-view data, utilizing multimodal prompt analysis and visual-text concept interaction to learn the interpretable multi-source input. During the reasoning stage, the proposed framework introduces lesion concepts for causal reasoning chains combining clinical guidelines, and adds doctor intervention for human-machine collaboration. For dynamic fusion decision and verification in multi-view DR diagnosis, we derive via generalization theory that incorporating each view’s lesion concept uncertainty and grading uncertainty reduces the generalization error upper bound. Accordingly, we design a dual uncertainty-aware module to enable provenance-based verification, ultimately enabling verifiable analysis of DR diagnostic results. Extensive experiments conducted on two public multi-view DR datasets demonstrate the effectiveness of our method.

Abstract:
The success of RL for LLM posttraining stems from an unreasonably uninformative source: a single bit of information per rollout as binary reward or preference label. At the other extreme, distillation offers dense supervision but requires demonstrations, which are costly and difficult to scale. We study natural language feedback as an intermediate signal: richer than scalar rewards, yet cheaper than complete demonstrations. Textual feedback is a natural mode of human interaction and is already abundant in many real-world settings, where users, tools, and automated judges routinely critique LLM outputs. Towards leveraging text feedback at scale, we formalize a multi-turn RL setup where text feedback is available during training but not at inference. Therefore, models must learn to internalize the feedback in order to improve their test-time single-turn performance. To do this, we propose two methods: Self Distillation, which trains the single-turn policy to match its own feedback-conditioned second-turn generations; and Feedback Modeling, which predicts the feedback as an auxiliary objective. We provide theoretical analysis on both methods, and empirically evaluate on reasoning puzzles, competition math, and creative writing tasks. Our results show that both methods consistently outperform strong baselines across benchmarks, highlighting the potential of RL with an additional source of rich supervision at scale.

Abstract:
Large Language Models (LLMs) often produce hallucinated outputs, which limit their reliability in highstakes applications. Conformal prediction can provide guarantees on the correctness and factuality of LLM outputs, but existing approaches rely on the exchangeability assumption, which rarely holds in online settings where user queries and interests change over time. To solve this problem, in this paper, we proposePACE(ProactiveAdaptiveConformal InferencE), a novel framework that sequentially updates the time-varying target miscoverage parameter with a dynamic step size to maintain valid coverage under online distribution shifts. PACE is motivated by the theoretical connections between expected miscoverage error and key factors such as distribution shifts and instantaneous parameter error. It integrates two complementary signals: (1) a proactive shift detection to estimate the magnitude of distribution shifts, and (2) a reactive error that scales updates according to the local coverage gap. Extensive experiments on synthetic and real-world datasets demonstrate that PACE consistently outperforms advanced adaptive baselines. It reduces the deviation from the target error rate by up to 60% in QA tasks and accelerates coverage recovery by over 2.5x during abrupt shifts, ensuring stable factuality guarantees without compromising utility and stability.

Abstract:
Traditional evaluations measure a learning algorithm's final performance on an i.i.d. test set, reducing learning to a single aggregate score. This approach obscures a fundamental question: to what extent does learning from a specific example generalize to others? Such persample generalization—akin to learning by analogy in human cognition—captures how far the knowledge extracted from one example can transfer, yet remains invisible to standard benchmarks. We introduce the Generalization Spectrum, an evaluation framework designed to expose this hidden dimension. For each training example, we construct a controlled suite of test variants arranged by increasing transfer distance—from exact recall to implementation transfer across languages, context transfer under complete narrative re-framing, category-matched in-domain problems, and an unpaired baseline. By tracking performance across these distances, we reveal not just whether an algorithm learns, but how far that learning extends. We instantiate this framework on competitive programming, using a synthetic generation pipeline seeded with recent problems to mitigate contamination. Across ICL, SFT, RFT, and RL, we find two levers that shape generalization radius: (i) the learning algorithm---how to learn from a fixed set of training instances---where RL yields markedly stronger near-transfer than SFT/RFT under matched memorization and learns more transferable structure; and (ii) the learning content---what extra signal to provide or reformat given the same seeds---where abstract ICL demonstrations and on-policy SFT targets yield more reliable transfer than concrete code and off-policy supervision.

Abstract:
Semantic Typography aims to visualize the meaning of an input word through the form of a character, while preserving its legibility. Existing vectorbased methods, which primarily rely on text-driven optimization like Score Distillation Sampling (SDS), often produce glyphs that lack rich semantic details. Furthermore, these approaches struggle to maintain the overall structural integrity of the glyphs and frequently suffer from visual artifacts caused by intersections, compromising both readability and aesthetic quality. To address these challenges, we propose VecDesigner, a novel optimization-based method for vector semantic typography. Specifically, we introduce Visual-Guided Score Distillation Sampling (VGSDS), which leverages text-related reference images as visual guidance to infuse the glyphs with richer and more concrete semantic details. To preserve legibility and structural integrity, we design a vector-based Procrustes loss to constrain the overall deformation of the glyph. Concurrently, we effectively mitigate the intersection problem by imposing positional relationship constraints on the control points. Comprehensive experiments demonstrate that VecDesigner outperforms existing methods in both semantic expression and structural preservation, generating high-quality, expressive, and clearly legible semantic glyphs.

Abstract:
AlphaZeroandMuZerohave demonstrated superhuman performance across a range of strategic tasks. Yet their reliance on maximizing expected returns limits their use in realworld settings, where even high-return policies may incur rare but catastrophic failures. We introduceRiskZeroto address this limitation; the firstMuZero-family method for risk-sensitive decision-making, and planning withzeroprior knowledge of environment dynamics.RiskZerolearns distributional quantities to estimate trajectory-level risk, guiding search toward policies that explicitly avoid rare but severe outcomes. We establish theoretical convergence to optimal, stationary risk-sensitive policies and validate our approach on environments designed to test risk-sensitive learning from pixels, as well as on larger-scale combinatorial tasks. Across all settings,RiskZeroconsistently outperforms state-of-the-art risk-sensitive baselines, and improves sample efficiency, providing a general framework for safer and reliable model-based reinforcement learning under uncertainty.

Abstract:
Personalized federated LoRA fine tuning has become a key approach to addressing data heterogeneity in distributed fine tuning of large language models (LLMs). Existing methods typically assume homogeneous personalization needs across clients, relying on dual LoRA or personalized calibration schemes. However, they fail to account for the heterogeneity of local personalization requirements and the conflicting optimization objectives in dual LoRA, limiting scalability and performance. To address this, we propose HetCPFLoRA, a customizable heterogeneous federated LoRA fine tuning algorithm inspired by the decoupling properties of LoRA parameters. We employ a single adapter fine tuning scheme to mitigate conflicts between personalized and generalized optimization, decouple LoRA into generalized and personalized subspaces for local customization, and use SVD compression to integrate cross client generalized knowledge. During inference, we introduce an OOD oriented dynamic mechanism to adjust the weighting between personalized and generalized decoupling knowledge, improving performance on user data. Extensive experiments on two public benchmark datasets show that Het-CPFLoRA outperforms state of the art methods in both personalization and generalization across heterogeneous scenarios. The code will be released as an open-source project.

Abstract:
Large language model (LLM)driven multi-agent systems (MAS) typically require multiple model invocations and complex coordination during inference, and their execution strategies directly affect system accuracy, latency, and computational cost. Parallel execution provides a means to improve inference-time efficiency. From the perspective of inference-time execution, this paper models parallelism in multi-agent systems as two distinct levels of decision processes: replica parallelism, which explores multiple complete solution paths at the task level, and structural parallelism, which enables concurrent execution within a single solution path through task decomposition. However, the roles of different forms of parallelism and their interrelationships still lack systematic study in terms of unified organization and coordination. We therefore propose TIPEX, a controllable execution framework that unifies these two levels of parallelism and coordinates their roles within the inference process under a unified execution semantics while supporting systematic combinations and analyses of different parallel strategies and parameter configurations. Systematic experiments on the GAIA benchmark demonstrate that inference-time parallelism can significantly improve accuracy and reduce end-to-end latency at the cost of increased token consumption. Further analysis shows that replica and structural parallelism exhibit complementary effects across task complexities, with tasks of intermediate difficulty benefiting most from their coordination, while overly aggressive parallel strategies do not necessarily yield better performance.

Abstract:
Largescale models are pretrained on massive web-crawled datasets containing documents of mixed quality, making data filtering essential. A popular method is Classifier-based Quality Filtering (CQF), which trains a binary classifier to distinguish between pretraining data and a small, high-quality set. It assigns each pretraining document a quality score defined as the classifier's score and retains only the top-scoring ones. We provide an in-depth analysis of CQF. We show that while CQF improves downstream task performance, it does not necessarily enhance language modeling on the high-quality set. Importantly, we find that training on CQF-selected data can outperform training directly on the high-quality set, even when the latter is sufficiently large. This finding alone is particularly striking, given the substantial effort and cost recently devoted to augmenting high-quality data. We explain this paradox by the fact that CQF implicitly filters the high-quality dataset as well as the low-quality one. Finally, we introduce an optimization-driven notion of data quality and demonstrate that it can be reliably estimated using small-scale proxy experiments. Altogether, our results both elucidate the mechanisms behind CQF and deepen our understanding of data selection methods widely used in practice.

Abstract:
Foundation models (FMs) pretrained on largescale sequence data have emerged as a promising paradigm for RNA biology, yet the mechanisms underlying their transferability remain unclear. In this work, we conduct a systematic investigation of transfer learning in RNA FMs across diverse structural and functional tasks. Our results demonstrate that frozen representations from pretrained RNA FMs are not universally transferable, and that the hierarchical feature reuse paradigm prevalent in computer vision does not generally extend to RNA FMs. Instead, pretraining primarily benefits downstream tasks by providing a favorable optimization initialization when pretraining and downstream objectives are well aligned, which accelerates convergence toward flatter minima associated with improved generalization. Overall, our findings characterize pretraining as an optimization prior whose effectiveness is governed by task alignment and model capacity, offering principled guidance for future RNA FMs.

Abstract:
Reliable evaluation of large language model (LLM) agents depends critically on benchmark validity. However, agent benchmarks are increasingly complex and often contain hidden flaws arising from interactions among user instructions, environments, tools, groundtruth trajectories, and evaluation protocols. These issues confound model errors with benchmark artifacts, undermining leaderboard-based comparisons. Manual auditing does not scale to this setting, while existing automated methods are not designed to systematically capture semantic and contextual issues across interacting benchmark components. We propose the COBA (COmponent-based Benchmark Auditing) pipeline, an automated pipeline for diagnosing and filtering validity issues in agent benchmarks. Our pipeline decomposes agent tasks into four standardized components—User, Environment, Ground Truth, and Evaluation—and operationalizes a component-level issue taxonomy using hybrid rule-based detectors and taxonomy-guided LLM evaluation, augmented with an adversarial rebuttal stage to reduce false positives. Across six widely used agent benchmarks, COBA achieves strong alignment with expert judgments, with F1 scores between 0.791 and 0.874. The pipeline complements manual verification of \tau^2-bench by identifying issues missed due to benchmark complexity and generalizes effectively to previously unseen benchmarks with minimal adaptation. Our analysis shows that benchmark flaws are widespread and materially affect agent evaluation outcomes, demonstrating that component-based automated auditing provides a scalable foundation for more reliable and interpretable agent evaluation.

Abstract:
Testtime adaptation (TTA) has emerged as a popular paradigm for improving the performance of vision–language models (e.g., CLIP) on downstream tasks. Among existing CLIP-based TTA methods, Test-Time Prompt Tuning (TPT) is a pioneering work that optimizes textual prompts using multiple test-time augmentations and remains a strong baseline to date. In this work, we revisit TPT and reveal that its optimization can be interpreted as implicitly learning from self-generated pseudo labels. Building on this perspective, we propose a unified self-ensembling frameworkUSEthat jointly refines the optimization and inference stages. During optimization, we introduce a simple yet effective self-ensemblingSEstrategy that emphasizes the test image itself over its augmented views adaptively to obtain more reliable pseudo labels. To fully exploit the potential of augmentation, we further apply the same strategy at inference time, unifying the objectives of both stages. Notably,SEcan also act as a lightweight training-free TTA method. Extensive experiments across multiple datasets demonstrate thatSEandUSEoutperform their counterparts, respectively. Furthermore,SEyields consistent performance improvements when integrated with existing TTA methods.

Abstract:
Multimodal Continual Instruction Tuning (MCIT) aims to exploit the incrementally accumulated knowledge to process multimodal inputs of diverse tasks, where parameter routing is an important technology. Existing advanced methods typically rely on sample to task center similarity and crossmodal fusion with equal weight during routing. However, such solutions face two fundamental flaws: (1) Within each modality, sample to task center distance is sub-optimal for routing since the abundant intra-task diversity information is underleveraged. (2) Different modalities exhibit varying reliability across tasks, where the modality with inter-task ambiguity can easily misguide the routing result. To address these problems, we propose Hyperbolic Uncertainty-aware Modality-Balanced Routing (Hyper-LLaVA) to improve parameter routing capacity based on cross-modality task feature uncertainty modeling. Specifically, to improve intra-modality task matching, Hyper-LLaVA accesses the sample to task distribution similarity in the Hyperbolic space. Besides, to alleviate the degradation brought by unreliable modality, Hyper-LLaVA quantifies the task matching ambiguity within each modality to achieve adaptive balancing between task matching across modalities. Based on the complementary intra- and inter-modality task matching enhancement, our Hyper-LLaVA outperforms state-of-the-art approaches by large margins.

Abstract:
As Multimodal Large Language Models (MLLMs) continue to be trained, the availability of public data diminishes, limiting the possibility for further training and adaptation. However, private data remains an underutilized yet valuable resource. Federated Learning (FL) enables decentralized training on private data, yet extending it to MLLMs is challenging: heterogeneous client modalities induce architectural incompatibility, and fullparameter fine-tuning of billion-scale models incurs prohibitive communication costs. Parameter-efficient methods like LoRA alleviate these issues but introduce aggregation inconsistency, as averaged low-rank updates fail to recover the true global update faithfully. To address these issues, we propose UniFLoW(Universal multi-modal Federated LoRA fine-tuning framework With Analytical Aggregation), a unified federated framework that leverages pre-trained large language models and a multi-modal Encoder architecture, and our proposed Federated Aggregating Analytical Low-Rank AdaptionFedA^2-LoRA). UniFLoW effectively utilizes fragmented client-side multi-modal data while FedA^2-LoRA ensuring consistent aggregation. And modality-specific encoders and a II stage training strategy ensure effective integration of diverse modalities without overfitting. Experiments on text, image, and speech demonstrate that UniFLoW enables scalable, communication-efficient, and aggregation-consistent federated fine-tuning, with FedA^2-LoRA achieving state-of-the-art performance compared to existing FedLoRA approaches. We envision UniFLoW as a promising solution to the growing scarcity of public data.

Abstract:
Just as LEGO pieces can be assembled into an unlimited variety of structures, heterogeneous federated learning (HFL) can be viewed as the assembly of diverse model components. Inspired by this analogy, we reformulate HFL as a LEGOlike assembly game. The central challenge in HFL lies in learning across heterogeneous model architectures, which hinders direct parameter sharing. To address this challenge, we propose to decompose models into a set of standardized, modular components—analogous to LEGO pieces, and then to learn these components collaboratively across clients. We refer to these components as model blocks. This paper investigates how to learn and assemble them under predefined composition rules to construct heterogeneous models. Based on this perspective, we develop a novel federated learning framework, termed LEGO-FL, which enables flexible model construction while preserving collaborative learning. We evaluate the proposed method through small-scale experimental studies and demonstrate its feasibility. Finally, we discuss potential extensions of LEGO-FL to large-scale federated settings and more complex model architectures.

Abstract:
One of the most common complaints about large language models (LLMs) is their prompt sensitivityi.e., the fact that their ability to perform a task or provide a correct answer to a question can depend unpredictably on the way the question is posed. We investigate this variation by comparing two very different but commonly-used styles of prompting: those using instructions, which describe the task in natural language, and those using in-context examples, which provide few-shot demonstration pairs to illustrate the task. We find that, despite large variation in performance as a function of the prompt, all the prompts we study engage the same common underlying mechanisms. Specifically, we identify task-specific heads that are interpretable in vocabulary space--- which we dub lexical task heads---and show that these heads are shared across prompting styles and are essential to triggering subsequent answer production. We further find that behavioral variation between prompts can be explained by the degree to which these heads are activated, and that failures are at least sometimes due to competing task representations that dilute the signal of the target task. Our results together present an increasingly clear picture of how LLMs' internal representations can explain behavior that otherwise seems idiosyncratic to users and developers.

Abstract:
The attention mechanism with softmax normalisation is a foundational component of Transformerbased large language models. However, with very long contexts, attention scores are known to diminish, raising fundamental questions about token distinguishability and how it can be preserved. In this work, we provide a formal characterisation of token distinguishability in attention as a function of context length and embedding dimension. We introduce Aitchison distance to quantify relative differences among attention probabilities, and show that, with Gaussian queries and keys, even in the long-context regime, token distinguishability converges to a finite, non-zero limit rather than vanishing. Leveraging the linear relationship between temperature scaling and Aitchison distance, we derive a theoretical lower bound of \Omega(\sqrt\log L) on the logit scaling required to produce a sharp attention distribution. Finally, we demonstrate that Aitchison distance provides a principled and practical alternative to entropy for monitoring training and inference, as it captures the full compositional structure, including the smaller components of the attention probabilities.

Abstract:
We study simultaneous confidence bounds for aggregated effects over downwardclosed subset families of independent statistical tests. The bounds are obtained by bootstrap calibration of the maximum normalized aggregated effect over the relevant subset family, yielding valid post-hoc inference for data-selected subsets and tighter bounds than classical methods that protect all linear contrasts. A central challenge is that the required maximization is a nonlinear combinatorial optimization problem whose exact solution is essential for correct coverage. We address this challenge by casting the problem as a densest subgraph optimization and reformulating it as a linear program, or as a mixed-integer linear program when downward-closed linear constraints are imposed, enabling efficient and exact evaluation. We further characterize the growth regime of the number of tests for which the bootstrap calibration remains valid and illustrate the method on several machine learning applications.

Abstract:
Most existing cultural evaluation frameworks for large language models (LLMs) focus on matching model outputs to groundtruth answers, primarily measuring factual cultural awareness. This overlooks whether models internalize broader cultural structure and pluralism. We introduce a spectral-analysis-based framework that captures large-scale macrostructural patterns in models' cultural knowledge and evaluate eight LLMs across nine cultural domains spanning all five of Newmark's cultural dimensions and 170 countries. Comparing with human data, we find that instruction-tuned models align more closely with human cultural structure than older models, while increased model size does not consistently improve performance. Finally, simulation-based experiments show that our proposed spectral metric better predicts a model's ability to serve users from unfamiliar cultural backgrounds than existing ones.

Abstract:
Value factorization eases nonstationarity in MARL, but its static coordination assumptions hinder generalization on long-horizon tasks with shifting dependencies. Prior VQ-VAE methods abstract trajectories yet miss time-varying inter-agent dependencies. We present TACTIC, a CTDE framework with three advances: (i) hierarchical goal decomposition to guide exploration under sparse rewards; (ii) dynamic sparse coordination graphs that adapt dependencies via variance-based TD-error pruning; and (iii) a semantic-conditioned VQ-VAE that discretizes trajectories into coordination classes and maps them to graph-level edge decisions, while also conditioning local policies. A pretrained, frozen goal predictor decouples task recognition from control, preventing gradient interference across coordination abstractions. On SMAC and SUMO, TACTIC delivers state-of-the-art coordination and transfer under sparse rewards and dynamic task structures.

Abstract:
LowRank Adaptation (LoRA) enables efficient federated fine-tuning of segmentation foundation models for medical imaging. However, most federated LoRA methods adopt a uniform aggregation rule, which breaks under the encoder–decoder asymmetry in medical segmentation: the encoder is dominated by appearance shifts, while the decoder is dominated by supervision variations. This mismatch entangles shared anatomy with site-specific biases and harms generalization. To address this, we propose Inverse Asymmetric Tuning (IAT). IAT aligns adaptation with heterogeneity sources by selectively personalizing module-specific adaptation components in the encoder to absorb acquisition-driven appearance shifts and in the decoder to accommodate site-dependent supervision, while retaining a shared pathway for transferable consensus. However, structural separation alone is insufficient under LoRA’s bilinear parameterization, where multiplicative coupling can still cause site-specific updates to leak into the shared update direction. We therefore introduce a Subspace Orthogonality Regularizer that penalizes shared–local collinearity in the effective update space, mitigating leakage without increasing communication. Extensive experiments demonstrate consistent improvements over strong federated LoRA and parameter-efficient FL baselines.

Abstract:
Diffusion models are effective generative frameworks with strong representation learning capabilities, yet the intrinsic properties that govern their semantic structure and generalization remain poorly understood. Drawing inspiration from selfsupervised representation learning (SSL), we introduce an evaluation framework that decomposes diffusion features into a perturbation invariant component and a residual component induced by noise and augmentations. From this decomposition we derive the Invariant Contamination Ratio (ICR), a Fisher-based metric that measures how residual, augmentation-sensitive energy contaminates invariant signal in feature space. We use this framework to analyze both discriminative and generative behavior. On the representation side, we find invariance peaks at intermediate noise levels, which also yield the best downstream classification performance. On the generative side, we study how training transitions from genuine generalization to memorization in data-limited regimes, and find that \mathrmICR serves as a sensitive training time indicator of the early learning phenomenon: rising residual energy along Fisher directions marks the onset of memorization, detectable from training features alone without external evaluators or held-out test sets. Overall, our results show diffusion models can be monitored from a self-supervised perspective via the geometry of their learned representations.

Abstract:
We study infinitehorizon discounted Markov decision processes (DMDPs) under a generative model. Motivated by the Algorithms with Advice framework (Mitzenmacher and Vassilvitskii, 2022), we propose a novel framework to investigate how black-box predictions of the transition matrix can enhance sample efficiency in solving DMDPs and improve sample complexity bounds. We focus on DMDPs with N state–action pairs and discount factor \gamma. We first provide an impossibility result showing that, in the presence of predictions with unknown accuracy, no sampling policy can compute an \epsilon-optimal policy with a sample complexity better than \tildeO((1-\gamma)^-3 N \epsilon^-2), which matches the state-of-the-art minimax sample complexity bound without prediction. In complement, we design an algorithm based on minimax optimization techniques that leverages predictions of the transition matrix without requiring knowledge of the prediction error. Our algorithm achieves a sample complexity bound that depends on the prediction error and is uniformly better than \tildeO((1-\gamma)^-4 N \epsilon^-2), the previous best result derived from convex optimization methods. In some cases, our bound even improves upon the state-of-the-art \tildeO((1-\gamma)^-3 N \epsilon^-2), despite not having access to the prediction quality.

Abstract:
Cross multidomain log anomaly detection aims to train a unified model applying in multiple heterogeneous systems, alleviating the annotation cost and scalability bottlenecks of traditional cross single-domain approaches. However, existing methods face two fundamental challenges: (i) geometric proximity alone is insufficient to certify normality, and (ii) forcibly aligning distributions across domains can induce negative transfer. To address these issues, we propose MPFM (Cross Multi-Domain Prototype Flow Matching for Log Anomaly Detection), grounded in the principle that anomalies are samples that cannot be stably generated by the normal data-generating mechanism. Specifically, MPFM employs a shared–private prototype system to disentangle cross-domain commonalities from domain-specific patterns, introduces domain-conditioned flow matching to perform anomaly detection by integrating structural and dynamical evidence, and further leverages prototype-drift-driven hard example mining to improve robustness near decision boundaries. Experiments on HDFS, BGL, Thunderbird, and Spirit demonstrate that MPFM delivers substantial gains under multi-domain joint training.

Abstract:
With the advance of generative AI, the textto-image (T2I) model has the ability to generate various contents. However, T2I models still can generate unsafe contents. To alleviate this issue, various concept erasing methods are proposed. However, existing methods tend to excessively erase unsafe concepts and suppress benign concepts contained in harmful prompts, which can negatively affect model utility. In this paper, we focus on eliminating unsafe content while maintaining model capability in safe semantic meaning interpretation by optimizing the concept erasing reward (CER) with reinforcement learning. To avoid overly content erasure, we introduce the safe adapter to project partial text embedding for efficient concept regulation in cross-attention layers. Extensive experiments conducted on different datasets demonstrate the effectiveness of the proposed method in alleviating unsafe content generation while preserving the high fidelity of benign images compared with existing state-of-the-art (SOTA) concept erasing methods. In terms of robustness, our method outperforms counterparts against red-teaming tools. Moreover, we showcase the proposed approach is more effective in emerging image-to-image (I2I) scenario compared with others. Lastly, we extend our method to erase general concepts, such as artistic styles and objects. Disclaimer: This paper includes discussions of sexually explicit content that may be offensive to certain readers.

Abstract:
Autonomous computer use agents that powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workflows. However, realworld execution environments are far from ideal: pop-up dialogs, resolution changes, and competing applications frequently interfere with agent perception and control. We introduce AgentHijack, a benchmark designed to evaluate the robustness and reliability of computer-use agents under common corruptions, where the uncertainties in dynamic environment disrupt the execution flow without direct adversarial intent. Specifically, AgentHijack introduces 9 configurable common corruptions to replicate realistic imperfect scenarios. We evaluate a variety of desktop tasks that utilize MLLM-based agents and discover that even minor instances of corruption can result in substantial performance degradation, which emphasizes the fragility of agents and underscores the necessity of robustness evaluation. Afterward, we propose AgentHijack-Agent, a framework that integrates an action generator with enhanced grounding capabilities and an onlooker responsible for behavior summarization and environment checking. Extensive experiments validate its effectiveness. We anticipate that our work will underscore the importance of GUI agents robustness and inspire more follow-up research efforts in this direction.

Abstract:
Decoderonly large language models (LLMs) struggle with table reasoning because tables must be serialized, obscuring row- and column-level structure. Prior graph and hypergraph approaches encode structure with an external encoder, but their gains are often inconsistent under autoregressive decoding. We analyze how tabular structure is represented inside decoder-only LLMs and find that row and column relations concentrate in a small subset of layers and attention heads. Based on this observation, we propose HInT, which injects hypergraph-derived structural features directly into these structural layers. HInT constructs a table hypergraph over cells and headers, performs lightweight message passing, and fuses the resulting features with token hidden states via token-level gated fusion, while preserving standard autoregressive computation. Experiments across diverse table reasoning tasks show consistent improvements over text-only baselines and prior (hyper)graph-based methods.

Abstract:
Many scientific problems require inferring unobserved mechanistic latent states from indirect observations. While classical approaches, including expectationmaximization, do not scale to combinatorially large spaces, deep learning approaches such as variational autoencoders typically form artificial latent states rather than reconstructing the mechanistic ground-truth states. Here, we introduce GReinSS, a policy learning framework that uses dynamically rescaled rewards to learn latent state distributions that maximize the observed data likelihood. We show that GReinSS accurately reconstructs simulated latent sets and latent graphs, outperforming alternative policy learning and generative modeling baselines. Additionally, GReinSS reconstructs isoforms from real short-read RNA sequencing data that better match orthogonal long-read sequencing detected isoforms than the standard RSEM algorithm. Overall, GReinSS is a principled and practically effective approach for generative modeling and inference of combinatorial latent states from indirect observations.

Abstract:
Largescale Set Cover Problems (SCP) with millions of variables and complex cost structures require high-quality solutions within seconds, yet remain beyond the reach of exact solvers and pose severe generalization challenges for neural methods. Such problems necessitate decomposition into bounded subproblems; however, when the induced subproblem topology differs from that observed during training, existing neural approaches often fail to transfer reliably. We introduce Generative Large Neighborhood Search (GLNS), which reframes neighborhood selection as generation using a discrete diffusion model. Our key insight is that the diffusion denoising trajectory exposes variables exhibiting high prediction instability across timesteps and identifies regions where local repair yields downstream improvement. GLNS exploits this trajectory-level signal to construct high-impact neighborhoods via a localized, bounded-complexity generative sampling procedure, enabling robust neighborhood selection without retraining. As a result, GLNS transfers effectively across cost regimes and instance scales within SCP. Under tight and equal wall-clock budgets, GLNS consistently outperforms established neural baselines and achieves competitive performance with state-of-the-art MIP solvers. These results demonstrate trajectory-guided generation as a scalable framework for large-scale SCP and suggest potential relevance to other constrained optimization settings.

Abstract:
As AI models continue to develop powerful capabilities, it becomes critical that we are able to verify that their output is aligned with our intentions. A recent line of work focuses on verification via debate, a model of interactive proofs where two competing powerful provers, or AI models, debate each other to convince a weak verifier, or a human, of the correctness of their claim. However, debate assumes that the two AI models possess equal abilities and that one of them is truthful, which may not be realistic. In this work, we showhow to avoid debate: we initiate the study ofsingleproverinteractive proofs for AI safety. Prior results in single-prover interactive proofs do not immediately carry over to the AI safety setting because they do not work when the computation has access to an oracle, such as to human judgment or an external database such as the web. We present doubly-efficient single-prover interactive proofs for oracle-aided computations (also known as relativizing proofs), in the settings where (1) the computation is robust, in the sense that the output does not change if at most a small fraction of the answers to oracle queries are incorrect, or (2) the oracle is a low-degree polynomial. These results suggest that interactive verification is possible even without debate, under structured or noise-tolerant oracle access.

Abstract:
Interactive world models continually generate video by responding to a user's actions, enabling openended generation capabilities. However, existing models typically lack a 3D representation of the environment, meaning 3D consistency must be implicitly learned from data, and spatial memory is restricted to limited temporal context windows. This results in an unrealistic user experience, and presents significant obstacles to down-stream tasks such as training agents. To address this, we present PERSIST, a new paradigm of world model which simulates the evolution of a latent 3D scene: environment, camera and renderer. This allows us to synthesize new frames with persistent spatial memory and consistent geometry. Our approach achieves substantial improvements in spatial memory, 3D consistency, and long-horizon video generation quality over existing methods, producing coherent and evolving 3D worlds.

Abstract:
Optimization problems are fundamental across science and industry, including planning, scheduling, and resource allocation. While LLMs show promise in automating optimization, they struggle to bridge the gap between realworld requirements and both mathematical formulations and effective heuristic designs. Furthermore, the field lacks a unified framework that spans problem formulation and heuristic discovery for NP-hard settings. To address these challenges, we propose OptMaster, a unified framework that spans optimization from formulation to heuristic discovery, structuring the process as a Directed Acyclic Graph (DAG) where each node represents a candidate solution. The DAG architecture enables cross-branch knowledge transfer when search progress stagnates. Within each node, we further replace textual self-reflection with independently generated verification code, grounding the evaluation in deterministic computation to suppress hallucinations. OptMaster achieves competitive performance across two optimization paradigms. In Formulation Intelligence, OptMaster achieves state-of-the-art accuracy across the three most challenging benchmarks in the field. In Heuristic Discovery, OptMaster surpasses the best known solutions on Circle Packing (n=26, 32) and achieves a cut of 9,590 on Gset70 with significantly reduced time and search budgets.

Abstract:
Forecasting the evolution of dynamic environments is crucial for autonomous agents. While generative world models have recently achieved high photorealism in 2D video synthesis, by mixing within the image plane egomotion and environmental dynamics, they exhibit physical inconsistencies, such as morphing or vanishing objects, especially over long time horizons. In this paper, we propose FR3D, a world model that predicts a persistent 3D latent representation for future dynamic 3D reconstruction. Unlike prior works that treat the world as a sequence of image-based features, FR3D explicitly decouples the 3D evolution of the scene from the agent's trajectory, treating the inferred ego-motion as a latent proxy for action. This disentanglement resolves the ambiguities between self-motion and world-motion, ensuring geometric consistency into the future. Furthermore, we introduce a teacher-student distillation strategy that leverages the spatial "common sense" of off-the-shelf foundation models, leading to robust zero-shot generalization. Extensive experiments demonstrate FR3D's strong performance for future dynamic 3D reconstruction from monocular observations across multiple datasets, even 2 seconds into the future.

Abstract:
Video anomaly understanding (VAU) relies on sparse, contextdependent cues. However, existing passive paradigms suffer from observational aliasing, where static sampling fails to disambiguate semantically distinct events. To overcome this, we propose Anom\text-\pi, a closed-loop framework that reconceptualizes video understanding as an active sequential decision-making process within a dynamic environment. Inspired by human video-reviewing behavior, this framework unifies internal cognitive reasoning and strategic evidence acquisition into an interleaved policy, utilizing temporal atomic operators such as local backtracking, temporal expansion, and fine-grained sampling to endow the model with perceptual proactivity. To learn such complex interaction strategies under video-level weak supervision, we design Interactive Direct Preference Optimization (iDPO) to achieve trajectory-level policy alignment, guided by an Active Evidence Inquiry (AEI) utility that balances task success, informative evidence acquisition, and interaction cost. This approach enables the agent to learn to actively disambiguate hypotheses while suppressing redundant exploration. Extensive experiments demonstrate that our framework, with only 2B parameters, achieves highly competitive performance, significantly outperforming state-of-the-art large-scale VAU models in complex scenarios.

Abstract:
Longcontext Large Language Model inference is severely bottlenecked by the massive Key-Value (KV) cache, yet existing sparse attention methods often suffer from static fixed-budget (Top-k) retrieval or rely on proxy scores that are computationally expensive and biased. To address these limitations, we propose RaBitQCache, a novel sparse attention framework that utilizes randomized rotated binary quantization and high-throughput binary-INT4 arithmetic to efficiently estimate attention weights. Our proxy score serves as an unbiased estimator with a proven error bound, enabling adaptive Top-p retrieval that dynamically adjusts the token budget based on actual attention sparsity. We further implement a hardware-aware system with asynchronous pipelining and lazy updates to mask overhead. Evaluations demonstrate that RaBitQCache significantly accelerates inference and reduces memory I/O while preserving generation quality compared to state-of-the-art baselines.

Abstract:
While promptbased parameter-efficient continual learning mitigates catastrophic forgetting by isolating task-specific prompts, this isolation also limits later tasks from improving earlier ones, leaving backward knowledge transfer underexplored. We address this limitation by proposing Selective bAckward refinement for positive Backward knowledge transfER (SABER), a replay-free framework that enables controlled backward transfer in prompt-based continual learning. SABER explicitly determines when backward refinement is beneficial using complementary task-correlation criteria based on prompt-gradient geometry and loss-distribution similarity, and how to perform refinement safely by restricting updates to non-interfering directions in the prompt parameter space. We provide theoretical guarantees showing that SABER’s backward refinements are interference-free and induce non-increasing loss under mild conditions. Extensive experiments across multiple continual learning benchmarks with different task orders, and diverse pretrained backbones, including T5-Large, LLaMA, and Qwen, demonstrate that SABER consistently achieves positive backward transfer while maintaining strong overall average performance.

Abstract:
Lowprecision arithmetic is pervasive in neural network training and deployment, yet its effect on prediction confidence, not just accuracy, remains unexamined. We show that the softmax function amplifies logit-space quantization errors in an input-dependent manner: confidence distortion scales with the product of precision-dependent error bound \epsilon and logit norm, peaking when the model is confident but not saturated. This explains why identical models report different confidence values across precisions, a phenomenon we term Precision Split. During training, the same mechanism causes gradient underflow: when logit margins exceed a precision-dependent threshold, gradients vanish and samples silently stop contributing to learning. Since logit norm serves as a computable proxy for precision-induced risk, we propose Precision-Aware Confidence Scaling (PACS), which applies sample-adaptive temperature inversely related to this risk, with sub-one-percent overhead and no full-precision computation required. On ImageNet with mixed-precision ResNet-50, PACS reduces Expected Calibration Error from 5.82% to 1.92% while maintaining accuracy, with consistent improvements across architectures, precision formats, and modalities.

Abstract:
Geometric analysis fundamentally distinguishes between extrinsic and intrinsic perspectives. The dominant paradigm in current 3D representation learning relies on either extrinsic spatial structures or highlevel semantics, struggling to capture the essence of shape identity and underlying manifold topology. To bridge this gap, we introduce a novel 3D representation learning paradigm, namely PRISM, for Pre-training, which learns isometric embeddings by Recovering the Intrinsic Surface geodesic Metric. PRISM incorporates a topology-enforcing objective that explicitly constrains the structure of latent space, alongside a specialized two-stage training recipe mitigating sample imbalance inherent in the distribution of geodesic distances. Experiments demonstrate that our approach shows satisfactory accuracy, robustness, and high efficiency in geodesic distance prediction and achieves superior performance across diverse downstream tasks, including shape recognition, surface parameterization, and non-rigid correspondence. Our code will be made publicly available.

Abstract:
Multitoken prediction (MTP) architecture is widely adopted in LLMs. MTP blocks can be appended to the tail of model to predict additional tokens. However, when training with pipeline parallel, MTP leads to more pipeline bubbles and deteriorates the pipeline efficiency. Based on in-depth analysis of MTP architectures and loss functions, we have identified the parallel nature of the MTP blocks, and leverage it for superior pipeline scheduling. We propose AdaHC, an adaptive pipeline scheduling framework for accelerating LLMs training with MTP block(s). AdaHC splits the output heads into chunks and reassembles the chunks to generate balanced pipeline stages, and performs adaptive activation forwarding to preserve the numerical equivalence. Experimental results show that AdaHC improves the training throughput of SOTA LLMs with diverse MTP configurations by 1.35× on average. This work paves a new direction for practical pipeline training.

Abstract:
It is increasingly common in machine learning to use learned models to label data and then employ such data to train more capable models. The phenomenon of weakto-strong generalization exemplifies the advantage of this two-stage procedure: a strong student is trained on imperfect labels obtained from a weak teacher, and yet the strong student outperforms the weak teacher. In this paper, we show that the potential improvement is substantial, in the sense that it affects the scaling law followed by the test error. Specifically, we consider students and teachers trained via random features ridge regression (RFRR). Our main technical contribution is to derive a deterministic equivalent for the excess test error of the student trained on labels obtained via the teacher. Via this deterministic equivalent, we then identify regimes in which the scaling law of the student improves upon that of the teacher, unveiling that the improvement can be achieved both in bias-dominated and variance-dominated settings. Strikingly, the student may attain the minimax optimal rate regardless of the scaling law of the teacher---in fact, when the test error of the teacher does not even decay with the sample size.

Abstract:
Causal inference typically assumes centralized access to individuallevel data. Yet, in practice, data are often decentralized across multiple sites, making centralization infeasible due to privacy, logistical, or legal constraints. We address this problem by estimating the Average Treatment Effect (ATE) from decentralized observational data via a Federated Learning (FL) approach, allowing inference through the exchange of aggregate statistics rather than individual-level data. We propose a novel method to estimate propensity scores via a federated weighted average of local scores using Membership Weights (MW), defined as probabilities of site membership conditional on covariates. MW can be flexibly estimated with parametric or non-parametric classification models using standard FL algorithms. The resulting propensity scores are used to construct Federated Inverse Propensity Weighting (Fed-IPW) and Augmented IPW (Fed-AIPW) estimators. In contrast to meta-analysis methods, which fail when any site violates positivity, our approach exploits heterogeneity in treatment assignment across sites to improve overlap. We show that Fed-IPW and Fed-AIPW perform well under site-level heterogeneity in sample sizes, treatment mechanisms, and covariate distributions. Theoretical analysis and experiments on simulated and real-world data demonstrate clear advantages over meta-analysis and related approaches.

Abstract:
Various 3D modalities have been proposed for highprecision imitation learning tasks to compensate for the short-comings of RGB-only policies. Modalities that explicitly represent positions in Cartesian space, such as most point cloud encoder architectures, have an inherent advantage over purely image-based ones, since they allow policies to reason about geometry. Despite the effectiveness of such architectures, a number of hybrid 2D/3D architectures have been proposed in the literature, indicating that this performance can often be task-dependent. We hypothesize that this discrepancy may be due to the spectral bias of neural networks towards learning low frequency functions, which especially affects architectures conditioned on slow-moving Cartesian features. We thus propose to use a parametric projection to map point clouds from Cartesian space into high-dimensional Fourier space when using a point cloud encoder. We experimentally validate the use of these Fourier features on challenging manipulation tasks from the RoboCasa and ManiSkill3 benchmarks, and on a real robot setup. Despite their simplicity, we find that Fourier features provide robust and significant benefits across diverse encoder architectures and tasks. These results indicate that Fourier features let policies leverage geometric details more effectively than Cartesian features, showing their potential as a general-purpose tool for point cloud-based imitation learning. The overview and demos are available on our project page: https://fourier-il.github.io/fourier-il.

Abstract:
Recent progress in Reinforcement Learning (RL) provides a principled approach to optimizing VisionLanguage-Action (VLA) models, facilitating a shift from trajectory imitation to active learning in the task environment. Despite improvements in control precision, most RL optimizers remain task-specific, which reduces VLA models from generalist controllers to policies that overfit to a narrow set of tasks. In this study, we conduct an in-depth analysis of this phenomenon and highlight the importance of cross-task feature representations for improving the generalizability of VLA models. Motivated by this finding, we introduce DyGRO-VLA, a two-stage optimization framework that 1) effectively captures cross-task latent representations based on information-theoretic principles, and 2) dynamically refines policy optimization via a mixture-of-RL-residuals. DyGRO-VLA enables the RL optimizer to exploit task-relevant latent information while strategically mitigating adverse interference on the learned representations throughout the optimization process. We evaluate our approach on LIBERO, RoboTwin2 benchmarks, and further validate it on real world, demonstrating consistent improvements over strong baselines under multi-task training and distribution shift.

Abstract:
Federated Learning (FL) faces significant challenges due to domain heterogeneity, where data from different clients exhibit substantial statistical shifts that hinder the generalization of the global model. Although existing methods attempt to mitigate this by exchanging class prototypes, they fall short by representing an entire class's complex distribution with a single point. This oversimplification disregards the rich structural information within the data, especially across diverse domains. To address this limitation, we propose a paradigm shift from pointbased representation to structure-based knowledge transfer. We introduce Federated Manifold Learning (FML), a novel framework that leverages perceptual manifolds—the intrinsic geometric structures of classes in the feature space—as rich knowledge carriers. In FML, clients transmit compressed manifolds, which are adaptively fused on the server using an attention-based Manifold Mutual Learning (MML) mechanism. This process enables domain-specific structures to learn from each other, creating a unified yet flexible global convergence target. Manifold-guided local training, enforced by a manifold approximation loss and a separation loss, further aligns local models with this global structure. Extensive experiments on the Digits and Office31 benchmarks demonstrate that FML substantially outperforms state-of-the-art methods, achieving accuracy improvements of up to 6.48%.

Abstract:
The inherent sparsity, lack of structure, and rotation sensitivity of point clouds often lead to high computational and parameter cost in robust feature learning. To address these problems, we present QPoint, a lightweight framework that leverages robust quaternion feature learning. QPoint incorporates a QuaternionEnhanced local perception module that uses learnable rotations to stabilize local features against geometric transformations, and a Quaternion global attention mechanism that employs quaternion similarity to capture global geometric context with inherent rotation invariance. Extensive experiments show that QPoint achieves top performance across multiple tasks. It achieves excellent 95.0%, 93.9%, and 92.1% on the challenging ScanObjectNN variants (OBJBG, OBJONLY, PBT50RS), 94.7% overall accuracy on ModelNet40, and 87.0% instance mIoU on ShapeNetParts. Furthermore, QPoint exhibits superior generalization in few-shot learning scenarios. Crucially, it accomplishes this with extremely minimal parameter and computational requirements, establishing a strong and efficient baseline for point cloud processing. Our source code is in the supplementary material and will be released to the public.

Abstract:
While Reinforcement Learning from Human Feedback (RLHF) is the standard paradigm for aligning large language models with human preferences, its effectiveness in pluralistic settings has been called into question. Notably, recent work by Golz et al. (2025) demonstrated that the distortion — defined as the multiplicative gap between the average user utility of the RLHF policy and the optimal average utility — can scale exponentially with the BradleyTerry temperature parameter \beta when users have heterogeneous preferences. In this work, we present a fine-grained analysis of the distortion of RLHF with reward clipping and demonstrate that such exponential degradation is not fundamental property of the algorithm but rather a consequence of distribution mismatch between the distribution generating preference data (\mu) and the KL reference policy (\pi_\mathrmref). We establish tight upper and lower bounds on the distortion of RLHF across multiple regimes of the KL regularization strength. We show that in a representative regime, under the Bradley–Terry model, the distortion is \tilde\Theta(\beta B), where \beta denotes the temperature parameter and B is an upper bound on the log density ratio between \mu and \pi_\mathrmref. As a consequence, when there is no distribution mismatch (i.e., \mu = \pi_\mathrmref), RLHF achieves the optimal distortion of O(\beta) up to a constant.

Abstract:
Generating realistic adversarial examples for tabular data remains challenging due to heterogeneous feature types and asymmetric interfeature dependencies. Existing approaches typically rely on handcrafted constraints or undirected similarity criteria to delimit the feasible attack region, which often fail to capture the structural dependency governing tabular generation. Consequently, standard attacks typically produce perturbations that are statistically likely yet semantically inconsistent and prone to optimization stagnation via gradient masking. To address this, we propose LCSA, a white-box framework that formulates adversarial generation as optimization over structurally admissible perturbations. LCSA leverages an ensemble of heterogeneous neural Structural Causal Models to infer dependencies and introduces a structure-aware ripple mechanism. Unlike attacks that perturb features in isolation, this mechanism propagates updates downstream, acting as a structural preconditioner that conditions gradient flow to mitigate masking effects. Extensive experiments demonstrate that LCSA outperforms state-of-the-art baselines in 45 of 50 evaluated configurations, yielding adversarial examples with superior structural consistency and transferability.

Abstract:
The exponential growth in the parameter scale of Large Language Models (LLMs) has precipitated an urgent demand for efficient compression techniques to facilitate practical deployment. To address this challenge, lowrank decomposition based on Singular Value Decomposition (SVD) offers a principled, hardware-friendly pathway for compressing LLMs without retraining. However, existing training-free approaches predominantly rely on uniform rank allocation, implicitly assuming homogeneous redundancy across the model depth and thereby neglecting the inherent non-uniformity of representational evolution. To bridge this gap, we introduce CGSVD, a \ulineCascaded \ulineGranular \ulineSingular \ulineValue \ulineDecomposition framework that leverages a dual-level non-uniform allocation strategy to maximize semantic preservation. Specifically, we quantify inter-layer significance via angular distance and assess intra-layer compressibility through spectral entropy, enabling precise identification of critical architectural components. Furthermore, we propose an Iterative Residual Filling (IRF) mechanism to bridge the parameter gap caused by integer-rank truncation and ensure strict adherence to global compression targets. Extensive experiments on representative LLM families ranging from 3B to 13B parameters verify the superiority of our approach. Notably, under a 30% compression ratio on the LLaMA3.1-8B model, CGSVD achieves a remarkable average zero-shot accuracy boost of 6.08% and reduces perplexity by 33.39 compared to the baseline. We release the code\footnoteThe code is available at: \urlhttps://anonymous.4open.science/r/CGSVD-BD6E. to facilitate future research.

Abstract:
Although Multimodal Large Language Models have made remarkable progress, they still struggle with longvideo understanding due to the massive memory footprint of KV Caches. Exsiting methods often resort to disjoint retrieval or attention-based static reduction to achieve compression. However, these methods disrupt temporal continuity and ignore the varying information density across network layers. In this work, we reveal that memory allocation should mirror layer-wise semantic density, rather than adhering to a uniform budget. To this end, we introduce EAKV, a training-free entropy-driven adaptive KV compression framework that leverages attention entropy to adaptively allocate compression budgets, selectively preserving critical tokens while distilling redundant contexts into compact contextual anchors, thereby achieving granular memory allocation proportional to semantic density. Extensive experiments on four benchmarks demonstrate that EAKV surpasses existing methods across varying model scales with improvements ranging from 1.5% to 4.8%.

Abstract:
Generative AI is reshaping healthcare, yet most existing advances rely on hospitalgrade devices, which limits their accessibility and potential for health management outside clinical settings. With the proliferation of portable devices and telemedicine, healthcare is shifting toward home-based Diagnosis-It-Yourself (DIY) care. Despite this promise, several distinctive challenges remain: (i) home-collected data are heterogeneous, exacerbated by the absence of standardized large-scale datasets; (ii) models require adaptation to variable task demands and evolving individual conditions; (iii) the broad spectrum of home care tasks lacks a unified benchmark for systematic evaluation. In this paper, we presentDIYHealth Suite, a comprehensive framework designed to address these challenges through a tailored dataset, model, and benchmark. We first curateDIYHealth-900K, a large-scale multimodal dataset capturing diverse real-world home care scenarios. Building on this, we proposeDIYHealthGPT, an adaptive foundation model for home-based health management, powered by the novel Hybrid Hyper Low-Rank Adaptation technique. Finally, we establishDIYHealthBench, the first benchmark to evaluate foundation models on home care tasks. Extensive experiments demonstrate that DIYHealthGPT delivers state-of-the-art performance over both general-purpose and medical-specific baselines on 11 home care tasks in both open-QA and closed-QA settings, laying the groundwork for the next generation of personalized health management at home.

Abstract:
Learning from demonstration can be an effective way to teach robots taskoriented policies. However, in an interactive setting when demonstrations are limited by time or other budgetary constraints, it is challenging to find those that fix the learner's (remaining) errors. This is especially difficult for novice teachers: they may provide task-valid trajectories, often these fail to meaningfully improve the policy due to their lack of knowledge of learning mechanisms internal to the robot. This paper introduces CLASP (Collaborative Learning with Anchored State-space Partitions), which summarises the teaching process as a compact map of behavioural regions anchored in the teacher's own demonstrations. The map connects task failure to actionable changes to demonstrations by indicating what is going wrong in an intuitive way. It also enables difficulty-aware training that emphasises regions where learning is failing. Across diverse benchmarks, CLASP improves success by up to 20% over offline and interactive baselines under the same demonstration budget, improves robustness under distribution shift by 14–20%, and preserves behavioural diversity.

Abstract:
Hyperspectral image change detection (HSICD) suffers from severe annotation scarcity and complex change patterns, which fundamentally limit the effectiveness of directly fine-tuning pre-trained foundation models. Although semi-supervised learning provides a promising direction, existing approaches mainly rely on confidence-based pseudo-label selection, leading to limited data diversity or severe error propagation. In this paper, we propose Influence-Aware Semi-supervised Fine-tuning (IA-SFT), a novel framework that evaluates the influence of pseudo-labels on model decision behavior to identify truly valuable supervision signals. Instead of confidence-based selection, IA-SFT evaluates each low-confidence pseudo-label by measuring its impact on labeled data, enabling reliable filtering of high-value pseudo-labels with minimal noise. To further adapt foundation models to HSI-CD, we design an Adaptive Fusion Change Decoder (AFCD) that jointly models global semantic consistency and local change details. Extensive experiments on three benchmark datasets demonstrate that IA-SFT consistently improves pseudo-label quality and detection performance, achieving superior accuracy compared to state-of-the-art methods. Additional analyses validate the transferability of IA-SFT when integrated into different frameworks in a plug-and-play manner. Code will be released.

Abstract:
Despite significant progress in multimodal Re-Identification (ReID), existing methods tend to emphasize low-frequency cues. Consequently, they focus on attributes such as color, illumination, and coarse appearance, while overlooking mid- and high-frequency structures that encode geometric, textural, and identity-discriminative details. This imbalance leads to incomplete spectral representations and unstable cross-modal alignment. To overcome these limitations, we introduce FUSE, a frequency-domain framework that reformulates multi-modal ReID as a two-stage process of spectral disentanglement and energy alignment. The proposed Spectral Decomposition Module (SDM) adaptively partitions features into low, mid, and high-frequency subspaces, enabling hierarchical spectral modeling. The Cross-Modal Alignment Module (CAM) further enforces energy alignment and subspace complementarity across modalities via frequency-consistency regularization. In addition, FUSE incorporates learnable frequency modulation to enhance robustness under varying illumination and heterogeneous sensor conditions. Extensive experiments on RGBNT201, RGBNT100, and MSVR310 show that FUSE achieves 9.1% mAP and 9.5% Rank-1 improvements, establishing an interpretable frequency-domain paradigm for multi-modal representation learning.

Abstract:
Scene text editing aims to modify text in a target region of an image while preserving its background style and texture. Existing methods rely solely on image background information while neglecting the visual details of target regions, which discards stylistic features in the original text and essentially degrades the task to text rendering. Moreover, the conditions imposed by pretrained glyph encoder limit the scope of editable text. To address these issues, this paper proposes a self-prompting scene text editing method, which constructs style and glyph prompts directly from the original image without additional style or glyph encoders. We employ a two-stage training strategy, where the diffusion transformer is first trained on large-scale self-supervised datasets and subsequently refined with a small set of paired images. By leveraging the in-context learning capability of FLUX-Fill, it achieves open-vocabulary and style-consistent text editing. Experimental results on various languages demonstrate that our method achieves the state-of-the-art performance in both text accuracy and style consistency.

Abstract:
Active 3D reconstruction relies on active view selection to maximize reconstruction fidelity under limited capture budgets. However, most existing methods rely on surrogate signals—such as parameter uncertainty or geometric heuristics—which are often misaligned with the ultimate goal: the fidelity of rendered predictions. We propose GOPRE, a goal-oriented next-best-view selection framework that explicitly targets information gain in the prediction space. Specifically, we formulate the objective as maximizing the reduction of the average marginal predictive entropy over a user-specified target view manifold. GO-PRE supports interactive goal specification and yields an efficient acquisition rule that enables real-time computation of information gain. Extensive experiments across benchmarks demonstrate that GO-PRE consistently improves active reconstruction performance and provides more reliable uncertainty quantification compared to state-of-the-art methods.

Abstract:
Multiview subspace clustering (MvSC) has shown remarkable potential in exploring underlying structures of highdimensional data. However, existing MvSC methods still suffer from two shortcomings: (1) the commonly use of convex low-rank approximations inadequately capture high-order correlations across views, while sensitivity to noise and outliers degrades clustering performance, and (2) they lack the ability to preserve global correlations and local geometric patterns simultaneously. To address these issues, we propose a novel nonconvex regularized MvSC model with deep prior, which not only accurately characterizes the intrinsic low-rank structure and suppresses the effect of outliers, but also preserves local structural properties through deep networks. By mathematically analyzing the optimal solution of the optimization problem in our proposed model, we develop an efficient ADMM-based algorithm with provable convergence guarantees to solve it. Extensive experiments on various datasets demonstrate the superiority of the proposed model.

Abstract:
Even after decades of advances in neural network training, the inherent robustness challenge remains open. While the sensitivity to adversarial perturbations is understandable given their intentional learning, the most surprising fact is the vulnerability to natural corruptions. Moreover, the reason for this inherent vulnerability remains unknown, and it is not limited to traditional CNNs but also applies to current models, including transformers and large foundation models. For the first time, through this work, we observe that natural corruptions often collapse the network's internal feature space into a highentropy state, causing predictions to rely on a small subset of fragile features. Inspired by this, we propose a simple yet effective entropy-guided fine-tuning framework, Dem-HEC, that strengthens corruption robustness while maintaining clean accuracy. Our method generates high-entropy samples within a bounded perturbation region. It applies it to both clean and high-entropy samples, combined with knowledge distillation from a teacher snapshot, ensuring stable predictions. The proposed Dem-HEC is effective across datasets ranging from small to large-resolution, from pure CNNs to transformers, and to large foundation models, including DinoV3. The proposed approach outperforms the state-of-the-art (SOTA) models not only in improving robustness but also in retaining or boosting clean accuracy.

Abstract:
Realworld datasets across image and text domains are often characterized by skewed class distributions and noisy annotations, which jointly degrade model performance, particularly on minority classes. Among existing solutions, active learning offers an effective and efficient paradigm by selectively querying the most informative and balanced samples for annotation. We propose an innovative active learning framework that mitigates class imbalance and selects the most informative samples to annotate. Leveraging foundation model priors, our algorithm enables imbalance-aware co-decisions between foundation model and small model to tackle noisy and imbalanced labels across various domains. We introduce the first study to systematically explore active learning under the dual challenges of label noise and class imbalance across image and text domains. Extensive experiments on imbalanced datasets demonstrate that our method achieves substantial annotation savings—over 50% compared to the best active learning baseline—while preserving performance and robustness to label noise.

Abstract:
Reinforcement learning with verifiable rewards can improve LLM reasoning, but learning is sampleinefficient under sparse terminal rewards. Prior work mitigates this by adding natural language critiques, yet it typically treats critique generation as fixed or auxiliary, so correct-sounding feedback may not translate into higher verified reward. We argue that natural language actor-critic for reasoning is inherently bilevel: the usefulness of the critique is defined by its downstream effect on the actor after adaptation. We formalize this coupling as a Stackelberg bilevel program and derive Bilevel Natural Language Actor-Critic (Bi-NAC), which jointly trains a critic to generate reward-improving feedback and an actor to exploit it. Across reasoning benchmarks, Bi-NAC improves sample and parameter efficiency over RL baselines and fixed-critic feedback methods. We perform experiments on MATH-500, MBPP, and GPQA demonstrating that Bi-NAC significantly enhances parameter and sample efficiency, enabling smaller models to outperform larger baselines. Specifically, our 2B model consistently outperforms the larger 3B GRPO baseline across all tasks (e.g., 46.6% vs. 41.4% on MATH-500), while our 6B model surpasses the 7B GRPO baseline (e.g., 49.3% vs. 43.6% on GPQA). These results show that aligning actor and critic via bilevel formulation provides a robust and efficient alternative for solving complex reasoning tasks.

Abstract:
Selfsupervised learning (SSL) has significantly advanced acoustic representation learning. However, most existing models are optimised for either speech or audio event understanding, resulting in a persistent gap between these two domains. We address this gap with SPEAR (SPEech and Audio Representations), a self-supervised framework that distils complementary knowledge from a speech-focused SSL teacher and a general-audio SSL teacher into a single unified model. SPEAR applies multi-codebook vector quantisation to continuous teacher representations to produce fine-grained discrete tokens that capture both semantic and acoustic information. To effectively integrate these heterogeneous representations, SPEAR jointly predicts them given a masked input with an asymmetric pre-training loss. We further improve robustness in complex sound scenes through a novel token mixing mechanism. Extensive experiments demonstrate that SPEAR consistently outperforms existing unified speech and audio models. SPEAR establishes a new state-of-the-art on the SUPERB benchmark, surpassing WavLM Large on 12 of 15 tasks, while achieving competitive performance on the HEAR benchmark. These results position SPEAR as a versatile foundation for general-purpose speech and audio representation learning. The code and pre-trained models will be released.

Abstract:
While large language models (LLMs) have greatly advanced the functional correctness of automated code translation systems, the runtime efficiency of translated programs has received comparatively little attention. With the waning of Moore’s law, runtime efficiency has become increasingly important for program quality, alongside functional correctness. Our preliminary study reveals that LLMtranslated programs often run slower than human-written ones, and this issue cannot be remedied through prompt engineering alone. Therefore, our work proposes SwiftTrans, a code translation framework comprising two key stages: (1) Multi-Perspective Exploration, where MpTranslator leverages parallel in-context learning (ICL) to generate diverse translation candidates; and (2) Difference-Aware Selection, where DiffSelector identifies the optimal candidate by explicitly comparing differences between translations. We further introduce Hierarchical Guidance for MpTranslator and Ordinal Guidance for DiffSelector, enabling LLMs to better adapt to these two core components. To support the evaluation of runtime efficiency in translated programs, we extend existing benchmarks, CodeNet and F2SBench, and introduce a new benchmark, SwiftBench. Experimental results across all three benchmarks show that SwiftTrans achieves consistent improvements in both correctness and runtime efficiency.

Abstract:
Mamba demonstrates strong efficiency in modeling long visual sequences. However, when token reduction is applied to structurally enhanced Mamba variants, these models exhibit a severe performance collapse. We attribute this degradation to the spatially agnostic nature of existing reduction methods, which violate the twodimensional structural premise required by the selective scanning mechanism. In this work, we propose STORM, a spatial-aware token reduction framework designed to maintain structural integrity throughout the compression process. STORM reformulates reduction into a structured operation on spatial units, enforcing localized constraints to maintain both grid topology and neighborhood coherence. As a plug-and-play module, STORM equips existing reduction pipelines with explicit spatial awareness without any training. Empirical results demonstrate that STORM achieves state-of-the-art pruning accuracy across diverse vision Mamba backbones under training-free settings. Notably, STORM delivers a substantial accuracy recovery on VMamba, outperforming prior methods by up to 63.3% in top-1 accuracy. Meanwhile, STORM incurs only a 1.0% accuracy drop on PlainMamba, achieving performance comparable to ViT.

Abstract:
Shampoo is one of the leading approximate secondorder optimizers: a variant of it has won the MLCommons AlgoPerf competition, and it has been shown to produce models with lower activation outliers that are easier to compress. Yet, applying Shampoo currently comes at the cost of significant computational slowdown, due to its expensive internal operations. In this paper, we take a significant step to address this shortcoming by proposing \method (for Distributed Accelerated SHampoo), a faster implementation of Distributed Shampoo based on two main new techniques: First, we show that preconditioner blocks can be stacked into 3D tensors to significantly improve GPU utilization; second, we introduce the Newton-DB iteration and the Chebyshev polynomial approximations as novel and faster approaches for computing the inverse matrix roots required by Shampoo. Along with these algorithmic contributions, we provide a first in-depth analysis of how matrix scaling critically affects Shampoo convergence. On the practical side, our GPU-aware implementation achieves up to 4.83× faster optimizer steps compared to the well-optimized Distributed Shampoo, while Newton-DB attains the lowest validation perplexity per iteration among all tested methods.

Abstract:
We study ridge regression and ridgeregularized empirical risk minimization (ERM) under (\varepsilon,\delta)-differential privacy via output perturbation. In classical private ERM, the ridge parameter simultaneously controls statistical regularization and the estimator’s global sensitivity. Larger regularization reduces the DP noise scale but increases bias. So choosing the tuning parameter becomes a privacy--accuracy bottleneck. We propose a framework that makes these two roles explicit by decoupling regularization into (i) a statistical penalty \alpha, defining the target ridge/ERM solution, and (ii) a privacy-stabilization parameter c, used only to enforce a curvature floor and hence a tight sensitivity bound. We apply this framework to ridge regression, where c is used to boost the minimum eigenvalue of the empirical Gram matrix. We derive an explicit bias--variance--DP-variance risk decomposition and characterize optimal (\alpha,c) in several regimes, yielding sharp tuning guidance and improved accuracy relative to single-parameter regularization. Finally, we extend the same decoupling principle to general ridge-regularized ERM. We support the theory with simulations.

Abstract:
Modern neural classifiers can achieve remarkable predictive performance, yet often suffer frommiscalibration. In this paper, we introduce a unified calibration framework applicable to arbitrary distributionbased classifiers. The proposed calibration objective guarantees amonotone Probably Approximately Individually Calibrated (mPAIC)predictor, which theoretically implies the properties of aProbably Approximately Calibrated Classifier (PACC)with explicit error bounds. To enable stable and effective optimization, we further devise aDecoupled Dual-Stream Optimization (DDSO)strategy with gradient detachment to reconcile discriminative representation learning and continuous calibration. Notably, our framework bridges calibration paradigms, supporting flexible deployment either as an end-to-endpre-calibrationobjective or as a lightweightpost-calibrationadapter. Extensive experiments across nine real-world datasets demonstrate that our approach consistently outperforms strong baselines, achieving superior performance on bothaccuracyand multi-levelcalibration.

Abstract:
Diffusion Language Models (DLMs) have emerged as a flexible alternative to autoregressive (AR) models. They can decode tokens in any order, but the generation quality critically depends on the decoding strategy. Existing approaches predominantly rely on local heuristics, such as confidence or entropy, which may fail to capture sequencelevel dependencies and the semantics in the context. To solve this problem, we propose Latent-aware Unmasking Guidance Search (\LUGS), a novel decoding framework that leverages the model's internal hidden states to guide the unmasking process. By incorporating latent-aware scores to compensate for the limitations of local heuristics such as confidence or entropy, \LUGS improves the model's performance. Extensive experiments on various downstream tasks demonstrate that our approach consistently outperforms existing baselines on LLaDA-8B-instruct and LLaDA-1.5 models. In Science and Reason tasks, \LUGS improved performance by more than 1% on both base models. And \LUGS obtains an average improvement of 3.5% in code generation. Remarkably, \LUGS outperforms the beam search baseline by more than 5% on average using LLaDA-8B-Instruct on code tasks. These results highlight the potential of latent-aware guidance for advancing controllable and high-quality generation.

Abstract:
We introduce SeisMark, a largescale open benchmark designed to bridge the gap between verifiable ground truth and realistic texture in 3D seismic fault detection. Using a novel pipeline merging procedural geology with diffusion-based synthesis, we produce domain-realistic (survey-specific) textured volumes that expose significant brittleness in existing models masked by simplified physics data. Experiments demonstrate that SeisMark acts as a rigorous discriminator, distinguishing robust modern architecture from legacy model that suffers performance collapse under realistic domain shifts. We release this benchmark to the community to serve as a verifiable standard for developing trustworthy, deployment-ready AI for safety-critical subsurface applications.

Abstract:
Characterizing the Hamiltonians of continuousvariable (CV) quantum systems remains a fundamental challenge due to the infinite-dimensional Hilbert space and the presence of unbounded operators. Existing learning protocols are often restricted to low-order Hamiltonian structures and can be sensitive to experimental noise, leaving generic multi-mode settings largely unresolved. In this work, we introduce the Displacement-Random Unitary Transformation (D-RUT), an experimentally accessible protocol for learning the coefficients of generic multi-mode bosonic Hamiltonians of arbitrary finite order. We prove that D-RUT achieves Heisenberg-limited scaling while remaining robust to state preparation and measurement (SPAM) errors. To extend the method efficiently to multi-mode systems, we develop a hierarchical coefficient recovery strategy that yields superior statistical efficiency compared to existing simultaneous estimation schemes. Importantly, we further show that our framework applies naturally to Hamiltonian coefficient learning in the first-quantized formulations, substantially broadening its scope beyond prior CV approaches. Numerical experiments validate the predicted Heisenberg scaling our approach in both single- and multi-mode nonlinear systems.

Abstract:
Despite rapid progress in 3D generative models, producing productiongrade 3D face assets from a single image remains challenging. To reconstruct facial micro-structures and fine-grained multiview-consistent textures, this work presents a two-stage framework named SAMT for monocular 3D avatar generation and texture synthesis. Specifically, a latent 3D diffusion model for facial mesh generation is pretrained and then further adapted to generate high-quality facial geometry through large-scale domain-specific fine-tuning on 35K curated 3D avatar models. Subsequently, a multiview-aware texturing strategy is proposed to texture the generated facial mesh. Its core idea lies in incorporating a multi-view facial prior, along with mesh geometry, to guide a 2D texturing diffusion for cross-view consistent and mesh-aligned texture synthesis. Extensive experiments demonstrate that SAMT outperforms existing approaches by producing more structured and detailed facial geometry, along with improved fine-grained appearance coherence.

Abstract:
Estimating Heterogeneous Treatment Effects (HTE) from observational data is essential in fields such as healthcare and policymaking, where randomized experiments are often impractical. While representation learning-based methods have shown promise, recent studies suggest that tree-based approaches may offer superior performance on tabular data, particularly in the presence of uninformative features. We introduceGRAdient-basedCausal treeEnsembles (GRACE), a novel tree-based architecture for HTE estimation that incorporates multi-way, oblique, and soft splits, enabling end-to-end training via backpropagation. GRACE can be seamlessly integrated into existing models as a replacement for fully-connected neural network layers. Across diverse benchmarks involving binary and non-binary treatment settings, GRACE consistently surpasses neural network and tree-based baselines, often by a substantial margin. We further analyze GRACE as an extension of fully-connected neural network layers and conduct ablation studies to isolate and quantify the contribution of each architectural component to the improvement in performance. These results position GRACE as a powerful new foundation for flexible, robust, and accurate HTE estimation.

Abstract:
\em De novo peptide sequencing is a pivotal technique that directly reconstructs amino acid sequences from tandem mass spectrometry (MS/MS) data; it enables the identification of novel proteins and variants absent from reference databases. Previous methods are typically based on autoregressive (AR) decoding or oneshot generation. The AR-based methods conflict with the bidirectional and globally constrained nature of MS/MS evidence and inevitably accumulate errors, while one-shot generation does not explicitly enforce physical constraints, failing to produce chemically valid and reliable peptides in a single pass. Accurate sequencing necessitates reasoning over the entire peptide simultaneously, enabling iterative self-correction under global constraints. To this end, we introduce PhysNovo, a novel paradigm that harnesses discrete diffusion to enable simultaneous global reasoning and iterative refinement. Specifically, PhysNovo reformulates sequencing as a physically mass-constrained reasoning process by embedding a knapsack-based feasibility kernel to enforce exact precursor mass consistency. By conditioning the diffusion process on global spectral context, PhysNovo supports abductive reasoning where bidirectional evidence is exploited to iteratively resolve local inconsistencies and ensure physically valid predictions. PhysNovo achieves state-of-the-art performance, exceeding baselines by over 2% in precision, with larger gains on out-of-distribution data.

Abstract:
Centralized multimodal learning commonly compresses language, acoustic, and visual signals into a single fused representation for prediction. While effective, this paradigm suffers from two limitations: modality dominance, where optimization gravitates towards the path of least resistance, ignoring weaker but informative modalities, and spurious modality coupling, where models overfit to incidental crossmodal correlations. To address these, we propose Group Cognition Learning (GCL), a governed collaboration paradigm that applies a two-stage protocol after modality-specific encoding. In Stage 1 (Selective Interaction), a Routing Agent proposes directed interaction routes, and an Auditing Agent assigns sample-wise gates to emphasize exchanges that yield positive marginal predictive gain while suppressing redundant coupling. In Stage 2 (Consensus Formation), a Public-Factor Agent maintains an explicit shared factor, and an Aggregation Agent produces the final prediction through contribution-aware weighting while keeping each modality representation as a specialization channel. Extensive experiments on CMU-MOSI, CMU-MOSEI, and MIntRec demonstrate that GCL mitigates dominance and coupling, establishing state-of-the-art results across both regression and classification benchmarks. Analysis experiments further demonstrate the effectiveness of the design.

Abstract:
Supervised FineTuning (SFT) is a critical step for adapting Large Language Models (LLMs) to specialized domains, often serving as an initialization for subsequent reinforcement learning (RL). However, SFT can overfit a small set of expert data, harming generalization and eroding prior knowledge. This can limit downstream RL, which benefits from a strong, generalizable initialization for exploration. Here, we demonstrate that prior knowledge degradation primarily results from tokens in the expert data to which the base model assigns low probability. Specifically, these low-probability tokens represent a significant deviation from the model’s current prior knowledge. Due to the nature of the log-likelihood objective, they produce larger gradient magnitudes, which speed up adaptation to the new data but degrade generalization. In this paper, we study the token-wise clipping strategy, a commonly used trust-region method for bounding per-token updates. We find that it reshapes token-level learning priorities, promoting more progressive adaptation that fits the new data while preserving general abilities. Compared with standard SFT, clipping low-probability tokens reduces out-of-distribution forgetting by 11.54% and improves final RL performance by 7.09% across the agentic benchmarks. Moreover, latent-space analysis shows smaller representational drift under clipping, indicating that it provides a generalizable initialization.

Abstract:
Human episodic memory supports the retrieval of experiences that unfold over extended timescales, yet the computational mechanisms underlying this ability remain debated due to the difficulty of mechanistic accessibility in longterm memory experiments in humans. Long-context LLMs may offer promising ways to reveal plausible computational mechanisms that drive this type of retrieval. Here, we investigate whether and, if so, how LLMs capture core behavioral signatures of humans of a central aspect of episodic memory via a temporal order memory task. Using a new dataset of human behavior based on a full-length novel, we show substantial similarities between the human and model performances on the temporal order memory task. We next perform long-context mechanistic interpretability analyses to reveal the underlying mechanisms in the model, and find that model performance relies on a one-dimensional temporal code that is reinstated during retrieval by a single time-reinstatement attention head. These findings support temporal context reinstatement as an important mechanism for episodic-like temporal-order memory in LLMs, offering new insights into how temporal aspects of long-term episodic memory may be instantiated in both artificial and biological systems.

Abstract:
Conventional practice assumes that online reinforcement learning for flowmatching models requires sampling full denoising trajectories to compute rewards. This assumption underlies methods such as Group Relative Policy Optimization (GRPO), where the policy must traverse the entire reverse process before receiving a delayed, trajectory-level reward. We observe, however, that while such terminal rewards provide feedback, they are neither necessary nor optimal for effective learning. In this work, we introduce iGRPO (Instant-reward GRPO), which replaces GRPO's full-trajectory rollouts with a single-step mapping that assigns rewards instantly at each denoising step. Because the flow matching model behaves differently across timesteps, our step-local instant rewards which are inherently time-dependent, overcome prior approaches that rely on a single, time-independent terminal reward. By evaluating each action locally rather than relying on a final terminal score, iGRPO eliminates the need for multi-step SDE rollouts and offers more precise credit assignment. Across standard benchmarks, iGRPO converges 10.2× faster than FlowGRPO while achieving higher final alignment quality. We hope this work motivates more efficient and scalable online RL methods for flow-matching generative models.

Abstract:
Partial identification provides informative causal guarantees when point identification is impossible, but existing approaches based on optimal transport (OT) become computationally and statistically intractable in highdimensional settings. This limitation is particularly severe when both potential outcomes and confounders are high-dimensional, where classical OT-based bounds suffer from the curse of dimensionality and unfavorable convergence rates. To address this challenge, we propose a novel estimator that decomposes the transport problem into a low-dimensional signal subspace and a high-dimensional residual subspace. Unlike existing projection-based methods that discard residual information, we recover the residual transport energy using the Sliced Wasserstein distance, which is computationally efficient and robust to high dimensions. We establish interpretable conditions controlling the approximation gap based on residual structure and provide a data-driven rule for signal dimension selection. Empirical results show that our estimator consistently outperforms projection-only baselines by recovering lost transport energy, yielding more informative causal bounds while remaining computationally tractable in high dimensions.

Abstract:
Pixellevel semantic segmentation of 4D light field (LF) data remains a considerable challenge, primarily due to the conflict between modeling complex spatial-angular dependencies and maintaining linear computational efficiency. Current linear models like VRWKV offer scalability but often fail to capture intrinsic geometric structures, leading to the structural collapse of Epipolar Plane Image (EPI) cues. To overcome these limitations, we propose E²I-VRWKV, an EPI-Enhanced and Interaction-aware network that generates high-quality segmentation maps by embedding explicit geometric priors into a linear-complexity backbone. Specifically, we introduce the Light Field Epipolar-Aware Cross-Modal Attention (LF-ECMA) block. The key innovation lies in the integration of an EPI Geometric Prior Generator, which explicitly extracts disparity-sensitive biases to enforce geometric consistency, and a Geometric-Context Gating (GC-Gate) mechanism. This mechanism functions as a geometrically modulated aperture to dynamically calibrate the fusion of spatial and angular manifolds. Experiments on the UrbanLF benchmark demonstrate that our method outperforms other state-of-the-art (SOTA) methods, achieving 86.55% mIoU on UrbanLF-Real while maintaining a superior balance between accuracy and linear efficiency.

Abstract:
Modeling decisionmaking outside of controlled environments requires accounting for asynchronous, exogenous signals, such as notifications or algorithmic feeds, that dynamically alter user response times. Standard Drift-Diffusion Models (DDM) become analytically intractable when drift rates vary continuously with time. In this paper, we derive a closed-form analytical approximation for the first-passage time distribution of a single-boundary DDM with time-dependent drift, valid in the high-threshold regime. The main result allows us to analytically study the optimal timing of external signals to maximize the probability of a user response within our approximation framework. To evaluate our response time model, we conduct an extensive empirical comparison with state-of-the-art methods for user watch-time prediction and evaluation in simulated environments.

Abstract:
While Conformal Prediction (CP) offers a principled framework for producing prediction sets with statistical guarantees, prior methods suffer from critical limitations in both calibration validity and score discriminability, resulting in violated coverage guarantees and excessively large prediction sets. To address these pitfalls, we propose Conformal Path Reasoning (CPR), a trustworthy KGQA framework with two key innovations. First, we perform querylevel conformal calibration over path-level scores, preserving the exchangeability while generating path prediction sets. Second, we introduce the Residual Conformal Value Network (RCVNet), a lightweight module trained via PUCT-guided exploration to learn discriminative path-level nonconformity scores. Experiments on benchmarks show that CPR significantly improves the Empirical Coverage Rate by 34% while reducing average prediction set size by 40% compared to conformal baselines. These results validate the efficacy of CPR in satisfying coverage guarantees with substantially more compact answer sets.

Abstract:
Multimodal spatio-temporal forecasting underpins many real-world applications but remains challenging due to the complex and evolving interactions across modalities and time steps. Moreover, the lack of interpretability in existing models limits their reliability in safety-critical scenarios. In this paper, we present CausalX, a unified and causally interpretable plug-and-play model for multi-modal spatio-temporal forecasting. CausalX achieves interpretability by learning a dynamic causal graph across modalities and time, whose edge weights quantify causal attribution strength, and are further refined by a diffusion-based generative process guided by structural priors. To overcome the absence of ground-truth causal structures, CausalX aggregates multi-source constraints from causal analysis techniques and a variational autoencoder, spanning predictive, temporal, interventional, and generative aspects to jointly learn a more comprehensive causal graph. Extensive experiments on real-world forecasting tasks, including pedestrian trajectory prediction and tropical cyclone forecasting, demonstrate that CausalX achieves superior accuracy while producing interpretable causal graphs. CausalX is modular, architecture-agnostic, and generalizable, offering a new perspective on bridging causal inference and spatio-temporal forecasting.

Abstract:
Vision Language Models (VLMs) achieve strong performance across multimodal tasks but remain vulnerable to universal adversarial perturbations (UAPs). Existing UAP methods mainly operate on the visual modality, overlooking structured textual semantics and cross-modal interactions, which limits their ability to disrupt alignment and generalize across tasks and model architectures. To address these limits, we proposeMulti-modal Attention Distraction Adversarial Attack (MADA-Attack)framework. We begin by conducting several insight experiments and discover that modality attention distributes differently over layers and early phase of optimization is decisive. Building on these observations, we introduce Semantic Token Manipulation (STM) to steer text-guided attention, and Fused Embedding Training (FET) to jointly optimize textual and visual embedding losses for coordinated misalignment. We further incorporate an Adaptive Data Augmentation (ADA) strategy that dynamically balances attack strength, transferability, and training efficiency. Extensive experiments demonstrate thatMADA-Attackconsistently achieves state-of-the-art performance and strong transferability while remaining computationally lightweight, with an average ASR of 82.60% and 73.42% in zero-shot classification and image captioning tasks. For the visual question answering (VQA) and I-T Retrieval task, our method exceeds the SOTA baseline by 10%. Our code will be available soon.

Abstract:
We propose a unified framework to enhance the power of online multiple hypothesis testing procedures based on evalues. While e-value-based methods offer robust online False Discovery Rate (FDR) control under minimal assumptions, they often suffer from power loss by discarding evidence that exceeds the rejection threshold. We address this inefficiency via the Sequential Control with Overshoot Refund for E-values (SCORE) framework, which leverages the inequality \mathbbI(y \ge 1) \le y - (y-1)_+ to reclaim this otherwise ``wasted'' evidence. This simple yet powerful insight yields a unified principle for improving a broad class of online testing algorithms. Building on this framework, we develop SCORE-enhanced versions of several state-of-the-art procedures, including SCORE-LOND, SCORE-LORD, and SCORE-SAFFRON, all of which strictly dominate their original counterparts while preserving valid finite-sample FDR control. Furthermore, under mild assumptions, SCORE permits retroactive updates of alpha‑wealth by using the latest decision twice: first to determine its reward or loss, and then to refresh past wealth. Such a mechanism enables more aggressive testing strategies while maintaining valid FDR control, thereby further improving statistical power. The effectiveness of the proposed methods is validated through extensive simulation and real-data experiments.

Abstract:
Visionlanguage-action (VLA) models typically rely on large-scale real-world videos, whereas simulated data, despite being inexpensive and highly parallelizable to collect, often suffers from a substantial visual domain gap and limited environmental diversity, resulting in weak real-world generalization. We present an efficient video augmentation framework that converts simulated VLA videos into realistic training videos while preserving task semantics and action trajectories. Our pipeline extracts structured conditions from simulation via video semantic segmentation and video captioning, rewrites captions to diversify environments, and uses a conditional video transfer model to synthesize realistic videos. To make augmentation practical at scale, we introduce a diffusion feature-reuse mechanism that reuses video tokens across adjacent timesteps to accelerate generation, and a coreset sampling strategy that identifies a compact, non-redundant subset for augmentation under limited computation. Extensive experiments on RobotWin 2.0, LIBERO, LIBERO-Plus, and a real robotic platform demonstrate consistent improvements in both task performance and sim-to-real generalization. For example, our method improves RDT-1B by 8% on RobotWin 2.0, and boosts \pi_0 by 5.1% on the more challenging LIBERO-Plus benchmark. Code is released in supplementary material.

Abstract:
We develop novel "empirical Bernstein" inequalities for the variance of bounded random variables. Our inequalities hold under constant conditional variance and mean, without further assumptions like independence or identical distribution of the random variables, making them suitable for sequential decision making contexts. The results are instantiated for both the batch setting (where the sample size is fixed) and the sequential setting (where the sample size is a stopping time). Our bounds are asymptotically "sharp": when the data are iid, our CI adapts optimally to both unknown mean \mu and unknown \mathbbV[(X\mu)^2], meaning that the first order term of our CI exactly matches that of the oracle Bernstein inequality which knows those quantities. We compare our results to a widely used (non-sharp) concentration inequality for the variance based on self-bounding random variables, showing both the theoretical gains and improved empirical performance of our approach. We finally extend our methods to work in any separable Hilbert space.

Abstract:
In this paper, we present the convergence analysis of the proximal Alternating Direction Method of Multipliers (ADMM) for problems with block antiupper triangular constraints. While the linear constraints can be treated separately, most analyses of ADMM and its variants predominantly regard the linear constraints as one. Hence, it relies on assumptions related to the entire constraint matrix, such as the full column rank. However, some problems with block anti-upper triangular constraints that can be solved by ADMM do not satisfy these assumptions. To fill this gap, a new assumption is proposed and used to guarantee the global convergence of the proximal ADMM for nonconvex problems. In the strongly convex setting, we also prove the global convergence of the proximal ADMM and establish the linear convergence under four different scenarios. This work extends the theoretical understanding of the multi-block ADMM to more general cases with block anti-upper triangular constraints.

Abstract:
Learning permutationinvariant functions over sets of N elements is fundamental to many deep learning applications. While sum-decomposable architectures like DeepSets theoretically offer universal approximation capabilities for such functions, existing constructive bounds suggest that the latent dimension must grow linearly with the set size, i.e., \mathcalO(N). This linear scaling poses a significant bottleneck for scalability. In this paper, we break this theoretical barrier by proving that a latent dimension of \mathcalO(\log N) suffices for the universal approximation of Wasserstein-stable permutation-invariant functions. We establish this by first showing that the covering number of the Wasserstein space scales linearly with N. Then, we show that random embeddings, specifically Random Fourier Features, with a logarithmic latent dimension to the covering number can preserve the geometry with high probability, thereby guaranteeing the existence of deterministic embeddings of the same width. This result advances the understanding of the expressivity of set-based neural network architectures.

Abstract:
Vision–language models such as CLIP embed images and text into a shared space, but still suffer from a modality gap, where image and text features cluster separately and nearest neighbors are dominated by samemodality rather than true cross-modal matches. Existing works alleviate the modality gap by strengthening cross-modal losses, post-processing embeddings or similarities, or imposing geometric regularization, but they primarily enforce global alignment and can distort local geometry, limiting gains in local ranking and zero-shot accuracy. We propose Fractal Spectral Alignment (FSAlign), which reduces the modality gap by shaping and matching the multi-scale geometry of image and text embeddings. By enforcing Ahlfors-regularity and sub-Gaussian heat kernel bounds, FSAlign constructs a shared fractal multi-scale structure for multiple modalities. This structure captures geometry across scales, from local neighborhoods to global structure, and ensures shared fractal spectral geometry across modalities. Based on this structure, we introduce a fractal spectral zeta score derived from multi-scale heat kernels and minimize the discrepancy between pairwise image–text samples to align their multi-scale neighborhoods. We theoretically demonstrate that FSAlign can guarantee the alignment of local spectral measures and global fractional Dirichlet energies.

Abstract:
Laboratory automation driven by scientific embodied agents represents a critical frontier in modern laboratories. Unlike conventional robotic domains, laboratory environments impose zerotolerance constraints on manipulation precision and collision, as minor deviations can lead to irreversible chemical hazards or equipment damage. This naturally makes the automated laboratory an ideal testbed for advancing embodied safety. However, existing benchmarks predominantly feature high-tolerance manipulation tasks where intermediate failures are largely reversible. More critically, current Vision-Language-Action (VLA) models trained via static imitation learning cannot satisfy these strict constraints. Because they merely mimic successful demonstrations, they lack the ability to recover from execution drift, leading to catastrophic compounding errors in precision-critical domains. Overcoming this limitation requires transitioning from static datasets to interactive environments that support Reinforcement Learning (RL) for dynamic error recovery. To this end, we introduce SafeLab, a generative simulation benchmark designed for the full lifecycle of safe robot learning. Grounded in a high-fidelity chemistry lab, our framework integrates an LLM engine for procedural task synthesis, an automated expert for scalable demonstration collection, and an interactive environment for continuous RL refinement. Leveraging this infrastructure, we release a dataset of 6,000+ complex trajectories to evaluate state-of-the-art VLA models. Experiments reveal that current embodied agents fail significantly under these safety constraints. In contrast, our RL post-training pipeline enables agents to learn active error correction, mitigating hazardous failures and improving success rates by 37%, thereby establishing SafeLab as a critical platform for developing reliable and safe generalist agents.

Abstract:
Natural policy gradients improve optimization by accounting for the geometry of distribution space, but their practical use is limited by the cost of estimating and inverting the Fisher matrix. We present Randomized Advantage Transformation (RAT), a method for estimating Tikhonovregularized natural policy gradients via direct backpropagation. By applying the Woodbury formula, we reformulate the regularized natural gradient as vanilla policy gradients with a transformed advantage. RAT computes this transformation efficiently via randomized block Kaczmarz iterations on on-policy mini-batches, avoiding explicit Fisher construction, conjugate-gradient solvers, and architecture-specific approximations. We provide convergence guarantees for RAT and demonstrate empirically that it matches or exceeds established natural-gradient methods across continuous and visual control benchmarks, while remaining simple to implement and compatible with various architectures.

Abstract:
Crossdomain recommendation (CDR) serves as a pivotal solution to data sparsity and cold-start problems by transferring knowledge across distinct domains. However, existing approaches predominately rely on Euclidean embedding spaces, which suffer from a fundamental geometry-distribution mismatch: real-world user-item interactions typically exhibit power-law distributions and latent hierarchical structures that flat Euclidean spaces cannot accurately represent without significant distortion. This geometric limitation not only compromises representation quality but, more critically, hinders the effective disentanglement of domain-invariant user preferences from domain-specific interests, limiting transferability in low-overlap scenarios. To bridge this gap, we introduce the Mixed-Curvature Hyperbolic Variational Auto-Encoder (HVAE), a principled framework that unifies knowledge extraction and transfer within a hyperbolic manifold. By leveraging the exponential expansion capacity of hyperbolic geometry, HVAE naturally accommodates hierarchical data structures, enabling precise disentanglement of user intents without the need for strict domain overlap constraints. Furthermore, we propose a rigorous hyperbolic Wasserstein barycenter mechanism to align invariant distributions across heterogeneous domains. Extensive experiments on large-scale industrial and public datasets demonstrate that HVAE achieves superior performance, particularly in challenging scenarios with long-tail distributions and minimal domain overlap.

Abstract:
Audio agents are increasingly deployed to execute tools from spoken requests, yet audio tool use poses challenges beyond textonly function calling: perception errors (e.g., homophones, noise, disfluencies) can corrupt entities and arguments, and natural interactions often require clarification that changes the tool-calling protocol. We introduce MFCL-Audio, a large-scale benchmark for audio function calling with 6.2K expert-verified tasks across two suites that mirror common deployments: MFCL Text Audio \(pipelined ASR\rightarrowLLM\rightarrowtools via transcripts) and MFCL True Audio \(end-to-end audio-in\rightarrowtool calls). MFCL-Audio includes controlled speech and acoustic perturbations (accent and speaking-rate variation, content disfluencies, and background noise) generated through a controllable audio synthesis/augmentation pipeline. We provide automatic grading for both function names and argument values using AST-based matching for single-turn calls and response/state-based metrics for multi-turn interactions, enabling scalable evaluation without LLM judges. Across a broad set of models, we propose a failure-mode taxonomy and analyze which speech and noise factors most strongly impact tool-calling accuracy. We release the benchmark, evaluation harness, and audio pipeline to support research on reliable speech-based agents.

Abstract:
Rapid expansion of urban areas and population growth is causing an immense increase in waste production, which demands the need for efficient and automated waste management. In this scenario, automated waste recycling (AWR) that utilizes deep learning methods to separate the recyclable waste objects may emerge as a savior to humanity. Recent deep learning approaches for AWR provide promising waste segmentation performance, however, these methods rely on large backbone networks that are inefficient for AWR systems and suffer from performance deterioration in cluttered scenes. To this end, an optimal waste segmentation network is introduced which effectively utilizes the spatial domain to capture localized structural dependencies and the spectral domain to efficiently extract global contextual relationships. This cascaded design allows the network to progressively leverage both local and global representations across complementary domains to highlight the semantic information necessary for effective segmentation of various waste objects. Furthermore, auxiliary feature enhancement module (AFEM) is introduced to enhance the target objects' boundaries and blob amplification for better segmentation in cluttered scenarios. Extensive experimentation on ZeroWasteaug, ZeroWaste-f and SpectralWaste reveals the merits of the proposed method.

Abstract:
Multivariate time series anomaly detection remains challenging as it requires the joint modeling of variable relationships and temporal dependencies. Existing methods often struggle to balance channel relationship modeling and overlook the relative importance of different variables within multivariate time series. To address this, we propose TeamWork, an asymmetric roleaware channel modeling framework that decouples variables into dominant and auxiliary roles according to their contributions to uncertainty reduction. Dominant variables drive system evolution and their deviations more strongly disrupt normal patterns, while auxiliary variables provide complementary cues. These variables with different roles are integrated through a role-aware gated interaction module. Moreover, point and subsequence anomalies can exist in multiple periodic systems, and the same anomaly type may behave differently across short- and long-period series. To capture such variations, we introduce a period-aware masked modeling mechanism. It employs multiple specialized masking mechanisms spanning short to long periods to facilitate comprehensive temporal dependency learning. Extensive experiments on multiple real-world datasets demonstrate that TeamWork achieves superior performance compared with state-of-the-art methods.

Abstract:
Patientlevel sepsis prediction in the ICU requires models that track how a patient’s condition evolves over time and integrate heterogeneous structured evidence from electronic health records. We present PathwayLLM, a trajectory-based framework that grounds prediction on temporal signals together with graph-structured and pathway-level clinical information derived from statistical dependency discovery. PathwayLLM follows a three-stage design. First, each observation window is encoded from multiple structured views, including physiological measurements, temporal dynamics, a heterogeneous patient–diagnosis–medication graph, and pathway signals constructed from discovered conditional independence structures among clinical variables. Second, these representations are provided to a pre-trained language model as auxiliary contextual embeddings so that risk prediction and evidence-conditioned text explanations can be learned jointly. Third, a Clinical Trajectory LSTM with Deterioration Attention aggregates window-level representations to highlight critical deterioration points and produce a patient-level risk score. On MIMIC-IV (15,410 ICU patients; 8.45% sepsis prevalence), PathwayLLM achieves AUROC 0.891 and AUPRC 0.724, outperforming strong time-series and pre-trained baselines. Ablation studies indicate that trajectory aggregation and structured clinical signals are key contributors, and clinician review suggests that the generated explanations are coherent, interpretable, and clinically relevant.

Abstract:
We study consumer utility maximization in an online randomorder model where strategic agents arrive sequentially. To circumvent strong impossibility results for utility maximization, we turn to the framework of learning-augmented mechanism design. Crucially, we show that the types of predictions commonly used in learning-augmented mechanism design (such as predictions of agent values or the optimal value) are not useful for utility maximization, where payments are directly at odds with the objective. Instead, we identify that a qualitatively different kind of prediction suffices: the identity of the highest-valued agent. First, we provide a deterministic truthful mechanism for our online setting by adapting offline randomized techniques. Then, we augment our mechanism with predictions. When the predictions are correct, we achieve a constant approximation to the optimal solution under full information (consistency), and even when predictions are arbitrarily bad, we guarantee a constant approximation to the best implementable solution (robustness).

Abstract:
Existing machine learning models for network simulation excel at predicting average performance but, due to their reliance on mean squared error, systematically fail to capture the critical taillatency and jitter that define modern network stability. This 'tail-blindness' renders them unreliable for latency-sensitive systems. We bridge this gap by introducing Accurate Neural Architecture Search (ANAS), a paradigm that automates the discovery of architectures for high-precision, distribution-aware network simulation. ANAS corrects the evaluation inaccuracies of weight-sharing NAS via a similarity-constrained search, employs a hybrid search space to model complex traffic, and uses a Wasserstein loss to optimize for the entire delay distribution, not just its mean. Empirically, the ANAS-discovered architecture is holistically superior: it reduces overall validation loss by 25.8% compared to DeepQueueNet, demonstrating strong average-case performance, while simultaneously excelling at tail-sensitive metrics by lowering the normalized Wasserstein distance (W_n) by up to 69.8%. This confirms its ability to faithfully model a comprehensive performance spectrum, encompassing both average and critical tail behaviors. The ANAS framework provides a practical methodology for automatically creating high-fidelity model of network devices, enabling more reliable validation of next-generation network protocols and algorithms.

Abstract:
Negative sampling is an important yet challenging component in selfsupervised graph representation learning, particularly for recommendation systems where user-item interactions are modeled as bipartite graphs. Existing methods often rely on heuristics or human-specified principles to design negative sampling distributions. This potentially overlooks the usage of an underlying ``true'' negative distribution, which we might be able to access as an oracle despite not knowing its exact form. In this work, we shift the focus from manually designing negative sampling distributions to a more principled method that approximates and leverages the underlying true distribution from the ground up. We expand this idea in the analysis of two scenarios: (1) when the observed graph is an unbiased sample from the true distribution, and (2) when the observed graph is biased with partially observable positive edges. The analysis result is the derivation of a sampling strategy as the numerical approximation of a well-established learning objective. Our theoretical findings are also empirically validated, and our new sampling methods achieve state-of-the-art performance on real-world datasets.

Abstract:
Efficient exploration remains a key challenge in deep reinforcement learning, especially for embodied agents operating in realistic environments with highdimensional observations and complex dynamics. Recent latent exploration methods define bonuses in a learned latent space, but often struggle in these settings where (i) representations can be noisy or policy-dependent, and (ii) common strategies such as randomized latent objectives or fixed directional spanning are brittle and fail to improve global coverage. We propose Covariance Volume Maximization (CVM), a coverage-driven latent exploration framework with two key components. First, we learn a behavioral state encoder using apolicy-mixtureobjective to reduce representation drift under rapidly changing exploration policies, yielding stable and behaviorally meaningful latent displacements. Second, CVM rewards each transition by its exact increase in the log-determinant of the covariance of recent latent displacements, explicitly expanding the explored region and prioritizing under-covered directions. This objective coincides with the classicalD-optimaldesign criterion, providing an information-efficiency justification. Extensive experiments on embodied navigation and manipulation tasks demonstrate that CVM substantially improves exploration efficiency and robustness, and scales effectively to different environments.

Abstract:
While VisionLanguage Models (VLMs) excel on static visual benchmarks, they consistently underperform in game-based reasoning environments. Existing evaluations conflate failures in perception, rule comprehension, and reasoning. We propose a two-stage diagnostic framework that decomposes VLM performance into testable components: controlled perception tests isolating visual encoding, and a 2×2 diagnostic matrix with a six-level rule complexity ladder evaluated in both explicit verification and predictive simulation modes. Experimenting with six state-of-the-art VLMs reveals three systematic failure patterns: (1) coordinated spatial drift, where off-by-one localization errors among adjacent pieces share the same shift direction at 1.5-1.9× the rate expected under spatial independence; (2) perception-reasoning dissociation, where models correctly verify board states but fail to apply rules—at complex constraint levels, perception remains relatively stable while reasoning accuracy plummets, with even the best-performing model capped at 75% and others ranging from 37% to 64%; and (3) a simulation gap, with performance dropping by up to 27 points when predicting future states versus verifying observed outcomes. These limitations persist across model scales, suggesting persistent limitations in bridging visual encoding and logical simulation.

Abstract:
Foundation models for intracranial neural recordings aim to learn generalizable representations from largescale unlabeled data. However, existing approaches rely on suboptimal tokenization schemes -- treating individual electrode channels as independent tokens or aggregating them into a single brain-wide representation -- which fail to capture the brain’s inherent functional modularity. We introduce NeuroCLUS, a foundation model that learns to represent neural activity through data-driven functional clusters. NeuroCLUS is built on a novel two-stage pre-training framework. First, a spatial-temporal model learns a functional context graph between channels via a functional context prediction task. Second, this graph guides a soft clustering of channels into a set of learnable prototype tokens, enabling the transformer backbone to process coherent functional units rather than raw channels. Evaluated across a diverse range of decoding paradigms -- including speech perception, speech production, and seizure detection -- NeuroCLUS consistently achieves state-of-the-art performance. The discovered functional clusters align with established neurophysiology and offer enhanced interpretability. Our work demonstrates that explicitly modeling functional neural groupings significantly improves the efficiency, generalization, and interpretability of foundation models for intracranial decoding.

Abstract:
As roleplaying Large Language Models (LLMs) become central to personalized AI, they face a fundamental challenge: balancing character authenticity with user satisfaction. Prior dual-process and dual-perspective approaches address this through prompt-level conditioning, auxiliary modules, or inference-time reflection---realizing duality externally rather than within the core attention mechanism. We introduce the KnowSelf-KnowOther Transformer (KSKT), which embeds dual-perspective reasoning directly into the generation process via axial attention that processes self-understanding and other-understanding through separate streams. This intrinsic integration enables token-level dynamic balance rather than post-hoc reconciliation. On CharacterBench, KSKT achieves 6.2% overall improvement over Qwen3-4B-Thinking. On SOTOPIA, KSKT improves Relationship by 19.3% over the base model---the dimension requiring explicit self-other coordination. These results establish intrinsic dual-perspective reasoning as an effective architectural principle for role-playing systems.

Abstract:
While recent autoregressive models have achieved textto-image generation performance comparable to diffusion models, they significantly struggle with fine-grained semantic alignment. To rigorously evaluate this limitation, we introduce DeltaBench, a benchmark featuring paired prompts with subtle fine-grained differences, which reveals that existing models fail to achieve precise control over visual tokens. To bridge this gap, we propose FineFocus, a comprehensive framework that enhances alignment by learning from subtle differences in similar text-image pairs. Specifically, we construct FineFocus-Data, a large-scale dataset of paired samples derived from image editing tasks to capture localized semantic shifts. Furthermore, we introduce Pair-GRPO, an improved reinforcement learning algorithm that extends GRPO to paired samples. Extensive experiments demonstrate that our approach outperforms most prior prominent methods on both DeltaBench and existing benchmarks.

Abstract:
Sentence representations are foundational to many Natural Language Processing (NLP) applications. While recent methods leverage Large Language Models (LLMs) to derive sentence representations, most rely on finallayer hidden states, which are optimized for next-token prediction and thus often fail to capture global, sentence-level semantics. This paper introduces a novel perspective, demonstrating that attention value vectors capture sentence semantics more effectively than hidden states. We propose Value Aggregation (VA), a simple method that pools token values across multiple layers and token indices. In a training-free setting, VA outperforms other LLM-based embeddings, even matches or surpasses the ensemble-based MetaEOL. Furthermore, we demonstrate that when paired with suitable prompts, the layer attention outputs can be interpreted as aligned weighted value vectors. Specifically, the attention scores of the last token function as the weights, while the output projection matrix (W_O) aligns these weighted value vectors with the common space of the LLM residual stream. This refined method, termed Aligned Weighted VA (AlignedWVA), achieves state-of-the-art performance among training-free LLM-based embeddings, outperforming the high-cost MetaEOL by a substantial margin. Finally, we highlight the potential of obtaining strong LLM embedding models through fine-tuning Value Aggregation.

Abstract:
Reinforcement learning with verifiable rewards (RLVR) is a cornerstone of posttraining for large reasoning models, yet widely used algorithms such as Group Relative Policy Optimization (GRPO) often exhibit diversity collapse. We provide a geometric diagnosis by formalizing GRPO as a dynamical flow on the probability simplex. Under a mode-based coarse-graining of rollouts, we show that GRPO induces a collision field over correct modes, monotonically pushing towards simplex vertices and thus yielding a winner-take-all regime. To address this systematically, we introduce G^2RPO (Geometric GRPO), which reshapes RLVR via principled vector-field editing. Concretely, we intervene at the advantage level by adding granularity bonuses inversely proportional to mode probabilities, encouraging underrepresented correct modes. The bonus has a natural geometric interpretation, and its potential performance side effects can be mitigated, thereby avoiding the usual accuracy--diversity trade-off. In experiments with 7B and 14B models trained on a math reasoning task and evaluated on AIME 2024/2025, GRPO loses up to 57% of active correct modes. In contrast, G^2RPO increases active correct-mode coverage by 172%--205%, reduces concentration on any single correct mode, prevents the late-stage \emphentropy crash, and improves \textttpass@1 by +1.4 to +7.9 points relative to GRPO. Overall, diversity is not merely a regularizer but a geometric property to be controlled to improve the model without trapping it in a single dominant strategy.

Abstract:
Spiking Neural Networks (SNNs) offer a promising alternative to traditional artificial neural networks by leveraging sparse, eventdriven computation that closely mimics biological neurons. When deployed on neuromorphic hardware, SNNs enable substantial energy savings due to their temporal and asynchronous processing. However, training SNNs remains fundamentally difficult because the non-differentiable nature of spike generation breaks the bidirectional gradient flow required in modern self-supervised learning (SSL) frameworks. In this work, we introduce the first fully SSL framework for SNNs that scales to large-scale visual tasks without requiring labeled fine-tuning. Our method leverages intrinsic spike-time dynamics by aligning representations across time steps and augmented views. To address gradient mismatch during surrogate training, we propose the MixedLIF neuron model, which combines a spiking path with an antiderivative-based surrogate path during training to stabilize optimization, while retaining a fully spiking and energy-efficient architecture at inference. We also introduce two temporal objectives, Cross Temporal Loss and Boundary Temporal Loss, that align multi-time-step outputs to improve learning efficiency. Our approach achieves competitive results across ResNet and Vision Transformer-based SNNs on CIFAR-10, CIFAR10-DVS, and ImageNet-1K. Our approach further generalizes through transfer learning from ImageNet-1K to downstream tasks. Notably, our self-supervised SNNs match or exceed the performance of some non-spiking SSL models, demonstrating both representational strength and energy efficiency.

Abstract:
Understanding how the structure of language can be learned from sentences alone is a central question in both cognitive science and machine learning. Studies of the internal representations of Large Language Models (LLMs) support their ability to parse text when predicting the next word, while representing semantic notions independently of surface form. Yet, which data statistics make these feats possible, and how much data is required, remain largely unknown. Probabilistic contextfree grammars (PCFGs) provide a tractable testbed for studying these questions. However, prior work has focused either on the post-hoc characterization of the parsing-like algorithms used by trained networks; or on the learnability of PCFGs with fixed syntax, where parsing is unnecessary. Here, we(i)introduce a tunable class of PCFGs in which both the degree of ambiguity and the correlation structure across scales can be controlled;(ii)provide a learning mechanism---an inference algorithm inspired by the structure of deep convolutional networks---that links learnability and sample complexity to specific language statistics; and(iii)validate our predictions empirically across deep convolutional and transformer-based architectures. Overall, we propose a unifying framework where correlations at different scales lift local ambiguities, enabling the emergence of hierarchical representations of the data.

Abstract:
Adapting Detection Transformers to Incremental Object Detection (IOD) poses a systemic challenge, as setbased optimization is inherently destabilized by sequential learning. In this work, we identify Gradient Dilution as the root cause of performance degradation, wherein optimization signals required to preserve old knowledge are progressively weakened. This phenomenon manifests as a cascading erosion driven by three tightly coupled factors: Signal Dispersion, where foreground gradients are overwhelmed by background noise; Assignment Drift, where stochastic query–target matching induces inconsistent gradient trajectories; and Support Attrition, where gradients from retained samples insufficiently cover the old-class feature space, weakening decision boundaries under interference from new classes. To counteract this, we propose FAS, a unified framework that \underlineFocuses, \underlineAligns, and \underlineSustains gradient flow throughout incremental learning. Specifically, we introduce prior-injected queries to focus discriminative signals by filtering background interference at the source. We further propose deterministic anchor distillation to align query–target assignments, bypassing unstable bipartite matching and enforcing semantic consistency across stages. Finally, we devise manifold-support replay to sustain distributional support of old classes, counteracting representational erosion induced by continual updates. Extensive experiments show that FAS restores robust optimization dynamics and outperforms state-of-the-art methods, achieving over 5.0 AP improvement in the challenging 40+10×4 incremental setting.

Abstract:
We study the sparse spiked Wigner model, where the goal is to recover an ssparse unit vector \symbfitu \in \mathbbR^d from a noisy observation \symbfitY = \beta \symbfitu \symbfitu^\top + \symbfitW. While the information-theoretic threshold is \beta = \widetilde\Omega(\sqrts), existing polynomial-time algorithms require \beta = \widetilde\Omega(s), yielding a substantial computational-statistical gap. We propose a column thresholding method that attains the \widetilde\Omega(\sqrts) scaling for both estimation and support recovery under the non-uniformity condition || \symbfitu ||_\infty = \Omega(1). This condition is not merely technical: it explicitly rules out uniform spikes, for which planted-clique-based hardness results apply, and identifies a concrete class of non-uniform spikes where the gap can be closed. Building on this initializer, we further develop a truncated power method that iteratively refines the estimate with provable linear convergence.

Abstract:
Reconstructing unified continuous dynamics from sparse, noncontiguous, and unpaired point cloud snapshots remains a fundamental challenge in spatiotemporal analysis for computer vision and developmental biology. Existing methods, including scene flow and Optimal Transport-based approaches, are limited either by explicit reliance on point-wise correspondences or by cumulative errors from frame-to-frame propagation and temporal discontinuity, and limited ability to model multi-attribute dynamics such as gene expression and population changes. We propose FlowCloud, a variational Neural Ordinary Differential Equation (Neural ODE) generative framework. FlowCloud aggregates information from all observed time points into a joint latent representation that initializes a Neural ODE z(t) enabling continuous spatiotemporal evolution modeling while mitigating propagation-induced errors and preserving temporal consistency. Training is performed without predefined correspondences using a multi-faceted objective with complementary roles: Sinkhorn distance for global distribution alignment, Chamfer distance for local geometric consistency, trajectory regularization to encourage smooth and physically plausible dynamics, and supervised losses for multi-attribute prediction. Experiments on human motion and developmental biology datasets demonstrate improved interpolation accuracy and promising short-term extrapolation performance. By unifying geometry, attributes, and population dynamics within a continuous latent framework, FlowCloud offers a novel and robust solution for continuous dynamic reconstruction from unstructured spatiotemporal observations.

Abstract:
Quantizationaware training is widely used for language model quantization in sub-4-bit precision, by training full-precision weights with gradients computed on the quantized model. The main bottleneck for this training approach is its slow convergence and plateauing of test performance, which gets worse in lower bit-widths. While observed in prior work, its precise cause has not been carefully studied. In this paper, we analyze the convergence by computing the Hessian spectrum of the model loss throughout quantization-aware training. We find the key reason is that the model weights converge to flat surfaces near saddle points, with a large fraction of Hessian eigenvalues concentrated around zero, and the magnitude of both positive and negative eigenvalues decreases over training. Additionally, the convergence speed is slower in lower bit-widths with significantly smaller Hessian eigenvalue magnitude. Motivated by these findings, we propose an approach to accelerate quantized training with minimal overhead named WinQ. This approach periodically performs linear weight interpolation between the full-precision and quantized weights and computes gradients on noise-injected weights. Both techniques effectively regularize the Hessian and accelerate training, resulting in an algorithm broadly applicable to quantization methods. Extensive experiments show that WinQ accelerates various quantized training methods by up to 4×. Under the same training budget, WinQ improves state-of-the-art sub-4-bit quantization performance by up to 8.8% relatively. Additionally, WinQ remains consistently effective across 16 settings of different language models, quantization methods, and bit-widths.

Abstract:
State abstraction plays a central role in scaling reinforcement learning to complex but structured systems. In studying such systems, a wide range of behavioral structures have been proposed in reinforcement learning, including value functions, invariants, bisimulation relations, and behavioral metrics. However, a general principle for determining what structures are provably preserved under state abstraction is still lacking. In this paper, we present a unified framework for defining and analyzing behavioral structures in reinforcement learning. Our framework provides a compositional way to specify behavioral semantics and metrics based on local, onestep descriptions of system dynamics. Using this framework, we establish results showing how behavioral structures can be safely transferred between abstract and concrete systems. We further show how to construct quantitative metrics from logical behavioral semantics with soundness guarantees. Together, these results provide a principled foundation for reasoning about behaviors under state abstraction in reinforcement learning and offer reusable definition and proof principles for a broad class of behavioral structures in reinforcement learning.

Abstract:
Grounding Large Language Models (LLMs) in Knowledge Graphs (KGs) has shown significant promise for complex Question Answering (QA) tasks. Since LLMs' limited context window cannot accommodate the sheer volume of largescale KGs, existing work usually utilizes agents to reason on real-world KGs, which follows reasoning paths derived from the queries step by step. However, the mismatch between query-derived paths and the KG's structure, stemming from users' lack of schema knowledge, usually leads the agents into dead ends. To address this problem, in this paper, we propose Backjump-on-Graph (BoG), a novel framework that empowers LLMs to retrospectively explore alternative reasoning paths at dead ends. We first propose to formalize each reasoning step with four atomic operations to create a structural scaffold that allows LLMs to revert to historical status. Next, we fine-tune the LLM with synthetic data containing the above atomic operations to instill basic backjump abilities. Finally, we leverage reinforcement learning and propose a hybrid reward function, which penalizes redundant transitions and promotes correct answers, to optimize the timing and landing nodes of backjumping. Extensive experiments on several KGQA benchmark datasets demonstrate the effectiveness of our BoG method.

Abstract:
Flow Matching models have demonstrated strong performance across a wide range of generative tasks. However, their reliance on ODEbased iterative sampling incurs substantial computational overhead, which limits their applicability in real-time scenes. While distillation is a promising solution, existing approaches largely borrow from diffusion-based score matching, often failing to exploit the intrinsic geometric structure of flows and suffering from training instability, high variance, and degraded generation quality. In this paper, we propose Mean Flow Distillation (MFD), a novel distillation framework tailored for flow matching models. We theoretically demonstrate that MFD acts as a temporal low-pass filter, effectively suppressing the high-frequency optimization noise inherent in variational score distillation (VSD) while ensuring global trajectory consistency. We further prove the Mean Flow Matching Theorem, establishing that matching expected average velocities is sufficient for strict distribution alignment. Empirically, on challenging high-dimensional manifolds including 4D occupancy forecasting and text-to-image generation, MFD achieves state-of-the-art performance, enabling high-fidelity single-step generation.

Abstract:
Highfidelity vehicle aerodynamics analysis is bottlenecked by costly CFD simulations. Neural surrogates accelerate prediction but lack inverse design capabilities, while existing generative optimization methods suffer from unstable convergence and frequent engineering constraint violations. We present UniDrag, a unified framework bridging multi-field aerodynamic prediction with robust differentiable shape optimization. Given a vehicle geometry, UniDrag predicts surface pressure, volume flow fields, drag coefficient C_d, and a streamwise build-up profile localizing drag contributions. Our architecture combines enhanced Physics-Sliced Attention (ePSA), Gated Expert Routing, and Modality-Protected Learning to prevent negative transfer across output modalities. At deployment, the frozen surrogate enables gradient-based optimization via Free-Form Deformation with engineering constraints. We introduce Expectation-over-Transformation to prevent adversarial exploitation of surrogate fragility. We curate a large-scale dataset of 15,000 vehicle geometries spanning four body types with GPU-accelerated LBM simulations. On this benchmark, UniDrag achieves C_d prediction R^2 of 0.937 (+7.6% over baselines) and 13.7% mean CFD-verified drag reduction with 100% success rate and only 21.3~mm average displacement (0.46% vehicle length).

Abstract:
Implicit Neural Representation (INR) has emerged as a powerful paradigm for continuous MRI reconstruction. However, standard unsupervised INR requires timeconsuming optimization from scratch for each scan, hindering clinical deployment. This work presents IPOD, a Reference-Free Meta-Learning framework designed to learn generalized parameter initializations for INR directly from undersampled data. Distinct from conventional meta-learning that relies on fully-sampled ground truth, IPOD operates in an inverse-problem-driven manner, leveraging diverse reconstruction tasks with varying sampling patterns to capture a robust prior. Furthermore, we introduce an adaptive meta-update strategy modulated by task-specific performance to ensure optimal parameter distribution for diverse anatomical structures. Extensive experiments demonstrate that IPOD provides a superior initialization that enables rapid adaptation and achieves high-fidelity reconstruction across various imaging protocols, significantly outperforming existing INR baselines. By eliminating the dependence on reference images, IPOD offers a scalable and efficient solution for a wide range of imaging inverse problems. Code and data available at: https://anonymous.4open.science/r/iPod-2C60

Abstract:
We study the geometry of feasible value functions in infinitehorizon partially observable Markov decision processes (POMDPs) under memoryless stochastic policies. Our main contribution is a characterization of the feasible set of value functions as a semi-algebraic set, defined by explicit polynomial inequalities determined by the transition dynamics, observation kernel, and reward structure of the POMDP. This result extends prior work for fully observable Markov decision processes, where the feasible set is known to be a polytope, to the substantially more intricate partially observable setting. In contrast to the polyhedral structure arising in MDPs, partial observability induces fundamentally nonlinear constraints, leading to a richer and more complex geometric structure. Our geometric characterization provides new insight into the landscape of policy optimization in both MDPs and POMDPs, and reveals qualitative phenomena unique to partial observability, including the emergence of isolated local maximizers of the long-term reward and their dependence on the initial state distribution.

Abstract:
Direct Preference Optimization (DPO) has become the dominant approach for aligning large language models with human preferences. However, standard DPO treats all preference pairs uniformly, overlooking the heterogeneous nature of the learning problem: some samples demand sophisticated semantic understanding of the prompt, while others require nuanced discrimination between similar responses. We argue that these two objectives should be disentangled during training. Through gradient analysis, we identify a layerwise localization phenomenon where semantic complexity predominantly drives lower-layer updates while preference uncertainty modulates upper layers. Building on this insight, we propose Gradient-Guided Disentangled DPO (GDO-DPO), a curriculum framework that independently regulates learning pace along each dimension based on layer-specific gradient stability. Experiments on UltraFeedback and HH-RLHF demonstrate consistent improvements, with GDO-DPO outperforming DPO by 4.1% on AlpacaEval 2.0 and showing particularly strong gains on reasoning-intensive tasks.

Abstract:
Autoregressive (AR) image generation has recently gained momentum as a scalable alternative to diffusion models, benefiting from unified nexttoken prediction paradigm and strong instruction following ability. However, AR visual generation must decode excessively long sequences of visual tokens, making inference heavily bottlenecked by the memory footprint and latency of the self-attention KV cache. While KV cache compression is well studied in Large Language Model, its counterparts in AR image generation remain underexplored. The reason is fundamental: visual tokens are highly redundant, and their spatial information density is highly non-uniform. In this work, we introduce SparseAR, a training-free, entropy-aware sparse attention method that is specifically designed for AR image generation and editing. Our key insight is that information-rich regions exhibit higher entropy and require broader attention, while redundant regions show lower entropy and allow aggressive sparsification. Based on this insight, we dynamically identify information-rich regions during decoding and adaptively adjust attention sparsity to reduce KV-cache overhead. SparseAR is plug-and-play and can be readily applied to mainstream AR models. Extensive experiments on four representative AR models across multiple benchmarks demonstrate that SparseAR significantly improves inference efficiency while maintaining, and often even improving, generation and editing quality.

Abstract:
As language models grow more capable, accurate capability evaluation becomes essential for safety decisions. If models can deliberately underperform on dangerous capability evaluationsa behavior known as \emphsandbagging---they may evade safety measures designed for their true capability level. We introduce Cross-Context Consistency (C³), a general framework for unsupervised black-box sandbagging detection that exploits a fundamental asymmetry: when a model truly lacks capability, its confusion manifests consistently across paraphrased questions, but when a capable model feigns incompetence, its strategic choices about \emphhow to appear weak create detectable inconsistencies. The framework is agnostic to the specific consistency metric and aggregation method; we present a simple instantiation using embedding cosine distance and mean aggregation that requires no training data or model-specific adaptation. We evaluate C³ across prompted and fine-tuned sandbagging scenarios, across instructed and more naturalistic setups, maintaining a classification signal where other black-box methods fail. Our findings show the limitations of existing sandbagging detection methods, and reveal the efficacy of consistency-checking as a detection mechanism for dangerous capabilities.

Abstract:
Aligning Multimodal Large Language Models (MLLMs) with human preferences remains a fundamental challenge. While Generative Reward Models (GRMs) offer a promising reasoningbased alternative to scalar models, they are often hindered by severe position bias and prohibitively high computational overhead. To address these limitations, we propose GenAlign, a unified framework that synergizes robust generative reward modeling with efficient MLLM alignment. First, we introduce a rubric-based GRM that explicitly models the preference judgment process. By employing reinforcement learning with verifiable rewards and an online position debiasing mechanism, our model produces interpretable reasoning critiques and robust preference predictions. Second, we propose a policy optimization strategy utilizing advantage-smoothed dynamic reference anchoring. This approach reduces computational complexity while mitigating gradient instability caused by variance collapse. Extensive experiments demonstrate that GenAlign achieves state-of-the-art preference prediction accuracy on multimodal reward modeling benchmarks. Moreover, it consistently improves the performance of three MLLMs across seven diverse evaluation benchmarks, particularly making significant progress in safety and hallucination.

Abstract:
Vision–language models (VLMs) now support both direct Instruct and explicitreasoning Thinking modes, but practitioners lack principled ways to decide when reasoning helps or how much computation to allocate at test time. We investigate whether VLMs encode meta-cognitive signals for adaptive inference. We derive oracle labels for two properties: (1) reasoning helpfulness—whether explicit reasoning improves accuracy, and (2) optimal generation length—the minimal token budget for correctness. Probing final-layer representations in InternVL and Qwen3-VL models, we find Thinking models encode these signals more linearly than Instruct models, suggesting reasoning-oriented training enhances meta-cognitive structure. Head-wise attribution reveals two circuits: length-control heads in lower layers, and reasoning/difficulty heads in higher layers. Causal interventions confirm these roles: scaling length heads controls output length with little accuracy loss, while scaling reasoning heads enables a perception–reasoning trade-off, improving accuracy by up to 5.3%. These effects generalize across benchmarks. Our results show reasoning-tuned VLMs develop localized, manipulable circuits for meta-cognitive control, enabling test-time steering of computation and reasoning without retraining.

Abstract:
Masked diffusion language models decode by iteratively unmasking tokens, where the unmasking order defines an ``order of thought'' that strongly influences generation quality yet is typically chosen heuristically. We derive a tractable upper bound on the sequential decoding mismatch, measured by the Kullback–Leibler divergence and expressed in terms of the model’s pathwise loglikelihood, with tightness under sufficient model expressivity. This bound induces a dense self-aware reward for a target sequence x and unmasking order \sigma, over ordered paths, casting order selection as a principled policy optimization problem with a frozen denoiser. We instantiate this idea as Self-Aware Scheduling (SAS), which learns a lightweight order policy using Group Relative Policy Optimization and applies seamlessly to both sequential and semi-autoregressive decoding. On Sudoku with 1B MDM, SAS improves puzzle accuracy from 82.0% (best heuristic schedule) to 91.8%, and reaches 97.9% with second-stage fine-tuning along learned trajectories. On LLaDA-8B, SAS improves pass@1 on GSM8K from 64% to 76% (full diffusion) and on MBPP from 39.5% to 41%, while consistently matching or exceeding heuristic schedules across generation lengths and block sizes.

Abstract:
Highthroughput gene perturbation experiments can test several genetic interventions in parallel, yet experimental budgets remain limited. A central goal is hit discovery: identifying as many perturbations as possible whose phenotypic effect exceeds a predefined threshold. Pure exploration strategies are statistically inefficient, wasting budget on low-value regions. Bayesian optimization methods offer a principled alternative but target a single global optimum, over-exploiting dominant modes while neglecting other high-value regions. We formalize hit discovery as a sequential experimental design problem and propose Probability-of-Hit, an acquisition function that directly targets threshold exceedance by ranking candidates according to their posterior probability of being a hit. We prove asymptotic optimality of this approach and demonstrate strong empirical performance on both synthetic benchmarks and real biological immunology datasets, including upto 6.4% improvement over baselines on the Schmidt IL-2 dataset.

Abstract:
Codewriting large language models (CodeLLMs) generate executable code policies for embodied agents by translating natural language goals and environmental constraints into structured control programs. However, policy generation in open-domain embodied environments suffers from two fundamental limitations: (i) delayed decoding caused by repetitive prefill computation over long prompts, and (ii) limited robustness due to fully generative decoding, which often produces API mismatches, missing safety guards, and unstable control logic. To address these limitations, we present FCGraft, a Functional Cache Grafting framework. FCGraft maintains a library of function-level validated code skeletons and their associated prompt-level Transformer key–value (KV) caches, and synthesizes new policies by retrieving relevant functions and grafting their KV caches when a new task is provided. Given retrieved function caches, FCGraft performs cache grafting via stitching, which composes cached function segments into a composite policy, and patching, which locally adapts only the necessary code regions to satisfy task-specific parameters and constraints with minimal additional decoding. By eliminating redundant prefill computation, this approach reduces generation latency, while reusing validated control structures improves robustness over prompt-level caching methods RAGCache, achieving 18.31% higher task success rate and 2.3× faster policy synthesis.

Abstract:
3D hierarchical semantic segmentation (3DHS) is crucial for embodied intelligence that demands the coarseto-fine grained and multi-hierarchy understanding of 3D scenes. 3DHS tasks can be addressed by multi-label learning, but facing two issues: I) learning multiple labels for each point with a shared model can lead to multi-hierarchy conflicts in cross-hierarchy optimization, and II) the class imbalance issue is inevitable across multiple hierarchies of 3D scenes, making the model easily be dominated by major classes. To address these issues, we propose a novel multi-label learning with contrastive cluster self-supervision framework for 3DHS. Specifically, we propose a late-decoupled multi-label learning 3DHS network which employs decoupled decoders with the coarse-to-fine hierarchical consistency guidance. This late-decoupled model architecture can mitigate the underfitting and overfitting conflicts among multiple hierarchies and also constrain the class imbalance problem within each individual hierarchy. Moreover, we introduce a 3DHS-oriented contrastive cluster self-supervision learning method, which learns cluster-wise point cloud features with contrastive loss and produces self-supervised information to enhance the class-imbalance segmentation. Extensive experiments on multiple datasets and backbones demonstrate that our approach promotes the multi-hierarchy balance and mitigates the class imbalance issue in 3DHS tasks.

Abstract:
Large vision language models (LVLMs) have made rapid advancements and are deployed across various applications, yet hallucinations remain a major challenge. Activation steering is appealing due to its minimal training overhead and controllability at inference time. However we found that during autoregressive decoding, visual conditioning affects token prediction sparsely and locally across decoding steps, and many existing methods that average imageversus-no-image differences over the entire sequence dilute these critical signals, yielding low signal-to-noise ratio steering directions. Additionally, many existing methods apply a fixed steering strength, which misallocates the intervention budget, over-perturbs non-critical tokens, and can cause instability. To address these limitations, we propose Token-Level Visual-Sensitivity Steering (TLVS) for hallucination mitigation. Our approach first extract token-level steering vectors and refines them, and then applies fine-grained, visual-sensitivity–adaptive steering only where it matters. This lightweight, plug-and-play mechanism requires only minimal training for calibration and can be applied across diverse vision-language models. It modulates the steering strength at each decoding step, selectively suppressing hallucination-prone spans while preserving evidence-grounded content. We evaluate TLVS on several benchmarks, including POPE, AMBER, CHAIR (COCO), MMHal and HallusionBench, demonstrating consistent improvements over previous steering methods.

Abstract:
As policy catches up with the capabilities of generative AI, watermarking is central to content provenance efforts. Inferencetime watermarks for autoregressive models are unfit for continuous modalities due to discretization inconsistencies. Existing methods overcome this by finetuning the modality tokenizers, nullifying the watermark's training-free advantage. In this work, motivated by the vocabulary redundancy of discretization, we propose an elegant solution for powerful and robust watermarking of synthetic audio. We theoretically analyze the impact of token errors on watermark detection, and effectively mitigate them using a reduced vocabulary obtained via community detection. Thorough experiments showcase that our gradient-free method can boost detectability by several orders of magnitude, while also achieving built-in robustness to audio modifications. Broadly, we discover a new state-of-the-art for token-level watermarks in multimedia, which simply arises from the nature of discrete representation learning.

Abstract:
Reliable estimation of Amari \alphadivergences underpins variational inference, yet unconstrained neural critics are notoriously prone to instability. We propose a scalable estimator by constraining the critic to a Reproducing Kernel Hilbert Space (RKHS) ball and approximating the kernel via band-limited Random Fourier Features (RFF). This formulation yields a linear-time objective amenable to mini-batch stochastic optimization while avoiding the cubic complexity of Gram-matrix methods. We present a unified analysis based on a four-term error decomposition—comprising RKHS approximation, feature discretization, statistical deviation, and optimization residual. Under a spectral source condition, we derive non-asymptotic bounds establishing that the RKHS approximation bias scales as \mathcalO(R^-\gamma), the RFF discretization error as \mathcalO(R D^-1/2), and the statistical error as \mathcalO(R n^-1/2). We further show that statistical non-degeneracy induces intrinsic local curvature, enabling our proposed Armijo-SGD to achieve local linear convergence. Empirical evaluations demonstrate that the RFF-RKHS estimator outperforms varying-representation baselines in stability, and applying this spectral regularization to GAN critics significantly enhances the capture of high-frequency data components.

Abstract:
Operating room (OR) scene graph generation (SGG) enables holistic modeling of OR domains by encoding interactions among medical staff, tools, and equipment as tripletbased structured scene graphs. Although existing OR SGG methods demonstrate satisfactory overall performance, they exhibit substantially lower accuracy on long-tail categories compared to head categories in OR data. We introduce SGG-ICL, a novel framework that represents the first attempt to address the long-tail problem in OR SGG by leveraging in-context learning (ICL). SGG-ICL first identifies long-tail samples via an Adaptive Router module and selectively applies ICL only to these samples. This selective routing strategy enhances performance on long-tail categories without degrading head-category accuracy. Subsequently, SGG-ICL constructs a candidate pool through multimodal retrieval and then employs a trained MLLM Reranker to re-rank the candidates, selecting the most similar examples to the test sample for ICL. The reranker is supervised by IoU scores derived from annotated SGG triplets and exploits rich multimodal information to estimate pairwise sample similarity. Experimental results show that SGG-ICL improves accuracy on long-tail categories by 6.9%, while also achieving a 2.6% improvement in overall accuracy.

Abstract:
Designing biological sequences such as proteins and DNA for desired properties is challenging due to vast search spaces and limited wet lab evaluation budgets. Current evolutionary approaches ignore sequential dependencies and rely on random mutations, which scale poorly for long sequences. In contrast, reinforcement learning (RL) and generative models explicitly model sequence structure but require large datasets to guide generation toward the target properties. These limitations suggest the need for a method that combines the sample efficiency of evolutionary approaches with the ability to exploit sequential structure. In this work, we propose a novel evolutionary approach, \textttIDEAS, in which mutations are guided by an explainable model. The model identifies critical motifs in highfitness sequences and uses them to mutate non-critical positions. Across six continuous-property datasets, seven baselines, and three evaluation budgets, \textttIDEAS achieves a 19% acceleration in design while maintaining a favorable position on the Pareto curve balancing acceleration, diversity, and novelty.

Abstract:
Dynamic graphs are pervasive in realworld systems, but their tightly entangled spatiotemporal evolution causes significant modeling challenges. Existing Dynamic Graph Neural Networks (DGNNs) lack a principled framework for systematically decoupling this multi-domain entanglement, raising two key problems: (i) representation drift caused by structural incompleteness, and (ii) signal distortion amplified by noise perturbation. These problems can accumulate over time, forming temporal redundancy that weakens robustness of DGNNs. In view of these, we propose DeR-Mamba(Decoupling for Robust Mamba), a multi-domain decoupling framework for robust DGNNs. To address (i), we develop the Multi-Particle Kernel Kalman observation field (MP-K^2alman), which achieves spatial decoupling by sampling latent evolution paths in kernel subspaces and performing Kalman-style updates to estimate structural states. To address (ii), we design the Adversarial-aware Frequency Decoupling Module (AFDM), which performs frequency-domain decoupling and dynamic cross-frequency modulation to purify spectral signals. Finally, a self-consistent dynamic graph state-space system performs temporal decoupling to control redundancy, suppressing residual disturbances through discretized cross-time modeling and selective snapshot scanning. Extensive experiments on benchmark datasets with adversarial attacks validate its superior robustness.

Abstract:
Contrastive learning has advanced the representation learning across domains, yet its success relies on data augmentations that preserve semantic contents while providing the view diversities. Multivariate time series, however, are inherently noisy, nonstationary, and lack such intuitive semantic cues. Consequently, standard heuristic augmentations that ignore semantic parts may risk destroying critical temporal dependencies. Though some recent approaches attempt to isolate informative components, they typically rely on an implicit neural mechanism to infer semantics, thus limiting the interpretability and controllability. To address this, we propose ProSAR, an information-theoretic framework that leverages the explicit prototype alignment to guide semantic augmentations, and establish a feedback loop between the augmentation, contrastive learning, and prototype updates. Specifically, grounded in our proposed Prototype-Conditioned Information Bottleneck principle, we leverage the time-domain prototypes as explicit anchors to localize semantic segments, and develop a time–frequency augmentation strategy that retains prototype-consistent information while discarding noise. To promote semantically consistent prototypes for a reliable view generation, we design a dual-prototype loop where the augmented views are encoded into representations and then the learned representations are clustered to update latent prototypes, whose decoded feedback refines the time-domain prototypes for the next round of augmentation. Experiments on diverse time-series benchmarks demonstrate that ProSAR outperforms the other contrastive learning methods on downstream forecasting and classification tasks.

Abstract:
Parametric CAD is widely used in mechanical and product engineering, but current generative models still have difficulty producing assemblies that are both editable at the parameter level and consistent with interpart constraints. Methods that generate meshes or history-free B-rep can represent multi-part shape, but they often lack the program structure and constraint logic needed for reliable downstream edits; in contrast, code-based CAD generation offers direct parametric control, yet most published settings and evaluations focus on single-part solids rather than constrained assemblies. We introduce SPADA (Self-testing Parametric Assembly Design Agent), a test-driven agent that synthesizes assembly code together with deterministic verification tests, and uses these tests as an executable contract for controllable generation. SPADA runs an iterative compile-test-repair loop with multimodal feedback, checking both specification logic and physical feasibility through programmatic constraints. To support evaluation, we release SPADA-Bench-Verified, a human-validated benchmark of real-world code-centric assemblies paired with deterministic tests and engineering-style constraints. Experiments show that SPADA could produces complex assemblies while maintaining geometric fidelity, supporting test-driven agents as a concrete path toward reliable, controllable CAD generation.

Abstract:
MultiAgent Reinforcement Learning (MARL) has been widely applied to automated aircraft conflict resolution due to its strong capability for cooperative control and distributed decision-making. However, existing approaches typically assume a fixed number of aircraft and neglect the unique characteristics of air traffic control instructions. This structural misalignment between model architectures and domain requirements leads to severe deficiencies in perception scalability and action stability across scenarios of varying scales. To address these challenges, we propose Structural-Aligned Reinforcement Learning (SARL), which aims to bridge the gap between perception and action. First, the Physically Encoded Relational Graph (PERG) effectively resolves the fixed input dimensionality issue by incorporating physical inductive biases into a graph attention mechanism. Second, we design the Sparse Cognitive Mixture-of-Experts (SC-MoE) to enhance decision stability. In addition, we introduce a Kinematic Kafety Shield (KSS) based on aviation rules, which not only improves inference-time safety but also effectively guides the model to generate semantically meaningful actions that comply with aviation standards. Simulation experiment results demonstrate that SARL significantly outperforms existing reinforcement learning baselines across diverse scenarios in terms of both success rate and operational efficiency.

Abstract:
Textguided stylized image generation has yielded promising advances by leveraging the powerful capabilities of text-to-image diffusion models. However, the inherent coupling of style and content information within the reference image presents a significant challenge. To address this, we propose StyleDistillation, a novel approach grounded in two key observations about the CLIP embedding space from a style perspective. By leveraging a lightweight StyleDistiller module, combined with carefully designed optimization objectives based on geometric and semantic priors, we can extract fine-grained style representation from the reference image. Additionally, we introduce a Prompt Alignment Enhancement mechanism during inference, which significantly improves the control that text prompts exert over the generated images. Extensive experiments demonstrate that our method achieves outstanding performance in both style reproduction and prompt alignment. Furthermore, StyleDistillation supports various personalized operations, including style editing and style fusion, highlighting its substantial potential for diverse applications.

Abstract:
Group Emotion Recognition (GER) is crucial for understanding social dynamics, ranging from interpreting intimate conversations to evaluating crowd behavior in largescale surveillance scenarios. While current AI models can analyze these scenes, they often act as black boxes that take shortcuts. Instead of focusing on how people are actually behaving, these models often get distracted by the background environment, leading to inaccurate results. To bridge this gap, we introduce VIBE (Variational Inference for Behavioral Emotion), a kinematics-aware framework that integrates audio, video, and text modalities through causal structuring. Unlike standard models that simply mix data together, VIBE utilizes mathematical constraints to filter out background noise and isolate the genuine emotions of the people involved. This purified representation enables our model to focus exclusively on the sociological mechanics of the crowd, dynamically modulating neural attention based on raw physical synchrony. Simultaneously, we align visual dynamics with human interpretability by projecting latent representations into a semantically structured space informed by textual descriptions. Comprehensive experiments demonstrate that VIBE consistently outperforms state-of-the-art methods. Code will be made publicly available upon acceptance.

Abstract:
Reinforcement learning (RL) has become a widely adopted technique for improving large language models (LLMs) on complex tasks. Despite this progress, existing RL methods still face challenges in training agents with longerhorizon interactions. One major bottleneck is distinguishing the contribution of different actions in long-horizon interaction, leading to high optimization variance. To address this, we introduce a novel policy gradient method, Hindsight Policy Optimization (HPO), that projects both the current policy distribution and the hindsight distribution into an intent space and extracts low-variance learning signals from the Wasserstein distance between them. We theoretically and empirically show that aggregating semantically similar states and actions in the intent space yields a bounded-variance estimator and improves policy performance stably. Our code is available online.

Abstract:
Diffusionbased large language models (DLLMs) have shown promise for non-autoregressive text generation, but their deployment is constrained by large model sizes and heavy computational costs. Post-training quantization (PTQ), a widely used method for compressing and accelerating Large Language Models (LLMs), suffers from severe accuracy degradation and reduced generalization performance when directly applied to DLLMs (e.g., AWQ suffers a 16% accuracy drop on LLADA under W4A4). This paper explores how the unique mechanisms of Dynamic Language Models (DLLMs) conflict with quantization, identifying three core issues: 1) During the iterative generation process of DLLMs, dynamic masking ratios are inherently involved, leading to notable differences in token distributions across decoding steps. Unfortunately, these distinct distributions are not sufficiently captured by current PTQ calibration approaches; 2) Quantization errors propogate and accumalte progressively during iterations in DLLMs, leading to a gradual decline in the performance of quantized models as decoding steps advance; 3) The stability of unmasked tokens, combined with the probabilistic nature of masked tokens, gives rise to an overall feature distribution that is uncoordinated and unsuitable for PTQ. To address these issues, we propose DLLMQuant, a PTQ framework tailored for DLLMs, which incorporates three novel techniques: 1) Temporal-Mask Adaptive Sampling (TMAS), a calibration method that accounts for both time and mask factors, with the capacity to capture distributions across timesteps. 2) Interaction-Aware Activation Quantization (IA-AQ), which utilizes bidirectional attention scores to identify important tokens, and prioritizes these tokens when minimizing quantization error. 3) Certainty-Guided Quantization (CGQ) incorporates mask status and token scores as core weighting criteria for error compensation, enabling PTQ to better align with the unique weight distribution of DLLMs. Experiments show that DLLMQuant achieves significant performance gains (e.g., over 10-point accuracy improvement on GSM8K for LLADA under 4-bit quantization) while enhancing efficiency.

Abstract:
Nowcasting forms the first line of defense against rapidly evolving weather hazards, where even minutes of delay can lead to severe societal impacts. However, existing systems predominantly extrapolate 2D radar reflectivity, which struggles under rapid intensification regimes. We introduce \N, a multiscale modeling framework that enables coherent reconstruction of the three-dimensional evolution of convective systems while explicitly conditioning on the ambient environment. \N\ integrates multi-source observations—including radar, satellite, and station—with reanalysis fields through two components: (i) \texttt\ComponentA that explicitly disentangles convective system state form vertical thermodynamic coupling and large-scale environmental forcing; (ii) \texttt\ComponentB that predicts future echos by adaptively aggregating cross-layer interactions conditioned on evolving environmental conditions. To support comprehensive evaluation, we build a new benchmark \textttStormBench that integrates observational and reanalysis data across regions. On this benchmark,~\N\ consistently achieves the best performance, reducing MAE by 12.4% and improving the mCSI by 34.0%. Dataset and code will be released after the review process.

Abstract:
We study lastiterate convergence for stochastic smooth and monotone variational inequalities (VIs), a framework that captures convex-concave saddle points and Nash equilibrium computation in monotone games with noisy payoff feedback. In contrast to the well-understood average-iterate guarantees, anytime last-iterate guarantees in stochastic settings remain limited, despite their relevance for uncoupled learning dynamics that output a single current strategy. We analyze two single-call regularized methods, the \emphregularized gradient (RG) and the \emphregularized optimistic gradient (ROG) methods, and establish anytime last-iterate convergence rates in terms of the squared gap function. For monotone VIs, RG attains O(t^-2/5) while ROG achieves the variance-adaptive rate O(\sigma^4/5 t^-2/5 + t^-1), where \sigma^2 is the noise variance. For \lambda-strongly monotone VIs, ROG yields O(\sigma^2 / (\lambda^2 t) + t^-c) for any constant c \ge 2. These results give anytime last-iterate guarantees without knowing the horizon and show that optimism improves convergence in the low-noise regime.

Abstract:
Dynamic graph learning, which focuses on modeling the merging, vanishing, and reconnection of nodes and edges, is crucial for realworld applications. In dynamic graphs, node neighborhoods often exhibit diverse and time-evolving topologies, including hierarchical, grid-like, and cyclic patterns. Existing methods typically embed graphs into a single curvature space, which limits the quality of node representations when the embedding geometry is not aligned well with the local graph topology. In this paper, we proposeDyGMoCE, aDynamicGraph Transformer with aMixtureofCurvature-awareExperts, which efficiently embeds each node at every timestamp into an adaptive curvature space. Specifically, DyGMoCE incorporates a mixture-of-experts framework to both the attention and feed-forward modules, where each expert operates on a Riemannian manifold with a distinct curvature. Then, motivated by the geometric continuity across the experts, we introduce a routing mechanism with a ranking constraint. To improve efficiency, we design a mathematically equivalent fast Riemannian attention module, achieving an average speedup of 26.3% and memory reduction of 52.0% for DyGMoCE. Notably, the fast Riemannian attention module is broadly applicable to Transformer models with sequence inputs. Extensive experimental results show that DyGMoCE significantly outperforms other state-of-the-art methods.

Abstract:
Schema Linking serves as the foundational perception layer in Textto-SQL, tasked with grounding natural language queries into relevant schema elements. However, existing retrieval-based approaches suffer from a criticalstructural blindness: by prioritizing elements with high textual similarity, they inadvertently prune semantically-thin but topologically-critical bridge tables, thereby severing relational pathways necessary for multi-hop joins. To bridge this gap, we propose Graph-Link, a novel framework that reformulates schema linking from an independent retrieval task into a constrained subgraph induction problem. We argue that generating executable SQL necessitates a connected subgraph that satisfies both semantic relevance and structural constraints. Accordingly, Graph-Link employs a hierarchical schema graph to model the search space across multiple granularities, and then applies a Steiner-tree-based optimization for subgraph induction that guarantees the topological connectivity while maximizing the signal-to-noise ratio for downstream LLMs. Extensive experiments on BIRD and Spider 2.0 demonstrate that Graph-Link achieves state-of-the-art schema linking performance, improving recall and hit rates by up to 7.0% over competitive baselines, and boosts downstream SQL generation accuracy on complex queries by 13.8%.

Abstract:
Recent advances in image and video generation raise hopes that these models possess world modeling capabilities—the ability to generate realistic, physically plausible videos. This could revolutionize applications in robotics, autonomous driving, and scientific simulation. However, before treating these models as world models, we must ask: Do they adhere to physical laws? Current evaluation methods rely on subjective judgments or trajectory matching, limiting their usage for physical reasoning estimation, where many generations could be physically plausible. Thus, we introduceMorpheus, one of the first physicsinformed evaluation frameworks for measuring the ability of video generation models to comprehend Newtonian dynamics.Morpheusfeatures 130 real-world videos capturing physical phenomena, guided by conservation laws. Using those as conditioning for video generation, we assess physical plausibility leveraging interpretable metrics evaluated with respect to infallible conservation laws known per physical setting, leveraging advances in physics-informed neural networks and vision-language foundation models. Importantly,Morpheustargets controlled Newtonian rigid-body settings to enable quantitative checks. Our findings reveal that even with advanced prompting and video conditioning, contemporary models struggle to encode physical principles despite generating aesthetically pleasing videos.

Abstract:
Modern visual generative models acquire rich visual knowledge through largescale training, yet existing visual representations (such as pixels, latents, or tokens) remain external to the model and cannot directly exploit this knowledge for compact storage or reuse. In this work, we introduce a new visual representation framework that encodes a signal as a function, which is parametrized by low-rank adaptations attached to a frozen visual generative model. Such implicit representations are learned by compressing the visual signal, e.g., an 81-frame video, into a single compact vector, achieving strong perceptual video compression at extremely low bitrates. Beyond basic compression, the functional nature of this representation enables inference-time scaling and control, allowing additional refinement on the compression performance. More broadly, as the implicit representations directly act as a function of the generation process, this suggests a unified framework bridging visual compression and generation.

Abstract:
LowRank Adaptation (LoRA) merging can efficiently combine diverse generative capabilities from multiple trained LoRAs for a diffusion model. However, existing LoRA merging techniques often suffer from severe parameter interference, causing destructive collisions in the shared parameter space. To address this, we propose Subspace Signal Routing (SSR), which resolves interference by routing internal signals instead of performing parameter-space merge. Specifically, SSR first constructs a unified subspace by concatenating candidate LoRAs along the rank dimension. Next, SSR employs an inverse correlation matrix to decorrelate mixed signals within this space. Finally, a directional guide matrix steers these purified signals into their respective task-specific subspaces. We provide a rigorous theoretical analysis proving that SSR aligns with the Ordinary Least Squares (OLS) solution, thereby ensuring mathematical optimality. We utilize the additivity of sufficient statistics to design a streaming algorithm. This enables on-the-fly updates that significantly reduce memory overhead and computation time. Extensive experiments validate that SSR significantly outperforms state-of-the-art methods while maintaining comparable efficiency. The source code will be made publicly available.

Abstract:
Symmetry is central to modern machine learning and physics: invariances and equivariances improve sample efficiency, robustness, and outof-distribution generalization, while symmetry principles guide scientific modeling. Yet for stochastic dynamical systems, the relevant continuous symmetries are rarely known, and symmetry discovery for SDEs has remained essentially unexplored. We introduceLieStoNet, an end-to-end,prior-freeframework for discovering Lie-point symmetries of SDEs directly from spatiotemporal trajectories, without prespecifying symmetry groups, templates, or canonical coordinates. Building on the seminal SDE Lie-symmetry theory of Gaeta and Quintero (1999), which formalizes Lie-point SDE symmetries and their relation to Fokker-Planck symmetries, LieStoNet learns neural surrogates for drift and diffusion from increments, then learns projectable generators by enforcing the SDE determining equations, separately regularizing for closure under Lie brackets, adherence to the Lie algebra axioms (bilinearity, antisymmetry, Jacobi), and a non-redundant independent basis. The surrogate also defines an associated Fokker-Planck equation, enabling optional discovery of its Lie-point symmetries in parallel. Across multiple canonical SDEs with known analytic symmetries, LieStoNet recovers generators consistent with the ground-truth symmetry algebra, providing interpretable symmetry discovery for noisy dynamics.

Abstract:
Symbolic regression discovers interpretable mathematical expressions from data and is central to scientific modeling. Recent neural approaches typically linearize expression trees into token sequences for sequential generation, but this representation weakens access to the underlying hierarchy and makes it difficult to enforce structuredependent constraints. Hybrid neural--evolutionary frameworks further combine neural generators with genetic programming (GP), yet training can be unstable due to distribution mismatch between neural samples and GP-refined elites. We propose GCN-SR, a graph-based symbolic regression framework that generates expressions directly in an explicit tree form. GCN-SR introduces Symbolic Perfect Binary Trees (SPBTs), a fixed-topology scaffold that enables batched tree generation and supports an autoregressive generator based on a Graph Convolutional Network (GCN) while preserving hierarchical structure. To leverage GP refinement without unstable target matching, we further introduce Similarity-Weighted Policy Gradient (SWPG), which uses GP only to construct similarity-weighted reward signals. Experiments on standard symbolic regression benchmarks, together with extensive ablations, show that GCN-SR consistently outperforms strong neural and hybrid baselines.

Abstract:
Spiking neural networks have attracted increasing attention for their energy efficiency, multiplicationfree computation, and sparse event-based processing. In parallel, state space models have emerged as a scalable alternative to transformers for long-range sequence modelling by avoiding quadratic dependence on sequence length. We propose here a spiking heterogeneous harmonic resonate-and-fire state space model (SH^2RFSSM), a second-order spiking SSM for classification and regression on ultra-long sequences. SH^2RFSSM outperforms transformers and first-order SSMs on average while eliminating matrix multiplications, making it highly suitable for resource-constrained applications. Furthermore, we introduce a kernel-based spiking regressor that enables accurate modelling of dependencies in sequences of up to 50k steps. We also observe a reduction in spiking operations and improved performance with heterogeneity and discretisation in harmonic resonate-and-fire neuronal layers. Overall, we evaluate Harmonic Resonate-and Fire layers across 17 diverse datasets, spanning sensors, time series, and classification to long-term forecasting. Our results demonstrate that SH^2RFSSM achieves superior long-range modelling capability with energy efficiency, positioning it as a strong candidate for signal processing on resource-constrained devices for human activity recognition, time series classification, and regression.

Abstract:
Reconstructionbased multimodal anomaly detection is fundamentally challenged by the one-to-many crossmodal mapping problem, where a single 3D feature may correspond to multiple valid RGB appearances, often leading to collapsed reconstructions and degraded detection performance. We proposeCrossmodal Feature Replacer (CFR), a self-supervised ensemble framework that resolves ambiguous crossmodal reconstructions. CFR first learns cyclic mappings between RGB and 3D features while constructing modality-specific and paired memory banks. It then employs an attention-based retrieval network to identify reliable crossmodal candidates. During inference, unreliable reconstructed features are selectively replaced with high-confidence retrieved features, yielding an unambiguous representation for anomaly detection. We evaluate CFR on MVTec 3D-AD and Eyecandies under few-shot settings. Extensive experiments show that CFR consistently outperforms state-of-the-art methods, achieving 92.3 (82.7) 30% AUPRO and 74.2 (75.9) I-AUROC in the 1-shot setting, demonstrating its effectiveness in addressing crossmodal reconstruction ambiguity.

Abstract:
Largescale molecular dynamics simulations are essential in understanding chemical and biological processes, necessitating the accurate and efficient modeling of interatomic interactions. Existing learning-based methods generally are based on message passing mechanisms; they are either not scalable or too coarse to offer accurate modeling. We propose a new message passing framework that can effectively and efficiently model interatomic interactions for simulating large-scale molecular dynamics at full atomic resolution. Specifically, our framework is stacked with a sequence of message passing neural network layers, each realizing the message passing over a distinct and dilated star-structured path. These star-structured paths are constructed progressively along dilated regions to capture the distance-dependent interactions. The crux of our framework is that it resolves the problem of dense interatomic interactions of large-scale atomic systems with sparser and region-based message passing graphs. We evaluate the framework on four benchmarks: MD22 (molecules with 42–370 atoms), Chignolin (a 166-atom protein featuring diverse conformations), the AdK dataset (a protein trajectory with up to 3,000 atoms), and the MISATO dataset (over 10,000 heterogeneous protein-ligand complexes with systems up to 40,000 atoms). Comprehensive evaluations demonstrate that our approach delivers state-of-the-art performance overall across various benchmarks.

Abstract:
3D visual grounding encompassing 3D referring expression comprehension (3DREC) and segmentation (3DRES) requires robust crossmodal representation to achieve fine-grained semantic alignment and precise geometric reasoning. However, most methods employ unimodal pre-trained encoders that transfer visual and linguistic knowledge independently, inducing domain shift and poor cross-modal alignment. Meanwhile, spatial modeling with handcrafted priors limits cross-modal geometric representation, struggling to capture complex object relations due to spectral bias. To address these challenges, we propose Text-aware Feature Tuning (TFT) and Decomposed Spectral Geometry (DSG) to enhance cross-modal semantic representation. Specifically, TFT injects linguistic context into the visual hierarchy to mitigate domain shift and facilitate early cross-modal alignment. DSG employs a learnable Fourier basis and explicitly decomposes pairwise relations into symmetric and antisymmetric spectral components, allowing the model to capture high-frequency geometric details and direction-aware relations for precise spatial reasoning. Extensive experiments on ScanRefer, Nr3D and Sr3D validate the effectiveness of our method, demonstrating state-of-the-art performance with improvements of 2.05% Acc@0.25 for 3DREC and 1.09% mIoU for 3DRES on ScanRefer.

Abstract:
We study causal discovery from soft interventions in the presence of latent confounding. Beyond withinenvironment conditional independences, soft interventions induce cross-environment invariances that can be encoded using an augmented graph with intervention indicator nodes (\mathcalI-AUG). Taking its maximal ancestral graph (MAG) yields the \mathcalI-MAG, which characterizes the interventional Markov equivalence class. Building on this framework, we show that the FCI-inspired learner (\mathcalI-FCI) by Kocaoglu et al. (2019) is sound but not complete: it may output circle endpoints that are nevertheless compelled by the interventional equivalence class. To exploit intervention-node semantics, we propose two complementary methods. First, we introduce an enumeration-based completion procedure that is sound and theoretically complete, but whose worst-case cost depends on the number of MAGs compatible with the partial graph learned by \mathcalI-FCI. Second, we derive a set of additional local orientation rules that provably tighten \mathcalI-FCI without increasing asymptotic complexity. Both methods refine prior outputs in the controlled soft-intervention setting with latent variables.

Abstract:
With the rapid advancement of highfidelity video generation models, robust AI-generated video (AIGV) detection has become increasingly needed. While most AIGV detection methods operate in the decoded pixel domain, we observe that detection in the pixel domain inevitably entangles task-irrelevant semantic information, leading to substantial semantic redundancy and extensive redundant computation, while overlooking free-to-use signals in compressed bitstreams. In particular, motion vectors and residuals directly encode temporal and spatial generative artifacts but remain largely underexplored. To address these issues, we propose a unified framework forSpatio-TemporalREsidual andArtifactMining, namelySTREAM, which enables AIGV detection directly from compressed bitstreams.STREAMleverages I-frames, motion vectors, and residual errors to capture spatiotemporal artifacts that are typically smoothed out by decompression filters. In particular, we design a lightweight network with a motion-guided alignment module and a gated fusion mechanism, enabling adaptive fusion of spatial artifacts and nonlinear temporal dynamics. Extensive experimental results demonstrate thatSTREAMachieves SOTA performance with an mAP of 0.965, with 2.5× faster inference than previous SOTA baselines.

Abstract:
We present IQASpider, the first image quality assessment (IQA) framework that unifies reasoning, grounding, and referring within a LMM-based system for multi-granularity quality understanding. Existing LMM-based IQA methods typically support only partial perception dimensions, \egno, quality description and question answering~(\ieno, reasoning) or pixel-level grounding, largely due to the absence of (i) a unified task-and-data formulation and (ii) effective optimization paradigms for multi-granularity learning. To address these limitations, we formulate a rigorous four-task paradigm covering global and local quality description, pixel-level grounding, and region-level referring. Based on this formulation, we construct a corresponding IQA dataset with a scalable and automatic annotation pipeline, thereby providing a solid foundation for unified multi-granularity learning. To further enable unified perception, we adopts a conflict-free two-stage design that progressively extends textual multi-granularity understanding to pixel-level grounding: (i) the first stage equips the model with fine-grained textual reasoning across multiple IQA tasks, and (ii) the second stage introduces a training-free text-to-point grounding paradigm, which bridges textual semantics and pixel-level perception by mapping token logits to spatial coordinates. Based on these efforts, we achieve IQA-Spider with unified multi-granularity explainable image quality assessment. Extensive experiments across multiple benchmarks demonstrate strong performance, validating the effectiveness and versatility of the proposed formulation and framework. Code and datasets will be released upon acceptance.

Abstract:
VisionLanguage Models (VLMs) inherit social biases from large-scale pretraining data, and these biases can amplify in downstream tasks, leading to systematic performance disparities across sensitive groups. Due to the high training cost and the risk of catastrophic forgetting, recent research has focused more on lightweight \emphtest-time debiasing, aiming to obtain an ideal fair embedding for each query. However, such point-based corrections are often unstable and become notably weaker in multi-class settings, where group structure cannot be adequately captured by a single point. Therefore, we propose W4D, a distributional debiasing framework that reframes fairness as aligning query embedding distributions to group reference distributions under the Wasserstein distance, which provides a geometry-aware notion of discrepancy beyond mean shifts. To make this alignment practical at test-time, W4D introduces probabilistic prompts that induce controlled distributional perturbations and optimizes a Wasserstein-based objective to reduce cross-group disparity while preserving task-relevant semantics. This distributional perspective improves robustness in multi-class debiasing and yields a stronger fairness--utility trade-off across diverse VLM downstream evaluations. Our code is available at https://anonymous.4open.science/r/W4D/.

Abstract:
Multimodal Federated Learning (MMFL) addresses collaborative training across clients with heterogeneous modality configurations, where effective client selection becomes critical under the compounded challenges of modality, distribution, and quantity heterogeneity. Existing selection methods operate within a reactive paradigm, responding to current observations without anticipating how decisions influence future optimization trajectories. This myopic approach leads to suboptimal convergence when training dynamics shift rapidly under severe heterogeneity. We propose FedSSM, which reconceptualizes client selection as a proactive decisionmaking process by predicting training dynamics through decision-aware state space models. The prediction error yields a \emphsurprise signal that quantifies uncertainty and governs adaptive participation budgets and exploration-exploitation trade-offs via counterfactual reasoning over candidate actions. For aggregation, we introduce trust-weighted fusion with modality-specific routing, where surprise calibrates sensitivity to client anomalies. Experiments on four multimodal benchmarks demonstrate that FedSSM achieves 2.5--4.5% accuracy improvements over state-of-the-art methods while reducing communication rounds by over 30%.

Abstract:
Recent classincremental learning (CIL) methods built on large pre-trained vision models have shown that strong performance can be retained even under strict data access constraints. This raises a fundamental question: which properties of pre-trained representations make such recovery possible in the class-incremental setting? In this work, we show that class-level feature statistics play a central role in enabling effective CIL under strong pre-training. When the visual backbone is frozen, maintaining simple class-wise statistics—such as prototypes and low-order distributional information—can recover a substantial fraction of the performance achieved by static joint training across diverse benchmarks. We make this observation explicit through deliberately minimal reference points built on frozen CLIP representations. In particular, we demonstrate that competitive performance can be obtained even without continual training, by performing inference directly from accumulated class-level statistics. Our findings suggest that class-level statistics constitute an important and previously underemphasized component of recent CIL approaches based on pre-trained models, offering a complementary perspective for understanding their strong empirical performance.

Abstract:
Generative Flow Networks (GFlowNets) have emerged as a powerful framework for molecular generation, sampling diverse candidates proportionally to a reward function. However, the vast chemical space necessitates truncating trajectory length, forcing models to treat incomplete molecular fragments as terminal states alongside valid molecules. This conflation distorts the learned distribution by allocating probability mass to chemically meaningless states. We propose LeakGFN, a dualhead architecture that decomposes flow into two components: a chemical head modeling flow over the full chemical space, and a valid head estimating the fraction of flow reaching valid molecules within the truncation boundary. Through this decomposition, the valid head implicitly learns molecular reachability without explicit supervision. We prove that LeakGFN recovers the correct distribution over accessible molecules under mild assumptions. Experiments on five molecular optimization tasks demonstrate consistent improvements over flow matching baselines, achieving state-of-the-art performance on four out of five tasks. Our module integrates as a plug-and-play enhancement into existing frameworks, improving performance on both pocket-conditioned and multi-objective generation tasks.

Abstract:
Fewshot time series forecasting is fundamentally challenged by the scarcity of high-quality training data and the risk of severe overfitting. To address this issue, we propose ReAugment, a reinforcement learning (RL) framework that explicitly learns where and how to augment time series data. ReAugment maintains a zoo of forecasting models and measures prediction diversity across them to identify training samples that are most prone to overfitting. These samples serve as anchor points and are used as inputs to the data augmentation process. We then employ an RL approach to learn transformation policies, using a model zoo-guided reward function to bias the transformed data to overfit-prone regions of the training distribution that are most beneficial for generalization. A key advantage of the RL formulation is that it avoids backpropagating gradients through the forecasting models, thereby mitigating gradient vanishing. Experiments across diverse forecasting architectures demonstrate the effectiveness of ReAugment in both few-shot and standard time series forecasting.

Abstract:
Functional brain network analysis plays an important role in understanding and diagnosing psychiatric disorders. However, current methods struggle with subject variations, impairing the model’s generalization ability to the test set. To address this issue, we propose the Subject Invarianceaware Inverse Graph Contrastive Learning (SI-IGCL) model, which adopts a two-stage paradigm with self-supervised subject-invariant pre-training followed by supervised fine-tuning for identification. During the pre-training phase, we construct an inverse contrastive objective that reshapes the embedding space by repelling intra-subject and attracting inter-subject embeddings to learn subject-invariant representations, with an auxiliary correction term to avoid early optimization plateaus. Meanwhile, we incorporate a structure-preserving reconstruction constraint to preserve discriminative information. Moreover, a Hierarchical Topology Enhanced Transformer (HTET) module is designed to enable multi-level modeling of subject-invariant functional patterns. During the fine-tuning phase, a supervised classifier is integrated to perform psychiatric disorder classification. Extensive experiments demonstrate that our method outperforms all state-of-the-art methods. The code is available at https://anonymous.4open.science/r/SI-IGCL.

Abstract:
Amortized causal inference via Priordata Fitted Networks (PFNs) has emerged as a promising paradigm, enabling zero-shot estimation of causal effects without the need for dataset-specific model tuning. However, the principled effectiveness of unified pre-training across general interventional regimes remains an underexplored question. In this paper, we investigate interventions on subsets of variables within Structural Causal Models (SCMs) and identify a fundamental theoretical limitation of current pre-training approaches. Theoretically, we prove that a single observational SCM induces an exponentially large space of interventional distributions, resulting in a phenomenon we term prior uncoverage. Consequently, this uncoverage yields a mismatch between the learned meta-prior and the true grounding prior, leading to unavoidable posterior inconsistency and estimation bias. To address this, we posit that fine-tuning is a fundamental necessity and propose a target-specific strategy named Point-Wise Interventional Fine-tuning (PWF), enabling the local generalization property. We further scale this approach via Meta-Sampling Fine-tuning (MSF) from a budgeted active learning perspective, thereby achieving uniform generalization on any interventional distribution.

Abstract:
While speculative decoding improves inference throughput for multibatch long-context Large Language Models (LLMs), its efficiency is often limited by a verification bottleneck where Key-Value (KV) cache loading dominates latency. Existing compression methods fail in this regime: static eviction incurs accuracy loss due to saliency shift, while dynamic selection introduces prohibitive computational overhead during the verification path. We propose Dustin, a sparse verification framework designed for long-context speculative decoding. Dustin integrates lookahead signals from the draft model with historical attention from the target model to identify critical tokens with high fidelity across multi-step verification windows. To reduce recomputation latency, this approach further employs a sparse estimation scheme that restricts importance scoring to a minimal subset of attention heads. Evaluations on PG-19 and LongBench with Qwen2.5-72B demonstrate that Dustin achieves a 27.85× speedup in self-attention and a 9.17× end-to-end decoding speedup at a 32k sequence length, all with negligible accuracy degradation.

Abstract:
While iTransformer pioneered general intervariate dependency (IVD) modeling in Transformers for multivariate time series forecasting (MTSF), subsequent research on such universal paradigms has been surprisingly scarce. Through comprehensive analysis, we identify a critical structural inconsistency in Variate Transformers (exemplified by iTransformer): typically capturing inter-variate dependencies via shallow self-attention layers while neglecting the critical requirement for deep-layer IVD modeling, which causes dependency information loss and difficulties in model optimization. To address these limitations, we propose CGTFra, as a general Graph Transformer framework. Specifically, we reconsider existing timestamp-based modeling and introduce a frequency-domain masking and resampling method for periodicity preservation, which serves as a general strategy for input feature enhancement and a substitute for timestamp embeddings. Additionally, CGTFra promotes consistent IVD modeling from two perspectives. Initially, a dynamic graph learning framework is integrated into Transformers to explicitly model IVD in deep network layer. Furthermore, grounded in the Information Bottleneck principle, we further propose a consistency-constrained alignment to learn more robust IVD and temporal feature representations. These three core design philosophies of CGTFra can be integrated into any existing Variate Transformer-based framework, and CGTFra achieves superior predictive performance across 13 long- and short-term datasets with high computational efficiency and desirable interpretability. Code is available at https://anonymous.4open.science/r/CGTFra.

Abstract:
VisionLanguage Models (VLMs) frequently suffer from visual perception errors and hallucinations that compromise answer accuracy in complex reasoning tasks. Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising solution by optimizing policies using answer correctness signals. Despite their effectiveness, prevailing RLVR methods face two critical limitations. First, much of the sampling budget is wasted on trajectories doomed to fail due to early visual description errors. Second, sparse rewards cannot distinguish whether failures stem from visual perception or reasoning stages. We introduce MIRL, a decoupled framework that addresses both limitations by leveraging mutual information (MI) between generated descriptions and visual inputs as a cheap pre-screening signal. This enables intelligent budget allocation toward high-potential trajectories via forking, while decoupled training provides independent MI-based rewards for visual perception optimization, resolving reward blindness. Experiments on six vision-language reasoning benchmarks demonstrate that MIRL achieves 70.22% average accuracy and successfully surpasses the performance of sampling 16 complete trajectories using only 10 pre-samples with top-6 selection (25% fewer complete trajectories). Our code is available at: https://anonymous.4open.science/r/mirl-main/.

Abstract:
Many methods aim to enhance the performance of visionlanguage models (VLMs) at test time. Among them, transduction has emerged as a promising paradigm due to its strong compatibility and efficiency. However, realistic evaluations often involve highly imbalanced class distributions, which cause performance degradation or even collapse. In this work, we systematically revisit transduction from the perspective of penalized likelihood estimation (PLE), showing that PLE with a KL-divergence anchor term naturally yields an adaptive shrinkage behavior between prior anchors and empirical estimates. From this viewpoint, the brittleness of transductive methods can be attributed to the absence of anchoring mechanism and static modeling of the shrinkage strength. Therefore, we propose Mixture of Von Mises-Fisher Models with Dynamic Shrinkage (MOON). MOON is based on a mixture of von Mises-Fisher distributions to model feature representations on the unit hypersphere. To handle imbalance, MOON dynamically adjusts the shrinkage strength using zero-shot priors at both instance and class levels. Thus, it suppresses unreliable assignments and prevents harmful updates from outlier classes, thereby mitigating negative transfer. MOON is model-agnostic, training-free, and requires no task-specific hyperparameter tuning. Extensive experiments further validate the advantage of MOON in both performance and efficiency.

Abstract:
While the "Scaling Laws" have driven massive success in Computer Vision and NLP through largescale training with massive batch sizes, Reinforcement Learning (RL) has largely failed to benefit from this paradigm. In RL, increasing batch sizes beyond a modest threshold often leads to diminishing returns or performance degradation due to the inherent non-stationarity of the data distribution. In this paper, we challenge the prevailing static view of batch sizes in RL by observing that the degree of non-stationarity is not constant: early training involves rapid behavioral shifts requiring small batches for plasticity, whereas late training approaches a quasi-stationary regime where large batches are essential for high-precision convergence. To leverage this insight, we propose Adaptive Batch Scaling (ABS), a simple yet effective framework that dynamically adjusts the effective batch size based on the stability of the learning process. We introduce Behavioral Divergence, a novel metric that quantifies non-stationarity by measuring action-level shifts between policy updates, and use it to scale the batch size inversely to the policy's volatility. By integrating ABS with the Parallelised Q-Network (PQN) algorithm, we demonstrate on the ALE benchmark that our method synergizes early-stage model plasticity with late-stage accurate and stable convergence. Our empirical results show that ABS not only yields substantial performance improvements over static baselines but also successfully scales to larger network architectures, offering a foundational step toward bridging the scaling gap between RL and supervised learning.

Abstract:
Inferring molecular structures from multimodal spectroscopic measurements requires integrating complementary yet highly heterogeneous signals. However, the common paradigm of directly concatenating multispectral sequences can exhibit anomalous performance degradation, primarily due to pronounced heterogeneity and the resulting multimodal imbalance across modalities. As a remedy, we propose MMSpectrum, a sparse Mixture-of-Experts framework tailored for multimodal multispectral spectra-to-structure elucidation. To better match the information characteristics under multispectral imbalance, MM-Spectrum introduces an explicit modality-aware routing mechanism that exposes spectral identity to the router in addition to token content representations. Moreover, it incorporates shared and interaction experts, together with heterogeneous expert capacities, to extract multispectral modality-unique and cross-modal synergistic information while suppressing noise-induced interference. Across full-modality, bimodal, and missing-modality settings on molecular structural elucidation, MM-Spectrum achieves consistent and substantial improvements, supported by ablation studies and interpretability analyses. Code is available at https://anonymous.4open.science/r/MM-Spectrum-1727.

Abstract:
Unsupervised Camouflaged Object Detection (UCOD) aims to identify objects concealed in their surroundings without relying on pixellevel labels. Existing methods rely solely on simple post-processing of DINO high-dimensional features to generate pseudo labels for training. However, these methods suffer from two major limitations: 1) pseudo labels they easily generate contain excessive noise, causing the model to learn substantial incorrect information. 2) Although pseudo-label supervision allows the model to understand the task, it remains insufficient for generating fine-grained segmentation of the camouflaged objects. To address these issues, we propose DualUCOD, a novel UCOD method based on dual-branch contrastive learning that effectively detects camouflaged objects without pixel-level labels. Specifically, we propose the Dual-Eigenvector Spectral Pseudo-Labeling (DESPL) strategy, which fuses semantic and color cues into an affinity matrix. We then compute the eigenvectors of its normalized graph Laplacian and generate high-quality pseudo-labels using these eigenvectors. Furthermore, we introduce a Boundary-Guided Foreground-Background Refinement (BGFBR) module that explicitly incorporates boundary information to improve segmentation accuracy. Finally, we introduce a Dual-Branch Contrastive Learning (DBCL) module that constructs positive and negative pairs from the original and augmented images, aligning positive representations while contrasting them against negatives to enhance camouflaged object understanding. Extensive experiments demonstrate that DualUCOD outperforms state-of-the-art methods on different datasets in the unsupervised setting.

Abstract:
Diffusion models (DMs) are a powerful framework for image generation and restoration. However, existing DMs are primarily trained in a supervised manner by using a large corpus of clean images. This poses fundamental challenges in many realworld scenarios, where acquiring noise-free data is hard or infeasible. While some methods are capable of training DMs using noisy data, they are effective only when the amount of noise is very mild or when additional noise-free data is available. In addition, existing methods for training DMs from incomplete measurements require access to multiple complementary acquisition processes, a significant practical limitation. Here we introduce the first approach for learning DMs for image restoration using only noisy measurement data from a single operator. First, we show that DMs, and more broadly minimum mean squared error denoisers, exhibit a weak form of scale equivariance linking rescaling in signal amplitude to changes in noise intensity. We then leverage this theoretical insight to develop a denoising score-matching strategy that generalizes robustly to noise levels below the training data, thereby enabling the learning of DMs from noisy measurements. For problems involving measurements both noisy and incomplete, we integrate our method with equivariant imaging, a complementary self-supervised learning framework that exploits the inherent invariants of imaging problems. This allows training DMs for image restoration from single-operator noisy measurements. We validate the effectiveness of our approach through extensive experiments on image denoising, demosaicing, inpainting, and MRI reconstruction along with comparisons with the state of the art.

Abstract:
Scanning tunneling microscopy (STM) enables precise manipulation of individual atoms and molecules, offering a pathway to constructing nanoscale assemblies with rich quantum mechanical behavior. Despite its potential, STMbased fabrication remains limited by the inherent complexity of manipulation procedures and the extensive manual effort required. In this work, we take a substantial step toward autonomous manufacturing with STMs by introducing a novel AI-based planning framework for molecular assembly and a high-fidelity simulation environment. Our framework computes collision-free assembly plans that minimize the total distance traveled by molecules. Given an assignment of molecules to target positions, satisfiability solving is used to compute execution schedules in which each molecule has an empty corridor available when it is scheduled to move. Reinforcement learning (RL) agents then execute sequences of STM actions to manipulate molecules to their targets. We further introduce NanoAssemblyGym, a high-fidelity simulation environment for molecular manipulation built on the Gymnasium API, allowing seamless integration with existing RL libraries and workflows. Using NanoAssemblyGym, we demonstrate autonomous assembly of structures containing up to 420 molecules.

Abstract:
We study unsupervised domain adaptation (UDA) where measuring crossdomain discrepancy is critical. Most UDA approaches fix a single f-divergence a priori, which can be suboptimal across heterogeneous shifts. We propose a framework that (i) tightens the variational lower bound of an f-divergence by inserting a learnable, monotone L-Lipschitz transform \tau (Tighter-VR), and (ii) selects the divergence family adaptively from data via a likelihood-based criterion. The resulting estimator yields more informative and statistically efficient discrepancy estimates while recovering prior fixed-divergence methods as special cases. Theoretically, we derive a target-risk bound whose three components are a transformed source risk, a Tighter-VR discrepancy between domains, and an ideal-hypothesis residual; we further provide finite-sample guarantees using standard complexity measures. Empirically, on Office-31, Office-Home, Digits, and VisDA-2017, our method consistently improves accuracy over strong baselines, demonstrating that coupling Tighter-VR with adaptive divergence selection yields tangible gains in UDA.

Abstract:
Threedimensional (3D) reasoning segmentation aims to segment target objects based on text instructions and 3D spatial cues. Recent efforts in 3D reasoning leverage Multimodal Large Language Models (MLLMs) to bridge the gap between text and 3D data. However, since MLLMs are primarily trained on text-image pairs, directly adapting them to unstructured 3D point clouds often fails to capture implicit semantic intent and reliably localize objects. This paper introduces TVDRNet to address these challenges. Inspired by Active Vision theory, where humans selectively choose optimal viewpoints to better observe targets, TVDRNet employs a differentiable renderer to simulate this active process in 3D perception. By using text instructions as supervision to optimize intrinsic and extrinsic rendering parameters, the TVDRNet identifies the optimal viewpoints for observing the 3D scene, and therefore learning 'where to look' based on what the text instruction 'asked to find'. This process generates informative, task-relevant 2D images that are compatible with MLLMs. TVDRNet comprises: (1) the AVPL module, establishing a learnable mapping from semantics to optimal rendering parameters; and (2) the MGL module, fusing multi-modalities via semantic grouping to guide mask generation. Experiments show TVDRNet achieves SOTA performance in 3D reasoning segmentation (Reason3D, Instruct3D) and 3D visual grounding (ScanRefer).

Abstract:
Disaggregated serving alleviates memory bottlenecks in Large Language Model (LLM) inference but creates a severe communication bottleneck: transmitting highdimensional Key-Value (KV) caches often dominates time-to-first-token (TTFT). Moreover, reusing caches across heterogeneous models (e.g., base and fine-tuned variants) causes semantic misalignment that accumulates over layers, degrading generation quality. We propose Semantic Cache Distillation (SCD), a loss-constrained framework that replaces raw KV transmission with compact semantic codes. SCD addresses these challenges via two mechanisms: (1) \textscReuse, which reconstructs most layers from low-rank subspaces to minimize transfer cost, and (2) \textscPatch, which predicts normalized inputs at sparse transition layers to truncate error propagation. Empirically, SCD reduces data transfer by up to 2.65× and outperforms quantization and selective recomputation baselines in bandwidth-constrained regimes, maintaining generation quality within 5% of the oracle.

Abstract:
Multitarget domain adaptation (MTDA) trains a model using a labeled source domain and several unlabeled target domains, aiming to enhance performance across all targets. However, existing methods lack a principled causal formulation and often rely on empirical domain-invariance enforcement, which can bias adaptation across targets. To fill this gap, we propose the Unbiased, Unconfounding, and Unified Causal Framework (U^3CF) for MTDA. To unify align multiple domains, we propose a prototype-driven alignment strategy that progressively updates prototypes by high-confidence target predictions, while the contrastive optimization objective jointly aligns target samples to semantic prototypes and preserves class discrimination. By formulating a structural causal model, we reveal that domain-invariant causal factors and domain-specific factors shape representations and labels, while the latter induces spurious label correlations across targets. Accordingly, U^3CF achieves unbiased prediction by disentangling representations into invariant causal components and domain-specific confounders and applying conditional intervention to block confounding effects while preserving invariant semantics. To ensure precise disentanglement, we leverage mutual information theory to derive a principled criterion for feature separation. Extensive experiments on four benchmarks demonstrate that U^3CF consistently outperforms leading methods.

Abstract:
Visual Contrastive Decoding (VCD) mitigates hallucinations in Multimodal Large Language Models (MLLMs) by penalizing the output shift from noiseperturbed images, assuming this shift captures the hallucination direction. We prove this assumption flawed: noise-induced drift in Language-Image Pretrained (LIP) encoders is a \emphcoupled vector entangling (i) structural degradation from corrupted visual information with (ii) hallucination induction from linguistic prior activation. VCD's indiscriminate penalty inevitably suppresses valid visual semantics. Our key insight is that Self-Supervised Learning (SSL) encoders exhibit \emphonly structural degradation under noise—geometrically orthogonal to hallucination paths—enabling principled disentanglement via LIP--SSL differential response. We propose Disentangled Visual Rectification (DVR), a training-free dual-stream framework performing visual-layer rectification and decoding-layer contrast on purified representations. DVR achieves approximately 5× theoretical error reduction over VCD and establishes SOTA performance on POPE, MME, LLaVA-Bench and CHAIR benchmarks.

Abstract:
Large language models (LLMs) and visionlanguage models (VLMs) have emerged as efficient annotators for tasks such as generation and classification. While these models offer significant cost and speed advantages over human annotation, a critical challenge remains: existing self-evaluation methods, such as LLM-as-judge, often lack reliable calibration signals for error detection. We address this limitation by introducing SAGE (Semantic-Anchored JudGmEnt), a method that leverages semantically similar samples retrieved via k-nearest-neighbor as references for annotation verification. We provide a theoretical framework that derives a closed-form expression for the error detection AUROC, which can be decomposed into three factors: intrinsic separability, reference-induced mean shift, and noise reduction through averaging. This decomposition reveals when semantic neighbors help (when references are both semantically matched and correct) and why (by providing calibration signals that raise scores for correct annotations and lower scores for incorrect ones). Experiments on LLM generation, VLM captioning, and classification tasks validate our theoretical framework: SAGE improves error detection when semantic neighbors provide reliable calibration signals, and our decomposition offers insights into when direct scoring or alternative strategies may be preferred.

Abstract:
Literature reviews are essential to reflect the landscape of research fields. Large language models, especially deep research agents, have recently shown strong capabilities in automated literature review generation. However, it remains a challenging task to rigorously evaluate the scientific value of the generated reviews, since human expert annotations are difficult to scale up and LLMas-a-judge approaches lack of a convincing criteria. To address this gap, we introduce LitReview Arena, a battle-style evaluation platform with a structured protocol tailored to literature review quality. Our protocol imitates academic peer review by recruiting domain experts with research paper-writing experience, and we match each query to reviewers within their expertise. Each battle is judged with dimension-wise outcomes over five literature-review-specific criteria, enabling reproducible and diagnostic comparisons across systems. We collect a large-scale human preference dataset of expert votes (4984 votes×5 dimensions) and systematically measure how far current models are from human drafts. Results show that the most advanced models win only 23.0% of decisive matches against humans on overall utility, leaving substantial room for improvement. Meanwhile, agentic LLMs, such as Sonar Deep Research, substantially outperform base language models by over 60%. We also find that existing LLM-as-a-judge evaluation methods are severely misaligned with human experts (Spearman's \rho \approx 0.467). Based on the collected preference data, we provide an expert-calibrated evaluator, \emphLitJudge, improving alignment to \rho \approx 0.78, comparable to inter-expert consistency. Codes and datasets are publicly available at https://anonymous.4open.science/r/LitReview-Arena-3B82/.

Abstract:
Diffusionbased text-to-image (T2I) models have demonstrated remarkable advancements in generating high-quality images. However, while real-world applications like product packaging and logo design necessitate synthesis within irregular geometries, existing methods struggle to handle such constraints. Therefore, generating complete pictures that conform to arbitrary-shaped canvas constraints while maintaining semantic integrity remains a significant challenge. To address this, we introduce AnyCanvas, a training-free framework that leverages a Mask-to-Potential Field paradigm to convert binary masks into a differentiable potential field, which guides content to naturally converge within target regions. Extensive experiments demonstrate that AnyCanvas achieves 4.23% higher spatial adherence to user-specified constraints while maintaining 99.45% of the semantic fidelity measured by CLIP score, leading to a superior harmonic mean of spatial and semantic metrics. AnyCanvas also exhibits robust generalizability across different model backbones and versatile spatial control objectives.

Abstract:
Reinforcement learning with diffusion models has shown strong potential, but existing approaches such as variants of Direct Preference Optimization (DPO) often rely on an inaccurate simplification: they equate trajectory likelihoods with finalstate probabilities. This mismatch leads to suboptimal alignment. We address this limitation with a principled framework that leverages the optimal value function as the return for short trajectory segments. Our approach follows a two-stage procedure: (i) learning a value-distribution function to estimate segment-level returns, and (ii) applying our VRPO to refine the score function. We prove that, under sufficient model capacity, the resulting model is equivalent to training a diffusion process on the tilted distribution proportional to p(x)\exp(\eta r(x)). Experiments on large-scale diffusion models validate our analysis and show stable and consistent improvements over prior methods.

Abstract:
Despite their remarkable general capabilities, Large Language Models (LLMs) struggle with the precise grounding required for embodied task planning. To bridge this gap, neurosymbolic approaches have emerged, leveraging action languages like BC+ for their formal expressiveness and reasoning flexibility. However, prior methods that naively couple LLMs with BC+ typically depend on one-shot program generation, which is brittle in dynamic environments and prone to sequential omission and causal inconsistency. To address these limitations, we propose DecoVer, a Decompose-and-Verify neuro-symbolic framework that systematically adapts BC+ to embodied task planning. Specifically, DecoVer employs a cascading decomposition strategy to partition complex knowledge into hierarchical subspaces and integrates a dual verification mechanism for syntactic and executable correctness. Extensive experiments demonstrate that DecoVer consistently outperforms LLM-based baselines across the majority of evaluation metrics, achieving a 12.9% success rate gain over the highly capable Gemini-3-Pro-Preview and a 60.9% improvement over GPT-5.1 on logically complex test cases.

Abstract:
Sampling from highdimensional, non-log-concave distributions with unnormalized densities constitutes a fundamental challenge in machine learning, particularly when gradient information is inaccessible or computationally prohibitive. While Langevin dynamics provides a robust mechanism for gradient-based sampling, its extension to the derivative-free setting is frequently compromised by high variance and a lack of rigorous convergence guarantees in non-convex landscapes. In this work, we propose a principled variance-reduced zeroth-order Langevin dynamics framework that addresses these limitations for both general non-log-concave black-box sampling and inverse problems utilizing pre-trained score-based generative priors. We introduce a novel gradient estimator that significantly mitigates the variance inherent in traditional zeroth-order methods, enabling stable navigation of complex, multimodal posterior distributions. Theoretically, we establish the first non-asymptotic complexity bounds for this class of algorithms, proving convergence to the target distribution in terms of \varepsilon-relative Fisher information, and, under a Poincare inequality, squared total variation distance, specifically for non-log-concave densities. We empirically validate our framework, demonstrating superior mixing and sampling accuracy on standard black-box benchmarks and establishing state-of-the-art performance for derivative-free linear and nonlinear inverse problems.

Abstract:
The transition from singleturn models to Multi-Agent Systems (MAS) promises enhanced problem-solving capabilities, yet the centralized orchestration topology remains a critical point of fragility. To analyze this, we propose a Mean-Field Entropy Dynamics framework, modeling the orchestration process as a system governed by the competing forces of task resolution and cumulative context loading. To facilitate high-resolution validation, we introduce Inverse Workflow Generation (IWG), a multi-agent pipeline that synthesizes process-verifiable, high-complexity benchmarks with dense intermediate checkpoints. We demonstrate that our entropy dynamics model fits empirical trajectories, providing physically interpretable parameters that quantify system stability and performance collapse. Crucially, our analysis uncovers a ``Reasoning Trap": while reasoning-heavy models excel in isolated tasks, they frequently fail as orchestrators due to context squeezing. By elucidating the physical mechanisms underlying the Orchestrator and quantifying systemic uncertainty, our findings offer insights for the architectural design development of Multi-Agent Systems in prospective research.

Abstract:
Graph neural networks (GNNs) excel at aggregating neighbor information for classification, yet their performance is hindered by graph structural entanglement, where spurious correlations from semantically irrelevant neighbors contaminate node embeddings. This challenge is most acute for nodes near class boundaries in the embedding space, where amplified structural noise blurs decision boundaries and destabilizes predictions. Existing robust GNN methods largely treat all nodes uniformly, ignoring boundary vulnerabilities. In this paper, to improve classification performance, we tackle graph structural disentanglement by identifying boundaryregion entanglement as the primary bottleneck and propose Boundary Embedding Shaping (BES), an adaptive contrastive learning GNN plug-in module that selectively suppresses spurious structural noise at decision boundaries with minimal model parameter perturbation. Extensive experiments demonstrate that BES consistently improves boundary discrimination and outperforms existing leading methods. Notably, BES boosts GCN performance by an average of 3.3% in node classification (up to 5.0% on WikiCS) and achieves superior accuracy in link prediction with gains most pronounced for boundary nodes.

Abstract:
The accelerating realism of AIgenerated content has amplified the spread of deceptive information and eroded public trust. Prior works typically split the problem into two tracks, media authenticity, which concerns whether content is real or AI-generated, and content veracity, which concerns semantic and factual correctness, thereby missing their joint effects in practice. We present FakeWorld 1.0, which deeply fuses these two orthogonal axes into a unified omni-modal benchmark. Along the media axis, FakeWorld spans text, audio, image, and video synthesis; along the content axis, it instantiates cross-modal semantic inconsistencies and factual errors. These axes are jointly instantiated within realistic web-based and streaming-style presentation scenarios, reflecting how multimodal deception is composed and delivered in real-world settings. FakeWorld provides explainable annotations in the form of per-instance rationales, enabling transparent, evidence-based diagnosis. Under a unified protocol, our evaluation of open- and closed-source MLLMs exposes capacity limits and highlights FakeWorld’s effectiveness at surfacing mixed-source, high-fidelity deception. Beyond the benchmark, we introduce OmniCheck, a unified omni-modal agentic workflow that performs explainable detection across both axes and outputs evidence-backed reports. We aim for FakeWorld 1.0 to serve as a realistic stress test and a practical foundation for building future systems that enable scalable, explainable detection of fake multimodal content.

Abstract:
We derive algorithms for smooth nonconvex nonconcave minimax optimization and establish linear convergence rates for problems that satisfy the twosided Polyak-Lojasiewicz (PL) inequality. At the core of our approach is the observation that Lyapunov functions can be used not only to certify convergence a posteriori, but also to design algorithms. By replacing an idealized, intractable Lyapunov function with a computable surrogate based on gradient information, we derive TALDA (Tri-Action Lyapunov Descent Ascent), a single-loop algorithm that enforces Lyapunov descent by construction. TALDA guarantees linear convergence under the two-sided PL condition, with a rate that depends explicitly on the cross-smoothness constant. This recovers existing worst-case guarantees while yielding sharper convergence rates in weakly coupled min–max problems.

Abstract:
Iterative selftraining (self-distillation) repeatedly refits a model on pseudo-labels generated by its own predictions. We study this procedure in overparameterized linear regression: an initial estimator is trained on noisy labels, and each subsequent iterate is trained on fresh covariates with noiseless pseudo-labels from the previous model. In the high-dimensional regime, we derive deterministic-equivalent recursions for the prediction risk and effective noise across iterations, and prove that the empirical quantities concentrate sharply around these limits. The recursion separates two competing forces: a systematic component that grows with iteration due to progressive \emphsignal forgetting, and a stochastic component that decays due to \emphdenoising via repeated data-dependent projections. Their interaction yields a U-shaped test-risk curve and an optimal early-stopping time. In spiked covariance models, iteration further acts as an iteration-dependent spectral filter that preserves strong eigendirections while suppressing weaker ones, inducing an implicit form of soft feature selection distinct from ridge regression. Finally, we propose an \emphiterated generalized cross-validation criterion and prove its uniform consistency for estimating the risk along the self-training trajectory, enabling fully data-driven selection of the stopping time and regularization. Experiments on synthetic covariances validate the theory and illustrate the predicted denoising–forgetting trade-off.

Abstract:
Adversarial images pose a severe security threat to multimodal large language models through prompt injection. Existing defenses largely lack a principled understanding of the underlying mechanisms and struggle to balance efficiency and fidelity. In this work, we show that successful adversarial attacks do not rely on the entire image uniformly but instead depend on a small subset of critical image tokens. Based on this insight, we propose a defense that first localizes these critical tokens via gradient analysis and then neutralizes them through masking. We show that attribution based on output probabilities fails when adversarial attacks preserve the predicted token. To overcome this limitation, we introduce the HiddenState Gradient Norm score for adversarial behavior attribution and prove that its ranking is consistent with that of the full adversarial loss gradient, providing a theoretical guarantee for accurate localization. GTM requires only a single forward–backward pass to identify and zero out a small number of high-scoring tokens, effectively disrupting the adversarial attack path. Extensive experiments on prompt injection and multimodal jailbreak attacks demonstrate that our approach reduces attack success rates (ASR) to near zero while preserving model utility with negligible computational overhead.

Abstract:
Evaluating longcontext language models on natural language conflates architectural capacity to capture dependencies with semantic knowledge and vocabulary statistics. When models fail at long contexts, we cannot determine whether failures stem from fundamental architectural limitations or insufficient domain knowledge, preventing clean diagnosis of efficient architectures before expensive training on real data. We introduceL-CUBE(Long-Context Utilization Benchmark), a synthetic benchmark that isolates dependency-capturing capacity from semantic knowledge through hierarchical Gaussian sequences with controllable bipartite mutual information scaling. The generator provides exact ground-truth conditionals that scale efficiently to arbitrarily long sequences, enabling unconfounded evaluation via conditional KL divergence rather than perplexity alone. We definelong-context utilizationto measure the amount of available predictive information that models extract as context grows. Experiments across transformers, state space models, and efficient alternatives validate L²M capacity theory predictions and uncover new phenomena. L-CUBE enables practitioners to test whether a particular design will maintain long-context capability at target sequence lengths before committing to real-data training.

Abstract:
Magnitudebased stability proxies such as parameter drift are widely used in narrow-task fine-tuning, yet they do not reliably indicate degradation of broad capabilities. We identify trajectory lock-in: under fixed training conditions for narrow adaptation, the joint evolution of task loss and broad generalization collapses onto a shared low-dimensional degradation curve, so many stabilizers primarily change the rate of progress along this curve rather than altering the curve itself. This yields a drift paradox, in which comparable Euclidean displacement can still correspond to divergent generalization outcomes. To diagnose the underlying structure, we introduce objective-agnostic geometric probes that track the effective update subspace, together with an online harm signal that reflects curvature-dominated channeling toward directions associated with broad degradation. Finally, we show that escaping lock-in requires a spectral bifurcation, namely a qualitative reorientation of the update subspace toward softer curvature modes, thereby improving broad generalization while maintaining matched training performance. We validate these findings across model scales and modalities in narrow-task settings, and report practical deployment procedures and overhead measurements.

Abstract:
Precise, trainingfree editing of text-to-image diffusion models requires balancing alignment (faithful attribute manifestation), consistency (preserving non-target content), and quality (artifact-free textures). Sparse autoencoder (SAE) steering offers interpretable, smooth ``slider-like'' control by manipulating SAE feature activations derived from the text encoder; however, existing approaches rely on heuristic feature selection and manual tuning of the steering strength, leading to suboptimal trade-offs among the three objectives. We propose Sparse Relaxed-Lasso Steering (SRLS), which casts steering-vector discovery as a convex sparse recovery problem. Exploiting the affine structure of the SAE decoder, SRLS automatically identifies sparse, generalizable support sets via a Lasso objective, and then debiases the coefficients using support-restricted ridge regression. We further select the optimal steering strength using Bayesian optimization. Experiments across diverse attributes and subjects show that SRLS generally improves over other methods, yielding a better balance among alignment, consistency, and quality.

Abstract:
Large VisionLanguage Models (LVLMs) exhibit remarkable general capabilities but struggle significantly with spatial reasoning tasks. In this paper, we uncover a critical representation-output misalignment via linear probing: LVLMs correctly encode spatial features internally, but generate incorrect results in the final text. To address this, we pioneer the Inference-time Geometric Manifold Adaptation paradigm and proposeGRASP(GeometricRectification forActiveSpatialPerception), a training-free framework to awaken these latent capabilities. GRASP employs Manifold Differential Search to identify optimal geometric counterfactuals, which then drive a dual-level rectification mechanism: Implicit Trajectory Correction to rectify attenuated intrinsic geometric features in intermediate decoder layers, and Explicit Distribution Alignment to break the dominance of language priors at the output layer. Extensive experiments spanning diverse architectures (LLaVA, Qwen 2.5/3-VL) and positional encoding paradigms (1D APE, 2D/3D RoPE) across image and video benchmarks (WhatsUp, VSR, VSI-Bench) demonstrate that GRASP significantly mitigates spatial hallucinations without parameter updates, achieving accuracy gains of up to 26.1% on image benchmarks and 9.7% on video reasoning tasks, consistently outperforming baseline methods.

Abstract:
Understanding what generative models retain from training data remains challenging, with implications for copyright and privacy. This question becomes particularly relevant as Rectified Flows power increasingly deployed systems. We analyze the interpolation path X_\lambda = (1\lambda)X_0 + \lambda X_1 that defines Rectified Flow training. We show that train-test distinguishability follows a bell-shaped curve over \lambda, with a maximum whose location we derive in closed form under Gaussian assumptions. This signal accumulates during training while validation metrics remain stable. We validate these predictions on both audio and images, and show that the bell-shaped structure is universal while the peak prediction holds when our assumptions are satisfied. As a proof of concept, we implement a membership inference attack achieving 0.91 AUC by exploiting this \lambda-resolved structure.

Abstract:
Testtime scaling methods such as majority vote aggregation and iterative refinement (e.g., self-reflection or multi-agent inference) improve reasoning performance by leveraging multiple solution samples. However, their efficacy depends not only on raw performance, but critically on the distribution of errors across samples. When errors concentrate, (a) aggregation accuracy degrades, as the majority vote may select a shared mistake, and (b) confidence in common mistakes may suppress exploration in iterative refinement. We argue that improving correctness alone is not sufficient to mitigate these issues; to this end, we propose to explicitly shape error distributions to improve aggregation. First, we introduce a theoretically grounded diverse failure reward that incentivizes calibrated disagreement within model errors. We prove that this reward directly optimizes majority-vote accuracy: policies achieving higher reward attain higher expected majority-vote performance, and vice versa. We further show that this theoretical property generalizes to iterative refinement. Second, we introduce anti-votes, in which the model predicts the most common mistake alongside its solution, allowing probability mass on dominant errors to be explicitly reweighted. We identify conditions under which anti-votes are guaranteed to improve majority-vote accuracy. Empirically, across three model families of varying sizes and four benchmarks, we show that both approaches substantially improve majority vote and iterative refinement performance without degrading single-sample accuracy.

Abstract:
Custom diffusion models (CDMs) have demonstrated impressive success in visual personalization tasks by enabling the generation of userspecific concepts. However, existing CDMs typically assume that personalized concepts are static and rely on costly model merging or sequential updates that are prone to catastrophic forgetting as new concepts are introduced. To address these limitations, we propose a Submodular Concept Neuron Selection method (SCNS), to solve CDMs with continual personalized concepts, which formulates continual personalization as a constrained submodular optimization problem to select a minimal yet sufficient set of concept-specific neurons under diminishing returns. SCNS combines a Facility Location-based coverage objective to suppress semantic redundancy, a Fisher-weighted risk proxy to protect previously learned concepts, and a cost-aware greedy rule to balance stability and plasticity with extreme sparsity. Extensive experiments demonstrate that SCNS achieves state-of-the-art performance in image alignment and anti-forgetting, while enabling fusion-free continual personalization by modifying only 0.41% of the total parameters for each concept on average. Our implementation is available at SCNS.

Abstract:
LLM applications increasingly execute as endto-end inference pipelines that couple generation with retrieval, stateful memory, context refinement, and tool use under strict tail-latency and SLO constraints. Today, these stages are often stitched together as RPC-connected services, obscuring cross-stage queueing and interference and limiting pipeline-level compilation and resource sharing. We present SAGE (Streaming-Augmented Generative Execution), a full-stack system that treats inference pipelines as first-class compilation targets. SAGE exposes pipelines as declarative dataflows and compiles them into distributed execution plans with bounded-queue backpressure. It integrates vector search, streaming semantic state, structured memory, and refinement as operators with explicit resource/state contracts, enabling operator-level diagnosis of tail behavior. SAGE integrates pluggable generation and embedding backends and provides a unified control plane for engine management, batching, and admission under mixed workloads. On a 16-node cluster, SAGE sustains 16 requests/s at >700 tokens/request with 1 ms median scheduling overhead, and achieves near-linear scale-out to 16 nodes (11.4× throughput at 16 nodes), and reduces p99 latency by 57% under multi-pipeline contention versus simultaneous admission.

Abstract:
Federated Adversarial Learning (FAL) enhances model robustness by integrating adversarial training into the federated learning framework. Despite recent advances proposing efficient FAL algorithms, existing work has mainly focused on convergence properties, with limited understanding of their generalization capabilities. To address this, we propose the first unified theoretical analysis of FAL generalization through the lens of algorithmic stability. We first analyze general FAL algorithms based on stochastic gradient descent and derive perturbationdependent generalization bounds, which reveal that stronger adversarial attacks can lead to degraded generalization. To mitigate the impact of adversarial perturbations, we further leverage Moreau envelope optimization, deriving a perturbation-independent bound that enhances the robustness and generalization of the federated model. Finally, we extend our analysis to the practical black-box setting, demonstrating that zeroth-order optimization techniques can effectively maintain both robustness and generalization even without local gradient access.

Abstract:
Large language models (LLMs) demonstrate strong reasoning abilities via Chainof-Thought (CoT), but their token-level generation encourages local decisions and lacks global planning, often leading to redundant or inaccurate reasoning. Existing methods, such as tree-based search and reinforcement learning (RL), attempt to address this issue but incur high computational costs and still struggle to produce reliable reasoning trajectories. To address these challenges, we propose Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization (PTA-GRPO), a two-stage framework designed to jointly improve high-level planning and fine-grained CoT reasoning. Specifically, in the first stage, a given LLM is responsible for summarizing CoT reasoning into compact high-level guidance, which is then leveraged for supervised fine-tuning. Then, we introduce a guidance-aware reinforcement learning method that jointly optimizes the final output and the quality of guidance, enhancing reasoning effectiveness. We evaluate PTA-GRPO on ten reasoning benchmarks across mathematics and natural sciences, using five diverse base models spanning multiple data modalities. The results show that PTA-GRPO consistently delivers stable and significant improvements across models and tasks, demonstrating strong effectiveness and generalization.

Abstract:
Large language models (LLMs) are increasingly used as planners for cooperative embodied agents, but multiagent settings amplify inconsistency under partial observability and make explicit communication costly or even unavailable. Many existing approaches rely on online message passing; when communication is removed, agents often fall back to independent local planning that suffers from tacit miscoordination. We introduce Contextual Tacit Communication, a training-free protocol that aligns decentralized decisions with a joint LLM value score without explicit message actions. Our method measures context-conditioned value rectifications via residual banding to pinpoint miscoordination actions and amortizes the resulting coordination signals into a retrieval-augmented Tacit Rule Memory that provides prompt-level cooperation rules at execution time. Experiments on VIKI, C-WAH, and TDW-MAT show that our approach improves cooperation performance over baselines while reducing runtime overhead compared with communication-based methods.

Abstract:
Saliency maps are widely used to interpret image classification models and build trust in their predictions; however, their reliability remains a central concern, as randomized networks can produce saliency maps that closely resemble those of trained models. We identify a previously underappreciated architectural contributor to this phenomenon: acenterfocused saliency biasinduced by common convolutional design choices. Through controlled ablations, we show that this bias arises from architectural components such as zero padding and receptive field growth, and persists even in randomly initialized convolutional neural networks (CNNs) and under randomized inputs. In contrast, this behavior is largely absent in non-convolutional architectures such as Vision Transformers (ViTs) and multilayer perceptrons (MLPs). To investigate the interaction between architectural priors and learning, we introduce a corner-shift benchmark and a Center-Shift Index that quantify how saliency redistributes under object relocation. We show that training can partially shift saliency toward object regions, while randomized models remain dominated by architectural priors, helping explain the previously observed similarity between trained and random saliency maps and clarify how architectural priors can confound standard saliency evaluations.

Abstract:
How should the sequence of hidden states produced during autoregressive generation be compressed into a representation that reflects the model’s internal state? We study representations derived from generated tokens and compare them to grounded embeddings across several domains. We find that pooling embeddings across tokens produces more informative representations than any individual token. This observation is consistent with semantic information being distributed across generated tokens rather than localized to a single position. In this setting, alignment provides a way to study how a model’s internal representations evolve and pooling offers a more reliable summary of the model's state across generation.

Abstract:
Recent textto-video diffusion models can synthesize visually compelling clips from natural language prompts. However, practical applications increasingly demand long-form videos with evolving narratives and persistent identity. A common solution is autoregressive generation, where the video is produced clip by clip over long horizons, yet coherence often degrades as errors compound. In this work, we study long-video generation under an autoregressive setting, where videos are synthesized clip by clip over long horizons. Despite strong short-clip quality, existing approaches often suffer from semantic drift, motion decay, and appearance instability as the sequence grows. We present DynaMem, a unified framework that improves long-horizon coherence via three components: Semantic-Adaptive Hierarchical Memory for long-range semantic preservation, Dynamics-Prioritized Optimization for motion-coherent learning, and Reference-Anchored Perceptual Alignment for stabilizing appearance. Extensive experiments show that DynaMem produces more consistent semantics, stronger temporal dynamics, and more stable appearance on long videos compared to competitive baselines.

Abstract:
Large language models (LLMs) are increasingly used as decision modules, yet existing benchmarks provide limited coverage of longhorizon, adversarial interaction while faithfully acting on human instructions. We introduce a long-horizon Red Alert RTS benchmark with a hierarchical interface in which LLMs output budgeted, low-frequency macro/tactical intents that are executed deterministically for standardized comparison. The benchmark evaluates (i) robustness to ``rules-as-variable'' perturbations via rule-style shifts , (ii) competitive strength via Elo-style ratings from head-to-head matches, and (iii) human steerability via standardized language interventions. Beyond win/loss, we log economy growth/spending, combat loss ratio, and visibility coverage to diagnose long-horizon failure modes. Overall, the benchmark provides a reproducible and diagnostic testbed for robustness and controllability in long-horizon adversarial decision making.

Abstract:
Autoregressive decoding of large language models is frequently memorytraffic bound, so ultra-low-bit weight-only PTQ helps only if dequantization avoids irregular codebook or LUT access in the inner loop. Under the GPU-friendly bitshift trellis, existing 2-bit trellis-coded quantization (TCQ) pipelines either reintroduce micro-LUTs or suffer overlap-amplified artifacts because incoherence improves global Gaussianity but does not guarantee overlap-local joint geometry. We introduce Proteus a strictly lookup-free TCQ framework whose computed generator MUL-BAL uses cheap integer mixing plus a per-layer affine Gaussianizer to produce overlap-robust, near-Gaussian code values with zero runtime table loads. Proteus instantiates each layer by selecting from a tiny, pre-vetted candidate pool and then applies lightweight channel compensation and optional few-shot distillation that tune only per-layer affine statistics while keeping packed indices and the bitshift-trellis decoder fixed. On Llama 2 (7B–70B) at 2-bit PTQ, Proteus improves perplexity and zero-shot accuracy over strong TCQ/PTQ baselines and reduces end-to-end decode bandwidth at comparable throughput (e.g., 740 vs. 1020 GB/s on 70B).

Abstract:
While algorithmic stability is a central tool for understanding generalization of learning algorithms, existing highprobability guarantees typically rely on uniform boundedness or sub-Gaussian/sub-Weibull tail assumptions, which can be overly restrictive for modern settings with heavy-tailed or unbounded losses. We develop a stability-based framework that requires only a finite L_p moment condition. Our first contribution is sharp concentration inequalities for functions of independent random variables under L_p constraints, extending McDiarmid's bounded-differences techniques beyond the classical regime. Leveraging these results, we derive sharp high-probability generalization bounds across a range of learning paradigms, including empirical risk minimization, transductive regression, and meta-learning. These guarantees show that L_p stability suffices for robust generalization even when boundedness fails, substantially weakening the standard assumptions in the stability literature.

Abstract:
Large Language Models (LLMs) are increasingly expected to operate over long contexts, yet standard softmax attention incurs a KV cache that grows linearly with sequence length, quickly becoming the bottleneck for long context inference. A practical remedy is to evict less important KV entries; however, existing eviction policies are largely heuristic and struggle to capture the rich, inputdependent distribution of token importance. In this work, we introduce alearnable indexerthat predicts KV importance, enabling more accurate retention of critical tokens. Meanwhile, naively evicting tokens permanently discards their information, leading to irreversible forgetting and degraded retrieval over long ranges. To address this, we propose a lightweightlatent memory modulethat compresses evicted tokens into a compact, online-updated state and provides residual readouts to compensate for the attention contributions lost through KV eviction. Collectively, our method enables accurate long-context inference under a bounded KV budget, delivering consistent improvements on RULER (4K/16K) across Qwen, Mistral, and Llama models (up to 25 points under aggressive eviction), markedly more stable Needle-in-a-Haystack retrieval, and superior LongBench scores and compression curves compared to existing eviction policies.

Abstract:
Long video question answering requires locating sparse, timescattered visual evidence within highly redundant content. Although current MLLMs perform well on short videos, long videos introduce long-horizon search and verification, which often necessitates multi-turn, agentic interaction. We show that existing LVU agents can exhibit evidence misalignment: they produce correct answers that are not supported by the retrieved or inspected evidence. To characterize this failure, we introduce two diagnostics (temporal groundedness and semantic groundedness) and use them to reveal two pressures that amplify misalignment: prompt pressure from shared-context saturation at inference time and reward pressure from outcome-only optimization during training. These findings point to a structural root cause: the coupled agent paradigm conflates long-horizon planning with answer authority. We therefore propose the decoupled planner--inspector framework, which separates planning from answer authority and gates final answering on pixel-level verification. Across four long-video benchmarks, our framework improves both answer accuracy and evidence alignment, achieving 55.1% on LVBench and 62.0% on LongVideoBench while producing interpretable search trajectories. Moreover, the decoupled architecture scales consistently with increased search budgets and supports plug-and-play upgrades of the MLLM backbone without retraining the planner. Code and models are available at \urlhttps://anonymous.4open.science/r/VideoSeal-8C8D.

Abstract:
Instruction tuning for speech language models (SLMs) is substantially more challenging than for textbased large language models (LLMs), as it requires learning a new modality and a wide range of speech-specific instructions in addition to those supported by text LLMs. Existing SLM training approaches largely replicate the text LLM training paradigm by synthesizing large-scale speech pre-training and instruction-tuning datasets. However, this strategy is difficult to scale, since speech sequences are significantly longer than text sequences. In this paper, we propose SpeechCombine, an instruction-following speech language model trainedwithout any instruction tuning, using only a single round of speech pre-training on as little as 30k hours of speech data. Starting from a text LLM base model, we perform continuous pre-training on speech utterances to obtain a speech-adapted model, and then directly combine its weights with the weight difference between the instruction-tuned and base versions of the text LLM. Our results show that this simple combination strategy not only preserves the knowledge and capabilities of the original text LLM, but also effectively transfers them to the speech domain. These findings suggest a new direction for SLM training that avoids reliance on massive volumes of speech data.

Abstract:
Time series prototype learning is fundamentally challenged by observational ambiguity. Discrete architectures fail to resolve this, as they lack the capacity to decouple stochastic noise from continuous dynamics. Furthermore, rigid closedset assumptions fail to capture unseen diversity. To address these limitations, we propose a hierarchical ordinary differential equation clustering network, which utilizes neural ordinary differential equation to model latent state evolution as a continuous integral curve. This formulation enforces temporal continuity to effectively disentangle smooth feature trends from stochastic noise, while our adaptive hierarchical mechanism autonomously identifies the optimal number of prototypes without rigid prior constraints. Validated on the early link failure detection task with irregularly sampled time series, the proposed method effectively extracts underlying physical prototypes, thereby enabling robust failure detection.

Abstract:
Accurately localizing and segmenting small targets in low signalto-noise ratio (SNR) infrared sequences remains a challenging task. Since targets are often indistinguishable from the background in individual frames, existing methods, even when equipped with advanced foundation model and powerful inter-frame association mechanisms, still fail to detect them. Motivated by the observation that targets tend to emerge gradually from the background over time and become distinguishable, we propose Temporal-Emerged Prompting for Segment Anything Model (TEP-SAM), a principled framework designed to explicitly exploit such temporal-emerged cues to modulate and prompt SAM. TEP-SAM operates by jointly modeling global motion patterns and local motion deviations to locate potential targets. It further enhances target region features by leveraging motion discrepancy, thereby generating temporal-emerged cues for SAM and enabling non-interactive segmentation. By bridging large-scale semantic pretraining with task-specific temporal modeling, TEP-SAM effectively adapts SAM to the challenging multiframe infrared small target detection task. Extensive experiments demonstrate the effectiveness of our approach, particularly under severely low-SNR conditions and in complex dynamic background.

Abstract:
Gradientbased optimization is fundamental to deep learning, yet standard backpropagation (BP) is inherently limited by the requirement of differentiability, rendering it brittle when encountering piecewise-constant objectives with vanishing gradients (e.g., hard 0-1 loss) or black-box feedback. While likelihood ratio (LR) methods offer a theoretical alternative, their high variance in high-dimensional spaces often undermines training stability and scalability. We propose OVLR (Output-Level Variance-Reduced Likelihood Ratio), a simple yet powerful framework that circumvents this fundamental trade-off by providing a unified solution for efficient, scalable, and robust gradient estimation. OVLR achieves dramatic variance reduction by performing perturbations and antithetic sampling in the low-dimensional output space. Crucially, the method maintains high computational efficiency: it requires only a single deterministic forward pass through the neural network, with additional costs restricted to evaluating the loss function across multiple samples. Designed as a drop-in replacement, OVLR integrates seamlessly into automatic differentiation frameworks via vector-Jacobian products, enabling the direct optimization of objectives with vanishing or pathological gradients, such as the 0-1 loss for noise-tolerant classification and truncated losses for outlier-resistant regression, where BP fails to provide reliable learning signals. Extensive empirical results across image classification, generative modeling, language modeling, and robot imitation learning demonstrate that OVLR not only matches BP performance on problems with informative gradients, but also provides a decisive advantage on problems with vanishing or inaccessible gradients.

Abstract:
The deployment of large language models (LLMs) in realworld applications is increasingly limited by their high inference cost. While recent advances in dynamic token-level computation allocation attempt to improve efficiency by selectively activating model components per token, existing methods rely on greedy routing—a myopic execute-or-skip mechanism that often leads to irreversible information loss and suboptimal token selection. This paper introduces informed routing, a new paradigm that proactively addresses these issues. The key insight is to assess not only a token’s immediate importance but also its recoverability, i.e., how well its transformation can be approximated. To this end, we propose the Lightweight Feature Forecaster (LFF), a small predictive module that estimates a unit’s output before routing decisions are made. This enables a flexible execute-or-approximate policy that preserves model fidelity while drastically reducing computation. Extensive experiments show that informed routing consistently achieves state-of-the-art performance across static and dynamic pruning approaches. We further present two practical inference pipelines: a pure-PyTorch implementation and a Triton-based custom operator, that translate these gains into real-world speedups, achieving practical acceleration and consistent improvement across various batch sizes.

Abstract:
Prompt optimization is a key way to steer large language models when finetuning is impractical. However, instruction optimization (IO) and in-context learning (ICL) demonstration selection are often optimized separately and combined post hoc, implicitly assuming that a "best'' instruction and a "best" demonstration set compose well. In practice, their interactions are strong, making such decoupled pipelines brittle. We propose SMILE, an efficient method thatjointlyselects instructions and demonstrations. Our key observation is that the ICL performance exhibits consistent diminishing returns across diverse instructions. Leveraging this structure, SMILE learns an instruction-conditioned surrogate aligned with LLM feedback and instantiates it as an Extended Deep Submodular Function that captures sample--sample coverage, sample--query relevance, and sample--instruction compatibility. SMILE then performs greedy, query-adaptive selection of the instruction--demonstration pair. Experiments on six datasets and multiple LLM backbones show that SMILE consistently outperforms IO-only, ICL-only, and existing joint baselines, supporting a context engineering view of prompting: jointly optimizing interacting components rather than tuning them in isolation.

Abstract:
Given that simulation can never exhaustively enumerate reality, generalization is the determining factor for whether VisionLanguage-Action (VLA) models can translate benchmark success into real-world functionality. However, current evaluation protocols often incentivize mechanical memorization rather than robust policy learning, leading to a paradoxical duality of failure: high-scoring models exhibitspurious invarianceto semantic changes while simultaneously displayingextreme brittlenessto trivial environmental perturbations. To address this, we introduceLIBERO-Gen, a diagnostic benchmark systematically designed to shift evaluation from intuition-driven heuristics to explicit distributional assumptions. Through a hierarchical protocol spanningIn-distribution,Compositional, andDomain Generalization, LIBERO-Gen reveals performance stratifications previously masked by standard metrics. Our analysis identifies Pi0.5 as the top performer (64.0% in Spatial-CG; 21.2% in Task-CG). By identifying perceptual instability and action binding collapse as primary failure modes while validating the efficacy of structured ``Stair” sampling, LIBERO-Gen establishes a rigorous baseline for deployment reliability.

Abstract:
The factors driving the performance of incontext learning (ICL) in large language models (LLMs) remain poorly understood despite ICL's surprising effectiveness, enabling models to adapt to new tasks from only a handful of examples. To clarify and improve these capabilities, we characterize how the statistical properties of the pretraining distribution (e.g., tail behavior, coverage) shape ICL. We develop a theoretical framework that encompasses generalization and task selection and show how distributional properties govern sample efficiency, task retrieval, and robustness. To this end, we generalize existing concentration results to heavy-tailed priors and dependent sequences, better reflecting the structure of LLM pretraining data. Our framework reveals a fundamental design trade-off: heavy-tailed pretraining distributions facilitate robust task selection under distribution shifts but are detrimental to generalization, especially in low-data regimes. We then empirically evaluate our predictions by studying how ICL performance varies with the pretraining distribution on challenging tasks such as stochastic differential equations and stochastic processes with memory. Together, these findings suggest that controlling key statistical properties of the pretraining distribution is essential for building ICL-capable and reliable LLMs.

Abstract:
Training Large Language Models (LLMs) on heterogeneous datasets requires optimizing domain representations to balance convergence speed and domain coverage. While recent methods reduce computational overhead by selecting highquality data subsets, they typically perform selection independently per domain or rely on computationally expensive proxy models to determine continuous domain weights. In this paper, we propose a joint sample selection framework that learns better domain representations in the batch directly and formulate the selection objective as joint domain utility maximization subject to partition matroid constraints. This approach enforces domain-specific budgets while maximizing a validation-guided gradient matching utility across all domains simultaneously. Theoretically, we establish that this objective is weakly submodular, allowing us to employ a computationally efficient orthogonal matching pursuit algorithm with provable guarantees. Empirically, we demonstrate that our method significantly outperforms baselines that favor independent domain selection on mathematical reasoning and molecular generation benchmarks. Furthermore, our analysis show that our approach reduces the number of conflicting training gradient pairs significantly as compared to independent selection across domains or domain agnostic selection. Applied to Qwen2.5 and Llama-3 and trained on MethaMathQA and Mol-Instructions, our approach yields robust gains under multiple subset fractions, with improvements on nine math-reasoning and four molecule-generation benchmarks, highlighting cross-domain benefits of joint subset selection.

Abstract:
Forwardmode automatic differentiation (FmAD) and zero-order (ZO) optimization are increasingly proposed as memory-efficient, backpropagation-free alternatives for large language model (LLM) fine-tuning, yet their benefits are typically evaluated only against standard backpropagation (BP), omitting memory-efficient variants such as activation checkpointing. We present a unified theoretical and empirical comparison of BP, checkpointed BP, FmAD, and ZO for LLM and vision-language model training, showing that while FmAD and ZO reduce activation memory, they trade memory for higher computational cost and longer wall-clock time to convergence, resulting in lower accuracy and slower training, especially under constrained perturbation budgets. Across models, BP with checkpointing outperforms FmAD and ZO variants, including variance-reduced methods, achieving up to 31.1% higher accuracy, 34.8% faster convergence, and 3.8× fewer computations at comparable memory usage, while also revealing instability-related failure modes in FmAD and ZO. Overall, our results correct a one-sided benchmarking narrative by showing that memory-efficient methods entail fundamentally different trade-offs, and that ignoring these distinctions has led to misleading conclusions about LLM optimization in prior work.

Abstract:
Complex tables with multilevel headers, merged cells and heterogeneous layouts pose persistent challenges for large language models (LLMs) in both understanding and reasoning. Existing approaches typically rely on table linearization or normalized grid modeling. However, these representations struggle to explicitly capture hierarchical structures and cross-dimensional dependencies, which can lead to misalignment between structural semantics and textual representations for non-standard tables.To address this issue, we propose an Orthogonal Hierarchical Decomposition (OHD) framework that constructs structure-preserving input representations of complex tables for LLMs. OHD introduces an Orthogonal Tree Induction (OTI) method based on spatial--semantic co-constraints, which decomposes irregular tables into a column tree and a row tree to capture vertical and horizontal hierarchical dependencies, respectively. Building on this representation, we design a dual-pathway association protocol to symmetrically reconstruct the semantic lineage of each cell, and incorporate an LLM as a semantic arbitrator to align multi-level semantic information. We evaluate OHD framework on two complex table question answering benchmarks, AITQA and HiTab. Experimental results show that OHD consistently outperforms existing representation paradigms across multiple evaluation metrics.

Abstract:
Lifelong multimodal knowledge editing allows vision language models to continuously adapt to dynamic updates to avoid catastrophic forgetting. To mitigate interference between sequential updates, recent paradigms have shifted towards modular parameter isolation. However, this strategy faces a critical scalability bottleneck: accumulating dense parameter blocks can lead to excessive memory growth, and managing these independent modules often uses decoupled routing mechanisms, resulting in architectural redundancy. To address this issue, we proposeTIME(TensorFactorizedIntrinsicMixture-of-Experts), a unified framework harmonizing parameter efficiency with structural self-routing. TIME parameterizes each knowledge edit as a compact CP-decomposed tensor, significantly reducing complexity compared to low-rank matrices. Furthermore, departing from auxiliary semantic retrievers, we introduce an intrinsic routing mechanism that utilizes the tensor's input factors to directly define the active subspace, effectively enabling expert parameters to serve simultaneously as the routing logic. Extensive experiments demonstrate that TIME achieves state-of-the-art performance on lifelong editing benchmarks while successfully reducing memory usage and inference latency.

Abstract:
Learning and planning in imagination using world models provides an effective paradigm for training agents for decisionmaking. However, existing approaches often rely on high-dimensional latent spaces or generic visual embeddings that retain many factors irrelevant to control, limiting efficiency and generalization across tasks. To this end, we study how agents can learn world models with representations that are task-specific, minimal, and sufficient for decision making. We achieve this via a closed-loop synergy between the agent and the world model, in which structured world-model learning distills task-sufficient representations from informative interaction data. On the agent side, agents perform active probing of the environment to collect informative trajectories that expose task-relevant latent factors, guided by an adaptive curriculum. On the world-model side, we learn structured representations over observations to distill compact, task-sufficient latent states from the collected interaction data. This synergy enables the recovery of task-sufficient latent representations that capture all control-relevant factors empirically. Leveraging these representations, the resulting policies achieve improved sample efficiency and systematic generalization, including generalization across skills, object–skill compositions, and previously unseen tasks on standard continuous control and robotic manipulation benchmarks.

Abstract:
Deploying multimodal systems in realworld environments often entails handling modality-missing scenarios, where one or more modalities are unavailable. While recent studies address this challenge for the general Multimodal Transformer (MT) architecture via prompt tuning, we identify a fundamental limitation in these methods: the Implicit Modality-Reduction bottleneck. By conditioning prompts solely on the observed modalities, they inadvertently restrict the reasoning scope of MTs to the modality-reduced subspace, cutting off access to the latent information sources of the missing modalities. To overcome this limitation, we propose AOEPT, which pioneers a novel modal-contextualized prompting fashion. Specifically, we introduce lightweight Modal-Contextualized Prompts (MCPs) that distill global modality-wise priors from training data, serving as latent repositories of the information sources for missing modalities. Conditioned on the remaining modalities, these MCPs are instantiated into instance-aware prompts that selectively augment missing-modality information for each sample, thereby restoring the reasoning scope of MTs beyond the observed-modality-only subspace. Experiments across various benchmarks and MT architectures confirm the strong performance of AOEPT, with minimal computational overhead.

Abstract:
Bottleneck states, which connect distinct regions of the state space, provide a principled and interpretable basis for constructing temporal abstractions in Hierarchical Reinforcement Learning (HRL). However, existing bottleneck identification methods primarily rely on topological analysis of the statetransition graph, limiting their scalability to high-dimensional or continuous domains. To address this challenge, we introduce Value Power Strength (VPS), a value function-based metric inspired by the analogy between the Bellman equation and Kirchhoff’s current law, to quantify bottleneck property via the diffusion of reward in Markov Decision Processes (MDPs). VPS is estimated efficiently using value functions learned from random reward signals and captures reward diffusion bottlenecks in both discrete and continuous state spaces. Leveraging VPS, we design options that guide agents toward or away from bottleneck regions. Experimental results on classic tabular domains, MiniGrid, and Atari 2600 games demonstrate that the VPS-based framework discovers semantically meaningful subgoals and substantially improves exploration efficiency.

Abstract:
Mathematical search problems present a unique challenge for Reinforcement Learning (RL) due to vast search spaces and sparse rewards. In previous works, the AndrewsCurtis (AC) conjecture was established as an illustrative example of such problems. In this work, we identify a critical structural barrier in the AC landscape: a "Two Hump" distribution, where problem instances are either trivially solvable or effectively impossible, with a scarcity of intermediate "hard-but-solvable" instances required for effective learning. We tackle this challenge through two primary avenues: novel data generation techniques to populate the difficulty gap, and significant algorithmic enhancements including the introduction of supermoves and Transformer-based architectures. We demonstrate substantial performance improvements over previous baselines, and release new comprehensive benchmark datasets includingAC-19(125,192 AC-trivial presentations of varying difficulty with length at most 19) andAC-1M(1,136,154 hard AC-trivial presentations of length at most 30), the first large-scale, publicly available datasets of this kind.

Abstract:
As large language models (LLMs) are increasingly deployed as interactive agents, openended human-AI interactions can involve deceptive behaviors with serious real-world consequences, yet existing evaluations remain largely scenario-specific and model-centric. We introduceOpenDeception, a lightweight framework for jointly evaluating deception risk from both sides of human-AI dialogue. It consists of a scenario benchmark with 50 real-world deception cases, anIntentNetthat infers deceptive intent from agent reasoning, and aTrustNetthat estimates user susceptibility. To address data scarcity, we synthesize high-risk dialogues via LLM-based role-and-goal simulation, and train the User Trust Scorer using contrastive learning on controlled response pairs, avoiding unreliable scalar labels. Experiments on 11 LLMs and three large reasoning models show that over 90% of goal-driven interactions in most models exhibit deceptive intent, with stronger models displaying higher risk. A real-world case study adapted from a documented AI-induced suicide incident further demonstrates that our joint evaluation can proactively trigger warnings before critical trust thresholds are reached.

Abstract:
Decentralized stochastic bilevel optimization has been actively studied in recent years. However, existing studies assume that the lower-level loss function is strongly convex, which limits their applicability to many machine learning models. To address this limitation, in this paper, we propose a novel decentralized stochastic first-order optimization algorithm, which does not require second-order Hessian or Jacobian matrices, for the setting where the lower-level loss function is nonconvex but satisfies the Polyak–Łojasiewicz (PL) condition. Additionally, unlike existing single-agent methods that introduce a regularization term to the lower-level loss function to artificially enforce strong convexity, our algorithm does not require such modification. Moreover, our algorithm employs a constant single-timescale learning rate for updating variables, which is different from the time-dependent and two-timescale learning rate schedules used in prior work. To establish the convergence rate, we develop a new convergence analysis framework for the pure PL condition, rather than relying on the artificial strong convexity introduced through regularization in existing single-agent methods. To the best of our knowledge, this is the first algorithm for nonconvex decentralized bi-level optimization that offers theoretical convergence guarantees under mild conditions. Finally, our extensive experimental results on hyperparameter optimization and model pruning applications validate the efficacy of the proposed algorithm.

Abstract:
Tensor Product Representations provide the structural fidelity required for symbolic reasoning in models but suffer from exponential dimensionality growth when encoding deep recursive structures. Conversely, Vector Symbolic Architectures maintain constant dimensionality but sacrifice capacity and fidelity due to noisy compression via superposition. In this work, we propose Orthogonal Subspace Carving (OSC), a memory architecture that binds fillers to roles by projecting onto the null space of the role basis before aggregating into a fixed orderp tensor. OSC uses projections to enforce geometric orthogonality between bound structures within a \em static memory trace. We show that this mechanism decouples the tensor order from the structural depth, enabling deep recursive binding within a constant memory footprint. This construction allows for component vectors that are orders of magnitude smaller than the memory tensor, giving excellent memory efficiency in settings involving high superposition. We also show that TPR is a special case of binding in Clifford algebra, and give a Clifford formulation of OSC.

Abstract:
Large Language Models (LLMs) have emerged as powerful tools for semantic reasoning, enabling the formalization of tasks that traditionally relied on manual human intuition. This capability extends to environment design in Reinforcement Learning (RL). While prior research predominantly focuses on reward design, the design of observation and action spaces remains relatively underexplored. We propose LOAM, a framework leveraging LLMs to construct refined agent spaces from raw environments. To mitigate the computational burden of identifying the best candidate model from stochastic LLM outputs, LOAM incorporates a continuous racing mechanism that dynamically allocates resources to prioritize the most promising configurations without additional training overhead. Empirical evaluations on HumanoidBench and Isaac Lab demonstrate that LOAM consistently outperforms handcrafted baselines in both learning speed and asymptotic performance.

Abstract:
SharpnessAware Minimization (SAM) has established itself as a powerful and widely adopted optimizer for training machine learning models. By explicitly minimizing the sharpness of the loss landscape, SAM often improves generalization while delivering strong empirical performance. However, SAM and its variants, like most training algorithms, are sensitive to the choice of learning rate, which is typically tuned by trial and error or via schedulers. In this work, motivated by recent advances on the effectiveness of stochastic Polyak step sizes for Stochastic Gradient Descent (SGD), we derive Polyak schedulers tailored to SAM-style updates, yielding novel adaptive algorithms in both deterministic and stochastic settings. In smooth setting, for the proposed methods, we prove linear convergence for strongly convex objectives and an O(1/T) rate (up to a neighborhood in the stochastic setting) for convex objectives. Numerical experiments demonstrate that the proposed Polyak schedulers match or surpass tuned SAM baselines while substantially reducing the need for learning-rate tuning.

Abstract:
Parameter sharing is a widely used technique in MultiAgent Reinforcement Learning (MARL) that enhances sample efficiency by equipping agents with a unified policy. While effective in homogeneous settings, it often struggles in heterogeneous environments where agents possess diverse capabilities. Conversely, learning customized policies for agents can resolve knowledge conflicts but significantly hinders knowledge transfer, thereby reducing learning efficiency. Existing approaches attempt to balance this trade-off using clustering or agent-specific masks, but they typically rely on strong environment-specific priors and struggle in settings where the team exhibits multi-modal policies. To address these limitations, we propose Dspic, an efficient shared-policy algorithm grounded in the maximum entropy framework. Specifically, Dspic employs self-supervised learning to extract discriminative role embeddings for each agent. These embeddings guide a complete division of the observation space, providing a theoretical guarantee for the optimality of parameter sharing. Furthermore, to handle the increased observation complexity and diversity resulting from this division, Dspic incorporates a diffusion policy, enhancing the capacity to model complex action distributions while enabling efficient learning. Extensive experiments on MaMuJoCo, SMAC, SMACv2, and LBF demonstrate that Dspic achieves superior sample efficiency while maintaining asymptotic optimality.

Abstract:
Recovering 3D human poses for multiple individuals from different camera views is a fundamental bottleneck for analyzing interacting behaviors. Existing selfsupervised approaches leverage synthetic catalogues of 3D poses; however, this leads to poor generalization in real-world scenarios due to distribution shifts. We therefore introduce DisPOSE, a self-supervised framework that approximates the inherently discrete multi-view person-assignment problem as a generative diffusion process over the space of polystochastic tensors. By employing differentiable Sinkhorn projections during denoising, our model learns to guide solutions toward valid and feasible assignments based on 2D image priors. The complete 3D skeletons of localized individuals are then regressed using a Hypergraph-Convolutional Decoder that explicitly models relational structures and articulated joints across multiple views. The proposed approach outperforms current state-of-the-art self-supervised methods on standard datasets and demonstrates strong performance on a newly proposed benchmark featuring highly occluded scenes from surgical operating rooms. Our diffusion-based localization demonstrates high label efficiency, retaining 99% of its performance with only 10% of the pseudo-labels. Notably, disentangling the assignment and root regression components while maintaining differentiability makes DisPOSE nearly agnostic to different camera arrangements.

Abstract:
Existing expressivity results for transformers typically rely on hardmax attention, high precision, and other architectural modifications that disconnect them from the models used in practice. We bridge this gap by analyzing standard transformer decoders with softmax attention and rounding of activations and attention weights, while allowing depth and width to grow logarithmically with the context length. As an intermediate step, we construct hardmax transformers with ternary activations and wellseparated attention scores that simulate Turing machines using Chain-of-Thought (CoT). This lets us convert the constructions to equivalent softmax transformers without the unrealistic parameter magnitudes or activation precision that prior approaches would require. Using the same technique, we analyze a recently proposed summarized CoT paradigm (InftyThink) and show that it simulates Turing machines more efficiently, with model size scaling logarithmically in a space bound rather than a time bound. We empirically demonstrate this advantage by training small transformers to solve Sudokus almost perfectly with summarized CoT, even on instances requiring more than a million reasoning tokens.

Abstract:
We study the problem of monitoring model performance in dynamic environments where labeled data are limited. To this end, we propose predictionpowered risk monitoring (PPRM), a semi-supervised risk-monitoring approach based on prediction-powered inference (PPI). PPRM constructs anytime-valid lower bounds on the running risk by combining synthetic labels with a small set of true labels. Harmful shifts are detected via a threshold-based comparison with an upper bound on the nominal risk, satisfying assumption-free finite-sample guarantees in the probability of false alarm. We demonstrate the effectiveness of PPRM through extensive experiments on image classification, large language model (LLM), and telecommunications monitoring tasks.

Abstract:
Model merging has become a practical posttraining strategy for building a single multi-task large language model (LLM) by combining multiple task-specialized models, avoiding costly joint training. However, most existing approaches rely on post-hoc merging, in which task-specific models are merged only once after training. This one-shot aggregation often suffers from task interference, leading toinformation erasureacross individual tasks. In this work, we show that replacing post-hoc merging with an iterativemany-shot mergingprotocol is effective in improving multi-task performance. Building on this insight, we proposeMETIS,MitigatingErasure fromTaskInterference forStable many-shot merging. METIS is a loss-aware many-shot merging method that stabilizes iterative integration through task-wise loss-gap weighting and consensus-based masking. Notably, METIS exhibits significant performance improvement on the worst-performing task, effectively mitigating information erasure.

Abstract:
LLMto-KG systems frequently fail on exclusion-rich questions because natural-language negation is both scope-sensitive and evidence-dependent: it may constrain only one subgoal/branch and only certain supporting paths, yet such attachment is rarely explicit in text. We propose the Executable Exchange Contract (EEC) to bridge this gap, specifying scope-bound exclusions as executable control metadata exchanged between a specifier and an executor. Our executor, MatLogic, compiles exclusions into scope-local masks applied during multi-hop propagation and executes requests under a unified P\rightarrowN\rightarrowC/D schedule, ensuring exclusions are enforced before witness loss and branch entanglement. The system can also return compact witness pointers to keep support types distinguishable when needed. We evaluate on both structured complex queries and end-to-end natural-language KGQA, and introduce contract-aligned diagnostics that isolate errors from specification versus execution and verify the necessity of scoped enforcement.

Abstract:
Taskoptimized neural networks are the leading in-silico models of sensory cortex, yet the field lacks a unified understanding of which modeling choices drive improved brain alignment. Prior NeuroAI work is fragmented across datasets and modalities, making it difficult to determine robust scaling trends. Here, we systematically investigate the scaling laws of model-to-brain alignment across 8 neural datasets (spanning electrophysiology, fMRI, EEG, and MEG) and over 600 models with diverse architectures and pretraining configurations. We report three key scaling trends: (1)Pretraining saturation: Alignment improves with pretraining compute and data scale but saturates across all recording modalities. (2)Complementary fine-tuning: Hybrid task & neural data optimization yields consistent improvements in alignment that generalize across datasets and modalities. (3)Mapping scaling: Increasing the number of neural samples to fit model-to-brain mappings yields log-linear gains with the largest impact on alignment. Finally, we propose a novel subject-shared cross-attention mapping which drastically reduces parameter count and improves alignment. Taken together, these results establish multimodal scaling laws that guide resource allocation for next-generation brain models.

Abstract:
Time Series Foundation Models (TSFMs) have achieved strong performance in univariate time series forecasting. However, most TSFMs rely on channelindependent pre-training that models each variable separately, limiting their ability to leverage inter-channel information that is crucial in real-world multivariate systems. Motivated by this limitation, we propose Chada, a lightweight plug-and-play channel adapter that allows frozen TSFMs to leverage multivariate correlations in a zero-shot setting. Chada first builds a budgeted pre-training dataset to cover diverse heterogeneous inter-channel dependency patterns. It then uses data-derived domain descriptors to learn a dataset-conditioned inter-channel similarity measure that reduces cross-domain metric distortion. Finally, it injects sparse inter-channel information via gated refinement, leveraging multivariate information without degrading intra-channel temporal dynamics. Extensive experiments on nine benchmarks validate the effectiveness of Chada, demonstrating consistent zero-shot improvements over four best-performing TSFMs while maintaining scalable deployment. Code is available at https://anonymous.4open.science/r/CHADA-A6.

Abstract:
Predictive code completion greatly accelerates how quickly developers work. In spreadsheets, despite being much more common, such autocompletion features are virtually non-existent. To address this gap, we introduce a benchmark for systems that observe a sequence of user actions in a spreadsheet and predict future actions. Two challenges are (1) the absence of edit histories in public spreadsheet corpora and (2) the complex space of spreadsheet actions (spatial, temporal, composite). To address (1) we symbolically generate action sequences using parametrized heuristics and refine them (LLM + human) to create 58 sequences consisting of 13K actions from publicly available spreadsheets. To address (2) we propose an online evaluation that expects a prediction after each user action, accepts or rejects that prediction, updates the future actions upon acceptance, and repeats this until the target spreadsheet is obtained. We use LMs as baseline predictive systems and analyze different properties that our benchmark teaches us, including but not limited to: properties of saved actions and false positives, efficiency, effect of user profiles, effect of triggers, effect of context, and effect of prediction length.

Abstract:
Memory is a core capability of autonomous agents, yet existing benchmarks evaluate it primarily in constrained settings such as short dialogues or synthetic tasks, failing to reflect realistic agent deployments. We present MemoriaBench, a benchmark for evaluating agent memory grounded in complete, chronologically ordered interaction trajectories that may span millions of tokens. Guided by principles of realism, domain and agent diversity, and explicit exposure of memory-centric challenges, all tasks are formulated as anti-summarization question answering, requiring fine-grained, temporally grounded memory retrieval rather than high-level abstraction. Memoria-Bench covers deep research, coding, and Science & Development agents across seven domain categories and instantiates three task families: temporal aggregation, multi-hop memory reasoning, and long-range state tracking. Experiments on state-of-the-art long-context models and memory-augmented-based methods reveal substantial performance degradation in long, noisy trajectories, exposing a critical memory bottleneck beyond context length scaling.

Abstract:
Differentiable approaches to causal discovery have shown promise in learning DAG structures via continuous optimization, but their theoretical guarantees are largely restricted to models with homoscedastic noise or known noise distribution. In particular, existing methods based on mean squared error fail to identify the true DAG when noise distributions are nonGaussian and vary in scale. In this paper, we address this gap in nonlinear additive noise models (ANMs) with arbitrary noise. Our approach extends NOTIME (Berrevoets et al. 2025) which minimizes an independence criterion among the residuals. We show that the global minimizer of the independence criterion corresponds to the true underlying DAG up to additional constant edges in general ANMs. To recover the exact structure, we introduce an adaptive group lasso penalty that regularizes entire columns of the first-layer weight matrix of an MLP, enabling the selective pruning of constant edges in a functionally meaningful way. Empirically, our method exhibits effective and stable performance across diverse noise types and variances, outperforming prior methods that lack identifiability guarantees in this setting.

Abstract:
ModelBased Reinforcement Learning (MBRL) leverages latent imagination for sample efficiency, yet remains constrained by Historical Tethering: imagination is typically initialized from observed states. This creates a learning asymmetry, where the world model’s manifold discovery outpaces the policy's sparse-reward optimization. We propose Mind Dreamer (MD), a framework that operationalizes Active Counterfactual Reasoning (ACR) to transcend Markovian continuity. MD reformulates discovery as the minimization of a global Relay Manifold Expected Free Energy (R-EFE); by invoking a latent-space do-operator, MD utilizes an adversarial generator to synthesize non-continuous latent jumps to epistemic blind spots that are physically plausible yet cognitively challenging. To resolve the credit assignment paradox across these spatial ruptures, we derive the Relay Value Function (RVF) and Relay Uncertainty Function (RUF). These potentials treat synthesized anchors as latent bridges, propagating pragmatic and epistemic value through a principled Bellman-style formulation. Notably, we prove that the Uncertainty propagation across discontinuities necessitates a quadratic discount \gamma^2, establishing a formal epistemic horizon. Theoretically, MD acts as an optimal importance sampler that expands the manifold's spectral gap, reducing the hitting time to critical bottleneck states. Empirically, MD achieves a 1.67× average speedup over DreamerV3 on DeepMind Control Suite, reaching 8.8× in sparse-reward tasks.

Abstract:
We introduce a new mechanism within the UtilityOptimized Local Differential Privacy (ULDP) framework that enables censoring with plausible deniability when collecting and analyzing sensitive data. Our approach addresses scenarios where certain values, such as large numerical responses, are more privacy-sensitive than others, while accompanying categorical information may not be private on its own but could still be identifying. The mechanism selectively withholds identifying details when a response might indicate sensitive content, offering asymmetric privacy protection. Unlike previous methods, it avoids the need to predefine which values are sensitive, making it more adaptable and practical. Although the mechanism is designed for ULDP, it can also be applied under symmetric LDP settings, where it still benefits from censoring and reduced privacy cost. We provide theoretical guarantees, including uniform consistency and pointwise weak convergence results. Numerical experiments on both synthetic data and real-world data were conducted demonstrate the validity of developed methodologies.

Abstract:
Spiking Neural Networks (SNNs) offer a notable energysaving advantage compared to Artificial Neural Networks (ANNs) when deployed on neuromorphic hardware. While recent SNNs achieve superior performance using larger and deeper backbones, this comes at a cost of diminishing their energy-saving benefits. In this paper, we propose UniSparse, a unified sparsification framework for enhancing the energy efficiency of SNNs. We demonstrate that the affine parameters in batch normalization also serve as the learnable threshold of its subsequent spiking neurons. Based on this, we propose a novel spike sparsification method that reduces firing rate by constraining the affine parameters. As a complement to spike sparsification, we propose a weight pruning method based on the same energy constraint, which can be naturally integrated with spike sparsification. Experimental results demonstrate that UniSparse achieves a state-of-the-art trade-off between accuracy and energy efficiency across models and datasets. The sparsified ResNet-18 model requires only 7.04M SOPs for inference to achieve 92.38% accuracy on the CIFAR-10 dataset. Our work highlights the great potential of deep SNNs in improving energy efficiency.

Abstract:
Datadriven cybersecurity research is fundamentally constrained by the scarcity of labeled datasets, yet acquiring authentic, large-scale malware traffic remains bottlenecked by obsolescent public datasets, unscalable manual construction, and inflexible sandboxes that fail to satisfy the sample-specific dependencies required for malware to exhibit malicious behavior. Threat intelligence documents these dependencies, and LLM agents offer a path to extract them for environment construction, yet directly applying such agents faces two challenges: input-side ambiguity and output-side fragility. In this paper, we propose Threat2Traffic, a multi-agent framework that extracts sample-specific dependencies from threat intelligence, reconstructs tailored environments, and captures malware traffic. To address input-side ambiguity, it formulates dependency extraction as structured multi-agent deliberation over an evidence graph. To overcome output-side fragility, it incorporates invariant-guided synthesis with dual-layer validation under syntactic and semantic constraints. Evaluated on 1,077 samples across eight malware families, Threat2Traffic achieves 83.1% reproduction success, highlighting its effectiveness for scalable and realistic malware traffic generation. We release the core source code and traffic dataset at https://github.com/apos3637/Threat2Traffic

Abstract:
Vision–language pretraining (VLP) holds great promise for general-purpose medical AI by leveraging radiology reports as rich textual supervision, yet existing methods struggle with 3D CT imaging due to inefficient visual backbones and coarse semantic alignment. To address these issues, we propose a tailored VLP framework featuring three key components: (1) a CNN–ViT hybrid encoder that replaces ViT’s patch embedding with a 3D CNN backbone to efficiently capture local anatomical details while preserving global attention and compatibility with pre-trained cross-modal priors; (2) a disease-level contrastive learning mechanism using learnable query tokens to dynamically extract disease-specific semantics from full reports and align them with corresponding visual features, thereby disentangling distinct diseases within the same anatomical region; and (3) a diagnosis-aware prompt strategy that employs real clinical phrases and aggregated disease prototypes to bridge the pre-training–inference gap and enhance zero-shot diagnostic reliability. Our model achieves state-of-the-art performance on CT-RATE (84.4% AUC, +5.1%) and Rad-ChestCT (75.4% AUC, +5.4%), with even larger gains (+9.8% AUC) on a challenging 60-disease benchmark, and demonstrates strong transferability to radiology report generation, underscoring the generality and clinical utility of our approach.

Abstract:
Large Language Model agents achieve strong performance on multi‑step reasoning and tool‑use tasks, but their impressive capabilities typically rely on extremely large backbones. Existing distillation approaches train smaller students to imitate full teacher trajectories, yet reasoning and knowledge gaps between the teacher and student can cause compounding errors. We propose SCoRe, a studentcentered framework in which the student generates training trajectories and the teacher corrects only the earliest error, producing training data matched to the student's ability and exposing specific weaknesses. The student is first fine-tuned on corrected trajectories. Subsequently, short-horizon reinforcement learning starts from the verified prefix preceding the earliest error, with target rewards assigned at that step. This design enables the student to solve problems through unconstrained RL exploration rather than teacher imitation, while the short‑horizon setup improves training stability. On 12 challenging benchmarks, a 7B-parameter student distilled with SCoRe closes the agentic performance gap with a 72B-parameter teacher.

Abstract:
Federated Learning (FL) enables decentralized clients to collaboratively train a global model without sharing raw data. However, most existing FL frameworks assume that clients train on static local datasets collected in advance or that the data follows a fixed underlying distribution, which limits their applicability in dynamic environments where data evolves over time. A parallel line of research, online FL, removes all assumptions and adopts an adversarial perspective, but this approach is often overly pessimistic and neglects the structured, partially predictable nature of realworld data dynamics. To bridge this gap, we propose SFedPO, a streaming federated learning framework that incorporates a prediction oracle to capture the temporal evolution of client-side data distributions. We theoretically analyze the convergence bounds of SFedPO and develop two practical sampling strategies: a Distribution-guided Data Sampling (DDS) strategy that dynamically selects training data under limited storage by balancing historical reuse and distribution adaptation, and a Shift-aware Aggregation Weights (SAW) mechanism that modulates global aggregation based on client-specific sampling behaviors. We further establish robustness guarantees under prediction errors. Extensive experiments demonstrate that SFedPO effectively adapts to streaming scenarios with distribution shifts and significantly outperforms existing methods.

Abstract:
Highdimensional continuous control remains challenging in deep reinforcement learning, where algorithms like TD3 and SAC often collapse. We propose a unifying Lipschitz Pathway framework that decomposes instability into four amplification stages, namely action parameterization (L_1), dynamics sensitivity (L_2), Q-network curvature (L_3), and temporal-difference (TD) target stability (L_4), where errors compound multiplicatively along the learning pipeline. Our analysis identifies a discrete-continuous mismatch as the root cause: value functions trained from sparse point samples must generalize over continuous manifolds, leading to multiplicative error amplification along the pathway. To address this, we introduce Action Manifold Smoothing (AMS), which replaces point-wise TD targets with orthogonally-sampled neighborhood averages, jointly regularizing L_3 (via implicit Laplacian smoothing) and L_4 (via local manifold supervision). We further characterize when Lipschitz-constrained Q-networks and geometric action priors are beneficial based on task structure. Empirically, AMS enables both TD3 and SAC to achieve over 400 reward on the 38-D Dog Run task within 1M steps, where baselines fail. These results validate the Lipschitz pathway as a principled framework for diagnosing and solving stability bottlenecks in high-dimensional control.

Abstract:
Recent progress in large language models (LLMs) has enabled understanding of both speech and nonspeech audio, but has also exposed new safety risks arising from complex audio inputs that are inadequately handled by current safeguards. We introduce SACRED-Bench (Speech–Audio Composition for RED-teaming) to evaluate the robustness of LLMs under complex audio-based attacks. Unlike existing perturbation-based methods that rely on noise optimization or white-box access, SACRED-Bench exploits speech–audio composition to enable effective black-box attacks. SACRED-Bench adopts three composition mechanisms: (a) speech overlap, (b) multi-speaker dialogue, and (c) mixtures of speech and non-speech audio. These mechanisms focus on evaluating safety in settings where benign and harmful intents co-occur within a single auditory scene. Moreover, questions in SACRED-Bench are designed to implicitly refer to content in the audio, such that no explicit harmful information appears in the text prompt alone. Experiments demonstrate that even Gemini 2.5 Pro, a state-of-the-art proprietary LLM with safety guardrails fully enabled, still exhibits a 66% attack success rate. To bridge this gap, we propose SALMONN-Guard, the first guard model that jointly inspects speech, audio, and text for safety judgments, reducing the attack success rate to 20%. Our results highlight the need for audio-aware defenses to ensure the safety of multimodal LLMs.

Abstract:
VisionLanguage-Action (VLA) models are bottlenecked by the scarcity of expert demonstrations—expensive triplets of observations, language instructions, and actions. We propose that learning ''how to move'' can be decoupled from learning ''what to do,'' and that the former requires no task labels at all. Our two-stage framework,Task-Agnostic Pretraining (TAP)first pre-trains on abundant, cheaptask-agnosticdata (discarded off-task trajectories or autonomous robot play) using an Inverse Dynamics objective that predicts actions from consecutive observations. This self-supervised phase instills physical affordances—grasping, contact dynamics, end-effector control—without human annotation. A lightweight second stage then aligns these physical priors with language instructions using minimal expert data. On the SIMPLER benchmark, our approach matches models trained on 1M+ expert trajectories while using orders of magnitude less labeled data, achieving a 10% absolute gain over standard behavior cloning. In real-world WidowX experiments, it surpasses internet-scale baselines under visual distribution shifts (e.g., 25% vs. 0% under camera perturbations), demonstrating that task-agnostic pretraining yields robust, transferable physical representations for Embodied AI.

Abstract:
Diagnosing complex diseases is inherently a sequential and iterative medical investigation process, in which a clinician strategically requests multiple rounds of diagnostic tests to differentiate among similar diseases until reaching a definitive diagnosis. Although large language models show great potential as clinical assistants, they often struggle to navigate this complex interactive process, suffering from premature diagnostic closure. Furthermore, optimizing LLMs for such multiround environments is frequently hindered by the challenge of reward sparsity and hacking. In this paper, we introduce CompDiag-Bench, a benchmark that formalizes diagnosis as a sequential decision-making process where a clinician must strategically request diagnostic tests from a dynamic environment in order to reach a definitive diagnosis. To address this task, we propose \textttSalus, a multi-agent framework that decouples diagnostic reasoning into three specialized functional roles: a Differential Reasoner, a Strategic Controller, and a Workup Proposer. \textttSalus is optimized via multi-agent reinforcement learning employing structured rewards to calibrate strategic diagnostic behavior. Specifically, we leverage an LLM-as-a-Judge reward mechanism to provide dense, semantically-grounded feedback, designed to penalize premature closure and incentivize accurate differential diagnoses. Experimental results show that our model, \textttSalus-7B, attains state-of-the-art Top-1 accuracy of 83.64% on complex cases, outperforming DeepSeek-V3.2 (71.38%) and achieving performance on par with GPT-5.2 (80.30%).

Abstract:
Multitask reinforcement learning (MTRL) aims to improve sample efficiency by sharing knowledge across related tasks, but it often suffers from asynchronous learning progress caused by inherent differences in task difficulty. This imbalance places substantial representational strain on the shared critic network, which emerges as a primary performance bottleneck. To address this issue, we propose Hybrid Multi-Task Reinforcement Learning (HyMTRL), a framework that alleviates critic overload through a phased policy evolution strategy. HyMTRL divides task learning into a reinforcement exploration phase and an imitation refinement phase. By transitioning mastered tasks from reinforcement learning–based policy optimization to imitation learning–based behavior consolidation, these tasks are removed from the critic’s optimization objective, effectively reducing representational strain. In addition, a critic reset mechanism restores network capacity while preserving learned policies and historical experience. HyMTRL is a general framework that can be easily integrated with a wide range of existing MTRL methods. Empirical evaluations on the MetaWorld benchmark demonstrate that combining HyMTRL with representative baselines leads to significant improvements in both learning efficiency and final performance.

Abstract:
We study online change point detection for multivariate inhomogeneous Poisson point process time series. This setting arises commonly in applications such as earthquake seismology, climate monitoring, and epidemic surveillance, yet remains underexplored in the machine learning and statistics literature. We propose a method that uses lowrank matrices to represent the multivariate Poisson intensity functions, resulting in an adaptive nonparametric detection procedure. Our algorithm is single-pass and requires only constant computational cost per new observation, independent of the elapsed length of the time series. We provide theoretical guarantees to control the overall false alarm probability and characterize the detection delay under temporal dependence. We also develop a new Matrix Bernstein inequality for temporally dependent Poisson point process time series, which may be of independent interest. Numerical experiments demonstrate that our method is both statistically robust and computationally efficient.

Abstract:
Reinforcement learning from demonstrations (RLfD) offers a promising method for robotic manipulation with sparse rewards. However, limited demonstrations often cause agents to encounter outof-distribution states where world models produce poor predictions. In multi-stage tasks, jointly optimizing a learned reward function and policy introduces a moving target problem, and the resulting non-stationarity intensifies the impact of uncertainty on policy learning. In this work, we propose QUEST, a model-based RL framework that adaptively switches between exploration and exploitation guided by uncertainty to achieve stable and efficient learning. Specifically, our approach employs intrinsic rewards to capture environmental stochasticity, leverages ensemble dynamics for uncertainty-guided planning, and introduces a hybrid sampling strategy to prioritize rare successful stage transitions. We evaluate QUEST on challenging sparse-reward manipulation tasks with limited expert demonstrations. Results show that QUEST outperforms state-of-the-art methods by 17% on average, with gains increasing to 60% on difficult tasks. We further demonstrate successful zero-shot sim-to-real transfer on three real-world tasks.

Abstract:
Transformerbased tabular foundation models (TFMs) dominate small to medium tabular predictive benchmark tasks, yet their inference mechanisms remain largely unexplored. We present the first large-scale mechanistic study of layerwise dynamics in 6 state-of-the-art tabular in-context learning models. We explore how predictions emerge across depth, identify distinct stages of inference and reveal latent-space dynamics that differ from those of language models. Our findings indicate substantial depthwise redundancy across multiple models, suggesting iterative refinement with overlapping computations during inference stages. Guided by these insights, we design a proof-of-concept, looped single-layer model that uses only 20% of the original model’s parameters while achieving comparable performance.

Abstract:
Designing heuristics for combinatorial optimization problems (COPs) is a fundamental yet challenging task that traditionally requires extensive domain expertise. Recently, Large Language Model (LLM)based Automated Heuristic Design (AHD) has shown promise in autonomously generating heuristic components with minimal human intervention. However, most existing LLM-based AHD methods enforce fixed algorithmic templates to ensure executability, which confines the search to component-level tuning and limits system-level algorithmic expressiveness. To enable open-ended solver synthesis beyond rigid templates, we propose Automated Algorithm Design via Evolutionary Program Trees (A_2DEPT), which treats LLMs as system-level algorithm architects. A_2DEPT explores the vast program space via a tree-structured evolutionary search with hybrid selection and hierarchical operators, enabling iterative refinement of complete algorithms. To make open-ended generation practical, we enforce executability with a lightweight program-maintenance loop that performs feedback-driven repair. In experiments, A_2DEPT consistently outperforms state-of-the-art baselines across standard and highly constrained benchmarks, reducing the optimality gap by an average of 9.8%. Our work implies that system-level algorithm synthesis is a viable and scalable paradigm for LLM-driven optimization.

Abstract:
Large Language Models (LLMs) that continue improving at testtime budgets far beyond their training budgets can solve harder problems by leveraging additional inference compute: we refer to this property as extrapolation. Standard on-policy RL operates on fixed problem distributions and training budgets, giving rise to a distribution shift between train and test that limits the resulting model's extrapolation capabilities. To address this, we introduce RC, an iterative decoding algorithm replacing standard autoregressive decoding that enables models to extrapolate to lengths an order of magnitude longer than those seen during training. RC exploits the asymmetry between summarization and generation capabilities present in LLMs to construct a decoding process that improves consistently over iterations. Its effectiveness can be further increased through training, which amplifies the model’s ability to perform summary-conditioned reasoning while avoiding the challenges of long-horizon RL. Empirically, training a 4B instruction-following model with RC using a 16k-token training budget improves performance on HMMT 2025 from 40% to 70% when evaluated with a 512k-token test budget, substantially surpassing comparably sized LLMs.

Abstract:
Antibodies originate from germline immunoglobulin gene templates and are diversified by somatic hypermutation during affinity maturation. This process produces sequences in which conserved germline residues provide structural scaffolding, complemented by rare nongermline substitutions that refine antigen binding and functional properties. Despite this important biological organization, current antibody language models (ALMs) treat all residues equivalently, blurring the distinction between germline-encoded conservation and adaptive non-germline variation. When applied to germline-dominated repertoires—reflecting shared germline starting points from which diversity arises—this induces a germline bias in which rare but functionally important non-germline (NGL) mutations are systematically down-weighted as statistical noise. Here, we introduce PRISM, a germline-aware antibody language model that explicitly represents germline and non-germline residues as distinct token types. This biologically informed representation achieves state-of-the-art pseudo-perplexity in hypervariable complementarity-determining regions (CDRs) and enables controllable sequence generation for precise paratope engineering while preserving germline-mediated framework stability. PRISM further exhibits improved zero-shot prediction of binding affinity relative to existing ALMs. Together, these results demonstrate that incorporating biologically grounded sequence representations substantially improves antibody language modeling.

Abstract:
Elucidating molecular structures from spectra is a foundational problem in chemical and materials characterization, yet remains challenging due to spectral ambiguity and the vast molecular space. Although recent diffusionbased generators show strong promise for spectra-conditioned elucidation, existing methods struggle to learn robust spectra-structure relationships from limited paired data when relying solely on global spectral representation. Moreover, the repeated full sampling inference strategy incurs substantial computation overhead. To address these limitations, we propose MAST, a Motif-Augmented diffusion framework with Search Tree, for joint 2D-3D spectroscopic molecular structure elucidation. MAST introduces explicit, interpretable motif priors as intermediate evidences throughout denoising, reducing conditional ambiguity and facilitating spectra-conditioned optimization. We further cast diffusion sampling as reward-guided tree search to prioritize high-reward denoising trajectories, yielding a compact set of spectra-consistent candidates under limited budgets. On the QM9S multi-spectra benchmark, MAST achieves 94.89% exact recovery and improves 3D fidelity, while preserving high chemical validity and stability.

Abstract:
Multimodal large language models (MLLMs) have demonstrated remarkable reasoning capabilities over internalized knowledge. However, current research overlooks contextual reasoning, the ability to reason based on the relevant information present in the context. To investigate this issue, we construct the Visual Contextual Reasoning Benchmark (ContextReasonVBench), and our analysis reveals two predominant failure modes: context neglect, where models rely on pre-trained knowledge instead of contextual information, and superficial pattern matching, where models exploit shallow correlations rather than genuine patterns. To address these limitations, we propose a two-stage approach that progressively establishes and reinforces contextual pattern acquisition. The first stage establishes an "analyze-then-solve" reasoning paradigm through supervised fine-tuning (SFT). We then employ a context-aware reinforcement learning (RL) framework that integrates context-aware reward modeling with hierarchical advantage estimation to encourage the model to identify genuine contextual patterns. This approach yields Context-Aware Reasoner (CAR), a model that achieves 38.14% accuracy on ContextReasonV-Bench, improving the base model by 22.09%. Notably, CAR exhibits strong generalization to tasks not seen during training, confirming that our approach enhances genuine contextual reasoning capability.

Abstract:
Decoding fMRI into natural language is challenging because strong, pretrained language priors can dominate autoregressive generation, obscuring whether a model truly utilizes neural evidence. We introduce BIT-LLM, which exposes fMRI-derived tokens as a persistent key–value memory through interleaved cross-attention adapters, enabling repeated neural access throughout decoding. BIT-LLM is trained with a three-stage pipeline: (i) multimodal contrastive learning to obtain semantically aligned fMRI representations, (ii) supervised fine-tuning to learn the brain-LLM interface while freezing the encoder and backbone LLM, and (iii) reward-based finetuning to optimize sequence-level caption quality directly. On the NSD subject-heldout benchmark (S1-7 train, S8 test), BIT-LLM yields substantially improved captioning quality over prior baselines under greedy decoding. In addition to standard captioning metrics, we perform several complementary evaluations to assess the robustness of brain–language grounding. Specifically, we conduct perturbation-based sanity checks by zeroing fMRI inputs or shuffling voxel values, and examine whether internal representations and generated outputs change accordingly. BIT-LLM exhibits clear sensitivity to these perturbations, indicating effective utilization of voxel values and their spatial correspondence.

Abstract:
Large Language Models (LLMs) have found widespread application and research in scenarios such as roleplaying and sociological simulations. Despite the growing use of LLM-based agents to simulate human activities, the extent to which their behaviors resemble human behavior remains underexplored. As diverse LLMs proliferate, the traditional Turing test is ineffective for scalable evaluation and prone to bias from human-crafted challenges, leading to unfair assessments. In this work, we propose a novel distribution-based framework that comprehensively evaluates human-likeness and believability of AI behaviors by leveraging large-scale open-ended human behavior data from web. For better evaluation, we design generic metrics to cover three principles: rationality, consistency, and diversity. Implemented across online shopping, open-topic Q&A, and urban mobility, our framework reveals that even the currently best LLM still exhibits a significant gap from real user behavior, underscoring the necessity of comprehensive research and evaluation of AI’s human-like capabilities.

Abstract:
PrintedCircuit-board (PCB) footprint geometry labeling of integrated circuits (IC) is essential in defining the physical interface between components and the PCB layout, requiring precise visual perception. However, the unstructured nature of footprint drawings and abstract diagram annotations prevents direct IC footprint parsing and automated package geometry labeling methods from developing. Existing Large Multimodal Models (LMMs) struggle with inaccurate geometric perception, limiting their effectiveness in this task. To address these challenges, we propose LMM4-IC4K, a novel framework that treats IC mechanical drawings as images and leverages LMMs for structured geometric interpretation. To support such a framework, we introduce ICGeo8K, a multi-modal dataset with 8,608 labeled samples, including 4138 real-world IC footprint samples and 4470 synthetically generated samples. We further present a two-stage training framework to fine-tune LMMs for IC footprint labeling. Extensive experiments demonstrate that our model outperforms state-of-the-art LMMs on the proposed benchmark. The accurate translation of footprint diagrams enabled by LMM4-IC4K contributes to advancing automation and standardization within the PCB industry.

Abstract:
Grokking, the phenomenon where models suddenly generalize long after overfitting training data, remains a puzzling challenge in neural network dynamics. Through mechanistic analysis, we find that this transition is fundamentally driven by a structural reorganization of token embeddings, with the onset of grokking entailing a shift toward a welldefined geometry, and reveal the model’s distinct understanding of data’s dual characteristics. Building on these geometric insights, we propose R2G (Repel to Grokking) Loss, an active intervention that reshapes the embedding manifold by enforcing structural repulsion. The versatility of R2G is empirically validated in both algorithmic and linguistic tasks, while our theoretical analysis and ablation studies jointly demonstrate that angular reorganization is the primary driver of grokking. Our work offers a novel mechanistic perspective on the evolution of grokking and provides a useful tool for enhancing model efficiency and reliability.

Abstract:
Large Language Model (LLM) agents have shown significant potential in automated optimization modeling for mathematical problems. However, realworld problems are still challenging due to their knowledge-intensive nature. Existing methods, constrained by static parametric knowledge, often lack the domain expertise required to comprehend complex scenarios and apply appropriate mathematical techniques, leading to errors. To address this challenge, we propose the Opt-Miner framework, where the agent learns to identify missing knowledge, retrieve technical documents on the web, and ground its mathematical models for improved modeling performance. The core of Opt-Miner is a novel tree-guided data synthesis pipeline coupled with a retrieval-based group relative policy optimization (R-GRPO) algorithm, designed to foster the agent’s information-seeking capabilities. Specifically, we first formulate each problem into a tree structure, with its scenario contexts and mathematical techniques embedded in subtrees. We then employ subtree union, transfer, and knowledge fogging to synthesize complex, multi-domain problems that incorporate knowledge gaps, thereby necessitating active information seeking to solve these problems. Based on synthesized data, we propose R-GRPO for agent reinforcement learning. Experiments demonstrate that Opt-Miner-Qwen3-8B achieves performance comparable to 32B state-of-the-art specialized agents and commercial reasoning models.

Abstract:
Reconstructing highfidelity RAW sensor data from processed RGB images is a fundamental yet ill-posed problem, plagued by irreversible information loss and complex non-linear ISP transformations. While generative models offer high-quality reconstruction, they suffer from prohibitive computational costs. Conversely, dominant regression-based methods are fast but susceptible to incoherent observational deviations, often yielding over-smoothed predictions that drift from the authentic signal manifold. To reconcile this trade-off, we propose SHADE, a Single-step Homogeneous Aligned DEterministic flow framework. We validate that, unlike point-to-point regression, the single-step deterministic flow captures global transport trends and enables intrinsic robustness against input perturbations. Furthermore, we introduce Homogeneous Aligned Guidance to maximize fidelity. By leveraging a homogeneously initialized student-teacher DINO pair, this mechanism enforces alignment within a shared feature space, significantly amplifying the representational capacity. Extensive experiments demonstrate that SHADE achieves state-of-the-art performance on multiple benchmarks, establishing a new paradigm for accurate and efficient sensor data reconstruction.

Abstract:
Crossdomain offline reinforcement learning leverages a source dataset to improve policy learning in a data-scarce target domain, but dynamics mismatch makes many source transitions kinematically infeasible and can cause negative transfer. Recent non-parametric geometric methods (e.g., standard optimal transport and k-nearest neighbors) avoid overfitting yet often yield only relative rankings under an implicit matching or retrieval budget, making performance sensitive to hand-tuned thresholds when the true cross-domain overlap is unknown. We formulate availability estimation as soft subset selection by learning a source reweighting that geometrically aligns with the target. We proposeRobustOffline unbalancedOptimalTransport (ROOT): (i) a robust ambiguity set for uncertainty under limited target samples, and (ii) an unbalanced transport objective that penalizes mass deviation, enabling a principled transport-or-discard mechanism. ROOT thus down-weights or discards high-cost source samples rather than forcing them onto the target support. Moreover, the induced weights decay exponentially with transport cost, guaranteeing outlier suppression. On D4RL dynamics-shift benchmarks, ROOT improves downstream offline RL and outperforms strong baselines on most tasks without task-specific threshold tuning.

Abstract:
Diffusion Large Language Models (dLLMs) enable parallel token generation, and their blockwise variants have attracted significant attention. However, existing dLLMs usually exhibit an accuracy–parallelism trade-off, where raising tokens per forward (TPF) via aggressive parallel decoding often degrades task accuracy. To address this, we suggest developing a post-training approach to directly optimize the speed–quality frontier of pre-trained dLLMs. Conceptually, we do not require the model to decode aggressively along all sampling trajectories, but rather to find several highly parallelizable ones that can yield correct results. To this end, we resort to a reinforcement learning paradigm, i.e., LightningRL, to optimize rewards regarding both the final accuracy and inference parallelism. LightningRL follows the Group Relative Policy Optimization (GRPO) framework, with further improvements for dLLMs: 1) stabilized training via per-reward decoupled normalization, 2) token-level negative log-likelihood (NLL) loss on correct trajectories for regularization, and 3) improved training efficiency through dynamic sampling with TPF-aware filtering. Across maths and code tasks, LightningRL consistently advances the Pareto frontier, maintaining competitive accuracy while increasing parallelism to an average TPF of 7.3 (up to 11.10 on MBPP).

Abstract:
Probabilistic forecasting with exogenous variables is vital for decisionmaking but remains underexplored compared to deterministic methods. We propose KITE, a knowledge-guided probabilistic modeling framework designed to bridge this gap by addressing two key bottlenecks: (1) topological disparity in sampling initialization and (2) spurious covariate correlations during the iterative conditional generation process. KITE introduces aHistory-Conditional Manifoldto construct an informative source distribution from historical dynamics, effectively anchoring the starting point closer to the target space. Additionally, aKnowledge-Guided Conditioningmodule is developed to regularize variable interactions using statistical priors, suppressing spurious correlations and enhancing the robustness of covariate conditioning. Extensive experiments demonstrate that KITE outperforms state-of-the-art methods in both deterministic and probabilistic forecasting.

Abstract:
The trajectory prediction of Nbody systems is of great significance and remains challenging with broad applications across various fields such as physics, chemistry and biology. Recent advances in generative models including flow matching and diffusion models have emerged as effective solutions to this problem, owing to their capacity to model the stochasticity and underlying distributions of complex system trajectories. However, existing approaches typically adopt trivial prior distributions that neglect the temporal correlations and spatial symmetries of N-body trajectories, which not only complicates the generation process but also limits model performance. To address these limitations, we propose GP-EquiFlow, an SE(3)-equivariant flow matching model incorporating vector-valued Gaussian processes. Based on observed trajectories, we employ vector-valued Gaussian processes to construct SE(3)-equivariant prior distributions, which exhibit enhanced consistency with the target data distribution in both spatial and temporal dynamics. Extensive experiments on N-body simulations and molecular dynamics demonstrate that the proposed GP-EquiFlow delivers more accurate predictions while requiring fewer sampling steps, underscoring the effectiveness of integrating Gaussian process-based SE(3)-equivariant prior distributions in geometric trajectory prediction.

Abstract:
The advancement of Large Language Models (LLMs) into higherlevel reasoning domains has rendered traditional heuristic evaluators insufficient for long-form open-ended responses, precipitating the widespread adoption of rubric-based benchmarks. While these frameworks utilize expert-curated criteria and LLM-as-a-judge to assess open-ended generation, the intrinsic robustness of these evaluation harnesses to fundamental validity assessments remains critically under-investigated. To bridge this gap, we introduce RubricRobustness, a systematic sensitivity analysis framework that subjects these benchmarks to three common sense perturbations: semantic negation, stochastic deletion and irrelevant addition. We investigate the extent to which manipulating the semantic veracity of a model’s response impacts its resulting score by applying the robustness framework to two of the most popular rubrics-based benchmarks: HealthBench and WildBench. Our findings reveal systematic vulnerabilities: while both benchmarks respond sharply to semantic negation (e.g., degradation slopes of approximately -0.38 on HealthBench and -0.55 on WildBench), they are substantially less responsive to irrelevant addition, often requiring over 35% of sentences to be perturbed before inducing even a 25% score drop. We argue that perturbation-based sensitivity analyses of this form are a necessary prerequisite for validating rubric coverage, ensuring that automated evaluation frameworks reliably penalize basic semantic failures. We plan to release our framework as an open-source tool to facilitate the development of more resilient benchmarks.

Abstract:
In incomplete multiview clustering, unbalanced missingness is prevalent, where different views exhibit significantly varying missing rates, causing severe observation bias. This imbalance poses two core challenges: models develop serious learning biases by over-relying on low-missing-rate views while neglecting high-missing-rate ones, and cross-view data recovery becomes extremely difficult due to sparse training samples in highly missing views, leading traditional generation methods into a "data starvation" dilemma. Existing methods either naively assume low-missing-rate views as high-quality or lack effective debiasing mechanisms, showing limited performance under imbalance. To address this, we propose the Causal-Invariant Meta-Learning Network (CIMLN). It employs a meta-learning paradigm to transfer knowledge across views, using complete samples as support sets to guide generation for highly missing views. Meanwhile, it incorporates a causal inference framework with counterfactual reasoning and adversarial intervention strategies to eliminate spurious dependencies on observation patterns, learning causally invariant clustering structures. These modules synergistically optimize to ensure generated representations possess both feature fidelity and clustering discriminability. Extensive experiments on benchmarks demonstrate the effectiveness of CIMLN.

Abstract:
VisionLanguage-Action (VLA) models are promising for generalist robot control, but on-robot deployment is bottlenecked by real-time inference under tight cost and energy budgets. Most prior evaluations rely on desktop-grade GPUs, obscuring the trade-offs and opportunities offered by heterogeneous edge accelerators (GPUs/XPUs/NPUs). We present a systematic framework for low-cost VLA deployment via model--hardware co-characterization. First, we build a cross-accelerator leaderboard and evaluate model--hardware pairs under CET (Cost, Energy, Time), showing that ``right-sized'' edge devices can be more cost-/energy-efficient than flagship GPUs while meeting control-rate constraints. Second, using fine-grained SM tracing and Roofline analysis, we uncover a consistent two-phase inference pattern: a compute-bound VLM backbone followed by a memory-bound Action Expert, which induces phase-dependent underutilization and hardware inefficiency. Finally, guided by these insights, we propose DP-Cache and V-AEFusion to reduce diffusion redundancy and enable asynchronous pipeline parallelism, achieving up to (2.9×) speedup on GPUs and (3.3×) on edge NPUs with only marginal success degradation. The code will be publicly available once the acceptance of the paper. The example leaderboard website is: \urlhttps://vla-leaderboard-01.vercel.app/.

Abstract:
Multimodal fusion object detection faces a substantial modality gap at the same backbone stage. This makes predefined stagealigned fusion insufficient for cross-stage interactions. We propose MFH-NAS, a hybrid neural architecture search framework that automatically discovers fusion architectures to better leverage cross-modal complementarity. MFH-NAS searches both local fusion primitives and stage-level fusion connectivity. It targets fusion operator design and fusion stage selection. It couples differentiable search with evolutionary search. Differentiable search learns architecture parameters for local fusion primitives. Evolutionary search explores global fusion topologies, including stage selection and cross-stage connection patterns. The joint search balances exploitation and exploration and mitigates premature convergence. It yields fusion structures that strengthen cross-stage interactions.We evaluate MFH-NAS on three public benchmarks, LLVIP, RGBT-Tiny, and M3FD. MFH-NAS consistently outperforms handcrafted fusion-stage designs and prior stage-searching NAS baselines, improving mAP@0.5 from 85.3% to 88.2% over strong fixed-stage fusion methods and delivering gains across all benchmarks.

Abstract:
Mixture of Experts (MoE) has emerged as a mainstream architecture for Large Language Models (LLMs), balancing computational efficiency with model scalability. While prior work has explored increasing tensorlevel sparsity via finer-grained expert configurations during pre-training, we identify significant unexploited sparsity at both the tensor and neuron levels during post-training and inference. To leverage this, we propose complete expert partition for post-training and threshold-based token-expert dropping for inference. These techniques improve the Mixtral-8×7B model's average accuracy by 1% across nine downstream benchmarks (notably 4% on GSM8K). To further optimize the accuracy-efficiency trade-off for inference, we introduce dual-threshold token-expert dropping with partial expert partition and reconstruction. Our approach yields a 1.19× MoE speedup and a 0.5% accuracy gain on Mixtral-8×7B when combining post-training and inference optimizations. For inference-only optimization on OLMoE-Instruct and DeepSeek-V2-Lite-Chat, we achieve up to 1.41× MoE speedup with a negligible accuracy loss (<0.5%).

Abstract:
Point cloud completion and generation are important across many 3D tasks, where both fidelity and sampling efficiency matter. Prevailing highfidelity approaches rely on long sampling schedules, which incur substantial inference latency. Few-step alternatives typically use rectification or distillation, leading to multi-stage training pipelines and potential quality trade-offs. We present 3D MeanFlow (3DMF), a distillation-free model that performs one-step average-velocity transport for point cloud completion and generation. We optimize an instantaneous-average consistency objective and impose a shape-level constraint to stabilize training. Additionally, we introduce PointPlug, integrating completion into 3D object detectors and evaluating its impact. PointPlug uses adaptive selection that balances benefit and latency. Across standard benchmarks, 3DMF achieves one-step sampling with an order-of-magnitude speedup while maintaining competitive fidelity. On nuScenes and KITTI, inserting PointPlug improves all evaluated detectors under comparable settings.

Abstract:
Reward design remains a central challenge in reinforcement learning (RL). Handcrafted rewards are often difficult to specify and may lead to suboptimal policies, while learned rewards from preferences can suffer from inefficiency and unstable training. Inspired by the dual nature of human learning explored in cognitive science, we decompose rewards into two complementary components: Formal Rewards (FR), explicitly designed based on task knowledge, and Residual Rewards (RR), learned from observations to capture implicit and nuanced preferences. Based on this decomposition, we propose CoRe, a hybrid framework that integrates FR and RR with vision-language models (VLMs) feedback to achieve preference-aligned policies without human involvement. Our contributions are twofold: (1) We propose a Formal Reward Module (FRM) that leverages VLMs to iteratively design and optimize FR based on task knowledge and preference feedback, enabling the continual improvement of policy during training; (2) We introduce a Residual Reward Module (RRM) that learns RR from video-level preference by employing VLMs to generate preference labels and capturing nuanced rewards that complement FR, ensuring alignment with human intent. Through the synergy of FRM and RRM, CoRe enables the automatic construction of reliable rewards that are efficient and preference-aligned. Extensive experiments demonstrate that CoRe outperforms existing approaches in terms of policy learning effectiveness and efficiency on ten robotic manipulation tasks in simulation and five real-worlds.

Abstract:
Speculative Jacobi Decoding (SJD) is an important approach for accelerating autoregressive image generation. Although SJD has shown superior performance, recent studies point out that it usually suffers from a token ambiguity issue during token verification but its reason can not be well explained. To figure out this reason, in this paper, we conduct a visualization analysis on vision token and find that different from text tokens, vision tokens generally corresponds to some local, small, and unclear vision details, which means only using single token is difficult to accurately express a certain semantic, thereby causing token ambiguity issue. To this end, we propose a novel Speculative Jacobi Decoding with Semantics Verification (called SJDSV), for accelerating autoregressive image generation. The key idea is that leveraging the strong correction characters between tokens to recognize semantic-aware token subsequence and then instead of perform token-by-token verification, turning to perform verification on semantic-aware token subsequence level for accelerating image generation. In particular, our method is plug-in, which can be directly integrated into existing SJD and its variants. Extensive experiments on various datasets show that existing SJD methods achieve significant performance improvement after integrating our SJD-SV method.

Abstract:
Reassembling N fragments in ndimensional space is a shape reconstruction task that is invariant to global rigid motions. Training directly on \mathcalM=\mathrmSE(n)^N can be ill-posed: standard losses penalize solutions that differ only by a global transform. Existing methods often address this with ad-hoc anchoring which breaks permutation invariance across fragments and can introduce biases that must be mitigated with extensive and costly data augmentation. We propose a geometric framework that enforces invariance by construction. First, a Global Gauge Fixing (GGF) strategy deterministically aligns configurations using an intrinsic generalized-inertia rule. Second, we introduce a quotient-invariant Flow Matching objective that operates via orthogonal projection onto the horizontal tangent bundle. This construction factors out global pose at each timestep, enabling the model to learn only shape-changing dynamics on the quotient space \mathcalM/\mathrmSE(n). Our unified \mathrmSE(n)-invariant framework admits efficient closed-form 2D/3D instantiations and improves accuracy on polygonal jigsaw puzzles and 3D fracture reassembly benchmarks.

Abstract:
Textto-image generation is widely used, but many applications require strict instance-level layout alignment. Masked Autoregressive (MAR) models on continuous latents are efficient and high-fidelity, yet flattening 2D latents into 1D sequences weakens spatial topology and hinders precise control. We propose Structure-Aware RoPE-MAR (StructMAR) to turn layout alignment from soft correlation into explicit structural alignment. StructMAR integrates 2D Rotary Positional Embeddings with a Layout-Guided Attention Bias to mechanistically enforce token-to-instance correspondence. We further apply Group Relative Policy Optimization (GRPO) to better align training objectives with layout-centric evaluation. On COCO-Position, StructMAR achieves state-of-the-art alignment (57.2 AP, 79.4 mIoU) while maintaining image quality comparable to strong diffusion baselines. On COCO-MIG, it improves robustness in dense settings (ISR 61.9, mIoU 57.4) and achieves a 4.05× inference speedup. These results highlight the importance of explicit structural inductive biases for robust, efficient controllable autoregressive generation; code is available at https://anonymous.4open.science/r/StructMAR-FE92/.

Abstract:
Imitation Learning (IL) has demonstrated strong capabilities in learning highquality policies from expert demonstrations for sequential decision-making tasks. Nonetheless, its effectiveness is significantly constrained in low-expert-data regimes. To mitigate this issue, previous works introduce ``offline IL with supplementary data" which augments expert demonstrations with additional, low-cost data generated by suboptimal policies. A prominent framework for this setting is Distribution Correction Estimation (DICE), which estimates the optimal density ratio by solving the dual of a divergence minimization problem between the learned policy and the expert visitation distribution. Despite their theoretical appeal, existing DICE-based methods often require adding a dataset regularizer to the divergence objective, or rely on a strict coverage assumption. These weaknesses limit the capability of DICE-based methods, causing them to be inefficient in some contexts. In this paper, we introduce ReDICE, a new method to address these limitations. ReDICE is derived by formulating an objective under a mixture distribution from the KL divergence between expert and learned policy occupancies. We formally prove that the dual of this formulation is mathematically equivalent to a stable Gumbel regression objective. Furthermore, we introduce a novel policy extraction mechanism that significantly improves performance in practice. Experiments across diverse benchmarks show that ReDICE achieves state-of-the-art results relative to prior offline IL baselines.

Abstract:
Realworld degradations, such as noise, blur, and low resolution, significantly impair the performance of Multi-modal Large Language Models (MLLMs) in document understanding tasks. Despite recent advancements, progress in this field remains stifled by two critical bottlenecks: the scarcity of large-scale, aligned training data necessary for learning robustness, and the lack of transferable restoration solutions across diverse MLLM architectures. To bridge the data gap, we first present DocRobust-VQA, a large-scale dataset explicitly constructed to support robustness training. Comprising 189K aligned low/high-quality document image pairs and 417K QA pairs, it provides the first substantial corpus for fine-tuning MLLMs to handle varying degradation conditions. Leveraging this data, we propose Uni-DocRobust, a universal plug-and-play framework that decouples restoration capabilities from specific visual encoders. Our method employs a frozen Universal Restoration Core pre-trained in a canonical feature space via multi-teacher distillation, which can be seamlessly integrated into target MLLMs (e.g., Qwen-VL, InternVL) through lightweight Feature Adapters. Extensive experiments demonstrate that Uni-DocRobust significantly enhances robust performance on MLLMs and enables a cost-effective ``pre-train once, deploy everywhere'' paradigm for robust MLLM deployment.

Abstract:
The utility of VisionLanguage Models (VLMs) in reasoning and auditing tasks hinges on their ability to exhaustively describe visual scenes. However, current models exhibit a pathology we term the Likelihood Trap: standard alignment objectives, specifically MLE and KL-regularization, drive generation toward generic, high-probability templates, systematically suppressing fine-grained details. To overcome this, we introduce Geo-RL, a framework that shifts the objective from probabilistic likelihood to geometric coverage. Geo-RL reformulates caption generation as maximizing the volume of a parallelotope in semantic space. By leveraging Determinantal Point Processes (DPPs), we enforce orthogonality among sampled descriptions, ensuring that they span the image's full semantic support. Crucially, we derive a closed-form leave-one-out marginal reward, enabling stable policy optimization. Empirically, Geo-RL escapes the trap, achieving a significant improvement in semantic richness and detail coverage without compromising visual grounding.

Abstract:
Reinforcement learningbased Multimodal Large Language Models (MLLMs) provide new perspectives for visual grounding, yet face significant challenges in Camouflaged Object Detection (COD) where objects blend seamlessly with backgrounds. This stems primarily from: difficulties in multi-object matching, the detrimental effects of low-quality samples, and erroneously localizing visual distractors with similar textures to true objects. We propose Progressive Matching and Semantic-aware Policy Optimization (PMSPO), a curriculum learning-based framework that employs Sinkhorn multi-object matching IoU reward during training for multi-object alignment, utilizes Positive Learning Gain Filtering (PLGF) to curate high-quality samples, and transforms deep visual features into semantic contrastive reward rules to calibrate target background semantics. Experiments on COD benchmarks demonstrate that PMSPO achieves state-of-the-art (SOTA) performance among reinforcement learning methods across all evaluation metrics.

Abstract:
Predicting 3D geometric trajectory requires capturing complex spatiotemporal dependencies while preserving physical symmetries. While flow matching offers a powerful generative paradigm, extending it to SE(3)equivariant dynamics is challenging due to the inherent gap between deterministic history and stochastic evolving flows. To address this, we introduce GSE-Flow, an SE(3)-equivariant flow matching framework. We first propose a Coherent Sequence Encoding and Time-Modulated Embedding strategy that unifies historical and evolving streams, incorporating velocity and flow time via equivariant affine transformations to guide continuous evolution. We further design a Geometry-Feature Tensorization mechanism that projects node states into a tensor product space, enabling Context-Flow Fusion to guide trajectory evolution with historical context. GSE-Flow guarantees theoretical SE(3)-equivariance and achieves SOTA accuracy on MD17, MD22, and CMU MoCap benchmarks for geometric trajectory prediction, while demonstrating generality by enhancing deterministic baselines.

Abstract:
Reliable uncertainty quantification (UQ) is crucial for deploying graph neural networks (GNNs) in safetycritical settings, yet dominant solutions either rely on costly multi-pass sampling or require retraining—often usingblack-box auxiliarymodels—to obtain evidential semantics. We proposeX-EviProbe, a simple and parameter-freepost-hocframework that turns afrozenGNN into an evidential predictor with a decomposable view of epistemic vs. aleatoric uncertainty. X-EviProbe constructs class-wise Dirichlet evidence by probing the frozen latent space and the model’s native outputs, and incorporates graph structure via lightweight evidence-strength propagation. This yields a transparent evidential representation without retraining or additional neural components. Extensive experiments on seven benchmarks show that X-EviProbe consistently ranks among the top methods for both OOD detection and misclassification detection, improving AUROC by up to33.4%and8.7%over the strongest baselines.

Abstract:
Agentic memory—conditioning large language and vision–language models on past cases, external knowledge, or meta‑experiences—has become a key mechanism for improving inference‑time reasoning. However, existing approaches largely rely on heuristic retrieval or expensive LLM‑based reranking, and do not explicitly learn how to compose memory for a given query. To address these limitations, we propose MemDecoder, a learned framework for adaptive agentic memory selection. MemDecoder formulates memory composition as an autoregressive index decoding problem over a retrieved candidate set, using a lightweight Transformer encoder–decoder to generate an ordered sequence of memory elements. This design enables efficient, task‑aware few‑shot reasoning without generating textual demonstrations. MemDecoder can be trained via supervised fine‑tuning and reinforcement learning with verifiable rewards. We further introduce a ranking‑aware variant of Group Relative Policy Optimization that exploits pairwise comparisons within response groups to provide richer learning signals. Experiments across visual question answering, mathematical reasoning, and scientific question answering benchmarks show that MemDecoder consistently outperforms prior agentic memory selection methods, demonstrating the benefits of the architectural design and learning algorithm of MemDecoder.

Abstract:
Combinatorial generalization remains a central challenge in deep reinforcement learning (DRL). Classical planning provides a simple yet challenging setting to study this problem through explicit relational descriptions, without requiring learning from perception. In sparsereward domains, standard RL exploration via real-time search is ineffective, and learning-based planning methods often rely on expert demonstrations, hindsight relabeling, or random walks from the goal state. In contrast, planners rely on best-first search methods such as \mathrmA^\star to solve problems from scratch. We propose a self-improving \mathrmA^\star learning framework in combination with a value heuristic represented by a Relational Graph Neural Network: the heuristic guides search, and the resulting search data updates the heuristic via Q-Learning. This loop yields heuristics that can function as general policies and solve new instances even without search, where DRL otherwise fails, as we show on puzzles such as Sokoban, PushWorld, The Witness, and the International Planning Competition 2023 benchmarks. Notably, we demonstrate strong zero-shot generalization: heuristics trained on Blocksworld instances with fewer than 30 blocks successfully solve instances with 488 blocks.

Abstract:
Rank estimation under label noise poses a fundamental challenge, as ordinal annotations often exhibit structured uncertainty rather than simple label corruption. In this paper, we reformulate rank estimation with noisy ordinal labels as a stochastic ordering problem, in which each instance is inherently associated with multiple plausible ranks instead of a single deterministic label. Based on this view, we propose stochastic order learning (SOL), a learning framework that captures ordinal label uncertainty and learns an embedding space through two complementary objectives: a discriminative loss that structures instancecentroid interactions and a stochastic order loss that enforces probabilistic ordering relations between instances. Extensive experiments across diverse datasets demonstrate that SOL enables reliable rank estimation under various types and levels of label noise.

Abstract:
Constraintbased causal discovery relies on conditional independence (CI) tests whose reliability degrades as conditioning sets grow, particularly in hub-dominated graphs. Existing methods constrain adjacency or global structure, but leave conditioning-set dimensionality uncontrolled. In this paper, we propose Ordering-Constrained Markov Blanket discovery (OCMB), a paradigm that treats conditioning-set dimensionality as a first-class constraint. OCMB decouples discovery into two stages: lightweight global ordering estimation providing directional priors, followed by local Markov blanket validation within small, ordering-constrained candidate sets. By enforcing directional constraints before any CI test, OCMB ensures bounded conditioning sets even with hub nodes. We show that OCMB recovers correct parent sets provided a high-recall ordering assumption holds, without requiring the ordering to be globally correct. Experiments demonstrate that OCMB significantly improves precision and robustness over constraint-based and hybrid methods in high-dimensional regimes where conventional CI-based approaches fail.

Abstract:
Transfer across diverse task compositions and unseen behaviors remains a significant challenge for visionlanguage action (VLA) models. Skills are repeatable and atomic components for various tasks, and similarities shared with different skills provide evidence for transferability across behaviors. However, existing skill-centric methods have two problems. First, skills are often loosely organized, lacking a hierarchy that can capture similarities and differences across skills. Second, they lack a mechanism which has the capacity to express transferable skill attributes in a structured parametric space. To this end, we propose SkillNet, which models skill attributes in a hierarchical way and regulates compositional model structure with transferable skill attributes. SkillNet exploits motion code and VerbNet Framework to explicitly model similarities of skills on mechanical properties and semantic roles, and organizes skills in a hierarchical way. Based on this hierarchy, SkillNet leverages the scalability of the mixture-of-experts (MoE) mechanism and develops skill embeddings as soft constraints to enable compositional generalization via similar expert activations on similar skills. On zero-shot and few-shot transfer experiments in simulators and real-world environments, SkillNet achieves an improvement of performance by 16.0% and 23.9%. Meanwhile, SkillNet achieves state-of-the-art performance on in-domain settings.

Abstract:
Accelerating diffusion models via feature caching has evolved from static reuse to polynomial extrapolation, yet current "cachethen-forecast" strategies remain limited by rigid, hand-crafted approximation families (e.g., Taylor or Hermite bases) that often misalign with the complex, layer-specific non-stationarity of generative feature dynamics. This paper introduces EigenCache, a theoretically grounded framework that re-frames acceleration as a problem of covariance learning and experimental design. By modeling feature trajectories as time-indexed stochastic processes governed by learnable temporal kernels, we demonstrate that the statistically optimal feature predictor (Minimum Mean Squared Error) is the Gaussian Process posterior mean (Kriging), which strictly generalizes and outperforms previous fixed-basis expansions. Crucially, this probabilistic formulation couples prediction with uncertainty quantification via a closed-form variance certificate. Leveraging this, we derive an information-theoretic scheduling algorithm that selects computation anchors by maximizing the log-determinant of the posterior covariance—a submodular objective that admits a provably near-optimal greedy solution. EigenCache thus provides a unified, training-free foundation for efficient inference, offering not only superior reconstruction accuracy but also a rigorous mechanism for robust, uncertainty-aware compute allocation.

Abstract:
This paper studies the critical problem of targeted adversarial attacks against closedsource MLLMs, which aim to generate highly transferable adversarial samples with open-source MLLMs. Previous approaches typically focus on maximizing the similarity of latent representations between adversarial samples and target samples. However, these approaches could overfit specific target samples with severely limited generalization ability to closed-source MLLMs. Towards this end, we propose a novel approach named Distribution-aware Adaptive Relational Transfer (DART) for adversarial attacks against closed-source MLLMs. The core of our DART is to adopt a statistical lens to characterize the intrinsic semantics of images for more generalized and robust alignment. In particular, each augmented image is considered an example from the intrinsic distribution of the original image. Then, we utilize non-parametric Energy Distance to measure the distribution divergence, which is naturally adopted for the semantic alignment in the hidden space. To further enhance transferability to specific target models, we learn a graph neural network (GNN) to explore the complex relations between source and target MLLMs on transferability and adaptively select surrogate models to maximize transferability across diverse targets. Extensive experiments on benchmark datasets validate the superior robustness and effectiveness of the proposed DART in comparison to various competing baselines.

Abstract:
While realworld applications of reinforcement learning (RL) are becoming increasingly popular, the security of RL systems deserve more attention and exploration. In particular, recent work has revealed that RL agents are vulnerable to backdoor attacks, where a victim agent behaves normally under standard conditions but executes malicious actions when a specific trigger is activated. Existing backdoor defenses for RL either require access to the agent’s internal parameters, operate only at the model or trajectory level, or are limited to specific attack types. To ensure the security of RL agents, we propose PolicyGuard, a test-time step-level backdoor defense which leverages Gaussian Process (GP) posterior variance and adapts pseudo trajectories to enable uncertainty computation for individual time step. Besides, we also provide theoretical foundations to explain the efficacy of GP posterior variance. Extensive experiments across seven RL games demonstrate that PolicyGuard achieves state-of-the-art detection performance in most cases, with average AUROC of 0.856 for perturbation-based attacks and 0.859 for adversary-agent attacks.

Abstract:
The Mamba model has gained significant attention for its computational advantages over Transformerbased models, while achieving comparable performance across a wide range of language tasks. Like Transformers, Mamba exhibits in-context learning (ICL) capabilities, i.e., making predictions for new tasks based on a prompt containing input-label pairs and a query, without requiring fine-tuning. Despite its empirical success, the theoretical understanding of Mamba remains limited, largely due to the nonlinearity introduced by its gating mechanism. To the best of our knowledge, this paper presents the first theoretical analysis of the training dynamics of a one-layer Mamba model, which consists of a linear attention component followed by a nonlinear gating layer, and its ICL generalization on unseen binary classification tasks, even when the prompt includes additive outliers. Our analysis shows that Mamba leverages the linear attention layer to select informative context examples and uses the nonlinear gating layer to suppress the influence of outliers. By establishing and comparing to the analysis of linear Transformers under the same setting, we show that although Mamba may require more training iterations to converge, it maintains accurate predictions even when the proportion of outliers exceeds the threshold that a linear Transformer can tolerate. These theoretical findings are supported by empirical experiments.

Abstract:
The rapid scaling of large language models (LLMs) has made distributed inference indispensable, yet endto-end latency is increasingly dominated by communication, forming a critical bandwidth wall that fundamentally limits the practical gains of existing quantization techniques. Existing approaches typically treat communication and computation in isolation, failing to exploit their coupled nature and introducing limited system-level acceleration and accuracy degradation. To address this, we propose CoCoQuant, a co-designed framework that jointly optimizes communication and computation as a unified end-to-end design space. CoCoQuant introduces a precision-aligned graph-rewriting that enables zero-overhead fusion between low-precision communication and computation. CoCoQuant formulates a hardware-aware mixed-precision allocation problem that integrates roofline-based cost modeling with relative sensitivity calibration, solved via global integer linear programming. Extensive experiments on LLMs of varing scales demonstrate that CoCoQuant achieves Pareto-optimal accuracy-latency trade-offs, delivering up to 2.92 end-to-end speedup with a negligible increase in perplexity (0.22).

Abstract:
Predictive memory has recently emerged as a powerful mechanism for online video models, enabling temporal reasoning beyond static memory banks. However, we observe a paradoxical phenomenon in which predictive memory often exacerbates drift under occlusion or rapid motion, as inaccurate predictions contaminate the internal state and lead to irreversible identity loss. We identify this failure as a reliability mismatch: generative predictive dynamics are applied uniformly, even when their uncertainty is high and observational evidence is weak. To address this issue, we reinterpret video memory as a dynamic latent process rather than a static buffer. Building on this insight, we introduce Reliabilityguided Predictive Memory (RPM), a framework that explicitly regulates when and how predictive dynamics should influence online video perception. RPM integrates a latent world model based on state-space latent dynamics to generate predictive priors, while employing a reliability-aware fusion policy that adaptively suppresses unreliable predictions during challenging scenarios such as occlusion and re-acquisition. We instantiate RPM on a SAM2-based foundation video model and evaluate it on challenging visual object tracking benchmarks, a representative instance of online video perception. Experimental results demonstrate that our method significantly reduces drift after occlusion, consistently outperforming strong baselines that rely on either static memory or unconditional predictive modeling. These findings establish that predictive memory is beneficial only when its reliability is explicitly modeled, and define a general principle for robust online video perception.

Abstract:
Most existing CLIPstyle medical vision--language pretraining methods rely on global or local alignment with substantial paired data. However, global alignment is easily dominated by non-diagnostic information, while local alignment fails to integrate key diagnostic evidence. As a result, learning reliable diagnostic representations becomes difficult, which limits their applicability in medical scenarios with limited paired data. To address this issue, we propose an LLM-Guided Diagnostic Evidence Alignment method (LGDEA), which shifts the pretraining objective toward evidence-level alignment that is more consistent with the medical diagnostic process. Specifically, we leverage LLMs to extract key diagnostic evidence from radiology reports and construct a shared diagnostic evidence space, enabling evidence-aware cross-modal alignment and allowing LGDEA to effectively exploit abundant unpaired medical images and reports, thereby substantially alleviating the reliance on paired data. Extensive experimental results demonstrate that our method achieves consistent and significant improvements on phrase grounding, image--text retrieval, and zero-shot classification, and even rivals pretraining methods that rely on substantial paired data.

Abstract:
Multiturn jailbreak attacks have emerged as a powerful threat to LLM safety, leveraging feedback from auxiliary judge models to iteratively refine harmful queries. Existing defenses mainly focus on detecting or blocking harmful content at the final turn, leaving the judge-driven refinement loop intact and allowing attackers to extract informative feedback from intermediate interactions. We introduceDecoy for theJudge (DJ), a semantics-preserving output rewriting approach that intervenes directly in this feedback loop by modifying the victim LLM’s responses before they are evaluated by the attacker’s judge. By misaligning the judge’s feedback signal, DJ perturbs the judge-driven refinement loop and degrades the attacker’s optimization process while preserving the semantic content of the original response. To enable robust and transferable rewriting, we construct a dataset that captures fine-grained distinctions among semantically similar responses with differing harmfulness signals. Leveraging this dataset, DJ is trained using a combination of supervised fine-tuning (SFT) and direct preference optimization (DPO), allowing it to reliably manipulate judge feedback across diverse judge LLMs. Experiments on HarmBench show that DJ significantly reduces the success rate of state-of-the-art multi-turn jailbreaks, while preserving performance on benign multi-turn benchmarks.

Abstract:
With diffusion and flow matching models achieving stateof-the-art generating performance, the interest of the community now turned to reducing the inference time without sacrificing sample quality. Consistency Models (CMs), which are trained to be consistent on diffusion or probability flow ordinary differential equation (PF-ODE) trajectories, enable one or two-step flow or diffusion sampling. However, CMs typically require prolonged training with large batch sizes to obtain competitive sample quality. In this paper, we examine the training dynamics of CMs near convergence and discover that CM trajectory tangents -- CM output update directions -- are quite oscillatory, in the sense that they move parallel to the data manifold, not towards the manifold. To mitigate oscillatory trajectory tangents, we propose a new loss function, called the \em manifold feature distance (MFD), which provides manifold-aligned trajectory tangents that point toward the data manifold. Consequently, our method -- dubbed \em Align Your Trajectory Tangent (AYT) -- can accelerate CM training by orders of magnitude and even out-perform the learned perceptual image patch similarity metric (LPIPS). Furthermore, we find that our loss enables training with extremely small batch sizes without compromising sample quality.

Abstract:
Electroencephalography (EEG) foundation models (EFMs) have achieved strong performance across a wide range of downstream EEG tasks via pretraining and finetuning. Through empirical analysis, we observe that (i) no single EFM consistently dominates all tasks, yet identifying the task-specific optimal model by fine-tuning all EFMs introduces substantial computational overhead; and (ii) models with inferior task-level performance still exhibit strengths at the sample level as distinct architectures induce diverse inductive biases. These observations motivate EmBrace, a representation-centric framework for sample-aware knowledge fusion that avoids the constraints of parameter-level or output-level alignment. EmBrace synchronizes discriminative intermediate representations into a unified manifold and adaptively weights multiple EFMs at the sample level while selecting the most compatible model as the carrier. Extensive experiments across multiple EEG benchmarks demonstrate that EmBrace consistently improves over SOTA EFMs and generalizes effectively under cross-task settings.

Abstract:
Reinforcement finetuning (RFT) has shown promise for training LLM agents to perform multi-turn decision-making based on environment feedback. However, most existing evaluations remain largely in-domain—training and testing are conducted in the same environment or even on the same tasks. In real-worlddeployment, agents may operate in unseen environments with different background knowledge, observation spaces, and action interfaces. To characterize the generalization profile of RFT under such shifts, we conduct a systematic study along three axes: (1) within-environment generalization across task difficulty, (2) cross-environment transfer to unseen environments, and (3) sequential multi-environment training to quantify transfer and forgetting. Our results show that RFT generalizes well across task difficulty within an environment, but exhibits unstable transfer to unseen environments, which correlates with shifts in both semantic priors and observation/action interfaces. In contrast, sequential training yields promising downstream gains with minimal upstream forgetting, and mixture training across environments improves the overall balance. We further provide detailed analyses and deeper insights, and hope our work helps the community develop and deploy generalizable LLM agents.

Abstract:
Monocular 3D human pose estimation is fundamentally challenged by severe occlusion and inherent depth ambiguity. To address this, we propose Glimpse, a framework that learns robust 3D poses by explicitly modeling anatomical geometry from a single image. We recast the problem as geometry learning of multiscale structural priors, realized through two synergistic components. First, structured sampling captures the body's geometric continuity through dual-level feature extraction, acquiring both local joint appearance and continuous features along skeletal limbs via deformable sampling. By propagating limb-level geometric cues to their connected joints, this design bridges information gaps caused by occlusion. Second, geometric correction ensures global 3D consistency by lifting coherent 2D features into a canonical 3D reference space, where a shared 3D anchor guides a distance-aware fusion mechanism. Extensive experiments conducted on Human3.6M and MPI-INF-3DHP demonstrate that Glimpse achieves state-of-the-art performance, with superior robustness under severe occlusion and complex articulation.

Abstract:
Textto-video diffusion transformers encode semantic information unevenly across model depth, which constrains effective concept erasure. We identify a representational bottleneck, termed concept–layer topological alignment, under which target concepts exhibit higher separability at certain representational depths. Outside these depths, concept and non-target signals remain strongly entangled, limiting the effectiveness of depth-specific erasure. This observation reframes concept erasure as the problem of identifying representational depths where concept–non-target separation naturally emerges. Motivated by this structural constraint, we introduce CLEAR, a separability-driven optimization framework for concept erasure that explicitly enforces concept–layer alignment. CLEAR operationalizes this principle by formulating layer selection as an optimization problem over concept–non-target separability, rather than relying on layer-agnostic or heuristic choices. To enable this, we introduce a separability-aware objective that favors layers exhibiting stronger concept–non-target separation. Experiments on large-scale text-to-video diffusion models demonstrate that enforcing concept--layer alignment leads to more precise concept suppression while preserving overall generative quality.

Abstract:
Modern deep learning is increasingly characterized by the use of openweight foundation models that can be fine-tuned on specialized datasets. This has led to a proliferation of expert models and adapters, often shared via platforms like HuggingFace and AdapterHub. Model merging has recently emerged as an effective way to leverage these existing resources, enabling the composition of capabilities from different model checkpoints. A natural pipeline has thus formed to harness the benefits of transfer learning and amortize sunk training costs: models are pre-trained on general data, fine-tuned on specific tasks, and then multiple checkpoints are merged to obtain a more capable model. A prevailing assumption is that improvements at one stage of this pipeline propagate downstream, leading to gains at subsequent steps. In this work, we challenge that assumption by examining how expert fine-tuning affects model merging. We show that long fine-tuning of experts that optimizes for their individual performance leads to degraded merging performance across vision and language modalities, multiple model scales, and both fully fine-tuned and LoRA-adapted models. We trace this degradation to the memorization of a small set of difficult examples that dominate late fine-tuning steps. This causes negative parameter interference and encodes knowledge that is forgotten during merging. Finally, we demonstrate that task-dependent aggressive early stopping strategies can significantly improve model merging performance.

Abstract:
Most federated recommender systems represent each user with a single embedding learned from local interaction data, implicitly assuming that user preferences are fixed and precisely identifiable. In federated settings, however, each client observes only a limited and fragmentary view of user behavior, rendering such point estimates inherently brittle. To address this mismatch, we model user preferences as distributions rather than points, allowing multiple compatible preference representations to coexist. Rather than collapsing evidence into a single embedding, our approach preserves uncertainty and diversity in user representations, providing a richer basis for preference modeling. We instantiate this idea with a diffusionbased generative framework that produces diverse user embeddings and derives recommendation scores by aggregating predictions across them. This distributional formulation yields more stable ranking behavior and improved robustness under ambiguous feedback. Extensive experiments on federated recommendation benchmark datasets demonstrate consistent and significant improvements over baselines. Our code is available.

Abstract:
Recent advances in large language models have accelerated neural theorem proving (NTP). Isabelle is a mature and important formal theorem prover that has been widely used in software and hardware verification. However, progress in the Isabelle setting remains limited. Existing approaches either optimize search strategies or train on highly imbalanced raw proof corpora. At the same time, the specialized structure of Isabelle proofs limits the effectiveness of generalpurpose data selection methods. To address these challenges, we adopt a data-centric framework for neural theorem proving in Isabelle. We characterize high-quality formal proof data along three complementary dimensions—proof complexity, semantic coverage, and reasoning diversity (PSR)—and propose a PSR-guided data selection pipeline to construct a compact, high-quality training subset. In addition, we leverage verifier feedback as a dynamic data signal during inference, introducing a dynamic feedback-based prompt optimization that iteratively incorporates Isabelle verifier feedback to guide proof generation. We construct and release a 4k high-quality Isabelle dataset based on the PSR criterion. On the miniF2F-test, fine-tuning solely on PSR-selected data achieves 84.8% Pass@64. When further combined with dynamic feedback–based prompt optimization, the full framework improves performance to 90.6% Pass@64, establishing a new state of the art for neural theorem proving in Isabelle.

Abstract:
Autoregressive Large Language Models (LLMs) often fail in complex reasoning because earlystage errors remain uncorrectable in subsequent steps—a limitation fundamentally rooted in the inherent irreversibility of the Transformer architecture. In this paper, we propose HEdit, a lightweight reasoning enhancement paradigm that equips models with a "hindsight-like" capability for dynamic error correction during generation. Our core insight involves deconstructing reasoning failures into two pivotal stages: latent representational biases emerging at logical anchors, and the subsequent eruption of explicit cognitive dissonance at trigger points. Based on these observations, the HEdit framework detects internal inconsistency signals at trigger points in real-time, actively backtracks to critical anchors, and utilizes a lightweight trainable editor to precisely refine their Key-Value (KV) caches. This mechanism effectively breaks the unidirectional constraints of autoregressive inference. Empirical results demonstrate that HEdit significantly enhances the performance of various models on mathematical reasoning tasks—with average accuracy improvements ranging from 2.2% to 10.8%—while maintaining extremely low overhead (add parameters <0.5%). HEdit provides a dynamic, pluggable and lightweight solution, making it particularly beneficial for users in low-resource environments. Our code can be found at anonymous github: https://anonymous.4open.science/r/ac3d2-51CF/

Abstract:
Fewshot classification aims to adapt a pretrained model to novel classes with limited examples. While current methods often heuristically combine pretrained knowledge and few-shot evidence, we seek a more principled understanding of their relationship. In this paper, we propose a Bayesian-inspired optimal integration framework(BOIF) that interprets pretrained models as priors and few-shot evidence as likelihoods. Under conditional independence approximation, we show that the optimal log-posterior decomposes into the sum of prior logits and likelihood logits. This leads to a simple yet effective design principle: decouple the prior and likelihood pathways and combine their logits additively. Guided by this principle, we implement BOIF using CLIP with two novel enhancements: (1) a multi-level feature adapter to enrich visual representations, and (2) a simplified cache module for likelihood estimation. Extensive experiments on 11 benchmarks show BOIF achieves state-of-the-art performance (e.g., 80.61% average accuracy at 16-shot) and strong out-of-distribution robustness. Our work provides both a principled perspective and an effective instantiation for few-shot adaptation.

Abstract:
Visual data is typically a prerequisite for training existing video anomaly detection (VAD) methods. However, obtaining sufficient annotated anomaly data for training is challenging and not scalable due to the rarity of anomaly data and the wide variety of abnormal events. In this work, we advocate that the effectiveness of treating texts as video sequences for the VAD model and propose a novel TextDriven Video Anomaly Detection (TD-VAD) approach to break visual dependence. In contrast to the anomaly video data, text descriptions of abnormal events are easy to collect, and their class labels can be directly derived. Specifically, our method utilizes video-like text descriptions with temporal characteristics generated by LLM to train a VAD model, without any reliance on target-domain anomaly data. To capture the long and short-range temporal logic of events, we design the event evolution causal attention module to model contextual dependencies across time. During inference, considering the domain gap between the texts and video sequences, we use the frozen CLIP encoder to extract embeddings of video frames to align the text modality while retaining crucial visual information. Comprehensive experiments on two large-scale VAD datasets, XD-Violence and UCF-Crime, demonstrate that our method outperforms prior one-class and unsupervised VAD methods by a large margin.

Abstract:
Naturallanguage explanations are widely used to interpret machine learning models, yet many prioritize human plausibility over accurately reflecting or predicting model behavior. Prior approaches often rely on human-written rationales, producing post-hoc explanations that neither align with the model’s decision function nor generalize. We introduce OPEX , a natural-language explanation model that directly optimizes for behavioral faithfulness: the ability of an explanation to reflect and predict a model’s observable input–output behavior. OPEX is trained using reinforcement learning with Group Relative Policy Optimization (GRPO), optimizing two complementary metrics: recoverability, which measures whether explanations recover model predictions on seen examples, and simulatability, which measures prediction of model behavior on unseen inputs. Across structured and text-based tasks, OPEX achieves high simulatability (∼0.85) and recoverability (∼0.99), outperforming GPT-4o, LLaMA-3.3-70B, and human-written explanations; despite having a 8B-parameter backbone. Human user studies show a 15% improvement in classification accuracy over competent baselines

Abstract:
Recent Spiking Transformer models have explored a variety of attention mechanisms beyond standard dotproduct formulations. However, many existing similarity-based spiking attention formulations remain inherently sensitive to firing density, causing neurons with high spike rates to dominate attention scores regardless of semantic relevance. This density bias is particularly problematic in event-driven spiking representations, where sparse spike patterns often carry critical information. To address this limitation, we rethink spiking attention from a set-theoretic perspective. We propose DiceFormer, a novel Spiking Transformer architecture driven by Spike Dice Attention (SDA). Unlike traditional approaches, SDA replaces density-sensitive measures with a set similarity function derived from the Dice coefficient. By explicitly normalizing for firing density, SDA focuses on spike co-occurrence rather than high firing rates. We primarily evaluate DiceFormer on the challenging audio domain, where spike sparsity varies substantially across inputs. On AudioSet-20k, DiceFormer achieves a SOTA mAP of 0.161 with 54.3M parameters, outperforming prior SNN-based approaches and substantially narrowing the performance gap with ANN-based models. We also introduce Lin-SDA, a linearized version for computation efficiency, while achieving performance comparable to SDA. Beyond audio, we evaluate the effectiveness of SDA on CIFAR-100 to verify its applicability to the vision domain.

Abstract:
Examplebased guidance is widely used to improve mathematical reasoning at inference time, yet its effectiveness is highly unstable across problems and models—even when the guidance is correct and problem-relevant. We show that this instability arises from a previously underexplored gap between strategy usage—whether a reasoning strategy appears in successful solutions—and strategy executability—whether the strategy remains effective when instantiated as guidance for a target model. Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, leading to complementary strengths and consistent source-dependent reversals under guidance. Building on this diagnosis, we propose Selective Strategy Retrieval (SSR), a test-time framework that explicitly models executability by selectively retrieving and combining strategies using empirical, multi-route, source-aware signals. Across multiple mathematical reasoning benchmarks, SSR yields reliable and consistent improvements over direct solving, in-context learning, and single-source guidance, improving accuracy by up to +13 points on AIME25 and +5 points on Apex for compact reasoning models.

Abstract:
While edge GPUs are increasingly used for latencycritical DNN tasks, limited resources often fail to meet strict real-time (RT) requirements under concurrent workloads. Existing preemption and early-exit mechanisms often underutilize GPU resources through single-task queuing and sacrifice excessive accuracy during task bursts. To address this, we propose RTInfer, a novel system that enables concurrent RT task execution while balancing throughput and accuracy. RTInfer integrates an accuracy-calibrated lightweight variant co-optimization to generate efficient models, a memory-layout-aware scheduler to mitigate fragmentation during preemption, and an on-demand loading strategy to minimize host-to-GPU latency. Extensive evaluations demonstrate that RTInfer outperforms state-of-the-art methods by up to 98.2% in deadline miss rate (DMR) and 58.0% in accuracy.

Abstract:
Many visual reinforcement learning (RL) algorithms learn representations by matching latent distances to a behavioral distance induced by reward and transition similarity. In practice, the choice of the latent distance can strongly affect performance: using a fixed, prespecified global norms (e.g., \ell_p norms or other hand-designed metrics) may be overly restrictive to capture the behavioral distance. In contrast, unconstrained pairwise distances may admit degenerate solutions that drive the metric loss down without improving the representation. To address this gap, we introduce PAMD: Pairwise Adaptive Mahalanobis Distance, which parameterizes a positive-definite, pair-conditioned metric for measuring latent state similarity. PAMD is a simple plug-in for existing bisimulation-based methods, offering a more expressive yet structured alternative to fixed, pre-specified latent distances. We empirically validate our method on visual MuJoCo continuous-control tasks, where final performance of several recent bisimulation-based RL algorithms is substantially improved when equipped with the distance we propose.

Abstract:
We propose a constrained latent optimization method for rewardguided generation that preserves white Gaussian noise characteristics with negligible overhead. Test-time latent optimization can unlock substantially better reward-guided generations from pretrained generative models, but it is prone to reward hacking that degrades quality and also too slow for practical use. In this work, we make test-time optimization both efficient and reliable by replacing soft regularization with hard white Gaussian noise constraints enforced via projected gradient ascent. Our method applies a closed-form projection after each update to keep the latent vector explicitly noise-like throughout optimization, preventing the drift that leads to unrealistic artifacts. This enforcement adds minimal cost: the projection matches the O(N \log N) complexity of standard algorithms such as sorting or FFT and does not practically increase wall-clock time. In experiments, our approach reaches a comparable Aesthetic Score using only 30% of the wall-clock time required by the SOTA regularization-based method, while preventing reward hacking.

Abstract:
Generative models for 3D molecular conformations must respect Euclidean symmetries and concentrate probability mass on thermodynamically favorable, mechanically stable structures. However, E(3)equivariant diffusion models often reproduce biases from semi-empirical training data rather than capturing the equilibrium distribution of a high-fidelity Hamiltonian. While physics-based guidance can correct this, it faces two computational bottlenecks: expensive quantum-chemical evaluations (e.g., DFT) and the need to repeat such queries at every sampling step. We present Elign, a post-training framework that amortizes both costs. First, we replace expensive DFT evaluations with a faster, pretrained foundational machine-learning force field (MLFF) to provide physical signals. Second, we eliminate repeated run-time queries by shifting physical steering to the training phase. To achieve the second amortization, we formulate reverse diffusion as a reinforcement learning problem and introduce Force--Energy Disentangled Group Relative Policy Optimization (FED-GRPO) to fine-tune the denoising policy. FED-GRPO includes a potential-based energy reward and a force-based stability reward, which are optimized and group-normalized independently. Experiments show that Elign generates conformations with lower gold-standard DFT energies and forces, while improving stability. Crucially, inference remains as fast as unguided sampling, since no energy evaluations are required during generation.

Abstract:
Messages embedded in diffusion generation noise suffer from severe attenuation due to denoising and VAE decoding, creating a persistent capacity–robustness tradeoff. Identifying that extraction accuracy strictly correlates with the distance between candidate hypothesis images, we propose ASIR, a training-free and provably secure steganography framework for both pixel and latent diffusion models. ASIR introduces two key innovations: (i) Antipodal Sampling, which maximizes signal separation in probability space to enhance distinguishability, and (ii) Iterative Recovery, a paradigm shift that treats extraction as a gradient-based optimization problem to reverse non-linear distortions. Extensive experiments demonstrate that ASIR achieves state-of-the-art performance, embedding up to 65,536 bits (pixel-space) and 16,384 bits (latent-space) with 99% accuracy, while remaining statistically undetectable to deep steganalyzers.

Abstract:
We introduce MahalanobisPruned Mixture-of-Experts (MP-MoE), a novel routing framework that approaches expert selection from the perspective of ensemble pruning. Existing Mixture-of-Experts (MoE) routing strategies often suffer from representation collapse due to greedy top-k selection mechanisms or rely on complex auxiliary regularization terms that may compromise model performance. To address these issues, we formulate routing as a diversity-aware subset selection problem and optimize a Mahalanobis-distance-based objective that explicitly enhances expert diversity. Specifically, we demonstrate that the expert co-occurrence matrix effectively captures inter-expert correlations, allowing us to efficiently model the covariance structure required for distance computation without accessing expert parameters. Furthermore, we devise a greedy strategy for the routing mechanism, backed by theoretical approximation guarantees, rendering it a plug-and-play module with negligible overhead. MP-MoE increases wall-clock training time by approximately 3%, while incurring no additional latency at inference time. Extensive experiments demonstrate that during the pre-training of the large language model, our method consistently outperforms the baseline by 1-3 percentage points across a broad range of benchmarks.

Abstract:
The “thinking with images” paradigm has led multimodal large language models to generate intermediate visual steps—such as cropping, annotation, spatial localization, and sketches—to enhance highresolution perception and complex reasoning. However, existing multimodal Process Reward Models (PRMs) evaluate only textual reasoning and cannot judge the correctness of these visual steps, creating a key gap when visual reasoning is essential for solving tasks. We propose Discriminative Visual Process Reward Model (DiscPRM), a multimodal PRM that jointly evaluates textual and visual intermediate steps by modeling visual reasoning trajectories, image operations, and text-image consistency. To support this, we build VTReward-100K, a dataset of step-by-step visual reasoning sequences with supervision. Experiments show that using DiscPRM for Best-of-N process supervision substantially improves multimodal reasoning performance on tasks requiring visual intermediate steps, achieving over 5% gains across benchmarks. We further introduce VABench, the first benchmark for evaluating PRMs on visual reasoning error detection. We hope this work can provide foundational support for advancing the emerging direction of visual–textual process reward.

Abstract:
Large language models (LLMs) increasingly perform multistep reasoning, where intermediate claims form implicit directed acyclic graphs whose node correctness is structurally conditioned on their ancestors. This makes factuality uncertainty structural, rather than a trivial accumulation of node-wise errors, and necessitates inference-time uncertainty quantification over the reasoning structure. While conformal prediction (CP) offers flexible user-specified factuality control, existing work remains post-hoc and cannot intervene during generation. To fill the gap between CP’s flexibility and its post-hoc limitation, we propose anInference-Time Conformal Reasoning (ITCR)framework that integrates CP directly into reasoning graph generation. ITCR learns a structure-level factuality uncertainty function that aggregates claim-level factuality signals over reasoning graphs without complex modeling assumptions. We then design the non-conformity score based on graph-level factuality uncertainty and calibrate the conformal threshold to decide when to stop generation. We theoretically show such generation is nested, yielding valid coverage guarantees for factuality control. Experiments over multiple datasets and coverage objectives demonstrate empirically valid coverage. In downstream reasoning tasks, inference-time calibrated graphs yield more accurate generation than post-hoc pruned graphs.

Abstract:
This paper presents OvisOCR, a lightweight and strictly endto-end Multimodal Language Model (MLLM) tailored for document parsing. Unlike current methods that rely on complex "Crop-OCR-Merge" cascades to handle high-resolution inputs, OvisOCR directly maps full-page visual signals to structured Markdown without localized slicing or layout detection dependencies. Through extensive evaluations on the OmniDoc benchmark, OvisOCR achieves SOTA performance, demonstrating that a compact E2E model can effectively ``digest'' the capabilities of intricate pipelines and surpass specialized and general methods. Technically, OvisOCR establishes a holistic paradigm that synergizes specialized perception with general reasoning, distilling fine-grained recognition from OCR engines and semantic correction from LLMs into a unified model. To balance the performance across diverse document constituents, we design category-specific reward mechanisms for distinct element types, such as dense text, complex tables, and formulas, and ensure the model enhances its formatting strengths for each modality concurrently. This approach effectively resolves the optimization conflict, guaranteeing that improvements in structural layout parsing do not come at the expense of omitting fine-grained textual details. Empirical results confirm that OvisOCR eliminates the error propagation inherent in split-and-merge architectures, offering a streamlined path for next-generation document intelligence.

Abstract:
Timeseries foundation models (TSFMs) deliver strong cross-domain generalization, but their scale makes deployment costly. Knowledge distillation is a natural compression route, yet prior TSFM distillation typically imitates teacher outputs, features, or pairwise relations, and therefore remains tightly coupled to teacher-specific training trajectories while underutilizing two empirical properties: (i) high-level representations across model scales tend to converge toward a shared, approximately low-rank geometry, and (ii) layer-wise utility follows a long-tail pattern. We propose consensus subspace distillation, which reframes distillation as aligning a student to a model-agnostic geometric object: a scale-invariant low-rank consensus subspace together with its center statistics. Offline, we screen high-contribution layers via drop-layer marginal loss, estimate a shrinkage-stabilized covariance from their embeddings, and derive a truncated eigensubspace that defines a consensus projector. Online, we project student embeddings into this subspace and match the teacher’s projected mean and covariance using a lightweight mean--covariance objective, enabling stable optimization without rigid pointwise feature binding. To mitigate subset-induced bias, we further introduce a frequency-domain uncertainty injection mechanism that inflates spectral density based on characteristic-function discrepancies and injects dispersion only within the consensus directions. Across forecasting and imputation, the distilled student matches or slightly improves upon the teacher, while exhibiting a predictable trade-off under strict zero-shot classification. With MOMENT-Large as teacher, we achieve about 90% parameter reduction and substantial distillation-time savings while retaining comparable performance across multiple time-series tasks. Code and compressed weights are available at anonymous.4open.science/r/CSD-13C3/.

Abstract:
Tertiary structurebased RNA design aims to generate RNA sequences that can fold into desired 3D structures, but remains a challenging problem due to the scarcity of annotated data, structural noise, and the intrinsic complexity of RNA topology. Existing structure-to-sequence frameworks largely rely on static k-nearest neighbor graphs and rigid message passing schemes, which fail to capture the flexible and heterogeneous nature of RNA geometry. To address these issues, we propose a unified framework, ATL-FGD, that integrates Adaptive Topology Learning (ATL) and FiLM-Guided Distillation (FGD) for robust RNA design. ATL introduces a differentiable edge gating mechanism to jointly learn topology and representation, enabling the model to construct data-driven, layer-adaptive graphs that better reflect structural dynamics and biochemical consistency. On top of this, FGD bridges structural and sequence representations via feature-wise linear modulation, softly transferring the semantic knowledge from RNA foundation models without relying on them during inference. Extensive experiments on tertiary structure-based RNA design benchmarks demonstrate that our approach achieves significant improvements in both sequence recovery and structural fidelity.

Abstract:
The rapid growth of deep neural networks (DNNs) has intensified the demand for efficient hardware acceleration under quantization. While prior research has successfully reduced weight and activation precision, partial sums generated during accumulation often retain high precision, resulting in significant energy overhead. In this work, we analyze psum distributions in tiled architectures and reveal that withintile outliers are input-dependent. We propose PsumQuant, a post-training, input-aware quantization that predicts psum scales on-the-fly. By leveraging the crest factor of input activations, our learnable scale predictor effectively bounds the psum bit-width while handling the extreme outliers in DNNs. Experimental results on a 128 × 128 systolic array demonstrate that PsumQuant compresses psum precision down to 8-bit within only a 1% accuracy drop on ResNet-18 and a marginal 0.04 perplexity increase on Llama-3.1. Furthermore, bit-width reduction with PsumQuant results in a 45% reduction in total energy with minimal accuracy loss, demonstrating that PsumQuant provides a highly efficient solution for actual NPU architectures.

Abstract:
Although reinforcement learning (RL) enhances the reasoning capabilities of large language models (LLMs), it is primarily learned from the model's selfgenerated distribution, limiting its ability to acquire reasoning skills beyond its initial knowledge. To overcome this, we propose a Difficulty-Aware Learning Strategy Allocation (DALSA) framework, which adaptively assigns appropriate learning strategies to samples based on their difficulty signals. DALSA is built on the key insight that samples beyond models' knowledge scope are better addressed through supervised fine-tuning (SFT), while those within the boundary but insufficiently mastered benefit more from RL, and well-learned samples are discarded to avoid redundant updates. To realize this principle, we extract a series of difficulty-aware training characteristics and employ a learnable strategy allocator to dynamically determine the optimal learning strategy for each sample based on its training dynamics. The allocator and the LLM are alternately optimized, enabling adaptive strategy allocation. Furthermore, two regularization techniques, anti-curriculum weighting and adversarial label smoothing, are integrated to alleviate the inherent limitations of RL and SFT, backed by comprehensive theoretical analyses. Extensive experiments on ten LLMs ranging from 1.5B to 70B across various tasks indicate that DALSA consistently outperforms baselines under both full and parameter-efficient fine-tuning settings.

Abstract:
Textattributed Graphs (TAGs) incorporate textual node attributes with graph structures to describe rich relational semantics. Recent efforts to integrate Graph Neural Networks (GNNs) and Large Language Models (LLMs) have shown promise for learning on TAGs, yet achieving well-aligned representations remains challenging. Prior studies largely rely on heuristics that perform coarse-grained matching. They lack sufficient constraints and ignore distributional alignment, leading to representation drift and limited generalization. Building on Energy-based Models (EBMs), we propose anEnergy-basedRepresentationAlignment (ERAlign) framework that projects GNN-encoded graph structure and LLM-derived text embeddings in a shared latent space to achieve distribution consistency. Concretely, layer-wise alignment is quantified by a distance metric and optimized via an EBM objective. By decreasing energy values, our framework yields well-aligned representations for downstream tasks. During training, we introduce Energy Discrepancy (ED) to avoid high sampling costs associated with intractable normalization. ED also carries theoretical guarantees of higher training efficiency and reduced energy landscape distortion. Empirical evaluations on eight TAG datasets demonstrate that ERAlign obtains state-of-the-art performance across varying levels of supervision and cross-task transfer scenario.

Abstract:
Conformal Prediction (CP) provides rigorous finitesample coverage guarantees, yet its statistical efficiency hinges critically on the size of the calibration set. In data-scarce regimes, CP often suffers from volatile quantile estimation, leading to overly conservative and wide prediction intervals. To address this, we propose Random Score Alignment-Conformal Prediction (RSA-CP), a simple framework designed to improve sample efficiency in small-sample CP. Instead of requiring the computationally intensive generation of full synthetic datasets, RSA-CP enhances calibration by directly aligning real scores with a high-resolution reference score distribution. By employing an optimal transport mapping, our framework refines "step-like" quantile increments through a globally optimal use of reference information. We provide theoretical guarantees establishing that RSA-CP maintains robust coverage without any distributional assumptions on the reference scores. Empirical evaluations demonstrate that RSA-CP consistently produces shorter and more precise prediction intervals while maintaining finite-sample coverage guarantees. Overall, RSA-CP offers a computationally efficient and theoretically grounded solution for robust uncertainty quantification under limited data.

Abstract:
Diffusion models have achieved remarkable success in generative modeling, yet their application to time series forecasting remains suboptimal. Existing approaches apply uniform Gaussian noise across all time steps, assuming all frequency components should be corrupted at the same rate. However, energy distribution across frequencies in time series is highly nonuniform: when uniform noise is added, high-frequency components are disproportionately overwhelmed while low-frequency trends remain inadequately diffused. We propose EADiff, an energy-adaptive diffusion framework operating in the wavelet domain to address this frequency-energy imbalance. Our key insight is that high-energy components require stronger perturbation while low-energy details need gentler corruption to preserve informative structures. We introduce a learnable modulation mechanism that automatically adjusts noise levels for each frequency band on a per-instance basis. Built upon this adaptive scheduler, we design a conditional diffusion framework where low-frequency trends serve as generation conditions, and noise-level-aware loss weighting naturally emphasizes different frequency components according to their signal characteristics. This cohesive design enables the model to respect the intrinsic multi-scale structure throughout both forward and reverse processes. Extensive experiments demonstrate that EADiff consistently outperforms existing diffusion-based and state-of-the-art deterministic methods.

Abstract:
Mixtureof-Experts (MoE) architectures provide strong scalability for large language models, but their large expert parameter footprint poses challenges for efficient deployment. Expert pruning is widely used to reduce model size and inference cost; however, existing approaches are token-agnostic, treating all tokens equally when estimating expert importance. This uniform treatment dilutes the contributions of informative tokens and leads to suboptimal pruning decisions. To address this fundamental limitation, we propose Step (Selective Token-guided Expert Pruning), a token-aware framework that rethinks expert pruning from the perspective of selective token guidance. By incorporating loss-aware expert evaluation and a lightweight knowledge-preserving mechanism, Step reduces information loss while removing redundant experts. Extensive experiments across different MoE architectures and model scales demonstrate the effectiveness of Step. On the 30B Qwen3 MoE model with 50% expert sparsity, Step achieves nearly a 50% reduction in memory usage with minimal performance degradation, delivers a 1.5× throughput improvement, and completes the entire pruning process within 10 minutes.

Abstract:
Offline metareinforcement learning requires agents to generalize to unseen tasks from fixed datasets, yet existing sequence-based and MoE-based methods rely on implicit or token-level routing signals that fail to capture task-level structure. We propose theTask-Guided Router (TGR), a structured expert-routing framework that explicitly models inter-task relationships via multi-view task representations that combine semantic descriptors, behavioral summaries, and latent dynamics features. Using structure-guided routing, TGR assigns experts based on global task compatibility rather than local trajectory fragments, enabling stable specialization and effective knowledge transfer across tasks.Extensive experiments on continuous-control benchmarks demonstrate that TGR consistently outperforms state-of-the-art offline meta-RL methods in few-shot generalization, particularly under sparse data and heterogeneous dynamics. Our results highlight the importance of task-level priors for robust offline meta-reinforcement learning.

Abstract:
ZeroShot Object Navigation (ZSON) agents rely on continuously updated internal states to support long-horizon planning and decision-making. However, existing methods heavily depend on the observational outputs of vision-language models (VLMs) during state updates and lack explicit validation of perceptual authenticity. This structural vulnerability allows injected adversarial information to transmute into long-term memory, persistently disrupting subsequent planning behaviors. Exploiting this, we propose the Visual-Instruction Recurrent Update Subversion (VIRUS) framework, the first training-free backdoor attack scheme specifically targeting the state update stage of ZSON agents. Upon dual-trigger activation, VIRUS generates velocity-modulated, geometrically consistent adversarial potential fields on the navigable manifold to entrap agents. Crucially, it employs an irreversible state update operator to permanently lock this corruption into memory, effectively subverting intrinsic self-healing mechanisms. Extensive experiments demonstrate that VIRUS achieves extremely high attack success rates across diverse ZSON agents and advanced VLMs. The framework exhibits robust generalization to visual and textual variations and successfully penetrates safety-aligned large model defense mechanisms.

Abstract:
Current audiovisual speech separation (AVSS) models typically rely on implicit multimodal fusion, but the absence of explicit modality alignment and reliability modeling often causes semantic misalignment and contaminates speech representations. The brain addresses this with a hierarchy: top-down auditory selection uses visual priors to maintain target-consistent acoustics, while bottom-up cross-modal compensation integrates temporally aligned articulatory cues to reconstruct and stabilize speech. Guided by this principle, we present Neuro-SCNet, an AVSS architecture that makes selection and compensation explicit and reliability-aware. The Auditory Selection Mechanism applies top-down, visually guided gain along the audio pathway to isolate target time-frequency units and suppress distractors. The module preserves the auditory trace with an identity bypass and adds controlled visual refinements via a residual path. A synchrony-driven gate reduces the influence of low-confidence visual cues. Additionally, a lightweight pre-alignment for visual feature pre-processing estimates and corrects small temporal offsets, and a compact magnitude-phase encoder is used to preserve fine acoustic detail to stabilize reconstruction. Evaluations on LRS2, LRS3, and VoxCeleb2 show state-of-the-art separation with improved efficiency, supporting the value of explicit selection and reliability-aware compensation.

Abstract:
Constrained clustering incorporates prior knowledge in the form of pairwise constraints to guide data partitioning. While effective, existing Bayesian approaches are often limited in scalability to large datasets and provide weak interpretability due to the lack of explicit feature relevance modeling. We propose BASIL, a scalable Bayesian semisupervised clustering framework that leverages stochastic variational inference to jointly infer cluster assignments and feature importance weights. This joint formulation enables the identification of discriminative features consistent with the imposed constraints. To robustly handle noisy or inconsistent supervision, BASIL introduces an adaptive constraint-weighting mechanism that down-weights unreliable constraints. Experiments on synthetic and real-world benchmarks demonstrate that our approach achieves competitive clustering performance while improving scalability and interpretability over existing baselines. We further demonstrate applicability to large-scale health data, including medical imaging and electronic health records.

Abstract:
Deep learning surrogates for 3D Partial Differential Equations (PDEs) often fail to generalize across geometric transformations because they depend heavily on specific coordinate systems. While equivariant networks offer a solution, they typically rely on local operations in the spatial domain, making the global receptive field—essential for PDE dynamics—computationally expensive. Conversely, Fourier Neural Operators (FNOs) efficiently capture global interactions, yet establishing 3D equivariance within them remains impractical due to the prohibitive cost of spectral group convolutions. To bridge this gap, we introduce EqGINO, a geometrically robust framework that enforces isotropy in the spectral domain. By design, EqGINO guarantees exact equivariance to the discrete symmetries inherent to the discretized computational domain. Beyond this discrete guarantee, our structural prior enables effective generalization to arbitrary continuous orientations even with a limited number of SE(3)transformed training samples. Consequently, our method robustly models coordinate-invariant physical laws on complex irregular 3D geometries. Our code is available at https://anonymous.4open.science/r/EqGINO/

Abstract:
While textto-image diffusion models achieve remarkable generation quality, they inadvertently memorize sensitive content, necessitating machine unlearning to prevent undesired outputs. However, existing unlearning methods rely on suboptimal surrogate objectives rather than directly optimizing the unlearning goal, leading to fundamental objective mismatch. Moreover, these methods preserve model utility via surface-level constraints on model parameters or outputs, yet fail to capture the intrinsic generative dynamics of diffusion models, consequently triggering catastrophic forgetting. To address these challenges, we propose Preference-calibrated Optimization with Score-level Distribution Alignment (POSDA), a unified unlearning framework that harmonizes effective erasure with fine-grained structural preservation. Specifically, we reframe unlearning as a preference optimization problem by constructing a reward that explicitly quantifies the unlearning objective. Additionally, we introduce score-level distribution alignment to ensure the invariance of the underlying manifold topology of the unlearned model, thereby preventing distributional drift. Extensive experiments across object, style, and NSFW unlearning tasks demonstrate that POSDA achieves state-of-the-art erasure efficacy while maintaining superior model utility compared to existing methods.

Abstract:
Blackbox prompt tuning (BBPT) aims to optimize input prompts for large models where internal parameters and gradients are inaccessible. However, existing methods fail to simultaneously address the dual challenges of prompt interpretability and query efficiency. To address these challenges, we propose CRL-BPT, a curriculum reinforcement learning framework that utilizes a large language model as an agent to generate human-readable prompts. Specifically, CRL-BPT implements a dynamic curriculum schedule on two auxiliary objectives: an imitation loss and an innovation loss. By dynamically weighting these objectives, CRL-BPT regularizes the RL process, guiding the agent from mimicking reference prompts to discovering novel patterns. Additionally, we introduce tailored stabilization mechanisms comprising historical loss normalization and relative reward calibration to ensure robust training. Extensive experiments demonstrate that CRL-BPT establishes new state-of-the-art performance and generates highly interpretable prompts under a strict budget of API calls. Code is available at https://anonymous.4open.science/r/CRL-BPT.

Abstract:
The maximal update parameterization (\mu P) has been influential in supervised and unsupervised learning conditions, with fixed data distributions, owing to its ability to maintain feature learning across larger parameter scales. This parameterization facilitates more consistent learning dynamics and learned features across model sizes. Moreover, optimal hyperparameters such as learning rate approximately transfer from small to larger models, minimizing the computational overhead of hyperparameter sweeps. However, it remains elusive if these benefits readily transfer to the reinforcement learning framework, where the model's learning dynamics are coupled to the shifting data distribution. Reinforcement learning agents must continually adapt to nonstationary data distribution shifts throughout training. We empirically study how two regimes, the ''rich'' CompleteP and ''lazy'' Neural Tangent Kernel (NTK) parameterizations affect hyperparameter transfer, feature and policy consistency as we scale reinforcement learning agents. Ultimately, we show that agents trained using CompleteP consequentially improves compute and reward efficiency compared to the NTK parameterization over 16 continuous control tasks and variants e.g. normalization and sparse rewards. Hence, we argue that adopting the CompleteP parameterization minimizes learning inconsistencies across model sizes to improve compute efficiency when scaling up.

Abstract:
NonMarkovian dynamics are commonly found in real-world environments due to long-range dependencies, partial observability, and memory effects. The Bellman equation that is the central pillar of Reinforcement learning (RL) becomes only approximately valid under Non-Markovian. Existing work often focus on practical algorithm designs and offer limited theoretical treatment to address key questions, such as what dynamics are indeed capturable by the Bellman framework and how to inspire new algorithm classes with optimal approximations. In this paper, we present a novel topological viewpoint on temporal-difference (TD) based RL. We show that TD errors can be viewed as 1-cochain in the topological space of state transitions, while Markov dynamics are then interpreted as topological integrability. This novel view enables us to obtain a Hodge-type decomposition of TD errors into an integrable component and a topological residual, through a Bellman–de Rham projection. We further propose HodgeFlow Policy Search (HFPS) by fitting a potential network to minimize the non-integrable projection residual in RL, achieving stability/sensitivity guarantees. In numerical evaluations, HFPS is shown to significantly improve RL performance under non-Markovian.

Abstract:
Parametric partial differential equations (PDEs) serve as fundamental models across science and engineering, yet constructing fast and accurate surrogate models from sparse, irregularly sampled observations with reliable uncertainty quantification remains challenging. Existing approaches struggle to simultaneously handle variable observation patterns, preserve physics consistency, and provide wellcalibrated predictive uncertainty. We introduce Bias-Spectrum Neural Processes (BSNP), a unified meta-learning framework that systematically integrates weak structural priors (translation equivariance, locality) with strong physical priors (governing equations and boundary conditions). BSNP addresses two critical obstacles: discretization overfitting through stochastic collocation that resamples residual evaluation points, and uncertainty collapse through mean-field enforcement that applies PDE constraints only to predictive means while preserving learned uncertainty. Comprehensive experiments on nonlinear Poisson equations, Burgers dynamics, and Navier-Stokes flows demonstrate that BSNP achieves superior accuracy and well-calibrated uncertainty quantification in sparse-data regimes.

Abstract:
Crosssubject emotion recognition aims to improve a model's generalization to previously unseen subjects. Existing methods are mainly built upon domain generalization or data augmentation, but suffer from two major limitations: 1) heavy dependence on modality-specific feature designs—almost exclusively tailored to EEG signals—resulting in limited generalizability; and 2) the widespread assumption of independently and identically distributed data, which restricts the diversity of generated samples. To address these challenges, we systematically analyze the heterogeneous distribution characteristics of emotion data and proposeMixEmo, a framework that integrates heterogeneous distribution augmentation and collaborative learning. Specifically, a well-trained backbone is used to extract representations and partition them into multiple single-distribution subsets as distribution prototypes. These prototypes are randomly combined to synthesize unseen distributions, thereby enhancing distributional diversity. Finally, heterogeneous distribution collaborative learning jointly optimizes the model across subsets. Extensive experiments demonstrate thatMixEmosubstantially improves generalization performance in cross-subject emotion recognition.

Abstract:
Reinforcement Learning (RL) with Group Relative Policy Optimization (GRPO) shows great promise for enhancing LLM reasoning, but remains challenged by sparse and unstable rewards in longhorizon tasks. Existing approaches to reward shaping struggle to balance semantic expressiveness, reliability, and computational efficiency: heuristic rules lack flexibility, while LLM-as-a-Judge incurs high computational cost and suffer from inconsistent and misaligned scoring signals in long-context settings. To address these challenges, we introduce GLARE, a neuro-symbolic reward framework that decouples semantic abstraction from credit assignment. Specifically, to leverage semantic understanding while preserving symbolic determinism, we first extract and symbolize trajectory events into a discrete representation. These events are then translated into Linear Temporal Logic (LTL) formulas, which are compiled into deterministic automata that track the agent's progress via state transitions. This mechanism yields dense and consistent reward signals, avoiding unstable direct scoring while significantly reducing computational cost. Empirical results on ALFWorld show that GLARE outperforms GRPO by 12.1% in success rate, while achieving an 8.1% improvement over conventional LLM-based judges using only 15% of their computational cost.

Abstract:
The proliferation of Large Language Models (LLMs) has shifted serving systems from processing isolated requests to orchestrating highconcurrency, multi-tenant agentic workflows. However, existing solutions typically prioritize intra-workflow optimization, largely neglecting the significant potential for inter-workflow optimization. In this paper, we propose HeraSys, an LLM serving system designed to optimize the end-to-end performance of concurrent workflows. Through fine-grained orchestration, HeraSys eliminates cross-workflow computational redundancy via structural node merging and reuse. Furthermore, HeraSys introduces a load-aware joint scheduling policy that dynamically manages execution order by evaluating both inter- and intra-query priorities. By integrating a resource skewing mechanism with adaptive batching and pipeline decomposition, HeraSys effectively mitigates tail latency while maintaining low average latency, thereby substantially improving system throughput. Extensive experiments demonstrate that HeraSys reduces P99 latency by up to 2.17× and increases serving throughput by up to 1.85× under strict latency guarantees.

Abstract:
Controlling the false discovery rate (FDR) under complex sparsity structures remains a fundamental challenge in large language model (LLM) analysis. Motivated by multiple comparison problems in LLMs, we consider a setting in which sparsity arises at the group level after a linear transformation of model parameters. We propose Split Group Knockoffs (SGKs), a general framework for groupwise variable selection under grouped transformational sparsity that extends the Split Knockoff procedure to grouped transformed variables. We establish theoretical guarantees for group-level FDR control and support recovery consistency, addressing challenges induced by group-wise penalties in transformed spaces. Applying SGK to LLM behavior auditing experiment reveals that model disagreement is not uniform across subjects, but instead concentrates in domains with greater semantic and reasoning complexity, where SGK effectively distinguishes genuine behavioral deviations from surface-level performance variation.

Abstract:
Symbolic expressions can effectively characterize and predict circuit behavior, but deriving them directly from circuit schematics is challenging. This process requires accurate visualto-symbolic construction of circuit structure from images and correct multi-step symbolic derivation, both of which impose strict correctness requirements. This work proposes AutoVSR, an automated framework for visual-to-symbolic generation of circuit expressions using Vision Language Models (VLMs). By reconstructing circuit diagrams into an executable intermediate representation (Executable IR) and leveraging a symbolic solver for reasoning, AutoVSR significantly improves the accuracy of symbolic expression generation. AutoVSR introduces two key innovations: an IR construction method guided by component rule retrieval and verification-based feedback, and a symbolic solver implemented as a planning agent equipped with a symbolic tool library for reliable multi-step derivation. Compared to end-to-end VLM approaches, AutoVSR achieves accuracy improvements of 30.01--59.45% in circuit analysis and outperforms specialized methods by 41.96--51.84%. Moreover, AutoVSR surpasses closed-source state-of-the-art VLMs in inference cost and computational efficiency. Code is available at https://anonymous.4open.science/r/icml34410/.

Abstract:
Data science agents promise to accelerate discovery and insightgeneration by turning data into executable analyses and findings. Yet existing data science benchmarks fall short due to fragmented evaluation interfaces that make cross-benchmark comparison difficult, narrow task coverage and a lack of rigorous data grounding. In particular, we show that a substantial portion of tasks in current benchmarks can be solved without using the actual data. To address these limitations, we introduce DSGym, a standardized framework for evaluating and training data science agents in self-contained execution environments. Unlike static benchmarks, DSGym provides a modular architecture that makes it easy to add tasks, agent scaffolds, and tools, positioning it as a live, extensible testbed. We curate DSGym-Tasks, a holistic task suite that standardizes and refines existing benchmarks via quality and shortcut solvability filtering. We further expand coverage with (1) DSBio: expert-derived bioinformatics tasks grounded in literature and (2) DSPredict: challenging prediction tasks spanning domains such as computer vision, molecular prediction, and single-cell perturbation. Beyond evaluation, DSGym enables agent training via execution-verified data synthesis pipeline. As a case study, we build a 2,000-example training set in DSGym that substantially improves a 4B mode on standardized analysis benchmarks. Overall, DSGym enables rigorous end-to-end measurement of whether agents can plan, implement, and validate data analyses in realistic scientific context.

Abstract:
Dataadaptive two-sample testing assesses if two samples come from the same distribution, using a discrepancy learned from the data (e.g., via kernel-based feature representations). Such methods typically rely on data splitting to decouple learning from testing and control type I error. However, this paradigm is ill-suited to few-shot settings with severe sample-size imbalance: abundant reference samples are available, while only a handful of query samples arrive. In this paper, we show how this imbalance can be leveraged constructively. Using abundant reference data, we learn reference-dependent representations that summarize salient structure of the reference distribution and provide informative signals for detecting departures. We incorporate a collection of representation families that capture both global and local structure, and adaptively weight them using only reference samples via an uncertainty-guided principle. Theoretically, we establish permutation-based type I error control and show consistency of the aggregated test: as the sample sizes grow, the test power converges to one whenever the representation set contains at least one consistent representation. Empirically, our aggregation achieves strong performance across a range of benchmarks while retaining type I error control.

Abstract:
Large language models (LLMs) increasingly support human decisionmaking, rendering human-interpretable confidence essential. However, it remains unclear whether verbalized confidence calibration generalizes across heterogeneous tasks without degrading accuracy. We show that universal confidence calibration fails. Across diverse benchmarks, we identify two incompatible task families with distinct confidence semantics. In reasoning-centric tasks, confidence supervision transfers within the family, often improving calibration while preserving or even improving accuracy, and induces emergent behaviors such as confidence-dependent reasoning length and self-verification. Retrieval- and copy-oriented tasks also exhibit within-family transfer, but fail to generalize to reasoning tasks, with cross-family supervision degrading both calibration and accuracy. Motivated by this finding, we disentangle confidence into reasoning uncertainty and evidence localization uncertainty. This simple decomposition restores cross-family generalization using supervised fine-tuning alone, suggesting that effective confidence alignment requires task-aware semantics rather than a universal scalar notion.

Abstract:
Active vision, also known as active perception, refers to actively selecting where and how to look in order to gather taskrelevant information. It is a critical component of efficient perception and decision-making in humans and advanced embodied agents. With the rise of Multimodal Large Language Models (MLLMs) as central planners in robotic systems, the lack of methods for equipping MLLMs with active perception has become a key gap. We first provide a systematic definition of MLLM-based active perception tasks and show that GPT-o3's zoom-in strategy can be viewed as a special case, though it suffers from low efficiency and inaccurate region selection. To address these issues, we propose Active-o3, a reinforcement learning framework built on GRPO that equips MLLMs with active perception capabilities. Leveraging a modular sensing-action design and a dual-form reward, Active-o3 autonomously learns efficient and stable region selection strategies without explicit supervision. We further establish a comprehensive benchmark covering both open-world tasks (small/dense-object grounding) and domain-specific scenarios (remote sensing, autonomous driving, interactive segmentation). Experimental results demonstrate that Active-o3 significantly enhances active perception capabilities compared to Qwen2.5-VL-CoT. Moreover, we show that our RL framework not only preserves the model’s general understanding ability but can also serve as a proxy task for leveraging perception data, further improving performance on benchmarks such as RealWorldQA. We hope that our work can provide a simple codebase and unified evaluation protocol to facilitate future research on active perception with MLLMs.

Abstract:
Existing MetaBlack-Box Optimization (MetaBBO) methods focus on how to search when controlling optimizers, but largely overlook where to search. We propose MetaSG-SAEA, a bi-level MetaBBO framework for expensive constrained multi-objective optimization problems (ECMOPs), in which a meta-policy provides search guidance to the low-level Surrogate-Assisted Evolutionary Algorithm (SAEA). To achieve this, we introduce Max–Min Constraint-Calibrated Inequality (MM-CCI), a compact, problem-agnostic region abstraction that maps heterogeneous constraint evaluations to an ordered scalar level; we further provide a theoretical analysis of its fundamental properties. Building on this region abstraction, we adopt diffusion-based population initialization to translate the meta-policy’s region-level guidance into solution-level priors for the SAEA. To make MetaSG-SAEA scalable, we construct an attention-based state representation across varying problem dimensions, population sizes, and numbers of objectives and constraints. Experimental results demonstrate that MetaSG-SAEA outperforms state-of-the-art baselines across diverse benchmarks and exhibits the ability to generalize across problem distributions.

Abstract:
Model Predictive Control (MPC) is widely used to operate safetycritical infrastructure by predicting future trajectories and optimizing control actions. However, nonlinear dynamics, hard safety constraints, and numerical optimization often render individual control moves opaque to human operators, undermining trust and hindering deployment. This paper presents Hierarchical Causal Abduction (HCA), which combines (i) physics-informed reasoning via domain knowledge graphs, (ii) optimization evidence from Karush--Kuhn--Tucker (KKT) multipliers, and (iii) temporal causal discovery via the PCMCI algorithm to generate faithful, human-interpretable explanations for control actions computed by nonlinear MPC. Across three diverse control applications (greenhouse climate, building HVAC, chemical process engineering) with expert validation, HCA improves explanation accuracy by 53% over LIME (0.478 vs. 0.311) using a single set of cross-domain parameters without per-domain tuning; domain-specific KKT-threshold calibration over 2--3 days further increases accuracy to 0.88. Ablation studies confirm that each evidence source is essential, with 32--37% accuracy degradation when any component is removed, and HCA's ranking-and-validation methodology generalizes beyond MPC to other prediction-based decision systems, including learning-based control and trajectory planning.

Abstract:
Multiview clustering effectively exploits rich information from multiple views, yet real-world applications are frequently challenged by missing views and cross-view sample misalignment, hindering cross-view modeling and resulting inferior clustering performance. To address these challenges, this paper presents a novel method,OPtimalTransport–guIded flOw matchiNg for incomplete and unaligned multi-view clustering (OPTION). Specifically, OPTION employs conditional flow matching to learn deterministic transport paths for missing-view imputation, enabling stable manifold-preserving recovery and more discriminative representations. To achieve alignment-free fusion, we introduce a Gromov-Wasserstein loss—a structural relaxation of optimal transport—that aligns intra-view geometric structures in the latent space. Furthermore, an optional contrastive regularization is incorporated to enhance cross-view consistency specifically for aligned settings. Extensive experiments demonstrate that OPTION outperforms state-of-the-art methods across ideal, incomplete, and unaligned scenarios.

Abstract:
Surrogate models for topology optimization (TO) exhibit highly variable outof-distribution (OOD) generalization under distribution shifts such as changing loads or boundary conditions, yet the source of this variability remains unclear. We hypothesize that OOD performance is governed by how much information the conditioning signal preserves about the adjoint sensitivity (reduced gradient) that drives classical TO. Modeling the TO pipeline as a causal Markov chain, the Data Processing Inequality establishes that, under this abstraction, the sensitivity field is an information-theoretically optimal conditioning signal for topology prediction. However, computing exact adjoint sensitivities can be expensive or unavailable in practice; we observe that certain physical fields can approximate sensitivities through monotone transformations. To formalize this, we introduce pseudo-sensitivities to characterize which fields enable generalization versus those that are information-poor. We then show that a sensitivity-conditioned Bernoulli flow-matching generator empirically confirms these predictions: conditioning on sensitivities yields state-of-the-art OOD performance, while increasingly distant physical fields degrade toward raw parameter conditioning. We further benchmark against competitive baselines, and find the same ordering of conditioning signals and the same OOD trends. Results hold across structural TO benchmarks under load shifts and our new CFD-TO dataset under boundary-condition shifts such as multi-outlet configurations.

Abstract:
When designing interventions to promote desired actions, twostage agent heterogeneity -- encompassing both engagement with the intervention and completion of the desired action -- creates significant challenges in identifying optimal intervention policies. While this two-dimensional heterogeneity creates distinct agent response types with varying marginal policy returns, existing literature typically falls short in full identification of all agent types, leading to inefficient intervention allocations. To address the challenge of learning optimal policies that account for two-stage outcomes, we propose a minimax approach within a counterfactual principal strata framework. A value function, accommodating varying policy returns across six potentially non-identifiable principal strata, is designed and partially identified to minimize the worst-case value loss relative to three benchmark policies: never-treat, always-treat, and oracle. We introduce three estimators for optimal policy learning: Principal Outcome Regression (P-OR), Principal Inverse Propensity Scoring (P-IPS), and Principal Doubly Robust (P-DR), providing theoretical guarantees for their unbiasedness, robustness, and regret upper bounds. Extensive numerical experiments demonstrate the effectiveness and superiority of the proposed approach.

Abstract:
Advancing AI toward agentic science is currently bottlenecked by the challenge of ultralong-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanning days or weeks. While Large Language Models (LLMs) have demonstrated prowess in short-horizon reasoning, they are easily overwhelmed by execution details in the high-dimensional, delayed-feedback environments of real-world research, failing to consolidate sparse feedback into coherent long-term guidance. Here, we present ML-ACE, an autonomous agent that masters ultra-long-horizon machine learning engineering (MLE) which is a representative microcosm of scientific discovery. By reframing context management as a process of cognitive accumulation, our approach introduces Hierarchical Cognitive Caching (HCC), a multi-tiered architecture inspired by computer systems that enables the structural differentiation of experience over time. By dynamically distilling transient execution traces into stable knowledge and cross-task wisdom, HCC allows agents to decouple immediate execution from long-term experimental strategy, effectively overcoming the scaling limits of static context windows. In evaluations on OpenAI's MLE-Bench under 24-hour budgets, ML-ACE achieves a state-of-the-art medal rate of 56.44%. Our findings demonstrate that ultra-long-horizon autonomy provides a scalable blueprint for AI capable of autonomous exploration beyond human-precedent complexities.

Abstract:
Taxonomies provide key information about the semantic relationships between concepts and the inherent organization of vision and language. Despite their impressive capabilities, large multimodal models (LMMs) often lack taxonomic knowledge, leading to low hierarchical visual recognition (HVR) consistency. These models typically only rely on language modeling objectives during finetuning and lack explicit taxonomy-aware regularization. To address this, we propose Hierarchical Representation Regularization (HiR^2), a simple plug-and-play regularizer that improves hierarchical consistency in LMMs. Specifically, we introduce a semantic-aware visual tree construction framework that extracts coarse-to-fine visual features from intermediate LLM layers guided by textual cues. The regularizer combines two complementary objectives: a taxonomic entailment loss that enforces hierarchy via hyperbolic entailment cones in the Lorentz model, and a discriminative dispersive loss that promotes angular separation of semantically similar embeddings on the unit sphere without disturbing the radial hierarchical structure. Extensive experiments demonstrate that HiR^2 effectively captures taxonomic structures across diverse LMMs and fine-tuning methods.

Abstract:
Static finetuning paradigms impose rigid structural constraints on foundation models like the Segment Anything Model 2 (SAM2), limiting their adaptability to the varying complexity of specialized downstream tasks. To overcome this limitation, we proposeACO-MoE-LoRA, a dynamic framework that introduces an "Evolving-while-Training" strategy by synergizing Ant Colony Optimization (ACO) with a Latent Space Mixture-of-Experts (MoE) architecture. Central to our method is the ACO-ConvLoRA module, which employs a pheromone-guided routing mechanism to actively govern expert selection and topological evolution. By formulating expert assignment as an evolutionary pathfinding problem, this module effectively mitigates the standard routing collapse issue and enables elastic adjustment of LoRA ranks via weight slicing, bridging discrete structural search with continuous parameter training. Extensive experiments across 16 challenging datasets demonstrate that our framework consistently outperforms leading static adapters, while effectively addressing the local optimality limitations of recent dynamic heuristics. This work presents a self-organizing solution that harmonizes swarm intelligence with gradient optimization for efficiently adapting foundation models to specialized domains.

Abstract:
Federated ClassIncremental Learning (FCIL) aims to continually expand a model’s recognition capacity in a distributed environment, enabling it to learn new classes while retaining knowledge of previously seen ones. Exemplar replay has emerged as a promising strategy owing to its simplicity and effectiveness. Existing methods either select exemplars based on local dynamics or construct global feature spaces to identify representative samples. However, they face inherent challenges in striking a balance between effectiveness and privacy. To address this issue, this paper proposes a Cross-view Lewis weIght Fusion method for exemplar replay in FCIL, termed CLIF, which fuses multi-view importance scores to guide representative sample selection under federated settings. Specifically, CLIF consists of two main modules: 1) the cross-view Lewis weight fusion module computes and integrates Lewis weights from multiple feature perspectives to achieve consistent importance estimation, ensuring that the selected samples better reflect the global data distribution and thus enhancing the representativeness of the replay subset. Building on this, 2) the frequency-based weighted training module adjusts the loss contribution of each sample according to its selection frequency across views, which emphasizes the contribution of critical samples. Moreover, we provide a theoretical analysis to guarantee the soundness and effectiveness of CLIF. Extensive experiments on three datasets demonstrate that our method consistently improves baselines by 1%–6%, supporting the above claims.

Abstract:
Reasoning Video Object Segmentation (RVOS) demands a sophisticated integration of temporal dynamics, spatial details, and linguistic reasoning to achieve precise pixellevel localization. Existing methods are limited to reasoning over fixed initial inputs and lack the capacity to actively acquire further visual evidence, which is often essential for resolving complex references in long or intricate videos. To address this, we propose VideoSEG-O3, the first multi-turn reinforcement learning framework for RVOS that emulates the human ``coarse-to-fine'' cognitive process. It employs a multi-turn temporal-spatial chain-of-thought to capture fine-grained details by iteratively pinpointing critical intervals and keyframes. Additionally, to enable the policy to perceive segmentation quality beyond mere text probability of \texttt[SEG] during the RL stage, we introduce SEG-aware logit calibration, which integrates pixel-wise segmentation feedback directly into the token-level logits. Furthermore, we design a decoupled thinking trace to hierarchically decompose the reasoning process into temporal, spatial, and linguistic dimensions, and construct VTS-CoT, a specialized cold-start dataset featuring comprehensive reasoning trajectories. Extensive experiments demonstrate that VideoSEG-O3 achieves advanced performance across 8 mainstream RVOS benchmarks, particularly excelling in long-horizon and complex reasoning tasks.

Abstract:
Large language models (LLMs) can reveal and amplify societal biases during chainof-thought (CoT) generation. We present COFT (Chain of Fair Thought), a training-free decoding method that applies token-level fairness control at decode time, with distribution-free marginal validity guarantees (under exchangeability) for any frozen causal language model. COFT operates in three stages. First, it creates a masked counterfactual prompt by replacing sensitive spans with neutral tokens. Second, it compares the factual and masked logit distributions through lightweight logit fusion to attenuate attribute-driven biases. Third, it uses dual-branch split-conformal calibration to certify per-step candidate token sets at a user-chosen risk level. We evaluate COFT across six models and multiple bias benchmarks. Our method reduces standard bias metrics by 30–55% (median 38%) while preserving task utility and language quality. Reasoning accuracies remain unchanged within run-to-run noise margins. The computational overhead is modest, equivalent to one additional cached forward pass (<=11%). COFT offers a clear, auditable path to safer CoT generation with significant bias reduction, negligible utility loss, and no requirement for retraining, auxiliary classifiers, or weight access.

Abstract:
Incomplete multiview clustering (IMVC) becomes particularly challenging under heavy missingness and view imbalance, where scarce co-observed pairs make cross-view correspondences unreliable: imputation-first pipelines can trigger cascading reconstruction errors, while purely consistency-based alignment often degrades sharply and offers limited control over semantic convergence across views. We propose MAGIC (Masked multi-pAth contrast with confIdence-Gated semantIc imputation), a unified framework that learns calibrated cluster semantics before performing conservative completion. MAGIC instantiates multiple correlated representation and prediction paths from lightly augmented latent codes and couples them via a masked multi-path contrastive consensus objective with prediction-consistency regularization, yielding stable posteriors even when co-observations are scarce; these posteriors are then aggregated into view-wise soft assignments to reduce overconfidence and alleviate dominance by highly available views. Building on the calibrated semantics, MAGIC conducts similarity-guided semantic transfer in label space with confidence-aware gating and completes missing representations in a geometry-preserving manner, thereby mitigating error propagation under severe missingness. Extensive experiments on four benchmarks across a wide range of missing ratios demonstrate consistent improvements over prior IMVC methods, and ablations validate the complementary roles of masked multi-path consensus learning and confidence-gated semantic imputation.

Abstract:
Agentic systems have recently become the dominant paradigm for formal theorem proving, achieving strong performance by coordinating multiple models and tools. However, existing approaches often rely on taskspecific pipelines and trained formal provers, limiting their flexibility and reproducibility. In this paper, we propose the paradigm that directly uses a general coding agent as a formal math reasoner. This paradigm is motivated by (1) A general coding agent provides a natural interface for diverse reasoning tasks beyond proving, (2) Performance can be improved by simply replacing the underlying base model, without training., and (3) MCP enables flexible extension and autonomous calling of specialized tools, avoiding complex design. Based on this paradigm, we introduce Numina-Lean-Agent, which combines Claude Code with Numina-Lean-MCP to enable autonomous interaction with Lean, retrieval of relevant theorems, informal proving and auxiliary reasoning tools. Using Claude Opus 4.5 as the base model, Numina-Lean-Agent solves all problems in Putnam 2025 (12/12), matching the best closed-source system. Beyond benchmark evaluation, we further demonstrate its generality by interacting with mathematicians to successfully formalize the Brascamp–Lieb theorem.

Abstract:
Bestof-N selection improves reasoning in large language models (LLMs) by allocating additional test-time compute to sample multiple candidate trajectories, but it fundamentally relies on reliable verification. However, widely used proxies based on logit confidence or sample agreement can suffer from calibration collapse, where confidence becomes misaligned with correctness. Instead, we move beyond output-level signals and analyze the model's latent dynamics during inference. Drawing from cognitive neuroscience, we hypothesize that effective reasoning exhibits metastability—a balance between stability and flexibility manifested as structured ``dwell-and-jump'' dynamics. We introduce Latent Velocity Entropy (LVE), a training-free metric that quantifies these dynamics via the entropy of internal representation updates. Extensive experiments on four reasoning benchmarks (AIME, GPQA, MATH, Brumo) demonstrate that the metric mitigates calibration collapse and consistently outperforms leading logit-based baselines. It surpasses the state-of-the-art baseline (UID) by 1.6% and majority voting by 4.0% in average accuracy. Remarkably, our method matches the performance of 10-sample majority voting using only 3 samples—a 70% reduction in inference cost.

Abstract:
Large language models are often evaluated for correctness on isolated questions. But modern deployments also rely on a different property: whether the model stays consistent as it generates, critiques, and revises over multiple steps that rely on the same underlying concepts. In these settings,selfconsistencyseems desirable, since it limits drift as models reuse and assess their own outputs. Yet we lack standard deployment-relevant ways of quantifying it. How do we measure self-consistency, and what can it signal for reliability in deployment? In this paper we propose a new measure:generator–evaluator self-consistency, which assesses whether a model applies the same underlying concept consistently when it is invoked across related prompts. We find that models exhibit substantial variation in self-consistency that is independent of their accuracy on benchmark questions involving those same concepts. Examining this variation in a clinical setting with physician-validated mistakes, we find that higher self-consistency is linked to greater vulnerability to mistakes. Rather than signaling robust understanding, consistency can reflect stable commitment to incomplete or brittle conceptualizations. We interpret this pattern as aconsistency trap: self-consistency is operationally useful, but can also be evidence of systematized errors.

Abstract:
Embedding methods are among the most efficient approaches for learning to reason about relational knowledge. In this paper, we focus on the framework of regionbased embeddings, where relations are encoded as geometric regions. The spatial arrangement of these regions allows such models to capture symbolic rules, enabling them to simulate some forms of symbolic reasoning. A crucial consideration is how the regions are parameterized, as this affects which rule bases can be captured. Most methods use convex regions which are defined in terms of coordinate-wise comparisons. This makes them highly efficient, but the implications of this choice have thus far remained unclear. We present a series of results that shed light on this issue, showing that convex coordinate-wise models indeed have important limitations, while at the same time showing that there is still room for pushing the expressivity of existing coordinate-wise models.

Abstract:
Can we bridge the gap between perceiving camera trajectories and rendering novel views within a single generative framework? Recovering camera parameters from images and rendering scenes from novel viewpoints are considered the forward and inverse problems in the field of computer vision and graphics. Previous approaches treat these problems in isolation, often failing when image coverage is sparse or camera poses are ambiguous. In this work, we propose Rays as Pixels, a specialized Video Diffusion Model (VDM) that learns a joint distribution of videos and camera trajectories. We represent cameras as dense ray pixels (raxels) and simultaneously denoise them alongside video frames using a novel Decoupled SelfCross Attention. This joint formulation enables us to: i) generate a video from multiple input images following a defined camera trajectory, ii) perform novel view synthesis from sparse views (without necessarily requiring camera poses), and iii) predict the camera trajectory from a raw video. We evaluate our model on pose estimation, camera-controlled video generation and validate its self-consistency. Please reference supplementary material for more qualitative results.

Abstract:
Accurately modeling the full distributions of possible states is crucial for understanding statistical properties and enabling reliable predictions in complex, unsteady physical systems. Recently, diffusion models and flow matching have shown promise in these tasks. However, they remain limited in uncovering the general principles of systems from multiple short trajectories across condition space. In addition, they exhibit inferior adaptability to large irregular geometries, particularly in regions with sharp gradients. In this paper, we propose a conditionaware graph flow matching (CGFM) method that combines condition-aware flow matching with a hierarchical graph structure to learn the full distributions of physical systems from incomplete training data. Specifically, CGFM constructs a flow enabling smooth interpolation across physical conditions and parameterizes the graph-conditioned vector field through HieraGraphNet. HieraGraphNet performs message passing across multilevel graphs to capture multi-scale dynamics and facilitate long-range information interactions in physical systems. Moreover, we introduce a topology- and geometry-aware graph coarsening scheme that incorporates topological connectivity and local geometric density to construct reliable coarse graphs. We validate the effectiveness of CGFM on three canonical scenarios across both 2D and 3D dynamics, which demonstrate its superior performance compared with that of state-of-the-art baselines.

Abstract:
Portrait pose transfer (PPT) requires generative models to preserve finegrained identity details while following complex pose and layout modification instructions. Existing methods often struggle with extensive data annotation requirements or employ optimization objectives that are suboptimal for addressing PPT's two key challenges. In this work, we propose PortraitRL, a novel post-training framework that addresses these challenges with a multi-objective reward mechanism. Specifically, we employ LVLM-based reward functions to effectively evaluate PPT's two challenges and apply within-group standardization to eliminate scale differences, allowing these rewards to effectively guide optimization. More importantly, we devise a novel reinforcement learning algorithm, Negative-aware Score Preference Optimization (NaSPO), which automatically identifies positive and negative preference samples through within-group advantages, eliminating annotation requirements while fully leveraging both positive and negative learning signals. Extensive experiments show state-of-the-art performance, with significant improvements in both detail preservation and editing accuracy.

Abstract:
Diffusion Transformers (DiT) achieve strong performance in image generation but incur substantial inference costs. While prior work has reduced this cost via quantization and distillation, semistructured sparsity—which can nearly halve FLOPs—remains underexplored. A key reason is that most existing approaches focus on weight sparsification, and pruning 50% of the weights can remove critical model capacity and degrade generation quality. Our study, however, shows that DiT activations are intrinsically sparse and significantly more robust to N:M semi-structured sparsification than weights. Motivated by this observation, we advocate a paradigm shift from weight sparsification to activation sparsification. We propose RT-Lynx, which applies N:M sparsification to activations and incorporates error-compensation techniques to mitigate accuracy loss. We further implement highly optimized CUDA kernels tailored to this setting, achieving up to a 1.55× speedup on average in linear layers. Extensive experiments across multiple diffusion models demonstrate that our method preserves the generation quality of the original models while substantially accelerating inference.

Abstract:
Lowrank approximation has emerged as a cornerstone technique for model compression and parameter-efficient fine-tuning, enabling substantial reductions in computation and memory without altering model architectures. However, existing approaches often overlook the shifts in feature distributions induced by the approximation process, which can lead to error amplification and unstable inference. We propose a distribution-aware whitening framework that dynamically whitens layer inputs based on the evolving feature distributions, ensuring second-order isotropy of input features. This allows that the discarded components in the low-rank approximation are those with minimal impact on model outputs, thereby minimizing cumulative approximation errors across layers. We theoretically analyze how distribution misalignment leads to error propagation and demonstrate that our approach achieves tighter control over layerwise distortion. Extensive experiments across various large language models demonstrate the superiority of our method in post-training compression. Moreover, our method can also serve as an effective initialization for LoRA-style parameter-efficient fine-tuning. Our findings highlight the importance of considering feature distributions in low-rank approximations, paving the way for reliable and effective model compression strategies.

Abstract:
We present SynLaD, a latent diffusion framework for smallmolecule generation that unifies 3D design objectives (what to make) with synthetic accessibility (how to make it). Current models typically optimize one objective at the expense of the other, creating a bottleneck for discovering high-scoring and experimentally testable molecules. SynLaD combines reaction-constrained generation with pharmacophore-conditioned 3D design by learning a latent space that decodes to both 3D structures and synthesis pathways. An encoder maps molecules to a latent representation used by two decoder heads: (i) a geometric head that reconstructs atom types and coordinates and (ii) an autoregressive synthesis head that outputs synthetic routes in a serialized, reaction-based notation. A diffusion transformer generates novel latents in the learned space, conditioned on pharmacophore profiles. Across analogue-generation tasks for bioactive ligands, SynLaD outperforms existing baselines in synthesizable and diverse hit generation, demonstrating that a single model can produce shape-accurate molecules with feasible synthesis plans.

Abstract:
Image forgery is rapidly evolving, rendering forensic traces increasingly subtle and readily attenuated by postprocessing. Although vision--language prompting can inject priors, open-ended LLM-generated prompts are difficult to constrain, and naive language description can introduce semantic perturbations. To address these challenges, we propose Forensic Prompting with Dual-Action policy optimization (FPDA) for vision-language forgery detection and localization, where Forensic Prompting Module (FPM) constructs a structured forensic prompt bank and supports optional text input to provide lightweight stable conditioning via a reliability gate. Moreover, a Dual-Action Policy Optimization (DAPO) is applied to adaptively route prompts and schedule refinement strategies on a per-image basis, stabilizing discriminative cues and improving mask spatial consistency. Extensive experiments are conducted on multiple public datasets of manipulations, diffusion content, face forgeries, and text-enabled settings (e.g., CASIA/NIST16/Coverage, CocoGlide, OpenForensics and SIDSetdescription), which demonstrates superior detection and localization performance over state-of-the-art methods.

Abstract:
Longterm non-uniform motion poses a significant challenge for feed-forward Novel View Synthesis (NVS), as it requires modeling higher-order motion, such as acceleration. Existing methods primarily rely on deformation fields or scene flow, which are limited to first-order approximations. Due to neglecting higher-order motion representations and supervision, these approaches suffer from long-term non-uniform motion scenarios. Inspired by Taylor’s theorem, we propose Taylor-Gaussians-Flow (TGsF) to represent and supervise non-uniform motion through first-order and second-order motion components. TGsF comprises two key modules: Taylor-Gaussians (TGs) and Taylor-Gaussians-Flow (TGs-Flow). TGs represent motion using Gaussian means with a quadratic temporal term and time-dependent opacity. Unlike previous methods, TGs-Flow decouples scene-flow supervision into separate depth and 2D optical-flow constraints. This approach effectively mitigates error propagation from either depth or motion estimation while circumventing the scarcity of labeled scene flow data. Guided by the above analysis, we develop the Feed-Forward Taylor-Gaussians-Flow (FF-TGsF) framework, which sets a new state-of-the-art on four dynamic benchmarks.

Abstract:
Many fundamental biological processes are governed by mechanical forces, with proteins acting as the key molecular mediators. Elucidating how protein unfolding responds to force is critical for understanding the mechanopathologies, such as cardiomyopathy and muscular dystrophy. While the unfolding trajectories measured by Single-Molecule Force Spectroscopy (SMFS) map the instantaneous force response against molecular extension, its broader application is limited by time-consuming data collection and high operational costs. Here, we present the first scalable generative diffusion framework for full unfolding trajectory prediction, which integrates protein encoders for multi-scale conditioning. Beyond establishing the field's first systematic benchmark using existing models, we propose GenUnfold, a novel physics-guided diffusion model that combines global coevolutionary context with a local mechanical representation of the protein. The representation is derived from a novel physics-biased attention mechanism, which steers the generative diffusion process by modeling dynamic residue dependencies as a function of both structural topology and interaction stiffness. The benchmark for this task is built upon the biomolecule stretching database and several representative baseline models. Empirical results demonstrate that GenUnfold achieves state-of-the-art performance, reducing distributional error (FID) by 30% and 54% compared to pretrained Evolutionary Scale Model (ESM)-2 and standard transformer, respectively. Beyond statistical curve similarity, GenUnfold demonstrates superior physical consistency; in downstream mechanical property prediction, it reduces prediction errors for unfolding force and energy distributions by 6% and 36% over the ESM-2 baseline. These results indicate that while existing generative AI approaches can alleviate the need for predicting representative force curves, GenUnfold further improves performance by leveraging the synergy between protein structure and evolutionary information. By enabling proteome-wide screening to identify mechanical candidates before costly physical validation, our approach is promising to accelerate the discovery of force-targeted therapeutics.

Abstract:
Texture recognition is often framed as matching an image to a static trainingset dictionary or codebook. In practice, this assumption is brittle: label-preserving transformations (illumination, scale, compression, blur) can shift test features away from the fixed training dictionary, producing a training-set codebook misalignment that limits accuracy. We propose STD-Former (Simple Texture Dictionary Transformer), a lightweight framework for image-conditioned texture dictionary encoding. Instead of comparing against a static codebook, STD-Former extracts a compact set of Intrinsic Textons (dictionary atoms / codewords) from the input image itself, yielding self-aligned representations at inference. Our design is intentionally simple and uses a decoupled two-stage recipe. In Stage 1, a Texture Dictionary Extractor (TDE) is pre-trained with a self-supervised Texton Coverage Loss that encourages the learned textons to collectively cover the image patch feature manifold. In Stage 2, a classifier is trained on the encoded dictionary representation; optionally, we add a Sparse Topological Loss derived from 0D persistent homology, which is equivalent to supervising only the (B-1) edges of a minimum spanning tree (MST) in each batch, providing efficient structure regularization. Across six standard texture benchmarks, STD-Former and STD-Former+ achieve new state-of-the-art results.

Abstract:
The rapid evolution of GUIenabled agents has rendered traditional CAPTCHAs obsolete. While previous benchmarks like OpenCaptchaWorld established a baseline for evaluating multimodal agents, recent advancements in reasoning-heavy models, such as Gemini3-Pro-High and GPT-5.2-Xhigh have effectively collapsed this security barrier, achieving pass rates as high as 90% on complex logic puzzles like ''Bingo''. In response, we introduce Next-Gen CAPTCHAs, a scalable defense framework designed to secure the next-generation web against the advanced agents. Unlike static datasets, our benchmark is built upon a robust data generation pipeline, allowing for large-scale and easily scalable evaluations, notably, for backend-supported types, our system is capable of generating effectively unbounded CAPTCHA instances. We exploit the persistent human--agent ``Cognitive Gap'' in interactive perception, memory, decision-making, and action. By engineering dynamic tasks that require adaptive intuition rather than granular planning, we re-establish a robust distinction between biological users and artificial agents, offering a scalable and diverse defense mechanism for the agentic era.

Abstract:
As network data has become ubiquitous in the sciences, there has been growing interest in network models whose structure is driven by latent nodelevel variables in a (typically low-dimensional) latent geometric space. These "latent positions" are often estimated via embeddings, whereby the nodes of a network are mapped to points in Euclidean space so that "similar" nodes are mapped to nearby points. Under certain model assumptions, these embeddings are consistent estimates of the latent positions, but most such results require the embedding dimension to be chosen correctly. Methods for choosing the embedding dimension have been studied extensively, but little is known about the behavior of embeddings when the dimension is misspecified. In this work, we provide a theoretical description of the effects of dimension misspecification under the random dot product graph, a class of latent space network models that includes several widely-used network models, most notably the stochastic blockmodel, as special cases. We show that when the dimension is chosen too large, consistent estimation still holds, albeit at a slower rate than when the embedding dimension is chosen correctly. On the other hand, when the dimension is chosen too small, there is a fundamental estimation error lower bound that need not go to zero in the large-network limit. A range of synthetic data experiments support our theoretical results. Our main technical result, which may be of independent interest, is a generalization of earlier work in random matrix theory showing that all non-signal eigenvectors of a low-rank matrix subject to additive noise are delocalized.

Abstract:
Diffusion Transformers (DiTs) are increasingly adopted for longvideo generation, yet inference is dominated by the quadratic cost of 3D full attention. Sparse attention mitigates this bottleneck by exploiting token-pair redundancy and pruning query-key interactions. Nevertheless, its effectiveness on long video generation is often constrained by non-sparse attention heads, making it hard to strike a good balance between inference speed and generation quality. To address this, we identify another pervasive but overlooked redundancy specific to video DiTs: frame-block similarity, where frame-blocks in attention weights exhibit highly similar distributions and can be well approximated by lightweight linear calibration. Motivated by this observation, we propose EchoAttention, which jointly leverages token-pair redundancy (Sparse operator) and frame-block similarity (Echo operator), together with a fine-grained routing policy learned via three-stage distillation. This design enables efficient handling of both sparse and non-sparse heads, overcoming the inherent ceiling of purely sparse attention and yielding a better speed-quality trade-off. Across public video DiTs, EchoAttention consistently improves the speed-quality frontier over SOTA sparse-attention baselines, reducing end-to-end latency up to 2.42× with minimal quality loss.

Abstract:
Why do neural networks generalize well on natural data? Natural data originates from processes subject to specific physical constraints, such as temporal and spatial invariance, that make it easier to learn. We investigate the sufficiency of these properties using 2D cellular automata as a controlled testbed: systems that are perfectly local, symmetric, and deterministic. We find that these conditions alone are not sufficient to predict the kstep evolution of a cellular automaton. We then examine smoothness (average sensitivity) as an additional criterion and find it predictive but still incomplete. Finally, we introduce a circuit complexity perspective, hypothesizing that natural functions are computable by small circuits. Junta coefficients, measuring the concentration of Fourier weight by interaction degree, provide a tighter predictor of learnability and a correspondence to combinatorial complexity. Across architectures (CNNs, transformers, MLPs), learnable functions are predominantly those with spectral weight concentrated at low degrees and therefore low complexity. These results would be consistent with the hypothesis that natural data is learnable because natural dynamics filters out complex, high-degree interactions.

Abstract:
As large language models (LLMs) continue to scale, it becomes increasingly challenging to grow model capacity under fixed computation budgets. We propose PathAligned Decompression Distillation (PADD), a framework for distilling knowledge from dense teachers without explicit routing into mixture-of-experts (MoE) students while learning high-quality routing policies. PADD organizes knowledge distillation into four stages in two phases: an initialization phase (Stage I) that builds diverse functionality in the student's experts through teacher neuron clustering and student-expert warmup, and a training phase (Stages II--IV) that integrates online adaptive distillation, path-refined policy optimization, and reward-augmented load balancing in a single training pipeline.Experiments on mathematical reasoning benchmarks demonstrate that PADD yields substantial gains over strong baselines at the same inference cost and that the MoE student can match or surpass its dense teacher. They also demonstrate effective teacher-to-student knowledge distillation and stable routing behavior.

Abstract:
Many optimization problems in machine learning and data science—from deep neural networks to Bayesian inference and beyond—fall outside the standard Lipschitz smoothness framework that underpins the convergence theory of stochastic gradient descent (SGD). Motivated by this theorypractice disconnect, we examine the almost sure convergence of the trajectories of SGD in non-convex landscapes under a generalized (L_0,L)1)-smoothness condition which allows for gradients with superlinear growth (even exponential). We begin by proposing a taming scheme for SGD that achieves almost sure convergence under a generalized ABC-type condition on the gradient noise. Subsequently, to relax this requirement, we introduce a more flexible, dissipative taming scheme which converges almost surely under less restrictive moment bound conditions for the stochastic gradients entering the process. For both taming schemes, we show that the generated trajectories avoid strict saddle points (and/or manifolds thereof) with probability 1 so, generically, both methods only converge to local minimizers.

Abstract:
Infrared and visible image fusion (IVIF) aims to synergize complementary thermal radiation and textural details for comprehensive scene perception. However, existing unsupervised paradigms often overlook the intrinsic topological consistency shared across modalities. Lacking explicit geometric regularization, encoders frequently succumb to degenerate numerical shortcuts, capturing superficial highfrequency noise rather than domain-invariant semantic structures to satisfy reconstruction objectives. To address this, we propose LaRA-Fusion, a framework achieving Latent-Robust Adaptation via Dual-Loop Manifold Constraints. We construct a strictly constrained latent space where an inner loop ensures geometric reversibility, while an outer loop anchors the generated representations to the intrinsic data manifold. This mechanism effectively mitigates latent space collapse, compelling the model to extract topologically aligned features that remain robust against modality-specific variations. Extensive experiments demonstrate that LaRA-Fusion outperforms state-of-the-art methods with superior robustness and interpretability.

Abstract:
Effective tool use and reasoning are essential capabilities for large reasoning models (LRMs) to address complex realworld problems. Through empirical analysis, we identify a prevalent "Lazy Reasoning" phenomenon, where LRMs frequently engage in repetitive and meaningless reflective reasoning. This occurs primarily due to their inadequate ability to decompose tasks when reasoning in complex tool use scenarios. To address this, we propose a two-stage training framework D-CORE ( Decomposing tasks and Composing Reasoning processes) that first incentivize the LRM’s task decomposition reasoning capability via self-distillation, followed by diversity-aware reinforcement learning (RL) to restore LRM's reflective reasoning capability. D-CORE achieves robust tool-use improvements across diverse benchmarks and model scales. Experiments on BFCLv3 demonstrate superiority of our method: D-CORE-8B reaches 77.7% accuracy, surpassing the best-performing 8B model by 5.7%. Meanwhile, D-CORE-14B establishes a new state-of-the-art at 79.3%, outperforming 70B models despite being 5× smaller. The source code and data sample are in the supplementary material.

Abstract:
The limited transferability of adversarial attacks on VisionLanguage Models (VLMs) stems from their failure to navigate model-specific safety alignments, where superficial perturbations exploit surrogate-specific artifacts rather than shared safety-critical features. We reveal through linear probing that safety-related representations are concentrated within specific intermediate neuronal circuits, which act as localized defense bottlenecks that can be disentangled from transferable features. To overcome this barrier, we propose the Safety Circuit Intervention Attack (SCIA), a framework that surgically steers internal representations to bypass these localized safety mechanisms. SCIA employs a dual-objective steering strategy that suppresses the defensive circuit encoding safety features while amplifying the transferable circuit capturing model-agnostic representations, effectively decoupling adversarial patterns from surrogate-specific safety behaviors. Furthermore, we incorporate contrastive semantic steering and spectral smoothness regularization to guide optimization toward compliant semantic regions while producing visually coherent perturbations. Experimental results demonstrate that SCIA significantly outperforms state-of-the-art methods in bypassing unseen black-box VLMs.

Abstract:
Unintended codeswitching, which refers to the phenomenon where LLM unexpectedly switch languages, poses a fundamental challenge in the multilingual capabilities in LLMs. However, we still lack a mechanistic account of how this failure mode is implemented inside the model. For example, what internal components (i.e., circuits) give rise to unintended code-switching, where they emerge across layers, and how we can intervene to mitigate it. In this work, we introduce a scalable circuit discovery framework that causally localizes multilingual neurons and describes their functional patterns, then further groups them into interpretable circuits---without any additional training or manual annotation. Our findings lie in two folds: a) The model's ``speaking-a-language'' circuit decomposes into a language regime (detecting and maintaining language identity) and a semantic regime (retrieving language-agnostic semantics). b) The mechanism of unintended code-switching is a regime shift. Semantic regime suppresses the language regime, and overwhelms the multilingual circuit, leading the model to speak in unintended language. To validate these findings, we further fine-tune the identified language sub-circuit, reducing the code-switching rate by 20.8% with minimal parameter updates (～0.019% of all neurons). This work serves as a preliminary exploration of multilingual generation mechanism, offering actionable insight for targeted training for multilingual LLMs.

Abstract:
Hardwarerelated silent data corruptions during gradient aggregation pose significant challenges to fault-tolerant distributed training, often leading to slow or failed convergence. To address this, we first mathematically formulate these errors as gradient inconsistency and theoretically analyze how they result in accumulated model divergence. Guided by this analysis, we introduce PAFT, a fault-tolerant distributed training system designed with dynamic and asynchronous parameter synchronization. PAFT comprises two core components: PAFT-Sync, which mitigates divergence via periodic synchronization, and PAFT-Dyn, which minimizes overhead through dynamic training overlap and frequency scheduling. Furthermore, the system’s synchronization mechanism is optimized to support standard optimizers, including SGD, SGD momentum, and Adam. We implement PAFT on PyTorch Distributed, and experimental results training ResNet, GPT-2, and LLaMA-2 on 4～32 GPUs demonstrate that it efficiently defends against aggregation errors while maintaining training performance.

Abstract:
VisionLanguage Models such as CLIP exhibit strong zero-shot recognition capability by aligning images with textual concepts, yet they often underperform on multi-label recognition where multiple objects co-exist. A key bottleneck is that the CLS token, as a single global visual representation, is insufficient to faithfully encode diverse targets with varying scales, contexts, and co-occurrence patterns. To address this limitation, we present a new multi-label image recognition framework, termedPIAA, which formulates prediction as Patch-levelInference followed byAdaptiveAggregation. Specifically, we first enhance patch-wise predictions from two complementary perspectives: (i) mitigating semantic entanglement in the visual encoder to obtain more discriminative patch representations, and (ii) learning an unsupervised visual classifier to narrow the vision–language modality gap. We then introduce an adaptive aggregation module that consolidates patch-level scores into the final multi-label prediction. Notably, the entire pipeline is fullyTraining-free, requiring no gradient updates or parameter fine-tuning. Experiments show that our method achieves strong improvements with minimal extra computation, exceeding a 6% mAP gain on the challenging NUS-WIDE benchmark over representative baselines.

Abstract:
Controlling spreading processes on networks such as epidemics, information cascades, product adoption, requires policies that perform on realistic stochastic dynamics, not just tractable approximations. Yet policies trained on standard simplifications (meanfield ODEs, Markovian dynamics) suffer severe performance degradation at deployment. We trace this sim-to-real gap to three theoretical pathologies: Optimism Bias, where deterministic approximations systematically underestimate variance via Jensen's inequality; Hub Blindness, where global state aggregation obscures the super-spreaders driving scale-free networks; and the Valley of Death, where mean-value critics fail to navigate the bimodal nature (extinction vs. viral) of cascade outcomes. We resolve these challenges through two synergistic contributions. First, the Stratified Mean-Field Observer partitions nodes by influence tier, preserving hub dynamics at O(N) cost while producing fixed-dimensional observations that enable zero-shot transfer across network scales and topologies. Second, we demonstrate that Distributional RL via Truncated Quantile Critics is essential for risk-aware control of bimodal cascades. Trained on a GPU-accelerated simulator supporting non-Markovian renewal dynamics, our approach achieves 59× improvement over Markovian baselines and robust zero-shot transfer to real-world social networks (Facebook, Twitter, YouTube), effectively closing the simulation-to-reality gap.

Abstract:
Large generative models raise growing concerns about provenance, misinformation, and impersonation. Digital watermarking offers a principled solution, yet extending it to natural language remains challenging due to text discreteness and sensitivity to semantic perturbations. Existing text watermarking methods either operate at the token level requiring whitebox access and remaining fragile to paraphrasing, or at the sentence level, which supports black-box deployment but suffers from low Watermark Success Rate (WSR). We show that low WSR in sentence-level watermarking primarily stems from low injection success probability caused by a mismatch between posterior embedding distributions and semantic accept regions. Based on this insight, we propose X-Guard, a geometry-aware sentence-level watermarking framework that improves injection success by jointly optimizing embedding distributions and semantic space partitioning. X-Guard learns a more isotropic embedding space and introduces A^2PQ, a centroid-aligned partitioning scheme that approximately equalizes probability mass across regions. Extensive experiments across multiple models, languages, and attack settings demonstrate that X-Guard consistently improves robustness while preserving text fluency and practical deployability.

Abstract:
Thede novogeneration of highaffinity epitopes tailored to specific major histocompatibility complex (MHC) proteins is a pivotal challenge in computational immunotherapy. However, current methods struggle to effectively integrate the MHC context into the generation process, and often fail to guarantee high binding affinity due to the neglect of discriminative signals from non-binders. To bridge these gaps, we presentEpiCoCo, a probabilistic framework forEpitope generation via MHC-contextCo-modeling andContrastive affinity learning. EpiCoCo treats the pMHC complex as a dynamic, co-adaptive system by operating on the joint E(3) graph. In addition, we introduce Contrastive Affinity Guidance (CAG), an inference mechanism that leverages the gradient difference between learned high- and low-affinity distributions. CAG actively drives the generation trajectory towards high-affinity manifolds while utilizing repulsive signals to filter out candidates with poor binding potential. Extensive evaluations demonstrate that EpiCoCo achieves a mean binding free energy of -45.20 REU, a 23% improvement over the state-of-the-art, while maintaining high structural plausibility. The results validate that context co-modeling and negative-informed guidance are essential for generating valid, high-potency immunotherapeutics.

Abstract:
The stochastic minimax optimization problem has widespread applications in machine learning. Recently, numerous distributed minimax optimization algorithms have been developed to handle distributed training data. However, most of these algorithms suffer from high communication costs. To address this issue, we develop a novel communicationefficient decentralized stochastic gradient descent ascent with momentum algorithm based on the error feedback mechanism. Importantly, our algorithm demonstrates how to balance the full-precision update and the compression residual with novel designs for coefficients regarding variables and gradients to guarantee convergence. However, compressing the primal and dual variables (and their gradients) of stochastic minimax optimization problems with the error feedback mechanism presents significant challenges for convergence analysis. In particular, it incurs the circle dependence among consensus errors and compression errors. To overcome this challenge, we propose novel strategies that enable the establishment of the convergence rate for our algorithm. Our theoretical results demonstrate how the compression operator influences the convergence rate. Finally, extensive experimental results confirm the efficacy of our proposed algorithm.

Abstract:
Origin–destination (OD) flow modeling underpins urban planning and mobility analysis, but prevailing graphbased methods often neglect salient geographic attributes, limiting their ability to model long-range and multi-area dependencies. In this paper, we introduce GeoFlow, a novel framework that (i) augments area representations with geospatial attributes, including relative positions, -hop and geodesic distances, (ii) employs a specialized geometric-intrinsic fusion encoder design that combines graph attention for intrinsic area signals with coordinate-aware encoders for global structure, and (iii) adopts an axial-global attention decoder to capture OD-specific competitive dependencies. For OD flow generation, GeoFlow is paired with flow matching models to produce more authentic and diverse mobility samples. Empirically, GeoFlow achieves superior performance in predictive accuracy, while substantially improving generative fidelity and diversity. Ablation and analytical studies confirm the contribution of each component.

Abstract:
As data scaling approaches its natural limits, recent work has explored the potential of data reuse, whose impact on optimization remains poorly understood. This work investigates how data repetition affects learning efficiency in terms of compute. We investigate an inversescaling effect that using less data can lead to faster convergence, which is observed across algorithmic tasks, architectures and optimizers and cannot be explained using prior theory. Instead, we argue that the speedup comes from appropriate layer-wise norm growth, which can be achieved faster when the dataset size is smaller. We provide theoretical justification by analyzing the benefits of sampling biases induced by small datasets, and we present extensive empirical evidence supporting this hypothesis. Together, our results highlight the potential of unlocking more efficiency gain by jointly considering different scaling axes.

Abstract:
Enabling VisionLanguage Models (VLMs) to perform spatial reasoning remains challenging. Existing approaches treat VLMs as passive observers, which is difficult for real-world applications. Moreover, reinforcement learning methods rely on sparse rewards, limiting their effectiveness for complex reasoning tasks. Inspired by pigeons’ building and exploiting cognitive maps for navigation, we propose a novel agentic pipeline for spatial reasoning. First, we introduce a new dynamic cognitive map parameterizing scene layout as object positions and orientations, serving as persistent memory for new observations. Second, we propose a novel Spatial Assertion Codes (SAC), Python expressions programmatically describing spatial relationships. By collaborating with the dynamic cognitive map, SAC enables verification of intermediate reasoning steps, providing dense reward signals. We optimize the model via supervised and reinforcement finetuning. Experiments on MindCube benchmark demonstrate state-of-the-art performance with 80.5% overall accuracy, surpassing the best method by 53.2% on the challenging ROTATION subset. We will release the code and data soon.

Abstract:
The longitudinal management of blinding fundus diseases constitutes a Partially Observable Markov Decision Process (POMDP) necessitating a critical precisionrisk trade-off between intervention and over-treatment, as true pathology is often obscured in static observations. However, existing paradigms fail to address this complexity. Traditional vision models remain uninterpretable and memoryless, and while Vision-Language Models (VLMs) excel in semantic understanding, they rely on unsafe open-loop text reasoning lacking the anatomical grounding essential for clinical safety. Furthermore, robust learning is hindered by the scarcity of process supervision in sparse clinical records. To bridge this gap, we introduce the Logic-Constrained Abductive Data Engine. Operating on a ``Propose-and-Verify'' paradigm, it validates MLLM-Proposed biomarkers against clinical and temporal logic to reconstruct dense pathological states from sparse outcomes. Building on this foundation, we propose ORBIT, the first ophthalmic Prognostic World Model. Uniquely, ORBIT employs counterfactual visual foresight to imagine anatomical futures under different treatments, anchoring decisions in Closed-Loop Anatomical Verification rather than linguistic probabilities. Experiments demonstrate that ORBIT effectively captures disease evolution and establishes a new paradigm for autonomous diagnosis and reliable decision-making in complex ophthalmic environments.

Abstract:
This work stems from prior complementary observations on the dynamics of Chainof-Thought (CoT): Large Language Models (LLMs) is shown latent planning of subsequent reasoning prior to CoT emergence, thereby diminishing the significance of explicit CoT; whereas CoT remains critical for tasks requiring multi-step reasoning. To deepen the understanding between LLM's internal states and its verbalized reasoning trajectories, we investigate the latent planning strength of LLMs, through our probing method, Tele-Lens, applying to hidden states across diverse task domains. Our empirical results indicate that LLMs exhibit amyopichorizon, primarily conducting incremental transitions without precise global planning. Leveraging this characteristic, we propose a hypothesis on enhancing uncertainty estimation of CoT, which we validate that a small subset of CoT positions can effectively represent the uncertainty of the entire path. We further underscore the significance of exploiting CoT dynamics, and demonstrate that automatic recognition of CoT bypass can be achieved without performance degradation.

Abstract:
Decentralized class continual learning refers to a paradigm where distributed clients continuously acquire new classes while retaining previously learned information without relying on a central server. With increasing emphasis on privacy preservation, there is a growing need for ondemand unlearning, introducing two key challenges: Historical Class Unlearning and Network-Wide Knowledge Entanglement. In this work, we propose a decentralized continual learning framework with on-demand unlearning (DCU), which is the first attempt at achieving class continual learning and arbitrary-time class unlearning in a distributed setting. Specifically, our proposed DCU comprises three main stages: prototypes extraction, prototype-guided continual learning, and unlearning with disposable prototypes. Firstly, the prototypes extraction mechanism is designed to capture the class-specific concepts as lightweight, disposable embeddings. Then, the synthetic data guided by these prototypes can be combined with real data to achieve incremental learning through distillation. Besides, synthetic samples with noisy label are used to guide the adjustment of the model's decision boundary, effectively erasing the influence of the target class while preserving other classes' knowledge. Extensive experiments conducted on two datasets demonstrate the effectiveness of our DCU in dynamic learning and target class unlearning.

Abstract:
Protein automodeling from cryoEM density maps faces unique challenges in enforcing physicochemical validity and managing conformational heterogeneity. Current solvers are often limited to static predictions or require computationally intensive heuristic searches. We present CryoACE, an end-to-end framework that reconstructs precise atomic graphs for both homogeneous and heterogeneous structures. Our method features two key innovations: an atom-centric reconstruction paradigm, where density features are sampled directly at atomic coordinates and iteratively recycled to refine structures—replacing expensive voxel convolutions for efficient multimodal fusion—and a training-free guidance mechanism that leverages predicted local resolution priors to resolve dynamic ambiguity. Validated on a newly constructed high-quality dataset, CryoACE significantly outperforms existing baselines on static benchmarks and, for the first time, unveils atomic-level dynamic conformations on complex real-world datasets like EMPIAR-10345 without relying on pre-built static structures. We release our code, model weights, and dataset to facilitate future research.

Abstract:
Large language model (LLM)based multi-agent systems commonly rely on natural-language prompts to specify agent behavior, yet optimizing these prompts remains challenging when agent roles and interaction structures are fixed by design. In such systems, behaviors emerge over long, noisy interaction trajectories, making it difficult to determine which prompt components are responsible for success or failure. As a result, outcome-level feedback alone is insufficient, while existing prompt optimization methods typically rely on final task scores or global prompt rewrites, limiting their ability to exploit trajectory evidence or support the localized updates. We propose Trajectory-based Rule Credit Estimation (TRUCE), a framework for prompt optimization in multi-agent systems that explicitly addresses this credit assignment challenge. TRUCE performs trajectory-aware attribution by linking outcome feedback to informative sub-trajectories and translating the resulting credit signals into unit-level edits over prompt-defined behavioral rules. By preserving agent roles and interaction structures, TRUCE enables prompt refinement through localized updates aggregated across tasks. Experiments on multiple benchmarks demonstrate that TRUCE consistently improves task performance and efficiency over competitive baselines.

Abstract:
Agentic reinforcement learning (RL) for web search is prohibitively expensive due to long context lengths and costly environment interactions, and this inefficiency is further exacerbated by GRPObased optimization, which discards learning signals from entire rollout groups with zero reward variance. In this work, we propose EVOLVING ROLLOUTS, an RL framework for web-search agents that moves beyond episodic training and distills collected rollouts into in-context guidance for future policy behavior. By extracting the reward-labeled trajectories into strategic experiences, our method augments standard parameter-space optimization with implicit context-space optimization guided by prior experience. This enables the agent to recover learning signals from zero-variance rollouts, thereby fostering co-evolution between the policy and the experience repository. EVOLVING ROLLOUTS improves sample efficiency and task performance across representative web search benchmarks, enabling Qwen3-4B models to achieve performance comparable to that of the substantially larger Qwen3-30B-A3B model on GAIA, Xbench, and HLE. We open-source our training framework to support reproducibility and future research.

Abstract:
Reasoning Large Language Models (LLMs) enable testtime scaling, with dataset-level accuracy improving as the token budget increases, motivating adaptive reasoning---spending tokens when they improve reliability and stopping early when additional computation is unlikely to help. However, setting the token budget, as well as the threshold for adaptive reasoning, is a practical challenge that entails a fundamental risk-accuracy trade-off. We re-frame the budget setting problem as risk control, limiting the error rate while minimizing compute. Our framework introduces an \emphupper threshold that stops reasoning when the model is confident (risking incorrect output) and a novel parametric lower threshold that preemptively stops unsolvable instances (risking premature stoppage). Given a target risk and a validation set, we use distribution-free risk control to optimally specify these stopping mechanisms. For scenarios with multiple budget controlling criteria, we incorporate an efficiency loss to select the most computationally efficient exiting mechanism. Empirical results across diverse reasoning tasks and models demonstrate the effectiveness of our risk control approach, demonstrating computational efficiency gains from the lower threshold and ensemble stopping mechanisms while adhering to the user-specified risk target.

Abstract:
Panoptic Video Scene Graph Generation (PVSG) aims to identify relations between pixellevel entities in a video, serving as a novel paradigm for structured video parsing. However, this task faces two key challenges. First, the interactions between entities are temporally fragmented and sparse, meaning videos are dominated by irrelevant content with limited salient information. Second, the distribution of relations exhibits a significant long-tailed pattern, making models struggle to perform well on tail categories with insufficient data. To address these issues, we propose SegPVSG, an innovative, temporal-segment-aware PVSG framework consisting of two key components: TempFocusNet (TFN) and Relation-centric Generative Video Augmentation (RGVA) module. TFN is a localization-then-recognition network that improves PVSG performance by explicitly localizing and focusing on salient segments before relation recognition. Meanwhile, RGVA is a novel augmentation module that generates realistic, context-consistent video segments for rare relations and coherently inserts them into original videos. Our method outperforms prior methods by +3.53 mR@20 and +5.9 mR@50, demonstrating its effectiveness. Code will be released.

Abstract:
Singlecell foundation models trained on millions of cells can learn gene expression patterns across diverse contexts. However, for predicting genetic perturbation effects they often underperform simple regression models. We hypothesize two potential limitations: targets defined on dropout-prone absolute expression, and pretraining focused on reconstructing absolute expression within cells, which captures static co-expression patterns but may not encode how genes co-regulate in response to expression changes. We introduce scDEBART, a foundation model pretrained to predict log fold-changes (logFC) conditioned on basal expression, thereby learning how gene sets co-vary across basal states at scale. To obtain reliable estimates of expression change under technical sparsity, we compute logFC from scVI-denoised expression and restrict pretraining to genes with robust detection. Pretrained on 6.28 million expression-change profiles from 66.6 million human cells and fine-tuned on five Perturb-seq datasets, scDEBART achieves mean enrichment factor (EF) of 11.96, 4--7× higher than scGPT and GEARS (mean EF 1.74--2.99), and 42.8% top-1 accuracy for reverse perturbation identification compared to near-zero accuracy for prior models. In cross-modal transfer to drug perturbations (SCIPLEX), the model shows dose-dependent enrichment (EF 2.03--4.31), suggesting partial transfer of learned regulatory patterns across modalities. Overall, these results indicate that large-scale pretraining on scVI-stabilized expression-change profiles provides a useful inductive bias for perturbation prediction.

Abstract:
Computing the differential entropy of distributions known only up to a normalization constant is a longstanding challenge with broad theoretical and practical significance. While variational inference is the most scalable approach for density approximation _from samples_, its potential in settings where _only the unnormalized density_ is available remains largely under-explored. The central difficulty lies in constructing variational distributions that simultaneously (i) exploit the structure of the unnormalized density, (ii) are expressive enough to capture complex target distributions, (iii) remain computationally tractable, and (iv) support efficient sampling. Recently, \citetmessaoud2024s introduced _P-SVGD_, a particle-based variational method that leverages Stein Variational Gradient Descent dynamics, satisfies all of these constraints and demonstrates promising results in low-dimensional setups. We show, however, that _P-SVGD_ does not scale to high dimensions due to _fundamental algorithmic flaws_: (i) misdiagnosed sensitivity to SVGD hyperparameters, (ii) violation of the global invertibility assumption in the entropy derivation, (iii) omission of a critical trace-of-Hessian term, (iv) along with suboptimal heuristics, including a divergence-based sampling check that induces mode collapse and loose informal bounds with no practical value. These issues severely limit both the correctness and the scalability of the approach. We propose _MET-SVGD_, a principled extension of _P-SVGD_ that addresses these flaws by providing a general framework for SVGD hyperparameters selection with global invertibility and convergence guarantees. This enabled more accurate and scalable entropy estimation in high-dimensional settings. Empirically, in entropy estimation benchmarks, _MET-SVGD_ achieves accuracy improvements of up to 12× and 16× over _P-SVGD_ and baselines from the _SVGD_ literature, respectively. On CIFAR-10 Energy-Based image generation, it improves FID by 80.4% compared to _P-SVGD_ and achieves 64× higher training stability. In Maximum-Entropy reinforcement learning, _MET-SVGD_ yields up to 16% better returns than _P-SVGD_. We will make our code publicly available at https://tinyurl.com/2esyfx8j.

Abstract:
Federated learning (FL) enables collaborative training across distributed edge devices, but deploying lightweight models in dynamic edge environments remains challenging. Existing methods typically require retraining whenever device resource constraints change, resulting in excessive computational overhead. We propose DFMPQ, a dynamic federated mixedprecision quantization framework that enables retraining-free deployment at the edge. DFMPQ trains a weight-sharing mixed-precision supernet via FL, which jointly represents diverse bit-width configurations. After training, resource-aware quantized subnets can be derived on demand to satisfy heterogeneous and time-varying resource constraints without additional optimization. Optimizing such a supernet in federated settings is difficult due to optimization interference among heterogeneous bit-widths and the coupling of quantization noise with non-IID data. DFMPQ addresses these issues through semantic-aware training and aggregation mechanisms that stabilize supernet optimization. In addition, a sensitivity-guided greedy search strategy is adopted to efficiently identify suitable quantization configurations under given resource budgets. Extensive experiments on multiple datasets and network architectures demonstrate that DFMPQ achieves competitive accuracy with significantly reduced computational cost, enabling efficient deployment for dynamic edge computing environments.

Abstract:
Class imbalance poses a critical challenge in federated learning (FL), where underrepresented classes suffer from poor predictive performance yet cannot be addressed by standard centralized techniques due to privacy and heterogeneity constraints. We propose FedCGNM (Federated ClassGrouped Normalized Momentum), a client-side optimizer in FL that partitions classes into a small number of groups based on minimum within-group variance, maintains a momentum per group, normalizes each group momentum to unit length, and uses the summation of the normalized group momentums as an update direction. This design both equalizes gradient magnitude across majority and minority groups and mitigates the noise inherent in rare-class gradients. We further provide a theoretical convergence analysis explicitly accounting for time-varying resampling rates. Additionally, to efficiently optimize these rates in small-client regimes, we introduce FedHOO, an X-armed-bandit (XAB) based algorithm that exploits federated parallelism that evaluates many combinations of two candidate rates per client at linear cost. Empirical evaluation on four public long-tailed benchmarks and a proprietary chip-defect dataset demonstrates that FedCGNM consistently outperforms baselines, with FedHOO yielding further gains in small-scale federations.

Abstract:
Deep learning–based watermarking has substantially improved robustness to realworld noise, but its performance degrades as the payload dimension increases. In contrast, coding-based methods such as quantization index modulation (QIM) do not suffer from this curse of dimensionality, although they are less robust to real-world noise. To leverage the strengths of both approaches, we propose OrthoMark, a framework that decouples robust feature extraction from message encoding. OrthoMark first learns a distortion-invariant feature representation using a deep robust feature extractor, and then performs watermark encoding and decoding in this feature domain using coding-based methods. Extensive experiments demonstrate that OrthoMark significantly improves the trade-off among visual quality, robustness, and capacity compared to prior deep watermarking methods, with particularly large gains in the high capacity regime, effectively overcoming the curse of dimensionality.

Abstract:
Recent advances in generative video models have blurred the boundary between real and synthetic content, raising urgent concerns about digital authenticity. Multimodal large language models (MLLMs) are appealing for AIgenerated video (AIGV) forensics due to their broad perceptual and reasoning capabilities; however, existing MLLM-based detectors still suffer from hallucination and unstable reasoning, yielding high false-alarm rates and generic, non-verifiable explanations. To address these issues, we propose Hermes, an evidence-driven agentic framework for trustworthy and explainable AIGV detection. Hermes is realized by three key capabilities: (1) Adaptive Instance-Conditioned Detection Strategy Planning, (2) Evidence-Centric Reasoning and Verification, and (3) Graph-Grounded Evidence Deliberation. Concretely, Hermes employs an instance-conditioned RAG mechanism to analyze each video and retrieve authenticity-verification knowledge for composing a tailored detection strategy. It then performs evidence-centric reasoning by constructing a verifiable Evidence Reasoning Graph (ERG) that maintains focus on authenticity verification and avoids attention drift or superficial reasoning. Finally, a multi-agent deliberation process audits and refines the ERG to reconcile conflicting evidence and enhance reliability. Supported by these capabilities and a rich library of internal and external forensic tools, Hermes achieves structured, verifiable, and interpretable decision-making. Extensive experiments show that Hermes delivers state-of-the-art performance while producing higher-quality, auditable explanations.

Abstract:
In openworld intelligent systems, processing continuous sensory streams disrupted by heterogeneous degradation sources presents a fundamental challenge: reconciling the inherent tension between observational completeness and reconstruction fidelity. Methods that prioritize completeness by bridging long-term occlusions often introduce spurious artifacts, while approaches focused on aggressive noise suppression inevitably disrupt temporal continuity and erase valid structures. To address this challenge, we propose NeuroMamba, a universal plug-and-play module that enhances spatiotemporal consistency in degraded streams. NeuroMamba tackles the dual objectives through two synergistic components. First, we propose a Regional Hybrid Spatiotemporal Rectification (HSR) module, which leverages the linear complexity O(L) of Mamba-based inertial modeling to recover long-range temporal dependencies and infer missing modalities under partial observability. Second, we introduce a Spiking Confidence Gate (SCG) that enforces reconstruction fidelity via physics-guided supervision. Acting as a hard neuromorphic filter governed by integrate-and-fire (LIF) dynamics, SCG distinguishes valid geometric features from sensor noise based on accumulated temporal evidence. Extensive experiments on the nuScenes robustness benchmark demonstrate that NeuroMamba effectively reconciles the completeness-fidelity trade-off, achieving state-of-the-art performance in restoring high-fidelity spatiotemporal features from severely incomplete and degraded observations.

Abstract:
Dataset distillation, a trainingaware data compression technique, has recently attracted increasing attention as an effective tool for mitigating costs of optimization and data storage. However, progress remains largely empirical. Mechanisms underlying the extraction of task-relevant information from the training process and the efficient encoding of such information into synthetic data points remain elusive. In this paper, we theoretically analyze practical algorithms of dataset distillation applied to the gradient-based training of two-layer neural networks with width L. By focusing on a non-linear task structure called multi-index model, we prove that the low-dimensional structure of the problem is efficiently encoded into the resulting distilled data. This dataset reproduces a model with high generalization ability for a required memory complexity of \Theta(r^2d+L), where d and r are the input and intrinsic dimensions of the task. To the best of our knowledge, this is one of the first theoretical works that include a specific task structure, leverage its intrinsic dimensionality to quantify the compression rate and study dataset distillation implemented solely via gradient-based algorithms.

Abstract:
The reasoning patterns of large language models (LLMs) remain opaque, and Reinforcement learning (RL) typically assigns uniform credit across an entire generation, blurring the distinction between pivotal and routine steps. This work treats attention as a natural substrate for interpreting LLM reasoning and a window for aligning optimization with its internal dynamics. We first distinguish attention heads between locally and globally focused information processing and reveal that locally focused heads produce a sawtooth pattern near the diagonal indicating phrasal chunks, while globally focused heads expose tokens that exert broad downstream influence over future tokens. We quantify these with two metrics measuring the extent of backward attention within a clipped window and the average attention a token receives from subsequent tokens, respectively. Taken together, these signals reveal a recurring preplanand-anchor mechanism, where the model first performs a long-range contextual reference to generate an introductory token, which is immediately followed by or coincides with a semantic anchor token that organizes subsequent reasoning. Leveraging these insights, we introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes (preplan tokens, anchor tokens, and their temporal coupling) and show consistent performance gains across various reasoning tasks.

Abstract:
While LLMdriven Neural Architecture Search (NAS) leverages exceptional code generation and reasoning, it suffers from a critical "Semantic-Physical Misalignment": LLMs prioritize high-level semantic plausibility but are agnostic to intrinsic physical dynamics such as gradient flow, whereas Zero-Cost Proxies (ZCPs) capture these local sensitivities but lack global semantic planning. To bridge this gap, we propose SAGE-NAS, a closed-loop evolutionary framework that synergizes an LLM-Based Semantic Agent with a Graph-Based Evaluator. Specifically, SAGE-NAS coordinates an LLM-driven Semantic Agent to construct candidate architectures by dynamically scheduling complementary sub-policies that balance exploitation with exploration. Furthermore, the framework integrates a Dual-Modality Graph Evaluator that serves as a rapid performance predictor by fusing ZCP statistics with topological features, and a State-Aware Behavioral Atlas that guides sparsity-driven exploration to escape local optima. Experiments demonstrate that SAGE-NAS achieves state-of-the-art performance across multiple mainstream search spaces and downstream tasks, exhibiting a superior balance between search efficiency, model accuracy, and cross-task generalization capability.

Abstract:
Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinder truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional reencoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to reconstruct high-fidelity images. Through large-scale pre-training on 1T multimodal tokens, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on text-to-image generation and image editing while remaining competitive on multimodal understanding benchmarks.

Abstract:
While Multimodal Large Language Models (MLLMs) demonstrate impressive general capabilities, they struggle with finegrained perception in ultra-high-resolution (UHR) images, particularly for tiny objects in cluttered scenes. Existing methods face a dilemma: they either rely on inefficient prior-free scanning, or depend on static prior-driven heuristics that lack posterior correction to rectify initial model biases. To address this, we proposeBVS(BayesianVisualSearch), a framework that formulates perception as a global optimization problem over a continuous spatial-scale manifold. Specifically, BVS bridges prior guidance with posterior correction: it utilizes an early-stop attention rollout of MLLM to construct reasoning-aware priors, while employing a scale-aware non-stationary kernel and GP-UCB to dynamically rectify noise and recover missing information in the prior through iterative local observations. We provide theoretical guarantees via sub-linear regret bounds, and extensive experiments demonstrate that BVS significantly outperforms state-of-the-art baselines with a superior trade-off between accuracy and efficiency.

Abstract:
Accurately capturing local variations in long series has always been one of the most challenging problems in timeseries forecasting especially in medical signals, where local variations often indicate pathological events. Our study reveals a previously overlooked key bottleneck in this field: traditional global and local branches learn similar representations, leading to strong feature coupling and reduced sensitivity to local variations. To address this challenge, we propose the novel Dual-Stream Enhancement Mechanism, which structurally enlarges the difference between global and local patterns, enabling weak interactions between the two. Based on this idea, we introduce a new baseline model for blood glucose prediction: Dual-Stream Enhancement Network (DSENet), which fundamentally alleviates the problem of excessively strong coupling between global and local features. Experimental results show that our model achieves SOTA performance on multiple public datasets. Moreover, benefiting from extremely low computational cost, our model demonstrates strong application potential and can serve as a baseline model in multiple domains in the future.

Abstract:
Hierarchical decisionmaking frameworks are pivotal for addressing complex control tasks, enabling agents to decompose intricate problems into manageable subgoals. Despite their promise, existing hierarchical policies face critical limitations: (i) reinforcement learning (RL)-based methods struggle to guarantee strict constraint satisfaction, and (ii) optimal control (OC)-based approaches often rely on myopic and computationally prohibitive formulations. To reconcile these trade-offs, hierarchical RL-OC architectures have emerged as a promising paradigm. However, the formulation of the lower-level optimization within these frameworks remains underexplored, often relying on heuristic or myopic objectives. In this work, we propose a principled framework that systematically integrates upper-level goal abstraction with structured lower-level decision making. We adopt an inverse optimization approach to inform the structure of the lower-level problem from expert demonstrations, ensuring that the objective of the lower-level policy remains aligned with the overall long-term task goal. To validate the approach, our framework is evaluated on distinct decision making tasks: network-based resource allocation and continuous collision avoidance. Empirical results demonstrate that our method consistently outperforms strong baselines based on end-to-end RL, learning-augmented optimal control, and existing hierarchical RL approaches in both efficiency and decision quality.

Abstract:
Data Assimilation (DA) aims to integrate observations with model forecasts to estimate the state of dynamical systems. Despite the widespread application of diffusionbased assimilation methods, they remain constrained by the high dimensionality of atmospheric states and the reliance on imperfect state-observation mappings. This leaves regions lacking observations spatially unconstrained, leading to severe error accumulation and physical inconsistency.. In this paper, we propose LoPhyDA, a diffusion assimilation algorithm dual-guided by low-rank tensor and physical gradients. By leveraging the low-rank property of meteorological field, we employ tensor completion to exploit spatial continuity and dynamic correlations, reconstructing a globally informative dense field from sparse observations to serve as a global prior. This approach mitigates the information deficit inherent in sparse settings. The framework further incorporates physical constraints into the iterative denoising process, utilizing Partial Differential Equation (PDE) residual gradients to rectify the generative trajectory in real-time. Experimental results demonstrate that LoPhyDA outperforms state-of-the-art generative assimilation models in global weather prediction. It achieves robust and physically consistent assimilation, significantly reducing error accumulation in regions lacking observations.

Abstract:
Early exit neural networks, which adapt computation to input complexity, have proven effective in supervised learning but remain largely unexplored in deep reinforcement learning (DRL). In this paper, we propose the use of Budgeted EXit Actor (BEXA), which is a novel actorcritic architecture that integrates early exit branches into the actor network. These branches are trained via the underlying DRL method and use a constrained value-based criterion to decide when to exit, allowing the policy to dynamically adjust its computation. BEXA is general, easy to tune and compatible with any off-policy actor-critic method. We evaluate BEXA using different DRL methods such as SAC and TD3 on a suite of MuJoCo tasks. Our results demonstrate a substantial improvement in inference efficiency with minimal or no loss in performance. These findings highlight early exits as a promising direction for improving computational efficiency in DRL.

Abstract:
Federated learning (FL) enables distributed model training, yet in heterogeneous deployments, BandwidthConstrained Clients (BCCs) often contribute inefficiently due to limited uplink bandwidth. In model-heterogeneous FL with fixed small sub-models, BCCs with sub-models may improve quickly in early rounds but become under-parameterized later, resulting in slow convergence and poor generalization. To address this challenge, we propose FedGMR, a federated learning framework centered around Gradual Model Restoration (GMR), where GMR progressively increases each client’s sub-model density during training, allowing BCCs to remain effective contributors throughout optimization. To make GMR practical under real-world heterogeneity, FedGMR is realized as an end-to-end workflow with asynchronous coordination and stable, mask-aware aggregation. We further establish convergence guarantees, showing that the aggregation error scales with the average sub-model density across clients and rounds, and that GMR provably narrows the gap toward full-model FL. Extensive experiments on FEMNIST, CIFAR-10, ImageNet-100, and StackOverflow demonstrate that FedGMR improves both convergence speed and final accuracy, especially under severe heterogeneity and non-IID data distributions.

Abstract:
Synthesizing realistic time series with generative models has wideranging applications in real-world scenarios. Despite recent progress, most existing methods are trained under the assumption of abundant training data, which substantially limits their effectiveness in data-scarce settings. In this paper, we propose TimeMoDE, a novel framework that integrates Diffusion Transformers with Mixture-of-Experts to exploit both domain adaptability and diffusion-stage awareness for time series generation under data scarcity. It is pre-trained on a large-scale collection of multi-domain datasets to extract domain-agnostic temporal representations and domain-specific information benefiting generalization during fine-tuning. We propose Domain Prompts to condition expert assignment for indistinguishable noised tokens, mitigating the limitations of capturing inter-dataset relationships. Moreover, we incorporate diffusion timestep signals to equip the experts with awareness of time series degradation variations, facilitating adaptive calibrate to stage-dependent denoising requirements. Extensive experiments demonstrate that TimeMoDE outperforms existing methods under diverse low-data settings. It establishes an innovative paradigm for advanced time series few-shot generation.

Abstract:
Robust signal enhancement at extremely low SNR is fundamentally challenging because noise becomes strongly entangled with the signal and corrupts local time–frequency (TF) evidence. In this regime, fixed resolution shorttime Fourier transform (STFT) enhancement with purely data driven convolutional biases can become overconfident in unreliable TF regions, causing unstable suppression or residual artifacts. We propose FracKGMF, which couples Fractional Distance Decay Convolution (FracConv) with Knowledge Guided Multi-view Fusion (KGMF) for expressive TF modeling and reliability aware decisions under heavy corruption. FracConv introduces a lightweight fractional distance decay family that reshapes local interactions into long tailed receptive patterns, enabling aggregation of weak but globally consistent cues when per-bin observations are ambiguous. KGMF further injects a Wiener inspired reliability view derived from noise statistics to calibrate multi-view fusion, avoiding over-suppression in uncertain regions while exploiting confident structure for effective denoising. Across speech and electromagnetic (EM) benchmarks, FracKGMF improves perceptual quality and intelligibility over state-of-the-art baselines, with especially robust gains at extremely low SNR; on EM dataset at -20 dB, it achieves an average improvement of 33 dB where conventional TF methods performed poorly. The code will be released upon acceptance.

Abstract:
Consumerlevel applications require fast optimization of 3D Gaussian Splatting (3DGS) with high-fidelity novel view rendering. However, existing 3DGS acceleration approaches still incur substantial computation on redundant pixels while sacrificing fine details. In this paper, we present TurboGS, an error-guided training framework that accelerates 3DGS by concentrating optimization on perceptually informative pixels. TurboGS is built upon four core components: (1) a tile-wise sparse pixel sampling, which, driven by multi-view reconstruction errors during training, prioritizes challenging regions and skips well-reconstructed ones to avoid redundant gradient computation; (2) a tile-wise structure-aware loss with sparse Normalized Cross-Correlation, which provides sparse yet effective supervision to preserve fine details and stabilize training; (3) an error-driven Gaussian density control strategy, which dynamically allocates model capacity and removes redundant primitives; and (4) a tailored hybrid optimizer that couples Hessian-informed updates with Adam moment damping to stabilize and improve convergence under sparse supervision. Experiments on standard benchmarks demonstrate that TurboGS can deliver on par or superior rendering quality within 100 seconds (up to 10x training speedup over vanilla 3DGS).

Abstract:
Large Language Models (LLMs) demonstrate remarkable performance across various natural language processing tasks but struggle with complex logical reasoning, particularly in realworld settings. Existing research is largely confined to the closed-world assumption, which posits that all premises required for reasoning are explicitly provided. However, real-world tasks frequently exhibit open-world characteristics, where the provided information is insufficient to infer a conclusion due to missing premises or implicit commonsense knowledge. To address this limitation, we propose OpenIKLR, an Open-world Incomplete-Knowledge-aware Logical Reasoning framework that integrates symbolic logic solvers with LLMs. OpenIKLR first translates natural language into symbolic representations to precisely pinpoint reasoning gaps via a logical solver. It then iteratively generate a minimal set of necessary missing premises using LLMs. To ensure these additional premises are both logically sound and factually accurate, we introduce a dual-verification process: logic verification via the solver and fact verification via the LLMs. Extensive experiments demonstrate that OpenIKLR consistently outperforms existing logical reasoning and RAG baselines across multiple backbones and real-world datasets, highlighting its efficacy in handling incomplete information. The code is available at https://anonymous.4open.science/r/ICML26_22398-B5BF/.

Abstract:
3D Gaussian Splatting (3DGS) is a powerful technique for creating highfidelity 3D assets. However, the widespread sharing and iterative modification of 3DGS models across digital platforms create pressing challenges for intellectual property protection and forensic traceability. To address this, we propose GaussTrace, a novel framework for constructing directed provenance graphs for 3DGS models. GaussTrace formulates provenance analysis as an evidence-based reasoning problem. It builds upon attribute-wise statistical profiling of 3DGS parameters to capture intrinsic properties. Moreover, we introduce hypothesis-driven editing simulations of common operations to provide auxiliary evidence for plausible transformation pathways. These statistical and simulated cues jointly enable a Large Language Model (LLM) to perform structured Chain-of-Thought (CoT) reasoning, yielding directional provenance inferences and explainable edge reasons. Experimental results demonstrate that GaussTrace effectively constructs evolutionary relationships among diverse 3DGS models, delivering accurate, interpretable, and robust provenance graphs without requiring model training or access to editing histories.

Abstract:
Integrating Large Language Models (LLMs) into time series tasks has yielded impressive performance. While some works aim to enhance accuracy by explicitly designing stepby-step reasoning into prompts, such explicit Chain-of-Thought (CoT) approaches are difficult to generalize to time series. This is because it is difficult to clearly define the reasoning trajectories of time series. In addition, the high heterogeneity across time series often requires specialized prompt designs, limiting the model's scalability. To address these challenges, we proposeTime-CoT(TimeSeriesChain-of-Thought), a hierarchical reasoning framework based on temporal semantic codes for multivariate time series classification. This framework automatically constructs scenario-specific reasoning trajectories based on the characteristics of time series, thereby better eliciting the LLM's reasoning capability for time-series data. Specifically, Time-CoT, we first perform temporal representation pre-training with a multi-view temporal representation fusion to acquire high-quality temporal embeddings. We then discretize these temporal embeddings into hierarchical temporal semantic codes as the reasoning trajectory. Finally, the LLM predicts temporal semantic codes in a stepwise manner and then infers the final labels, thereby establishing a coarse-to-fine decision process. Experiments on ten public multivariate time series datasets demonstrate that the Time-CoT effectively adapts to diverse datasets and outperforms state-of-the-art methods. Our code is available at .

Abstract:
Recent advancements in Reinforcement Learning (RL), particularly Group Relative Policy Optimization (GRPO), have significantly enhanced the reasoning capabilities of Large Language Models. However, applying these problemcentric optimization methods to role-playing agents often leads to a loss of character fidelity and style collapse, as they prioritize context-specific utility over persona alignment. To address this, we propose Character-Centric Group Relative Policy Optimization (CRPO), a framework designed to realign RL objectives with the role-playing task. CRPO improves character distinctiveness through three mechanisms: decoupling task logic from stylistic rewards to resolve gradient conflicts, dynamically adapting optimization constraints based on character complexity, and utilizing generic responses as negative baselines to prevent the model from reverting to a common distribution. Extensive experiments demonstrate that CRPO outperforms existing methods in consistency, emotion and others.

Abstract:
Large Language Models (LLMs) show strong reasoning abilities, yet their reliability is hindered by hallucinations, where fluent reasoning becomes factually or logically incorrect. Most existing uncertaintybased detectors rely on sequence-level averaging, which ignores the step-wise dynamics of reasoning and often misclassifies hard-but-correct or easy-but-wrong samples. We propose a dynamic perspective that models reasoning as a trajectory on a latent \emphEvidence Manifold, where each step is supported by local evidence. Hallucinations are characterized as \emphEvidence Drops, i.e., sudden declines in local evidence support that indicate topological deviations from this manifold. Based on this insight, we design a training-free and model-agnostic detector that identifies hallucinations via the worst-case Evidence Drop and enables step-level error localization. Experiments on GSM8K, MATH, and ProcessBench show consistent improvements over sequence-level uncertainty baselines in selective accuracy and risk–coverage trade-offs.

Abstract:
Asynchronous stochastic gradient descent (ASGD) is widely adopted in distributed and federated learning. In this paper, we develop a sharp generalization analysis for ASGD by leveraging the concept of onaverage model stability. For convex and smooth objectives, we establish stability and excess risk bounds under minimal assumptions, removing Lipschitz continuity, bounded noise, bounded parameter or data domains, while allowing randomly partitioned data and arbitrary delays. Our bounds are optimistic and explicitly characterize the impact of worker participation, recovering the minimax-optimal rate O(1/\sqrtmn) in balanced regimes where mn denotes the sample size and implying fast rates under low-noise conditions. We further extend the analysis to non-smooth objectives with Hölder-continuous gradients and to heterogeneous data settings via random ASGD, obtaining non-vacuous excess risk guarantees in both settings. Experimental results support our theoretical findings.

Abstract:
Recent large language models (LLMs) have shown rapid progress on readingbased question answering (QA), where the evidence is explicitly provided or trivially retrievable. In contrast, real-world questions are often not paired with accurate evidence documents. The useful evidence resides in a massive collection of data lakes, necessitating searching as a prerequisite for answering. However, there is a lack of a comprehensive benchmark that requires searching and reasoning over a large collection of data lakes. To this end, we introduce LakeQA, a comprehensive benchmark for search-centric question answering over data lakes that jointly emphasizes \emphsearching and \emphreasoning capabilities. LakeQA is built on a heterogeneous collection of ~9.5 TB text resources from Wikipedia and open-source government data, spanning structured and unstructured data. To ensure the quality of LakeQA's tasks, each sample is annotated by at least one Ph.D level expert. Each task requires long-horizon multi-hop reasoning with implicit intermediate steps: agents need to discover the correct document(s) and then compose evidence across sources to produce the answer. Intensive experiment results on seven frontier LLMs have demonstrated that LakeQA is challenging. For instance, GPT-5.2 only obtains an exact matching score of 14.73% on LakeQA. Overall LakeQA provides a realistic testbed for developing LLM agents that can both \emphfind and \emphanalyze data in modern data lakes.

Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated strong perception and reasoning capabilities. However, most existing models focus on isolated objects and neglect structured relationships for efficient target navigation, limiting their performance on visually intensive tasks. To address this challenge, we introduce Scene Graph Thinking (SaGe), a novel paradigm that enables finegrained and structured visual reasoning through explicit scene-graph representations. Specifically, we first introduce an automated data engine that converts flat image–text corpora into structured scene graphs, where hierarchical entities constitute the nodes and diverse visual relations define the edges. Building upon this, we construct 120K high-quality training data by sampling reasoning traces from scene graphs. Then two-stage graph-aligned post-training paradigms are introduced, where supervised fine-tuning internalizes MLLMs with structured reasoning, and subsequent reinforcement fine-tuning proposes node-as-proxy graph rewards to consolidate efficient graph exploration. With curated data and graph-aligned training, our approach achieves significant improvements across eight multimodal benchmarks, demonstrating strong effectiveness on fine-grained perception and reasoning tasks.

Abstract:
Identifying molecular structures from spectral data is essential for earlystage chemical analysis, yet it remains a difficult task due to the imbalance in functional group distributions. Current methods often overfit to prevalent groups while neglecting underrepresented ones, failing to capture key dependencies between functional groups. This highlights the need for a unified approach that addresses both data imbalance and structural constraints. In this work, we present SymSpectra, a Symmetric Conditional Information Bottleneck (SCIB) framework designed to seamlessly integrate multi-modal Spectra features. Our model employs the SCIB framework to fuse multi-modal spectroscopic data into a unified representation, effectively preserving discriminative signals while mitigating redundancy. To enhance robustness against data imbalance, we incorporate conditional mutual information into the training objective, increasing the model’s sensitivity to rare functional groups and challenging molecular cases. Additionally, a specialized module captures the dependencies among functional groups, improving both prediction accuracy and chemically meaningful interpretability. Experiments on multimodal spectral datasets demonstrate that SymSpectra significantly outperforms state-of-the-art methods, achieving an F1-score of 0.970 in substructure classification. More importantly, SymSpectra consistently outperforms baselines under various imbalanced scenarios, exhibiting superior robustness and generalizability, which may help advance the automation of chemical discovery. Our code can be found at https://anonymous.4open.science/r/SymSpectra-0017.

Abstract:
Time series forecasting plays a vital role in numerous domains. However, realworld time series are frequently contaminated by noise, missing values, and anomalies, posing significant challenges to reliable forecasting. In this work, we first systematically investigate a fundamental limitation prevalent in existing forecasting methods: an excessive reliance on the most recent observations---termed "recent-data bias". This bias renders forecasts highly vulnerable to perturbations in recent data, severely undermining prediction reliability. To address this issue, we propose TameR, a novel approach for robust time series forecasting that effectively mitigates recent-data bias via enhancing the utilization of global context. Specifically, it employs a basis-aligned randomized sampling strategy to reduce dependence on any specific recent data. Furthermore, TameR incorporates a learnable periodicity extraction module coupled with a two-stage learning protocol to robustly separate periodic patterns from the sampled residual components. Comprehensive experiments demonstrate that TameR significantly outperforms state-of-the-art methods in robustness against diverse perturbation scenarios, while achieving comparable accuracy on clean data.

Abstract:
In generative modeling, we often wish to produce samples that satisfy a userspecified reward such as measurement consistency, aesthetic quality, or alignment with human intent, a problem known as inference-time guidance. While flow-based models enable high-quality generation, existing guidance methods either require expensive multi-particle, many-step schemes to sample from a reward-tilted distribution or rely on heuristic approximations. To design efficient algorithms, we instead reformulate guidance as a deterministic optimal control problem, rather than the stochastic control problem from which the reward tilt emerges. We find that the flow map arises naturally in the optimal solution, and that many existing flow-based guidance methods are best understood as coarse approximations that replace it with a single Euler step. Rather than relying on these approximations, we propose Flow Map Trajectory Guidance (FMTG): a principled framework that uses the flow map to both integrate and guide, enabling training-free alignment in just a few network evaluations. We demonstrate FMTG at text-to-image scale using a FLUX.1-distilled flow map, showing that it achieves comparable performance to baselines across inverse problems and image editing tasks with up to 10 times fewer function evaluations.

Abstract:
Generating accurate and calibrated confidence estimates is critical for deploying LLMs in highstakes or user-facing applications, and remains an open challenge. Prior research has often framed confidence as a problem of eliciting a model’s “self-knowledge”, i.e., the ability of an LLM to judge whether its own answers are correct; this approach implicitly assumes that there is some privileged information about the answer’s correctness that is accessible to the model itself. However, we find that whether trained or training-free, an LLM attempting to predict the correctness of its own outputs generally performs no better than an unrelated LLM attempting the same task. Moreover, we hypothesize that a key factor in predicting model correctness, i.e., building a “Correctness Model” (CM), is exposure to a target model’s historical predictions. We propose multiple methods to inject this historical correctness information, including training an LLM to predict the confidences of many other LLMs, i.e., creating a Generalized Correctness Model (GCM). We use GCMs and CMs as a lens for studying the source of correctness prediction ability and its generalization, studying the importance of answer phrasing, world-knowledge, performance history, in-context examples, and posthoc-calibration for correctness prediction. We evaluate GCMs based on Qwen3-8B across 5 model families and the MMLU and TriviaQA datasets, as well as on a downstream selective prediction task, finding that reliable LLM confidence estimation is a generalizable and cross-model skill learned by systematically encoding correctness history rather than a model-specific skill reliant on introspection.

Abstract:
Zeroshot 3D anomaly detection is essential for industrial quality inspection, where labeled anomaly samples are scarce. Meanwhile, existing methods lack an effective mechanism to fuse complementary 2D color images with 3D geometric structures, limiting their ability to detect both surface and structural defects in a unified framework. To address these issues, we propose CoGeoAD, a unified CLIP-based framework that fuses color and geometric features by constructing pixel-aligned paired multi-view images. The framework introduces a Data-Driven Multi-View Attention (MVA) mechanism to adaptively aggregate 3D features and a Multi-Stage Color-Geometric Fusion (MS-CGF) module to hierarchically integrate multi-level features from both modalities. Extensive experiments on the MVTec3D-AD and Eyecandies benchmarks demonstrate that CoGeoAD achieves state-of-the-art performance, effectively capturing both structural and textural anomalies in complex industrial scenarios.

Abstract:
Endto-end autonomous driving systems are increasingly integrating Vision-Language Model (VLM) architectures, incorporating text reasoning or visual reasoning to enhance the robustness and accuracy of driving decisions. However, the reasoning mechanisms employed in most methods are direct adaptations from general domains, lacking in-depth exploration tailored to autonomous driving scenarios, particularly within visual reasoning modules. In this paper, we propose a driving world model that performs parallel prediction of latent semantic features for consecutive future frames in the bird’s-eye-view (BEV) space, thereby enabling long-horizon modeling of future world states. We also introduce an efficient and adaptive text reasoning mechanism that utilizes additional social knowledge and reasoning capabilities to further improve driving performance in challenging long-tail scenarios. We present a novel, efficient, and effective approach that achieves state-of-the-art (SOTA) results on the closed-loop Bench2drive benchmark. Code will be released soon.

Abstract:
Learning rate configuration is a fundamental aspect of modern deep learning. The prevailing practice of applying a uniform learning rate across all layers overlooks the structural heterogeneity of Transformers, potentially limiting their effectiveness as the backbone of Large Language Models (LLMs). In this paper, we introduce Layerwise Learning Rate (LLR), an adaptive scheme that assigns distinct learning rates to individual Transformer layers. Our method is grounded in HeavyTailed Self-Regularization (HT-SR) theory, which characterizes the empirical spectral density (ESD) of weight correlation matrices to quantify heavy-tailedness. Layers with weaker heavy-tailedness are assigned larger learning rates to accelerate their training, while layers with stronger heavy-tailedness receive smaller learning rates. By tailoring learning rates in this manner, LLR promotes balanced training across layers, leading to faster convergence and improved generalization. Extensive experiments across architectures (from LLaMA to GPT-nano), optimizers (AdamW and Muon), and parameter scales (60M–1B) demonstrate that LLR achieves up to 1.5× training speedup and outperforms baselines, notably raising average zero-shot accuracy from 47.09% to 49.02%. A key advantage of LLR is its low tuning overhead: it transfers nearly optimal LR settings directly from the uniform baseline. Our code is submitted.

Abstract:
Oncologists are increasingly relying on multiple modalities to model the complexity of diseases. Within this landscape, transcriptomic and epigenetic data have proven to be particularly instrumental and play an increasingly vital role in clinical applications. However, their integration into multimodal models remains a challenge, especially considering their high dimensionality. In this work, we present a novel bimodal model that jointly learns representations of bulk RNAseq and DNA methylation leveraging self-supervision from masked language modeling. We leverage an architecture that reduces the memory footprint usually attributed to purely transformer-based models when dealing with long sequences. We demonstrate that the obtained bimodal embeddings can be used to fine-tune cancer-type classification and survival models that achieve state-of-the-art performance compared to unimodal models. Furthermore, we introduce a robust learning framework that maintains downstream task performance despite missing modalities, enhancing the model’s applicability in real-world clinical settings.

Abstract:
Traditional hallucination detection fails on "Stubborn Hallucinations"—errors where LLMs are confidently wrong. We propose a geometric solution: EmbeddingPerturbed Gradient Sensitivity (EPGS). We hypothesize that while robust facts reside in flat minima, stubborn hallucinations sit in sharp minima, supported by brittle memorization. EPGS detects this sharpness by perturbing input embeddings with Gaussian noise and measuring the resulting spike in gradient magnitude. This acts as an efficient proxy for the Hessian spectrum, differentiating stable knowledge from unstable memorization. Our experiments show that EPGS significantly outperforms entropy-based and representation-based baselines, providing a robust signal for detecting high-confidence factual errors.

Abstract:
The operational cost of serving large language models remains prohibitively high, largely due to extreme workload heterogeneity in production traffic. We observe that combining disaggregated inference with resource autoscaling enables finegrained resource adjustment, allowing inference phases and operations to scale independently based on their specific bottlenecks. Building on this insight, we propose HexGen-3, a cost-effective LLM serving framework that leverages a fully disaggregated inference architecture and heterogeneous resource autoscaling. HexGen-3 introduces two key components: (i) A hierarchical scheduling framework that jointly optimizes resource allocation and parallelism configuration for any given resource provisioning, and (ii) an autoscaling framework that dynamically adjusts resources and triggers deployment rescheduling in response to workload fluctuations. Experiments comparing HexGen-3 against state-of-the-art LLM serving systems demonstrate up to 60% (on average 46.5%) improvement in per-cost throughput under static resource provisioning, and up to 78.3% (on average 55.1%) improvement with autoscaling enabled under dynamic workloads.

Abstract:
While recent advancements in Vector Quantization (VQ) models have successfully achieved complete codebook utilization, a critical bottleneck remains largely unexplored: the effective dimensionality of the codebook embedding space. We observe that discrete codebook representations tend to degenerate into lowdimensional subspaces, characterized by significantly lower effective rank than continuous representations during quantization. Through comprehensive spectral analysis, we identify that this dimensional collapse stems from the suppression of low-variance components inherent to the vector quantization process, thereby severely limiting the expressive capacity of VQ models. To mitigate this fundamental issue, we propose a simple yet effective codebook regularization strategy designed to restore low-variance components, effectively bridging the spectral gap between discrete codebook spaces and continuous representations. Extensive experiments demonstrate that this regularization objective is compatible with diverse VQ training paradigms, yielding significant improvements in reconstruction fidelity and downstream performance in autoregressive image generative models.

Abstract:
Finetuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in which adversarial data removes safeguards and induces unsafe behaviors. We propose SPARD, a defense framework that integrates Safety-Projected Alternating optimization with Relevance-Diversity aware data selection. SPARD employs SPAG, which optimizes alternatively between utility updates and explicit safety projections with a set of safe data to enforce safety constraints. To curate safe data, we introduce a Relevance–Diversity Determinantal Point Process to select compact safe data, balancing task relevance and safety coverage. Experiments on GSM8K and OpenBookQA under four harmful fine-tuning attacks demonstrate that SPARD consistently achieves the lowest average attack success rates, substantially outperforming state-of-the-art defense methods, while maintaining high task accuracy.

Abstract:
Lateral connections (LCs) are ubiquitous in the cortical circuits. While modern deep learning architectures have rich intralayer interactions (e.g., convolutional mixing, normalization, or attention) to support feature selectivity and contextual modulation, explicit excitatory and inhibitory (EI) LCs remain underexplored and unjustified additions for encoding models in both deep learning and neuroscience. In this work, we analyze and train stabilized supralinear networks (SSNs) with sufficiently strong recurrent excitation and feedback inhibition, using local unsupervised plasticity rules under natural image stimulation. We demonstrate that these LCs support a transition between dynamical regimes under different input conditions. During the transition, the network shifts from employing population coding to sparse coding balancing cost and performance: population coding extracts robust features from low-contrast or noisy inputs by recruiting more neurons while sparse coding encodes high-contrast, proper inputs efficiently with minimal cost. These results are then compared against sparse coding and ICA-based models. Our findings frame explicit E-I recurrent neural networks through the lens of dynamic coding strategies and provide insights into designing more adaptive and robust systems with a concrete example in vision.

Abstract:
Large Language Models (LLMs) with proprietary Chainof-Thought (CoT) capabilities constitute high-value intellectual property, yet protecting them against unauthorized theft and unlicensed commercialization remains a critical challenge. Existing watermarking paradigms are ill-suited for safeguarding these models: direct logit perturbations inevitably fracture the fragile logical consistency required for complex reasoning, or remain superficial enough to be erased by fine-tuning. In this paper, we propose BiCoT, a framework that embeds ownership directly into the reasoning representations via bi-level variational alignment. Instead of adding external perturbations, our method optimizes the model's internal states to collapse onto a signature subspace. This creates a functional entanglement where the watermark becomes a prerequisite for the model's reasoning utility: removing the signature destroys the capability. To handle representation drift in stolen models, we further introduce a Robust Subspace Registration (RSR) verifier. Experiments demonstrate that BiCoT achieves negligible fidelity loss while maintaining strong robustness against diverse attacks on both in-domain and out-of-distribution data.

Abstract:
Group fairness can ensure equitable performance across different demographic subgroups for medical image analysis. However, the current finetuned foundation models (FMs) exhibit significant subgroup disparity. One-shot federated learning (OFL) can potentially mitigate this by leveraging cross-institutional data diversity within a single communication round. But heterogeneous distributions across medical institutions may cause OFL local models to diverge severely, resulting in parameter conflicts that amplify disparity upon aggregation. To address these challenges, we propose Fair-FedMOE, a group-fair OFL framework for medical FMs. During local training, Fairness-aware Expert Routing routes samples to group-specific experts via learnable prototypes, enabling subgroup-specialized learning to capture group-specific features without inter-group interference. During model aggregation, Prototype-guided Differential Aggregation computes personalized weights based on prototype similarity and applies differentiated aggregation strategies to filter conflicting updates. We propose RES-AUC, a Rawlsian justice-inspired metric based on worst-group performance that remains stable as groups increase. Comprehensive experiments on diverse retinal datasets covering different modalities and diseases, using retinal-specific and general-purpose FMs, show consistent fairness gains without sacrificing accuracy. Code available at https://anonymous.4open.science/r/Fair-FedMOE-2624.

Abstract:
Standard preference alignment relies on a binary forcedchoice paradigm, assuming definitive preferences for all pairs. However, we find that indistinguishable pairs are prevalent even in standard benchmarks, where quality differences of two responses often fall below the labeler's discriminative resolution limit. Forcing a choice in such cases could inject significant noise that undermines policy optimization. In this work, we propose a silent-aware framework that introduces a principled way to allow annotators to stay silent (i.e., express ties) and then explicitly model these ties during optimization. Our findings reveal a compelling phenomenon: when ties are properly modeled, supervision from small models yields alignment surpassing that of forced-choice LLMs or human experts. This discovery highlights a cost-effective path for alignment: respecting a labeler’s resolution limit is more critical than increasing its capability, while simultaneously unlocking the latent value in existing benchmarks by properly modeling inherent tie signals without requiring any re-labeling effort. To leverage these signals, we propose several optimization objectives to drive the policy toward high-reward regions while mitigating unreliable updates that lead to arbitrary distribution shifts. Our approaches significantly surpass conventional alignment performance, consistently outperforming the strongest available baselines across diverse benchmarks.

Abstract:
VisionLanguage-Action (VLA) models have recently shown strong decision-making capabilities in autonomous driving. However, existing VLAs often struggle with achieving efficient inference and generalizing to novel autonomous vehicle configurations and driving scenarios. In this paper, we propose Reasoning-VLA, a general and efficient action-generation VLA framework. The proposed model employs a set of learnable action queries, implicitly guided by predefined spatial representations to enhance spatial awareness. These learnable queries interact with reasoning-enhanced vision–language features to generate continuous action trajectories in parallel. To promote robust generalization, we consolidate eight publicly available autonomous driving datasets into a standardized, Chain-of-Thought reasoning–based, and easy-to-use data format for model training. Leveraging both supervised learning and reinforcement learning fine-tuning, extensive empirical evaluations across multiple benchmarks demonstrate that Reasoning-VLA achieves state-of-the-art performance, strong generalization capability, and the excellent inference speed with parallel decode.

Abstract:
Openvocabulary 3D scene understanding answers free-form text queries over reconstructed scenes. However, lifting dense 2D foundation-model embeddings into 3D Gaussian Splatting (3DGS) is still challenging. Existing 3DGS-based methods often average normalized embeddings in Euclidean space. This ignores their hyperspherical geometry and can cause feature collapse. They also distill supervision from all views equally, which amplifies occlusion noise and mixed-depth artifacts. We proposeRh-3DGS, a robust semantic 3DGS framework that uses reliability-aware distillation and manifold-consistent aggregation.Visibility-Calibrated Distillation (VCD)computes per-pixel reliability weights from rasterization statistics and down-weights ambiguous pixels.Visibility-Weighted Fréchet Mean (VFM)aggregates embeddings on the unit hypersphere with a Riemannian Huber objective for robust distillation.Lightweight Consistency Contrast (LIC)regularizes the 3D semantic field with neighborhood-based multi-positive contrast to improve local consistency and sharper boundaries. Experiments on three benchmarks show that Rh-3DGS is best on open-vocabulary segmentation, boundary quality, and view-consistent rendering.

Abstract:
Deep Neural Networks (DNNs) achieve outstanding performance in Time Series Classification (TSC) yet remain opaque "black boxes", hindering their adoption in sensitive domains. While Explainable AI (XAI) aims to bridge this gap, existing TSC XAI methods rely on a single perspective and incur significant computational costs, making them difficult to integrate into realtime applications. To overcome these challenges, we propose a framework, containing two key techniques called Instance-level Multi-Expert Knowledge Distillation (IMEKD) and Semi-Amortized Optimization Explanation (SAOE). Unlike static methods, our IMEKD approach bridges the gap between explanation methods by dynamically selecting the best attribution map from a pool of "XAI experts" for each instance. We then distill this instance-optimal knowledge into a DNN using our SAOE framework, a dual-stage process that learns a global attribution via distillation and refines it using faithfulness and robustness losses, aligning the optimization with objectives of human evaluation. To the best of our knowledge, this is the first work to unify multi-expert selection with semi-amortized optimization for TSC XAI. Also, we introduce a Faithfulness-Preserving Segmentation (FPS) mechanism that converts point-wise maps into interpretable segments without sacrificing fidelity to align explanations with human intuition. Comprehensive experiments on four synthetic datasets, a ECG dataset with human-verified ground truth, and 11 multivariate UEA benchmarks across three DNN architectures show that our framework significantly outperforms the current state-of-the-art (SOTA) in terms of faithfulness, robustness, and computational efficiency.

Abstract:
Diffusion Large Language Models (dLLMs) refine tokens iteratively but commit them irreversibly, leading to a "stability lag" where early decisions remain fragile even after being written. We reveal that PostTraining Quantization (PTQ) error easily flips these borderline decisions at the write frontier, which are then permanently locked in and amplified. To address this, we propose Frontier-Aware Instability-Reweighted Calibration (FAIR-Calib), a two-stage PTQ framework for dLLMs. Stage I probes a full-precision teacher to estimate a position prior that combines frontier hits and masked-stage reliability. Stage II performs off-policy, layer-wise calibration by minimizing a reweighted hidden-state MSE, effectively prioritizing the protection of fragile frontier states without requiring expensive end-to-end diffusion rollouts. We further theoretically justify our weighted objective as a surrogate for output KL divergence. Empirically, FAIR-Calib consistently outperforms state-of-the-art baselines on LLaDA and Dream (W4A4), significantly reducing frontier decision flips and suppressing post-commit mismatches across diverse benchmarks.

Abstract:
Textguided image-to-video generation has made substantial progress, yet it still struggles to execute text-specified edits that require substantial changes to a reference image (e.g., object addition, deletion, or modification). Empirically, our analysis reveals that this stems fromvisual dominance, where the reference image causes severe attention dispersion, inhibiting the model's ability to incorporate new semantic information. To address this, we proposeAlignVid, a training-free intervention that re-calibrates the model's internal attention distribution. Drawing on an energy-based perspective of attention, AlignVid employs Attention Scaling Modulation (ASM) to reduce attention entropy and concentrate focus on semantic tokens, alongside Guidance Scheduling (GS) to maintain generation stability. To rigorously assess this capability, we presentOmitI2V, a comprehensive benchmark for evaluating prompt adherence across object addition, deletion, and modification. Extensive experiments demonstrate that AlignVid effectively enhances semantic fidelity with negligible computational overhead.

Abstract:
Recent Large Reasoning Models (LRMs) have demonstrated powerful multistep problem-solving capabilities but often suffer from inefficiency due to an ``overthinking phenomenon", where they apply complex reasoning to simple tasks, resulting in unnecessary computational cost and latency. While adaptive reasoning models that can switch between generating explicit reasoning and producing direct answers offer a potential solution, their effectiveness is compromised by a critical flaw: they are often misled by superficial linguistic complexity, mistaking verbosely phrased simple problems for complex ones. To address this, we propose a two-stage training framework to create a more robust adaptive reasoner. The first stage uses supervised fine-tuning with augmented data—presenting simple problems in both concise and redundant forms—to teach the model to ignore superficial verbosity. Subsequently, a reinforcement learning phase utilizes Generalized Reward Policy Optimization (GRPO) with a custom reward function to refine the model's adaptive policy, ensuring it selects a reasoning mode based on true task complexity rather than surface-level cues. The resulting model reduces computational overhead without sacrificing accuracy and demonstrates improved robustness to misleading linguistic cues.

Abstract:
False discovery rate (FDR) is a cornerstone of modern multiple testing. However, it often fails to guarantee the reliability of ``marginal" discoveries that lie at the boundary of the rejection set, which are often crucial in highprecision applications. While recent works (Soloff et al., 2024; Xiang et al., 2025) introduced the boundary false discovery rate (bFDR) to control the error probability at the marginal discovery, their method relies on restrictive assumptions such as independence or specific prior distributions. In this paper, we first propose k-bFDR, a novel generalization that controls the error probability of the k least significant discoveries. We then provide a systematic investigation into the theoretical relationship between k-bFDR and existing error metrics. Furthermore, building upon the closure principle, we develop Domino, a unified framework that guarantees k-bFDR control under arbitrary dependence, applicable for both p-values and e-values. We prove the theoretical validity of the proposed Domino algorithm and demonstrate through extensive numerical experiments that it consistently achieves rigorous k-bFDR control while identifying trustworthy marginal discoveries. Analyses of real data reveal that k-bFDR control yields higher-quality rejection sets with greater practical significance.

Abstract:
Hypergraph Neural Networks (HNNs) have emerged as powerful tools for modeling complex highorder correlations. Most existing HNNs adhere to a two-stage message passing paradigm, where node feature propagation is mediated by hyperedges. In this paper, we identify two fundamental theoretical limitations inherent to this paradigm, which we term rank collapse and hyperedge semantic dependency. To address these challenges, we propose the Multi-Channel Hypergraph Neural Network (MC-HNN). We design a multi-channel message passing mechanism to maintain high-rank representations, while simultaneously introducing a latent hyperedge type encoding mechanism to inject an independent degree of freedom into hyperedge representations. Both theoretical insights and empirical experiments demonstrate that MC-HNN effectively mitigates the limitations of the prevailing paradigm and achieves superior performance.

Abstract:
Point cloud registration can be categorized into rigid and nonrigid settings depending on the motion characteristics of the underlying objects. Rigid alignment assumes a single global transformation under which corresponding points remain geometrically consistent across scales, whereas non-rigid alignment involves spatially varying deformations, where geometric similarity holds only locally and semantic correspondence dominates at larger scales. This multi-scale discrepancy creates an optimization gap that has made unified registration particularly challenging. To this end, we propose RGGT, a Generative-Prior-Guided Transformer that unifies rigid and non-rigid registration within a shared optimization space. Through coordinated design at the representation, architecture, and supervision levels, RGGT jointly captures local geometric details and global structural semantics: generative priors enrich point features with unified geometric–semantic cues; a Global–Self–Cross Attention module models long-range structure, local interaction, and bidirectional cross-shape reasoning; and a dual correspondence–reconstruction objective provides consistent supervision for both deformation types. Extensive experiments on rigid (ModelNet40) and non-rigid (4DMatch) benchmarks demonstrate that RGGT achieves state-of-the-art accuracy across both rigid and non-rigid settings within a single unified framework.

Authors: Santiago Gonzalez, Alireza Amiribavandpour, Peter Ye, Edward Zhang, Ruslans Aleksejevs, Todor Antić, Polina Baron, Sujeet Bhalerao, Shubhrajit Bhattacharya, Zachary Burton, John Byrne, Hyungjun Choi, Nujhat Disha, Koppány I Encz, Yuchen Fang, Robert Joseph George, Ebrahim Ghorbani, Alan Goldfarb, Jing Guo, Meghal Gupta, Stefano Huber, Annika Kanckos, Minjung Kang, Hyun Jong Kim, Dino Lorenzini, Levi Lorenzo, Tianyi Mao, Giovanni Marzenta, Ariane Masuda, Lukas Mauth, Ana Mickovic, Andrés Miniguano-Trujillo, Antoine Moulin, Wenqi Ni, Tomos Parry, Kevin Ren, Hossein Roodbarani, Mathieu Rundström, Manjil Saikia, Detchat Samart, Rebecca Steiner, Connor Stewart, Dhara Thakkar, Jeffrey Tse, Vasiliki Velona, Yunhai Xiang, Sibel Yalçın, Jun Yan, Ji Zeng, Arman Cohan, Quanquan Liu

Abstract:
As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard "LLMas-a-Judge" protocols suffer from a systematic evaluation Alignment Gap when applied to upper-undergraduate to early graduate level mathematics. To quantify this, we introduce QEDBench, the first benchmark to systematically measure alignment with human experts on undergraduate-level math proofs by contrasting course-specific rubrics against expert common knowledge criteria. By deploying a dual-evaluation matrix (7 judges × 5 solvers) against 1,000+ hours of human evaluation, we reveal that certain frontier evaluators like Claude 4.5 Opus exhibit significant positive bias (up to +0.28 mean score inflation), effectively "hallucinating rigor" in flawed proofs. Furthermore, we uncover a critical reasoning disparity: while Gemini 3.0 Pro achieves state-of-the-art performance (0.91 raw score), specialized reasoning models like o3-deep-research collapse in discrete domains, dropping to 42.1% accuracy in Graph Theory. We release QEDBench as a public benchmark for evaluating and improving AI judges.

Abstract:
Classincremental learning (CIL) requires models to continuously acquire new knowledge while avoiding catastrophic forgetting. While exemplar replay is effective, it raises concerns regarding privacy and storage. Thus, generative replay has emerged as a viable alternative, synthesizing old data using frozen pretrained text-to-image (T2I) models without any extra training. However, we observe that directly mixing synthetic old-class data with real new-class data during incremental training leads to significant performance degradation. This issue stems from a ‘’domain shortcut'', where models rely on domain-discriminative features instead of semantic class cues. To address this, we propose DREAM (Domain-Regularized Exemplar-free Alignment Model), which uses a training-free generator to synthesize old-class data and eliminates domain shortcut via subspace rectification and orthogonal projection, while reinforcing semantic alignment through real-anchored prototype regularization. Extensive experiments on 4 datasets demonstrate that DREAM outperforms existing exemplar-free CIL methods and achieves state-of-the-art performance.

Abstract:
Autonomous 3D indoor scene synthesis breaks down in nonconvex rooms with tightly coupled spatial constraints. Data-driven generators lack topological priors for long-horizon planning, while iterative agents fragment semantics and become geometrically brittle. We present ZoneMaestro, a unified framework that shifts the paradigm from object-centric synthesis to Zone-Graph Orchestration. By internalizing a novel zone-based logic, ZoneMaestro translates high-level semantic intent into functional zones and topological constraints, enabling robust adaptation to diverse architectural forms. To support this, we construct Zone-Scene-10K, a large-scale dataset enriched with explicit Zone-Graph annotations. We further introduce an Alternating Alignment Strategy that cycles between reasoning internalization and Zone-Aware Group Relative Policy Optimization (Z-GRPO), effectively reconciling the tension between semantic richness and geometric validity without relying on external physics engines. To rigorously evaluate spatial intelligence beyond convex primitives, we formally define the task of Intricate Spatial Orchestration and release SCALE, a stress-test benchmark for irregular indoor scenarios with complex, dense spatial relations. Extensive experiments demonstrate that ZoneMaestro resolves the density-safety dichotomy, significantly outperforming state-of-the-art baselines in both structural coherence and intent adherence.

Abstract:
This paper pays attention to openvocabulary 3D object affordance grounding (OVAG), which aims to localize affordance regions on 3D objects by leveraging interaction images or textual instructions. Most existing methods treat interaction images as sources of external affordance knowledge and align them with 3D visual representations, while overlooking the intrinsic relationship between local object attributes and affordances, which limits localization accuracy and generalization. For instance, a cup handle affords grasping due to its curved shape and appropriate thickness, indicating that affordances emerge from specific attribute compositions rather than global object appearance. Motivated by this, we propose Attribute-Affordance Hierarchies (AAH) learning framework that explicitly models the hierarchical relationships between object-region attributes and affordances. Our approach first captures local region relationships using hypergraph, and then projects these region-level concepts into a hyperbolic space to encode their hierarchical organization. Furthermore, we introduce counterfactual attribute samples to encourage robust learning of attribute–affordance dependencies under varying conditions. By jointly modeling visual structure and hierarchical concept information, our method achieves more accurate affordance localization. Extensive experiments and qualitative analyses demonstrate the effectiveness of our approach.

Abstract:
RetrievalAugmented Generation (RAG) augments large language models with external knowledge, which in turn exposes their retrieval corpora to data poisoning risks. However, existing poisoning attacks exhibit limited effectiveness against RAG equipped with a reranker to enhance retrieval quality. Remarkably, this defensive capability requires no adversarial training: a reranker fine-tuned solely on benign, in-domain corpora can effectively filter malicious content. Towards realistic RAG red-teaming, we conclude practical prompt design principles that reveal reranker blind spots. Building on these insights, we introduce the Prompt-Perturbation Poisoning Attack (\mathbfP^3 \mathbfA). \mathbfP^3 \mathbfA first employs rule-based prompt engineering to craft initial poisoned texts. It then injects subtle character-level perturbations into these texts, which promotes their ranking by the reranker while maintaining their adversarial effectiveness. These perturbations introduce only about 1% textual change, ensuring the poisoned texts remain natural and readable. Extensive experiments show that \mathbfP^3 \mathbfA achieves strong attack effectiveness and transferability, even when constrained to poisoning a single document. Code is available in the supplementary material.

Abstract:
The rapid evolution of audio deepfakes requires robust detection capable of generalizing to unseen attacks. Oneclass learning offers inherent robustness for this task by characterizing real speech distributions to detect anomalies. However, establishing a compact decision boundary without spoof supervision remains a fundamental challenge. Existing relaxed approaches often compromise this strictness by introducing auxiliary negative samples, which biases the boundary toward seen artifacts and degrades generalization to unseen attacks. To address this, we propose CA-SOADD, a framework that refines the acceptance region by constructing off-manifold boundary probes. Our proposed centroid-anchored tri-objective learning paradigm simultaneously enforces centroid compactness and a centroid-referenced margin against these probes, thereby explicitly tightening the acceptance region without treating them as an explicit negative class. We further extend the framework to heterogeneous settings through domain-conditioned centroids. Experiments on ASVSpoof and MLAAD benchmarks demonstrate that our strict real-only method consistently outperforms strong baselines under unseen attack types and domain shifts, with its effectiveness further validated through extensive ablation studies.

Abstract:
Efficient branching policies are essential for accelerating Mixed Integer Linear Programming (MILP) solvers. Their design has long relied on handcrafted heuristics, and now machine learning has emerged as a promising paradigm to automate this process. However, existing learning-based methods are often hindered by their dependence on expensive expert demonstrations and the gap between training objectives and the solver’s end-to-end performance. In this work, we propose LLM4Branch, a novel framework that leverages Large Language Models (LLMs) to automate the discovery of efficient branching policies. Specifically, the discovered policy is an executable program with a program skeleton generated by the LLM and a parameter vector, which is optimized via a zeroth-order method over a few instances with their end-to-end performance feedback. Extensive experiments on standard MILP benchmarks demonstrate that LLM4Branch establishes a new state-of-the-art among CPU-based methods and achieves performance competitive with advanced GPU-based models.

Abstract:
The advancement of artificial neural networks (ANNs) has been driven by diverse and wellestablished architectural designs, especially in connectivity. Biological neural networks, which exhibit a rich variety of neurodynamic circuits, offer a valuable source of inspiration for developing novel ANN models. In this study, we analyze the meta-connectivity structure and introduce a network motif-based approach, in which 13 distinct motifs are modeled as functional building blocks. These motifs represent low-dimensional, fundamental components of larger network architectures. Through rigorous theoretical analysis, we classify these motifs into a three‑layer hierarchical classification of their dynamical regimes and demonstrate that their hierarchical proportions critically shape collective neural dynamics. Furthermore, by embedding motif distributions into recurrent neural networks (RNNs), we show that these motifs can selectively enhance either network robustness or flexibility. Collectively, our findings provide a theoretical framework—supported by extensive experiments—for understanding how specific network motifs influence the computational properties of artificial intelligence systems via their underlying dynamics. This motif-driven approach offers significant potential for analyzing and modulating neural dynamics in ANNs.

Abstract:
Adaptive testing is widely adopted in AIdriven educational assessment systems (e.g., GRE), where the goal is to select an optimal subset of questions from a large question pool to accurately estimate an examinee's ability. A fundamental challenge is that: optimal question subsets are inherently personalized, and solving for them is NP-hard. Recently, it has been framed as a gradient matching problem: aligning gradients between selected subsets and the full question set across the entire ability parameter space. However, such global alignment on entire space is computationally expensive and difficult to scale. In this work, we propose GPM (Gradient Path Matching), a novel framework that instead aligns gradients along possible optimization paths toward the final estimate. By leveraging intermediate gradients as supervision, GPM learns an explicit and generalizable selection algorithm from large-scale data. We provide theoretical analysis on its convergence and scalability. Experiments on both real-world and synthetic datasets demonstrate that it achieves the same estimation accuracy using, on average, 20% fewer questions.

Abstract:
Biophysically detailed neural networks represent a promising frontier for braininspired AI, offering intrinsic spatio-temporal dynamics to enhance the expressivity and computational density of deep learning systems. However, general-purpose deep learning frameworks suffer from a fundamental mismatch between their dense parallel optimizations and the irregular, tree-structured complexity of biological mechanisms. In this work, we proposeHelioX, aGPU-nativeframework designed to unify high-performance simulation with scalable training. Unlike approaches that adapt biology to existing deep learning tools, HelioX adopts a "GPU-to-Biophysics" paradigm. We tailor the underlying GPU parallelism to biological structures by implementing custom-fused CUDA kernels for both the Dendritic Hierarchical Scheduling (DHS) algorithm and its gradient propagation. This design eliminates the runtime overhead of generic automatic differentiation and enables multi-stream concurrency for spike generation and equation assembly. Experimental results demonstrate that HelioX outperforms standard simulators (NEURON) by orders of magnitude and surpasses prior GPU-based solvers in both speed and scalability. We successfully train deep biophysical MLPs and organism-scale biophysical neural networks (e.g., the BAAIWorm C. elegans model) on a single consumer-grade GPU. HelioX establishes a new standard for computational efficiency, enabling the training of biophysically detailed models at scales previously unattainable.

Abstract:
While reasoning models have achieved remarkable success in complex reasoning tasks, their increasing power necessitates stringent safety measures. For safety alignment, the core challenge lies in the inherent tradeoff between safety and utility. However, prevailing alignment strategies typically construct CoT training data with explicit safety rules via context distillation. This approach inadvertently limits reasoning capabilities by creating a rigid association between rule memorization and refusal. To mitigate the safety-utility trade-off, we propose the Adaptive Safe Context Learning (ASCL) framework to improve the reasoning given proper context. ASCL formulates safety alignment as a multi-turn tool-use process, empowering the model to autonomously decide when to consult safety rules and how to generate the ongoing reasoning. Furthermore, to counteract the preference for rule consultation during RL, we introduce Inverse Frequency Policy Optimization (IFPO) to rebalance advantage estimates. By decoupling rule retrieval and subsequent reasoning, our method achieves higher overall performance compared to baselines.

Abstract:
Randomized experiments (or A/B tests) are widely used to evaluate interventions in dynamic systems such as recommendation platforms, marketplaces, and digital health. In these settings, interventions affect both current and future system states, so estimating the global average treatment effect (GATE) requires accounting for temporal dynamics, which is especially challenging in the presence of nonstationarity; existing approaches suffer from high bias, high variance, or both. In this paper, we address this challenge via the novel Truncated Policy Gradient (TPG) estimator, which replaces instantaneous outcomes with shorthorizon outcome trajectories. The estimator admits a policy-gradient interpretation: it is a truncation of the first-order approximation to the GATE, yielding provable reductions in bias and variance in nonstationary Markovian settings. We further establish a central limit theorem for the TPG estimator and develop a consistent variance estimator that remains valid under nonstationarity with single-trajectory data. We validate our theory with two real-world case studies. The results show that a well-calibrated TPG estimator attains low bias and variance in practical nonstationary settings, and highlight the value of the policy gradient approach in the design of effective estimators despite complex dynamics.

Abstract:
Visionlanguage models (VLMs) such as CLIP achieve strong zero-shot recognition but remain highly fragile under adversarial perturbations. Recent test-time adaptation defenses improve robustness by leveraging many augmented views, but this leads to impractical slowdown and a clear robustness-throughput trade-off. To address this challenge, we present Stability and Suitability-guided Test-time Prompt Tuning (SS-TPT), evaluating the quality of each augmented view via two complementary scores: (1) stability, measuring prediction invariance to weak augmentations, and (2) suitability, measuring feature-space density among views. These stability and suitability (SS) scores guide both adaptation and inference through an SS-guided consistency loss and an SS-weighted prediction, amplifying trustworthy views while suppressing corrupted ones. Extensive experiments demonstrate that SS-TPT significantly outperforms prior state-of-the-art methods, achieving superior robustness-throughput trade-offs across diverse datasets and varying numbers of views, thereby demonstrating both strong practicality and generality. Our SS-TPT code is provided and will be publicly available on GitHub.

Abstract:
We reconceptualize Personalized Multimodal Federated Learning (PMFL) by treating missing modalities as intrinsic structural identities that constrain each client to a distinct Riemannian submanifold, rather than deficiencies to be compensated. To resolve the tension between identity preservation and crossclient collaboration, we cast PMFL as an identity-aware potential game and seek a geometry-consistent equilibrium instead of a single full-modality global optimum. We propose GeoEvo, which realizes this equilibrium via Fisher--Riemannian evolutionary dynamics: Natural Evolution Strategies for curvature-adaptive local exploration and subspace-constrained particle swarm updates for symbiotic knowledge transfer. GeoEvo admits a Lyapunov potential and, with a monotone acceptance rule, guarantees potential dissipation; in non-convex regimes it achieves an O(1/\sqrtT) stationarity rate, implying convergence to first-order Nash equilibria, and empirically improves personalization and robustness across diverse modality-missing identities.

Abstract:
Lens flare removal is challenging due to the large spatial extent of flare artifacts and their entanglement with scene structures, while existing meth-ods heavily rely on large-scale paired data. We propose a semi-supervised flare removal frame-work that enables stable learning from unlabeled images by jointly addressing pseudo-label relia-bility and representation discrimination. We pro-pose an adaptive pseudo-label repository that pro-gressively refines pseudo supervision through no-reference quality assessment, momentum-based updates, and invalid label filtering, effectively mit-igating error accumulation. Moreover, we pro-pose a flare-aware contrastive loss that explic-itly treats flare-contaminated inputs as negatives and performs patch-level contrastive learning, en-couraging representations that are discriminative against flare patterns while remaining consistent with reliable pseudo targets. Extensive experi-ments on multiple flare benchmarks demonstrate that the proposed framework is model-agnostic and consistently improves performance and ro-bustness.

Abstract:
Existing decodingtime safety interventions are often reactive, relying on local signals to correct unsafe outputs after they emerge. Under adversarial prompts that drive generation into recurring unsafe response, such local signals provide weak guidance for stable repair. As a result, rollback and post-hoc rewriting often trade-off response quality with recurrent violations. To address these limitations, we propose RBCBF, a rollback-based decoding-time framework that jointly selects intervention steps and performs distribution-level corrective control. Our key innovation is a risk-aggregation formulation that views terminal violations as the accumulated build-up of risk along the prefix. By selecting rollback steps from these decisive prefixes, RBCBF moves rollback targeting beyond heuristic cues and turns it into a trajectory-level decision. RBCBF then applies invasive corrective control to the next-token distribution under multiple rule constraints. Across jailbreak-style evaluations, RBCBF outperforms prior rollback methods and decoding-time baselines, reducing harmful responses and substantially lowering violation recurrence.

Abstract:
Largescale pre-trained video generators offer powerful priors for Video Super-Resolution (VSR), yet adapting them remains computationally prohibitive. Full fine-tuning demands extensive resources, and ControlNet-style adapters lose their efficiency advantage under modern Diffusion Transformers (DiTs) since the absence of encoder-decoder hierarchy forces duplication of the entire backbone. We observe that low-quality videos, despite degradation, retain reliable structural information such as layout and motion, and that such structural content is largely domain-agnostic. This suggests that a frozen generator can perform VSR when the input structure is properly aligned to its embedding space. Building on this insight, we propose LiteVSR, a minimalist framework that performs VSR using a completely frozen Diffusion Transformer (DiT) with a lightweight State-Aware Adapter. The adapter employs a dual-stream architecture that jointly processes static structural cues from the low-quality input and dynamic cues from intermediate denoising states through time-dependent cross-attention, enabling adaptive transition from structural alignment to texture refinement as denoising proceeds. LiteVSR achieves comparable restoration quality with only 12.68% trainable parameters and 12 GPU-hours of training on a single A100.

Abstract:
While the robustness of vision models is often measured, their dependence on specific architectural design choices is rarely dissected. We investigate why certain vision architectures are inherently more robust to additive Gaussian noise and convert these empirical insights into simple, actionable design rules. Specifically, we performed extensive evaluations on 1,174 pretrained vision models, empirically identifying four consistent design patterns for improved robustness against Gaussian noise: larger stem kernels, smaller input resolutions, average pooling, and supervised vision transformers (ViTs) rather than CLIP ViTs, which yield up to 506 rank improvements and 21.6%p accuracy gains. We then develop a theoretical analysis that explains these findings, converting observed correlations into causal mechanisms. First, we prove that lowpass stem kernels attenuate noise with a gain that decreases quadratically with kernel size and that anti-aliased downsampling reduces noise energy roughly in proportion to the square of the downsampling factor. Second, we demonstrate that average pooling is unbiased and suppresses noise in proportion to the pooling window area, whereas max pooling incurs a positive bias that grows slowly with window size and yields a relatively higher mean-squared error and greater worst-case sensitivity. Third, we reveal and explain the vulnerability of CLIP ViTs via a pixel-space Lipschitz bound: The smaller normalization standard deviations used in CLIP preprocessing amplify worst-case sensitivity by up to 1.91 times relative to the Inception-style preprocessing common in supervised ViTs. Our results collectively disentangle robustness into interpretable modules, provide a theory that explains the observed trends, and build practical, plug-and-play guidelines for designing vision models more robust against Gaussian noise.

Abstract:
Medical Large VisionLanguage Models (Med-LVLMs) suffer from severe hallucinations, posing critical safety risks in clinical deployment. Editing LVLM activations has shown promise for mitigating hallucination with minimal cost. However, due to the requirements of medical domain expertise, existing methods struggle to capture imaging manifestations and diagnostic principles that are critical for clinical interpretation, thereby limiting their effectiveness. To address these limitations, we propose the first MEDical-oriented Activation Editing (MEDA) method by integrating Query-decisive Manifestation Steering (QMS) and Principle-driven Diagnosis Induction (PDI) to promote Med-LVLM's expertise elicitation. Specifically, QMS retrieves positive query-decisive imaging manifestations as trusted guidance for activation steering, while PDI constructs positive principle-embedded diagnostic prompts to induce expert-like clinical reasoning. Extensive experiments across six benchmarks and six LVLMs demonstrate that MEDA efficiently improves the response factuality with up to a 10.2% gain on IU-Xray, while exhibiting strong generalization and few-shot robustness.

Abstract:
Large language models (LLMs) are increasingly used in scientific discovery, system modeling, and decisionmaking, prompting interest in their ability to reason over complex structured data. Existing benchmarks primarily focus on static or local graph reasoning, overlooking the high-order structures in real-world systems whose global properties evolve across multiple scales. We introduce LLM4PH, a benchmark that evaluates multi-scale structural reasoning through the lens of persistent homology (PH), a topological framework for tracking structural evolution. LLM4PH decomposes the PH pipeline into interpretable reasoning tasks spanning synthetic and real-world graphs, revealing that most models struggle with reasoning over structural transitions and persistence. Beyond task-level evaluation, we perform cross-task ablations on prompt encoding and transfer, explore post-training effects, and construct a compositional PH pipeline to assess end-to-end performance. Our results provide the first in-depth view of how well LLMs bridge discrete graph structures with continuous topological abstraction, and offer insights into their potential for structure-aware scientific reasoning.

Abstract:
Size transfer scales Graph Convolutional Networks (GCNs) by applying models trained on sampled subgraphs to larger target graphs. However, existing theoretical guarantees are typically confined to dense graphs or restricted sparsity regimes, failing to cover the arbitrary sparsity of realworld networks. To bridge this gap, we introduce the Generalized Graphon Convolutional Network (GWCN) based on the generalized graphon theory. Unlike the classical graphon limit which vanishes in sparse settings, GWCN employs stretching to construct a non-trivial limit that preserves topological structure. We derive an explicit transfer error bound that decomposes into size-dependent and density-dependent components, providing a unified guarantee across arbitrary sparsity levels. Empirical results on real-world networks corroborate our findings, demonstrating that transfer error vanishes as graph size increases and edge density decreases.

Abstract:
Visionlanguage models (VLMs) excel in many multimodal tasks but remain prone to a subtle yet impactful failure mode: they tend to overestimate dominant visual-textual cues while underestimating sparse but decision-critical contextual variables. This issue, which we term Contextual Variable Overestimation (CVE), becomes particularly evident in real-world applications such as predicting advertisement image preferences across diverse geographic markets. For instance, when a VLM (e.g., Qwen2-VL) is asked to choose between two product images tailored for different countries (e.g., Korea vs. France), it often defaults to a consistent output (e.g., always selects “A”), ignoring ground-truth regional variations. This collapse occurs because pervasive high-volume signals, such as product attributes and dense image patches, overwhelm the few but critical tokens that encode market-specific context (e.g., country names). To address CVE, we first collect a new multimodal dataset of real advertising creatives and their click-through performance across multiple countries. We then introduce GeoReward, a reward model designed to predict ad image preferences across diverse geographic markets. GeoReward integrates three purpose-built mechanisms: (1) Market-Aware Retrieval Augmentation, which retrieves and injects region-aligned preference signals during training to sharpen localization awareness. (2) Context-Guided Visual Modulation, a lightweight adapter that dynamically adjusts visual representations using textual country embeddings, enabling fine-grained regional adaptation. (3) Selective Sensitivity Loss, an objective that applies heightened penalties for context-specific mispredictions, sharpening the model's focus on critical variables. Furthermore, we demonstrate how GeoReward can guide the fine-tuning of RL for a VLM to generate background designs for text-to-image models (e.g., SDXL), producing market-aware advertising creatives. Experiments validate that our framework mitigates CVE and outperforms existing baselines. This work not only diagnoses a systematic bias in VLMs toward dominant perceptual features but also delivers a targeted solution for applications where sparse contextual variables govern decision-making.

Abstract:
Unified Multimodal models (UMMs) built on a single architecture have shown impressive performance in both understanding and generation. We identify a fundamental challenge lies in inductive biases induced by distinct supervision signals: generation branch prefers highfidelity, fine-grained representations capable of reconstruction, while the understanding favours semantically discriminative embeddings that remain invariant to task-irrelevant factors. Consequently, optimizing these complementary but non-equivalent objectives within a monolithic backbone leads to mutual impairment instead of enhancement. In this paper, we first analyze the root cause of this interference in unified backbones and reveal a complementary structure in their internal representations. Motivated by the observation, we propose DIVA, a self-improved post-training framework that transforms the representation divergence into interior synergy. By explicitly factorizing the visual representation into shared and unique components based on two complementary information flow, DIVA enables both the understanding and generation branches to achieve beneficial transferring while preserving the integrity of unique information from cross-flow interference via mutual information estimation. Despite its generality, our method consistently achieves improvements across visual understanding (+7.82%) and generation (+8.46%). The official code is available at: https://anonymous.4open.science/r/DIVA-D225.

Abstract:
GoalConditioned Reinforcement Learning aims to design agents that can reach specified goals, notably from previously collected trajectories in the offline setting. In this context, graph-based approaches have been proposed to mitigate compounding value-estimation errors in long-horizon navigation tasks. However, existing approaches typically rely on dense keypoint coverage of the dataset support, resulting in computationally expensive planning. Moreover, they lack explicit mechanisms to adapt to topological changes (e.g., new obstacles), hindering deployment in live applications such as video game environments. To address these two shortcomings, we introduce Adaptive Quasimetric Mapping (AQM), an offline framework leveraging a “time-to-reach” quasimetric learned from the available data. Crucially, it builds a sparse cover of the dataset support, as a greedy approximation to a dominating set problem. At test-time, the resulting graph is carefully pruned by comparing the observed edge traversal time against a time-to-reach budget derived from the quasimetric, thus enabling zero-shot replanning. Empirically, we evaluate AQM on navigation tasks ranging from a classical to a video-game-like benchmark evaluating adaptation across tasks. We show that AQM achieves competitive performance while requiring up to 100× fewer keypoints than prior approaches, hence demonstrating the relevance of topological abstraction for goal-conditioned navigation.

Abstract:
AI agents increasingly execute procedural workflows as sequential action traces, which obscures latent concurrency and induces repeated stepby-step reasoning. We introduce BPOP, a Bayesian framework that infers a latent dependency partial order from noisy linearized traces. BPOP models traces as stochastic linear extensions of an underlying graph and performs efficient MCMC inference via a tractable frontier-softmax likelihood that avoids #P-hard marginalization over linear extensions. We evaluate on our open-sourced Cloud-IaC-6, a suite of cloud provisioning tasks with heterogeneous LLM-generated traces, and WFCommons scientific workflows. BPOP recovers dependency structure more accurately than trace-only and process-mining baselines, and the inferred graphs support a compiled executor that prunes irrelevant context, yielding substantial reductions in token usage and execution time.

Abstract:
Activation decomposition methods in language models are tightly coupled to geometric assumptions on how concepts are realized in activation space. Existing approaches search for individual global directions, implicitly assuming linear separability, which overlooks concepts with nonlinear or multidimensional structure. In this work, we leverage Mixture of Factor Analyzers (MFA) as a scalable, unsupervised alternative that models the activation space as a collection of Gaussian regions with theirlocalcovariance structure. MFA decomposes activations into two compositional geometric objects: the region's centroid in activation space, and the local variation from the centroid. We train large-scale MFAs for Llama-3.1-8B and Gemma-2-2B, and show they capture complex, nonlinear structures in activation space. Moreover, evaluations on localization and steering benchmarks show that MFA outperforms unsupervised baselines, is competitive with supervised localization methods, and often achieves stronger steering performance than sparse autoencoders. Together, our findings position local geometry, expressed through subspaces, as a promising unit of analysis for scalable concept discovery and model control, accounting for complex structures that isolated directions fail to capture.

Abstract:
Although there is a rich literature on adversarial attacks on large language models, their current practical impact is limited. Gradientbased attacks such as Greedy Coordinate Gradient (Zou et al., 2023) typically produce high-perplexity, incoherent suffixes that are easily detectable and thus easy to guard against, especially in combination with other defense-in-depth techniques (Bengio et al., 2024). On the other hand, attacks that aim to produce coherent prompts often alter the semantic intent of the original query. When the model complies with such altered query, it often produces a response that is not actually useful for the original query, thus incurring the so-called "jailbreak tax". In this work, we introduce a novel framework that can efficiently generate adversarial attacks against safety-aligned models while maintaining low perplexity and high semantic adherence to the adversary's original intent. The framework, Greedy Coordinate Diffusion (GCD), leverages the generative priors of discrete diffusion language models to guide the search for adversarial suffixes that achieve semantic coherence and adherence. Furthermore, unlike GCG, GCD does not require direct gradient access, allowing it to operate in a gray-box setting. We empirically demonstrate the power of GCD by showing it achieves state-of-the-art attack success rates against aligned models, and its adversarial prompts are not detected by semantic filters such as llama-guard-3.

Abstract:
Graph Convolutional Networks (GCNs) are defined based on aggregating the node information of adjacent nodes, that are usually treated as equally important as each other, limiting the representational power of existing GCNs for graph classification. To address this shortcoming, we propose a novel Global Interacted Graph Convolutional Network (GIGCN), that can leverage the solution vectors maintained during the iterative updates of the Dominant Set to adaptively characterize the global importance distribution of different nodes. Specifically, at each convolution layer, this distribution is adopted to adaptively modulate the importance weights of different node features before performing the local message passing. We show that this convolution strategy can effectively capture the highly correlated information between nonadjacent nodes through the Dominant Set algorithm, not only emphasizing the critical information at the graph level but also enhancing the discriminative power of graph representations. Furthermore, we optimize the spatial complexity of the framework, significantly reducing the memory overhead associated with the global interaction modeling. Experiments demonstrate the effectiveness of the proposed GI-GCN.

Abstract:
Highfidelity solubility prediction is fundamental to pharmaceutical development and environmental partitioning, where accurate modeling must couple molecular structure with thermodynamic behavior across diverse chemical environments. However, recent advancements have been dominated by deep learning architectures that often sacrifice physical interpretability for predictive power. We challenge this trend by showing that state-of-the-art performance does not require such non-transparent architectures. To address this, we introduce Dissolvr, a transparent framework for molecular solubility prediction. In addition, we perform a comprehensive literature review and a benchmarking study against various methods. We show that Dissolvr approaches the aleatoric limit of experimental uncertainty and achieves OOD generalization through structural invariance derived by mapping molecules to physically-grounded descriptors. Then, we present an LLM-assisted post-hoc explanation pipeline that bridges the gap between symbolic model artifacts and chemically grounded narratives. Finally, a comparative benchmark of a survey involving 22 expert chemists reveals that expert evaluators provide deep insights.

Abstract:
Efficiently designing highquality molecules targeting disease-relevant targets is a critical challenge. Most existing methods can capture pairwise amino acid relations, neglecting the higher-order relations among multiple amino acids. This paper proposes a target-specific molecule generation framework, namely TSMGen, to comprehensively capture the local and global structural information of the protein pocket by modeling higher-order spatial dependencies both at the atomic and the amino acid levels. Furthermore, we design a context-aware bidirectional fusion module to learn the more detailed structural information about the protein pocket. This module simultaneously attends to features from both the protein pocket and the molecule, fully leveraging the structural information from both to optimize the generation process of targeted molecules, thereby enhancing the quality of generated molecules. Experiments show that TSMGen outperforms state-of-the-art methods in terms of Vina Score, High Affinity, QED, SA and Diversity, and a case study on \beta-secretase enzyme further confirms its ability to generate molecules with stronger binding affinity.

Abstract:
In clinical practice, patients often present with multiple cooccurring diseases, yet most existing Multi-Label-Diagnosis (MLD) methods treat diagnosis as a rigid discriminative partitioning task, implicitly assuming that overlapping pathologies are separable. This assumption is problematic in medical images, where identical or highly similar visual observations may simultaneously support multiple disease labels, and disease concepts are inherently correlated rather than independent. Enforcing hard decision boundaries under such overlap suppresses shared evidence, biases feature representations, and ultimately undermines model reliability. To address this limitation, we propose Fuzzy Alignment with Comorbidity Topology FACT, a novel paradigm that reformulates MLD as a fuzzy alignment problem between atomic visual evidence and disease semantic anchors. FACT is characterized by three key features: (1) modeling visual polysemy through shared and reusable atomic visual evidence; (2) encoding disease correlation via semantic anchors structured by comorbidity topology; and (3) employing a metric-based fuzzy membership function for non-discriminative visual-semantic alignment. Extensive experiments on three public clinical benchmarks demonstrate that FACT consistently improves diagnostic performance while delivering clinically plausible predictions. The code will be available upon the acceptance of this paper.

Abstract:
Sparse Autoencoders (SAEs) are widely used to interpret large language models by decomposing activations into sparse, humanunderstandable features, but scaling to large dictionaries exposes fundamental challenges. Systematic studies reveal pervasive feature splitting that fragments coherent concepts into non-atomic latents and widespread feature absorption that creates arbitrary exceptions in general features, severely compromising latent reliability. These issues stem from inconsistent latent assignment across samples: without cross-sample constraints, per-sample optimization often allows a single underlying concept to be inconsistently distributed across multiple redundant or interfering latents. To address this, we introduce C^2R (\underlineCross-sample \underlineConsistency \underlineRegularization). C^2R explicitly encourages that each semantic feature is consistently represented by a unified latent across the batch by penalizing the co-activation of directionally similar latents. Comprehensive evaluation demonstrates that C^2R effectively mitigates both splitting and absorption while, crucially, preserving reconstruction fidelity, providing a principled solution that enhances latent interpretability without degrading model performance. Source code is available\footnote\urlhttps://anonymous.4open.science/r/Cross-sample-Consistency-Regularization-28F8.

Abstract:
Exact sparse methods based on \ell_0 constraints are increasingly used for interpretable and scalable time series modeling, where one aims to recover a small set of informative lags/factors while maintaining strong predictive performance and low computational cost. Despite their empirical success, finitesample and computational guarantees for such methods under temporal dependence remain limited. In this paper, we study \ell_0-constrained least squares for time series generated by \alpha-mixing stationary Gaussian processes with sparse coefficients. We establish high-probability restricted strong convexity/smoothness (RSC/RSS) for the empirical quadratic loss. Leveraging these conditions, we derive nonasymptotic statistical guarantees and computational complexities for a series of exact sparse methods, including iterative hard thresholding (IHT). We apply our theoretical results to Gaussian vector autoregressive (VAR) models and obtain new guarantees. Experiments on synthetic sparse VAR models and real-world mobility time series demonstrate that exact sparse methods recover lag structure more accurately and interpretably than some classical methods, while achieving comparable prediction error with substantially lower computational cost.

Abstract:
Contextual bandits serve as a foundational framework for sequential decisionmaking in domains like recommendation systems, IoT device management, and conversational AI, yet classical models overlook a critical practical constraint: acquiring extra observations incurs non-trivial costs, creating an unaddressed trade-off between information gain and expenditure. To fill this gap, we study contextual bandits with paid observations, where the learner actively chooses which actions to observe (at a specified cost) in each round, with the goal of minimizing total regret that combines learning losses and cumulative observation costs. We first design a near-optimal algorithm for adversarial environments, proving it achieves a regret rate significantly higher than that of cost-free contextual bandits—even for small observation costs—thus quantifying how paid observations reshape learning complexity. We then uncover a critical phase-transition phenomenon when incorporating free observation budgets: below a threshold budget, free observations only reduce total costs without changing the underlying regret rate, while above this threshold, they drastically improve learning efficiency, lowering the regret rate to match that of cost-free settings. To leverage this phenomenon, we develop a meta-controller that adaptively switches strategies based on the available budget, ensuring near-optimal performance across both low- and high-budget regimes. Furthermore, to address practical challenges like infinite policy spaces and computational inefficiency, we propose an oracle-efficient algorithm under a function approximation framework, which leverages an online regression oracle to maintain strong performance for stochastic losses. Our results also shed light on the scenarios about switching costs, budgeted constraints, model misspecification and the trade-offs involved in knapsack problems. Finally, we conduct numerical experiments to validate our theoretical findings and demonstrate practical efficiency. Key words: Contextual bandit, paid observation, function approximation, phase transition.

Abstract:
WiFibased human pose estimation offers privacy-preserving and occlusion-robust sensing, but current Transformer-based approaches suffer from quadratic complexity and lack explicit inductive biases for Channel State Information structure. We propose WiFi-Mamba, the first State Space Model architecture for WiFi-based 3D multi-person pose estimation. Our approach introduces three key contributions: (1) a Dual-Stream Selective State Space Model that processes amplitude and phase through parallel pathways with cross-stream state coupling to respect their distinct physical properties, (2) Selective State Attention for pose query decoding with SSM-derived sequential context, and (3) Persistent SSM Memory for temporal consistency across frames without recurrent memory explosion. Extensive experiments on the Person-in-WiFi 3D dataset, covering both single-person and multi-person scenarios, demonstrate 16-27% MPJPE reduction across varying numbers of persons while using only 4.4% of baseline parameters (2.14M vs. 48.2M), achieving superior efficiency-accuracy trade-offs particularly beneficial for edge deployment in privacy-sensitive continuous monitoring scenarios.

Abstract:
Despite rapid progress in VisionLanguage-Action (VLA) models for robotic control, instruction drift remains a persistent failure mode in long-horizon tasks. This paper reconceptualizes this phenomenon, positing that instruction drift is fundamentally a systematic sampling error: local greedy sampling is prone to collapsing into ''Negative Pivotal Windows''—irreversible local optima with high local probability that sever global success pathways. To address this, we propose Context-Aware Power Sampling (CAPS), a training-free inference-time computation framework. CAPS leverages power distributions to sharpen global trajectory probabilities, effectively activating the model's implicit world model for lookahead planning. Furthermore, we introduce a metacognitive control mechanism based on Signal-to-Noise Ratio (SNR). This mechanism triggers adaptive MCMC search solely when drift risk is detected, enabling a dynamic transition from ''intuitive fast thinking'' to ''rational slow search.'' Experiments on RoboTwin, Simpler-WindowX, and Libero-long benchmarks demonstrate that CAPS significantly outperforms SOTA baselines, such as OpenVLA and TACO, without parameter updates. These results confirm that adaptive inference-time computation is a potent pathway to enhancing embodied long-horizon robustness.

Abstract:
Compositional data analysis has gained increasing attention due to the widespread occurrence of simplexvalued data, including microbiome data. However, existing kernel or distance-based nonparametric two-sample tests are often designed for Euclidean data and rely on square-root or log-transformations, motivating the need for a unified framework for nonparametric two-sample testing applicable to both compositional and directional data. We propose a studentized spherical harmonic energy distance-based two-sample test over a fixed dimensional underlying space, incorporating U-statistics theory and recent developments of studentization in the context of compositional and directional data. We establish asymptotic normality of our studentized test statistics constructed via spherical harmonics theory, avoiding the need for permutation or bootstrap tests. Simulations demonstrate convergence to the limiting distribution, empirical size control, and improved power in certain scenarios. Our proposed framework paves a new direction for nonparametric testing in non-Euclidean data analysis.

Abstract:
Recent advancements in Multimodal Large Language Models (MLLMs) have achieved remarkable success in visionlanguage tasks, yet the quadratic computational complexity arising from the vast number of visual tokens creates significant memory and latency bottlenecks. While visual token reduction (VTR) strategies have been explored to mitigate this burden, existing methods overlook the positional and attentional consistency between the full and reduced sequences, resulting in a distorted representation. To this end, we propose a novel VTR framework that rectifies the positional and attentional distortions while maintaining efficiency. Specifically, we present a simple yet effective calibration method that recovers lost visual attention by modulating attention weights based on relative distances. We also introduce a distinctive anchor selection for token merging to mitigate information loss during feature averaging. Experimental results on multiple benchmarks demonstrate that our method consistently improves the accuracy of various reduction methods, achieving state-of-the-art performance while maintaining computational efficiency. We will make our code publicly available online.

Abstract:
For denoising Spatially Resolved Transcriptomics (SRT) data, existing methods often construct spot and gene graphs to model interspot and inter-gene relationships, respectively. However, these methods often introduce spurious similarity biases among spots when constructing the spot graph and fail to capture nonlinear relationships among genes when constructing the gene graph. Moreover, ineffective graph fusion strategies further bottleneck denoising performance. To address these challenges, we propose SpaEF, which innovatively constructs spot and gene graphs with two Large Models (LMs) to inject prior knowledge for mitigating biases and capture nonlinear relationships, and then fuses them with the proposed element-wise graph autoencoder. As far as we know, SpaEF is the first SRT denoising method that utilizes pre-trained LMs to construct spot and gene graphs. Experiments on four real-world datasets with corresponding downstream tasks demonstrate that SpaEF not only outperforms SOTA denoising methods in accuracy but also exhibits strong robustness across tasks.

Abstract:
It is folklore that reusing training data more than once can improve the statistical efficiency of gradientbased learning. However, beyond linear regression, the theoretical advantage of full-batch gradient descent (GD, which always reuses all the data) over one-pass stochastic gradient descent (online SGD, which uses each data point only once) remains unclear. In this work, we consider learning a d-dimensional single-index model with a quadratic activation, for which it is known that one-pass SGD requires n\gtrsim d\log d samples to achieve weak recovery. We first show that this \log d factor in the sample complexity persists for full-batch spherical GD on the correlation loss; however, by simply truncating the activation, full-batch GD exhibits a favorable optimization landscape at n ～eq d samples, thereby outperforming one-pass SGD (with the same activation) in statistical efficiency. We complement this result with a trajectory analysis of full-batch GD on the squared loss from small initialization, showing that n \gtrsim d samples and T \gtrsim\log d gradient steps suffice to achieve strong (exact) recovery.

Abstract:
Computeruse agents (CUAs) have made tremendous progress in the past year, yet they still frequently produce misaligned actions that deviate from the user's original intent. Such misaligned actions may arise from external attacks (e.g., indirect prompt injection) or from internal limitations (e.g., erroneous reasoning). They not only expose CUAs to safety risks, but also degrade task efficiency and reliability. This work makes the first effort to define and study misaligned action detection in CUAs, with comprehensive coverage of both externally induced and internally arising misaligned actions. We further identify three common categories in real-world CUA deployment and construct MisActBench, a benchmark of realistic trajectories with human-annotated, action-level misalignment labels. Moreover, we propose DeAction, a practical and universal guardrail that detects misaligned actions before execution and iteratively corrects them through structured feedback. DeAction outperforms all existing baselines across offline and online evaluations with moderate latency overhead: (1) On MisActBench, it outperforms baselines by over 12% absolute in F1 score; (2) In online evaluation, it reduces attack success rate by over 75% under adversarial settings while preserving or even improving task success rate in benign environments.

Abstract:
3D Scene Graph Generation (3DSGG) aims to create a structured representation of 3D environment by identifying objects as nodes and their relations as edges. Existing 3DSGG methods based on RGBD sequences typically put much focus on the adaption of neural networks to robust node and edge feature extraction in complex 3D scenes, yet ignoring the inherent intra-class diversity within each class and inter-class similarity between different categories associated with nodes and edges. In this work, we develop GMPSSG, a novel Gaussian Mixture-distributed Prototype mining framework for 3DSGG. Specifically, we model different categories with independent Gaussian Mixture-distributed Prototype to effectively mitigate inter-class similarity, while employing multiple Gaussian components within each prototype to capture intra-class diversity. Moreover, Prototype-anchored Representation Learning is introduced to construct a well-structured and mutually independent category space; Topology-aware Prototype Interaction is devised to capture implicit co-occurrence priors within the scene, and leverage them to calibrate prototype distributions, thereby ensuring the plausibility of node-edge matching. Experiments on 3DSSG dataset demonstrate GMPSSG outperforms various top-leading methods. Source code will be released.

Abstract:
Scorebased diffusion models have emerged as prominent deep generative models; however, their application to tabular data remains challenging because their backbones assume fully specified inputs, whereas real-world tabular data often contain missing values. We proposeAugMask, a plug-and-play training framework that adapts missing-unaware backbones to incomplete data via stochastic regularization. AugMask 1) completes inputs viaconditional stochastic augmentationusing lightweight auxiliary models and 2) masks the loss, using augmented missing entries for conditioning while restricting supervision to observed coordinates. We connect AugMask to a Rao-Blackwellized objective and show that marginalizing missing entries yields a variance-weighted sensitivity penalty, promoting invariance of observed-coordinate reconstruction with respect to uncertain missing entries. Across diverse datasets and missingness regimes, AugMask enables standard diffusion-based tabular generators to match or outperform specialized missing-aware baselines in both sample fidelity and downstream utility. The code will be released.

Abstract:
Continual learning of VisionLanguage Model (VLM) aspires to empower foundation models with new expertise without compromising their universal zero-shot capabilities. However, this pursuit faces a critical ''dual-forgetting'' challenge: the catastrophic forgetting of newly acquired classes (Incremental Knowledge Forgetting, IKF) and the insidious erosion of foundational zero-shot capabilities (Pre-trained Knowledge Forgetting, PKF). Existing evaluations often ignore PKF or assess it via confounded protocols where positive transfer on semantically similar domains creates an illusion of retention, masking severe foundational degradation. To address this, we propose the Dual-Forgetting-Aware Class-Incremental Learning (DFA-CIL) framework and the Similarity-Calibrated Retention (SCR) metric. Unlike standard averaging, SCR utilizes the frozen pre-trained feature space to inversely weight performance based on semantic similarity, effectively mitigating the confounding gains to stress-test foundational stability. Building on this, we propose DFA-MoE, a functionally heterogeneous Parameter-Efficient Fine-Tuning (PEFT) method. DFA-MoE architecturally decouples optimization objectives by assigning a momentum-enhanced contrastive expert for feature alignment, and separate plasticity experts that combine classification with auxiliary contrastive learning to adapt to new tasks while retaining historical knowledge. Extensive experiments demonstrate that our framework uncovers the hidden fragility of existing methods and achieves a state-of-the-art balance in preserving both incremental and pre-trained knowledge.

Abstract:
The maximal coding rate reduction (\textMCR^2) objective is proposed for learning lowdimensional subspace representations and for principled deep model design, where layer structures are derived by unrolling its optimization steps. However, existing methods motivated by this objective do not fully adhere to design principles implied by the \textMCR^2 gradient, which weakens the principled and interpretable foundations of the resulting models. In this work, we introduce PACEAttention, a novel principled attention mechanism inspired by the geometric insight of \textMCR^2. From the geometric perspective, gradient-based updates of \textMCR^2 move features along directions shaped by the underlying low-dimensional feature structure. Our method captures this structure by leveraging randomization to guide feature updates. This principled construction enables the resulting PACENet to exhibit enhanced interpretability, with different heads attending to distinct image regions and capturing fine-grained structures under simple supervised training. Besides, two learnable weights in PACEAttention enable explicit regulation of the feature update dynamics, reflecting the relative contributions of different components across layers. Experiments demonstrate that PACEAttention achieves superior performance and more stable scalability than previous principled modules while remaining low complexity.

Abstract:
Estimating heterogeneous treatment effects is central to datadriven decision-making, yet industrial applications often face a fundamental tension between limited randomized controlled trial (RCT) budgets and abundant but biased observational data collected under historical targeting policies. Although observational logs offer the advantage of scale, they inherently suffer from severe policy-induced imbalance and overlap violations, rendering standalone estimation unreliable. We propose a budgeted active experimentation framework that iteratively enhances model training for causal effect estimation via active sampling. By leveraging observational priors, we develop an acquisition function targeting uplift estimation uncertainty, overlap deficits, and domain discrepancy to select the most informative units for randomized experiments. We establish finite-sample deviation bounds, asymptotic normality via martingale Central Limit Theorems (CLTs), and minimax lower bounds to prove information-theoretic optimality. Extensive experiments on industrial datasets demonstrate that our approach significantly outperforms standard randomized baselines in cost-constrained settings.

Abstract:
Current linear StateSpace Models for 3D point clouds typically rely on 1D serialization (e.g., Hilbert curves) for global modeling. Such rigid ordering disrupts spatial continuity in dense scenes, introducing what we term Serialization Bias. We propose AnIsoNet, a framework that decouples anisotropic geometry from isotropic semantics via two dedicated modules: Local Anisotropy Geometric Modeling (LAGM) and Global Isotropy Semantic Aggregation (GISA). LAGM employs ellipsoidal encoding to capture local directionality without global order. GISA adapts to geometric characteristics via two modes: content-based accumulation (Identity Mode) for dense scenes and Morton serialization for sparse objects. This eliminates redundant multi-view scanning while maintaining O(N) complexity. Experiments show that avoiding artificial serialization in dense scenes achieves 82.62 % mIoU on S3DIS (surpassing PCM by 3.0 %), while Morton serialization for sparse objects achieves 94.21 % OA on ScanObjectNN (+1.6 %). On ScanNetV2, we reach 78.52 % mIoU, surpassing PTv3 (77.5 %) without pre-training. We achieve these results with only 12.2 M parameters (26.4 % of PTv3's）

Abstract:
Influence maximization is an important research area in social network analysis, where the goal is to select a small set of seed nodes so as to maximize the expected spread of influence under a stochastic diffusion process. Classical approximation algorithms for this problem rely on full knowledge of the underlying influence probabilities and operate in an offline manner. In many realworld settings, however, these probabilities are unknown and must be learned from data, raising the question: \emphcan one still obtain strong performance guarantees while simultaneously learning the diffusion model parameters through repeated interactions? In this paper, we study the problem of \emphonline influence maximization under the independent cascade model, where influence probabilities are unknown and feedback is limited to \emphnode-level activation outcomes. Prior work relies on a \emphpair oracle which needs to perform a joint optimization over seed sets and feasible parameters. This oracle is difficult to implement in practice and it was open whether one can achieve sublinear regret using only a \emphstandard offline oracle. We resolve this question by designing an online learning algorithm that achieves \widetildeO(\sqrtT) regret using only a \emphstandard offline oracle. Finally, we validate our theoretical results via experiments on real and synthetic data.

Abstract:
Large language models (LLMs) are increasingly deployed in highstakes settings, yet they remain vulnerable to sequential jailbreaks that exploit multi-turn interaction to circumvent safety mechanisms. Current safety evaluations are largely outcome-based, offering little insight into the latent decision processes that lead to unsafe compliance. We propose an interpretable cognitive modeling framework that couples a controlled elicitation paradigm, the Contextual Iowa Gambling Task (C-IGT), with a Generalized Rescorla--Wagner (GRW) architecture to decompose behavior into measurable mechanisms. Across a diverse set of mainstream LLMs, we find that sequential vulnerability is not explained by scale alone but emerges from interactions among cognitive factors, including optimism-biased learning, perceptual reward amplification, and choice inertia. Moreover, counterfactual feedback and psychologically framed rewards (e.g., regret, authority, threat) substantially accelerate the transition from refusal to compliance. These results yield principled cognitive profiles of LLM ``irrationality'' and provide insights for interdisciplinary research on LLM agents at the intersection of machine learning and human behavioral science.

Abstract:
Federated Continual Learning (FCL) enables the continuous acquisition of knowledge from streaming tasks, but inherently struggles with the temporal dynamics of client data distributions. These dynamics naturally induce asynchronous concept drift, where distribution shifts occur independently across clients at unsynchronized times and with varying magnitudes. Such asynchrony generates conflicting updates that destabilize global convergence and exacerbate catastrophic forgetting. However, existing FCL research focuses on static or incremental settings, typically treating all incoming updates uniformly, which obscures concept drift under divergent distributions and fails to adapt to the evolution of learned concepts. To address these limitations, we propose RCFCL, a retrospective calibration framework for FCL that can effectively distinguish asynchronous concept drift and adjust the learning strategy adaptively. Specifically, RC-FCL leverages a conditional generative model to synthesize class-conditional reference distributions of previously learned concepts for local drift detection. It calibrates local adaptation using a weighting mechanism driven by the local discriminator to prioritize informative samples, and executes a global aggregation strategy based on drift magnitude. Our experimental results demonstrate that RC-FCL achieves competitive performance against state-of-the-art methods.

Abstract:
Sequence models typically operate at a fixed temporal or spatial scale and struggle to generalize to substantially longer horizons or higher resolutions without retraining. Existing hierarchical architectures expand receptive fields but rely on scalespecific parameters and lack mechanisms to enforce consistent dynamics across scales. We propose Dynamic Fractal Mamba (DF-Mamba), a recursive state-space model that applies a single shared operator across multiple scales. By sharing parameters across recursion depths and exponentially scaling the effective time step, DF-Mamba achieves an exponentially expanding receptive field while preserving linear computational complexity. A learned content-aware coarse-graining module aggregates representations across scales. Auxiliary reconstruction and cross-scale consistency objectives stabilize recursive training. We evaluate DF-Mamba on long-range time-series forecasting, spatial transcriptomics, and computational pathology. Across all tasks, DF-Mamba consistently outperforms Transformers and flat Mamba baselines while using fewer parameters and maintaining linear-time scalability. Importantly, models trained on short sequences or low-resolution inputs generalize in a zero-shot manner to substantially larger temporal and spatial scales unseen during training. These results demonstrate that recursive parameter sharing provides an effective inductive bias for learning scale-consistent and efficient sequence representations.

Abstract:
Recommender systems play a crucial role in internet economies by connecting users with relevant products. However, designing effective recommender systems faces the key challenges: the explorationexploitation tradeoff in securing incentive to explore new products against user's self-interested preferences. While prior work addresses Bayesian Incentive Compatibility (BIC) in fixed-design linear bandits \citesellke2023price, we tackle the challenge of stochastic user covariates sampled online. Unlike standard black-box reductions \citepmansour2020bayesian, our two-stage framework exploits the linear reward structure to achieve sublinear regret while satisfying incentive constraints. To address it, we propose a two-stage algorithm that integrates incentivized exploration with any efficient plug-in offline learning algorithms. In the first stage, it explores products while maintaining incentive compatibility to gather optimal samples. The second stage employs inverse proportional gap sampling strategy integrated with any efficient learning methods to secure sublinear regret. Theoretically, we prove that algorithm \recon achieves \tildeO(\sqrtKdT) regret and simultaneously satisfies incentive constraints, and discovers the tradeoff between incentive budget and regret, validating in experiments. We demonstrate RCB's strong incentive gain, sublinear regret, and robustness through a real application on personalized warfarin dosing and simulations. To the best of our knowledge, this is the first analysis for BIC in online preference learning settings.

Abstract:
Estimating free energy differences quantifies thermodynamic preferences in molecular interactions, which is central to chemistry and drug discovery. Despite fruitful progress, existing methods still face key limitations: classical computational approaches remain prohibitively expensive due to their reliance on extensive molecular dynamics simulations, while deep learningbased methods are constrained by either less-expressive generative models or input dimensions tied to a specific system, resulting in negligible generalization. To address these challenges, we propose CARD, a generative framework that employs a novel radix-based decomposition to bijectively convert 3D coordinates into mixed discrete-continuous sequences, enabling coarse-to-fine autoregressive modeling with enhanced expressiveness. Notably, the model corresponds to a distribution with zero free energy, serving as a proposal for absolute free energy computation of arbitrary systems without relying on alchemical pathways. Experiments across diverse tasks demonstrate that CARD matches the accuracy of classical computational methods on unseen systems with diverse topologies, while achieving an approximately 40-fold speedup in inference.

Abstract:
In distributed machine learning, efficiently training across multiple agents with heterogeneous data distributions remains a central challenge. We address the problem of stochastic, strongly convex distributed optimization by applying accelerated gradient ascent to the dual variables and multistep stochastic gradient descent (SGD) to the primal variables in the Lagrangian formulation. This approach naturally enables local computation, as the inner SGD loops require no inter-agent communication. We prove that the method converges for any number of local updates, attaining the optimal communication complexity when local computation is sufficient. Our analysis builds on an inexact accelerated gradient framework, where the partial gradient of the Lagrangian with respect to the dual variables is treated as an inexact gradient of the dual function. A notable byproduct of this framework is an algorithm that achieves optimal reproducibility guarantees under biased gradient estimates.

Abstract:
Generating stepby-step "how-to" procedures is a key LLM capability: how-to advice is commonly requested in chatbots, and step-by-step planning is critical for reasoning over complex tasks. Yet, measuring and improving procedural validity at scale on real-world tasks remains challenging and understudied. To address this, we introduce How2Everything, a scalable framework to evaluate and improve goal-conditioned procedure generation. Our pipeline How2Mine extracts and rewrites 351K procedures from 980K web pages across 14 topics, and can scale to larger corpora. From this pool we build How2Bench, a 7K-example evaluation set balanced across topics. We also introduce How2Score, an evaluation protocol that uses an LLM judge to detect whether a generation contains any critical failure that would prevent achieving the goal. For low-cost, reproducible evaluation, we distill a frontier judge into an open 8B model achieving 80.5% agreement with human annotators. How2Bench reveals clear scaling trends across model size and training stages, providing signal early in pretraining. Finally, RL using How2Score as a reward improves performance on How2Bench by >10 points across three base models without systematic regressions on standard benchmarks, with gains not primarily explained by source-document memorization or superficial format compliance. We release all code and data upon acceptance.

Abstract:
Watermarking has emerged as a critical solution for the detection and provenance tracing of content generated by large language models. However, existing methods still suffer from significant limitations, including difficulties in achieving personalized attribution, substantial degradation of generation quality, and weak robustness against attacks. To address these challenges, we propose IPMark, the first IPinspired hierarchical personalized watermarking framework. Specifically, to enable personalization and efficient detection, IPMark employs a hierarchical addressing framework to structurally organize model and user identities. Subsequently, addressing the inherent semantic distortion caused by token-level watermarking, we design a semantic-syntactic dual-stream embedding strategy. Centered on sentence-level candidate selection and reinforced by dual signals from syntactic and semantic features, this approach optimizes the injection process, thereby significantly enhancing generation quality while ensuring strong robustness. Experimental results demonstrate that IPMark achieves the lowest perplexity among baselines, ensuring superior generation quality while maintaining strong robustness and significantly reducing detection latency through hierarchical retrieval.

Abstract:
The rapid advancement of Large Language Models (LLMs) is revolutionizing AI for Game by enabling openended and fluid interactive storytelling. However, existing research has largely overlooked the critical challenge of maintaining logical consistency and narrative integrity against unconstrained user interventions. To address this, we formulate this challenge as \emphNarrative Commitment Preservation (NCP), and take interactive narrative as our testbed. We introduce NCP-Bench, a benchmark of 100 narrative environments derived from movie synopses. Each environment includes a structured narrative specification (trajectory, commitments, and initial facts) that we can reliably check throughout the interaction between player and storyteller. Experiments across state-of-the-art LLMs reveal that high linguistic quality does not guarantee commitment preservation, even strong models frequently generate logically conflicting content under adversarial interventions, with the best-performing model (GPT-5.2) achieving only 40% survival rate after 20 turns and fact conflicts occurring in 40%--68% of all interactions.

Abstract:
Regression trees are among the most interpretable yet expressive model classes in machine learning. Historically, greedy induction has been the dominant approach for constructing wellperforming regression trees. While optimal methods based on dynamic programming and branch-and-bound exist, they are computationally prohibitive for general linear regression trees, despite often achieving substantially better performance than greedy approaches. Recent work has shown that specialized lookahead strategies can dramatically improve runtime while maintaining near-optimal performance, primarily in classification settings. In this work, we develop a novel algorithm for near-optimal, sparse, piecewise linear regression trees that combines a lookahead-style search strategy with efficient rank-one Cholesky updates of the Gram matrix. We demonstrate, both theoretically and empirically, that our method achieves a favorable trade-off between computational efficiency, predictive accuracy, and sparsity, and scales significantly better than the current state of the art.

Abstract:
Electronic Health Records (EHRs) possess unique characteristics distinct from natural language, yet existing EHR foundation models often rely on suboptimal NLPbased approaches. We propose a pretraining method tailored to EHRs' distinct features. First, we introduce Pathology-Focused Binning, a density-based quantization strategy that prioritizes clinically significant numerical ranges over usual values. Second, to jointly capture both the exact timing of clinical events and the relative intervals between them, we propose Dual-Calendar Rotary Positional Embedding (RoPE), which encodes absolute and relative temporal signals. Third, we introduce the Time-Conditioned Foreseeing (TCF) objective, aligning with clinical treatment planning to forecast events across multiple temporal horizons by explicitly modeling event timing. Our approach establishes a temporal generative EHR model that outperforms existing foundation models on nine diverse downstream tasks—achieving up to a 48% improvement in AUPRC—and enables the generation of realistic, temporally consistent patient trajectories.

Abstract:
Illconditioned kernel matrices and loss landscapes can make first-order methods for training neural networks converge slowly. We establish non-asymptotic convergence bounds for the Gauss–Newton method in both under- and overparameterized regimes, showing it avoids these conditioning bottlenecks. In the underparameterized setting, Gauss–Newton gradient flow in parameter space induces a Riemannian gradient flow on a low-dimensional submanifold of function space. Using tools from Riemannian optimization, we show that, under an appropriate output scaling, the loss satisfies geodesic Polyak–Lojasiewicz and Lipschitz-smoothness conditions, implying geometric convergence to the optimal in-class predictor at an explicit rate independent of Gram-matrix conditioning. In the overparameterized setting, we identify adaptive, curvature-aware regularization schedules and prove fast geometric convergence to a global optimum for both Gauss–Newton gradient flow and discrete-time Gauss–Newton iterates, with rates independent of the minimum eigenvalue of the neural tangent kernel and, locally, independent of the strong-convexity modulus. Overall, Gauss–Newton can be provably faster in ill-conditioned regimes where first-order methods slow down.

Abstract:
Scheduling problems arise from repeatedly selecting one item from a set of candidates based on their states. These problems often reduce to assigning priority scores and choosing the highestranked item. In this work, we propose a factorized scheduling principle (FSP) framework to learn interpretable and transferable scheduling rules. The FSP framework represents system states as condition distributions and decomposes a global scheduling principle into additive univariate and pairwise components with identifiability constraints. The scheduling principle enables the framework to maintain a simple priority-based structure during deployment. This principle is learned by using a policy-based objective combined with a temporal-difference signal defined on the condition distribution. Experiments on synthetic and realistic scheduling tasks demonstrate the FSP framework's strong performance, interpretability, and zero-shot generalization across different system scales.

Abstract:
Although causal discovery from multivariate time series is widely used, it remains challenging under noise. Convergent cross mapping (CCM) infers causality by reconstructing shadow manifolds via timedelay embedding (TDE) and evaluating cross-map skill between manifolds. Despite Takens’ theorem guarantees in ideal settings, TDE effectively attempts to recover system state from a single noisy view, often yielding geometrically degraded manifolds and unreliable distance-based neighborhoods, which in turn weakens causal identification. We propose TopoDistill, a topology-informed knowledge distillation framework that improves univariate shadow-manifold reconstruction by aligning local neighborhood structure to a multivariate system representation. A global embedder trained on multivariate observations captures a global attractor representation, while a delay embedder is distilled to produce embeddings whose neighborhood distributions match the global topology. This cross-view alignment yields smoother and more reliable neighborhoods, improving cross mapping under noise while maintaining specificity against spurious correlations. Theoretical analysis and experimental results demonstrate that our method enables effective causal discovery.

Abstract:
Precise property prediction of organic materials is pivotal for nextgeneration electronic and energy devices. In density functional theory (DFT), the electron density (ED) serves as the fundamental determinant of material properties. Yet, establishing it as an input modality for material property prediction has been impeded by two practical barriers: scarce large-scale ED data and the enormous computational complexity of ED representation. To bridge these gaps, we introduce VisionED, an efficient physics-informed model pre-trained on electron-density images. We curate a dataset of 2 million molecules and represent ED as multi-shot images that efficiently encode both geometric and electronic structure. VisionED is then pre-trained on 12 million multi-shot ED images via cross-scale, physics-informed pretext tasks. Empirical evaluations on photovoltaic and organic chromophore datasets show that VisionED outperforms state-of-the-art baselines by up to 27.0%, exhibiting superior robustness under distribution shifts and data scarcity. Notably, the model generalizes to unseen device-scale applications, successfully recovering experimental trends and mixing-ratio effects in ternary blends with an average accuracy of 93%. Moreover, relative to the previous ED point cloud, the ED image improves performance by 26.2% with 2.6× fewer memory and 4.6× lower time. The code and data are available at https://anonymous.4open.science/r/VisionED-AC1B.

Abstract:
Dataset pruning (DP) and dataset distillation (DD) fundamentally differ in their outputs: DP selects original image subsets, while DD generates synthetic images. Recently, DD's increasing reliance on original images suggests a convergence of the two directions. To investigate this convergence trend, we propose a unified dataset compression (DC) benchmark. This benchmark reveals an interesting tradeoff for soft-label-DD: while soft labels provide valuable information, they can make the distillation process less essential, as distilled images may not always outperform random subsets. In addition, the benchmark reveals that in current stages, dataset pruning outperforms dataset distillation at small dataset sizes. Given these observations, we explore hard-label-DC as a complementary approach that emphasizes image quality while offering substantial storage efficiency. Our PCA (Prune, Combine, and Augment) is the first framework that does not rely on soft labels but instead focuses on image quality. (1) "P'' means selecting easy samples based on dataset pruning metrics, (2) "C'' indicates combining these samples effectively, and (3) "A'' is to apply constrained image augmentation during training.

Abstract:
Large language models (LLMs) can already draft plausible Triton kernels, yet most existing evaluations still focus on singleshot generation and underplay tool use and feedback. We introduceTritonGym, a benchmark and orchestration framework for evaluating agentic workflows in GPU code generation.TritonGymstandardizes access to a set of code generation tools via function-calls, separating intrinsic model capability from workflow design and enabling fair, apples-to-apples comparison. The benchmark spans a maintained operator set, community samples, out-of-distribution tasks, and DSL extensions, ensuring both generality and extensibility. By providing a common orchestration and evaluation framework,TritonGymdemocratizes the development of GPU coding agents, supports practical adoption of agent-generated kernels, and facilitates progress on advanced agentic systems.

Abstract:
Pruning is a common approach to reduce the memory footprint and inference cost of large vision and language models. As these architectures continue to scale, oneshot pruning methods - i.e. approaches that prune the network without any retraining - have become increasingly attractive. Many popular one-shot pruning methods (e.g., WoodFisher, CAP, SparseGPT, and ALPS) typically optimize a quadratic objective under sparsity constraints. However, in practice, this objective is affected by multiple sources of uncertainty, including noise in the calibration data and variability introduced by algorithmic updates. To address these issues, we introduce RobOP, a robust optimization framework that explicitly accounts for such uncertainties. RobOP is modular and flexible, and can be applied with any existing pruning method through simple modifications motivated by our theoretical framework. We demonstrate that by taking into account uncertainty, RobOP offers improvements over prior pruning approaches. Our framework applies tractably across a range of stylized uncertainty sets, enabling robust one-shot pruning at scale.

Abstract:
Quantum computation offers the potential for a significant constantfactor speedup for the Ordered Search Problem (OSP). A classical construction is the k-query quantum ordered search algorithm, which can exactly search an N-element ordered list and achieves a query complexity improvement of a factor of \frack\log_2 N. For larger k, stronger constant-factor improvements could be obtained by finding the largest admissible list size N^\star, a task that can be formulated as a structured semidefinite program (SDP). However, solving this SDP becomes computationally intractable beyond k=6, as existing CPU and GPU solvers rely on explicit construction of prohibitively large constraint matrices. In this paper, we introduce a matrix-free GPU SDP framework that evaluates the highly structured constraints in OSP on-the-fly using custom CUDA kernels, reducing memory complexity from quadratic to linear and shifting the bottleneck from memory to computation. Using this approach, we tightly bracket the optimal list size for k=6 as 90,000 \le N^\star < 94,000, improving the best known upper bound on the query coefficient from 0.390 to 0.365. We further certify these results by constructing rigorous dual infeasibility certificates via matrix-free minimum-eigenvalue estimation.

Abstract:
Modern vision backbones treat pointwise activations (e.g., ReLU, GELU) and exponential softmax as essential sources of nonlinearity, but we demonstrate they are not required. We design activationfree polynomial alternatives for three core primitives (MLPs, convolutions, and attention), where Hadamard products replace standard nonlinearities to yield polynomial functions of the input. These modules integrate seamlessly into existing architectures: instantiated within MetaFormer, a modular framework for vision backbones, our PolyNeXt models match or exceed activation-based counterparts across model scales. We also substantially outperform prior polynomial networks at reduced computational cost, showing that polynomial variants of standard modules beat complex custom architectures. Our code is available at https://anonymous.4open.science/r/PolyNeXt-E424

Abstract:
We study optimal experimental design for multinomial logit (MNL) bandits, where an agent repeatedly selects a subset of K items from a ground set of size N and observes singlechoice feedback. Unlike linear or generalized linear bandits, MNL bandits have a combinatorial action space, which makes classical optimal design approaches and naive optimization over all subsets computationally intractable. We propose a computationally efficient optimal design framework for MNL models that achieves both statistical efficiency and scalability through two complementary approaches: (i) an exact or certified-approximate reformulation of the design oracle as a 0-1 mixed-integer linear program (MILP) with solver-certified early stopping, and (ii) a fully polynomial-time lifted design that replaces the nonlinear objective with a tractable surrogate. Using the Kiefer-Wolfowitz equivalence theorem, we establish near G-optimality guarantees and characterize the induced statistical-computational trade-offs. As an application, we develop a best assortment identification algorithm for MNL bandits with linear utilities and non-uniform revenues, and prove an instance-dependent sample complexity of \tilde\mathcalO\big(\fracd \log N\Delta^2\big), where d is the feature dimension, N is the number of arms, and \Delta is the minimum revenue gap.

Abstract:
Zerothorder optimization is widely used when gradients are unavailable, but the standard two-point estimator suffers from \mathcalO(r^2) truncation bias at smoothing radius r. Existing bias-reduction schemes typically increase the leading directional variance under a fixed number of function evaluations per gradient estimate, while variance-reduction schemes generally do not improve the bias order. We propose Romberg-ZOGE, which forms a Romberg-extrapolated linear combination of two-point differences evaluated at radii \r/2^k\_k=0^R while reusing the same perturbation direction across all radii. With appropriately chosen weights, Romberg-ZOGE cancels the first R even-order truncation terms and achieves \mathcalO(r^2R+2) bias under (2R+2)-order smoothness, while preserving the leading directional variance constant of the two-point estimator up to higher-order residual terms. We further characterize the stochastic-oracle setting by deriving an explicit noise-amplification factor and corresponding bias and variance bounds. Experiments on synthetic benchmarks, simulator-based wireless optimization, and black-box prompt tuning of OPT-1.3B demonstrate faster and more stable zeroth-order SGD when the number of function evaluations per gradient estimate is fixed.

Abstract:
Given the quadratic complexity of attention, KV cache eviction is vital to accelerate model inference. Current KV cache eviction methods typically rely on instantaneous heuristic metrics, implicitly assuming that score magnitudes are consistent proxies for importance across all heads. However, this overlooks the heterogeneity in predictive fidelity across attention heads. While certain heads prioritize the instantaneous contribution of tokens, others are dedicated to capturinglonghorizon utility. In this paper, we propose that optimal budget allocation should be governed by the marginal utility in preserving long-term semantic information. Based on this insight, we propose LU-KV, a novel framework that optimizes head-level budget allocation through a convex-hull relaxation and a marginal-utility-based greedy solver to achieve near-optimal precision. Furthermore, we implement a data-driven offline profiling protocol to facilitate the practical deployment of LU-KV.

Abstract:
This paper investigates theoretical and methodological foundations for stochastic optimal control (SOC) in discrete time. We start formulating the control problem in a general dynamic programming framework, introducing the mathematical structure needed for a detailed convergence analysis. The associate value function is estimated through a sequence of approximations combining nonparametric regression methods and Monte Carlo subsampling. The regression step is performed within reproducing kernel Hilbert spaces (RKHSs), exploiting the classical KRR algorithm, while Monte Carlo sampling methods are introduced to estimate the continuation value. To assess the accuracy of our value function estimator, we propose a natural error decomposition and rigorously control the resulting error terms at each time step. We then analyze how this error propagates backward in timefrom maturity to the initial stage-a relatively underexplored aspect of the SOC literature. Finally, we illustrate how our analysis naturally applies to a key financial application: the pricing of American options.

Abstract:
Gradient clipping is widely used to stabilize stochastic gradient methods and is often theoretically motivated by heavytailed gradient noise, where even second moments may be infinite, seemingly contradicting empirical risk minimization where all moments are finite for a fixed dataset. We resolve this paradox by explicitly separating data sampling from optimization randomness: although moments are finite conditional on the dataset, heavy-tailed data induce dataset-dependent noise whose second moment typically grows with the dataset size N. In particular, when \|\nabla f(x_\star,\xi)\| has tail index \alpha \in (1,2), the quantity \frac1N\sum_i=1^N\|\nabla f(x_\star,\xi_i)\|^2 scales as N^\frac2\alpha-1, leading to deteriorating convergence guarantees for standard SGD as N increases. In contrast, we show that stochastic gradient descent with clipping avoids this growth and admits finite-sum convergence guarantees under heavy-tailed data for broad step-size and clipping schedules. We further derive generalization bounds for strongly convex smooth objectives and show that the tail behavior of gradients at the population minimizer is the key quantity linking optimization and generalization under heavy-tailed data.

Abstract:
Genome language models (gLM) have the potential to further understanding of regulatory genomics without requiring labeled data. Most gLMs are pretrained using sequence reconstruction tasks inspired by natural language processing, but recent studies have shown that these gLMs often fail to capture biological signal. To overcome this, we introduce pretraining tasks that predict the rate of evolution. These tasks are designed so that they can be composed with sequence reconstruction, enabling a controlled comparison of predicting sequence only, evolutionary rate only, or both. To address gaps in existing evaluations, we developed a suite of biologically grounded benchmarks. Across these tasks, and for established variant effect prediction benchmarks, models pretrained on both sequence and evolutionary rate outperform those trained on sequence alone, and training on evolutionary rate can make the even the relatively small models in our work competitive with much larger existing gLMs for some tasks. These results establish evolution as a key training target for genomescale models.

Abstract:
Infrared object detection is crucial for perception in autonomous driving and surveillance but remains vulnerable to physical adversarial attacks. Unlike in the RGB domain, where attacks rely on color texture, infrared attacks must manipulate thermal signatures, making the geometry shape of heatblocking materials the primary adversarial information carrier. Current shape-based methods suffer from a fundamental trade-off between representational capability and optimization power, limiting their attack effectiveness. In this work, we overcome this dilemma by introducing learnable Fourier shapes to the infrared domain. We utilize an end-to-end differentiable framework where a compact set of Fourier coefficients, defining the shape boundary, is analytically mapped to a pixel-space mask via the winding number theorem. This enables efficient gradient-based optimization to generate potent shapes that cause human targets to evade detection. Extensive digital and physical experiments provide a comprehensive evaluation and validate our superior performance. Our resulting physical patch achieves striking robustness, successfully evading detectors across diverse distances, angles, poses, and individuals, and achieves over 88% attack success rate at distances greater than 25m (conf.=0.5).Video demo and source code are available in the supplementary material.

Abstract:
Trainingfree test-time adaptation (TTA) for vision-language models (VLMs) can boost zero-shot classification under mild shifts but often collapses under severe environment/style shifts. We identify two shared failure modes: (i) retrieval confounding, where feature similarity is dominated by style and corrupts cache/bank evidence; and (ii) environment-biased priors, where VLMs logits exhibit environment-dependent centered shifts that distort gating and prior-like terms. We propose D^2O, a strictly training-free debiasing operator that outputs three inference objects per test sample: a content feature for reliable retrieval, a style fingerprint for environment routing, and debiased logits for corrected priors. D^2O composes plug-and-play with cache-based and closed-form Gaussian adapters in both online and transductive settings. We further provide operator-to-decision guarantees: finite-difference covariance recovers a style subspace, style-routed EMA controls the centered logit-bias estimate, and these errors translate to bounded posterior log-odds perturbations, yielding a margin-based condition for label invariance under strong shifts. Extensive experiments on diverse benchmarks show that our method consistently achieves state-of-the-art performance across a broad range of distribution shifts.

Abstract:
Large Language Models (LLMs) frequently generate output that contradicts explicit input evidence, limiting their reliability in realworld applications. We identify cognitive inertia in LLMs—a tendency to overly rely on co-occurrence associations learned during pretraining and to resist adaptation when conflicting input evidence appears—as a critical factor behind such hallucinations. We further empirically show that adherence to input evidence declines as co-occurrence associations are strengthened—driven by either higher data frequency or intensified training. Inspired by human counter-inertial thinking, we propose an adaptive counter-inertial reasoning framework that probes input-related cognitive inertia in the LLM and generates adaptive counter-inertial reminders, which are then injected into the prompt to promote evidence-based reasoning. Experiments on co-occurrence induction datasets show that LLMInertia reduces hallucination rates by up to 35% and improves accuracy by up to 35.68%. Extensive evaluations on four context-rich summarization and QA datasets, across three LLM backbones of varying scales, further validate its effectiveness and robustness. Our work provides new insight into the causes of input-unfaithful hallucinations in LLMs, contributing to the development of more reliable AI.

Abstract:
We introduce Distribution Contractive Reinforcement Learning (DICERL), a framework that uses reinforcement learning (RL) as a “distribution contractor” to refine pretrained generative robot policies. DICE-RL turns a pretrained behavior prior into a high-performing “pro” policy by amplifying high-success behaviors from online feedback. We pretrain a diffusion-based policy for broad behavioral coverage, then finetune it with a stable, sample-efficient residual off-policy RL framework that combines selective behavior regularization with value-guided action selection. Extensive experiments and analyses show that DICE-RL reliably improves performance with strong stability and sample efficiency, enabling mastery of complex long-horizon manipulation skills both in simulation and on a real robot. Project website:dice-rl-anonymous.github.io.

Abstract:
Large Vision–Language Models (LVLMs) have shown outstanding performance across various multimodal tasks, but they still suffer from hallucinations, where they generate incorrect information by relying on language priors without visual grounding. To alleviate this issue, prior work has explored contrastive decoding approaches that compare the output of an original LVLM with that of a contrast branch. However, existing methods typically generate contrast logits through preprocessing of the input image. Such inputlevel perturbations fail to sufficiently reflect the model’s internal degree of visual reliance during the decoding process. To address this limitation, we propose Dynamic Entropy-Guided Attention Perturbation (DEGAP) for contrastive decoding in LVLMs. DEGAP performs contrastive decoding by directly perturbing visual attention and leveraging the resulting logits, without requiring any additional image preprocessing. To this end, we analyze the layer-wise effects of visual attention perturbations and, based on these observations, dynamically select the layers at which attention perturbation is applied according to the model’s confidence. Experimental results on seven benchmarks demonstrate that DEGAP effectively mitigates various types of hallucinations and consistently outperforms state-of-the-art methods in general VQA performance.

Abstract:
The technique of combining multiple votes to enhance the quality of a decision is the core of boosting algorithms in machine learning. In particular, boosting provably increases decision quality by combining multiple "weak learners"—hypotheses that are only slightly better than random guessing—into a single "strong learner" that classifies data well. There exist various versions of boosting algorithms, which we improve upon through the introduction of QuantumBoost. Inspired by classical work by Barak, Hardt and Kale, our QuantumBoost algorithm achieves the best known runtime over other boosting methods through two innovations. First, it uses a quantum algorithm to compute approximate Bregman projections faster. Second, it combines this with a lazy projection strategy, a technique from convex optimization where projections are performed infrequently rather than every iteration. To our knowledge, QuantumBoost is the first algorithm, classical or quantum, to successfully adopt a lazy projection strategy in the context of boosting.

Abstract:
World models derived from largescale video generative pre-training have emerged as a promising paradigm for generalist robot policy learning. However, standard approaches often focus on high-fidelity RGB video prediction, but this can result in overfitting to irrelevant factors, such as dynamic backgrounds and illumination changes. These distractions reduce the model's ability to generalize, ultimately leading to unreliable and fragile control policies. To address this, we introduce the Mask World Model (MWM), that leverages video diffusion architectures to predict the evolution of semantic masks instead of pixels. This shift imposes a geometric information bottleneck, forcing the model to capture essential physical dynamics and contact relations while filtering out visual noise. We seamlessly integrate this mask dynamics backbone with a diffusion-based policy head to enable robust end-to-end control. Extensive evaluations demonstrate the superiority of MWM on the LIBERO and RLBench simulation benchmarks, which significantly outperforming the state-of-the-arts RGB-based world models. Furthermore, real-world experiments and robustness evaluation (via random token pruning) reveal that MWM exhibits superior generalization capabilities with robust resilience to texture information loss.

Abstract:
The rapid advancement of large language models (LLMs) has led to remarkable performance across diverse domains, making them indispensable assistants in daily life and work. Currently, LLM services are primarily accessed in two ways: (i) paid access to cloudhosted LLMs, which are powerful but introduce nontrivial cost; and (ii) deployment of small language models (SLMs) on personal devices or small clusters, which, while less powerful, are sufficient for handling relatively simple tasks. To achieve a balanced trade-off between monetary cost and task performance, we propose Selective Deferred Routing, a paradigm that enables cost-efficient collaboration between local SLMs and remote LLMs. In this framework, a user request is first processed by the local SLM, which not only generates a preliminary response but also provides rich semantic representations of the request. A lightweight decision module then leverages this information to either adopt the initial response or route the request to the most suitable remote LLM for a higher-quality response. Extensive experiments across diverse model architectures and families, including both SLMs and LLMs, as well as datasets spanning multiple task scenarios, demonstrate that our approach consistently outperforms existing multi-LLM collaboration methods under a wide range of cost–performance trade-offs.

Abstract:
While enabling effective collaboration on complex tasks, LLMbased Multi-Agent Systems (MAS) face critical security challenges due to vulnerabilities at the agent and interaction levels. Most existing MAS security defenses are built upon two core assumptions: semantically-explicit malicious attacks and explicit graph-based modeling of the MAS topology and agent-level interactions. In practice, real-world attacks are becoming more semantically stealthy, while MAS execution is typically asynchronous without the temporal alignment assumed by graph-based propagation models. To address these limitations, we propose AcMAS, an activation-based framework for malicious-behavior detection in MAS. By analyzing internal reasoning states in the activation space of local agents, AcMAS detects even stealthy attacks in a synchronization-robust fashion, without relying on explicit interaction graphs. Moreover, our activation analysis provides critical signals to guide AcMAS in restoring the functionality of compromised agents, rather than the disruptive agent isolation commonly used by the state-of-the-art methods. Comprehensive evaluation demonstrates that AcMAS significantly outperforms graph-based baselines against stealthy attacks, by +0.22 F1 in synchronous settings (0.94 vs. 0.72) and by +0.55 F1 in asynchronous settings (0.93 vs. 0.38), with generalization across diverse open-source LLM backbones, attack intensity, and MAS scale.

Abstract:
Global ocean modeling is vital for climate science but struggles to balance computational efficiency with accuracy. Traditional numerical solvers are accurate but computationally expensive, while pure deep learning approaches, though fast, often lack physical consistency and longterm stability. To address this, we introduce HybridOM, a framework integrating a lightweight, differentiable numerical solver as a skeleton to enforce physical laws, with a neural network as the flesh to correct subgrid-scale dynamics. To enable efficient high-resolution modeling, we further introduce a physics-informed regional downscaling mechanism based on flux gating. This design achieves the inference efficiency of AI-based methods while preserving the accuracy and robustness of physical models. Extensive experiments on the GLORYS12V1 and OceanBench dataset validate HybridOM's performance in two distinct regimes: long-term subseasonal-to-seasonal simulation and short-term operational forecasting coupled with the FuXi-2.0 weather model. Results demonstrate that HybridOM achieves state-of-the-art accuracy while maintaining physical consistency, offering a robust solution for next-generation ocean digital twins.

Abstract:
Representation learning on textattributed graphs (TAGs) is crucial for real-world applications, as it enables effective modeling of both rich node semantics and complex graph structure. Nevertheless, this task is intrinsically challenging due to structural–semantic mismatch stemming from divergent modality distributions, as well as dual-source noise inherent in node textual content and graph structure. Existing approaches often enforce a rigid fusion of distinct modalities while overlooking their inherent noise, which inevitably results in persistent distribution gaps and amplifies mixed interference during information propagation. To address these issues, we propose UDPD, an Uncertainty-modulated Dual-Path Diffusion model for robust text-attributed graph learning. Specifically, we first employ a dual-perspective node encoding strategy to separately learn semantic and structural embeddings. We then introduce a cooperative diffusion paradigm with parallel semantic and structural branches, where mutual guidance enables progressive alignment of different distributions while effectively suppressing modality inherent noise. Crucially, the reverse process is guided by node uncertainty, which is used to adaptively modulate cross-branch interaction strength, ensuring robust coupling and maximizing denoising effectiveness. Extensive experiments on five public benchmarks demonstrate the effectiveness and superiority of our UDPD over state-of-the-art baselines.

Abstract:
We propose a new paradigm for long video understanding by treating a long video as a Neural Knowledge Representation (NKR). NKR represent video contents neither as a stream of tokens or preorganized databases, but as an individual small portion of network weights attached to the VLM backbone. The NKR weights is optimized to encapsulate the video's semantic content via a novel Agentic Knowledge Distillation (AKD) process, where an agent automatically synthesizes dense descriptions and question-answer pairs to distill the video's knowledge into the NKR. While AKD serves as a comprehensive, one-time encoding phase, the resulting NKR transforms the video into a portable, reusable asset. At inference, the lightweight NKR is mounted onto a frozen Vision-Language Model (VLM), enabling direct, query-based understanding without reloading or re-encoding the the original video. This approach decouples video length from inference cost, offering high amortized efficiency for multi-turn video understanding. Experiments on the LVBench benchmark show our method achieves performance comparable to state-of-the-art approaches while reducing end-to-end latency by over two orders of magnitude, opening new possibilities for interactive long-video understanding.

Abstract:
Realworld data contains aleatoric uncertainty -- irreducible noise arising from imperfect measurements or from incomplete knowledge about the data generation process. Mean variance estimation (MVE) networks can learn this type of uncertainty but require ad-hoc regularization strategies to avoid overfitting and are unable to predict epistemic uncertainty (model uncertainty). Conversely, Bayesian neural networks predict epistemic uncertainty but are notoriously difficult to train due to the approximate nature of Bayesian inference. We propose to cooperatively train a variance estimation network with a Bayesian neural network and empirically demonstrate that the resulting model disentangles aleatoric and epistemic uncertainties while improving the mean estimation. We demonstrate the effectiveness and scalability of this method across a diverse range of datasets, including a time-dependent heteroscedastic regression dataset we created where the aleatoric uncertainty is known, used to assess estimation accuracy. The proposed method is straightforward to implement, robust, and adaptable to various model architectures.

Abstract:
We introduce Hybrid Spaceaware Stochastic Convolution Attention Noise (HySCAN), a hybrid randomized defense that helps close the long-standing gap between provable robustness under ℓ2 certificates and empirical robustness against strong ℓ∞ attacks, while maintaining strong generalization across diverse imaging benchmarks. HySCAN jointly explores complementary sources of stochasticity at both training and inference: (i) implicit weight-space randomness via stochastic-aware Random Weights, and (ii) explicit feature-space randomness via Stochastic Attention Noise Injection modules. By incorporating randomness at both the parameter and representation levels, HySCAN enables meaningful certified guarantees while improving empirical robustness in practice. Comprehensive experiments on diverse imaging datasets, e.g., CelebA, CIFAR-10, and CIFAR-100, ImageNet-1k, HAM10000, and NIH Chest X-ray, demonstrate that HySCAN outperforms existing certified and empirical defenses, improving certified robustness by up to ≈ 9.6% and empirical robustness by up to ≈ 5% without reducing clean accuracy.

Abstract:
Reliable epidemic forecasting is critical for public health decisionmaking yet remains challenging due to data sparsity and the non-stationary nature of disease dynamics. While recent hybrid models attempt to integrate mechanistic principles with data-driven approaches, they often relegate mechanistic priors to merely auxiliary features or regularization terms. This design not only obscures the interpretability of the mechanistic contribution but also fails to inherit the capability of physical models to generalize under non-stationary dynamics, as the core architecture remains predominantly data-driven. To address these limitations, we propose EpiDiff, a unified framework that synergizes epidemiological domain knowledge with the generative power of diffusion models. Unlike methods that rigidly fuse features, EpiDiff employs a novel uncertainty-aware steering mechanism during inference. Specifically, we quantify the posterior uncertainty of mechanistic estimations and use it to dynamically modulate the diffusion process. Extensive experiments on real-world datasets demonstrate that EpiDiff consistently outperforms state-of-the-art baselines in accuracy and robustness, particularly under non-stationary distributions, while offering transparent insights into model reliance by explicitly visualizing when the forecast is governed by mechanistic laws versus data-driven patterns. Our code and datasets are available at https://anonymous.4open.science/r/epidiff-4782.

Abstract:
Despite the high accuracy of EEGbased emotion recognition, existing models remain opaque "black boxes", lacking semantic grounding between abstract neural features and human-interpretable states. In this paper, we reframe EEG explainability as a cross-modal generation task, shifting the paradigm from feature attribution to behavioral visualization. We introduce Facial Emoji Proxy Modeling, a novel framework that translates high-dimensional EEG signals into identity-agnostic facial emojis. Guided by the neuroscientific prior of neural-facial consistency, this approach grounds neural representations in the manifold of observable facial dynamics. Technically, our framework integrates FMENet, a specialized backbone modeling expression-relevant spatial synergies, and the Facial Emoji Learning Branch (FELB), which treats emoji reconstruction as a structured semantic regularizer. Extensive experiments on EAV and MMER benchmarks demonstrate that our method achieves state-of-the-art accuracy among EEG-only models. Crucially, it generates semantically faithful facial animations that provide a transparent, privacy-preserving window into the brain's emotional evolution, effectively allowing users to "see the emotion" directly from neural signals.

Abstract:
RetrievalAugmented Generation (RAG) has recently been enhanced with tree or graph structures to match user intent for precise passage retrieval, which facilitates large language models (LLMs) in effectively mitigating hallucinations by leveraging external knowledge. However, we identify that existing structure-augmented RAG systems are experiencing (i) potential retrieval suspension and (ii) cumulative semantic drift, due to low-quality structures and semantic embeddings that often poorly capture textual details. Motivated by this, we propose a novel paradigm named KG-Translator, which is distinct from traditional matching-based paradigms and instead translates user queries into graph-level clues. Specifically, KG-Translator utilizes lightweight models to conduct named entity recognition (NER) and syntactic parsing on the corpus, constructing a reliable knowledge graph (ParseKG). On top of ParseKG, KG-Translator adopts constrained decoding strategies to faithfully translate clues, traces them to original passages, and employs a lightweight ranking model for precise passage retrieval. Extensive experiments on five datasets demonstrate that KG-Translator significantly outperforms baselines.

Abstract:
Layer dropout (a.k.a.\ stochastic depth) has been shown to enable faster training, higher accuracy, and robustness to zeroshot layer pruning in both language and vision transformers. However, as models and datasets have scaled, dropout---particularly layer dropout---has largely disappeared from LLM pre-training recipes. While some prior work has reported that dropout can degrade accuracy, no comprehensive study has quantified, let alone mitigated, this effect. In this study, we show that layer dropoutshouldbe used in state-of-the-art LLM training, establishing best practices and scaling analysis for both training and post-training benefits. Concretely, with optimal layer distribution, time schedule, and optimizer hyperparameters, a 3.9B-parameter LLM can achieve lower validation loss while saving 20% of training FLOPs. Moreover, layer dropout enables significant post-training optimizations, such as early exit, intermediate-layer skipping, and self-speculative decoding, yielding up to 1.7x inference speedup with negligible accuracy loss. Across more than 2400 training experiments, spanning models from 271M to 3.9B parameters and datasets up to 116B tokens, we demonstrate that these findings extend reliably to large-scale training regimes.

Abstract:
Extending traditional graph anomaly detection (GAD) from onefor-one to one-for-all paradigms, generalist GAD aims to learn a universal detector for identifying anomalies on unseen graphs without graph-specific retraining. Nevertheless, existing approaches primarily focus on aligning heterogeneous features across different data domains via PCA-based projection, which harmonizes feature dimensions but neglects semantic alignment. As a result, GAD models fail to acquire semantically transferable knowledge from source-domain pre-training, and even exhibit negative transfer on unseen graphs. To address this issue, we propose a Relational Fingerprint-based generalist GAD approach (REFI-GAD for short), aligning heterogeneous raw features with a universal and semantics-aware relational fingerprint (REFI) that encodes anomaly-indicative cues from both contextual and structural perspectives. Building on REFI, we design a fingerprint-grounded generalist GAD model, which combines a transformer-based encoder to capture domain-invariant knowledge with an SNR-guided refinement module for domain-specific adaptation. Extensive experiments on 14 datasets demonstrate that REFI-GAD significantly outperforms state-of-the-art methods.

Abstract:
Realworld multimodal systems inevitably face partial observability due to sensor dropout and degradation. Standard robustness methods can improve average performance, but they often remain unreliable in rare, adverse long-tail conditions. Under a locked same-support contract, we uncover a same-support tail failure where two runs with matched average performance still diverge on worst-case and shift-reweighted metrics computed on the same observable support. We attribute this instability to conditional interaction where environment coverage does not guarantee parameter exposure. Information is routed through different parameter groups, leaving some groups rarely updated even when environment coverage is complete, and tail-focused aggregation amplifies errors from these high-leverage but underexposed groups. This mismatch is auditable from gating logs, and we summarize it with TailPressure, an exposure-normalized statistic of tail-leveraged interaction. Guided by this diagnosis, we propose Heterogeneity-aware Closed-loop Exposure Stabilizer (H-CES), a controller that regulates per-group pressure via deterministic increment-branch gating and group-wise decoupled weight decay, without changing the task loss or adding inference branches. Experiments across diverse multimodal settings and backbones show that H-CES improves tail reliability under the same-support contract while preserving clean performance.

Abstract:
The task of temporal answer grounding in instructional videos (TAGV), which aims to locate precise video segments that respond to natural language queries, is increasingly important for direct video answer retrieval. This task remains challenging due to the need to comprehend semantically complex questions and to address the significant length mismatch between untrimmed videos and short target moments. Existing methods often suffer from sensitivity to irrelevant content or insufficient visual reasoning capabilities. To tackle these limitations, we propose a CandidateAware Causal Reasoning (CACR) framework. Our approach first employs a Visual-Language Pre-training based Candidate Selection (VBCS) algorithm to efficiently generate K candidate segments, then applies a temporal logic reasoning module enhanced by a rejection reward mechanism and optimized via Group Relative Policy Optimization (GRPO) for robust inference. Extensive experiments on six benchmarks demonstrate that our method achieves state-of-the-art performance in terms of mean Intersection-over-Union (mIoU), providing a new perspective for reasoning-based retrieval in long videos.

Abstract:
Despite their remarkable success, a rigorous theoretical understanding of how latent variables (LVs) govern the generalization performance of Variational Autoencoders (VAEs) remains largely elusive. Existing theoretical analyses are confined to supervised learning or models with discrete latent spaces, leaving their role in standard VAEs with continuous LVs poorly understood. This paper establishes the first informationtheoretic analysis for VAEs by adapting a theoretical framework from supervised learning---the leave-one-out conditional mutual information framework---to the unsupervised, continuous latent space of these models. Our analysis reveals that their generalization error is bounded solely by the information complexity of the encoder and LVs, independent of the decoder. The versatility of our framework is demonstrated through its extension to both hierarchical VAEs, for which we provide layer-wise bounds, and data generation, where we link our information-theoretic principles to a novel bound on the 2-Wasserstein distance between true and generated distributions.

Abstract:
Time series forecasting is critical in domains such as finance, energy, and healthcare, yet realworld datasets often exhibit non-stationarity, noise, missing values, and distribution shifts, posing severe challenges for generalization. In practice, industry solutions typically rely on customized forecasting frameworks that combine imputation, decomposition, and specialized models. However, such frameworks are costly to engineer and maintain. Moreover, we observe that many frameworks suffer from the impacts of distribution shifts, which degrade their respective performance. It motivates a paradigm that transfers reliably across heterogeneous datasets while accumulating reusable strategy knowledge for large-scale, dynamic environments. Although large language model-based agents have recently shown strong reasoning and tool-use capabilities, existing approaches do not consistently adapt forecasting workflows across diverse time series. We identify two primary factors, including limited strategy-level supervision and the inherent complexity of mapping dataset-specific meta-features to effective forecasting strategies. To address these challenges, we propose BECRA, a novel agent training paradigm that learns forecasting intelligence through contrast-aware exploration and agent-level causal lesson extraction, without human-annotated supervision. BECRA distills symbolic strategy lessons that support in-context planning on unseen datasets, enabling zero-shot training adaptation.

Abstract:
It is a fundamental challenge to ascertain whether the causal effect of a treatment on an outcome is identifiable in the presence of latent confounders, which serves as the logical prerequisite for recovering the causal effect in a partially observed system. While prior literature demonstrates that the causal effect is identifiable when there exist auxiliary variables subject to stringent structural constraints, this paper investigates identifiability of the causal effect without such variables. This means that we ground identifiability solely in the joint distribution of the treatmentoutcome pair, which constitutes the irreducible statistical basis for causal effect identification. Focusing on linear structural causal models (SCMs), we provide a nuanced and complete characterization of identifiability of the causal effect contingent on the distributional properties of exogenous noises. Specifically, we formulate a set of mutually exclusive and collectively exhaustive conditions regarding the Gaussianity of exogenous noises, ascertain under which conditions the causal effect is identifiable and under which it is not, while also quantifying the cardinality of the feasible solution set for the unidentifiable cases. Finally, we empirically validate our theoretical findings.

Abstract:
Large language models usually face continuous domain shifts during testing, which degrade performance on unseen shifting domains. So, researchers propose continual testtime adaptation (CTTA) to adapt to evolving testing domains while preserving knowledge of previous domains, making adaptability-stability (A-S) balance. Existing CTTA methods are constrained by dense base models that encode knowledge from all domains into a global model, hardly achieving the A-S balance. We observe that the model sparsity of mixture-of-experts (MoE) models is better for achieving A–S balance than dense models. In CTTA, however, MoE faces difficulty in (1) correctly routing samples from unseen shifting domains and (2) capturing domain-level shifts. In this paper, we propose test-time mixture-of-experts routing (TiME) via asymmetric co-optimal transport (As-COOT): we model MoE routing in CTTA as a test-time allocation problem via COOT. To ensure reliable routing, we propose a semantic space alignment to align sample-expert distributions via bidirectional contrastive learning. To address COOT’s limitations in CTTA, we propose As-COOT, relaxing sample-side constraints while enforcing expert-side constraints to ensure noise robustness and balance expert load. Experiments show TiME outperforms baselines. Code is: anonymous.4open.science/r/As-COOT-78FF

Abstract:
The Planthen-Code paradigm effectively enhances Large Language Models (LLMs) in complex code generation by decomposing reasoning into explicit, interpretable steps. However, introducing the plan and verification report substantially enlarges the context, which in turn misdirects the model’s attention toward irrelevant tokens and the most recently generated code. This effect leads the model to overlook critical constraints and to generate incorrect code, especially for small-scale LLMs (less than 8B). To address this issue, we propose Perturbation-Verified Attention Distillation and Dynamic Alignment (PADA). PADA identifies the key tokens most critical to the student model and constructs the optimal attention target matrix, dynamically aligning the student’s focus with key tokens for each plan step. We evaluate PADA with two teacher models and three student models across seven benchmarks, and the results show that PADA improves Pass@1 by up to 16.7% and outperforms SOTA methods in all settings. Our code is available at https://anonymous.4open.science/r/PADA-coder

Abstract:
Foundation models have sparked a revolution via a pretrainingadaptation paradigm, with recent efforts extending this success to graphs. Unlike other modalities, graphs contain rich structural patterns, yet their structural transferability remains poorly understood. Prior studies consider common substructures in the discrete realm, and we are motivated by a fundamental question: Are common substructures transferable? The underlying theory is largely underexplored. In this work, we shift toward learning transferable structures through the lens of functional behavior. Theoretically, we connect transferable substructures to intrinsic geometry of the representation space. However, characterizing such intrinsic geometry has rarely been touched. Grounded in Riemannian geometry, we develop a graph intrinsic geometry learning framework—Neural Vector Bundle, which enables parsing intrinsic geometry with local coordinates. Building on this, we design \textscGauge, a pretrainable neural architecture that constructs the vector bundle, flattening geometrically compatible local coordinates, and a new Dirichlet loss, which also measures the transfer effort. We empirically validate its superior expressiveness in challenging tasks including zero-shot link prediction and graph isomorphism.

Abstract:
Effective time series forecasting enables various realworld applications, benefiting from the proliferation of mobile devices. However, the volume of time series data may vary significantly across domains due to low sampling rates and data regulations. To maximally create value from sparse data, this study focuses on a new problem of source-free time series forecasting, aiming to adapt a pretrained model from sufficient source time series to the sparse target time series without access to the source data, enabling data protection. To achieve this, we propose TimeID, a novel source-free time series forecasting framework with a large language model (LLM) centric proxy denoising inspired by the powerful generalization capabilities of LLMs. Specifically, TimeID consists of three key components: (1) dual-branch invariant disentangled feature learning that enforces representation- and gradient-wise invariance by means of season-trend decomposition; (2) lightweight, parameter-free proxy denoising that dynamically calibrates systematic biases of LLMs; and (3) knowledge distillation that bidirectionally aligns the denoised prediction and the original target prediction. Extensive experiments on real-world datasets demonstrate that TimeID outperforms state-of-the-art baselines, improving MSE and MAE by 10.7% and 9.3% on average. The code is publicly available at https://anonymous.4open.science/r/TimeID-6D1D/.

Abstract:
As language models improve and become increasingly deployed to solve a variety of tasks, trustworthiness becomes essential. Calibration is a good proxy for trust: wellcalibrated confidence estimates help inform the risk versus reward trade-off when trusting a specific model output. Unfortunately, even as models improve, they remain poorly calibrated, often biasing towards overconfidence. Additionally, calibration can be gamed: a policy that always predicts the base rate is perfectly calibrated, but completely uninformative. To resolve this, we develop a new metric,expected utility renormalized by the oracle (EURO), that balances calibration and informativeness. We also propose a general-purposeactivation-based confidence estimation protocol (ACE)to appropriately adjudicate uncertainty. The ACE protocol provides flexible, sample-efficient, and compute-efficient confidence estimators for 3 tasks including multiple choice question answering, tool-calling, and scientific document summarization across 6 models from 4 model families. ACE outperforms strong baselines on EURO, while maintaining low calibration error. Taken together, our work shows that equipping LLMs with the ACE protocol can improve calibration, utility, and trustworthiness in numerous settings.

Abstract:
Federated parameterefficient fine-tuning (PEFT) enables customizing large language models on private data, yet it is vulnerable to backdoor poisoning—especially when privacy constraints prevent inspection of per-client real-valued updates. We exploit the intuition that poisoning leaves a similar backdoor imprint in which adapter coordinates become salient, so overlap in salient-index supports remains informative even without values. We introduce IndexGuard, an unsupervised index-only vetting primitive in which clients send only Top-K salient update indices and the server operates on the induced overlap geometry, clustering clients and filtering cohesion-outlier groups before aggregation. We analyze support stability under bounded rescaling and separability under shared-trigger poisoning under non-IID drift. Across attacks, backbones, and PEFT variants, IndexGuard provides end-to-end mitigation, preserving clean accuracy while achieving performance comparable to centralized methods.

Abstract:
Large Language Models (LLMs) possess broad conceptual knowledge acquired through largescale text pretraining, yet their potential to supervise models in other modalities remains underexplored. In this work, we propose \LaViD—Language-to-Visual Knowledge Distillation—a simple and effective framework for transferring high-level semantic knowledge from a language-only teacher to a vision-only student model. Instead of relying on paired multimodal data, LaViD elicits conceptual signals from an LLM by prompting it to generate multiple-choice questions (MCQs) that probe semantic distinctions between visual classes. Each class is mapped to a soft label distribution over these MCQs, forming a rich conceptual signature that guides the student through an auxiliary distillation loss. Notably, despite using a language-only teacher without access to image data, LaViD consistently outperforms recent methods like MaKD that distill from vision-language models across multiple fine-grained benchmarks. It also achieves competitive or superior performance compared to state-of-the-art visual distillation methods such as DKD and MLKD, with further gains when combined with logit standardization. On the Waterbirds dataset, LaViD substantially improves worst-group accuracy, demonstrating enhanced robustness to spurious correlations with distillation.

Abstract:
Training deep spiking neural networks (SNNs) remains challenging due to sharp loss landscapes and temporal inconsistency caused by surrogate gradients. To address these challenges, we propose a unified framework: adaptive and asymmetric surrogate gradients (A^2SG). The adaptive gradients adjust an effective window for spatiotemporal adaptation, reducing spatial gradient variation and maintaining directional consistency of gradients over time. The asymmetric gradients reflect neuronal dynamics by assigning larger gradients to neurons with higher membrane potentials, and we prove that they yield lower variation than symmetric surrogates. Our analysis further establishes a direct connection between local gradient variation and the curvature of the loss landscape, providing a principled explanation for how A^2SG promotes convergence to flatter minima and improves generalization. We conduct extensive experiments on diverse models, including CNN-based and Transformer-based SNNs, across various tasks such as image classification using both static and neuromorphic datasets, as well as segmentation. The results demonstrate that A^2SG consistently improves accuracy and energy efficiency, establishing it as a general and reliable solution for training deep SNNs.

Abstract:
Generative models that transport a simple source distribution to a complex data distribution—such as diffusion and flowbased models—are central to high‑fidelity data generation. Test-time guidance can further steer pretrained models toward user-specified high-reward regions without costly retraining. However, existing guidance methods face critical limitations: they struggle with non-differentiable rewards, fail to navigate complex landscapes, and often lack theoretical guarantees on generation performance. We propose \it Source Parallel Tempering (SPT), a gradient‑free test‑time guidance framework that operates entirely in source space, leveraging its simpler geometry to avoid the complexities of the data manifold. SPT couples a local exploration kernel with parallel tempering, enabling efficient barrier crossing and robust discovery of high‑reward modes. Theoretically, we provide a new error bound linking training-time approximation error to test-time guidance performance. Empirically, SPT significantly improves over state-of-the-art methods on benchmark tasks in conditional image synthesis, protein structure generation, and dynamical system trajectory sampling.

Abstract:
Formal verification of system software with complex heap manipulations remains challenging. Standard automated solvers frequently fail to discharge separation logic verification conditions even when correct specifications like loop invariants are provided, forcing verification engineers to manually construct proofs. While large language models (LLMs) have shown promise in proof synthesis, specialized approaches for separation logic remain unexplored. To bridge this gap, we introduce SLVC (Separation Logic Verification Conditions), a benchmark of 208 verification conditions derived from real-world C programs, including data structures and algorithms, the LiteOS kernel's linked list library, and the mini-gmp library. Our evaluation reveals that general-purpose LLMs and existing LLM-based Rocq provers struggle to effectively discharge these verification conditions. To address this challenge, we propose SPLIT (Split spatial and pure Proving with LLM-frIendly Tactics), a novel framework that enables predictable proof state transitions through an LLM-friendly tactic library, combined with a two-stage workflow that separates spatial and pure reasoning to align with separation logic semantics. Experimental results on SL-VC demonstrate that SPLIT consistently outperforms existing approaches, showing that LLM-assisted proof synthesis is a promising solution for separation logic verification of real-world system software.

Abstract:
Deep protein structure predictors such as AlphaFold provide confidence estimates (e.g., pLDDT) that are not calibrated and degrade under distribution shifts across experimental modalities, temporal changes, and disordered regions. We introduceCalPro, a prioraware evidential conformal framework for shift-robust uncertainty quantification. CalPro combines three components: (i) a geometric evidential head outputting Normal Inverse Gamma distributions via graph neural networks; (ii) a differentiable calibration surrogate that shapes representations during training, followed by split-conformal calibration for finite-sample coverage; and (iii) domain priors (disorder, flexibility) encoded as soft constraints on predicted uncertainty. Theoretically, we derive structure-awaresensitivity boundsfor coverage degradation under distribution shift using PAC-Bayesian control over ambiguity sets, quantifying how miscoverage increases with model complexity and shift magnitude. Empirically, CalPro achieves at most 5 percentage points coverage degradation across modalities compared to 15 to 25 points for baselines, reduces calibration error by 30% to 50%, and improves downstream docking success from 52% to 75% when filtering by uncertainty. The framework extends beyond proteins to structured regression tasks where priors encode local reliability.

Abstract:
Computational pathology has made progress in diagnosis and prognosis prediction from whole slide images (WSIs), yet pipelines still rely on patchlevel feature extraction and aggregation, departing from the cell-centric reasoning used by pathologists. This gap limits sensitivity to micro-lesions and subtle changes, and current methods rarely provide a unified solution that supports both local and global tasks with trustworthy evidence. We propose Cello, a universal cell-wise feature aggregation framework for reliable pathology image analysis. Cello integrates cell-level representations into WSI modeling via protein-signal–supervised cell-wise learning, preserving fine-grained cellular cues under gigapixel constraints. For local tasks, Cello introduces a flexible prototype-based contrastive module for scalable, task-adaptive representation learning. For global tasks, Cello adopts a weakly supervised gated aggregation that can widely leverage WSI labels. Finally, a cell–local–global decision-route consistency objective dynamically aggregates cellular evidence and aligns local predictions with global outcomes, improving reliability and faithfulness. Trained with only hundreds to thousands of samples, Cello achieves performance gains of 3.0%~7.6% and outperforms SOTA pathology foundation models pretrained on tens of thousands of samples. Code is available at https://anonymous.4open.science/r/Cello.

Abstract:
The paradigm of visual generation is rapidly shifting from singleimage conditioning toward multi-image conditioning, making the ability to synthesize and edit images based on multiple visual references a critical capability. Despite this trend, existing benchmarks remain largely limited to single-reference scenarios or narrowly defined tasks, leaving model behavior under complex multi-concept composition insufficiently explored. To bridge this gap, we introduceMICE-Bench, a comprehensive benchmark forMulti-referenceImageCreation andEditing. The benchmark is designed around three core principles: 1) heterogeneous concept composition across seven visual dimensions; 2) varying levels of constraint density, ranging from dual-concept to seven-concept configurations; 3) concept-centric data construction and benchmark evaluation, enabling fine-grained analysis of interactions among multiple concepts. MICE-Bench consists of 3,119 high-quality test cases within a unified concept space. Using an 8-dimensional evaluation metric, we systematically evaluate 13 state-of-the-art models. Our results show that although closed-source models maintain a clear performance advantage, all models experience notable degradation in concept consistency and physical realism as concept complexity increases.This indicates that current models rely on superficial composition rather than genuine multi-concept synthesis, highlighting substantial room for future improvement.

Abstract:
High dynamic range novel view synthesis (HDRNVS) remains challenged by geometric artifacts and radiometric distortions under multi-exposure conditions, primarily due to existing methods ignoring exposure and over-relying on color cues. Inspired by the integrated processing of color and structure of the human visual system (HVS), we propose Expo-GS, a novel framework that decomposes HDR-NVS into three interpretable components, namely, Irradiance Field Training, Geometry Field Training, and Interactive Joint Training. Central to Expo-GS is the exposure-aware signed distance function (Expo-SDF), which dynamically reweights geometric supervision via localized exposure reliability estimation, suppressing noisy gradients from unstable regions while enhancing structure learning in well-exposed areas. Building on this, we design an interactive optimization strategy that synchronizes Gaussian primitive growth and pruning with evolving Expo-SDF cues, enabling exposure-aware density control and eliminating hallucinated structures near exposure transitions. Experiments show that Expo-GS significantly outperforms prior methods on both synthetic and real-world datasets. It achieves a peak PSNR of 39.06 dB under HDR settings and up to 41.38 dB in the LDR-OE configuration, excelling in preserving high-frequency textures and maintaining structural consistency.

Abstract:
Accurately modeling complex temporal and topological dependencies and depth information is critical for monocular 3D human pose estimation, yet existing Mambabased approaches struggle to fulfill these demands, suffering from internal state update confusion induced by forced sequence flattening and depth modeling confusion inherent to single-view observations. To address this confusion, we propose a StructMamPose framework equipped with Identity Anchoring Mechanism (IAM) and View Transformation Hub (VTH). The IAM injects spatiotemporal identities into the parameter generation network to anchor the selectivity of state update matrices, suppressing spurious connections to enforce feature propagation along valid topological dependencies. The VTH performs internal coordinate rotation to transform implicit depth inference into observable planar features, endowing the model with explicit spatial understanding and multi-view constraints. Experimental results demonstrate that our framework achieves SOTA performance on most datasets.

Abstract:
Geometric foundation models hold promise for unconstrained dense geometry prediction from uncalibrated images; however, in current feedforward designs, their predicted confidence scores are heuristic, lack probabilistic interpretation, and often fail to indicate where and how much the predicted geometry can be trusted. To fill this gap, we presentTrust3R, a trust-aware 3D reconstruction framework that pairs a lightweight gated residual mean refinement with evidential learning to predict pointmap evidence under a Normal-Inverse-Wishart prior and yield a closed-form multivariate Student-t predictive distribution. This design provides probabilistically grounded pointmap uncertainty estimates while adding moderate inference overhead. We evaluate on diverse indoor and outdoor benchmarks, and compare against MASt3R's built-in confidence map as well as common uncertainty-aware baselines spanning single-pass heteroscedastic regression and sampling-based methods such as MC dropout and deep ensembles. Experimental results show that Trust3R consistently improves risk--coverage and sparsification, and generally improves geometric accuracy. Trust3R consistently improves uncertainty ranking across benchmarks (e.g. on ScanNet++: 25% lower AURC and 41% lower AUSE), enabling uncertainty-aware weighting for downstream alignment and fusion.

Abstract:
We study gradient testing and gradient estimation of smooth functions using only a comparison oracle that, given two points, indicates which one has the larger function value. For any smooth f\colon\mathbb R^n\to\mathbb R, \mathbfx\in\mathbb R^n, and \varepsilon>0, we design a gradient testing algorithm that determines whether the normalized gradient \nabla f(\mathbfx)/||\nabla f(\mathbfx)|| is \varepsilonclose or 2\varepsilon-far from a given unit vector \mathbfv using O(n) queries, as well as a gradient estimation algorithm that outputs an \varepsilon-estimate of \nabla f(\mathbfx)/||\nabla f(\mathbfx)|| using O(n\log(1/\varepsilon)) queries. We prove lower bounds establishing the optimality of both algorithms. Furthermore, we study these problems in the quantum comparison oracle model where queries can be made in superpositions, and develop quantum algorithms for gradient testing and gradient estimation using O(1) and O(\log (n/\varepsilon)) queries, respectively.

Abstract:
Recent advancements in generative AI have led to image editing models capable of producing realistic forgeries that evade traditional image forgery localization methods, as these approaches depend on physical noise absent in synthetic data. To address this challenge, we theoretically demonstrate that the diffusion process inherently suppresses local highfrequency variance, creating a statistical energy gap that is distinguishable from the natural entropy of optical imaging. Guided by this insight, we propose FLAME, a unified framework that utilizes a LAD map to capture these intrinsic anomalies, coupled with a parameter-efficient adapter for the SAM 3 to achieve precise, pixel-level forgery localization. Furthermore, to bridge the lag between forensic benchmarks and evolving generative models, we introduce EditStream, an automated pipeline for continuous, instruction-based training data synthesis. Extensive experiments demonstrate that FLAME establishes a new state-of-the-art, significantly outperforming previous methods on AI-generated forgery datasets while effectively generalizing to unseen generative architectures.

Abstract:
Large Language Models (LLMs) have shown to be vulnerable to backdoor attacks, yet we observe that many LLM backdoors do not survive when end users perform supervised finetuning (SFT). In this work, we provide a geometric explanation: by probing the backdoor objective under controlled weight perturbations, we find that conventional poisoning often drives the backdoor loss to a narrow and sharp basin; consequently, even modest parameter drift induced by downstream SFT can push the model out of the low-loss and high-ASR region, leading to rapid backdoor forgetting. Motivated by this insight, we propose BAD-BOOM, a resilient backdoor attack via broader smoothness minimization, which explicitly broadens and smooths the backdoor basin. BAD-BOOM extends sharpness-aware minimization with a Fisher-induced ellipsoidal constraint that allocates larger perturbation budgets to backdoor-sensitive parameters, encouraging solutions whose neighborhoods also maintain low backdoor loss. Across two threat settings (sentiment steering and targeted refusal), three attacks (AddSent, Sleeper, VPI), three open-source LLMs, and three trigger-free downstream SFT tasks (SST-2, GSM8K, instruction following), BAD-BOOM consistently preserves high ASR while maintaining competitive utility.

Abstract:
With the widespread deployment of large language models (LLMs), aligning model outputs with pluralistic human values has become an important research problem. Recent approaches that train taskspecific experts and merge them through parameter aggregation have shown promise for pluralistic alignment. However, these methods often overlook the intrinsic complexity of real-world value data, where multiple correlated value dimensions coexist, resulting in highly similar and entangled expert representations. Consequently, modifying the contribution of one value expert may unintentionally influence other values, limiting fine-grained controllability. To address this issue, we propose DisAlign, a model-merging framework that explicitly decomposes value representations into consensus and value-specific components using an information-geometric perspective. DisAlign first extracts a consensus anchor and subspace to capture shared structure across values, and then applies spectral decomposition to the residual representations to construct disentangled value subspaces. This design enables more precise and independent modulation of multiple values. Experiments on three datasets covering different value frameworks demonstrate that DisAlign consistently improves value disentanglement and achieves more accurate pluralistic value control compared to existing baselines. Our code is available at \urlhttps://anonymous.4open.science/r/DisAlign-7F35

Abstract:
Optimizing maintenance strategies for largescale infrastructure is a critical sequential decision-making problem, exemplified by the high-stakes domain of bridge management. While Reinforcement Learning (RL) offers a theoretical framework for such problems, practical deployment necessitates offline constrained RL—learning policies solely from static historical datasets under rigid budgetary limits without dangerous on-policy exploration.However, current research is hindered by benchmarks that fail to capture the confluence of distributional shift and hard constraints typical of real-world assets. We introduce InfraRL, a high-fidelity benchmark that uses bridge maintenance as a rigorous testbed for general infrastructure asset management challenges.Constructed from the U.S. National Bridge Inventory, InfraRL defines a rigorous offline task for optimizing maintenance strategies under hard budgetary constraints. We benchmark a diverse suite of baselines, ranging from industry-standard heuristics to SOTA single-agent and multi-agent offline RL algorithms. Through a comprehensive evaluation protocol, we analyze performance across structural utility, constraint adherence, and behavioral fidelity, revealing critical trade-offs between safety and long-term efficiency. Our code and data are available at https://anonymous.4open.science/r/ICML-6656

Abstract:
Validating autonomous driving systems requires diverse scenarios, yet realworld data collection is biased and costly. Editing existing driving logs offers a scalable solution, but simultaneously editing objects and ego-trajectory—termed unified editing—remains challenging. Current methods face an inherent dilemma: generative flexibility for object editing and physical precision for trajectory control. To address this, we introduce SceneDirector, a diffusion-based framework that bridges explicit geometry and generative priors. For explicit geometry, we leverage LiDAR-guided depth completion to construct dense scene geometry and integrate editable 3D assets to form a Unified Geometric Scaffold, providing rigorous structural guidance for unified editing. To leverage generative priors, we encode the source video into a Static Texture Bank to provide rich appearance context. Our proposed Mask-Gated Reference Attention bridges these modalities. Guided by a geometric uncertainty metric, this mechanism dynamically regulates the interaction between the scaffold and the bank—preserving reliable geometry while adaptively injecting textures for semantic refinement. Extensive evaluations demonstrate that SceneDirector outperforms state-of-the-art methods in both controllability and visual quality.

Abstract:
Multimodal image fusion (MMIF) aims to integrate complementary information from different modalities into a single fused image that preservesfine local detailswhile maintainingglobally consistent appearance. Most existing approaches build shared representations on 2D feature grids, which excel at modeling local structures but offer limited leverage over imagelevel appearance factors. To better optimize two objectives jointly, we redesign the shared representation by mapping inputs into a compact sequence of discrete1D image tokens, and instantiate this design with TiTok as a lightweight tokenizer, decoupling the shared representation from fixed pixel locations and concentrating image-level attributes into a small set of global tokens. We proposeSelective Token Editing(STE): we sparsely update/replace only a small set of critical shared tokens, providing a lightweight token-level mechanism to steer global appearance coherence while keeping the fusion backbone unchanged and avoiding complex loss designs. Experiments on multiple benchmarks show that our method deliversconsistent, multi-metric improvements—enhancing global coherence and local fidelity simultaneously—and achieves thebest overall performanceunder comprehensive evaluation.

Abstract:
While Large Language Models (LLMs) possess strong reasoning capabilities, enabling them to learn continuously from experience without parametric retraining remains an open challenge. Existing RetrievalAugmented Generation (RAG) approaches typically treat memory as a static or append-only corpus, leading to "memory saturation''---where accumulating noise and redundant information degrade performance over time. To address this, we propose an Experience Risk Minimization (ERM) framework that formalizes the experience library as a learnable parameter under an explicit capacity budget. We introduce Textual Stochastic Gradient Descent (TSGD), a discrete optimization algorithm that refines this library via failure-driven Add, Edit, and Delete operations. TSGD estimates ``textual gradients'' through self-reflection and employs a dual-verification mechanism to ensure generalization, effectively preventing overfitting to local errors. Empirical results on MATH and AIME benchmarks demonstrate that TSGD achieves state-of-the-art performance, improving accuracy by up to 18.7% over zero-shot baselines and significantly outperforming static RAG, all while maintaining a compact memory footprint (compressing hundreds of experiences into \approx30 high-utility rules).

Abstract:
Transaction cost optimization (TCO) of online portfolio selection is crucial in computing science, due to the significant impact of transaction costs in practical shortterm trading. Moreover, sparsity of portfolio vector is often desired to enhance stability and decrease risk. However, there is a lack of models considering transaction costs and sparsity simultaneously in the literature. In this paper, we first propose a K-sparse TCO model that minimizes the negative return and transaction costs while keeping the portfolio vector being K-sparse. Noting that the model is NP-hard due to the K-sparse constraint, we bypass this difficulty by reformulating the sparse model to a nonsmooth difference of convex (DC) optimization problem. We show that both problems are equivalent by proving that the penalty parameter is large enough. Then, to overcome the difficulty caused by the nonsmoothness and the simplex constraint of the model, we develop a proximal subgradient algorithm (PSGA) to solve the DC problem and apply the alternating direction of multipliers (ADMM) to compute the proximity operator of the corresponding function. Furthermore, we establish the global convergence of the entire sequence generated by PSGA through showing the surrogate function satisfies the Kurdyka-Łojasiewicz (KL) property. In addition, by showing the KL exponent of the surrogate function is 1/2, we establish the R-linear convergence rate of PSGA for any arbitrary initiaal point. Finally, we compare our proposed algorithm with other state-of-the-art strategies on four benchmark real-market data sets, with the numerical results showing that the proposed algorithm achieves lower risk while keeping higher return than classical TCO models.

Abstract:
Recently, large language models (LLMs) have driven a systematic shift in the graph ML community through the adoption of text-attributed graphs (TAGs). Although a variety of frameworks have been developed, most fail to properly ad- dress the challenge of data uncertainty in open- world environments. A representative source of such uncertainty is the limited availability of la- bels in large-scale datasets due to high annotation costs, where unlabeled nodes may belong to either known classes or novel, unknown classes. While node-level out-of-distribution detection and con- ventional open-world graph learning attempt to tackle this problem, two core limitations remain: ① Insufficient methods — existing approaches typically optimize semantics or topology in isola- tion for unknown-class rejection, failing to effec- tively integrate textual and structural information in TAGs; ② Incomplete pipelines — most stud- ies conduct only idealized analyses, such as as- suming a predefined number of unknown classes, which restricts practical utility for model updates and long-term deployment. To overcome these issues, we introduce the Open-world Graph Assis- tant (OGA), an LLM-based framework. OGA first performs unknown-class rejection via adaptive la- bel traceability (ALT), harmoniously combining semantic and topological cues, and then applies the graph label annotator (GLA) for unknown- class annotation, allowing unlabeled nodes to con- tribute to model training. In essence, OGA offers a new pipeline that fully automates the handling of unlabeled nodes in open-world environments, and we establish a systematic benchmark cover- ing four key aspects to validate its effectiveness and practicality through extensive experiments.

Abstract:
We study episodic constrained Markov decision processes (CMDPs) with linear function approximation, where the goal is to achieve strong regret and constraint violation guarantees without allowing error cancellations. Unlike the existing work, which focuses on either tabular CMDP or modelbased reinforcement learning methods. We propose a model-free policy APMPO that achieves near-optimal \widetildeO(\sqrtK) strong regret and strong constraint violation with Slater's condition (or strict feasibility assumption), where K is the total number of episodes. It matches the best-known rates without requiring any prior knowledge of the feasibility gap reported in prior model-based work for tabular CMDPs. Besides, APMPO achieves \widetildeO(K^\frac34) strong regret and \widetildeO(K^\frac34) strong constraint violation without Slater's condition. To the best of our knowledge, this is the first sublinear result of CMDP w.r.t. the strong metrics without Slater's condition. APMPO achieves these results by a novel and adaptive design of a violation-aware penalty and learning rates to balance the strong regret and constraint violation, which is quite different from the (regularized) primal-dual methods imposing constraints via dual penalty in the literature. The experiments show APMPO significantly outperforms the strong baselines, which justify our design and theoretical performance.

Abstract:
Concept erasure aims to prevent image generative models from producing unsafe content while preserving their general generative capability. Meanwhile, nextscale autoregressive (AR) image generation has recently emerged as a new generative paradigm characterized by next-scale prediction, for which concept erasure remains largely unexplored. In this paradigm, semantic information is highly compressed at early scales, leading to severe entanglement between unsafe and unrelated semantics. In this paper, we propose ScaleErasure, an inference-time concept erasure method that performs minimal intervention. ScaleErasure precisely selects and guides predicted logits that are most relevant to the unsafe concept, thereby enabling effective erasure under severe semantic entanglement. Specifically, ScaleErasure performs two additional forward passes conditioned on the unsafe concept and the corresponding safe concept, and leverages their outputs to guide the target logits away from unsafe concepts toward safe concepts. To enable precise and minimal intervention, logits selection and guidance are conducted across three dimensions: scales, tokens, and bit channels. Experiments demonstrate that ScaleErasure outperforms adapted baselines in the next-scale AR paradigm, achieving more precise concept erasure while largely preserving general generative capability.

Abstract:
Accurate survival prediction in kidney transplantation is critical yet challenging due to the complex interplay between functional biomarkers and patient characteristics under censoring. To address this, we propose a functional censored quantile neural network (FunCQNet), a novel framework that integrates deep neural networks with a censoringadjusted sequential quantile loss to approximate interaction-dependent coefficient functions. We further introduce a conformal inference approach to rigorously assess the significance of scalar-functional interactions, ensuring interpretability alongside predictive power. Extensive simulations demonstrate that FunCQNet robustly recovers functional effects under varying noise and censoring levels. When applied to kidney transplant data, the model yields precise multi-quantile predictions and reveals clinically significant, age-dependent interaction patterns between donor type and recipient survival.

Abstract:
As Large Language Models transition to autonomous agents, user inputs frequently violate cooperative assumptions (e.g., implicit intent, missing parameters, false presuppositions, or ambiguous expressions), creating execution risks that textonly evaluations do not capture. Existing benchmarks typically assume well-specified instructions or restrict evaluation to text-only, single-turn clarification, and thus do not measure multi-turn disambiguation under grounded execution risk. We introduce DRIFT-BENCH, the first diagnostic benchmark that evaluates agentic pragmatics under input faults through multi-turn clarification across state-oriented and service-oriented execution environments. Grounded in classical theories of communication, DRIFT-BENCH provides a unified taxonomy of cooperative breakdowns and employs a persona-driven user simulator with the Rise evaluation protocol. Experiments show substantial performance drops under these faults, with clarification effectiveness varying across user personas and fault types. DRIFT-BENCH connects clarification studies with agent benchmarking, providing a framework to diagnose failures arising from faulty user inputs.

Abstract:
In offline Reinforcement Learning, immediate rewards in logged batch data are often unobserved due to sparse or irregular recordkeeping, or censored beyond certain reward values. This issue frequently arises in practical settings, including health care and marketing. We investigate off-policy evaluation (OPE) in finite-horizon Markov decision processes when rewards are missing not at random (MNAR), which breaks ignorability and induces selection bias even after conditioning on states and actions. To address this, we formalize a reward-dependent propensity model and use future states as shadow variables to identify the full-data conditional mean reward. We further introduce a bridge function that recovers the conditional mean reward without explicitly modeling the MNAR mechanism, and estimate it via a min-max procedure to avoid double sampling. Building upon these identification results, we propose an Fitted-Q-Evaluation-style estimator that propagates the recovered rewards while allowing target policies to depend on past missingness indicators. Finally, we establish consistency and finite-sample error bounds for our OPE estimator, and show through simulations the strong performance of our method compared to existing benchmarks.

Abstract:
Recent work has analyzed the convergence of firstorder methods under non-uniform smoothness assumptions that better model the loss landscape in machine learning tasks. We generalize this assumption to objectives whose curvature is an affine function of the objective value. This property is satisfied by a broad class of problems, including logistic regression, generalized linear models with a logistic link function, softmax policy gradient in reinforcement learning, and a class of neural networks. Under this assumption and gradient domination conditions, we establish a general convergence rate for the steepest descent method, and deterministic, diagonal variants of RMSProp and Adam. Our results imply that for logistic regression on separable data and the softmax policy gradient objective, sign GD converges linearly and is provably faster than GD. Furthermore, we show that for a class of two-layer neural networks on separable data, RMSProp and Adam can converge at a linear rate with a constant step-size and momentum parameter. Finally, we present a lower bound demonstrating that, under our assumption, RMSProp and Adam are provably faster than AdaGrad, AMSGrad, gradient descent, and heavy-ball momentum.

Abstract:
Leveraging Large VisionLanguage Models like CLIP has recently set new benchmarks for No-Reference Image Quality Assessment (NR-IQA). However, the contrastive pretraining of CLIP inherently prioritizes semantic invariance, which often suppresses subtle perceptual signals, a phenomenon we term perceptual submergence. Furthermore, standard preprocessing techniques (e.g., cropping and interpolation) further exacerbate the loss of critical high-frequency quality cues. In this paper, we propose the Cross-modal Perception Alignment Adapter (CMPA), a manifold-aware framework designed to disentangle perceptual distortions from dominant semantics. CMPA introduces a Perception-Sensitive Feature Extractor (PFE) that projects CLIP features into a compact, low-dimensional subspace, explicitly magnifying distortion-induced off-manifold deviations. Subsequently, a Cross-Modal Perception Alignment Injector (PAI) aligns these features with quality-aware text anchors and re-injects them into the backbone. To ensure input fidelity, we also devise a Residual-enhanced Perceptual Downscaling strategy that adaptively compensates for resolution-induced information loss using Just Noticeable Difference (JND) guided frequency re-injection. Extensive evaluations on several benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, effectively recovering the perceptual signals submerged in semantic-dense representations.

Abstract:
Molecular property prediction from 3D structures is fundamentally constrained by the scarcity of labeled data. To address this challenge, researchers have adapted various selfsupervised pre-training methods from computer vision and natural language processing; however, these approaches often neglect the fundamental physical principles unique to molecular systems. When grounded in physical principles, denoising pre-training can be formally shown to be equivalent to learning molecular force fields.However, existing methods uniformly apply a uniform noise scheme across all molecules, which introduces systematic bias in molecular distribution modeling. To overcome this limitation, we propose MOES-Pred, a novel denoising pre-training framework that employs an energy sentinel mechanism to dynamically adjust molecule-specific noise perturbations. By incorporating chemical prior knowledge, we design molecule-specific noising strategies that expand conformational sampling coverage and improve the fidelity of molecular distribution modeling. Extensive experiments demonstrate that MOES-Pred consistently surpasses existing methods, achieving state-of-the-art performance on both force prediction tasks and downstream quantum chemical property predictions.

Abstract:
Large language models (LLMs) have achieved remarkable breakthroughs across various applications. However, their architectures remain inefficient in pretraining due to two main limitations: (i) selfattention lacks an explicit inductive bias for locality, leading to redundant modeling of sequence-internal local information; (ii) mixture-of-experts (MoE) implicitly couples knowledge storage with computational pathways, hindering flexible access to sequence-external global knowledge. To overcome these limitations, we propose LoKiFormer, a novel LLM architecture that augments the standard decoder with two dedicated modules: 1) Local Fusion Attention (LFA), which incorporates a convolutional fusion to attention, explicitly capturing local patterns and allowing the attention to operate on more informative representations; 2) Knowledge Memory Module (KMM), which introduces a parametric key–value memory that explicitly stores global knowledge in addressable slots, decoupling storage from computation and enabling direct knowledge retrieval. Together, these modules enable LoKiFormer to achieve more efficient and effective integration of information at both levels. Experimental results show that LoKiFormer converges 1.33x faster in pre-training than baseline models, underscoring its superiority over existing LLM architectures.

Abstract:
Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains limited by a critical weakness: when visual evidence conflicts with textual cues, MLLM judges tend to reward plausible narratives over perceptually correct answers. We identify and systematically analyze this phenomenon, which we term Perceptual Judgment Bias. Through controlled visual perturbations, existing multimodal judges frequently anchor on the response text instead of their own visual perception, leading to inconsistent and nonverifiable evaluations. To address this issue, we introduce the Perceptually Perturbed Judgment Dataset, which constructs minimally edited counterfactual responses that isolate perceptual errors and enable verifiable supervision. Building on this dataset, we develop a unified training framework that combines a structured GRPO-based reward with a batch-ranking objective, achieving coherent global ordering without explicit pairwise labels. Experiments across diverse MLLM-as-a-Judge benchmarks show that our approach substantially improves perceptual fidelity, ranking coherence, and alignment with human evaluation. Our results establish a scalable and generalizable pathway for training multimodal judges that are perceptually grounded, interpretable, and robust to visual–reasoning conflicts.

Abstract:
Vision Language Models (VLMs) achieve strong reasoning with Chainof-Thought (CoT) prompting but incur high sequential-generation cost, error accumulation, and limited self-correction. Diffusion Multimodal Large Language Models (dMLLMs) unmask tokens in an order-agnostic process, improving efficiency and enabling self-correction, yet their reasoning and how to enhance it remain underexplored. We propose a training-free method, Spatio-Temporal token Veto (ST-Veto), leveraging the ability to observe all tokens at each diffusion step. ST-Veto vetoes temporally unstable tokens via second-order Taylor prediction of confidence dynamics and filters weakly grounded tokens using image attention mass, swapping them with safer candidates. Across multiple dMLLMs and multimodal reasoning benchmarks, ST-Veto consistently outperforms standard decoding policies and prior VLM reasoning methods, improving accuracy by up to 9% with no additional training or generation cost. Analyses show that ST-Veto steers generation toward higher-confidence, better-grounded paths, and we will release our code upon publication.

Abstract:
While Large Language Models (LLMs) excel in code generation, they remain prone to replicating subtle yet critical vulnerabilities endemic to their training data. Current alignment techniques, such as Supervised FineTuning (SFT) and Reinforcement Learning (RL), typically apply coarse-grained optimization at the sequence level. This approach often fails to address the localized nature of security flaws, where a single incorrect token choice can compromise an entire program. To bridge this gap, we introduce Tree-like Self-Play (TSP), a framework that reframes secure code generation as a fine-grained sequential decision process. Unlike standard methods that blindly maximize likelihood, TSP constructs a decision tree where the model explores branching trajectories—generating both secure "golden paths" and vulnerable variants. By treating code generation as a self-play game, the model learns to strictly discriminate against its own localized errors. This provides a dense, on-policy learning signal that forces self-correction precisely at the critical decision nodes where vulnerabilities typically emerge. Our experiments demonstrate that TSP fundamentally enhances model reliability. In Python security benchmarks, TSP boosts CodeLlama-7B’s pass rate (SPR@1) to 75.8%, significantly outperforming SFT (57.0%) and unstructured self-play baselines. Crucially, TSP induces robust out-of-distribution generalization: the model not only reduces vulnerabilities in unseen categories (CWEs) by 24.5% but also successfully transfers security principles learned from C/C++ to diverse languages, including Python, Go, and JavaScript. This suggests that TSP does not merely memorize patches, but internalizes abstract, language-agnostic security logic.

Abstract:
Precise alignment is a prerequisite for hyperspectral and multispectral image fusion, yet existing methods struggle with complex nonrigid deformations. Existing techniques either suffer from inter-task error accumulation by treating registration and fusion as disjoint processes or neglect the geometric nature of distortions by relying on isotropic Euclidean metrics. We propose Structure-Aware Riemannian Flow Matching (SA-RFM), a geometry-informed framework for joint registration and fusion of hyperspectral and multispectral images. SA-RFM reformulates registration as dynamic optimal transport on a structure-induced Riemannian manifold, where anisotropic costs are derived from MSI structural cues. To circumvent the complexity of explicit OT solvers, we incorporate this geometry into a conditional flow matching framework via a local cost approximation. This formulation is realized through two synergistic mechanisms: a Riemannian Flow Matching objective that enforces structure-aware error measurement, and an optimal transport direction regularization that aligns the velocity field with the induced metric, thereby resolving the fundamental mismatch between anisotropic costs and conventional Euclidean supervision. Extensive experiments on multiple datasets demonstrate the superiority of our method.

Abstract:
Offline meta reinforcement learning (RL) enables agents to learn a unified policy from multitask offline data to support generalization in out-of-distribution (OOD) tasks. Recent approaches in single-agent RL tackle this by learning an efficient task representation to distinguish between tasks, showing promising adaptation ability. However, when extended to multi-agent settings, these methods struggle with decentralized task identification due to limited global information, and suffer from inefficient knowledge transfer in the absence of role information. To address this, we propose D^2TR, a novel context-based meta RL framework with efficient decentralized and disentangled task-role identification. Specifically, D^2TR first introduces mutual information knowledge distillation to align decentralized task representations with centralized task representations inferred from global trajectories, enabling efficient decentralized team-centric information identification. Next, D^2TR leverages a large language model to assign semantic roles to trajectories in offline data, and achieves effective individual-centric information inference by learning decentralized role representations. Extensive experiments conducted on commonly used multi-agent environments, including CN, SMAC, and SMACv2, demonstrate that D^2TR exhibits strong generalization performance to unseen tasks, outperforming prior multi-agent multi-task and context-based meta RL baselines.

Abstract:
Recent advances in visionlanguage models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Finally, our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.

Abstract:
Adversarial purification uses generative models to restore clean data distributions from unseen attacks without retraining classifiers. However, unimodal diffusionbased approaches struggle to preserve semantic consistency, while recent multimodal variants rely on computationally expensive adversarial training or distillation. Both approaches often lack theoretical guarantees. In this work, we propose MultiDAP, a novel framework leveraging multimodal diffusion models for efficient adversarial purification. MultiDAP first learns continuous class-agnostic prompts from clean data to capture rich semantic priors, replacing rigid hand-crafted templates. Guided by these prompts, MultiDAP purifies adversarial inputs by minimizing a regularized DDPM loss for only a few steps (e.g., 5-20). We provide theoretical guarantees for both the likelihood improvement via prompt learning and the convergence of the purification process. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet-1K demonstrate that MultiDAP matches the robustness of state-of-the-art baselines but with improved efficiency.

Abstract:
Most spatiotemporal forecasting models assume in-distribution data and can degrade sharply under non-stationary environments. Existing methods for handling distribution shift largely rely on discrete graph inference, making it difficult to disentangle universal dynamics from environment-specific changes and to respect the continuous physical nature of spatio-temporal fields. To this end, we propose STPDE, a general framework that reformulates spatio-temporal dynamics as the evolution of inhomogeneous partial differential equations. STPDE explicitly decomposes dynamics into an Invariant Diffusion Operator that captures universal mechanisms and an Environment Basis Manifold that parameterizes local heterogeneous media. We show that the Green's function of the Laplacian can be effectively approximated by linear attention, enabling global diffusion at scale. Combined with stochastic environment perturbations, STPDE improves robustness under heterogeneous and shifting environments. Extensive experiments on in-distribution forecasting, out-of-distribution generalization, few-shot cross-city transfer, and continual learning demonstrate consistent improvements over state-of-the-art baselines with competitive computational efficiency.

Abstract:
Model merging (MM) has gained significant attention as a costeffective approach to integrate multiple task-specific models into a unified model. However, recent work reveals that MM is highly susceptible to backdoor attacks. Existing defenses based on task arithmetic often fail to eliminate backdoors without substantially degrading clean-task performance, owing to their reliance on direct parameter-space editing. To address this gap, we propose Linear Feature Path Minimization (LFPM), a backdoor mitigation framework for model merging, which introduces an anti-backdoor task vector into the backdoored merged model. Unlike prior approaches, LFPM formulates the backdoor robustness of the merged model from a unified feature-space perspective under the Cross-Task Linearity (CTL) framework, which leverages the approximate linearity of features across tasks. This perspective guides the optimization of the anti-backdoor task to suppress backdoors while preserving clean-task performance. Furthermore, we introduce an effective optimization mechanism based on gradient accumulation and loss path-integral, ensuring robust backdoor suppression along the interpolation path. Extensive experiments demonstrate that LFPM consistently exhibits strong robustness against backdoor attacks in both full fine-tuning and Parameter-Efficient Fine-Tuning (PEFT) settings.

Abstract:
Existing VisionLanguage-Action (VLA) models predominantly rely on explicit Chain-of-Thought (CoT) reasoning to bridge perception and action. While effective, this paradigm suffers from high computational costs and error propagation in multi-step tasks. In this paper, we propose Adaptive Variable Alignment VLA (AVA-VLA), a novel Latent Reasoning VLA framework that models reasoning as a sequence of unobservable latent variables, bypassing the need for explicit text generation. However, latent trajectories are inherently susceptible to noise interference and misalignment with downstream objectives. To address this, we introduce a Reinforcement Learning-based Denoising mechanism that treats latent state generation as a sequential decision process, optimizing reasoning trajectories via task-level rewards. Furthermore, we incorporate an Early-Exit Strategy that adaptively terminates reasoning based on state confidence, enabling a dynamic trade-off between depth and efficiency. Extensive experiments on embodied decision benchmarks demonstrate that AVA-VLA significantly reduces inference latency while achieving superior stability and success rates compared to full-reasoning baselines.

Abstract:
A sustainable diet represents a multidimensional synergy among four essential pillars: nutrition adequacy, economic affordability, cultural acceptability, and environmental respect. Despite the prevalence of population-level sustainability modeling, practical implementation relies on effective individual-level adoption. This transition is often hindered by inter-individual heterogeneity, posing a formidable challenge in aligning sustainable diet requirements with individual preferences. To address this issue, we propose a personalized sustainable diet recommendation model based on a constraint-aware decision-making mechanism, where sustainability is incorporated through learnable constraints rather than modeled as user preferences. To systematically evaluate the proposed approach, we construct a sustainable diet dataset named SusDiet with about 150k recipes, characterized by broad coverage of sustainability indicators across four dimensions. Experimental results on this dataset show that our method promotes more sustainable choices without compromising individual preference. This work establishes a framework for aligning individual dietary choices with planetary health, offering quantitative evidence to guide future sustainable diet interventions and policy-making for sustainable development.

Abstract:
Federated learning (FL) faces significant challenges from modality heterogeneity, which motivates multimodal federated learning (MFL) to leverage complementary modalities across decentralized clients for improved performance. However, modality imbalance introduces a new attack surface, making MFL more vulnerable to membership inference attacks (MIAs), an issue that remains largely unexplored. In this work, we present the first systematic study of MIAs against MFL and propose a modalityaware attack framework. We show that multimodal models are inherently more susceptible to MIAs due to heterogeneous modality contributions, and existing attacks are suboptimal as they treat multimodal parameters as a whole. By performing MIAs on individual modalities, we find that (i) attacking the dominant modality achieves comparable accuracy with lower overhead, and (ii) different modalities expose distinct membership patterns. To identify members with different patterns, we propose a modality-aware framework that exploits cross-modal performance gaps to adaptively select attack modalities and calibrate inference results. Experiments on three datasets show our approach outperforms baselines across multiple metrics.

Abstract:
Understanding the internal mechanisms of Deep Neural Networks remains a significant challenge, particularly in elucidating how generic visual concepts emerge within latent spaces. In this work, we propose SAEsBrainMap, a novel framework that utilizes human brain activation patterns from the ventral visual pathway as objective probes to guide the identification of features decomposed by Sparse Autoencoders (SAEs). Our quantitative and qualitative empirical results demonstrate a robust representational alignment between sparse model features and biological Regions of Interest (ROIs), confirming the feasibility of utilizing brain signals to characterize model functionality. By leveraging this alignment, we trace the hierarchical trajectory of generic concepts cross layers and utilize the brain's hierarchical structure to visualize the model's global processing flow, providing novel insights into model interpretability.

Abstract:
In the AI4Chemistry scenario, utilizing heterogeneous data at different fidelity levels is a common and core issue. Highfidelity data is accurate but scarce, while low-fidelity data is abundant but biased. Traditional multi-fidelity methods typically identify cross-fidelity biases based on paired samples under different fidelity labels. However, due to the mismatch in dataset input distribution and the complexity of the biases themselves, these methods are difficult to implement in real-world scientific environments. To address this, we propose a trusted information subset decomposition framework that can efficiently utilize multi-fidelity data without requiring paired samples. Multi-fidelity label supervision is decomposed into three complementary subsets: a trusted information subset based on the absolute value of high-fidelity labels; a trusted subset that captures the reliability of the high-fidelity and medium-fidelity label intervals through adaptive constraints; and an ordered trusted subset representing the numerical relationships within the same fidelity level. These subsets are then integrated into a unified end-to-end model, enabling the reasonable utilization of medium- and low-fidelity information. Extensive experiments on various molecular and material property benchmarks demonstrate that our method consistently outperforms state-of-the-art multifidelity and singlefidelity baseline methods, and exhibits good robustness under real-world unpaired multifidelity conditions.

Abstract:
In this paper, we proposeAlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to ARDiffusion Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities without relying on external knowledge injection. Our approach unlocks the model's intrinsic potential to perform advanced reasoning tasks: Reasoning Text-to-Image Generation, where the model actively infers implicit user intents, and Self-Reflective Refinement, where it autonomously diagnoses and corrects misalignments in generated outputs. To address the challenge of providing stable supervision for real-world multimodal generation, we introduce the Decompositional Verifiable Reward (DVReward). Unlike holistic scalar rewards, DVReward utilizes an LLM to decompose complex user requests into atomic, verifiable semantic and quality questions, which are then evaluated by a general MLLM to provide reliable and interpretable feedback. Extensive experiments demonstrate that AlphaGRPO yields robust improvements across multimodal generation benchmarks, including GenEval, TIIF-Bench, DPG-Bench and WISE, while also achieving significant gains in editing tasks on GEdit without training on editing tasks. These results validate that our self-reflective reinforcement approach effectively leverages inherent understanding to guide high-fidelity generation.

Abstract:
Neural Signed Distance Functions (SDFs) excel at reconstructing watertight manifolds but fail on thin structures and open boundaries due to strict insideoutside constraints. Conversely, Unsigned Distance Fields (UDFs) accommodate general geometries but suffer from gradient singularities at the zero-level set, hindering optimization and extraction. We introduce Metric–-Phase Fields (MPFs), a decoupled implicit representation that separates metric proximity from topological phase. Given an unoriented point cloud, MPFs learn (i) an unsigned metric field r and (ii) a smooth phase field \theta, for which we derive a bounded phase indicator P=\tanh(\beta\theta) that provides soft inside–outside cues where they are meaningful. We couple the two fields via a gated-metric formulation with a residual phase injection to obtain a signed implicit function with stable near-surface gradients. The phase coefficient \beta is learnable, allowing MPFs to adaptively control the sharpness of the phase transition and the degree of saturation of the soft sign indicator. Experiments on both synthetic and scanned thin-shell and thin-plate shapes demonstrate that MPFs preserve thin and layered structures more faithfully than recent SDF-based methods, while also enabling more robust training and more reliable surface extraction than UDF-based approaches. We will make the source code publicly available.

Abstract:
Though Vision Transformers (ViTs) have become the dominant backbone in many computer vision tasks, due to permutation invariance, their attention mechanism lacks explicit spatial inductive biases. This become particularly important in two common settings: when model capacity is small or training data is limited. Inspired by the attention masking strategies in Linear Transformers and the scanning patterns of Vision SSMs, we introduce VIOLIN, a lightweight masked attention mechanism that encodes spatial structure within attention via Space Filling Curves (SFCs) with less than 0.0002% extra parameters and negligible computational overhead. VIOLIN scans the image using multiple SFCs to construct curvespecific decay masks, which are then combined and multiplied with the attention matrix. Across a wide range of evaluations, VIOLIN consistently improves performance. In limited data regimes such as fine-tuning on VTAB-1K, it boosts accuracy across all task groups and by up to 8.7% on the Structured tasks, where spatial information is essential. It can be combined with parameter-efficient fine-tuning methods such as LoRA to further increase the performance. Beyond fine-tuning, VIOLIN improves various small scale ViT architectures (e.g., DeiT, DINO) during pretraining on ImageNet-1K, achieving gains of up to 0.9%. Additionally, on pixel-level CIFAR-100 training, a task that is highly dependent on location information, VIOLIN increases accuracy by up to 7.2%. Overall, VIOLIN provides a computationally efficient yet effective way to inject spatial inductive bias into ViTs, especially benefiting small models and limited data settings.

Abstract:
Modern neural networks derive much of their effectiveness from rich connectivity patterns. Yet, existing architectures often fix the topology at either the sparse or dense extremes, thereby limiting structural flexibility and analysis. We propose Kronecker Generative Networks (KGNs), an algebraic framework that constructs neural network topologies via recursive generation rules, treating topology as a firstclass design object. KGNs generate families of directed acyclic graphs with controllable connectivity complexity, enabling systematic interpolation between sparse and dense aggregation regimes. Under this formulation, architectures such as FractalNet and DenseNet arise as specific instantiations corresponding to different generation rules. We provide theoretical analysis of acyclicity, connectivity scaling, and expressiveness, and demonstrate experimentally that KGN instantiations achieve favorable accuracy-efficiency trade-offs across multiple domains.

Abstract:
The collective intelligence of Large Language Model (LLM)based Multi-Agent Systems (MAS) is fundamentally governed by the underlying communication graph. However, discovering task-adaptive structures within this combinatorial search space remains a significant challenge. Existing methods, ranging from heuristic pruning to autoregressive generation, often lack a unified theoretical framework to guide the self-organization of agents into efficient teams. In this paper, we bridge non-equilibrium thermodynamics and generative modeling to formalize multi-agent graph generation as an energy minimization process. Specifically, we frame the emergence of efficient collaboration as a thermodynamic "cooling" process, where initially stochastic interactions converge to a low-energy, structured equilibrium. To implement this, We propose MAGE (Multi-Agent Communication Graph Generation), a score-based diffusion framework that constructs communication graphs by navigating the energy landscape via iterative denoising and first-order gradient guidance. Extensive experiments on representative benchmarks demonstrate that MAGE achieves state-of-the-art performance. Furthermore, qualitative analysis reveals that the generated graphs mirroring the functional specialization of human organizations, validating our thermodynamic hypothesis.

Abstract:
The formal integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) into scientific peerreview workflows introduces novel and significant risks. Their safety against adversarial manipulation remains critically underexplored, especially given the multimodal nature of scientific papers where figures, not just text, convey core evidence. This creates a significant gap: current robustness studies are overwhelmingly text-only, and the problem is distinct from standard jailbreaking, as a peer-review attack seeks to induce a domain-specific, targeted failure (e.g., "inflate this score") rather than a general safety policy violation, for which no practical defenses exist. To address this, we introduce PaperGuard, the first comprehensive benchmark designed to systematically evaluate and defend AI-generated peer review against these domain-specific, cross-modal attacks. Our framework is built on three pillars: (1) a new multimodal peer-review dataset spanning multiple scientific domains; (2) a unified suite of attacks, including black-box prompt injections and white-box perturbations, specifically designed to target both text (GCG) and figures (PGD); and (3) a practical defense, motivated by the long-context challenge of academic papers, that uses chunk-based embedding search to efficiently localize and mitigate harmful instructions. Our extensive experiments, conducted across state-of-the-art models, confirm that these systems are pervasively vulnerable. PaperGuard establishes the foundational benchmark, protocols, and actionable defense necessary to pioneer trustworthy, attack-resilient AI-assisted scholarly reviewing.

Abstract:
Humans express uncertainty verbally via markers (e.g., "possible", "likely"), yet most LLM uncertainty quantification (UQ) relies on costing likelihoodor consistency-based signals. From a cognitive perspective, accurate verbal uncertainty reflects metacognitive monitoring, representing knowledge boundaries ("knowing that you don't know") to support regulation and information seeking. In this paper, we investigate: How LLMs diverge from humans in verbal uncertainty quantification? Can verbal markers reliably quantify LLM uncertainty? We curate a corpus of human uncertainty markers from psychology and decision-science literature and benchmark LLMs against it. We observe that LLMs encode verbal uncertainty with numerical levels that differ substantially from those of humans. We then introduce VOCAL, a novel optimization-based algorithm that learns an optimal uncertainty profile over uncertainty markers directly from LLM outputs. By fitting a marker–uncertainty mapping to best explain empirical correctness, VOCALdiscovers how much probability mass each verbal marker should convey, rather than estimating uncertainty via repeated sampling. VOCAL enables a direct, marker-level comparison of confidence semantics between humans and LLMs, disentangling mismatch and revealing systematic confidence disparities in verbal expressions.

Abstract:
Lossy KV cache compression is a wellexplored subfield of machine learning efficiency, with improved latency being one of its major gains. However, lossy compression techniques can fumble from time to time, exhibiting various — and often catastrophic — failure patterns that are not only difficult to resolve but sometimes even hard to identify in the first place, making the direct deployment of models with compressed KV cache a risky endeavor. In this work, we explore a way to preserve lossless generation quality while still benefiting from the acceleration provided by attending only to a compressed KV cache. Specifically, we draw inspiration from the n-gram candidate pool decoding paradigm pioneered by Lookahead Decoding — a largely overlooked and underdeveloped way to achieve efficient yet lossless decoding — where we purposely allow the model to Fumble Around with compressed KV cache to generate multiple lossy ''n-gram guesses'' with just one forward pass, while Find Out via lossless verification in the same forward pass in truly parallel fashion. From a conceptual standpoint, our proposed framework is compatible with all typical static or dynamic KV cache compression methods from the token dropping realm, thus opening up a new avenue for the stagnant n-gram decoding paradigm. Practically, we show that — with careful system support — this framework presents many useful traits that similar draftless baselines (e.g., Self-Speculative Decoding) simply cannot achieve, such as requiring only one set of KV cache and being far less sensitive to model, task, and input-length scenarios. Our comprehensive empirical results show FAFO provides 1.20-2.71x latency speedup over the original model, while consistently outperforming other lossless + draftless solutions by a large margin.

Abstract:
Multimodal LLMs (MLLMs) are increasingly deployed across diverse applications, but they pose significant safety concerns due to crossmodal interactions. To improve model safety awareness, existing methods rely on explicit-risk preference datasets and reinforcement learning guided by safety rewards. While effective in improving models' safety awareness, these methods still face data scarcity and reward hacking in implicit-risk scenarios, leading to insufficient risk perception and harmful responses. To address these challenges, we propose Meerkat-VL, a framework that enables models to perceive and verify implicit risks while generating safe responses. First, we introduce Meerkat-Safe, the first training dataset with detailed labels for implicit risks. Second, we develop Normative Perceptual Self-Verification, which enables models to verify both perceptual reasoning and responses. This process provides denser and more reliable rewards for perception accuracy and answer safety, thereby mitigating reward hacking. Finally, we propose Dual-Objective Perceptual Consistency Alignment, encouraging models to generate safe responses by penalizing answers that follow safe templates without accurate risk perception. Extensive experiments show that Meerkat-VL consistently outperforms baselines on multimodal safety benchmarks, improving safety and helpfulness by 16% and 13%, and achieving a 32% safety gain on implicit-risk tasks. Our codes are availablehere.

Abstract:
Incontext learning (ICL) is highly sensitive to which demonstrations appear in the prompt, but selecting them is expensive because candidate contexts must be validated with repeated LLM calls. We argue that demonstration selection is \empheasier to judge than to find: predicting whether a specific query--context pair (q,D) will succeed is cheaper and more general than searching for an optimal D^\star. Based on this insight, we propose DiSP, a sample-and-judge framework that stratifies queries by difficulty. DiSP runs random demonstration trials to estimate each training query's success rate, trains a lightweight router to predict difficulty from the query, and trains level-specific judges to score sampled contexts. At inference, DiSP performs stop-on-acceptance judging under an explicit budget and typically makes a single LLM call, emitting diagnostic risk tags when no suitable context is found. Across five classification datasets with Llama 3–8B and Qwen 2.5–7B, DiSP achieves the best average accuracy, improving over strong learned selection baselines by up to 3.4%, while achieving up to 23× end-to-end wall-clock speedup.

Abstract:
Recent studies have demonstrated the effectiveness of modularly integrating traditional machine learning methods, such as Support Vector Machines (SVMs), into neural networks for endto-end optimization. However, current approaches mostly rely on static embedding, failing to leverage SVM's geometric properties for dynamic iterative optimization, thereby limiting their generalization potential. To address this, we propose aDifferentiable Deep Support Vector Machine (DDSVM)framework that alternates over three modules: representation learning, boundary optimization, and geometry-aware feature refinement. This is achieved through an iterative pipeline ofboundary construction, feature pushing, loss backpropagation and representation update. After constructing the SVM hyperplane, our method actively pushes feature points along the normal vector to maximize the geometric margin and backpropagates the separation loss into the network. Theoretically, we conduct an in-depth analysis of the underlying optimization principles, elucidating the fundamental mechanism through which the proposed architecture achieves superior performance. We demonstrate how the iterative synergy between geometric refinement and representation learning enhance the generalization, providing formal insights into its effectiveness. Experiments demonstrate significant performance over previous baselines.

Abstract:
Large language models still struggle to reliably answer questions grounded in realworld files like spreadsheets and slides, where evidence is scattered across irregular layouts and heterogeneous formats. We address this by formalizing File Reasoning, a setting where agents must interact directly with unprocessed files (XLSX, PDF, DOCX, PPTX) within a persistent sandbox. To support this, we introduce a unified data pipeline and release a high-difficulty benchmark of over 400 verifiable tasks that preserve native file structure. Furthermore, we propose a reinforcement learning framework grounded in stateful file execution. We train and releaseFIRE(FileInteractiveReasoningExpert), a family of models that learn to optimize long-horizon planning using genuine execution feedback from the environment. Unlike stateless tool-use methods, our approach enables agents to recover from errors and adapt to structural ambiguities. Empirical results show that Qwen3-32B-FIRE achieves the strongest performance among open-source models under identical execution constraints.

Abstract:
Forecasting multivariate hidden Markov processes is challenging due to nonlinear and nonstationary observations, latent state transitions, and crosssequence dependencies. While deep learning methods achieve strong predictive accuracy, they typically lack explicit state modeling, whereas Hidden Markov Models (HMMs) provide interpretable latent states but struggle with complex nonlinear emissions and scalability. To address these limitations, we propose DRL-STAF, a Deep Reinforcement Learning based STate-Aware Forecasting framework that jointly predicts next-step observations and estimates the corresponding hidden states for complex multivariate hidden Markov processes. Specifically, DRL-STAF models complex nonlinear emissions using deep neural networks and estimates hidden state transitions via reinforcement learning, avoiding predefined transition structures and enabling flexible adaptation to diverse and high-order dynamics. In particular, DRL-STAF remains effective when typical HMM-based methods suffer from state-space explosion. Extensive experiments demonstrate that DRL-STAF consistently outperforms HMM variants, standalone deep learning models, and existing DL–HMM hybrids in both forecasting accuracy and hidden state estimation.

Abstract:
Reinforcement learning (RL) has achieved remarkable success across complex decisionmaking tasks, especially with the advent of deep neural networks. However, the resulting models are often opaque, making their deployment in safety-critical domains challenging. Explainable AI aims to address this issue, but most specific efforts for deep RL remain limited either to post-hoc explanation methods or to imitation learning and distillation procedures. These latter approaches rely on pre-trained black-box agents and are typically restricted to environments with discrete action spaces, limiting their scalability and interpretability. In this paper, we introduce ProtoSAC, a novel deep RL architecture that integrates a prototype-based actor into the Soft Actor-Critic (SAC) algorithm, enabling intrinsic interpretability in continuous action spaces. Our method learns a set of prototypes that represent interpretable state clusters, each associated with a Gaussian action distribution. Actions are generated as a similarity-weighted mixture over these prototypes, providing transparent decision-making without sacrificing performance. We evaluate ProtoSAC on continuous action-space environments and show that it matches the performance of the original SAC while offering enhanced interpretability.

Abstract:
While neural PDE solvers have demonstrated significant potential for accelerating engineering simulations, existing architectures remain constrained by high memory consumption and the "singlenode bottleneck," where the maximum processable mesh resolution is strictly limited by the VRAM of a single compute unit. To address these challenges, we propose PGD-NO, a neural operator with Precomputed Geometry Decomposition, that relocates the computational overhead of geometric encoding to a deterministic pre-computation phase. By utilizing an iterative geometry decomposition algorithm to extract "geometry tokens," our model decouples feature extraction from solution querying. This architecture enables linear memory scalability, allowing high-fidelity learning on meshes exceeding 10 million nodes—a scale where existing architectures typically encounter memory exhaustion. PGD-NO demonstrates competitive predictive accuracy across diverse industrial benchmarks and provides intrinsic interpretability through attention mechanisms. By effectively overcoming traditional mesh-size constraints, PGD-NO offers a robust and efficient solution for the next generation of large-scale, high-fidelity industrial design applications.

Abstract:
Pass@k and other methods of scaling inference compute can improve language model performance in domains with external verifiers, including mathematics and code, where incorrect candidates can be filtered reliably. This raises a natural question: can we similarly scale compute to elicit gains in truthfulness for domains without convenient verification? We show that across five benchmarks and models, surprisingly, it cannot. Even at 25× the inference cost of naive sampling, pollingstyle aggregation yields no consistent accuracy gains over single-sample baselines and often amplifies shared misconceptions. We find that under uncertainty, models are better at predicting what other models will say within model ensembles than at identifying what is true, revealing a separation between social prediction and truth verification. Across models and benchmarks, aggregation fails to provide a robust truth signal because language model errors are strongly correlated. The source of correlation goes beyond any individual benchmark: we show that even when conditioned on out of distribution random strings and asked to produce pseudo-random outputs, different models produce correlated outputs. Confidence-based weighting provides no benefit because self-reported confidence fails to reliably distinguish correct from incorrect answers. These results delineate a boundary for inference-time scaling: in verified domains, additional samples provide more candidates for a verifier to filter; in unverified domains, additional samples merely reinforce shared misconceptions.

Abstract:
Document parsing, the task of extracting diverse content from PDFs while preserving their structural integrity, has been significantly advanced by Multimodal Large Language Models (MLLMs). These models have achieved remarkable success, largely driven by extensive posttraining on massive datasets. This paper therefore undertakes a deep analysis of the two dominant adaptation strategies, Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), prompted by a puzzling observation on the PDF-to-Markdown task: SFT makes a negligible impact, especially on parsing complex tables and formulas, while RL achieves substantial overall gains. To unravel the reasons, our systematic investigation reveals a clear and complementary division of labor: SFT primarily operates as a structure learner, biased towards mastering the low-entropy syntax of document layouts. While it learns the format of a table, it struggles to ensure the fidelity of its high-entropy cell content. Conversely, RL excels as a content refiner by optimizing a holistic reward that reflects final accuracy. We further ground this phenomenon in the distinct theoretical nature of their respective objective functions. Based on these findings, we introduce a unified strategy that explicitly harnesses their individual strengths while mitigating their weaknesses. This work shows that a deep understanding of post-training methods is key to unlocking performance beyond what data scaling alone can achieve.

Abstract:
In the agentic web era, LLMbased agents increasingly invoke web services as tools, yet most interfaces are still exposed as \emphstatic endpoints. As tasks grow into long-horizon workflows with loops, conditionals, joins, and retries, agents must externalize control flow into stepwise calls and multi-round reasoning, causing excessive network turns, over-/under-fetching, and brittle recovery under partial failures. We argue that the bottleneck is representational: endpoint sequences are a poor interface for expressing tool intent. We present ToolPro, which represents an agent’s tool intent as an \emphexecutable tool program---a composable program that compactly encodes multi-step service interactions with explicit effect types. ToolPro combines (i) constraint-guided program construction with compiler/runtime feedback to improve executability of LLM-produced code, (ii) effect-aware replay that provides exactly-once semantics for state-modifying calls across iterative repair and re-execution, and (iii) a profile-driven consolidation policy that decides when program execution outperforms stepwise calling. We instantiate ToolPro over MCP-style services with WebAssembly sandboxing. Across three real applications and six workflows, ToolPro reduces end-to-end latency by up to 37.2% and client-side traffic by up to 85.3%, with larger gains under higher network latency and increased workflow complexity. We believe that ToolPro lays a foundation for agent-facing service interfaces in the emerging agentic web.

Abstract:
Many optimizers can be interpreted as steepestdescent methods under norm-induced geometries, and thus inherit corresponding implicit biases. We introduce Orthogonal Lion which combines spectral control from orthogonalized update directions with \ell_\infty-style coordinate control from sign updates. OLion forms a Lion-style momentum direction, approximately orthogonalizes it via a few Newton--Schulz iterations, and then applies an entrywise sign, providing an efficient approximation to taking a maximal step over the intersection of the spectral and \ell_\infty constraint sets (a scaled Hadamard-like set for matrix parameters). Despite the strong nonlinearity of orthogonalization and sign, we prove convergence under a mild, empirically verified diagonal-isotropy assumption. Across large-scale language and vision training, including GPT-2 and Llama pretraining, SiT image pretraining, and supervised fine-tuning, OLion matches or outperforms AdamW and Muon under comparable tuning while using only momentum-level optimizer state, and it mitigates optimizer mismatch when fine-tuning AdamW-pretrained checkpoints.

Abstract:
Large Language Models (LLMs) demonstrate remarkable potential in roleplaying tasks but frequently suffer from personality decay—termed "Out-of-Character" (OOC) behavior—during prolonged interactions. While heuristic strategies exist to align model behaviors, the internal computational dynamics driving personality expression remain opaque. A fundamental barrier to decoding these mechanisms is ametric gap: while standard causal attribution paradigms target atomic, single-token outcomes, personality manifests as a holistic, multi-token behavioral tendency. We bridge this gap via theLatent Persona Vector, a differentiable proxy enabling the first fine-grained causal tracing of personality circuits. This metric reveals a structured "Preparation-Establishment-Expression" dynamic and identifies the mechanistic root of OOC behavior not as knowledge erasure, but asgeneric prior dominance. Specifically, we find that intrinsic assistant priors suppress emergent persona intents during the critical "Establishment" phase. Guided by this diagnosis, we propose surgically recalibrating the signal magnitude in fewer than 5% of attention heads. This targeted intervention effectively counteracts prior suppression, significantly restoring character consistency while preserving general reasoning capabilities.

Abstract:
Continuous ECG monitoring is clinically valuable, but scaling it beyond electrodes to comfortable longterm use motivates contactless mmWave sensing. In practice, mmWave-to-ECG reconstruction is severely constrained by the scarcity of high-quality synchronized recordings and poor cross-subject generalization. To overcome these bottlenecks, we propose Cardio-mmFlow, a Gaussian-prior-free physics-informed flow matching framework that synthesizes realistic mmWave radar signals from abundant clinical ECG corpora. It learns a direct transport trajectory between the latent manifolds of ECG and radar. To capture subject-dependent propagation differences, we incorporate a simplified mass--spring--damper inspired physical prior and inject it into the flow dynamics via feature-wise linear modulation for personalization. Extensive experiments show that our system have generate high fidelity radar data in both signal and latent domains. It significantly improve zero-shot downstream mmWave\rightarrowECG task, and enable Atrial Fibrillation classification with synthetic data. Further analysis evaluate the model interpretability.

Abstract:
As large language models continue to scale, finegrained, block-scaled low-precision formats such as NVFP4 and MXFP4 are increasingly adopted for their substantial throughput and memory benefits. In this regime, floating-point and integer quantizers exhibit complementary strengths in matching block-level data distributions. However, tensor-core–accelerated matrix multiplications typically require all operands—weights and activations in the forward pass, and weights, activations, and gradients in the backward pass—to share a single quantization format, which can destabilize training and degrade inference quality. To address this limitation, we introduce MixFP4, a tensor-core–co-designed quantization scheme that evaluates two candidate scale factors for each block (corresponding to FP- and INT-style quantization behaviors) and selects the one that minimizes quantization error, thereby combining the benefits of both representations while preserving efficient GEMM execution.

Abstract:
Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides amechanistic explanationfor this phenomenon, tracing its roots to the value aggregation process inherent in selfattention, which induces a systematic dimension-wise variance discrepancy. We demonstrate that this discrepancy is drastically amplified by the activation of super neurons within Feed-Forward Network (FFN) layers. Specifically, the channel-sparse down-projections trigger a dimension disparity of the first-token representation, necessitating the formation of attention sinks as a structural anchor. We validate this causal chain through two controlled interventions: (i) isolating the aggregation effect via attention mask modifications and (ii) amplifying the variance of targeted token representations. Both interventions can replicate attention sinks at arbitrary positions. Our mechanistic understanding offers a foundation for the systematic control of sink formation. As a proof of concept, we proposehead-wise RMSNorm, an architectural modification that stabilizes value aggregation outputs during pre-training. Our experiments demonstrate that restoring statistical parity across positions significantly accelerates convergence.

Abstract:
Recent medical MLLMs have made significant progress in generating step by step textual reasoning chains. However, they still struggle with complex clinical tasks that necessitate dynamic and iterative focusing on finegrained visual regions. To close this gap, we introduce Ophiuchus, a versatile, tool-augmented framework that equips an MLLM to (i) decide when fine-grained visual evidence is needed, (ii) determine where to probe and ground within the medical image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved, multimodal chain of thought for precise segmentation and diagnosis. Ophiuchus moves beyond mere tool-calling by tightly fusing the MLLM’s inherent grounding and reasoning capabilities with external tools, enabling more accurate and trustworthy decisions. The core of our method is a three-stage training strategy: cold-start SFT for basic tool selection; self-reflection fine-tuning to strengthen decision revision; and agentic tool reinforcement learning to elicit sophisticated, expert-like diagnostic behaviors. Extensive experiments show that Ophiuchus consistently outperforms both closed-source and open-source SOTA methods across diverse medical benchmarks, including VQA, detection, and reasoning-based segmentation.

Abstract:
Sparsereward reinforcement learning typically focuses on exploration, but we ask: can structural assumptions about reward functions themselves accelerate learning? We introduce Policy-Aware Matrix Completion (PAMC), which exploits low-rank structure in reward matrices while correcting for policy-induced sampling bias. PAMC combines three key components: a low-rank plus sparse reward model, inverse propensity weighting to handle Missing-Not-At-Random (MNAR) data, and confidence-gated abstention that falls back to intrinsic exploration when uncertain. We provide finite-sample theory showing that completion error scales as O(\sigma\sqrtr(|\mathcalS|+|\mathcalA|)/\textESS) where ESS is the effective sample size under policy overlap \kappa. PAMC achieves strong empirical results at 10M steps (a sample-efficiency comparison): 4100\pm250 return vs. 200\pm50 for DrQ-v2 on Montezuma's Revenge, 78% vs. 65% success rate on MetaWorld-50, and 15% improvement over CQL on D4RL datasets. The method maintains 8% computational overhead while providing calibrated confidence intervals (95% empirical coverage). When structural assumptions are violated, PAMC gracefully degrades through increased abstention rather than catastrophic failure. Our approach demonstrates that reward structure exploitation can complement traditional exploration methods in sparse-reward domains.

Abstract:
Instruction tuning (IT) is a central mechanism for aligning large language models (LLMs) with user intent. In practice, randomly shuffling the training set is a simple yet surprisingly strong baseline. However, it overlooks latent structure, such as domain and reasoning depth, and thus interleaves heterogeneous objectives, which can induce gradient conflicts and diminish effective optimization progress. To this end, we propose EPOrder, an embedding-proximity-based data-ordering paradigm for IT of LLMs. Unlike previous paradigms that derive order from per-example scores, EP-Order explicitly accounts for inter-sample correlations by operating in representation space. EP-Order trains a warm-up model on a small subset of data (e.g., 10%), embeds all training samples for clustering, and ranks these clusters according to embedding proximity. To handle sharp gradient changes at cluster transitions and alleviate catastrophic forgetting under cluster-based training, we introduce mixed regions that interleave samples from the previous, current, and next clusters, thereby stabilizing learning. We evaluate EP-Order on seven popular multimodal LLM benchmarks and demonstrate that it is both more effective and more efficient than competing data ordering paradigms. We expand the application of EP-Order to a hybrid thinking text-only scenario, where think and no-think samples induce substantial optimization conflict, and evaluate with three benchmarks. EP-Order obtains nearly consistent improvements over random mixing. These results highlight embedding-proximity-based ordering as a promising direction for complex, high-conflict training data in IT.

Abstract:
Monolithic agents in deep search often suffer from "cognitive overload," while existing multiagent approaches mostly rely on frozen models that cannot learn from collaboration failures. To bridge this gap, we propose DECOR (DEcompose and COllaborate via Role-specialized agents), a framework formulating deep search as a Multi-Agent Reinforcement Learning (MARL) problem. DECOR functionally decomposes the task into three specialized roles: a Planner to navigate, a Filter to curate a noise-reduced memory, and an Answerer for synthesis. Unlike training-free orchestration, we jointly optimize these agents using a hybrid reward strategy that harmonizes role-specific intrinsic feedback with team-level outcome signals. Experiments on seven benchmarks show that DECOR significantly outperforms strong monolithic baselines, demonstrating the necessity of learning-based functional decomposition in handling cognitive overload.

Abstract:
Semisupervised learning has become a dominant paradigm for reducing annotation costs. However, we argue that the current progress is clouded by a dual problem of overconfidence. Algorithmically, prevailing pseudo-labeling frameworks often conflate prediction confidence with uncertainty, leading to severe confirmation bias and poorly calibrated models. Methodologically, since multiple benchmark datasets lack validation sets, some studies repurpose test sets as validation sets, leading to inflated results. Subsequent methods, compelled to employ the same strategy to surpass reported SOTA, trigger an arms race of overfitting. Exciting numerical gains may reflect test overfitting rather than genuine progress. Thus, we propose TCSeg, a tri-space calibrated segmentation framework founded on a principled dual-axis reliability assessment engine. It explicitly decouples confidence from uncertainty and uses this signal to detect and correct confirmation bias across feature, probability, and image spaces in a collaborative manner. Across three benchmarks, TCSeg delivers consistently strong performance at both the best and final checkpoints, and we report results under a multi-run protocol to reset the benchmark with a more realistic perspective. Core code is available: https://github.com/BubbleDirk/temporaryanonymoustcseg.

Abstract:
Continual learning (CL) seeks models that acquire new knowledge while avoiding catastrophic forgetting. However, many methods that mitigate forgetting constrain parameter updates and thereby reduce model plasticity. We revisit the singular value spectrum of gradients in representative CL methods and show that they commonly exhibit singular value collapse, where only a small subset of gradient directions drive parameter updates. Motivated by this observation, we propose Plasticity Activation via Polar Operator (PAPO), a plugin that preserves the dominant directions that mitigate forgetting while activating previously suppressed directions to enhance plasticity. Concretely, PAPO modifies the gradient \mathbfG as \mathbfG\leftarrow \mathbfG+\lambda \cdot \operatornamepolar(\mathbfG), which uniformly increases near-zero singular values without changing the singular vectors. To avoid the cost of explicit singular value decomposition, we approximate the polar factor using the iteration-dependent Polar Express scheme, which relies only on matrix multiplications and additions. In our empirical evaluation on both vision and language benchmarks, incorporating PAPO yields consistent improvements. In particular, on MiniImageNet, integrating PAPO into ER, MAS, GPM and TRGP produces substantial accuracy gains of 9.01%, 4.76%, 8.90% and 9.19%, respectively.

Abstract:
Neural network parameter spaces are inherently noninjective, as distinct parameter configurations can realize identical functions through functional equivalence. While this symmetry is well understood in classical fully connected and convolutional models, it becomes substantially more intricate in modern attention-based architectures. Existing analyses of multihead attention have largely focused on the vanilla formulation, overlooking positional encodings that fundamentally reshape architectural symmetries. In this work, we provide a formal study of functional equivalence in Transformers with positional encodings. Focusing on the two most widely used variants--sinusoidal and rotary positional encodings (RoPE)--we show that sinusoidal encodings preserve the equivalence structure of vanilla attention, whereas rotary encodings significantly reduce the symmetry group, thereby enhancing expressivity. This offers a principled explanation for the growing prominence of RoPE in practice. We further examine how positional encodings affect linear mode connectivity, and through an alignment algorithm, empirically demonstrate that the presence and variability of connectivity across Transformer settings crucially depend on the positional encoding.

Abstract:
Recent studies have shown that semantic watermarks, which embed information into the initial noise of latent diffusion models (LDMs), are vulnerable to blackbox forgery attacks. However, existing methods primarily rely on empirical evidence and lack a rigorous theoretical understanding of the conditions under which such attacks succeed or fail. To bridge this gap, we rethink the nature of such attacks through the lens of rate-distortion in the latent space. Our analysis identifies an irreducible distortion floor due to structural mismatches between proxy and target models, which fundamentally limits the fidelity of forged watermarks. We further characterize this distortion as structured geometric deviations on the latent manifold, in the form of global drift and local deformation rather than stochastic noise. Leveraging these insights, we propose a scheme-agnostic detection method that distinguishes forged samples before watermark verification. Extensive experiments demonstrate the effectiveness of our method across diverse black-box scenarios, while preserving robustness to common distortions.

Abstract:
Realworld perception and decision making are inherently multimodal, integrating complementary signals across modalities. However, training multimodal models faces two main obstacles. First, collecting large-scale, well-aligned paired multimodal datasets is often impractical, making end-to-end multimodal training difficult. Second, existing multimodal representations frequently entangle information shared across modalities with modality-specific information, hindering interpretability and control. We introduce MultiLoReFT, an efficient and scalable low-rank representation fine-tuning framework for multimodal learning with pretrained unimodal models. MultiLoReFT extends low-rank adaptation to the multimodal setting and learns interpretable projection subspaces that decouple shared and modality-specific information. Across simulated and real-world benchmarks, it produces representations that support multimodal prediction while explicitly revealing how shared and modality-specific information are distributed across modalities.

Abstract:
The rapid growth of modern training datasets has significantly increased computational cost, motivating dataset pruning(DP) methods which retain only a subset of informative samples to reduce training cost. Existing pruning criteria typically rely on either intrinsic signals that assess samples independently or extrinsic signals that promote diversity via pairwise relations. While effective in their own specific regimes, each captures only one aspect of sample utility and lacks robustness across different pruning ratios or data distribution. In this work, we present a unified graphbased DP framework. By modeling the dataset as a weighted graph, where node weights encode intrinsic value and edge weights encode extrinsic value, DP can be cast as a Maximum Weight Clique Problem (MWCP). Although MWCP is NP-hard, its structure admits a principled greedy solution based on sample-wise marginal gains. Under a few mild and interpretable conditions, we further prove that this unified objective enjoys a formal approximation guarantee, which applies to a broad family of importance metrics and provides practical design guidelines. Extensive experiments demonstrate that the proposed method outperforms existing pruning methods while substantially reducing training cost. On ImageNet-1k with ResNet-50, our method reduces training time by over 40% without sacrificing accuracy.

Abstract:
Tokenlevel credit assignment remains a key obstacle for reinforcement learning (RL) in large language models (LLMs), where RL recipes typically treat all tokens equally, failing to distinguish decisive reasoning steps from routine formatting or fluent filler. Recent attempts leverage model-internal signals to assign finer-grained credit, but these are often point-wise heuristics that ignore the global structure of information propagation. We propose FlowTracer, an RL framework that traces \emphanswer-targeted reasoning flow on an attention-induced directed acyclic graph in which nodes correspond to tokens and edge capacities come from aggregated attention weights and derives token credit from this global structure. The edge capacities are reweighted to retain only the influence that can reach the answer region, while enforcing local flow conservation so intermediate tokens neither lose nor gain effective mass due to path length or irrelevant branches. On this graph, FlowTracer extracts an information-flow backbone connecting the question to the answer and scores tokens by flow throughput, revealing high-impact hubs and aggregation checkpoints that mediate long-range dependencies. These derived importances are used to shape token-level rewards, enabling learning signals to focus precisely on the tokens that route information toward (or away from) correct answers and delivering consistent performance gains across a range of reasoning tasks.

Abstract:
Graph Contrastive Learning (GCL), which trains graph encoders by maximizing similarity between positive samples and minimizing it between negative ones, has emerged as a mainstream graph pretraining paradigm. It is widely recognized that positive samples are essential in GCLs. Ideally, maximizing the similarity of positive samples enables graph encoders to capture intrinsic semantic and patterns of graph data. However, we discover an interesting phenomenon: GCLs can achieve competitive performance even without positive samples. This motivates us to revisit the fundamental mechanism of positive samples in GCLs. From the perspective of Dirichlet energy, we theoretically finds that message passing, a key mechanism in graph encoders, trivializes the maximization of positive samples, preventing GCLs from effectively learning from positive samples. To address this, we propose SPGCL to mitigate the trivialization caused by message passing and restore the learning efficacy of positive samples. Specifically, we find that high Dirichlet energy features help positive samples provide effective learning signals while low Dirichlet energy features contribute little to positive learning signal but is useful for positive sampling. Based on this, SPGCL propagates only high Dirichlet energy features and uses low energy features to construct a probability matrix for reliable positive sampling. Extensive experiments demonstrate the effectiveness of SPGCL.

Abstract:
Despite rapid progress, multimodal reasoning still lacks a systematic approach to synthesize largescale vision-centric datasets beyond visual math. We introduce a framework able to synthesize vision-centric problems spanning diverse levels of complexity, and the resulting dataset with over 1M high-quality problems including: reasoning traces, preference data, and instruction prompts supporting SFT, offline and online RL. Our vision-centric synthesis framework uses a two-stage process focusing on: (1) generating diverse verifiable questions from existing images at scale, and (2) creating complex compositional visual problems by merging simpler questions. Remarkably, finetuning Qwen2.5-VL-7B on our data outperforms existing open-data baselines across evaluated vision-centric benchmarks, and our best configurations match or surpass strong closed-data models such as MiMo-VL-7B-RL on VBench, CV-Bench and MMStar-V. Notably, despite being entirely vision-centric, our data transfers positively to text-only reasoning (MMLU-Pro, +3.7%) and audio reasoning (MMAU, +1.32%), demonstrating its effectiveness. Similarly, despite containing no embodied visual data, we observe notable gains (NiEH, +8.8%) when evaluating open-ended embodied QA. Lastly, we use our data to comprehensively analyze at scale (1M+) the entire VLM post-training pipeline showing that (i) SFT on high-quality data with cognitive behaviours on reasoning traces is essential to scale online RL, (ii) offline RL could match online RL’s performance while disaggregating compute demands, and, (iii) SFT on high quality data also improve out-of-domain, cross-modality transfer.

Abstract:
Vector Quantization (VQ) has been widely used in visual and audio representation due to its effectiveness in compressing highdimensional signals. However, existing VQ methods often rely on large and unstructured codebooks, which leads to inefficient code utilization and frequent codebook collapse. In this paper, we proposeIChingVector Quantization (IVQ), a lightweight and structured vector quantization framework inspired byIChing. IVQ introduces binary hierarchical composition and geometric symmetry relations into the codebook design, enabling a compact set of structured codes to represent a large number of configurations while maintaining high utilization without codebook collapse. We conduct systematic comparisons between IVQ and several VQ variants mainly focusing on audio representation. Experimental results show that IVQ achieves superior quality with significantly smaller codebooks and consistently higher utilization rates. Auxiliary experiments on visual reconstruction and cross-modal alignment further validate the universality and robustness of our structured representation.

Abstract:
Generalized Category Discovery (GCD) assigns unlabeled instances, mixed with labeled data, to known or novel categories, requiring humanlike compositional reasoning: reusing primitives learned from known classes and deciding when new combinations imply new categories. Existing GCD methods operate on unstructured token features and struggle to extrapolate to novel compositions. We propose CoGe-GCD, which rethinks GCD through compositional generalization with two coupled stages. (i) Compositional Perception structures patch tokens by mapping them to a small vocabulary of primitives and refining token embeddings via competitive token-primitive assignment and information passing, yielding coherent groups for discovery. (ii) Generalizing Induction exploits the induced geometric structure and applies a geometric-structure-preserving calibration over spatial relations, maintaining probabilistic semantics while improving extrapolation to unseen primitive combinations. CoGe-GCD is implemented as an inductive-bias module between backbone and projection head, without modifying heads or losses, and can be plugged into diverse GCD frameworks. On standard benchmarks, it consistently improves all-class accuracy, unknown-class number estimation, and geometric quality, with marginal computational overhead.

Abstract:
Accurately predicting the future trajectories of traffic participants is critical for safe, efficient, and humanfriendly autonomous driving. Existing learning-based trajectory prediction methods are predominantly time-domain and insufficiently exploit latent frequency information, which limits their capability to capture low-frequency long-term dependencies and high-frequency short-term dynamics. To address this, we propose TF-FACE, a Time-Frequency learning framework with Frequency-domain Adaptive and Controllable Enhancement. TF-FACE introduces a fusion encoder with learnable gated frequency-domain attention that adaptively manipulates band-specific features for trajectory prediction. Building on the fused representation, we design a dual-stage decoder and a band-specific time–frequency dual-consistency loss to enable controllable decoupling and coupling across long- and short-term temporal scales, global and local scales, and then generate final multimodal predictions. Experiments on Argoverse 1 demonstrate that TF-FACE achieves state-of-the-art accuracy, while maintaining real-time inference for autonomous driving. Additional experiments are conducted on Argoverse 2, further validating the TF-FACE's performance and generalizability.

Abstract:
Recent advances in video diffusion models have demonstrated remarkable generative capability, yet adapting these large pretrained textto-video (T2V) models to video super‑resolution (VSR) typically encounters challenges, such as artifacts introduced by complex degradations in real-world scenarios and compromised fidelity due to the strong generative capacity of the powerful T2V models. We present WEVSR, a novel approach that adapts a pretrained flow-matching video diffusion transformer to RealVSR. First, we design a task-oriented adaptation strategy that leverages timestep sampling and noise augmentation to enhance detail restoration while preserving structural stability. Second, we propose a lightweight multi-level discrete wavelet transform (DWT) front-end for the VAE encoder, injecting explicit frequency priors into the latent space without modifying the pretrained decoder. Extensive experiments across multiple RealVSR benchmarks demonstrate that WEVSR achieves state-of-the-art performance against existing approaches.

Abstract:
Locating files and functions requiring modification in large software repositories is challenging due to their scale and structural complexity. Existing LLMbased methods typically treat this as a repository-level retrieval task and rely on multiple auxiliary tools, which often overlook code execution logic and complicate model control. We propose RepoNavigator, an LLM agent equipped with a single execution-aware tool: jumping to the definition of an invoked symbol. This unified design reflects the actual flow of code execution while simplifying tool manipulation. RepoNavigator is trained end-to-end via Reinforcement Learning (RL) directly from a base pretrained model, without relying on closed-source distillation. Experiments demonstrate that RL-trained RepoNavigator achieves state-of-the-art performance, with the 7B model outperforming 14B baselines, the 14B model surpassing 32B competitors, and the 32B model exceeding closed-source models such as GPT-5 on most metrics. These results confirm that integrating a single, structurally grounded tool with RL training provides an efficient and scalable solution for repository-level issue localization.

Abstract:
In recent years, significant advancements in large language models have greatly propelled the development of Textto-SQL tasks. However, due to the token-by-token sequential generation mechanism employed by these models, they encounter a semantic blind spot problem with respect to pending SQL components—the parts of the SQL query yet to be generated. Specifically, language models are unable to effectively utilize the semantic information of these pending SQL components during the generation of the final SQL query, which poses considerable challenges for generating complex SQL statements. To address this issue, we propose a novel thought process based on SQL components pre-generation and design a maximum connected subtree matching reward mechanism leveraging the SQL abstract syntax tree to improve the accuracy of local component generation. Extensive experiments demonstrate that, under comparable model parameter scales, our training approach achieves significant advantages, effectively enhancing the generation of complex SQL queries. Our method attains an execution accuracy EX of 65.78% on the BIRD-dev dataset and achieves state-of-the-art performance on the Spider-syn datasets.

Abstract:
Multimodal large language models (MLLMs) are playing an increasingly important role across multiple domains. In many applications, such as robotics, it is crucial to quantify the uncertainty in the output of these models. We develop Fused Uncertainty with Semantic Evidence (FUSE), a probabilistic framework for capturing two complementary sources of uncertainty in multimodal language modeling: (i) aleatoric embeddinglevel uncertainty derived from input data vision-language ambiguity, and (ii) epistemic model-level uncertainty estimated from the semantic response diversity of MLLMs. Our approach formulates a Bayesian fusion mechanism that analytically combines these uncertainty sources to produce a scalar measure of uncertainty. This measure serves as a novel uncertainty representation for downstream applications of MLLMs and provides a principled foundation for uncertainty calibration in multimodal systems, improving reliability and downstream performance in MLLM-based reasoning and vision-language tasks. We demonstrate that our method outperforms baselines in providing uncertainty estimates and achieves state-of-the-art uncertainty calibration.

Abstract:
Recently, the spiking neural networks (SNNs) have shown great promise in enhancing AI task performance by utilizing the braininspired and energy-efficient computational paradigm via the binary (0/1) spikes. Modern SNNs, especially those based on transformers, often require FPGA accelerators or neuromorphic chips (e.g., Intel Loihi) to enable spike-driven computations. However, this domain-specific hardware is not always accessible on commodity edge devices like NVIDIA Jetsons, which may degrade SNNs' energy efficiency due to massive computational waste on inactive "0" spikes and finally undermine the usage boundary. This limitation raises an interesting question: is it possible to make SNNs edge-friendly and tame the computations mostly on active "1" spikes? In this paper, we present the answer yes and propose Spik4lite, which serves as a lightweight plug-and-play module to significantly improve SNN's performance between model accuracy and computational efficiency. The key is to refactor SNN's channel-wise neuromorphic sparsity by zeroing out low-efficiency channels while proactively compensating for the eliminated spikes. Different from prior methods mainly focusing on optimizing the theoretical synaptic operations, our design philosophy can evolve the SNNs into a physically compact manner, thus inherently saving more computational and energy costs. Extensive experiments based on real edge devices show that Spik4lite can be integrated into existing SNN baselines to further improve their accuracy-and-efficiency performance, guaranteeing the model accuracy while saving the computational and energy costs.

Abstract:
Textto-image (T2I) diffusion models (DMs) have achieved remarkable generative quality but still exhibit the risk to produce biased and inappropriate images. A promising line of prior work aims to mitigate this issue by learning interpretable and linearly controllable concepts from semantic spaces, such as the U-Net bottleneck. However, these methods rely entirely on the bottleneck layer in U-Net and therefore cannot be directly applied to modern ViT-based DMs, including FLUX and PixArt. In this work, we present a model agnostic framework for discovering interpretable and linearly controllable semantic attributes across any T2I DMs backbone. We first show that multi-modal attention heads in ViT-based DMs encode interpretable and (near) linear semantic structures similar to those in the U-Net bottleneck. Based on this insight, we introduce a method that learns external concept vectors, which are added to the multi-modal attention heads for ViT-based DMs or to the bottleneck layer for U-Net-based DMs, while keeping the pretrained model frozen. Experiments across SDXL, SD3.5, PixArt, and FLUX demonstrate that these concept vectors provide interpretability, linearity, and highly improved fairness while preserving visual fidelity. The code is included in the supplementary material.

Abstract:
Graph neural networks (GNNs) have been demonstrated to be powerful tools for analyzing structural graph data. However, most existing methods usually rely on fixed adjacency structures for information propagation, lacking strong adaptability to the latent semantic relationships that exist but are not explicitly connected in graph, especially in complementary highpass and low-pass filtering views. To this end, this paper proposes a novel Dual-channel Dynamic Graph Neural Network (DCD-GNN), mainly consisting of parallel representation learning channels: a static structure-preserving channel and a dynamic adjacency-enhancing channel. The dynamic channel exploits both low-pass structural filtering and high-pass personalized detail via self-attention adjacency learning and then integrates them for comprehensive semantic modeling, while the static channel maintains structural stability. Both channels employ a multi-scale representation fusion mechanism and are finally fused into a unified and discriminative node embedding representation. Extensive experiments on various graph benchmark datasets verify the superiority of DCD-GNN in discriminative graph representation learning.

Abstract:
ImportanceWeighted Neural Processes (IWNPs) provide a principled framework for probabilistic meta-learning by using multi-particle latent representations to approximate the marginal log-likelihood of task data tightly. However, this work reveals that the standard optimization of IWNPs suffers from the Matthew effect in the latent space, where high-likelihood particles dominate gradient signals. The neglect of lower-likelihood regions leads to poor tail-risk generation and unstable fast adaptation. While robust objectives such as \textCVaR_\alpha can mitigate these risks, they often entail a trade-off that degrades average-case performance. This work proposes \underlineOrder-\underlineStatistics Aligned \underlineNeural \underlineProcesses (OS-NPs) to achieve latent space robust optimization without sacrificing average result. Specifically, we stratify multiple inference particles into disjoint difficulty bins based on order statistics and derive the regularized worst-case optimization framework for OS-NPs. Our method aligns the reduction of stratified order-statistic losses in IWNPs and provides a computationally efficient pipeline to implement. Extensive experiments demonstrate that the OS-NP constitutes stable, reliable probabilistic meta-learning that significantly enhances tail-risk robustness while maintaining or even improving average performance.

Abstract:
Machine learning fairness research documents tradeoffs but lacks quantitative frameworks to measure intrinsic metric compatibility without requiring causal graphs. We introduce a game-theoretic framework that decomposes metrics into interaction vectors, enabling compatibility measurement between metrics via cosine similarity and mechanistic attribution to attribute coalitions. Through analysis of 6 datasets, 7 models, and 6 debiasing methods, we reveal that fairness and utility are often structurally orthogonal (median compatibility \approx 0) rather than diametrically opposed, with conflicts driven by sparse, low-order interactions. We further show that debiasing improves fairness by compressing the compatibility space—reducing compatibility of both synergistic and conflicting relationships—rather than eliminating conflicts, providing a mechanistic basis for understanding metric alignment.

Abstract:
Existing genomic foundation models (GFMs) typically treat DNA as raw nucleotide sequences, often overlooking the regulatory context required to interpret genetic variation accurately. We introduce BioToken, a tokenization framework that directly encodes variants and biological annotations into genomic representations, and BioFM, a parameterefficient model built on this architecture. By leveraging biological inductive biases, BioFM outperforms state-of-the-art models and specialized baselines like Enformer on benchmarks including pathogenicity and expression prediction while requiring 100-fold less compute than current large-scale genomic models. These findings demonstrate that explicitly modeling biological structure yields more robust and efficient genomic representations than scaling alone.

Abstract:
To forecast traffic with both spatial and temporal dimensions, we unroll a mixedgraph-based optimization algorithm into a lightweight and interpretable transformer-like neural net. Specifically, we construct two graphs: an undirected graph \mathcalG^u capturing spatial correlations across geography, and a directed graph \mathcalG^d capturing sequential relationships over time. We formulate a prediction problem for the future samples of signal \mathbfx, assuming it is ``smooth'' with respect to both \mathcalG^u and \mathbfG^d, where we design new \ell_2 and \ell_1-norm variational terms to quantify and promote signal smoothness (low-frequency reconstruction) on a directed graph. We construct an iterative algorithm based on alternating direction method of multipliers (ADMM), and unroll it into a feed-forward network for data-driven parameter learning. We periodically insert graph learning modules for \mathcalG^u and \mathcalG^d, which are akin to the self-attention mechanism in classical transformers. Experiments show that our unrolled networks achieve competitive traffic forecast performance as state-of-the-art prediction schemes, while reducing parameter counts drastically.

Abstract:
Online learning for the Cox model is challenging because its partial likelihood is nondecomposable, with each risk set requiring a summation over all samples. We propose Collaborative Operation of Linked Survival Analysis (COLSA), an online learning framework that replaces the partial likelihood with the full likelihood using sieve approximation of the baseline hazard. The proposed likelihood function is decomposable and eliminates the need to store historical data in memory, enabling efficient online updates. Moreover, COLSA maintains sufficient statistics for a higher-order basis and employs data-driven basis projection to adaptively scale model complexity to the effective sample size. Unlike existing online Cox methods, COLSA achieves asymptotic normality and attains the same statistical efficiency as the pooled-data partial likelihood estimator, without accessing full data and only requiring constant memory. Simulation studies and application to kidney transplant data demonstrate that COLSA outperforms existing online methods and matches the performance of full-data estimation.

Abstract:
SLMLLM routing accelerates generation by strategically invoking LLMs for critical tokens. However, existing methods typically train routers to mimic the LLM, capping performance at the reference trajectory's limit. In this work, we demonstrate that the SLM-LLM collaborative inference space offers a richer solution set, yielding correct answers even when the LLM fails. To exploit this, we proposeAlphaRouter, a routing framework learning optimal collaborative inference paths via asearch and iterative updateparadigm. Formulating routing as a Markov Decision Process, we introduce Collaborative Inference Tree Search (CITS) to explore the solution space. To tackle the severe credit assignment challenge posed by sparse rewards, we propose Tree-Advantage Policy Optimization (TAPO) to optimize the routing policy. By leveraging counterfactual advantages within the tree structure, TAPO effectively attributes the final reward to specific routing decisions, stabilizing training without dense supervision. Extensive experiments show that AlphaRouter advances the Pareto frontier of accuracy-efficiency trade-offs by exploiting better inference trajectories in the collaborative space. Code is available at https://anonymous.4open.science/r/AlphaRouter.

Abstract:
Existing research on CodeGen AI security mainly focuses on red teaming, which aims to uncover vulnerabilities and risks in AIgenerated code. However, progress on the blue teaming side remains limited, as effective defenses require a deep security analysis of given tasks and edge cases. To fill in this gap, we propose BlueCodeAgent, an end-to-end blue teaming agent powered by automated red teaming. Our red teaming component generates diverse risky instances, providing effective edge cases and guidance for the subsequent blue teaming process. Our blue teaming agent then conducts multi-level defense, leveraging these red teaming examples to detect previously seen and unseen risk scenarios through constitution summarization and dynamic code analysis. Our evaluation across four representative code-related tasks–bias instruction detection, malicious instruction detection, vulnerable code detection, and prompt injection detection–shows that BlueCodeAgent achieves significant gains over diverse baselines. In particular, for vulnerability detection tasks, BlueCodeAgent integrates dynamic analysis to effectively reduce false positives, a challenging problem as base models tend to be over-conservative. Overall, with GPT-4o as the base model, BlueCodeAgent achieves an average F1 score improvement of 14.7% across four tasks compared to directly prompting the model, attributed to its ability to summarize actionable constitutions and perform dynamic analysis.

Abstract:
Sparse Autoencoders (SAEs) have become a cornerstone in mechanistic interpretability. However, current training methods inherit the Block Training paradigm from LLM pretraining. We identify this as a critical methodological oversight when applied to instruct models. Theoretically, utilizing GSNR analysis, we prove that attention leakage from unrelated contexts introduces destructive gradient noise. To rectify this paradigm, we propose \underlineFinetuning-\underlinealigned \underlineSequential \underlineTraining (FAST), a novel training method specifically tailored for instruct models.FAST aligns the training process with the data distribution and activation patterns of instruct models, achieving substantial improvements in both reconstruction and feature interpretability. Experimental results validate the efficacy ofFAST. GSNR analysis confirms improved training performance, with FAST demonstrating higher GSNR. This translates into superior reconstruction fidelity: FAST achieves an MSE of 0.6468 (significantly outperforming the baseline’s 5.1985) and maintains a near-zero Delta Loss (-0.51% to 0.37%). Consequently, feature quality is markedly enhanced; on Llama-3.2-3B-it, FAST yields 21.1% high-quality features, surpassing the 7.0% and 10.2% achieved by baselines. Surprisingly, we discover that intervening on special token activations via SAEs improves output quality, suggesting new opportunities for fine-grained control. Code, data, and all 240 trained SAEs will be publicly released.

Abstract:
Multiplechoice benchmarks that rank candidate completions by conditional log-probability suffer from a length bias: because log-probabilities sum over tokens, longer answers tend to be penalized relative to shorter ones in practice. A common mitigation is to normalize scores by completion length, but we show empirically that this heuristic frequently over-corrects, introducing a bias toward longer answers instead. We first analyze these scoring rules, characterizing when standard and length-normalized accuracy are appropriate and how their length biases depend on the distribution of completion lengths. Motivated by this analysis, we introduceBayesian accuracy, a scoring rule that computes the posterior probability of each candidate under an explicit prior over answer length, thereby removing linear length effects. Bayesian accuracy is a drop-in replacement for likelihood-based multiple-choice evaluation, requires no additional forward passes, and consistently exhibits lower empirical length bias than both standard and length-normalized accuracy across benchmarks and few-shot settings.

Abstract:
Multimodal large language models (MLLMs) have been increasingly adopted in forensics for their robust semantic understanding. As AIgenerated images become realistic, semantic-level inconsistencies alone are often insufficient for reliable detection. This motivates a critical question:whether MLLMs can achieve full-spectrum forensic signal perception, i.e., capturing low-level generator artifacts without sacrificing pre-trained semantic knowledge.We then conduct a layer-wise analysis of forensic signal perception in MLLMs and find that semantic information is mainly encoded in the early-to-middle layers, and directly fine-tuning MLLMs for artifact learning causes rapid semantic forgetting. Based on this insight, we propose Deep Visual Residual MLLM (Deep-VRM) to preserve early semantic processing while injecting artifact-specific visual signals as a residual path into an intermediate layer, where they are fused with semantic token representations and propagated through subsequent trainable layers. This enables later layers to jointly model semantic reasoning and signal-level forensic cues, and surprisingly, the model learns to adaptively leverage different levels of forensic signals depending on the input, achieving robust and generalizable detection performance. Extensive experiments show that our method achieves state-of-the-art across all benchmarks.

Abstract:
Neural compression is currently dominated by Nonlinear Transform Coding (NTC), which maps data to realvalued latents via continuous transforms. Despite its success, NTC suffers from train-test mismatch due to non-differentiable quantization, a ''smoothness bias'' inherent in continuous transforms that precludes optimality for certain sources, and a loss of ''shaping gain" due to the complexity of including high-dimensional vector quantization. We proposeSoftBinary Coding(SBC), an end-to-end learning paradigm that bypasses these limitations by using a stochastic binary latent space. In the spirit of vector quantization, SBC employs discrete representations and compresses them through a novel fast binary channel simulation scheme, for which we provide a proof of rate optimality. Experimental gains on information-theoretic sources provide both theoretical and practical closure to NTC's limitations, establishing discrete binary structures as a viable path toward reaching optimal rate--distortion bounds. Surprisingly, SBC also achieves state-of-the-art performance on vector quantization of i.i.d. sources, exceeding Trellis Coded Quantization of the Gaussian source.

Abstract:
Federated Graph Learning (FGL) enables collaborative training across distributed clients without sharing raw graph data. However, its performance is severely hindered by graphspecific heterogeneity arising from divergent node feature distributions and disparate graph structures. Existing FGL methods primarily focus on aligning or personalizing node features but largely overlook the role of structural knowledge, leading to aggregation-induced representation drift during message passing. We observe that structural heterogeneity often originates from feature-driven connection biases shaped by local data collection practices or user preferences. To address this, we propose Fed-Kalter, a novel FGL framework that integrates Kalman filtering principles into graph neural networks. Fed-Kalter introduces Kalter-Conv, a graph convolution grounded in a Kalman framework with learnable dynamics, which treats structural embeddings as latent states and feature-augmented neighborhoods as noisy observations, thereby filtering feature-induced structural noise in a layer-wise manner. Only structural parameters are aggregated globally, enabling effective cross-client knowledge transfer while preserving local personalization. Extensive experiments on 16 graph classification datasets spanning 4 domains demonstrate that Fed-Kalter consistently outperforms state-of-the-art FGL methods. Further ablation and hyperparameter studies confirm its robustness, efficiency, and effectiveness in mitigating structural heterogeneity.

Abstract:
Wholeslide images (WSIs) are widely used for cancer survival analysis because of their comprehensive histopathological information at both cellular and tissue levels, enabling quantitative, large-scale, and prognostically rich tumor feature analysis. However, most existing WSI survival analysis methods struggle with limited interpretability and often overlook predictive uncertainty in heterogeneous slide images. In this paper, we propose DPsurv, a dual-prototype whole-slide image evidential fusion network that outputs uncertainty-aware survival intervals, and enables interpretable survival results through patch prototype distribution assignment, component prototype evidence reasoning, and component-wise relative risk aggregation. Experiments on five publicly available datasets demonstrate strong discriminative performance and well-calibrated predictions, validating its effectiveness and reliability. The interpretation of survival results provides transparency at the feature, reasoning, and decision levels, thereby enhancing the trustworthiness and interpretability of DPsurv.

Abstract:
Large reasoning models (LRMs) achieve remarkable reasoning performance by generating long chainsof-thought (CoT). However, standard supervised fine-tuning (SFT) treats all tokens uniformly, indiscriminately minimizing loss across both essential reasoning steps and those that are noisy, redundant, or instance-specific. This often leads student models to memorize superficial patterns rather than acquire generalizable reasoning capabilities. To better understand this limitation, we introduce Loss Subspace Attribution, a gradient decomposition analysis approach that uncovers a striking geometric structure: Gradients corresponding to effective reasoning predominantly lie within a low-rank consensus subspace, while conflicting or unstructured signals dominate the residual subspace. Guided by this insight, we propose Spectral-guided Learning, a step-level distillation strategy that uses spectral strength to identify reasoning steps aligned with the consensus subspace and prioritizes their contribution to parameter updates, while suppressing gradients from the residual subspace. Experiments across various LRMs and diverse complex reasoning tasks consistently demonstrate that focusing optimization on the consensus subspace yields more robust and generalizable student models.

Abstract:
Pretraining large language models from scratch is prohibitively expensive as model scales increase. A practical alternative is Model Width Expansion (MWE), which grows a larger model from a well-pretrained ''seed'' model to inherit existing capabilities at initialization. However, we identify a phenomenon termed theSubspace Trap: during continual pre-training, parameter updates largely stagnate within a low-dimensional subspace aligned with the initialization, limiting the effective capacity of the expanded model. Our theoretical analysis investigates this issue by attributing it to the function-preserving properties of width expansion. In particular, element-wise adaptive optimizers remain confined to the trap, whereas optimizers that yield an isotropic geometry of parameter updates can escape. To demonstrate the impact of the subspace trap on model performance, we conduct empirical experiments across different model sizes and model families, which show that escaping the trap is principally effective in improving training efficiency and overall model performance. Detailed mechanistic analyses further confirm that escaping the trap indeed activates the new dimensions to encode general knowledge. Our code is available at https://anonymous.4open.science/r/MWE-1B46.

Abstract:
Personalizing AI decision support for heterogeneous human decisionmakers remains a key challenge. We study a collaboration workflow where AI provides a reduced prediction set via conformal prediction as an input for human, and human makes final decision. We use contextual bandits to learn the complex and intangible human decision function, where the optimal set size, governed by a significance parameter \alpha (arms), varies across individuals and tasks (context). To address large arms spaces and high-dimensional contexts, we introduce the Adaptively Grouped Contextual Bandit (AGCB) framework, bypassing unreliable complex function online approximation in favor of directly exploiting Human-AI problem structure through two pillars: continuity-aware counterfactual reasoning that efficiently shares information across decisions, and a data-driven zooming mechanism that adaptively partitions the context space. The zooming mechanism performs a principled, native trade-off between intra-group estimation error and inter-group approximation bias, ensuring optimal granularity for both cumulative and simple regret objectives. Crucially, a single continuity assumption uniquely enables both the bias control for adaptive grouping and the robustness of our counterfactual updates. This leads to minimax-optimal regret rates. Empirical results confirm that AGCB significantly outperforms existing methods in heterogeneous, data-scarce environments.

Abstract:
Recent work shows that vision encoders capture ordinal attributes along linear axes, which can be recovered from as few as two labeled images. However, in the zeroshot setting, the text-driven rank axis for Vision-Language Models (VLMs) like CLIP remains suboptimal. In this work, we study the embeddings of Multimodal LLMs (MLLMs). We hypothesize that MLLMs can overcome this limitation due to three potential advantages: their inherent ordinal understanding, capacity for conditional embeddings, and a small cross-modal gap. We show that MLLMs are rankable using only text prompts. Experiments demonstrate that a text-driven rank axis for MLLM embeddings achieves 90% of the performance of the supervised linear rank axis, significantly outperforming the 61% observed in VLM embeddings. We validate that this capability stems from MLLMs' conditional embeddings and a smaller modality gap than VLMs. Furthermore, we demonstrate that this property generalizes to the audio domain. Our findings suggest that language provides a direct interface for probing latent ordinal structures in MLLMs.

Abstract:
Recent advances in large language models (LLMs) have demonstrated that reinforcement finetuning of pretrained base models can lead to significant gains in reasoning performance at inference time. In this work, we theoretically analyze why reinforcement fine-tuning induces better reasoning ability than purely supervised fine-tuning (SFT) methods. We model chain-of-thought (CoT) reasoning as a pathfinding problem on graphs and compare the popular method of reinforcement learning with verifiable rewards (RLVR) against traditional SFT. We prove that SFT, when trained on golden shortest paths without negative examples, fails to learn how to efficiently backtrack. In contrast, an RLVR-trained model can learn how to efficiently backtrack from dead ends using only outcome reward. This leads to a provable inference-time compute separation between the two methods, and demonstrates that RLVR allows the model to learn the location of difficult decisions in a reasoning chain, ultimately allowing for better allocation of inference-time compute. Finally, we show that the reasoning traces of an RLVR model can be distilled to train a base model to backtrack efficiently as well.

Abstract:
Diffusion probabilistic models (DPMs)’ sampling process is often inefficient, requiring hundreds to thousands of iterative steps to accurately approximate the diffusion trajectory. This inefficiency limits their practical applicability. Although recent advances in sampling efficiency—such as numerical solvers for diffusion ordinary differential equations (ODEs)—have made progress, significant challenges remain: trainingfree numerical solvers suffer from the suboptimality of manually designed timestep selection rules and the inherent inconsistency between the forward diffusion process (typically involving thousands of steps) and the reverse denoising process (usually limited to tens of steps). Since timestep selection is inherently a discrete problem and cannot be optimized via gradients, we propose an innovative approach—reparameterizing the timestep scheduling through probabilistic masking, thereby enabling gradient-based optimization of sampling timesteps. To circumvent backpropagation, we employ policy gradient methods. Furthermore, to address the inconsistency between forward diffusion (training) and reverse denoising (inference), we extend this framework into a bilevel optimization paradigm: the inner loop performs additional lightweight training on the model at specific timesteps determined by the outer mask to align forward and reverse processes, while the outer loop optimizes the timestep distribution via probabilistic masking and policy gradient based on generation quality. Under mild assumptions, we theoretically analyze the convergence of the proposed algorithm. Extensive experiments across diverse datasets and samplers demonstrate that this framework effectively enhances sampling efficiency and generation quality while maintaining compatibility with various DPM architectures and advanced ODE solvers.

Abstract:
While flowbased generative models have demonstrated strong performance across a wide range of domains, deploying them in safety-critical physical systems remains challenging due to strict constraint requirements. Existing approaches typically enforce safety through post-hoc corrections, which incur substantial computational overhead and may distort the learned distribution. We propose PolyFlow, a polytope-constrained flow matching framework that embeds constraints directly into the model and flow dynamics. PolyFlow introduces a discrete-time flow formulation and a projection-free architecture, which eliminate the numeration error and guarantee strict satisfaction of arbitrary polyhedral constraints, without the need for expensive iterative solvers. Experimental results show that PolyFlow achieves zero constraint violation while maintaining high distributional fidelity across a range of planning and control tasks. Compared to state-of-the-art constrained generation baselines, PolyFlow significantly reduces inference latency and demonstrates a favorable trade-off between safety, efficiency, and generative quality.

Abstract:
Explainability is essential for applying time series analysis in highstakes domains. While Time Series Captioning (TSC) offers a pathway to enhance temporal explainability, achieving reliable caption generation usually necessitates high-quality textual annotations. However, as interpreting abstract temporal dynamics requires specialized domain knowledge, acquiring such caption annotations is challenging, thereby impeding the advancement of TSC. To address this challenge, we introduce a novel Caption Label-Free Learning (CLFL) paradigm. Departing from the supervised learning tradition of imitating human annotations, CLFL formulates captioning as an agentic exploration task optimized by feedback from a proxy reward. Specifically, we propose a Dual Loop Agentic Captioning (DLAC) framework to achieve such an exploration-feedback mechanism. In the inner loop, a Time Series Captioning LLM Agent (TSCAgent) reflectively explores potential semantic captions. In turn, the outer loop evaluates these captions via downstream reasoning to derive a proxy reward, which feeds back to optimize the TSCAgent. Empirical results validate the effectiveness of the CLFL, proving that the exploration-feedback mechanism is sufficient for learning complex temporal semantics and autonomously generating captions, without any caption label supervision. Furthermore, we release TFTSC, an industrial expert-level time series caption dataset, which is available at: \urlhttps://anonymous.4open.science/r/TFTSC-05ED/.

Abstract:
AutoRegressive (AR) models with Monte Carlo Tree Search (MCTS) are a dominant paradigm for achieving “System 2” reasoning. However, this approach suffers from significant latency due to the serial, token-by-token generation mechanism of AR models. In contrast, Diffusion Large Language Models (dLLMs) offer inherent speed advantages via parallel sequence generation, yet they often struggle with accuracy in complex reasoning due to a lack of rigorous search, evaluation, and revision capabilities. Directly applying MCTS to diffusion models faces architectural barriers, since the denoising generation process lacks the discrete decision steps that naturally accommodate tree search. To retain efficiency while improving the reasoning ability, we propose DiffuReason, a Monte Carlo tree search reasoning algorithm for diffusion models. By modeling the generation process as a Markov Decision Process (MDP), DiffuReason discretizes the continuous diffusion flow into searchable thought blocks. During the reverse generation process, DiffuReason recursively performs four MCTS-style stages: select the best node (block), expand to obtain candidate nodes, simulate to evaluate node values, and revise the unsatisfactory nodes. Experiments on mathematical reasoning benchmarks demonstrate that DiffuReason significantly improves the reasoning ability of diffusion models, and achieves superior balance of accuracy and efficiency even compared with auto-regressive models.

Abstract:
Integrating visionlanguage models (VLMs) into end-to-end (E2E) autonomous driving (AD) systems has shown promise in improving scene understanding. However, existing integration strategies suffer from several limitations: they either struggle to resolve distribution misalignment between reasoning and action spaces, underexploit the general reasoning capabilities of pretrained VLMs, or incur substantial inference latency during action policy generation, which degrades driving performance. To address these challenges, we propose AutoMoT in this work, an end-to-end AD framework that unifies reasoning and action generation within a single vision-language-action (VLA) model. Our approach leverages a mixture-of-transformer (MoT) architecture with joint attention sharing, which preserves the general reasoning capabilities of pre-trained VLMs while enabling efficient fast-slow inference through asynchronous execution at different task frequencies. Additionally, we introduce a VLA-oriented action refiner that further enhances driving performance via diffusion-based fine-tuning. Extensive experiments on multiple benchmarks, under both open- and closed-loop settings, demonstrate that AutoMoT achieves competitive performance compared to state-of-the-art methods. We refer to \hrefhttps://automot-website.github.io/Project Page for the demonstration videos and corresponding descriptions.

Abstract:
With the rapid evolution of foundation models, Large Language Model (LLM) agents have demonstrated increasingly powerful tooluse capabilities. However, this proficiency introduces significant security risks, as malicious actors can manipulate agents into executing tools to generate harmful content. While existing defensive mechanisms are effective, they frequently suffer from the over-refusal problem, where increased safety strictness compromises the agent's utility on benign tasks. To mitigate this trade-off, we propose \textscSafeHarbor, a novel framework designed to establish precise decision boundaries for LLM agents. Unlike static guidelines, \textscSafeHarbor extracts context-aware defense rules through enhanced adversarial generation. We design a local hierarchical memory system for dynamic rule injection, offering a training-free, efficient, and plug-and-play solution. Furthermore, we introduce an information entropy-based self-evolution mechanism that continuously optimizes the memory structure through dynamic node splitting and merging. Extensive experiments demonstrate that \textscSafeHarbor achieves state-of-the-art performance on both ambiguous benign tasks and explicit malicious attacks, notably attaining a peak benign utility of 63.6% on GPT-4o while maintaining a robust refusal rate exceeding 93% against harmful requests.

Abstract:
SinglePositive Multi-Label Learning (SPML) studies learning from incomplete supervision, where each instance is annotated with only one positive label despite potentially belonging to multiple categories. While existing methods assume the annotated labels are randomly distributed, real-world annotations are often biased toward the most salient category. We formalize this realistic scenario as Salient Single-Positive Multi-Label Learning (SalSPML). This salient annotation bias poses a challenge to conventional SPML methods, as the missing labels often correspond to less salient and harder-to-recognize categories. Fortunately, we find that salient annotations are typically more representative and informative. Motivated by this insight, we propose Prototype-Guided Rejection for Salient Annotation (PiSA), which constructs reliable class-wise prototypes from salient labels and leverages them to guide embedding learning for non-salient labels recognition. We theoretically demonstrate that SalSPML is harder than Random SPML due to irreducible annotation bias, and under SalSPML, more accurate prototypes facilitate false-negative label detection. Experiments on multiple benchmarks, together with two newly constructed real-world SalSPML datasets, demonstrate that PiSA consistently outperforms existing methods, achieving an average mAP improvement of 3.16%. Our code is available in the supplementary materials.

Abstract:
Recovering radiometric fidelity from unknown dynamic range compression (UDRC), such as lowlight enhancement and HDR reconstruction, is a challenging blind inverse problem, due to the unknown forward model and irreversible information loss introduced by compression. To address this challenge, we first identify monotonicity as the fundamental physical invariant shared across UDRC tasks. Leveraging this insight, we introduce the cascaded monotonic Bernstein (CaMB) operator to parameterize the unknown forward model. CaMB enforces monotonicity as a hard architectural inductive bias, constraining optimization to physically consistent mappings and enabling robust and stable operator estimation. We further integrate CaMB with a plug-and-play diffusion framework, proposing CaMB-Diff. Within this framework, the diffusion model serves as a powerful geometric prior for structural and semantic recovery, while CaMB explicitly models and corrects radiometric distortions through a physically grounded forward operator. Extensive experiments on a variety of zero-shot UDRC tasks, including low-light enhancement, low-field MRI enhancement, and HDR reconstruction, demonstrate that CaMB-Diff significantly outperforms state-of-the-art zero-shot baselines in terms of both signal fidelity and physical consistency. Moreover, we empirically validate the effectiveness of the proposed CaMB parameterization in accurately modeling the unknown forward operator.

Abstract:
Theory of Mind (ToM) is a mustacquire skill for modern foundation model systems to operate effectively and safely in the real world. Recent works have explored honing ToM via post-training; however, we show that such progress is confounded by a pervasive “shortcut” issue: tasks can reach up to 99% accuracy by simply exploiting spurious causal correlations, leading to a false sense of ToM. Motivated by this, we first develop a framework to systematically examine ToM datasets for shortcuts and provide guidance for future development. We find that questions reducible to pure state tracking (e.g., “belief”) are especially shortcut-prone compared to mind questions (e.g., “intention”) where reasoning beyond tracking is required. Using four shortcut-free datasets across three ToM contexts, we then comprehensively study whether reinforcement-learning fine-tuning with verifiable rewards and explicit reasoning (Thinking-RFT) elevates ToM beyond supervised fine-tuning (SFT). Our key findings are: 1) Thinking-RFT effectively improves ToM in all scenarios (+6% vs. SFT), particularly in complex higher-order reasoning (+10% vs. SFT) and multimodal cases (+7% vs. SFT), and generalizes notably better to unseen domains and higher-order queries while being more robust to counterfactuals. 2) ToM benefits specifically from the joint effect of reasoning and RL: Thinking-RFT outperforms No-Thinking-RFT by 7% on average. 3) RFT works by learning to ground its reasoning on anchor cues (keywords/state changes) that correspond to causal factors. We believe our study is useful for developing effective and robust ToM post-training datasets and advancing critical ToM capabilities in foundation models.

Abstract:
Mixup is a widely used data augmentation technique that constructs new training examples by interpolating between existing ones. While simple and effective in domains like vision and language, applying mixup to graph data is nontrivial and there is no independent empirical evidence for its effectiveness. To fill this gap, we conducted an extensive evaluation study following a unified, established evaluation protocol for graph classification. In contrast to prior results, we found that none of the state-of-the-art mixup methods yielded statistically significant improvements over the no-mixup baseline. To obtain further insights, we analyzed the graphs generated from these mixup methods from an interpolation perspective. We found that (i) many mixup methods failed to interpolate well, (ii) high interpolation error led to performance degradation, and (iii) even good interpolation properties did not lead to performance improvements. Our findings question the efficacy of existing graph mixup methods and highlight the need for a more rigorous exploration and evaluation.

Abstract:
Retrievalaugmented generative agents rely on retrieval for grounding, yet are typically evaluated on a query-by-query basis. This isolates interactions that are geometrically coupled in a shared embedding space. For example, we show that the high document density required to serve majority interests (e.g., generic "Crime" movies) can geometrically overcrowd the retrieval neighborhood of a semantically similar minority (e.g., "Film Noir"), effectively expelling minority content from top-k results. We introduce a formal framework to analyze how such goal collisions in dense retrieval induce fundamental performance limits and emergent fairness issues inherent to spatial crowding. In our static analysis, we demonstrate that for a fixed embedding space, a phase transition occurs where minority user goals suffer a catastrophic collapse in performance as the density of majority goals increases. We then extend this to a dynamic model and derive a non-linear Fokker-Planck equation that governs the evolution of document embeddings as the agent updates them to maximize retrieval accuracy. Our analysis reveals that this local relevance objective triggers an emergent global mechanism that systematically marginalizes minority interests. We prove that such objectives drive the system to self-organize into a state that exclusively serves majority interests. These results provide a theoretical foundation for understanding a critical grounding failure mode in retrieval-augmented agents.

Abstract:
Stochastic approximations (SA)algorithms which derive their power through the use of random, incremental updates--are at the heart of reinforcement learning (RL). Expanding the theory of SA has established rigorous results concerning the most important algorithms in RL, including stochastic gradient descent and temporal difference learning. In this work, we focus on two-timescale stochastic approximations, a class which notably includes temporal difference learning with gradient correction (TDC) and actor-critic methods. Prior work has developed stability (boundedness) and convergence criteria for two-timescale SA under i.i.d. noise, but analogous results for Markovian noise have remained elusive--a critical issue since RL data are generated by a Markov chain, making i.i.d. assumptions unrealistic. To address this gap, we present the first stability result and the first asymptotic convergence result for two-timescale schemes with Markovian noise under general, verifiable conditions--notably, without resorting to projected variants of the schemes or requiring the noise to be in a compact space. As a key application, we contribute the first asymptotic convergence proof of TDC, an off-policy prediction algorithm with linear approximation and eligibility traces. Together, our results extend SA theory, establishing the first theoretical foundation for analysis of two-timescale algorithms with the realistic noise models inherent to RL.

Abstract:
Ontologies (schemas) are a key bottleneck for schemagrounded information extraction and knowledge graph construction, yet manual ontology engineering is expensive and schemas quickly fragment or drift across domains. We introduce SCOPE (Schema Construction and Ontology Induction Pipeline Evaluation), a benchmark for train-only ontology/schema induction and optional ontology fusion directly from raw corpora. SCOPE normalizes 24 public IE sources (15 RE + 9 EE; zh/en) into machine-readable gold schema graphs and provides train-only induction corpora through a standardized text corpus release. We propose SCION (Structural mining and Contracted semantic Induction for Ontology constructiON and fusion), a controllable pipeline that mines a candidate space of concepts/relations/events from text, performs LLM-assisted naming/merging/filtering under a strict JSON contract with evidence pointers, and can fuse the result with a fixed base ontology package using conservative alignment with provenance tracking. On the SCOPE core suite, SCION improves ontology-level similarity over official/manual schemas, a Text2Onto-style baseline, and LLM-only induction baselines under Literal, Fuzzy, Continuous, and Graph F1. SCOPE and SCION together enable reproducible and auditable evaluation of end-to-end ontology induction and fusion.

Abstract:
Prompt learning for visionlanguage models (VLMs) often suffers from performance degradation when adapting to downstream tasks with noisy labels. Existing methods that rely on filtering or reconstructing supervision can propagate errors, leading to sharp performance drops. We observe that pre-trained embeddings are resilient to label noise, offering stable references despite limited adaptation. Based on this insight, we propose Evidence-Prompt, a framework built on the evidence prior that enhances prompt learning by integrating stable pre-trained knowledge. We treat prompt learning as a Bayesian reasoning task, where credibility is derived from both supervision-agnostic and supervision-conditioned evidence. This framework effectively combines these sources to infer robust training targets under noisy conditions, enabling stable learning even with high noise levels. Extensive experiments on eight benchmarks with both synthetic and real-world noisy labels demonstrate that our method flattens the accuracy–noise curve and consistently outperforms SOTA methods, with notable gains on OxfordPets dataset at a 75% noise rate (+36.6% under Asym and +14.4% under Sym). Additionally, transferability experiments reveal that incorporating our evidence prior into other SOTA methods results in accuracy improvements ranging from 2.6% to 15.66%.

Abstract:
Recent developments in deep learning optimization have brought about radically new algorithms based on the Linear Minimization Oracle (LMO) framework, such as Muon and Scion. After over a decade of Adam's dominance, these LMObased methods are emerging as viable replacements, offering several practical advantages such as improved memory efficiency, better hyperparameter transferability, and most importantly, superior empirical performance on large-scale tasks, including LLM training. However, a significant gap remains between their practical use and our current theoretical understanding: prior analyses (1) overlook the layer-wise LMO application of these optimizers in practice, and (2) rely on an unrealistic smoothness assumption, leading to impractically small stepsizes. To address both, we propose a new LMO-based framework called Gluon, capturing prior theoretically analyzed methods as special cases, and introduce a new refined generalized smoothness model that captures the layer-wise geometry of neural networks, matches the layer-wise practical implementation of Muon and Scion, and leads to state-of-the-art convergence guarantees. Our experiments with NanoGPT and CNN confirm that our assumption holds along the optimization trajectory, ultimately closing the gap between theory and practice.

Abstract:
We study the gradient flow dynamics of diagonal linear networks for regression tasks under infinitesimal initialization. Extending the saddleto-saddle dynamics described in Theorem 1 from Pesme & Flammarion (2023), we generalize the analysis to both deep diagonal linear networks and a broader class of two-layer diagonal linear networks (as defined in Definition 4.1). Specifically, we demonstrate that the training trajectories of these models can be equivalently characterized by the proposed Algorithm 1. We further prove that this algorithm converges to the solution of a modified ℓ1 norm minimization problem. As a result, we establish that the implicit bias of both network architectures corresponds to a modified ℓ1 norm in the regime of infinitesimal initialization. Additionally, we provide insights into the underlying mechanisms governing these dynamics by identifying the Structural Invariant Manifold (SIM) (Zhao et al., 2025) as the key geometric structure that shapes the learning process

Abstract:
VideoQA models can be accurate yet often fail to align answers with the correct video segments (the \emphgrounding gap). We introduce LINGUA (Languagebased INference for Grounded Video Understanding Agent), a memory-based agent that performs grounded VideoQA by reasoning in an explicit \emphlinguistic belief state. LINGUA uses five mechanisms: (1) event-driven perception (retains 8--12% of frames while preserving 94% of question-relevant events); (2) typed memory for episodic narratives, semantic affordances, and procedural scripts; (3) Belief-Action-Verification loops with postcondition and temporal checks; (4) meta reflection with contrastive refinement; and (5) Bayesian reliability tracking for continual learning without gradient updates. Built with Gemma3-4B (Ollama, 4-bit), LINGUA outperforms strong baselines on five VideoQA benchmarks, reaching 82.4% on NExT-QA and 42.3% Acc@GQA on NExT-GQA (answer + IoU\geq0.5 temporal localization), while running 2.6× faster than dense-frame methods. In continual learning over 100 videos, accuracy rises from 45.2% (first 10) to 61.8% (last 10) without catastrophic forgetting, indicating online adaptation via memory refinement.

Abstract:
Textto-Image diffusion models have achieved remarkable success in image generation and are increasingly fine-tuned for personalized use cases. However, many personalized models may incorporate unauthorized data (e.g., copyrighted materials) during the fine-tuning process, raising growing concerns about potential copyright infringements. Existing methods either require intrusive modifications to the images to be protected, which not only fail to safeguard previously released images but may also degrade image quality, or rely on the availability of the pre-fine-tuned model, thereby limiting their applicability. To bridge this gap, in this paper, we propose the first non-intrusive copyright authentication framework without pre-fine-tuned model. We reveal that if a model is fine-tuned on a specific image, it learns the denoising trajectory of that image across varying noise levels, allowing it to stably reconstruct the image even under noise perturbations. Based on this insight, we propose Reliable dEteCtion Of unauthorized data usage via inVErsion Robustness (RECOVER), an effective non-intrusive detection method without pre-fine-tuned model. Unlike existing methods that rely on external watermarks or discrepancies between the suspect and pre-fine-tuned models, RECOVER directly leverages the robustness observed during the inversion–reconstruction process of the suspect model to determine whether an image was used for fine-tuning. Extensive experiments demonstrate that RECOVER is effective across a wide range of scenarios, consistently outperforming existing methods.

Abstract:
As large language models (LLMs) are increasingly deployed in retrieval augmented generation (RAG) and agentic systems that accumulate extensive context, understanding how distracting information affects performance in long context becomes critical. Prior work shows that semantically relevant but misleading documents can cause performance degradation, yet the quantitative relationship between the proportion of distractors and performance remains unstudied. In this work, we systematically vary the proportion of hard distractors within fixedlength contexts, revealing a striking nonlinear pattern: as the proportion of hard distractors increases, performance drops sharply within the first small fraction, while the remainder of the range yields only marginal additional decline. We term this ''The First Drop of Ink'' effect, analogous to how a single drop of ink contaminates water. We provide both theoretical and empirical analysis grounded in attention mechanics: hard distractors disproportionately capture attention even at small proportions, with diminishing marginal impact as their proportion increases. Through controlled experiments, we further show that filtering yields performance gains primarily from context length reduction rather than distractor removal, and only achieves substantial recovery when hard distractor proportion is reduced to near zero, which highlights the importance of upstream retrieval precision.

Abstract:
Training large neural networks exposes neural scaling laws for the generalization error, which points to a universal behavior across network architectures of learning in high dimensions. It was also shown that this effect persists in the limit of highly overparametrized networks as well as the Neural network Gaussian process limit. We here develop a principled understanding of the typical behavior of generalization in Neural Network Gaussian process regression dynamics. We derive a dynamical meanfield theory that captures the typical case learning dynamics: This allows us to unify multiple existing regimes of learning studied in the current literature, namely Bayesian inference on Gaussian processes, gradient flow with or without weight-decay, and stochastic Langevin training dynamics. Employing tools from statistical physics, the unified framework we derive in either of these cases yields an effective description of the high-dimensional microscopic behavior of networks dynamics in terms of lower dimensional order parameters. We show that collective training dynamics may be separated into the dynamics of N independent eigenmodes, whose evolution equations are only coupled through collective response functions and a common statistics of an effective, independent noise. Our approach allows us to quantitatively explain the dynamics of the generalization error by linking spectral and dynamical properties of learning on data with power law spectra, including phenomena such as neural scaling laws and the effect of early stopping.

Abstract:
We study offline learning of factored stochastic policies over extremely large, combinatorial action spaces and show how standard conjoint data can be used to estimate such policies with asymptotically valid uncertainty under conditions. Conjoint analyses typically report AMCEs by averaging over opponent attributes and thus ignore strategic interdependence. We instead learn \emphstochastic interventionsproduct-of-Categorical policies over factor levels—that (i) optimize expected outcomes in an average-case setting and (ii) extend to a two-player \emphminimax (adversarial) setting that realistically captures simultaneous strategic candidate selection. Methodologically, we derive a closed-form solution for the average-case optimizer under two-way interactions with L_2 variance regularization, and provide a general gradient-based procedure for richer model classes. Uncertainty from the outcome model propagates asymptotically to both the optimal policy and its value via a Delta method approximation. We further model institutional details (e.g., primaries) inside the minimax objective and introduce a data-driven measure of strategic divergence between parties. On synthetic data, we empirically characterize finite-sample error and coverage as dimensionality and n vary. On a U.S. presidential conjoint, adversarially learned policies produce restricted-equilibrium vote shares that align with historical election ranges in our data, in stark contrast to non-adversarial (averaging) optimizers.

Abstract:
Adapting pretrained vision models using parameter-efficient fine-tuning (PEFT) remains challenging, as it aims to achieve performance comparable to full fine-tuning using a minimal number of trainable parameters. When applied to complex dense prediction tasks, existing methods exhibit limitations, including input-agnostic modeling and redundant cross-layer representations. To this end, we propose AdaRoute, a new adapter-style method featuring a simple mixture-of-experts (MoE) architecture. Specifically, we introduce shared expert centers, where each expert is a trainable parameter matrix. During a feedforward pass, each AdaRoute module in the network dynamically generates weight matrices tailored for the current module via a simple dynamic parameter routing mechanism, which selectively aggregates parameter matrices in the corresponding expert center. Dynamic weight matrices in AdaRoute modules facilitate low-rank adaptation in an input-dependent manner, thus generating more customized and powerful feature representations. Moreover, since AdaRoute modules across multiple network layers share the same expert center, they improve feature diversity by promoting implicit cross-layer feature interaction. Extensive experiments demonstrate the superiority of AdaRoute on diverse vision tasks, including semantic segmentation, object detection and instance segmentation, and panoptic segmentation.

Abstract:
Reconstructing a complete yet compact 3DGS from sparseview monocular long videos is challenging: hierarchical training with VFI can improve coverage, yet correlated pseudo views and repeated merging tend to accumulate near-duplicate Gaussians and exacerbate co-adaptation. To address this, we propose CoverPruneGS, a coverage-preserving structured pruning framework tailored for hierarchical merging with VFI-augmented supervision, which performs coarse-to-fine pruning via voxel-based local diversity selection and GT-guided lazy refinement with randomized dropout rendering. To make refinement reliable, we introduce a footprint-aware CUDA attribution that aggregates GT-aligned error degradation over Gaussian-influenced pixels in a manner consistent with alpha compositing, yielding faithful per-Gaussian scores for quantile-based rescue. Experiments on multiple datasets demonstrate that CoverPruneGS substantially reduces the Gaussian count by 56.8% and accelerates inference while maintaining or improving novel view synthesis quality.

Abstract:
Largescale vision–language models (VLMs) excel at multimodal reasoning, yet efficiency collapses when vision tokens—often orders of magnitude more than text—dominate compute and memory. Prior token-reduction strategies typically trade off salience (which is prone to position bias and incurs extra computation) against diversity (which can under-cover salient regions and is sensitive to hyperparameters). We present SPLIT, a theoretically grounded framework that jointly preserves salience and diversity while aggressively eliminating redundancy. SPLIT (i) estimates token importance via temporal shifts of hidden states across layers—eschewing attention scores and their biases; (ii) assigns adaptive region-level budgets to guarantee localized coverage; and (iii) selects tokens using a diversity score that prioritizes distinctive, non-redundant representations. Our analysis shows that adaptive budgeting yields tighter coverage guarantees than uniform allocation, and our selection rule maintains diversity without costly tuning. Empirically, SPLIT consistently outperforms state-of-the-art on image and video understanding benchmarks. On image understanding with LLaVA-1.5-7B, SPLIT preserves over 99% accuracy with 192 vision tokens and about 92.8% with only 64 tokens, demonstrating robust performance under severe token budgets. These results indicate that SPLIT delivers scalable, attention-score-free token reduction that makes multimodal reasoning substantially more efficient without sacrificing accuracy.

Abstract:
In critical domains like clinical reporting, legal analysis, and policy drafting, large language models (LLMs) are increasingly expected to produce extended, fact‑rich narratives rather than isolated sentences. Reliable uncertainty quantification in such long‑form outputs is crucial. Existing techniques either assign a single confidence score to an entire paragraph or evaluate factual consistency by comparing extracted atomic facts across multiple generations. Some recent approaches represent fact–paragraph relationships using bipartite entailment graphs and derive uncertainty from fact centrality. However, these methods ignore the explicit dependencies among facts within a paragraph and the structural and semantic variation across multiple LLM outputs for the same prompt, missing a key source of uncertainty specific to long‑form generation. We proposeGAUSS(Graph‑AssistedUncertaintyQuantification usingStructure andSemantics), a principled framework that models each generated paragraph as a semantic graph of atomic facts and their relations. We posit that uncertainty arises from structural and semantic discrepancies among these graphs across different samples.GAUSSquantifies uncertainty as the expected alignment cost between the semantic graph of an anchor paragraph and those of alternative generations. By capturing both semantic content and structural coherence,GAUSSoffers a more interpretable and theoretically grounded measure of uncertainty than coarse, sentence‑level scores.

Abstract:
Regional climate prediction presents unique challenges for time series foundation models, which typically process temporal patterns through a singlepass inference. Expert climatologists, in contrast, employ multi-scale temporal analysis and iterative refinement based on systematic error diagnosis. We present RGMR (Residual-Guided Multi-Resolution Refinement), an inference-time framework that adapts pre-trained foundation models to perform structured multi-scale reasoning for climate forecasting without parameter modification. Our approach combines hierarchical coarse-to-fine prediction refinement with residual-guided error correction; together, they systematically address prediction failures at each resolution level. Applied to drought forecasting using the Standardized Precipitation Evapotranspiration Index (SPEI), RGMR consistently enhances foundation model performance across diverse climate regions within an Australian regional area. Experimental results demonstrate substantial improvements over direct foundation model application, achieving up to 18.9% reduction in mean squared error, 10.2% reduction in root mean squared error, and 21.1% relative gain in R^2 when applied to TimesFM, with the largest benefits observed in climatologically complex regions where multi-scale temporal dynamics are most pronounced. The framework's inference-time operation enables immediate deployment on existing operational climate prediction systems without model retraining, offering a practical solution for enhancing foundation model capabilities in specialized forecasting domains.

Abstract:
Generative Flow Networks (GFlowNets) enable finetuning large language models to approximate reward-proportional posteriors, but they remain prone to mode collapse, manifesting as prefix collapse and length bias. We attribute this to two factors: (i) weak credit assignment to early prefixes, and (ii) biased replay that induces a shifted, non-representative training flow distribution. We propose Rooted absorbed prefix Trajectory Balance (RapTB), an objective that anchors subtrajectory supervision at the root and propagates terminal rewards to intermediate prefixes via absorbed suffix-based backups, providing dense prefix-level learning signals. To mitigate replay-induced distribution shift, we further introduce SubM, a submodular replay refresh strategy that promotes both high reward and diversity. Empirically, on tasks such as molecule generation with LLM using SMILES strings, RapTB combined with SubM consistently improves optimization performance and molecular diversity while preserving high validity.

Abstract:
Learning unified road representations across diverse cities is a pivotal challenge in urban computing. However, existing approaches predominantly focus on singlecity modeling, failing to handle the distribution shifts caused by heterogeneous urban layouts. We identifyspectral misalignment, manifested as the significant divergence of spectral distributions across different cities, as the primary barrier preventing standard Graph Neural Networks from capturing universal patterns. To bridge this gap, we proposeCoSpec, a framework that disentangles road networks into shareable low-frequency commonalities and city-specific high-frequency specificities. CoSpec employs a hierarchical dual-path architecture where the low-frequency path aligns global functional semantics via adaptive prototypes, while the high-frequency path modulates local geometric residuals to fit specific urban textures. Theoretical analysis shows CoSpec bounds the Wasserstein distance between city distributions, and extensive experiments demonstrate its superior generalization over state-of-the-art baselines.

Abstract:
Large language models (LLMs) exhibit social biases that reinforce harmful stereotypes, limiting their safe deployment. Most existing debiasing methods adopt a suppressive paradigm by modifying parameters, prompts, or neurons associated with biased behavior; however, such approaches are often brittle, weakly generalizable, datainefficient, and prone to degrading general capability. We propose KnowBias, a lightweight and conceptually distinct framework that mitigates bias by strengthening, rather than suppressing, neurons encoding bias-knowledge. KnowBias identifies neurons encoding bias knowledge using a small set of bias-knowledge questions via attribution-based analysis, and selectively enhances them at inference time. This design enables strong debiasing while preserving general capabilities, generalizes across bias types and demographics, and is highly data efficient, requiring only a handful of simple yes/no questions and no retraining. Experiments across multiple benchmarks and LLMs demonstrate consistent state-of-the-art debiasing performance with minimal utility degradation. Data and code are available at \urlhttps://anonymous.4open.science/r/KnowBias-EFF9.

Abstract:
Sparse Mixtureof-Experts (MoE) language models enable conditional computation but face deployment challenges due to the "memory wall": while few experts are activated per token, the entire model must reside in memory. Existing expert pruning methods primarily rely on independent ranking, failing to account for the complex inter-dependencies and redundancies between experts. In this paper, we formulate post-training MoE pruning as a reconstruction-driven subset selection problem, aiming to minimize layer-output distortion under a cardinality constraint. We introduce SCHUR-A, an algorithm that leverages A search to achieve globally optimal expert selection within each layer. To maintain computational tractability, we derive a novel, admissible heuristic upper bound using a Schur-complement-based relaxation of the reconstruction objective. This tight bound allows for aggressive pruning of the search space while mathematically guaranteeing optimality. Furthermore, we propose an automated strategy to balance fidelity and memory reduction across heterogeneous layers via knee-point detection. Extensive experiments on Qwen3-30B-A3B demonstrate that SCHUR-A significantly outperforms greedy and ranking-based baselines, maintaining comparable performance even under aggressive pruning ratios.

Abstract:
Pretrained transformers have demonstrated remarkable generalization abilities, at times extending beyond the scope of their training data. Yet, real-world deployments often face unexpected or adversarial data that diverges from training data distributions. Without explicit mechanisms for handling such shifts, model reliability and safety degrade, urging more disciplined study of out-of-distribution (OOD) settings for transformers. By systematic experiments, we present a mechanistic framework for delineating the precise contours of transformer model robustness. We find that OOD inputs, including subtle typos and jailbreak prompts, drive language models to operate on an increased number of fallacious concepts in their internals. We leverage this device to quantify and understand the degree of distributional shift in prompts, enabling a mechanistically grounded fine-tuning strategy to robustify LLMs. Expanding the very notion of OOD from input data to a model’s private computational processes—a new transformer diagnostic at inference time—is a critical step toward making AI systems safe for deployment across science, business, and government.

Abstract:
Knowledge Distillation (KD) promises to bridge the gap between the high computational costs of Transformerbased models and the expressiveness limitations of linear models in long-term time series forecasting. Existing time series distillation methods inherit the computer vision paradigm, constraining student models by minimizing point-wise prediction matching (output-level distillation) errors. However, blindly mimicking teacher predictions, which are often uncertain, can induce negative transfer. To address this, we propose Dynamic Structural Distillation (DSD), a robust framework that goes beyond the prediction-matching paradigm. First, we design LMP-Net, leveraging manifold expansion to project features into a high-dimensional latent space, alleviating the expressiveness bottleneck while preserving lightweight inference. Second, to address architectural mismatch, we propose Dual Manifold Alignment, employing Similarity-Preserving Knowledge Distillation (SPKD) and Optimal Transport (OT) to align features at the topological and geometric levels, respectively. Finally, we introduce Regime-Aware Adaptive Distillation (RAAD) to mitigate teacher misguidance via a dataset-level regime prior and a confidence-based adaptive gating mechanism. Extensive experiments on five benchmarks validate that DSD is compatible with diverse Transformer-based teachers, mitigating negative transfer while achieving a favorable accuracy--efficiency trade-off. An anonymized implementation is available at https://anonymous.4open.science/r/DSD-master-4B8F.

Abstract:
Multimodal Large Language Models (MLLMs) must resolve conflicts when modalities provide contradictory information, a process we term "modality following". We propose a framework that deconstructs this behavior into casespecific relative reasoning uncertainty and a model's stable inherent preference. By evaluating diverse MLLMs, we establish a universal law: the probability of following a modality decreases monotonically as its relative reasoning uncertainty increases, which is robustly preserved across diverse uncertainty indices. This law allows us to quantify a "balance point" where uncertainties are subjectiveized, offering a principled measure of modality bias that is disentangled from unimodal capabilities. Probing the internal decision-making reveals that this conflict resolution is a high-level cognitive process: in ambiguous regions near the balance point, models exhibit significant "concept oscillations," where top predictions vacillate between modalities specifically within the middle-to-late layers. Finally, we demonstrate the framework's utility for preference steering through Supervised Fine-Tuning (SFT). We find that data efficiency is governed by reasoning uncertainty: training on easy samples (where one modality dominates) fails to generalize, whereas targeting the identified ``boundary cases" is essential for robust preference alignment and suppressing internal vacillation.

Abstract:
Principled uncertainty quantification (UQ) is increasingly recognized as essential for trustworthy artificial general intelligence (AGI). Bayesian LowRank Adaptation (LoRA) provides a principled mechanism for uncertainty-aware fine-tuning of large language models (LLMs). However, existing techniques either face scalability constraints, e.g. Laplace-LoRA, or rely on approximate inference schemes that lead to poorly calibrated posterior uncertainty, often manifesting as overconfident predictions under distribution shift. To address this challenge, we propose GPan-LoRA, the first scalable Gaussian Process (GP)-based framework for Bayesian LoRA, which integrates neural network-based sparse GP approximations with amortized variational inference. By preserving the Bayesian function prior and posterior semantics intrinsic to GPs, GPan-LoRA achieves a faithful balance between computational scalability and principled UQ. Empirically, GPan-LoRA produces well-calibrated uncertainty that remains reliable under distribution shift, mitigating overconfident failures while preserving competitive task performance.

Abstract:
Person reidentification (ReID) models are sensitive to long-tail nuisances (e.g., rare viewpoints, occlusions, complex backgrounds), yet current generative augmentation is largely open-loop: prompts/conditions are sampled heuristically without verifying whether the synthesized samples improve ReID discriminability. We introduce ReasonAug, a closed-loop framework that learns an image-conditioned instruction policy for a frozen generator, turning augmentation into a sequential decision problem over instruction tokens. A Semantic Reasoning Agent (SRA) performs hierarchical planning from global semantics to identity-critical local cues, producing structured edit instructions whose utility is verified by downstream ReID feedback. To make closed-loop optimization reliable, we propose Metric-Aligned Gated Reward (MAGR), which converts metric-learning objectives into a dense reward while gating task shaping by identity preservation to prevent reward hacking, and Structure-Aware Entropy (SAE), which allocates exploration per token to lock identity-critical cues while diversifying nuisance factors. Experiments on Market-1501 and MSMT17 demonstrate state-of-the-art performance, confirming that closing the augmentation loop and learning what to generate yield more discriminative training data than open-loop alternatives.

Abstract:
Despite the success of knowledge distillation (KD) in Large Language Models (LLMs), the underlying mechanism behind its efficacy remains unclear. In this paper, we propose a unified approach to explore the common mechanism of various KD methods using interactions. Specifically, we decompose the output score of the LLM into the sum of numerous interactions. Each interaction represents a nonlinear relationship involving a set of input variables (e.g., words). Based on the decomposed interactions, we discover that the common mechanism underlying various KD methods is the sparsification of interactions, i.e., student models retain fewer interactions for inference while suppressing other interactions to zero effects. Furthermore, we discover that the performance variance across different KD methods arises from their capabilities in handling complex interactions. A KD method typically yields better performance if it enables the student model to achieve higher sparsity of complex interactions. Motivated by these insights, we propose a plugand-play loss function called Complex Interaction Penalty (CIP) to explicitly enforce the sparsity of complex interactions during the distillation process. Extensive experiments demonstrate that integrating CIP consistently improves the performance of diverse KD methods on both in-domain and out-of-distribution benchmarks.

Abstract:
By sharing intermediate features, collaborative perception extends each agent's sensing beyond standalone limits, but realworld feature modality heterogeneity remains a key barrier to effective fusion. Most existing methods, including direct adaption and protocol-based transforma-tion, typically rely on training adapters for newly emerging feature modalities and often require additional retraining or fine-tuning. Such repeated training is costly and is often infeasible across manufacturers due to model and data privacy con-straints, limiting real-world scalability. To address this issue, we propose UniTrans, a universal any-to-any feature modality translation model that instantiates translators on the fly for arbitrary modalities. UniTrans pretrains a bank of translator expert parameters and learns their combination coefficients as a function of source-to-target modality mapping. The mapping is measured in a modality-intrinsic latent space, where an intrinsic encoder extracts modality-specific yet scene-invariant codes from single-frame intermediate features, enabling UniTrans to instantiate translators in a zero-shot manner. Experiments on OPV2V-H and DAIR-V2X demonstrate that UniTrans consistently outperforms state-of-the-art methods in both simulated and real-world set-tings, enabling efficient any-to-any translation through a universal model. Code will be made available.

Abstract:
MetaReinforcement Learning (Meta-RL) faces significant challenges in non-parametric settings, where vastly different return scales across diverse tasks cause severe gradient interference. Existing categorical solutions attempt to normalize these scales but often fail due to rigid discretization and quantization errors. To address this, we propose Reflect-then-Correct (RTC), a framework that models meta-values using Sinkhorn divergence. By treating distributions as adaptive floating particles, RTC achieves a geometry-aware alignment of distinct meta-task structures. However, while Sinkhorn updates harmonize gradients, they introduce statistical bias via sampling estimation. RTC overcomes this by ''reflecting'' on the temporal accumulation of Bellman inconsistencies through a recursive error model and ''correcting'' the optimization via adaptive importance weights that prioritize transitions critical for accuracy. We provide theoretical guarantees for this reweighting strategy and demonstrate that RTC outperforms existing baselines on the challenging Meta-World ML-10 and ML-45 benchmarks.

Abstract:
Statespace language models such as Mamba and gated linear attention (GLA) offer efficient alternatives to transformers due to their linear complexity and parallel training, but often lack the expressivity and robust state-tracking needed for complex reasoning. We address these limitations by reframing sequence modelling through a probabilistic lens, using Bayesian filters as a core primitive. While classical filters such as Kalman filters provide principled state estimation and uncertainty tracking, they are typically viewed as inherently sequential. We show that reparameterising the Kalman filter in information form enables its updates to be computed via an associative scan, allowing efficient parallel training. Building on this insight, we introduce the Kalman Linear Attention (KLA) layer, a neural sequence-modelling primitive that performs time-parallel probabilistic inference while maintaining explicit belief-state uncertainty. KLA offers strictly more expressive non-linear updates and gating than GLA variants while retaining their computational advantages. On language modelling tasks, KLA matches or outperforms modern SSMs and GLAs across representative benchmarks for discrete token manipulation and state tracking.

Abstract:
Finetuning Large Language Models (LLMs) as autonomous agents on domain-specific data has emerged as a promising paradigm for tackling interactive, real-world tasks. However, existing studies have overlooked the critical coordination between long-term planning and multi-step execution in optimizing agent capabilities. This oversight leads to the propagation of impractical plans and plan-deviated trajectories into the optimization process, resulting in suboptimal task performance and hindering the further development of LLM-based agents in long-horizon tasks. To bridge this gap, we propose CoPE, a novel framework that explicitly integrates planning–execution coordination into LLM-based agent optimization. CoPE employs Self-Refining MCTS to generate task plans and multiple execution trajectories through environment interactions. By quantifying the coordination between planning and execution, CoPE assigns higher optimization weights to well-coordinated samples, enabling LLM-based agents to learn better planning and execution policies. Extensive experiments demonstrate that CoPE substantially improves agent coordination, outperforming state-of-the-art baselines on benchmarks comprising two long-horizon multi-step tasks. Codes and data are available at https://anonymous.4open.science/r/CoPE-F144.

Abstract:
Personalization techniques for image generation models have increasingly been misused for malicious purposes, including unauthorized style imitation and copyrighted content replication. In response, recent mimicry protection methods embed carefully designed perturbations into images to disrupt a model’s ability to learn genuine semantic representations. Despite their growing adoption, the robustness of these protection mechanisms remains poorly understood, raising concerns about their reliability in realworld deployment. In this work, we present the first systematic analysis showing that DDIM inversion inherently acts as a perturbation amplifier, causing protected images to suffer severe structural distortions during reconstruction. Building on this observation, we propose DDIM Inversion-based Reconstruction Purification (DIRP), a novel purification approach that removes protective perturbations by explicitly minimizing DDIM inversion reconstruction error under perceptual constraints. Extensive experiments on six existing mimicry protection methods demonstrate that DIRP consistently outperforms five state-of-the-art attack baselines, achieving superior perturbation removal while better preserving image quality. Our results expose fundamental vulnerabilities in current mimicry protection strategies and highlight the urgent need for more robust and principled defenses.

Abstract:
Model merging is a promising technique to enhance the capabilities of neural networks (NNs) by integrating multiple downstream finetuned models without requiring access to clients' raw data or substantial computation resources. However, conventional model merging typically requires collecting the full set of fine-tuned model parameters from multiple clients, which may expose them to model-privacy risks. An emerging approach, known as partial model merging (PMM), mitigates this risk by splitting each model into private and shared parts, where only the shared part is merged while the private part remains local to each client. Despite its stricter parameter fusion, PMM can still achieve competitive performance compared to full-parameter sharing. However, the privacy properties of PMM remain underexplored. In this paper, we propose a novel model clone attack and assess the risk of reconstructing the unshared private part of a partially merged model under eight attack scenarios with varying prior knowledge (i.e., partial training data, model parameters and/or model structure). Our comprehensive experiments reveal that merging NNs without adequate protection is highly vulnerable. Even when only a small fraction of training data, model parameters, or model structure is exposed, adversaries can still recover significant portions of the private model's performance.

Abstract:
Intervalcensored data present unique challenges in statistical analysis due to the partial observability of event times within known intervals, requiring assumptions about the censoring mechanism. This paper explores the theoretical relationship between two foundational assumptions: independent monitoring and non-informative censoring. We demonstrate that these assumptions are equivalent for Case-1 interval-censored data, but not for Case-K interval-censored data, where K \geq 2, through a synthetic dataset example. Additionally, we propose the first strictly proper scoring rule and calibration metric specifically designed for interval-censored data under the constant-sum assumption and the non-informative censoring assumption, respectively. Our empirical evaluations on real-world datasets show that a neural network model trained with our scoring rule is competitive with established statistical baselines, offering enhanced flexibility. These contributions provide significant advancements in the theoretical understanding and practical analysis of interval-censored data.

Abstract:
Conditionally positive definite (CPD) kernels are defined with respect to a function class \mathcalF. It is well known that such a kernel K is associated with its native space (defined analogously to an RKHS), which in turn gives rise to a learning method called conditional kernel ridge regression (conditional KRR) due to its analogy with KRR --- where the estimated regression function is penalized by the square of its native space norm. This method is of interest because it can be viewed as classical linear regression, with features specified by \mathcalF, followed by the application of standard KRR to the residual (unexplained) component of the target variable. Methods of this type have recently attracted increasing attention. We study the statistical properties of this method by reducing its behavior to that of KRR with another fixed kernel, called the residual kernel. Our main theoretical result shows that such a reduction is indeed possible, at the cost of an additional term in the expected test risk, bounded by \mathcalO(1/\sqrtN), where N is the sample size and the hidden constant depends on the class \mathcalF and the input distribution. This reduction enables us to analyze conditional KRR in the case where K is positive definite and \mathcalF is given by the first k principal eigenfunctions in the Mercer decomposition of K. We also consider the setting where \mathcalF consists of k random features from a random feature representation of K. It turns out that these two settings are closely related. Both our theoretical analysis and experiments confirm that conditional KRR outperforms standard KRR in these cases whenever the \mathcalF-component of the regression function is more pronounced than the residual part.

Abstract:
Large language models are becoming increasingly significant in financial applications. Nevertheless, prevailing benchmarks are largely dependent on simulated or generic data, which leads to a significant gap between reported performance and actual efficacy in realworld scenarios. To tackle this challenge, we present BizFinBench.v2, the first integrated offline and online benchmark built upon authentic user query-response data from both Chinese and U.S. equity markets. It comprises 28,860 questions across eight offline and two online tasks. Experimental results show that GPT-5 achieves a mere 61.5% accuracy, still failing to meet the practical business requirement (84.8%). Among the evaluated commercial models, DeepSeek-R1 exhibits superior investment efficacy. Error analysis grounded in real financial practice reveals persistent limitations in existing models. By overcoming the constraints of prior benchmarks, BizFinBench.v2 provides a substantiated foundation for advancing LLM deployment in the financial sector.

Abstract:
The trustworthiness of evaluation is critical to reliable model comparison and deployment in Video Anomaly Understanding (VAU). However, existing metrics are sensitive to expression styles and normal content, and this field lacks a diagnostic benchmark to validate metric validity and robustness. To bridge this gap, we propose: (1) a ClassGuided Chain-of-Evaluation (CG-CoE) metric, which structures assessment by extracting anomalous events and matching them under a class-specific semantic tolerance boundary, thereby decoupling anomaly semantics from descriptive style; and (2) an anomaly-focused meta-evaluation benchmark with two subsets: Anomalous Event-level Annotations (AEA) for measuring the validity of reflecting VAU models’ anomaly understanding ability and Controlled Variant Pairs (CVP) with fixed anomalies for quantifying robustness to stylistic perturbations. Extensive experiments demonstrate that CG-CoE achieves SOTA validity and robustness.

Abstract:
Robust reinforcement learning (RRL) aims to tackle unexpected environmental changes by optimizing policies against the worst case. However, RRL remains impractical due to the cost of the MaxMin optimization, where it suffers from the exhaustive query complexity for finding the worst-case (dubbed 'Min') within the environmental uncertainty set \mathcalU, i.e., \mathcalO(|\mathcalU|). By viewing this via a lens of quantum perspective, we raise a pivotal question: If we can query from the environment with quantum superpositions, is it possible to accelerate the Max-Min optimization of RRL? Our answer is 'Yes'. Our method, called quantum robust inner minimization (QRIM), encodes the uncertainty set with quantum superposition and amplifies low-return cases, thus enabling RL for solving the robust (i.e., worst-case) Bellman equation. Importantly, QRIM achieves a quadratic speed-up in query complexity without altering the outer RL pipeline, i.e., \mathcalO(\sqrt|\mathcalU|). Validated through classical simulations to real quantum hardware execution, QRIM learns more robust policies with quadratically reduced queries than classical RL.

Abstract:
Weakto-strong generalization studies how to improve a strong student using supervision from a weaker teacher when reliable labels are scarce. We view this primarily as a data selection problem, where the key challenge is to identify which weak labels are reliable enough to serve as a training signal. To address this, we introduce trust functions that assign each weak label a scalar trust score and use these scores to filter weak supervision. Across several domains, including world knowledge, quantitative reasoning, decision making, trust filtering yields students that match and sometimes surpass ground-truth supervision, achieving near-lossless weak-to-strong generalization. Moreover, trust functions enable an iterative weak-to-strong chain that compounds gains by training a student and reusing it as the next teacher, producing the strongest final model. Our analyses suggest that neural trust functions improve learning through more than label error reduction. They induce an implicit easy-first curriculum, recover near-optimal alternatives where ground truth labels are incomplete, and produce more coherent gradient updates, offering a mechanistic account of the stability and efficiency of trust-filtered weak-to-strong generalization.

Abstract:
Advances in pretrained vision-language models have enabled zero-shot out-of-distribution (OOD) detection using only in-distribution (ID) labels. Recent methods in this direction expand the label space with negative labels to enhance the discrimination between ID and OOD inputs. Despite their promising progress, there remains a limited understanding of their empirical effectiveness in open-world scenarios, where negative labels can arbitrarily diverge from real OOD ones. This paper bridges this research gap with the helm of a novel energy-based framework, where the energy function is built upon the margin between the similarity of an input to ID labels and that to negative labels. Guided by this framework, we prove that the inherent tolerance of such methods to the sampling bias essentially stems from estimating the worst-case energy function over a KL-constrained set of potential distributions centered on the negative label distribution. Furthermore, our theoretical analysis reveals that existing methods suffer from over-pessimism and consequently high sensitivity to outliers. Provably, we can alleviate these problems by leveraging Rényi divergence to refine potential distributions. Extensive experiments empirically manifest that our method establishes a new state-of-the-art across a variety of OOD detection settings.

Abstract:
RealWorld Reinforcement Learning (RL) has shown significant potential in robotic manipulation tasks. However, many methods still require substantial human-in-the-loop involvement to complete contact-rich tasks, especially when there are disruptions such as visual backgrounds or positional changes. To address this, we propose the Focus Then Contact (FTC), a lightweight and low-cost method to accelerate the convergence of human-in-the-loop real-world RL for contact-rich tasks. FTC leverages residual RL to provide base actions, helping the system quickly reach the target regions and improve sample efficiency. Additionally, FTC integrates an affordance-guided reward that drives the real-world RL system to quickly focus on key regions of interest, making it possible for the robotic arm to continuously engage with these goal areas through force-control feedback. At the same time, we optimize the human-in-the-loop implementation to prevent conflicts with RL over control of the robotic arm. We demonstrate the effectiveness of FTC on 6 contact-rich tasks, where it outperforms baseline methods in achieving high success rates and speeds up robotic contact-rich task learning under a real-world RL setting. Video materials can be seen in \urlhttps://anonymous.4open.science/api/repo/FTC-website-BB5E/file/index.html.

Abstract:
While heavytailed synaptic weight distributions are pervasive in biological neural networks, their computational role---particularly in relation to generalization---remains poorly understood. To address this, we develop a novel optimal-transport-based optimization algorithm that incorporates key biological constraints, including Dale’s principle and heavy-tailed synaptic statistics, to train recurrent neural networks (RNNs) on a wide range of cognitive tasks. We show that these biologically constrained, heavy-tailed RNNs exhibit substantially improved generalization, which we further characterize within a PAC-Bayes framework. Our theoretical analysis and numerical experiments reveal two complementary mechanisms underlying this generalization enhancement. Topologically, heavy-tailed connectivity induces an effectively low-rank structure, which in turn yields low-dimensional neural dynamics. Geometrically, heavy-tailed connectivity intrinsically shapes task variable representations to lie near a linear manifold, thereby improving generalization for a linear readout strategy. Together, these results identify heavy-tailed connectivity as a biologically grounded intrinsic mechanism that promotes low-rank structure and favorable representational geometry, leading to improved generalization in flexible cognitive tasks.

Abstract:
Multiobjective reinforcement learning (MORL) must often support preferences that change online or are specified only after data collection. We study finite-horizon MORL with vector feedback in linear MDPs under two protocols: (i) predictable adversarial preferences revealed before each episode, and (ii) reward-free preference-free exploration (PFE), where exploration observes only transitions and must later answer arbitrary preference queries. Standard reductions are protocol-unsafe: re-scalarizing past stochastic rewards with future weights breaks the martingale structure needed for self-normalized confidence bounds, and hypervolume evaluation must account for episode-start randomization, which yields a deployable convex hull of return vectors. We propose a protocol-safe reward interface that estimates each reward coordinate via regression and performs scalarization only at query time, and we formalize deployable hypervolume semantics with a stability chain from support-function error to hypervolume error. Consequently, we obtain filtration-safe regret bounds for any predictable preference sequence without discretizing the simplex (only \log m dependence) and matching near-minimax rates in linear MDPs, as well as sharp reward-free PFE guarantees: a (near-)minimax decision-optimal query answering rate \tildeO(d^2 U_\mathrmret^2/\varepsilon^2) and a tight separation from explicit transition-model recovery \Theta(d(|\mathcalS|-1)/\varepsilon_P^2). These results connect online learning, preference-free deployment, and hypervolume-aware evaluation through a single protocol-aligned theory.

Abstract:
Supervised finetuning (SFT) on a small high-quality set of long reasoning traces is an effective way to enable strong reasoning abilities for Large Language Models (LLMs). However, curating a high-quality SFT data requires generating a large pool of long Chain of Thoughts (CoTs), and filtering the generated data for diversity and difficulty. Both stages rely heavily on strong reasoning models and make data curation prohibitively expensive. In this work, we show that diverse and difficult reasoning examples can be identified very early during their generation. We show that after generating as few as 100 out of 34k tokens of a reasoning trace, challenging examples can be reliably identified based on their loss at a highly perturbed checkpoints of the pretrained model. Then, we prove that examples with similar loss trajectory, i.e., value at a few noisy, perturbed checkpoints of the pretrained model, have similar gradients. A diverse subset can be then found by sampling from clusters of loss trajectories obtained after generating 1k tokens. Our extensive experiments for fine-tuning Qwen2.5-7B and Llama3 on m23K medical reasoning and Openthoughs datasets confirms the effectiveness of our approach. our approach outperforms existing baselines by up to 2% while generating only as few as 9% tokens for reasoning traces.

Abstract:
Continual PreTraining (CPT) is essential for enabling Language Models (LMs) to integrate new factual knowledge without erasing old. While classical CPT techniques like data replay have become the standard paradigm, the mechanisms underlying how LMs acquire and retain facts over time, termed as continual Factual Knowledge Acquisition (cFKA), remain unclear. In this work, we present a theoretical framework that characterizes the training dynamics of cFKA using a single-layer Transformer with linear attention, offering a unified explanation for the behavior of popular CPT methods. Our analysis reveals that regularization-based methods merely adjust the convergence rate of parameters without altering the inherent forgetting tendency, whereas data replay methods shift convergence dynamics and stabilize pretrained knowledge. Building on these insights, we propose a novel generative data replay approach, called Selecting Tokens via attentiOn Contribution (STOC), which identifies influential factual snippets to guide replay generation. Extensive experiments on both synthetic and real-world datasets validate our theoretical findings and demonstrate that STOC effectively enhances cFKA by mitigating catastrophic forgetting.

Abstract:
SharpnessAware Minimization (SAM) improves generalization by minimizing the worst-case loss within a fixed parameter-space radius neighborhood. SAM and its variants mainly rely on a first-order linearized surrogate, while flat minima are inherently a second-order (curvature) notion. We revisit this mismatch and propose Loss-Equated SAM (LE-SAM), which inverts the traditional SAM mechanism that fixed perturbation radius with a fixed loss-space budget, effectively removing gradient-norm–dominated learning signals and shifting optimization toward curvature-dominated terms. Extensive experiments across diverse benchmarks and tasks demonstrate the strong generalization ability of LESAM that consistently outperforms SAM and even its variants, achieving the state-of-the-art performance.

Abstract:
Machine learning–based surrogate models have emerged as more efficient alternatives to numerical solvers for physical simulations over complex geometries, such as car bodies. Many existing models incorporate the simulation mesh as an additional input, thereby reducing prediction errors. However, generating a simulation mesh for new geometries is computationally costly. In contrast, meshfree methods, which do not rely on the simulation mesh, typically incur higher errors. Motivated by these considerations, we introduce SMART, a neural surrogate model that predicts physical quantities at arbitrary query locations using only a point-cloud representation of the geometry, without requiring access to the simulation mesh. The geometry and simulation parameters are encoded into a shared latent space that captures both structural and parametric characteristics of the physical field. A physics decoder then attends to the encoder's intermediate latent representations to map spatial queries to physical quantities. Through this cross-layer interaction, the model jointly updates latent geometric features and the evolving physical field. Extensive experiments show that SMART is competitive with and often outperforms existing methods that rely on the simulation mesh as input, demonstrating its capabilities for industry-level simulations.

Abstract:
Multimodal Diffusion Language Models (MDLMs) have emerged as a powerful alternative to autoregressive models in vision-language understanding, offering advantages in bidirectional context modeling and parallel decoding. However, existing MDLMs suffer from severe visual hallucinations due to the static nature of visual perception. Unlike autoregressive models, MDLMs lack the sequential dependency required to dynamically interact with visual content. Therefore, MDLMs rely on fixed visual features encoded at initialization, causing the denoising process to drift toward language priors and lose its anchor to visual evidence. In this paper, we propose VGR (Visual-Guided Refinement), a framework that enables MDLMs to revisit visual details by exploiting diffusion dynamics. Our key insight is that the temporal trajectory of confidence during denoising reveals intrinsic uncertainty: while grounded tokens converge smoothly, hallucinated ones exhibit pronounced confidence fluctuation. VGR utilizes this fluctuation signal to detect uncertain spans and corrects them through targeted visual evidence extraction and in-place remasking. Extensive experiments on image captioning and hallucination evaluation benchmarks demonstrate that our method significantly reduces hallucinations and recalls more details, achieving state-of-the-art performance among MDLMs.

Abstract:
MultiAgent Systems (MAS) powered by Large Language Models (LLMs) have recently become a strong paradigm for solving complex workflow-structured tasks through expert collaboration. However, the data that make such collaboration effective are typically distributed across organizations and cannot be centrally pooled due to privacy, intellectual property, and compliance constraints. Federated Learning preserves data locality, yet most federated paradigms treat clients as independent and fail to capture workflow dependencies that are essential for coherent multi-stage collaboration. Data locality and workflow dependency are orthogonal, and the key challenge arises where both must be satisfied, namely federated, workflow-aware collaboration. We introduce FedWave, a framework that enables LLM-based experts to solve sequential workflows under strict privacy constraints. FedWave integrates a Value Chain Layer that encodes inter-stage dependencies with communication-efficient federated LoRA adaptation, a server-side Mixture-of-Experts (MoE) router that performs input-conditioned expert fusion at inference time while retaining standard federated aggregation during training, and a Direct Preference Optimization (DPO) stage that aligns collaborative outputs using router-induced preferences. Experiments show that FedWave consistently outperforms strong federated baselines and remains competitive with centralized multi-agent systems without compromising data privacy. Code is available at https://anonymous.4open.science/r/FedWave-111A.

Abstract:
Recent advances in 3D molecular modeling have achieved highfidelity structural synthesis, yet these models often lack an explicit and manipulable representation space. To address this, MolFLAE introduced a fixed-dimensional, E(3)-equivariant latent space, providing a novel framework for molecular editing independent of atom counts. However, because its latent space was primarily optimized for geometric reconstruction, it remains semantically shallow and inadequate for comprehensive representation learning. In this work, we proposeMolAlign3D, which evolves this architecture into a unified semantic-generative engine. By anchoring MolFLAE’s manipulable latents with embeddings from a pre-trained molecular encoder, we yield a manifold that is both semantically dense and geometrically precise. Experiments show that MolAlign3D achieves high-fidelity molecular reconstruction and attains comparable performance on molecular property prediction benchmarks. Notably, the integration of rich semantic priors significantly enhances zero-shot molecular manipulation, including atom-number editing and latent-space interpolation, outperforming prior fixed-dimensional equivariant latent baseline.

Abstract:
Sampling from distributions with bounded supports is a fundamental challenge in constrained statistical inference. Reflective Hamiltonian Monte Carlo (ReHMC) provides a useful sampling approach for this setting. However, it relies on convexity assumptions on the support and lacks nonasymptotic theoretical guarantees such as mixing-time bounds. To bridge this gap, we propose a convex-container plus thinning framework that is applicable to arbitrary target densities with bounded support. We establish the first non-asymptotic total-variation mixing-time bounds for ReHMC, achieving a polynomial dimension dependence of O(d^2) for L-smooth targets, though with exponential dependence on smoothness parameters. Under an additional m-strong convexity assumption, we derive a sharper bound that eliminates this exponential dependence. We further apply this approach to sampling on the Stiefel manifold via a well-conditioned polar reparameterization and demonstrate improved numerical stability and sampling efficiency in simulation studies.

Abstract:
Recent advances in LipSync generation technology have led to the creation of highly realistic videos, posing severe societal risks. However, existing defense strategies struggle against LipSync forgeries, as stateof-the-art generative models not only optimize for the lip synchronization but also significantly eliminate visual artifacts, resulting in the lack of key detection signals. Inspired by the inherent biological coupling between lip movements and head poses in natural speech, we observe that generative models fundamentally disrupt this global coordination when optimizing for local lip motion. In this paper, we propose LipDA, a novel framework for joint LipSync Detection and Attribution, which takes advantage of the inconsistency between head and lip. For detection, the framework learns to quantify this discrepancy by contrasting lip and pose features from authentic versus forged videos. For attribution, our method is designed to capture the unique temporal dynamics and audio-visual synchronization patterns that act as generative fingerprints, enabling source tracing. To validate our approach, we conduct extensive experiments on two challenging LipSync benchmarks as well as on our own proposed large-scale and multi-generator dataset, LipSyncBench-A. LipDA achieves over 97% AUC in detection and 97.5% accuracy in model attribution, significantly outperforming existing methods.

Abstract:
Mixtureof-Experts (MoE) models achieve efficient scaling through sparse expert activation, but often suffer from suboptimal routing decisions due to distribution shifts in deployment. While existing test-time adaptation methods could potentially address these issues, they primarily focus on dense models and require access to external data, limiting their practical applicability to MoE architectures. However, we find that, instead of relying on reference data, we can optimize MoE expert selection on-the-fly based only on input context. As such, we propose a data-free, online test-time framework that continuously adapts MoE routing decisions during text generation without external supervision or data. Our method cycles between two phases: During the prefill stage, and later in regular intervals, we optimize the routing decisions of the model using self-supervision based on the already generated sequence. Then, we generate text as normal, maintaining the modified router until the next adaption. We implement this through lightweight additive vectors that only update router logits in selected layers, maintaining computational efficiency while preventing over-adaptation. The experimental results show consistent performance gains on challenging reasoning tasks while maintaining robustness to context shifts. For example, our method achieves a 5.5% improvement on HumanEval with OLMoE. Furthermore, owing to its plug-and-play property, our method naturally complements existing test-time scaling techniques, e.g., achieving 6% average gains when incorporated with self-consistency on DeepSeek-V2-Lite.

Abstract:
Large Language Models are commonly benchmarked on a dataset by evaluating all relevant models on all queries in the test set. This can be wasteful for a practitioner who wants to find the best model to deploy—if a model clearly performs worse than others, there is no need to precisely estimate its performance. Bestarm identification algorithms can drastically reduce costs by dynamically allocating evaluation budget. When applying these algorithms to language models, we can further leverage that their responses to the same prompt are often very similar. While previous attempts to make use this additional structure can exploit model similarity in some cases, they significantly underperform simple baselines in others. We propose Synchronized Successive Rejects (SySRs), augmenting the classical Successive Reject algorithm with paired comparisons. Unlike prior work, our approach is hyperparameter-free and comes with performance guarantees that improve with the degree of similarity between evaluated models. Empirically our method outperforms all baselines, both in terms of average error rate on a suite of 15 standard benchmarks, and in terms of the fraction of benchmark data required to reliably identify the best model on these benchmarks.

Abstract:
Quadratic Discriminant Analysis (QDA) assumes Gaussian classconditional distributions, causing systematic misclassification when data exhibit heavy tails. We propose Stable-QDA, which replaces the Gaussian likelihood with a symmetric \alpha-stable likelihood that decays polynomially rather than exponentially in Mahalanobis distance. Crucially, we find that correcting likelihood misspecification yields larger gains than robustifying parameter estimation: standard estimators (sample mean, Ledoit--Wolf covariance) often outperform robust alternatives when class heteroscedasticity is discriminative. We provide consistency guarantees under infinite-variance regimes, data-driven diagnostics for estimator selection, and demonstrate 15--53% error reduction on real-world heavy-tailed benchmarks.

Abstract:
Generalization and approximation capabilities of message passing graph neural networks (MPNNs) are often studied by defining a compact metric on a space of input graphs under which MPNNs are Hölder continuous. Such analyses are of two varieties: 1) when the metric space includes graphs of unbounded sizes, the theory is only appropriate for dense graphs, and, 2) when studying sparse graphs, the metric space only includes graphs of uniformly bounded size. In this work, we present a unified approach, defining a compact metric on the space of graphs of all sizes, both sparse and dense, under which MPNNs are Hölder continuous. This leads to more powerful universal approximation theorems and generalization bounds than previous works. The theory is based on, and extends, a recent approach to graph limit theory called graphop analysis.

Abstract:
Masked discrete diffusion models have improved steadily, but still lag behind autoregressive (AR) models in quality, require fixedlength generation, and cannot exploit key-value (KV) caching. Block Diffusion partially bridges diffusion and AR by unmasking left-to-right token blocks, but sacrifices infilling flexibility and KV caching within blocks. Our key insight is that interpolating generation orderings between autoregression and fully-random decoding, rather than committing to a fixed block length, offers a better interpolation between diffusion and AR. We present a new class of language models, Set Diffusion, comprised of 1) a tighter likelihood bound induced by an order-informed noise process and 2) a causal diffusion architecture that enables KV caching under stochastic token orderings. We bias the noise process toward left-to-right generation, rather than enforcing a strict block factorization, such that tokens can be decoded in sliding-window sets for faster inference and greater flexibility for any-order decoding. Set Diffusion achieves better speed-quality tradeoffs on mathematical reasoning, summarization, and unconditional generation compared to prior diffusion language models while offering stronger infilling performance than Block Diffusion.

Abstract:
Recent advances in video generation have enabled highly realistic synthetic content, raising concerns about the integrity of digital media and motivating the development of benchmarks and detection methods for generated videos. Prior works have largely prioritized bolstering model generalization against unseen generators. However, we uncover a neglected factor: the quality distribution of real videos plays a pivotal role. Current training protocols suffer from a clear quality bias between real and fake data, prone to shortcut learning. Compounded by testing on similar real data distributions, this creates an illusion of generalization. In reality, these models fail to generalize when exposed to real data with significantly different quality profiles. To address this, we propose training with qualitymatched real and fake data to mitigate bias. Building on this, we introduce a data expansion strategy that broadens the training set to comprehensively cover the full quality spectrum. This approach enables the model to learn quality-agnostic features for detection, thereby achieving generalization across real data of varying qualities and enhancing real-world applicability. Extensive experiments demonstrate that our method scales well across diverse backbones, consistently enhancing the generalization capability of existing models.

Abstract:
Deploying large language models (LLMs) on untrusted hardware entails a risk of weight extraction, which can lead to unauthorized replication and misuse of the model. A practical approach is to leverage Trusted Execution Environments (TEEs) and protect model security by obfuscating model weights. However, existing obfuscation schemes struggle to simultaneously provide strong security guarantees and high performance: schemes with security guarantees incur substantial overhead due to frequent TEE interactions, whereas schemes that achieve efficient inference are insecure. We propose SLIM, a secure inference framework that exploits the iterative structure of LLMs to let transformed representations cascade through consecutive obfuscated layers, thereby minimizing interactions with the TEE. SLIM introduces a TWay Mixing algorithm that performs consecutive inter-vector covering using carefully constructed block-diagonal Householder matrices and combines it with successive random permutations, providing thorough weight obfuscation while keeping TEE-side computation lightweight. Evaluations demonstrate that SLIM provides robust security guarantees and significantly outperforms prior state-of-the-art obfuscation schemes in terms of performance, delivering up to a 13.80× speedup while preserving fidelity.

Abstract:
We investigate the origins of massive activations in large language models (LLMs) and identify a specific layer named the Massive Emergence Layer (ME Layer), that is consistently observed across model families, where massive activations first emerge and subsequently propagate to deeper layers through residual connections. We show that, within the ME Layer both the RMSNorm and the FFN parameters jointly contribute to the emergence of massive activations. Once formed, the massive activation token representation remains largely invariant across layers, reducing the diversity of hidden representations passed to the attention module. Motivated by this limitation, we propose a simple and effective method to reduce the rigidity of the massive activation token. Our approach consistently improves LLM performance across multiple tasks, including instruction following and math reasoning, in both training free and fine tuning settings. Moreover, we show that our method mitigates attention sinks by selectively weakening their influence, elucidating their origin at the hidden state level and shedding new light on principled mitigation strategies.

Abstract:
Symbolic regression (SR) aims to discover interpretable mathematical expressions from observed data. While recent generative approaches have shown promise in treating SR as machine translation or multimodal learning tasks using NN methods, they suffer from a fundamental limitation: trainingevaluation misalignment. The training objectives (average cross-entropy loss on a token level across the distribution of historical data) differ from the evaluation metric (fitting error for every test data / complexity), necessitating extensive heuristic post-processing and constant optimization. On the other hand, direct optimization methods suffer from curse of dimensionality, non-differentiability and local optima traps. We propose MOD-SR, unifying multimodal distribution learning during training with direct optimization at inference time. This is achieved by modeling the task as p(x_0 \mid \mathcalD, y^) and employing gradient-guided diffusion in embedding space, enhanced by contrastive learning and representation alignment. Furthermore, we introduce DFEX, a fixed-depth tree relaxation method that ensures differentiability for effective gradient guidance during inference. Experiments demonstrate significant improvements over existing methods, achieving superior performance on diverse benchmarks through a unified framework integrating distribution learning and optimization.

Abstract:
Watermarking is crucial for identifying AIgenerated text, however, existing detection methods often focus on offline settings and fail to control the online False Discovery Rate (oFDR) when applied to real-world streams where machine-generated content is sparse and mixed with human writing. To address this issue, in this paper, we propose LORD-GoF, a novel online detection framework that combines a Goodness-of-Fit (GoF) statistic with the Levels based On Recent Discovery (LORD) procedure. We prove that LORD-GoF approach can rigorously control the oFDR below a user-specified level by dynamically adjusting detection thresholds. Extensive experiments on watermarked text from Qwen-2.5-3B, Sheared-LLaMA-2.7B, and OPT-1.3B using both the Gumbel-Max and Inverse Transform watermarking schemes show that our method maintains statistical power comparable to offline benchmarks while successfully controlling the oFDR under complex, mixed streaming scenarios.

Abstract:
Machinelearning interatomic potentials (MLIPs) have advanced rapidly, with many top models relying on strong physics-based inductive bias. However, as models scale to larger systems like biomolecules and electrolytes, they struggle to accurately capture long-range (LR) interactions, leading current approaches to rely on explicit physics-based terms or components. In this work, we propose AllScAIP, a straightforward, attention-based, and energy-conserving MLIP model that scales to O(100 million) training samples. It addresses the long-range challenge using an all-to-all node attention component that is purely data-driven. Extensive ablations reveal that in low-data/small-model regimes, inductive biases improve sample efficiency. However, as data and model size scale, these benefits diminish or even reverse, while all-to-all attention remains critical for capturing LR interactions. Our model achieves state-of-the-art energy/force accuracy on molecular systems (OMol25), while being competitive on materials (OMat24) and catalysts (OC20). Furthermore, it enables stable, long-timescale MD simulations that accurately recover experimental observables, including density and heat of vaporization predictions.

Abstract:
Social recruitment offers a solution to worker scarcity in Spatial Crowdsourcing (SC) but faces challenges that are often ignored in traditional Influence Maximization. First, task heterogeneity arising from offline execution constraints breaks theinterestimplies-participation'' assumption, as social influence often fails to translate into physical presence. Second, finite task demand creates asaturation trap'', a non-submodular setting in which utility drops sharply to zero once demand is met. To bridge these gaps, we propose GKD-Recruiter, a Task-Aware framework designed to maximize Effective Task Satisfaction (ETS). We explicitly model the complex worker-task affinity via a heterogeneous graph and capture directional social influence using a novel Influential GAT. To robustly fuse these distinct signals, we introduce a Graph Knowledge Distillation mechanism. Furthermore, we employ Rainbow DQN to navigate the non-submodular combinatorial search space, avoiding the local optima that trap greedy heuristics. Extensive experiments on real-world datasets demonstrate that GKD-Recruiter significantly outperforms state-of-the-art baselines in both solution quality and inference efficiency. The code is available at \urlhttps://anonymous.4open.science/r/GKD-Recruiter-3A4B.

Abstract:
Distinguishing Large Language Model (LLM) generated text from human writing is a critical and difficult challenge. While LLMs are trained to write like humans, we hypothesize that this training leaves an indelible mark. LLMs develop a particularly strong aversion to token repetition very early in training. This bias persists as a ``Vestigial Heuristic'' (a developmental artifact) that is activated in LLMgenerated text, separating LLM from human writing. To probe this phenomenon, we introduce Telescope Perplexity, a metric that evaluates the token repetition of the model, P(s_i | s_1:i). Our empirical investigation reveals that the Telescope Perplexity signature emerges early in pre-training, and Telescope Perplexity empirically enables highly effective zero-shot LLM detection. We show state-of-the-art or competitive performance across diverse datasets (including modern evaluation sets we introduce), reference models, and perturbation schemes with greater efficiency than other methods.

Abstract:
Reinforcement learning (RL) has become a dominant paradigm for enhancing LLMs' reasoning capabilities. However, RL algorithms with PPOClip are inherently limited by exploration collapse. Subsequent works remain primarily heuristic and fail to identify the essential cause of PPO-Clip’s failure. This work reveals the fundamental flaw of PPO-Clip: it implicitly measures policy discrepancy using Euclidean metric, which is theoretically inconsistent with the intrinsic geometry on the policy Riemannian manifold. This geometric mismatch results in overly conservative updates in low-probability regions while aggressive in high-probability regions, ultimately collapsing exploration. To correct this geometric flaw, we propose Riemannian Isometric Policy Optimization (RIPO), which guarantees isometric policy updates on the Riemannian manifold, effectively balancing exploration and exploitation. We further show that RIPO achieves a favorable bias-variance trade-off, which stabilizes optimization. Extensive experiments demonstrate that RIPO significantly surpasses existing LLM RL algorithms across seven competition-level benchmarks (up to 60% improvement over GRPO on AIME24).

Abstract:
Modeling the bidirectional correspondence between external sensory stimuli and internal neural activity has emerged as a critical frontier in neuroscience. However, existing approaches predominantly treat brain encoding and decoding as isolated tasks, relying heavily on unimodal alignment and external priors while overlooking the brain's intrinsic nature as a multimodal integration system. To address these limitations, we propose BrainJanus, the first unified brain foundation model that integrates brain, vision, and language within a single framework. Specifically, we introduce a Unified Brain Tokenizer to quantize continuous neural dynamics into discrete tokens aligned with visual and linguistic representations in a shared Omni space. Building on this, we utilize an Allin-One autoregressive architecture that leverages next-token prediction to enable seamless any-to-any generation, which encompasses image-to-brain and text-to-brain encoding, and brain-to-image and brain-to-text decoding. Extensive experiments demonstrate that BrainJanus achieves superior performance across diverse benchmarks. Furthermore, our framework exhibits zero-shot generalization and preserves interpretable biological topography, highlighting its potential as a general-purpose brain modeling paradigm. The code and pretrained models will be released to support further research.

Abstract:
Finding mathematical relations underlying natural phenomena is a fundamental task in scientific discovery. Recent advances in evolutionary search with Large Language Models (LLMs) show great promise by leveraging their embedded scientific knowledge. However, discovering governing equations remains challenging due to vast combinatorial hypothesis spaces with exponentially many possible relations. Existing LLMbased approaches treat LLMs as static hypothesis generators unaware of the observed scientific system, leading to suboptimal and inefficient exploration that over-relies on internal priors. To address this, we introduce \emphDecompose, Adapt, and Evolve (DecAEvolve), a framework that combines granular feedback from symbolic term decomposition with LLM refinement through reinforcement learning fine-tuning. DecAEvolve unifies symbolic decomposition with test-time RL adaptation, enabling adaptive rather than static hypothesis generation. Our experiments across diverse scientific benchmarks demonstrate that DecAEvolve significantly improves both the accuracy of discovered equations and the efficiency of the discovery process, reducing error by up to an order of magnitude compared to state-of-the-art baselines.

Abstract:
Graph Foundation Models (GFMs) aim to learn universal patterns through largescale pretraining on diverse graphs and generalize to open-world scenarios. While GFMs have garnered significant attention, existing works primarily focus on sigle-modal graphs. However, many real-world graphs are multimodal, consisting of structures alongside diverse features derived from modalities such as text and images. To date, exploration into Multimodal Graph Foundation Models (MGFMs) remains limited. Incorporating multimodal data provides a more comprehensive view, allowing models to learn richer semantics, thereby advancing GFMs. We are therefore motivated to explore MGFMs, where the core challenge lies in synergistically encoding structures and multimodal features to achieve effective cross-modal alignment and fusion. To this end, we propose a graph foundation model with Cross-modal Alignment and Modality-aware Expert fusion, CAME. Specifically, CAME first generates graph embeddings for each individual modality. We then introduce a multimodal multi-expert encoding mechanism, which includes a dimension-wise routing strategy to fuse multimodal information. Finally, we employ a cross-modal contrastive loss to train CAME, enabling the adaptive alignment and fusion across different modalities. Extensive experiments demonstrate the effectiveness of CAME across multiple tasks and diverse multimodal graph datasets.

Abstract:
Dynamic Facial Expression Recognition (DFER) models the temporal evolution of facial expressions in videos. In realworld deployments, changing scenarios distort expression trajectories over time, making it difficult for existing methods to maintain performance. While most current approaches address this issue through passive feature alignment across scenarios or domain-incremental learning techniques that preserve previously learned representations, they do not explicitly model scenario evolution over time, limiting their ability to robustly capture expression dynamics under scenario-incremental changes. To end this, we proposeEmWorld, an emotion world model for DFER that explicitly models latent emotion state evolution under scenario variations. Specifically,EmWorldformulates scenario-incremental DFER as a progressive Bayesian inference problem over latent world states with dual temporal scales. Slow-timescale component (STS) models scenario evolution using stochastic evolutionary priors, capturing long-term scenario effects and providing proactive guidance in new scenarios. Fast-timescale component (FTS) models frame-level expression dynamics with temporally consistent latent transitions, effectively decoupling expression dynamics from scenario influences. By jointly inferring latent states at both timescales,EmWorld, shifts DFER from a passive feature discrimination process to a active probabilistic state inference under evolving scenarios. Experiments on FERV39k, DFEW, and MAFW demonstrate thatEmWorld, consistently outperforms state-of-the-art methods, achieving up to 3.84% improvement while exhibiting strong cross-scenario stability and long-term robustness.

Abstract:
Long context understanding remains challenging for large language models due to their limited context windows. This paper introduces Long Input FineTuning (LIFT), a novel framework for long-context modeling that can enhance the long-context performance of arbitrary short-context LLMs by dynamically adapting their parameters to the given long input. Importantly, rather than endlessly extending the context window size to accommodate increasingly longer inputs in context, LIFT stores and absorbs the long input in parameters. By fine-tuning the long input into model parameters, LIFT allows short-context LLMs to answer questions even when the required information is not provided in the context during inference, avoiding the quadratic complexity w.r.t. input length of a normal long context model. Furthermore, LIFT does not simply perform continued pretraining on new, long contexts, but leverages carefully designed LLM-generated synthetic tasks to enhance the comprehension of long contexts, moving beyond mere memorization. To accommodate the additional cost of fine-tuning, we design a highly optimized pipeline that reduces the Time to First Token (TTFT) to less than 10 seconds for 8k context. We further provide a comprehensive analysis of LIFT's strengths and limitations in long-context understanding, discuss its feasibility for large-scale real-world deployment, and highlight valuable directions for future research.

Abstract:
Standard Chainof-Thought (CoT) reasoning trades reliability for responsiveness: in a single user-visible token stream, more deliberation delays meaningful output, imposing a ``silence tax.'' We introduce \emphSide-by-Side (SxS) Interleaved Reasoning, a training framework that makes \emphdisclosure timing a controllable decision within standard autoregressive generation. By interleaving \emphsupported partial answers with continued private reasoning in the same context, SxS avoids monolithic reasoning preambles. We treat disclosure as a policy learning problem and train models via a multi-stage pipeline: supervised fine-tuning (SFT) on entailment-aligned interleaved trajectories, followed by reinforcement learning (RL) to recover reasoning performance and optimize accuracy. Without architectural changes, SxS improves the accuracy--latency trade-off across two Qwen3 architectures/scales (MoE Qwen3-30B-A3B, dense Qwen3-4B) and both in-domain (AIME25) and out-of-domain (GPQA-Diamond) benchmarks, reducing \emphsubstantive content latency by 18% and improving a proxy for perceived wait time by 49%, yielding more responsive interactions without compromising answer quality.

Abstract:
Timeseries analysis is critical in real-world applications, yet the explosion of time-series data imposes severe burdens on storage and computational resources. Recently, dataset condensation has emerged as a promising data-centric solution by synthesizing compact yet informative datasets to replace large-scale raw data. However, existing methods are largely vision-centric, failing to capture unique temporal properties of time series, or task-specific, tightly coupling the condensed data to a particular downstream objective. As a result, these approaches suffer from feature mismatch and fail to generalize across diverse time-series tasks. To bridge this gap, we propose UniTSC, the first unified dataset condensation framework for general time-series analysis. UniTSC employs a multi-view hybrid encoder to capture task-invariant representations across temporal, spectral, and topological perspectives. Building upon this representation, we design a tri-space alignment paradigm that jointly aligns optimization trajectories, power spectral densities, and multivariate dependency structures, enabling comprehensive information preservation under extreme compression. Extensive experiments show that UniTSC retains up to 97.9% of downstream performance using as little as 0.01% of the original training data, revealing that a single batch-equivalent budget (\textless 128 samples) is sufficient to capture the essential dynamics of complex time-series data.

Abstract:
The synthesis of inductive loop invariants is a critical bottleneck in automated program verification. While Large Language Models (LLMs) show promise in mitigating this issue, they often fail on hard instances, generating invariants that are invalid or computationally ineffective. While finetuning is a natural route to mitigate this limitation, obtaining high-quality training data for invariant generation remains an open challenge. We present a rigorous data curation pipeline designed to extract high-quality training signals from raw verifier-generated invariants. First, we formalize the properties required for a high-quality training invariant. Second, we propose \textscWonda, a pipeline that refines noisy data via AST-based normalization, followed by LLM-driven semantic rewriting and augmentation with provable quality guarantees. We demonstrate that fine-tuning Small Language Models (SLMs) on this curated dataset result in consistent and significant performance gain. In particular, a fine-tuned 4B parameter model matches the utility of a GPT-OSS-120B baseline and approaches the state-of-the-art GPT-5.2, without incurring reasoning-time overhead. On challenging instances from the recent InvBench evaluation suite, our approach doubles the invariant correctness rate of base models; and improves their Virtual Best Performance (VBP) rates on the verification task by up to 14.2%.

Abstract:
Upweighting highquality data in LLM pretraining often improves performance, but in data-limited regimes, especially under overtraining, stronger upweighting increases repetition and can degrade performance. However, standard scaling laws do not reliably extrapolate across mixture recipes or under repetitions, making the selection for optimal data recipes at scaling underdetermined. To solve this, we introduce InfoLaw (Information Scaling Laws), a data-aware scaling framework that predicts loss from consumed tokens, model size, data mixture weights, and repetition. The key idea is to model pretraining as information accumulation, where quality controls information density and repetition induces scale-dependent diminishing returns. We first collect the model performance after training on datasets that vary in scale, quality distribution, and repetition level. Then we build up the modeling for information so that information accurately predicts those model performance. InfoLaw predicts performance on unseen data recipes and larger-scale runs (up to 7B, 425B tokens) with 0.15% mean and 0.96% max absolute error in loss, and it extrapolates reliably across overtraining levels, enabling efficient data-recipe selection under varying compute budgets.

Abstract:
Macro placement is a fundamental step in modern VLSI physical design, determining the solution quality of highdimensional combinatorial optimization problems. Despite recent advancements in machine learning for spatial coordinate determination, the temporal dimension of placement sequencing remains largely governed by static heuristics. In this work, we demonstrate that the placement sequence is not merely a preprocessing step but a decisive factor in optimization, where suboptimal early decisions trigger irreversible domino effects that constrain the solution space. To harness this unexplored dimension, we propose OrderPlace, a novel framework that automates the discovery of macro placement strategies via proxy-guided Large Language Model (LLM) evolution. Unlike existing methods that rely on manual rules like area or connectivity, OrderPlace leverages LLMs to evolve generalizable, code-level ordering strategies—ranging from static metrics to dynamic, physics-inspired mechanisms. To mitigate the prohibitive cost of evaluating sequences, we introduce a lightweight proxy evaluation mechanism that efficiently filters candidates using a deterministic greedy probe. Experimental results on the standard ISPD 2005 benchmarks demonstrate that OrderPlace discovers novel ordering strategies. Compared with WireMask-EA and the state-of-the-art method EGPlace, OrderPlace reduces wirelength by 34.04% and 14.08%, respectively.

Abstract:
Tabular Language Models (TLMs) have been claimed to achieve emergent generalization for tabular prediction. We conduct a systematic reevaluation of Tabula-8B as a representative TLM, utilizing 165 datasets from the UniPredict benchmark. Our investigation reveals three findings. First, binary and categorical classification achieve near-zero median lift over majority-class baselines and strong aggregate performance is driven entirely by quartile classification tasks. Second, top-performing datasets exhibit pervasive contamination, including complete train-test overlap and task-level leakage that evades standard deduplication. Third, instruction-tuning without tabular exposure recovers 92.2% of standard classification performance and on quartile classification, format familiarity closes 71.3% of the gap with the residual attributable to contaminated datasets. These findings suggest claimed generalization likely reflects evaluation artifacts rather than learned tabular reasoning. We conclude with recommendations for strengthening TLM evaluation.

Abstract:
Textto-3D generation has advanced rapidly, yet state-of-the-art models, encompassing both optimization-based and feed-forward architectures, still face two fundamental limitations. First, they struggle with coarse semantic alignment, often failing to capture fine-grained prompt details. Second, they lack robust 3D spatial understanding, leading to geometric inconsistencies and catastrophic failures in part assembly and spatial relationships. To address these challenges, we propose VLM3D, a general framework that repurposes large vision-language models (VLMs) as powerful, differentiable semantic and spatial critics. Our core contribution is a dual-query critic signal derived from the VLM's "Yes/No" log-odds, which assesses both semantic fidelity and geometric coherence. We demonstrate the generality of this guidance signal across two distinct paradigms: (1) As a reward objective for optimization-based pipelines, VLM3D significantly outperforms existing methods on standard benchmarks. (2) As a test-time guidance module for feed-forward pipelines, it actively steers the iterative sampling process of SOTA native 3D models to correct severe spatial errors. VLM3D establishes a principled and generalizable path to inject the VLM's rich, language-grounded understanding of both semantics and space into diverse 3D generative pipelines.

Abstract:
Modern largescale LLM pretraining benefits from utilizing Pipeline Parallelism; however, synchronous implementations leave GPUs idle during pipeline bubbles, wasting computational resources. Asynchronous Pipeline Parallelism approaches effectively eliminate these bubbles, maximizing throughput at the cost of gradient staleness. Unlike other schemes, PipeDream-2BW ensures a constant one-step gradient delay regardless of pipeline depth. However, its widespread adoption remains limited due to the common belief that optimizing under staleness is fundamentally unstable. In this work, we challenge this assumption, demonstrating that convergence degradation is largely an artifact of optimizer choice rather than an intrinsic limitation. We provide the first comprehensive analysis showing that while AdamW, the predominant optimizer at the time when PipeDream-2BW was introduced, indeed suffers from severe degradation, recent methods like Muon exhibit inherent robustness under a one step delay. We support this with theoretical analysis and introduce an optimizer-agnostic Error-Feedback mechanism to further mitigate delay effects. Extensive evaluation on models up to 10B parameters confirms that our strategies bridge the performance gap with synchronous training, enabling the practical deployment of asynchronous pipeline parallelism at scale.

Abstract:
Distributed optimization over timevarying directed graphs has shown promising performance in addressing challenges posed by complex communication constraints in real-world scenarios. In many practical settings, however, the direct application of distributed optimization algorithms encounters additional difficulties, most notably hyperparameter tuning, which our empirical observations suggest can be effectively mitigated by integrating bilevel optimization. Motivated by these findings, we study distributed bilevel optimization over time-varying directed networks, a problem that remains largely unexplored due to the compounded challenges arising from consensus bias in dynamic unbalanced communication and the nested optimization structure. In this work, we propose a fully first-order distributed gradient-based algorithm that integrates the Push–Pull (also known as AB) communication strategy with a value function-based penalty method and establish its non-asymptotic convergence properties. Notably, a simplified variant of our analysis framework for nonconvex single-level distributed optimization establishes a convergence rate for the Push–Pull algorithm, thereby resolving an open question concerning its convergence over time-varying directed graphs. Empirical evaluations across diverse tasks, including hyperparameter tuning, data hyper-cleaning, and reinforcement learning, validate the effectiveness and efficiency of the proposed algorithm.

Abstract:
Infrared and visible image fusion aims to integrate complementary information from both modalities. However, most existing methods rely on Euclidean representations, which inherently impose geometric constraints that hinder effective semantic modelling. Specifically, Euclidean geometry imposes rigid distance metrics that distort multimodal feature interactions, particularly in preserving parent-to-child semantic hierarchies. To overcome this, we introduce a text-driven fusion framework empowered by hyperbolic manifold learning. In particular, our approach models text-attribute correlation during training by leveraging BLIP-extracted prompts to align with vision-attribute, thereby enabling the formation of adaptive enhancement strategies semantically. Within the hyperbolic space, the text prompts act as topological anchors, guiding vision-attribute alignment through hyperbolic embeddings that naturally expand with semantic granularity. Using the Poincaré ball's negative curvature, we encode coarse-to-fine semantics without Euclidean distance saturation, while its exponentially growing periphery prevents texture distortion during cross-modal fusion. During inference, the fusion process autonomously adapts to the input content using learned text-attribute priors, eliminating any dependence on textual input. The experimental results show that the proposed method outperforms existing state-of-the-art methods on existing publicly available benchmark datasets.

Abstract:
Textto-Music diffusion models are increasingly used in real-world applications, yet deployment remains challenging: generations can collapse to limited patterns even with diverse initial noise and prompts, and inference-time diversity control often harms text alignment and fidelity by distorting key prompt cues established in early denoising. To address this, we propose Padding-Annealed Diffusion Sampling, which perturbs only a padding-indexed subspace while keeping non-padding conditioning fixed, enabling controlled exploration with reduced semantic drift. However, in a text-unaware VAE latent space, such exploration is less likely to stay within genre-faithful neighborhoods, limiting genre-consistent diversity. We therefore introduce Text-Aware Latent space that aligns local neighborhoods with text-implied genre structure, promoting genre-consistent diversity. Together, the two techniques form a unified pipeline that, compared to prior methods that perturb the full conditioning, achieves a better text alignment--diversity trade-off: at comparable text alignment, it delivers 15.4% higher diversity with a relatively small fidelity drop, and further improves within-genre diversity by 71.6%. Generated samples are available at https://pads-tal.github.io/PADS-TAL.io

Abstract:
Benchmarks for LLMassisted theorem proving in Lean are often treated as intrinsically reliable because every solved instance comes with a machine-checked proof. However, the kernel only checks that a proof establishes a \emphformal statement; it does not verify that the statement faithfully encodes the intended informal problem, nor that evaluation harnesses are robust to trivial or adversarial solutions. We audit widely used Lean theorem-proving benchmarks and find recurring defects in every dataset we examined, including missing hypotheses, problem simplification, incomplete or incorrect translations, and Lean-specific specification hazards. Beyond dataset construction, we survey and identify evaluation-time failure modes that can inflate reported success without demonstrating meaningful proof capability. We propose a fault taxonomy, a suite of automated checkers and prompts, and release standards to guide the creation of formal math datasets and make evaluation more reproducible and trustworthy.

Abstract:
Large Reasoning Models (LRMs) achieve superior problemsolving through extended chain-of-thought generation, but the resulting key-value (KV) cache grows linearly with sequence length and creates severe memory bottlenecks—often exceeding GPU capacity for long reasoning traces. Existing KV cache compression methods rely on recent queries to estimate future token importance, implicitly assuming these serve as reliable proxies for future attention patterns. We demonstrate that this assumption fails in long-horizon reasoning: certain decoding steps generate Thought Revisiting Tokens (TRT) that re-attend to distant previous context, such as task-solving plans formulated early in the trace. Through systematic analysis, we discover that queries corresponding to the TRT cluster into a small number of similarity groups in the embedding space. Based on this insight, we propose BeaconKV, a training-free KV cache compression method that maintains beacon queries—compact representatives for each global query cluster—to anticipate which KV pairs will be revisited without storing the entire query history. We introduce Continual Farthest Point Sampling for memory-efficient beacon identification during inference. Across four open-source LRMs and diverse reasoning benchmarks, BeaconKV consistently outperforms state-of-the-art methods, achieving up to 5.8× memory reduction while nearly preserving full cache accuracy and improving throughput by over 4.3×.

Abstract:
The Mixture of Experts (MoE) framework is widely used in continual learning to mitigate catastrophic forgetting. MoEs typically combine a small intertask shared parameter space with largely independent expert parameters. However, as the number of tasks increases, the shared space becomes a bottleneck, reintroducing forgetting, while fully independent experts require explicit task ID predictors (e.g., routers), adding complexity. In this work, we eliminate the inter-task shared parameter space and the need for a task ID predictor by enabling expert communication and allowing knowledge to be shared dynamically, akin to human collaboration. We bridge the inter-expert knowledge sharing by leveraging the open-set learning capabilities of a multimodal foundation model (e.g., CLIP), thereby providing “expert priors” that bolster each expert’s task-specific representations. Guided by these priors, experts learn calibrated inter-task posteriors. Additionally, Multivariate Gaussians over the learned posteriors promote complementary specialization among experts. We propose new evaluation benchmarks that simulate realistic continual learning scenarios, and our prior-conditioned strategy consistently outperforms existing methods across diverse settings without relying on reference datasets or replay memory.

Abstract:
Diffusion models have shown impressive capabilities in textto-image synthesis, but multi-concept personalized generation remains challenging, particularly in aligning multiple reference concepts while preserving fidelity. In this work, we propose a novel framework that addresses this challenge with a two-stage Sketch-to-Rendering process, utilizing Dual Optimal Transport (OT) for structural alignment and texture injection. Our approach consists of two key components: Structural Guidance via OT: Ensures shape alignment by using mass-preserving OT for spatial consistency, and Texture Injection via Geometry-Guided OT: Leverages low-frequency structure alignment to inject high-frequency texture details via OT-based residual transfer, preserving texture fidelity without distorting structure. Extensive experiments demonstrate that our method significantly enhances both conceptual fidelity and visual quality in multi-concept generation. Ablation studies further confirm the effectiveness of the proposed optimal transport guidance and the decoupling of structure and texture during the generation process.

Abstract:
Large language models (LLMs) are increasingly deployed in privacysensitive domains, where users must balance the risk of data exposure through external APIs against the high computational cost of local deployment. Split learning has therefore emerged as a promising paradigm for LLM fine-tuning and inference under limited local resources. However, it introduces new privacy risks. Prior work primarily studies leakage of private input prompts, typically via inversion attacks on intermediate representations, while the potential for sensitive information leakage through generative response outputs remains largely unexplored. In this work, we unveil novel vulnerabilities of Split-LLM by presentingPatched ModelInversion withDual-SidedInitialization(PIDI), a two-stage attack that simultaneously targets both private input prompts and output responses in Split-LLM settings. It combines dual-sided initialization with a patched inversion strategy to tackle long sequences, substantially outperforming prior inversion methods. To counter threats from both sides, we further propose theAdapter-basedDualGuard withMutualInformation Defense(ADMI), which integrates an adapter-based local warmup strategy and mutual information regularization to provide a strong privacy guarantee with minimal impact on task performance. Extensive experiments across diverse tasks and models demonstrate that ADMI effectively defends against PIDI and other state-of-the-art inversion attacks. Our code is publicly available at https://anonymous.4open.science/r/DDRA/.

Abstract:
Shampoo is attracting considerable attention for its superior performance on largescale optimization benchmarks; yet it faces a significant practical bottleneck: the prohibitive computational overhead of matrix inversion. To mitigate this, practitioners typically rely on stale preconditioner updates, creating a fundamental trade-off between computational efficiency and optimization fidelity. In this work, we provide a theoretical study of staleness through the complementary lenses of convergence and stability. While staleness improves computational efficiency, it inherently degrades performance and introduces numerical instability. Guided by this analysis, we propose FOAM, an adaptive algorithm that stabilizes training by dynamically controlling both the damping factor and the eigendecomposition frequency based on an approximation of the staleness-oriented error. Experimental results demonstrate that FOAM reduces wall-clock time compared to standard Shampoo while maintaining robust convergence.

Abstract:
Deep multiview graph clustering (DMGC) typically leverages graph neural networks for representation learning, but most existing methods excessively depend on local and static graph structures and only utilize simplistic cross-view fusion strategies. To this end, this paper proposesAttribute-aware Bidirectional Structural Refinement (ABSR) andPseudo-label Guided Multi-level Fusion (PGMF) for DMGC, termedAPGC. Specifically, ABSR selectively strengthens high-quality connections and suppresses semantically conflicting relationships, achieving bidirectional refinement of the graph structure based on attribute similarity. It incorporates global attribute semantics into the graph structure, thereby promoting the homophilic connections for discriminative graph representation learning. Guided by reliable pseudo-labels, PGMF achieves adaptive weighted fusion at both the node-level and the view-level, effectively balancing the differentiated contributions of multi-view information. Experiments on six homophilic and heterophilic datasets demonstrate the superior clustering performance of the proposed APGC method.

Abstract:
Although diffusion models have revolutionized continuous domains like image synthesis through high quality generations and controllable guidance mechanisms, bringing this controllability to the discrete, sequential nature of text remains an open challenge. Meanwhile, current sampling strategies and guidance methods adjust token likelihoods without capturing the broader semantic landscape, leading to a suboptimal balance between fidelity and diversity. In this work, we introduce a novel trainingfree Semantic-Aware Kernel Entropy (SAKE) guidance method. Our method computes the order-2 Rényi entropy over a kernel Gram matrix that captures both cross-token semantic interactions and relative token positions. By linearizing this objective in the embedding space, we derive a tractable guidance signal that dynamically adjusts the sampling distribution—flattening it to encourage exploration during redundancy and sharpening it for fidelity when diverse. Empirical experiments demonstrate that our approach achieves a superior Pareto frontier between fidelity and diversity, and improves multi-sample performance on reasoning-intensive tasks, such as code and mathematics generation, compared to temperature scaling and discrete guidance baselines.

Abstract:
Recent benchmarks reveal that singlecell perturbation response models are often outperformed by simply predicting the dataset mean. Through large-scale in silico simulations, together with analyses of two real-world perturbation datasets, we trace this anomaly to a metric artifact: unweighted error metrics systematically reward mean predictions when perturbation effects are sparse. To address this limitation, we introduce differentially expressed gene (DEG)-aware metrics—weighted mean-squared error (WMSE) and weighted delta R^2 (R^2_w(\Delta))—that sensitively measure error in niche, perturbation-specific signals. We further propose explicit negative and positive performance baselines to calibrate these metrics. Under this framework, the mean baseline sinks to null performance, while genuinely informative predictors are correctly rewarded. Finally, we show that using WMSE as a training objective reduces mode collapse and improves predictive performance across multiple model architectures.

Abstract:
Graphs with heterophily, where adjacent nodes carry different labels, are prevalent in realworld applications, from social networks to molecular interactions. However, existing spectral Graph Neural Network (GNN) approaches tailored for heterophilous graph classification suffer from hub-dominated (node with large degree) aggregation and oversmoothing, as their suboptimal polynomial filters introduce approximation errors and blend distant signals. To address the degree-biased aggregation and suboptimal polynomial filtering, we introduce a Hierarchical Multi‐view HAAR (HMH), a novel spectral graph‐learning framework that scales in near‑linear time . HMH first learns feature- and structure-aware \emphsigned affinities via a heterophily-aware encoder, then constructs a soft graph hierarchy guided by these embeddings. At each hierarchical level, HMH constructs a sparse, orthonormal, and locality-aware Haar basis to apply learnable spectral filters in the frequency domain. Finally, skip-connection unpooling layers combine outputs from all hierarchical levels back into the original graph, effectively preventing hub domination and long-range signal bottleneck (over-squashing). Experimentation shows that HMH outperforms state‑of‑the‑art spectral baselines, achieving up to a 3 % improvement on node classification and 7% on graph classification datasets, all while maintaining linear scalability.

Abstract:
Federated learning (FL) is vulnerable to backdoor attacks. Yet sustaining backdoor effectiveness under repeated aggregation remains challenging. Existing methods often rely on heuristic trigger designs or indiscriminant parameter manipulation, leading to rapid decay or detectable anomalies. In this work, we view FL backdoor persistence through the lens of optimization dynamics, and argue that longlasting attacks require alignment between trigger-induced representations and aggregation-stable parameter directions. Based on this insight, we propose the Coupled Trigger Optimization and Vulnerable Parameter Alignment (CTO-VPA) FL backdoor attack method. By constraining updates to this coupled subspace, backdoor behaviors can be embedded into optimization-stable directions while preserving benign performance. Experiments across multiple datasets and defense settings show that CTO-VPA achieves substantially improved persistence and robustness compared to prior attacks, highlighting the importance of trigger–parameter coupling in FL settings.

Abstract:
Heterogeneous federated learning (HtFL) has emerged as a promising approach to address heterogeneity in local computational resources and data distribution. However, existing methods cause performance degradation of model personalization because personalized and generalized knowledge are either intertwined or dominated by one of them. To address this issue, we propose a novel Elastic Mixture of Experts (EMoE) architecture on HtFL, namely FedEMoE, decoupling personalization from generalization. Specially, FedEMoE employs a multiscale feature extraction mechanism via personalized experts to enrich personalized knowledge. Furthermore, we design an elastic shared expert to break the transferred knowledge bottleneck across heterogeneous client models. The elastic shared expert can adaptively expand or shrink according to the status of each expert by the weight spectrum analysis, respectively. Extensive experiments across statistical and model heterogeneity settings demonstrate that FedEMoE significantly outperforms state-of-the art methods on the accuracy of each heterogeneous model over diverse datasets.

Abstract:
VisionLanguage Models (VLMs) have demonstrated exceptional general reasoning capabilities. However, their performance in embodied navigation remains hindered by a scarcity of aligned open-world vision and robot control data. Despite simulators providing a cost-effective alternative for data collection, the inherent reliance on photorealistic simulations often limits the transferability of learned policies. To this end, we propose Sandbox-Abstracted Grounded Experience (SAGE), a framework that enables agents to learn within a physics-grounded semantic abstraction rather than a photorealistic simulation, mimicking the human capacity for mental simulation where plans are rehearsed in simplified physics abstractions before execution. SAGE system operates via three synergistic phases: (1) Genesis: constructing diverse, physics-constrained semantic environments to bootstrap experience; (2) Evolution: distilling experiences through Reinforcement Learning (RL), utilizing a novel asymmetric adaptive clipping mechanism to stabilize updates; (3) Navigation: bridging the abstract policy to real-world control. We demonstrate that SAGE significantly improves navigation performance, achieving a 53.21% LLM-Match Success Rate on A-EQA (+9.7% over baseline) and generalizing widely to real-world deployments.

Abstract:
Video quality assessment (VQA) plays a critical role in optimizing video delivery systems. While numerous objective metrics have been proposed to approximate human perception, the perceived quality strongly depends on viewing conditions and display characteristics. Factors such as ambient lighting, display brightness, and resolution significantly influence the visibility of distortions. In this work, we address the question of the multiscreen quality assessment on mobile devices, as this area still tends to be under-covered. We introduce a first large-scale subjective dataset collected across more than different 300 Android devices, accompanied by metadata on viewing conditions and display properties. We propose a strategy for aggregated score extraction and adaptation of VQA models to device-specific quality estimation. Our results demonstrate that incorporating device and context information enables more accurate and flexible quality prediction, offering new opportunities for fine-grained optimization in streaming services. Ultimately, this work advances the development of perceptual quality models that bridge the gap between laboratory evaluations and the diverse conditions of real-world media consumption. We made the dataset and the code available at [Link is redacted].

Abstract:
Autoregressive video diffusion models enable \emphstreaming generation, opening the door to longform synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to × 5--× 10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.

Abstract:
We introduce UCSBench, a dataset spanning 170+ hours of egocentric visual observations with 7K+ timestamped questions for diagnosing User-centric Continual Spatial intelligence in egocentric video streams. UCS-Bench targets a new problem that emphasizes dynamic spatial reasoning, long-term memory, and their alignment with users' real-time locations. We propose DirectMe, a framework that incrementally constructs and maintains a structured spatial memory from streaming egocentric observations. DirectMe enables robust tracking and recall of object locations, all relative to user's movement over time. By tightly coupling visual perception with memory updates and spatial reasoning, our approach supports long-horizon queries that require recalling interactions, resolving viewpoint-induced ambiguities, and adaptation to dynamic scenes. Our experiments show that DirectMe significantly improves the spatial reasoning of leading multimodal LLMs; it also surpasses many spatial-aware and long streaming video models. We hope our benchmark and solution will advance spatial intelligence research for egocentric AI assistants. Data and code will be released.

Abstract:
Soil microorganisms control organic matter cycling and largely determine how soil systems can cope with and mitigate climate change and environmental threats. Representing microbial dynamics in processbased soil models is therefore critical to predict carbon cycling in soils, albeit highly challenging to inform from data. One promising approach to improve their parametrisation is the integration of genomic data, yet modelling the complex and unknown relationship between genomes and the processes the microbes are driving is an unsolved problem. In this work, we present the first hybrid modeling framework for deriving biokinetic parameter values of a process-based soil organic matter turnover model from metagenome-inferred functional traits based on DNA sequencing data. Our model predicts biokinetic parameters of the process-based model from genomic trait data with a neural network and integrates constraints from ecological theory and literature to ensure realistic behavior, even of non-observed state variables. We evaluate our method on synthetic genomic trait datasets of varying complexity, showing that our approach improves performance, reduces model equifinality, and learns the dynamics of unmeasurable components of the process-based model effectively, even for small training datasets.

Abstract:
Spiking Neural Networks (SNNs) have garnered increasing attention for their biological plausibility, energy efficiency, and temporal modeling capability. Due to the nondifferentiability of spike generation, a widely used supervised training method for SNNs is backpropagation through time with surrogate gradients, which achieves competitive performance with a small number of timesteps. Intuitively, scaling timesteps should further improve performance by enriching temporal dynamics. However, we observe timestep scaling paradox (TSP), a counter-intuitive degradation in accuracy when scaling timesteps. We investigate TSP and link it to long-term temporal gradient vanishing and weakened cross-timestep dependencies. To address this, we propose the Timestep-Scalable (TS) neuron model. It introduces long-term memory reconsolidation to enhance cross-timestep information flow and enable effective learning with more timesteps. In parallel, a temporal forgetting mechanism periodically truncates the accumulation path, suppressing excessive temporal information buildup and improving training stability. Supported by theoretical analysis and extensive experiments, TS consistently improves performance when scaling timesteps. Beyond gains from timestep scaling, it attains state-of-the-art results on EEG signals, event-based recognition, and time-series forecasting, while remaining strong on conventional image classification and object detection datasets.

Abstract:
Computational protein design holds immense promise across diverse domains, but existing approaches face significant challenges: traditional physicsbased methods require substantial domain expertise, while emerging deep learning methods often rely on restricted functional ontologies, struggle to bridge the semantic gap between text and protein sequences, or lack closed-loop optimization mechanisms. In this paper, we present PDAgent, an LLM-driven autonomous agent framework that enablesin silicoprotein design through template-based directed mutation. Our framework accepts natural language specifications of desired protein properties and employs a ReAct-style reasoning loop comprising five phases: THINK, PLAN, ACT, OBSERVE, and REFLECT. PDAgent integrates template retrieval, conservation-aware mutation strategies, and domain-specific computational tools for property optimization across eight biophysical dimensions. Experiments on 100 diverse protein design tasks demonstrate that PDAgent achieves a 91.34% average constraint satisfaction rate with high structural quality (mean pLDDT 87.69), substantially outperforming both direct LLM generation and specialized deep learning methods.

Abstract:
Current image editing methods excels at static attributes but fails at complex HumanObject Interactions (HOI), a critical challenge unaddressed by existing benchmarks that conflate HOI with static attributes, relying on global metrics incapable of simultaneously assessing dynamic interaction validity and entangled human-object pair preservation. Thus, we first introduce HOI-Edit, a comprehensive benchmark with three progressive cognitive levels, which features an automated metric HOI-Eval that first reliably evaluates instance-level interaction by letting VLM Q&A after thinking with images containing grounded Human-Object pair. Considering the task's essence of remodeling dynamic relationships, we benchmark Image-to-Video (I2V) models, finding them inherently suited for dynamic editing due to their temporal generation capabilities. Crucially, beyond superior performance, this capability provides a "replay of the failure process", offering unique diagnosability into why errors occur. We thus propose SCPE (Self-Correcting Process Editing), a novel, agentic self-correcting framework that constrains the generation of I2V models through iteratively refined prompts, enabling the generated videos to more accurately present the target HOI. Extracted frames from these videos are the final editing results. On HOI-Edit, SCPE achieves performance competitive with state-of-the-art (SOTA) editing models like Nano Banana on interaction.

Abstract:
Testtime training (TTT) adapts a pretrained model to each prompt via parameter updates, improving accuracy under pretraining-to-test distribution shifts. Yet, its performance often suffers from instability and sensitivity to hyperparameters such as update steps and subspace. We explain this behavior through a decision-theoretic lens, treating TTT as implicit Bayesian inference in the kernel regime. Under a Gaussian process benchmark, we show that TTT reduces prediction error when updates are spectrally matched to the prompt's signal-to-noise ratio and aligned with query-relevant eigen-directions. This perspective underpins the following results: (1) we show when fixed update steps and subspaces fail under distribution shifts, motivating adaptive strategies; (2) we prove that selecting update steps via prompt evidence admits a PAC-Bayes guarantee against overfitting; and (3) we characterize the Bayes-optimal update subspace under a linear-Gaussian correction model, yielding a scoring rule for selecting Transformer blocks and heads. Our theory helps explain the empirical instability of TTT, taking a step toward principled guidance for when, how far, and which directions to adapt.

Abstract:
Conversational query reformulation (CQR) has become indispensable for improving retrieval in dialoguebased applications. However, existing approaches typically rely on reference passages for optimization, which areimpracticalto acquire in real-world scenarios. To address this limitation, we introduce a novelreference-freepreference optimization frameworkDualReformthat generatespseudo reference passagesfromcommonly-encounteredconversational datasets containing only queries and responses. DualReform attains this goal through two key innovations: (1)response-based inference, where responses serve as proxies to infer pseudo reference passages, and (2)response refinement via the dual-role of CQR, where a CQR model refines responses based on the shared objectives between response refinement and CQR. Despite not relying on reference passages,DualReformachieves 96.9--99.1% of the retrieval accuracy attainable only with reference passages and surpasses the state-of-the-art method by up to 31.6%.

Abstract:
Highdimensional genomics studies are frequently confounded by unmeasured biological processes that obscure disease-specific signals. While existing workflows can estimate these latent confounders, they fail to quantify how robust a discovery is to varying levels of hypothetical confounding. We introduce sensGAN, a deep-learning adversarial framework that systematically explores the confounding spectrum by learning "worst-case" latent variables that nullify the most gene associations under novel predictive-gain constraints. By identifying the minimum confounding strength required to explain away an observed effect, our method shifts the paradigm toward a formal, quantitative sensitivity analysis. In diverse simulations, sensGAN accurately recovers latent structures and outperforms existing methods in identifying confounder-sensitive genes. Applied to human Alzheimer's disease microglia, our framework prioritizes robust disease pathways while successfully isolating signals driven by unmeasured co-occurring neurodegenerative pathologies.

Abstract:
Large language models (LLMs) have shown promise as interactive agents that solve tasks through extended sequences of environment interactions. While prior work has primarily focused on systemlevel optimizations or algorithmic improvements, the role of task horizon length in shaping training dynamics remains poorly understood. In this work, we present a systematic empirical study that examines horizon length through controlled task constructions. Specifically, we construct controlled tasks in which agents face identical decision rules and reasoning structures, but differ only in the length of action sequences required for successful completion. Our results reveal that increasing horizon length alone constitutes a training bottleneck, inducing severe training instability driven by exploration difficulties and credit assignment challenges. We demonstrate that horizon reduction is a key principle to address this limitation, stabilizing training and achieving better performance in long-horizon tasks. Moreover, we find that horizon reduction is related to stronger generalization across horizon lengths: models trained under reduced horizons generalize more effectively to longer-horizon variants at inference time, a phenomenon we refer to as horizon generalization.

Abstract:
Visionand-Language Navigation in Continuous Environments (VLN-CE) requires agents to make fine-grained navigation decisions under partial observability. However, most existing methods rely on open-loop execution, lacking mechanisms to detect and correct internal state drift during inference. We propose SC^2-WM, a self-correcting world model framework that introduces internal feedback for closed-loop decision making in VLN-CE. Our method derives feedback from world-model foresight to perform state-level plan refinement before action execution. To handle challenging scenarios, we further introduce conditional world-aware adaptation, which enables model-level correction by selectively updating the world model at test time when feedback indicates model capacity insufficiency. Experiments on standard VLN-CE benchmarks demonstrate improved navigation robustness and generalization. Code is available in the Supplementary Material.

Abstract:
Handwritten mathematical expression recognition (HMER) remains challenging in realworld educational scenarios, even with recent advances in large vision-language models. While these models often achieve high accuracy in local symbol transcription, their reliability in capturing two-dimensional mathematical structure under realistic handwritten conditions is still poorly understood. We introduce a real-world handwritten benchmark covering 13 categories of structurally complex expressions with authentic writing artifacts. Evaluations on large models reveal a clear performance degradation as structural complexity increases, even when symbol-level accuracy is high. Most failures arise from structural mis-parsing and context-dependent symbol role confusion rather than pure visual perception errors. To mitigate this issue, we propose a training-free, schema-anchored structure-aware inference framework that decomposes recognition into schema identification, schema-constrained transcription, and context-driven disambiguation. Our method improves the ExpRate from 11.63% to 24.52% on Qwen-8B and generalizes well across multiple large models. Our benchmark provides a realistic evaluation for large models on handwritten mathematics, and our framework offers an effective and interpretable solution to structure-related failures in real-world HMER.

Abstract:
LowRank Adaptation (LoRA) is the dominant parameter-efficient fine-tuning method due to its favorable compute-performance trade-off, yet it suffers from catastrophic forgetting. We study forgetting through a tractablemean-field self-attentiontoy model, where tokens evolve as an interacting particle system and LoRA acts as a low-rank perturbation. Using tools from partial differential equations and dynamical systems, we characterize regimes suggesting a phase transition between forgetting and non-forgetting behavior. We show that one phase transition appears with respect to the norm of the perturbation, and the other with respect to the depth of the Transformers. We further bound the time-to-deviation in terms of the perturbation size and spectral quantities, and corroborate the predicted trends with experiments and exploratory analyses on real models under LoRA fine-tuning.

Abstract:
Visual causal reasoning is essential for understanding and intervening in the physical world, requiring identification of causal variables from visual inputs and reasoning over intervention effects. Despite recent progress, large visionlanguage models (VLMs) remain brittle at such tasks, especially for interventional and counterfactual queries over multi-image inputs. Most existing explorations inject causal knowledge via textual prompts, leaving causal mechanisms external to model execution and limiting reliable control during inference. To address this problem, we propose BridgeVLM, which internalizes visual causal reasoning by inducing a causal graph from multi-image inputs and converting it into structured Causal Tokens executed by RAMP layers injected into the LLM decoder for causal message passing. We further introduce a unified training interface M3S for fine-grained causal supervision from different granularities (local/global level). BridgeVLM achieves 54.4% accuracy on intervention tasks on CausalVLBench (vs. 33.2% with prompt-level supervision), improves results on Causal3D from 43.6% to 49.0%, and substantially improves causal structure learning on CausalVLBench (F_1: 33.4% \rightarrow 75.1%).

Abstract:
Structured scientific spectra encode rich physical information while satisfying hard constraints such as conservation and spectral geometry. We study a canonical example, the electronic density of states (eDOS), whose accurate prediction is central to materials discovery. Prior methods often (i) decouple band gaps from eDOS, (ii) violate totalstate conservation, or (iii) collapse crystals into global embeddings that obscure atom-projected contributions. We introduce DeepSciReasoner, a design paradigm for deep scientific reasoning under physical constraints. Instantiated for eDOS prediction, DeepSciReasoner combines structure-aware spectrum decoding with constraint-preserving physical reasoning, in this case, mass-conserving iterative refinement. It substantially improves eDOS accuracy while maintaining physical validity, enabling reliable high-throughput screening. Beyond eDOS, DeepSciReasoner offers a reusable blueprint for predicting structured scientific spectra under hard physical constraints.

Abstract:
Multiagent collaborative perception improves perception range and robustness in autonomous driving. However, most existing methods assume homogeneous sensors and perception networks, which is unrealistic in real-world heterogeneous systems. Differences in sensing modalities and independently trained models lead to significant semantic and geometric inconsistencies, limiting effective collaboration. To solve these problems, we propose a novel cross-modal adaptation framework with Mixture-of-Experts and geometry-guided fusion for heterogeneous collaborative perception, named X-MoGe. Specifically, we propose a Pixel-level Mixture-of-Experts (P-MoE) module, which adaptively models modality-specific semantic characteristics under heterogeneous sensing conditions. In addition, a geometry-guided feature fusion module incorporates explicit geometric priors to enforce spatial alignment and consistency in the BEV space. Extensive experiments on OPV2V and DAIR-V2X datasets demonstrate that the proposed method achieves state-of-the-art performance with strong robustness and scalability in heterogeneous collaborative perception.

Abstract:
Navigation in complex environments relies on internal spatial representations that guide action. While the brain employs a diverse repertoire of spatial tuning cells—including grid, place, and headdirection cells—a normative theory linking these static neural codes to the dynamic process of navigation remains elusive. In this work, we propose a Unified Laplacian Framework derived from first principles of representational smoothness and efficiency. We first demonstrate that diverse spatial codes emerge naturally as spectral decompositions of the Laplacian operator. Crucially, bridging the gap from representation to action, we derive a biologically plausible navigation policy based on the Green's function potential. We show that this potential encodes the environment's intrinsic geometry to enable simple, trap-free gradient ascent, achieving significantly improved sample efficiency and generalization in goal-reaching tasks. Furthermore, we demonstrate that these spectral representations can be learned directly from high-dimensional visual inputs, confirming its plausibility in realistic environments. Our results suggest that the "cognitive map" can be viewed as a spectral embedding of the Laplacian, providing a rigorous foundation for spatial cognition in both biological and artificial agents.

Abstract:
Multiobjective discrete optimization problems, such as molecular design, pose significant challenges due to their vast and unstructured combinatorial spaces. Traditional evolutionary algorithms often get trapped in local optima, while expert knowledge can provide crucial guidance for accelerating convergence. Large language models (LLMs) offer powerful priors and reasoning ability, making them natural optimizers when expert knowledge matters. However, closed-source LLMs, though strong in exploration, cannot update their parameters and thus cannot internalize experience. Conversely, smaller open models can be continually fine-tuned but lack broad knowledge and reasoning strength. We introduce Multi-LLM Collaborative Co-evolution (MCCE), a hybrid framework that unites a frozen closed-source LLM with a lightweight trainable model. The system maintains a trajectory memory of past search processes; the small model is progressively refined via reinforcement learning, with the two models jointly supporting and complementing each other in global exploration. Unlike model distillation, this process enhances the capabilities of both models through mutual inspiration. Experiments on multi-objective drug design benchmarks show that MCCE achieves state-of-the-art Pareto front quality and consistently outperforms baselines. These results highlight a new paradigm for enabling continual evolution in hybrid LLM systems, combining knowledge-driven exploration with experience-driven learning.The code of MCCE is available on https://anonymous.4open.science/r/MCCE_Anonymous-1F92

Abstract:
Fewshot point cloud semantic segmentation (FS-PCSS) aims to achieve precise segmentation of novel categories using only limited labeled samples. Existing prototype-based methods typically rely on shallow feature fusion strategies, failing to adequately model the feature distribution shift between support and query sets, resulting in insufficient prototype adaptation. To address this, we propose the Deep Prototype Refinement Network (DPR-Net), which systematically achieves progressive adaptation by constructing a coarse-to-fine prototype evolution trajectory. Our core Dynamic Prototype Refinement (DPR) module explicitly decomposes features into common and distinctive subspaces based on channel activation, enabling targeted adjustment of domain-sensitive features while preserving class-shared semantics. By cascading multiple refinement modules, we construct a prototype trajectory transitioning from support-biased to query-adapted representations, mitigating both under- and over-adaptation. Furthermore, our Mixture of Prototype Experts (MoPE) mechanism treats prototypes at different stages as complementary experts and adaptively ensembles their predictions through confidence-driven weighting. Extensive experiments demonstrate that DPR-Net achieves state-of-the-art performance with high efficiency. Notably, with only 0.28M parameters, DPR-Net achieves 80.76% mIoU on S3DIS (2-way 1-shot), representing a 15.92% improvement over the baseline.

Abstract:
AIgenerated text increasingly blends with human writing, raising practical risks such as misinformation, academic misuse, and corpora contamination. While statistical detectors are appealing for efficiency and generalization, they suffer from two key limitations. (i) Boilerplate dominance, boilerplate tokens shared across human and LLM writing can overwhelm discriminative signals. (ii) Brittle point estimates, relying on a single probability score yields unstable decisions under adversarial manipulations. To address these issues, we propose Uncertainty, a multiscale uncertainty estimator that focuses on informative low-probability tokens, which more clearly expose distributional discrepancies. Locally, it alleviates boilerplate dominance by averaging the log-probabilities of low-probability tokens; globally, it reduces brittleness by capturing the distributional shape of this low-probability region via Rényi entropy. We further extend the estimator to Uncertainty++ via conditional independent sampling, yielding a more stable uncertainty estimation. Experiments on seven datasets and sixteen LLMs demonstrate high accuracy, generalization, and robustness. Our code is available at https://anonymous.4open.science/r/Uncertainty-8915.

Abstract:
Generalized ZeroShot Learning (GZSL) suffers from severe seen-class bias, a challenge stemming from the label incompleteness inherent in mixed test distributions. To address this, we propose PUFE, a unified plug-and-play framework that recasts GZSL inference as a Positive-Unlabeled (PU) learning task by treating seen categories as positive and mixed test data as unlabeled. Serving as a seamless post-processing module, PUFE constructs a PU classifier in the semantic space, jointly estimating the seen-class posterior and labeling propensity via Maximum Likelihood Estimation (MLE) within a dual-head network. Furthermore, we introduce an adaptive prototype calibration strategy that leverages high-confidence pseudo-instances—identified by the PU classifier—to explicitly align semantic prototypes with the underlying test distribution. Extensive experiments demonstrate that PUFE consistently mitigates bias and significantly boosts the performance of various embedding-based baselines, yielding gains of up to 11.2 percentage points in the harmonic mean.

Abstract:
Despite strong performance in competitive programming, the role of Large Language Models (LLMs) in supporting human learning in the same setting remains largely unexplored. In this work, we introduceUOJBench, a benchmark designed to evaluate not only the problem-solving ability of LLMs, but also their ability to identify errors in human-written code—a crucial educational activity traditionally supported by running test cases over online judge systems. UOJ-Bench consists of three distinct tasks: code generation, code hacking, and code repair, all constructed from real-world code submissions on the Universal Online Judge (UOJ) and evaluated through UOJ's native judging infrastructure. Our results show that under one-shot evaluation, even the strongest models fail to identify errors in more than 50% of a set of submissions that have been found to be incorrect by UOJ users. While test-time scaling improves success rates to above 90%, the substantial computational costs incurred from model inference limit its practicality for large-scale deployment. Despite these limitations, we find that the best-performing models under test-time scaling can uncover errors in over 5% of full-score submissions across roughly 30 problems, suggesting that frontier LLMs can already provide complementary signals beyond standard judging systems.

Abstract:
This paper studies the problem of single domain generalization in time series classification, which aims to learn a generalized time series classification model using a single source domain. This problem is highly challenging due to unreliable supervision from domain scarcity. Although current approaches employ generative models for data augmentation, these synthesized samples often suffer from low diversity and intrinsic noise, leading to weak generalization ability. Towards this end, we propose a novel approach named Contextdriven Diffusion with Progressive Expansion (CURE) for single domain generalization in time series classification. The core of our CURE is to generate both semantic-aware and semantic-free contexts to strategically guide a conditional diffusion model for informative data expansion. In particular, our CURE first conducts representation disentanglement to extract semantic-aware and semantic-free representations from source data. To enhance generalizability through data synthesis, we not only retrieve reference time series trajectories with similar semantics for semantic-aware contexts, but also utilize adversarial strategies to learn semantic-free contexts. These contexts are integrated as joint conditions for a diffusion model, enabling diverse and reliable virtual data. To enhance expansion adaptability and stable optimization, we progressively update our semantic-free contexts via a memory bank and measure boundary properties for dynamic data filtering. Comprehensive experiments on benchmark datasets validate the effectiveness of the proposed CURE in comparison to extensive baselines. Our code is available at https://anonymous.4open.science/r/cure_9C6E/.

Abstract:
This paper studies the theoretical foundations of guidance mechanisms in continuoustime score-based diffusion models. We adopt Doob’s h-transform as a principled framework for characterizing ideal guided diffusion processes and analyze the discrepancy between ideal and approximate guidance. Our analysis provides explicit error bounds and yields probabilistic guarantees on satisfying prescribed constraints, which are particularly important for safety-critical planning. We further show that the Doob-based formulation induces a stochastic optimal control problem, enabling practical guidance design without additional model training. We demonstrate the effectiveness of the proposed framework on robotic navigation tasks, including language-conditioned planning.

Abstract:
Current approaches to enhance Large Language Model (LLM) reasoning, such as Chainof-Thought and "Wait" prompts, primarily encourage models to think more, yet often fail to guide them toward Truth. While Representation Editing (RepE) offers a intrinsic control, its application to dynamic reasoning trajectories remains underexplored. In this work, we bridge this gap by investigating the geometry of truth within unfolding reasoning chains. We uncover three critical insights: (1) Truth is encoded at the sentence level and is entangled with latent reasoning patterns; (2) Effective intervention follows an Uncertainty Principle and a Decay Effect, requiring localization to early, high-entropy forks; (3) Naive steering vectors suffer from noise, risking collateral damage to correct trajectories. Based on these findings, we propose DynaSteer, a dynamic RepE framework. DynaSteer employs pattern clustering to disentangle reasoning manifolds and utilizes Fisher-LDA to project purified truth. By dynamically monitoring lookahead entropy, it selectively steers and rolls back trajectories only when necessary. Comprehensive experimental results on several MATH benchmark verify the effectiveness of DynaSteer, and experiments on out-of-domain coding tasks further confirm its generalization ability. Our code is publicly available at https://anonymous.4open.science/r/DynaSteer.

Abstract:
The realization of autonomous scientific experimentation is currently limited by LLMs' struggle to grasp the strict procedural logic and accuracy required by biological protocols. To address this fundamental challenge, we presentBioProBench, a comprehensive resource for procedural reasoning in biology. BioProBench is grounded inBioProCorpus, a foundational collection of 27,000 humanwritten protocols. From this corpus, we systematically constructed a dataset of over 550,000 task instances, offering both a large-scale training resource and a rigorous benchmark with novel metrics. Evaluating 10 mainstream LLMs, we find that while general comprehension is high, performance drops significantly on tasks demanding deep reasoning, quantitative precision, and safety awareness. To demonstrate the value of BioProCorpus in mitigating these issues, we developedProAgent, grounded in our corpus, ProAgent substantially advances the state-of-the-art. BioProBench provides a rigorous diagnostic benchmark and a foundational resource for developing the next generation of reliable scientific AI. Code and data are available at: https://anonymous.4open.science/r/Anonymization-112358 .

Abstract:
The unordered nature of point cloud data poses significant challenges to conventional architectures primarily designed for structured data. Spiking neural networks (SNN), by virtue of their inherent sparsity and dynamics, are particularly wellsuited for processing point clouds to effectively extract meaningful features. We propose SpikeNet, a novel spiking neural network architecture for energy-efficient and robust point cloud analysis. We introduce spiking-driven sparse attention mechanism coined the Spiking Vector Mask Transformer (SVMT). By dynamically aligning the sparsity of point cloud data through binary spiking masks, SVMT eliminates the need for softmax and multiplication operations, significantly improving computational efficiency. We also propose a Dynamic Sparse Spiking Residual (DSSR) structure and integrate it with SVMT to form the Spiking Neural Network (SpikeNet) for point cloud classification and segmentation. SpikeNet overcomes the trade-off between accuracy and efficiency in previous SNN methods, achieving collaborative optimization of performance and energy-efficiency. Experiments on benchmark datasets show that SpikeNet achieves state-of-the-art performance in shape classification and segmentation tasks, comparable to artificial neural network (ANN) based methods. Our source code is in supplementary material and will be made publicly available.

Abstract:
Gesture interaction and touchless sensing is a natural and intuitive control method that allows users to control devices through natural hand or body movements, reducing reliance on physical input and enhancing convenience and functional efficiency. The biggest challenge is balancing accuracy and model parameter count for custom gesture tasks, while minimizing the number of data samples required for training. This paper proposes a novel IMUbased BiCrossNet model and two novel data augmentation models (delta-generator and embedding-generator) to address this challenge. Compared with existing methods, the model proposed in this article boosts accuracy by 11.7% and 12.7% in UMAHand (public datasets); and 8.85% and 5.25% in GRHand (self-developed datasets), with decreasing 27.8% pretrained feature-extractor model parameters. This research lays a solid foundation for deploying and implementing custome gesture recognition engineering on intelligent terminal devices.

Abstract:
Unmanned aerial vehicle (UAV) autopilot systems typically comprise navigation and flightcontrol modules, and their effective scheduling is critical to achieving high flight performance. However, most existing UAV platforms adopt a split architecture in which navigation and flight control are deployed on separate hardware devices. This separation restricts system-wide observability and prevents holistic scheduling and optimization across the entire autopilot pipeline. Moreover, autonomous flight performance emerges from implicit, cross-coupled, and accumulated interactions among multiple factors, rendering traditional model-based or heuristic scheduling approaches ineffective. To address these challenges, we propose UAV^2, a unified and adaptive scheduling framework for UAV autopilot systems with reinforcement learning, targeting flight performance optimization. UAV^2 integrates navigation and flight control onto a single onboard computing platform and operating system, formulates the scheduling problem as a partially observable Markov decision process, and learns scheduling policies from runtime execution feedback. The proposed approach is trained and evaluated in a hardware-in-the-loop simulation environment. Experimental results demonstrate that the learned scheduling policy consistently outperforms fixed-rate scheduling strategies in terms of flight robustness and tracking performance.

Abstract:
We introduce NetDiff, a nodeconditioned denoising diffusion model that generates directional link topologies and a two-slot transmit/receive parity for mobile ad hoc networks. Directional antennas can yield high throughput but require globally consistent link decisions under sector, interference, connectivity, and half-duplex constraints. NetDiff improves global coherence with Absolute Cross-Attentive Modulation (ACAM) tokens, which provide permutation-invariant global signals and help the model match graph-level counts (e.g., density and sector usage). We also propose partial diffusion to update an existing topology with a small number of denoising steps, enabling fast reconfiguration under mobility. NetDiff reaches over 95 % of target performance with constant inference time, outperforms heuristic and omnidirectional baselines, and improves over a strong diffusion graph-transformer baseline on key metrics.

Abstract:
Retrievalaugmented generation (RAG) systems suffer from a fundamental functional misalignment where retrievers optimize for semantic relevance, often recalling documents with high background utility but factually erroneous answer spans that generators blindly adopt as cognitive shortcuts. To resolve this, we propose the collaborative Critic-Reasoner framework that shifts robustness control from coarse-grained filtering to fine-grained cognitive decoupling. We disentangle the generation process into two serialized roles by deploying a Critic to perform surgical evidence purification through identifying and masking misleading entities while preserving supportive background context, followed by a Reasoner that switches from rote extraction to deductive reasoning based on the residual evidence. We operationalize this framework via a two-stage alignment strategy combining supervised fine-tuning (SFT) with path-aware direct preference optimization (DPO) to enforce strict behavioral synergy. Experimental results on adversarial benchmarks such as ConFiQA demonstrate that our method significantly outperforms baselines, achieving a 25.99% accuracy gain in conflicting scenarios and effectively resolving the trust bias dilemma in real-world RAG.

Abstract:
Label noise is pervasive in realworld datasets and significantly compromises model generalization, fueling extensive research into Learning with Noisy Labels (LNL). Most LNL methods focus on robust discriminative learning, while recent generative classifiers such as label diffusion models (LDMs) show superior robustness by modeling class posteriors. However, current LDMs predominantly rely on standard \epsilon-prediction, where Gaussian pnoise lacks explicit class semantics, limiting both optimization and inference under label noise environments. To address this issue, we propose just y-prediction (JYP), a novel training paradigm that enables LDMs to directly characterize the label manifold and leverage explicit class-semantic guidance. Theoretically, we prove that JYP converges to an optimal solution equivalent to that of \epsilon-prediction within the label diffusion framework, while facilitating accelerated convergence and enabling one-step inference. Leveraging JYP as a foundation, we further incorporate historical cumulative inconsistency to adaptively tailor optimization strategies for clean, noisy, and hard samples. Extensive experiments demonstrate that our method consistently outperforms competitors across diverse synthetic noisy datasets and achieves state-of-the-art performance on multiple real-world benchmarks.

Abstract:
Fewshot EEG domain adaptation faces severe data heterogeneity and optimization instability. While prevalent "symmetric alignment" methods typically seek a compromised shared subspace, they often falter when domain discrepancies are vast, leading to mutual interference and negative transfer. To overcome this, we abandon the pursuit of a middle ground and advocate for a "target-to-source alignment" strategy that explicitly maps target distributions onto the rigid source manifold. To regulate the optimization dynamics of this asymmetric mapping, we present H-GSC (Hierarchical Geometric Framework with Structural Constraints). Specifically, H-GSC employs Feature-Space Guidance (FSG) as a geometric pre-conditioner to reorient adaptation gradients, and Hierarchical Gradient Alignment (H-GA) to filter destructive interference by strictly prioritizing source discriminability. To prevent manifold collapse, we preserve intrinsic manifold structure via a dual-constraint regularization scheme (synergizing Masked Structural Consistency with semantic anchors) and ensure robust early stopping with a Metric-Decoupled Validator. Theoretical analysis confirms that H-GSC aligns with generalization bound minimization. Extensive experiments on CHB-MIT demonstrate state-of-the-art performance (79.54% AUC). Crucially, in rigorous joint cross-dataset scenarios where prior methods suffer from negative transfer, H-GSC achieves a significant 9.65 pp AUC gain, validating that our rectified trajectory effectively bridges vast distributional shifts for scalable clinical deployment.

Abstract:
Federated Graph Learning (FGL) enables distributed clients to collaboratively train graph neural networks while strictly preserving data privacy.However, existing FGL methods implicitly assume the reliability of local graph structures and lack elastic awareness of parameter importance during model aggregation, leading to representation degradation under topological noise and catastrophic forgetting caused by model drift. To address these challenges,we propose GraphPFL, a general personalized FGL framework.(1)we design a self-supervised dynamic topology reconstruction mechanism on the client side. This mechanism mines implicit dependencies to adaptively rectify noisy topologies, effectively suppressing topological noise propagation and capturing precise structural relationships for high-quality representations.(2)we introduce a Fisher-based Elastic Parameter Alignment (FRPA) algorithm. FRPA imposes anisotropic regularization constraints in the parameter space to precisely quantify parameter importance, enabling the model to strictly preserve critical local knowledge while flexibly aligning with the global model, thus effectively overcoming catastrophic forgetting.Extensive experiments on seven benchmarks (including biochemical molecules, social networks, and large-scale encrypted traffic) demonstrate that GraphP-FL significantly outperforms state-of-the-art methods, improving accuracy by up to 8.6% while exhibiting superior generalization and robustness.

Abstract:
Diffusionbased Q-learning has emerged as a powerful paradigm for offline reinforcement learning, but its reliance on multi-step denoising makes both training and inference computationally expensive and brittle. Recent efforts to accelerate diffusion Q-learning toward single-step action generation typically introduce auxiliary networks, policy distillation, or multi-phase training, which frequently compromise simplicity, stability, or performance. To address these limitations, we introduce Bootstrapped Flow Q-Learning (BFQ), a novel framework that enables accurate single-step action generation during both training and inference—without auxiliary networks or distillation procedures. BFQ adopts a divide-and-conquer view of the displacement vector along the flow path: it begins by learning short-range displacements that can be accurately estimated from the Flow Matching marginal velocity, and bootstraps these components to directly learn a noise-to-action mapping in a single step. This formulation eliminates multi-step denoising, resulting in a learning procedure that is substantially faster, simpler, and more robust. Extensive D4RL evaluations show that BFQ improves performance while significantly reducing computational cost compared to multi-step diffusion baselines, demonstrating that single-step action generation suffices for high-performance offline Reinforcement Learning.

Abstract:
Robotic manipulation of flat objects is challenging due to the ungraspable configurations and strong variations in object geometry and material. Existing methods rely on heuristic premanipulation and are often evaluated in closed settings with limited generalization. We propose a unified framework that decouples the manipulation into a strategy generator and an action execution module. The strategy generator predicts appropriate manipulation strategies from object point clouds by learning strategy-centric, object-invariant representations via simulated data transformation and contrastive learning. Conditioned on the predicted strategy, the execution module decomposes long-horizon manipulation into reusable action primitives and dynamically composes them to generate stable trajectories. To enable systematic evaluation, we introduce FlatLab, a comprehensive simulation benchmark for robotic flat object manipulation. FlatLab provides high-fidelity physical simulation of diverse rigid and deformable flat objects, automated multi-modal data collection, and standardized task definitions and evaluation protocols. Experiments conducted in FlatLab demonstrate that our approach generalizes effectively to unseen objects and categories, outperforming existing baselines. The project page is provided at \urlhttps://flatlab-web.github.io/, and the code will be publicly released.

Abstract:
Finegrained traffic prediction is critically important for mitigating traffic congestion in key urban areas and for providing lane-change guidance in autonomous vehicles and navigation systems. However, task-specific models are not efficient enough, city-scale pre-trained models often overlook fine-grained requirements, and the demand for extensive computational resources hinders practical deployment. To address this issue, we developed a lightweight pre-training framework, MiniTraffic. This framework leverages abundant road-level data to address lane-level data scarcity through a frequency domain stability augmentation module and captures road-lane correlations via contrastive clustering to construct small-scale graph structures, significantly reducing model parameters. Fine-tuning with minimal target data provides a unified and efficient solution for fine-grained traffic prediction. In multi-granularity traffic prediction tasks across six fine-grained datasets, MiniTraffic demonstrated superior performance compared to all existing baseline models. The MiniTraffic-related code, datasets, and pre-trained models are available at https://anonymous.4open.science/r/MiniTraffic/.

Abstract:
Multimodal contrastive learning (MCL) aims to embed data from two modalities in a shared embedding space. However, in practice, representations of images and text occupy completely separate regions of embedding space, a phenomenon called the modality gap. Moreover, experimental findings on how the size of the modality gap affects downstream performance are inconsistent. These observations raise two key questions: (1) What causes the modality gap? (2) What affects downstream performance? To address these questions, we introduce the first theoretical framework for analyzing the convergent optimal representations (COR) of MCL when training is optimized. We prove that, when representations of image and text collapse into different subspaces, a phenomenon called \emphdimension collapse, the modality gap occurs. Our theorem also reveals that while the modality gap prevents representations of image and text from aligning directly, their projections onto the shared space can be aligned. And share space alignment plays dominate role in determining downstream performance. Inspired by these findings, we first propose Shared Space Alignment (SSA) to improve MCL pretraining by enhancing alignment within the shared space. Extensive experiments validate our theoretical analysis and proposed methods.

Abstract:
Federated Prompt Learning (FPL) adapts VisionLanguage Models to privacy-sensitive medical imaging, typically via a textual tuning paradigm that assumes the frozen visual encoder provides a discriminative feature geometry. We argue this assumption breaks down in medical settings, leading to two geometric pathologies: (1) Intra-client: Medical Manifold Collapse, where high morphological similarity reduces the effective rank of visual features; and (2) Inter-client: Medical Topological Misalignment, where heterogeneous acquisition protocols induce inconsistent geometry across clients. To address these, we propose FedMAP, which shifts the paradigm to Visual Manifold Anchoring. FedMAP utilizes an LLM-derived codebook as a client-invariant synchronization signal to restructure the visual space, via Manifold Semantic Anchoring (MSA) and Topology Structural Alignment (TSA) to enforce consistent inter-class relations. Experiments on FedISIC, FedCamelyon17, and a private ultrasound dataset show that FedMAP consistently outperforms state-of-the-art methods, especially in high-noise regimes where manifold collapse is most severe.

Abstract:
Federated continual learning (FCL) of large language models on edge devices is constrained by a communicationstability--plasticity trilemma. We reveal a less-is-more phenomenon: beyond a moderate layer upload ratio, stability loss offsets saturated plasticity gains, so overall continual performance no longer improves. Moreover, layer-wise conflict is heavy-tailed and concentrates in a few layers; denser uplink increasingly includes these layers, which disproportionately drives forgetting and motivates selective sparse communication. Therefore, we introduce RieSelect, which treats stability as staying within a Fisher-metric safe basin around historical solutions. Under this safe-basin constraint, we derive a layer-wise conflict score and a closed-form certified safe step size for finite local updates, and formulate selective uplink as a knapsack-based utility--risk selection, balancing plasticity gains against stability risks. Extensive experiments show that, under a per-round uplink budget, RieSelect achieves the best performance across task orders. Beyond this matched-budget setting, under standard communication protocols, RieSelect improves average accuracy by 18.99–28.14 points while reducing total uplink by 53–115×.

Abstract:
Continuous diffusion for categorical data is a framework belonging to the diffusion family and aiming at generating discrete data. The scientific interest to such models has been constantly increasing these days because researchers try to achieve a challenging goal of finding reasonable alternatives to autoregressive large language models. In this paper, we study the properties of the structure of the latent space corresponding to discrete tokens expressed in terms of KullbackLeibler divergence on diffusion path measures and accuracy of the correct token prediction by the optimally trained diffusion model. We find that FSQ tokenization scheme has the latent space structure with the properties that make it best suited for continuous diffusion for categorical data as verified through rigorous theoretical analysis and extensive numerical experiments. To validate our findings in real-life scenario, we train several text-to-speech diffusion models having speech tokens as intermediate acoustic features, and show that the one based on FSQ tokens indeed performs the best, and, moreover, it outperforms its strong LLM-based counterpart, at the same time being significantly smaller and faster.

Abstract:
Symbolic Regression aims to discover interpretable mathematical expressions from data. Equation Learner (EQL) is a gradientbased method with strong fitting capability and expressive potential, yet it often activates redundant operators as model complexity grows, leading to over-complex expressions and unstable equation recovery. We analyze a gradient residual issue induced by operators that do not vanish at zero, which can prevent the ideal sparse expression from being a local optimum and bias training toward unnecessarily complex structures, making exact recovery nearly unattainable in practice. To address this, we propose EQL-Z, a structurally controllable symbolic regression framework. EQL-Z enforces zero-point constraints via zero-point consistent operator transformations to eliminate residual gradients on silent paths, and performs an incremental small-to-large structure search that grows depth/width from a compact seed under a complexity-penalized validation score. After selecting a compact structure, we optionally apply BFGS fine-tuning to refine coefficients. Experiments on synthetic and real-world datasets show that EQL-Z substantially improves exact equation recovery and in-/out-of-distribution generalization over vanilla EQL, achieving performance close to the best existing symbolic regression baselines. The code is available at https://anonymous.4open.science/r/EQL-Z-BE6C/.

Abstract:
We present Reading Observed At Mindless Moments (ROAMM), a largescale multimodal dataset comprising 50 hours of simultaneous EEG and eye-tracking recorded during naturalistic multi-page reading from 44 participants, with annotations including eye events, page-level comprehension scores, and word-level mind-wandering (MW) labels obtained via a retrospective self-report paradigm. We introduce a standardized evaluation protocol for MW detection under leave-one-subject-out evaluation, achieving up to 0.609 AUROC using supervised models. We also report results for EEG-to-text decoding trained on non-MW segments and show that decoding performance decreases when MW-labeled segments are included. Overall, ROAMM provides a benchmark dataset for MW detection and EEG-to-text decoding tasks, and enables the study of attention-related degradation in language decoding from brain activity in naturalistic reading.

Abstract:
Predicting spatial gene expression from routine H&E makes highresolution molecular profiling accessible at scale, especially for large retrospective cohorts. However, current models mostly treat gene expression as a series of pointwise tasks. While effective for numerical fitting, this approach overlooks biological structures: the functional coordination between genes and their organized distribution across tissue. We reframe this task as structured distribution modeling and introduce FLAG, a diffusion-based framework designed to preserve these biological relationships. To capture such structures, a natural strategy would be to jointly model gene expression and their spatial interactions. However, we identify a critical Gene Dimension Curse: such joint modeling fails in high-dimensional gene spaces. This motivates FLAG, which conditions the generative process using a novel spatial graph encoder to ensure gene-spatial topographic coherence and a Gene Foundation Model (GFM) alignment to maintain high gene-gene structural fidelity. To rigorously assess our approach, we propose a structural evaluation metrics, including Gene Structural Correlation (GSC) and Spatial Structural Correlation (SSC). Our experiments demonstrate that FLAG is highly competitive with or superior to state-of-the-art models in traditional accuracy (PCC/MSE), while achieving significantly enhanced structural fidelity in capturing both gene-gene and gene-spatial relationships.

Abstract:
Deciphering how cells commit to future fates is essential for developing precision therapeutics that can reprogram stem cells or modulate immune functions. However, isolating these fatedetermining signals in single-cell lineage tracing (scLT) remains challenging because differentiation programs are often confounded by unrelated processes like the cell cycle. To address this, we introduce Lineage-aware Contrastive Learning (LCL), a framework that treats inheritable lineage barcodes as a "natural" data augmentation to isolate subtle, lineage-specific signals. LCL utilizes a semi-supervised architecture to align unlabeled cells, facilitating the transfer of lineage structures to clinical datasets where explicit barcoding is unavailable. We demonstrate LCL’s utility by predicting future cell-type compositions from early-time points, effectively modeling longitudinal fate commitment from cross-sectional data. Benchmarking on hematopoietic and fibroblast systems shows that LCL significantly outperforms standard models like scVI, establishing contrastive learning as a scalable paradigm for understanding and potentially manipulating cellular differentiation.

Abstract:
Automatic Prompt Optimization (APO) enables Large Language Models (LLMs) to adapt to specific tasks while minimizing manual engineering costs. However, since existing APO approaches either rely solely on multiround iterative procedures or use model-specific generators tailored to optimizing prompts for a single model and objective, they are not readily applicable to auto-routing scenarios, which require operating over diverse LLMs and juggling multiple, often competing, trade-offs. To address this issue, we propose TAMPO, a novel task- and model-aware APO framework for auto-routing in LLM-based systems. Specifically, to reflect performance variation across a broad range of tasks and models, we construct a comprehensive heterogeneity-aware dataset for training an uncertainty-aware reward model. Serving as an offline proxy, this reward model can greatly mitigate reward hacking, allowing TAMPO to learn an optimal multi-objective conditional policy for robust prompt generation. Based on the user requirements encoded in our defined preference vector, this policy enables flexible control over prompt generation, supporting a cost-effective deployment strategy. Extensive experiments across 86 tasks demonstrate that TAMPO effectively maintains performance stability across diverse tasks and models, providing a robust, controllable solution for auto-routing in various LLM-based systems.

Abstract:
Electroencephalography foundation models (EEGFMs) have advanced brain signal analysis, but the lack of standardized evaluation benchmarks impedes model comparison and scientific progress. Current evaluations rely on inconsistent protocols that render cross-model comparisons unreliable, while a lack of diagnostic analyses obscures the internal mechanisms driving transfer efficiency and scaling behaviors. To address this, we introduceEEG-FM-Bench, a unified system for the standardized evaluation of EEG-FMs. The benchmark integrates 14 datasets across 10 paradigms and incorporates diverse experimental settings, including multiple fine-tuning strategies, task organizations, and classifier configurations, supported by tools for gradient and representation analysis. Our experiments and analysis reveal several critical insights: (1) multi-task learning acts as a critical regularizer to mitigate overfitting in data-scarce EEG contexts; (2) pre-training efficiency is currently limited by gradient conflicts between reconstruction objectives and downstream tasks; (3) model scaling deviates from typical laws, as compact architectures with domain-specific inductive biases consistently outperform significantly larger models. This benchmark enables fair comparison and reproducible analysis, shifting the field from fragmented results to interpretable advances.

Abstract:
Pretrained Transformers can perform incontext learning (ICL) from a few demonstrations, but this ability can fail sharply when the test distribution differs from pretraining—a common deployment setting. We study attention temperature as a simple inference-time control for improving ICL robustness under such shifts. In a high-dimensional linear-regression framework, we analyze a Transformer with "approximate softmax" attention, which preserves softmax's normalization and temperature-dependent selectivity while remaining tractable. We derive a closed-form expression for the ICL generalization error under distribution shift, and show that it is minimized by an explicit optimal attention temperature. This characterization yields interpretable guidance by linking the best temperature to moments of the pre-softmax attention scores, and predicts when temperature adjustment can recover near Bayes-optimal performance. We validate the theory with extensive simulations, and further demonstrate gains on pretrained LLMs (GPT-2 and Llama2-7B) on question-answering benchmarks under distribution shift induced by noisy in-context demonstrations. Overall, attention temperature emerges as a principled, lightweight knob for improving the robustness of ICL in pretrained Transformers.

Abstract:
Recent advances in large language models (LLMs) have led to the emergence of specialized multimodal LLMs (MLLMs), forming distinct model families that share a common foundation language models. Despite this evolutionary trend, it remains unexplored whether a fundamental behavioral link exists between derived MLLMs and their foundational LLMs. This work investigates the inheritance of truthfulness traits along this trajectory by quantifying the degree of contexttruthfulness across individual attention heads. Our analysis of the Vicuna and Qwen families reveals a striking finding: MLLMs maintain a high correlation in truthfulness scores with their base LLMs, even after multi-modal fine-tuning and when evaluated on disparate data sources. Building on this insight, we propose a Soft Gating strategy that utilizes these inherited Truth Scores to amplify the influence of context-truthful heads while preserving the contributions of other heads. We validate our approach on base LLMs on HaluEval benchmark to demonstrate improved truthful reasoning. Subsequently, we show that Truth Scores derived from a base LLM can be effectively transferred to its multimodal descendants as a plug-and-play gate, achieving performance gains on POPE and CHAIR benchmark comparable to probing the MLLMs directly. Our work highlights a novel, systemic approach to enhancing reliability across an entire model family by leveraging its inherent, inherited traits.

Abstract:
A widely adopted strategy for model enhancement is to use synthetic data generated by a stronger model for supervised finetuning (SFT). However, for emerging reasoning models like Qwen3-8B, this approach often fails to improve reasoning capabilities and can even lead to a substantial drop in performance. In this work, we identify substantial stylistic divergence between teacher generated data and the distribution of student as a major factor impacting SFT. To bridge this gap, we propose a Teacher–Student Cooperation Data Synthesis framework (TESSY), which interleaves teacher and student models to alternately generate style and non-style tokens. Consequently, TESSY produces synthetic sequences that inherit the advanced reasoning capabilities of the teacher while maintaining stylistic consistency with the distribution of the student. In experiments on code generation using GPT-OSS-120B as the teacher, fine-tuning Qwen3-8B on teacher-generated data leads to performance drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench, whereas TESSY achieves improvements of 11.25% and 6.68%.

Abstract:
Reinforcement learning (RL) has emerged as a key mechanism for transforming LLMs into robust reasoners. While supervised finetuning (SFT) often limits models to the distribution of observed reasoning traces, RL post-training significantly improves performance on out-of-distribution (OOD) tasks that require unfamiliar recombinations of familiar steps. We argue that this improvement is driven bycompositional generalization, which we formalize through aHierarchical Latent Selection Model. In this framework, reasoning traces are generated by a cascade of discrete latent selection variables corresponding to reusable atomic modules, including both skills (local operations) and routing mechanisms (how intermediate information is selected, reused, and composed). We theoretically show that RL’s exploratory nature provides sufficient coverage to identify latent structure and enable compositional generalization. We design controlled experiments to validate this theory. Our results demonstrate that RL can extract atomic modules from compound traces and recombine them to solve new configurations. Moreover, we find that training on compound traces can yield stronger generalization than training on isolated atomic modules. Finally, we investigate relations between SFT and RL and identify an effective protocol in which SFT ensures coverage of all atomic modules, while RL focuses on novel compositions beyond the SFT support to encourage exploration.

Abstract:
Precise evidence region localization in Document Visual Question Answering (DocVQA) is crucial for improving model interpretability and reliability. However, most existing approaches rely on singlestep localization, which struggles to effectively distinguish true evidence from irrelevant content when page semantics are complex or evidence regions are extremely small, leading to ambiguous boundaries and localization errors. To address these challenges, we propose a progressive evidence localization framework based on Wasserstein gradient flows, which reformulates evidence localization as an optimal transport optimization problem over probability distributions. Since continuous-time gradient flows are intractable in practice, we adopt the Jordan--Kinderlehrer--Otto (JKO) scheme for discrete optimization and derive an end-to-end trainable loss function that translates the theoretical framework into a neural network–optimizable objective. This formulation enables precise evidence localization through progressive refinement from coarse-grained to fine-grained regions. Experimental results demonstrate that our method significantly outperforms existing approaches in both evidence localization and answer generation, while providing an interpretable progressive reasoning process.

Abstract:
Despite the success of large transformers at modeling variablelength protein sequences, most statistical phylogenetics analyses use relatively simple continuous-time finite-state Markov models of point substitution to describe molecular evolution, keeping sequence length fixed and ignoring insertions and deletions (indels) entirely. The simplistic assumptions of these models limit the realism of such analyses. We extend the TKF92 model - the canonical hierarchical model combining an outer birth-death process for indels with an inner finite-state Markov chain for substitutions - by introducing additional nesting and latent states. We compare these TKF92 extensions (which are exactly solvable, and in which evolutionary time naturally appears as a matrix exponential coefficient) to two classes of neural seq2seq models that take evolutionary time as an input feature: the first class of model being constrained to enforce a TKF92-like structure, and the second lacking any such constraint. We evaluate the per-character perplexities of all models on splits of the PFam database of aligned protein domains. A nested TKF-based model with only 32,000 parameters is highly competitive with neural networks containing tens of millions of parameters, outperforming all but two of the neural architectures tested. Our results indicate that approaches grounded in molecular evolutionary theory may be more parameter-efficient and provide a better fit to real alignments than unconstrained alternatives, supporting the incorporation of classical model structure within future neural phylogenetic approaches.

Abstract:
According to the reformulated version of the Learned Helplessness theory, an individual who experiences uncontrollable negative events may subsequently develop a negative attributional style, thereby exhibiting greater susceptibility to depressive symptoms. This depressogenic attributional style not only contributes to depressive symptoms but also represents a malleable target for cognitive therapy. Despite its theoretical and practical significance, computational research on attributional cognition remains underexplored due to the lack of largescale, high-quality datasets and robust evaluation protocols. In this work, we introduce the Attributional Style Transfer Dataset (ASTD) along with dedicated evaluation metrics, the first benchmark designed to model, assess, and reframe attributional explanations at scale. Constructed via a Prevent–Filter–Validate pipeline that integrates LLM-based generation with specialist validation, ASTD contains 42,000 real-world events paired with psychologically grounded attributions spanning seven styles. Using this dataset, we address two key challenges: (1) scalable assessment of attributional style via both supervised classifiers and zero/few-shot LLMs; and (2)attributional reframing and evaluation, where we propose automatic evaluation metrics to quantify psychological validity. Furthermore, we leverage our proposed metrics to construct a preference dataset, fine-tuning LLMs with Direct Preference Optimization (DPO) and achieving substantial gains in reframing quality. Together, our dataset, metrics, and methodology offer a new paradigm for understanding and modeling attributional style, with direct implications for scalable and adaptive mental health interventions.

Abstract:
Spiking Neural Networks (SNNs) are regarded as the third generation of neural networks, offering energyefficient computing for neuromorphic devices. Despite this benefit, hardware-implemented SNNs are vulnerable to hardware faults, which severely degrade their performance. Previous approaches have required direct access to internal SNN circuits to modify weights or monitor internal states, limiting their practicality. Improving robustness to hardware faults without such access remains challenging. To overcome this challenge, we propose a fault-tolerant mechanism that operates only through input data control. The hardware faults reduce the usable learning capacity of SNNs, leading to a mismatch between the instantaneous input load and the degraded network dynamics. Our mechanism mitigates this mismatch by dividing each input sample into multiple fragments, redistributing the input load via a learnable fragmentation strategy. The strategy learns two key fragmentation components: 1) division boundaries and 2) the number of fragments. To our knowledge, this is the first to improve the fault tolerance of SNNs without accessing the internal SNN circuit. Experimental results demonstrate that our mechanism consistently outperforms previous methods in various SNN models, achieving these gains without direct access to internal circuits. Furthermore, we validate its effectiveness on SNNs implemented with a physical FPGA platform, confirming its practicality.

Abstract:
Automating endto-end GPU kernel generation with Large Language Models (LLMs) faces a critical tension between global performance and exploration efficiency. We present LEGO, a hierarchical framework that resolves this trade-off via a parallel multi-agent search over a recursive AND-OR FusionTree. LEGO synergizes two complementary flows: Top-Down Construction decomposes complex graphs into valid, context-isolated sub-problems to guarantee correctness and enable parallel exploration, while Bottom-Up Mutation speculatively fuses verified sub-plans to recover global locality for peak performance. This bi-directional mechanism effectively prunes the search space to avoid repetitive unguided sampling, while naturally parallelizing exploration, and enabling the discovery of sophisticated fusion strategies. Evaluations demonstrate that LEGO achieves 2.18x–13.48x speedups over PyTorch Eager and reduces end-to-end exploration time by up to 2.47x (with 7x token reduction) compared to monolithic baselines across diverse end-to-end models.

Abstract:
The rapid evolution of Large Language Models (LLMs) has empowered even nonprogrammers to create visually appealing frontend mini-games with a single instruction. However, open-source models significantly lag behind proprietary counterparts in this domain. The core bottleneck is the lack of an evaluation mechanism that balances reliability with scalability, as existing methods either fail to verify dynamic interactivity or incur prohibitive computational costs. To bridge this gap, we introduce ALIVE (Aligning LLMs via Interactive Visual Execution), a high-throughput framework that leverages one-shot planning and DOM-based analysis to automatically evaluate generated games at scale. Extensive experiments demonstrate that ALIVE significantly outperforms static judge baselines in identifying functional flaws while remaining orders of magnitude more efficient than GUI agents. Functioning as a scalable `pre-flight' evaluation layer, it curates high-quality data for Supervised Fine-Tuning (SFT) and provides a consistent reward signal for Reinforcement Learning (RL). We leverage this pipeline to train ALIVE-Coder, a model achieving superior performance in interactive frontend generation. To the best of our knowledge, our work offers the first scalable path to evaluate and optimize interactive code, substantially advancing open-source capabilities.

Abstract:
Though biological foundation models (BioFMs) have delivered strong performance across biomedical tasks, their robustness to small-but-real perturbations is underexplored. In this work, we ask: Are Bio-FMs robust for real-world use? What perturbations compromise their reliability? Our pilot study suggests that due to subtle biological data curation issues and common machine-learning (ML) processing choices, Bio-FMs suffer from two complementary perturbation sources: biologically plausible perturbations (capturing experimental corruptions and curation artifacts) and ML-induced transformations (capturing preprocessing, data augmentation, and embedding choices). Guided by this taxonomy, we design perturbation suites that mimic corruptions frequently encountered in biological experiments, and we systematically probe how transformations in the ML pipeline reshape model behavior. By conducting 2,128 experiments over 11 state-of-the-art Bio-FMs on 7 bio-tasks, we show that most Bio-FMs are vulnerable to both biological perturbations and ML transformations, revealing underappreciated robustness gaps that can directly translate into deployment risk. Interestingly, we find that subtle biological perturbations, which are often imperceptible to current measurement tools, can induce severe discrepancies in Bio-FM outputs and lead to critical failures. We also find that cryo-EM reconstruction models (e.g., CryoDRGN) exhibit a surprising level of robustness even under worst-case adversarial settings. Our study for the first time surfaces critical failure modes and provides a principled perspective for evaluating the robustness of Bio-FMs in real-world biological pipelines.

Abstract:
The integration of external tools has substantially expanded the capabilities of large language model (LLM) agents, but also introduced new attack surfaces beyond prompt injection. In particular, crosstool description poisoning can manipulate planner-visible tool metadata to steer an agent’s trajectory, even if the poisoned tool itself is never chosen. To understand the effectiveness of existing attacks against this emerging threat, we evaluate several existing agent defenses against prompt-injection and find they transfer poorly to cross-tool description poisoning. Building on this insight, we propose Tool-Guard, a novel defense based on a new concept called isolated planning, in which tool invocations that are detected as misaligned or suspicious cause the corresponding tool to be placed in a quarantined list (the influenced list), breaking further influence from poisoned descriptions. With this influence isolated, the tool can continue to be used to support the task, enabling a robust defense that preserves legitimate tool utility. Experiments on the AgentDojo and ASB benchmarks show that \sysname substantially reduces attack success while maintaining high task utility.

Abstract:
Robust distributed learning algorithms aim to maintain reliable performance despite the presence of misbehaving workers. Such misbehaviors are commonly modeled as Byzantine failures, allowing arbitrarily corrupted communication, or as data poisoning, a weaker form of corruption restricted to local training data. While prior work shows similar optimization guarantees for both models, an important question remains: How do these threat models impact generalization? We show, for the first time, a fundamental gap in generalization guarantees between the two threat models: Byzantine failures yield strictly worse rates than those achievable under data poisoning. Our findings leverage a tight algorithmic stability analysis of robust distributed learning. Specifically, we prove that: (i) under data poisoning, the uniform algorithmic stability of an algorithm with optimal optimization guarantees degrades by an additive factor of \varTheta ( \fracfnf ), with f out of n workers misbehaving; whereas (ii) under Byzantine failures, the degradation is in \Omega \big( \sqrt \fracfn-2f \big).

Abstract:
Autoregressive (AR) models have shown great promise in image generation, yet they face a fundamental inefficiency stemming from their core component: a vast, unstructured vocabulary of visual tokens. By treating tokens as a flat set, standard models overlook the manifold structure where geometric proximity reflects semantic similarity. This oversight unnecessarily complicates the prediction task, hindering training efficiency and limiting generation quality. To resolve this, we propose ManifoldAligned Semantic Clustering (MASC), a principled framework that constructs a hierarchical semantic tree directly from the codebook's intrinsic geometry. Utilizing a geometry-aware distance metric and density-driven agglomerative construction, MASC faithfully models the token embedding manifold. By transforming the flat, high-dimensional prediction into a structured hierarchical task, MASC introduces a powerful inductive bias that simplifies learning. Designed as a plug-and-play module, MASC accelerates training by up to 71% and significantly boosts generation quality, improving LlamaGen-XL's FID from 2.87 to 2.49. Crucially, MASC further serves as a convergence enabler for complex architectures. These results establish that structuring the prediction space is as vital as architectural innovation, elevating existing AR frameworks to state-of-the-art performance. Our code is open-sourced via \urlhttps://anonymous.4open.science/r/anonymous_MASC-50F6/

Abstract:
Linear bandits traditionally prioritize regret minimization, often overlooking statistical inference of the underlying parameter as a critical objective. In highstakes settings such as healthcare, precise parameter estimation is indispensable, as it provides fundamental insights into system mechanisms and ensures robust decision-making under covariate shift. We investigate the tripartite balance between regret, inference, and safety, deriving a fundamental minimax lower bound that characterizes the Pareto-optimal frontier of these competing goals. We then propose SERMiSC, a novel algorithm that achieves the optimal trade-off by matching this lower bound while maintaining a near-constant \tildeO(1) safety risk. Empirical results demonstrate that SERMiSC effectively navigates the Pareto frontier and outperforms various baselines, thereby validating our theoretical analysis.

Abstract:
Effective sequence modeling fundamentally requires balancing the retention of unbounded history with the highresolution detection of abrupt short-term variations common in real-world phenomena. However, existing state space models (SSMs) relying on high-order polynomial projection operators (HiPPO) face a critical trade-off where uniform measures dilute recent information to maintain timescale invariance, while exponential measures sacrifice global context to capture local dynamics. This paper proposes a Fractional Recurrent Architecture for Computational Temporal Analysis of Long sequences (FRACTAL), a novel architecture integrating fractional measure theory into recursive memory updates to address this limitation. By deriving projection operators with analytically characterized spectral properties and a tunable singularity index, the proposed method amplifies sensitivity to recent signal perturbations while preserving the spectral structure that encodes scale-invariant memory dynamics. This theoretical innovation is instantiated within a simplified diagonalized state space framework by modulating input projection initialization to enable simultaneous capture of multi-scale temporal features. FRACTAL achieves an average score of 87.11% on the Long Range Arena benchmark, including 61.85% on the ListOps task, outperforming the S5 model.

Abstract:
Large language model (LLM) agents are increasingly deployed in longrunning settings where improving through experience at test time becomes important. A common approach is to update an explicit memory after each interaction to guide future decisions. However, most existing methods rely on hand-designed prompting rules, making it difficult to align memory updates with downstream objectives over multi-step horizons consistently. We propose MemoPilot, a plug-in memory copilot that explicitly trains the memory update process to improve a frozen LLM's performance across sequential interactions. We formulate memory updating as a multi-turn decision problem and optimize it end-to-end with multi-turn GRPO. Our training recipe introduces (i) a turn-wise reward signal and (ii) a context-independent, turn-level advantage estimation across rollouts, enabling finer-grained credit assignment and more stable training in multi-turn settings. We evaluate MemoPilot on two testbeds: multi-round Rock-Paper-Scissors (RPS) and Limit Texas Hold'em (LHE). Across both environments, MemoPilot substantially improves test-time learning of a frozen player over strong baselines, ranking first in Elo ratings on both games (1762 on LHE and 1590 on RPS) and outperforming all baseline memory methods and proprietary models, including Deepseek-V3.2.

Abstract:
The planning horizon in a Markov Decision Process (MDP) determines how far into the future an agent reasons. In practice, shorter horizons are commonly associated with policies that exhibit simpler or more interpretable decisionmaking behavior. In this paper, we establish a formal connection between environmental stochasticity and planning horizon in MDPs. We show that for broad classes of transition noise, solving a noisy MDP can be formally related to solving a noise-free MDP with a shorter effective discount factor, leading to identical optimal policies in some cases and near-optimal ones in others. We further characterize settings in which this correspondence breaks down, clarifying when horizon-based interpretations of noise are not valid. These results, which are supported by both theory and experiments, also give some insight into the common practice of using smaller discount factors for reinforcement learning than what can be justified by typical, grounded interpretations of a discount factor, such as inflation or the probability of catastrophic failure.

Abstract:
Explanations of model behavior are commonly evaluated via proxy properties weakly tied to the purposes explanations serve in practice. We contribute a decision theoretic framework that treats explanations as information signals valued by the expected improvement they enable on a specified decision task. This approach yields three distinct estimands: (i) a theoretical benchmark that upperbounds achievable performance by any agent with the explanation, (ii) a human-complementary value that quantifies the theoretically attainable value that is not already captured by a baseline human decision policy, and (iii) a behavioral value representing the causal effect of providing the explanation to human decision-makers. We instantiate these definitions in a practical validation workflow, and apply them to assess explanation potential and interpret behavioral effects in human–AI decision support and mechanistic interpretability.

Abstract:
Large language models can resist taskmisaligned activation steering during inference, sometimes recovering mid-generation to produce improved responses even when steering remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer model activations, we find that Llama-3.3-70B shows substantial ESR, while smaller models from the Llama-3 and Gemma-2 families exhibit the phenomenon less frequently. We identify 26 ``off-topic detector'' latents that predict ESR episodes in Llama-3.3-70B. Zero-ablating these latents reduces the multi-attempt rate by 25%, providing causal evidence for dedicated internal consistency-checking circuits. We demonstrate that ESR can be deliberately enhanced through both prompting and training: meta-prompts instructing the model to self-monitor increase the multi-attempt rate by 5× for Llama-3.3-70B, and fine-tuning on self-correction examples successfully induces ESR-like behavior in smaller models. These findings have dual implications: ESR could protect against adversarial manipulation but might also interfere with beneficial safety interventions that rely on activation steering. Understanding and controlling these resistance mechanisms is important for developing transparent and controllable AI systems.

Abstract:
The GromovWasserstein (GW) distance provides a powerful framework for aligning structured data by comparing the intrinsic geometries of metric measure spaces, and has become a fundamental tool in machine learning. Most existing methods leverage entropy regularization to reduce the computational complexity to \boldsymbol\mathrmO(n^3), where n is the number of samples. However, this cubic time complexity remains a major bottleneck in large-scale applications, severely limiting the scalability. To address this challenge, we propose LoBCD-GW, an efficient GW optimization algorithm. Specifically, we reveal the data-dependent sparsity of large-magnitude updates to the coupling matrix and introduce a localized block coordinate selection strategy. This confines the optimization to a "selected set" of size r (which is a parameter that depends on the given data set, and usually is much less than n), thereby reducing the complexity to \boldsymbol\mathrmO(r^3). In addition, unlike prior acceleration methods often based on constraint relaxation, our method can guarantee the strict feasibility through a novel "marginal compensation mechanism" to synchronize local mass redistribution with global constraints. Finally, we conduct a set of experiments on various datasets, and the results demonstrate that our method achieves a 160× speedup on large-scale graph alignment benchmarks, while maintaining state-of-the-art accuracy.

Abstract:
Crossview geo-localization (CVGL) aims to retrieve the corresponding satellite image given a street query and is critical for autonomous navigation. Although recent methods perform well on benchmarks, they often fail to generalize to unseen environments. A key limitation is the use of contrastive learning, which assigns equal labels to all negative samples and induces similarity-amplified repulsion. But should all negatives be treated equally? In CVGL, semi-positive samples that are geographically proximate to the positive often share important semantic cues. Treating them as ordinary negatives forces the model to overfit noise, leading to a collapse in generalization. To address this issue, we propose an uncertainty-aware framework grounded in Deep Evidential Regression (DER), modeling the Normal-Inverse-Gamma (NIG) distribution as a conjugate prior to quantify environmental complexity u in a single forward pass. The estimated u adaptively softens labels for hard negatives in Soft InfoNCE, mitigating excessive repulsion on semi-positive samples. An Uncertainty Head with cls-to-spatial cross-attention and attention statistics is designed to accurately fit the NIG distribution. Extensive experiments demonstrate state-of-the-art performance, including an average 18% R@1 improvement in zero-shot cross-dataset transfer, filling the critical gap between laboratory benchmarks and robust real-world deployment.

Abstract:
Recent success of large reasoning models (such as OpenAI o1 and DeepSeek R1) have spurred a resurgence of interest in reinforcement learning from verifiable rewards (RLVR). However, progress is still largely driven by RL algorithm design, while data scheduling the data-side decisions that determine what the model trains on over time -- is critical but remains underexplored. Therefore, data scheduling becomes the focus of this paper, including how to curate data for supervised fine-tuning (SFT) and how to select prompts and collect rollouts for reinforcement learning (RL). We introduce a pipeline with careful designs on data scheduling, consisting of hardness-prioritized SFT and two-stage RL. Specifically, we first fine-tune the base model on supervision data that is curated to prioritize difficulty based on both arena learning and classification. Then, we introduce two-stage RL where a decreased max sequence length during rollout is used in the first stage to expand entropy and reduce repetition, and a large number rollouts per prompt and curriculum design are adopted in the second stage to encourage exploration for challenging problems. We implement this pipeline on Qwen2.5-32B and an internal 389B MoE model, and evaluate them on a wide range of benchmarks including challenging LeetCode and Codeforces weekly contests. The results not only indicate the effectiveness and scalability of our pipeline but also demonstrate our model achieve sota of 32B models in competitive code generation.

Abstract:
Largescale models are at the forefront of time series (TS) forecasting, dominated by two paradigms: fine-tuning text-based Large Language Models for TS (LLM4TS) and training Time Series Foundation Models (TSFMs) from scratch. Both approaches share a foundational assumption that scaling up model capacity and data volume leads to improved performance. However, we observe a scaling paradox in TS models, revealing a puzzling phenomenon that larger models do NOT always achieve better performance. Through extensive experiments on two model families across four scales (100M to 1.7B parameters) and diverse data (up to 6B observations), we rigorously confirm that the scaling paradox is a pervasive issue. We then diagnose its root cause by analyzing internal representations, identifying a phenomenon we call few-layer dominance: only a small subset of layers are functionally important, while the majority are redundant, under-utilized, and can even distract training. Based on this discovery, we propose a practical method to automatically identify and retain only these dominant layers. In our models, retaining only 21% of the parameters achieves up to a 12% accuracy improvement and a 2.7x inference speedup. We validate the universality of our method on 8 prominent SOTA models (LLM4TS and TSFMs, 90M to 6B), showing that retaining less than 30% layers achieves superior accuracy in over 95% tasks.

Abstract:
Artificial and biological systems may evolve similar computational solutions despite fundamental differences in architecture and learning mechanisms—a form of convergent evolution. We provide largescale evidence for this phenomenon through comprehensive analysis of alignment between human brain activity and internal representations across over 600 AI models spanning language and vision domains (1.33M to 72B parameters). Analysis of 60 million alignment measurements reveals that higher-performing models spontaneously develop stronger brain correspondence without explicit neural constraints, with language models demonstrating markedly stronger correlations (r=0.89, p<7.5 × 10^-13) than vision models (r=0.53, p<2.0 × 10^-44). Crucially, longitudinal training analysis shows that brain alignment consistently emerges prior to performance improvements, suggesting that developing brain-like representations constitutes a fundamental stepping stone toward enhanced capabilities. We identify systematic organizational patterns reflecting human cognitive architecture: language models exhibit strongest alignment with limbic and integrative regions, while vision models show progressive correspondence with visual cortical hierarchies. These findings establish that optimization for task performance naturally drives AI systems toward human-like computational strategies.

Abstract:
Federated Quantile Regression (FQR) has emerged as a powerful modelling paradigm for estimating conditional quantiles, offering a more comprehensive understanding of response distributions than standard conditional mean regression. However, achieving communication efficiency and optimal statistical guarantees for FQR remains challenging, particularly due to the nonsmooth nature of quantile loss functions and the presence of heterogeneously structured data, where each local agent trains its conditional quantile models with distinct sets of features. In this paper, we propose a datadriven, one-shot weighted ensemble estimator for FQR that incorporates scalable weighting schemes to effectively leverage the partially observed features at each local agent, thereby enjoying both communication efficiency and estimation optimality. Theoretically, we present a unified analysis of the proposed learning procedure, establishing that the resulting estimator exhibits asymptotic normality and attains uniformly minimum variance. Furthermore, we investigate the estimator's sensitivity to perturbations introduced by local agents and derive conditions under which the estimator achieves stability and enjoys strong out-of-sample generalization. Extensive simulations and real data analysis under various scenarios validate the asymptotic normality of our estimator and demonstrate its superior estimation accuracy and uniform convergence compared to several baseline methods across a range of quantile levels.

Abstract:
Realworld dynamics shifts pose a critical challenge for reinforcement learning, yet prior methods typically rely on encoding explicitly identified physical parameters into a latent context, a rigid parameterization that proves brittle to unmodeled or compound dynamics variations. We instead investigate dynamics adaptation through the lens of latent geometry, and show theoretically that target-domain regret is controlled by the Lipschitz smoothness of a trajectory dynamics encoder. We further prove that this Lipschitz constant can be upper-bounded through optimizing a multi-positive InfoNCE objective, yielding a smooth, task-relevant latent topology without privileged dynamics information. On MuJoCo benchmarks, our method significantly outperforms explicit identification baselines under severe dynamics shifts, including unmodeled structural failures, while simultaneously improving in-distribution stability and latent interpretability. Overall, these results validate that controlling latent smoothness is a principled and scalable mechanism for robust adaptation.

Abstract:
Endpoint Detection and Response (EDR) systems are crucial for identifying malicious activities on endpoint devices, yet existing methods struggle to efficiently model ultralong log sequences and to provide interpretable reasoning for security analysts. We propose WatchLog, a novel framework that represents raw logs as video-structured data, enabling scalable and expressive video-language modeling of endpoint behaviors. Each event is encoded as a key–value-guided image, and the resulting images are temporally organized into a video sequence. To capture long-range dependencies, WatchLog employs a temporal cross-attention adapter that enables pixel-wise interaction across time. The adapter acts as an auxiliary temporal reasoning pathway, aligning spatial representations with relevant temporal contexts while preserving the original behavioral semantics. We adopt a two-stage pre-training strategy followed by supervised fine-tuning to generate behavior explanations grounded in event-level semantics and detection outcomes. Experiments on our newly constructed EDR8M-20R dataset and a public benchmark demonstrate that WatchLog consistently outperforms state-of-the-art methods in detection accuracy and recall, while offering more interpretable reasoning traces and significantly improved inference efficiency. Extensive ablation studies further support the robustness and interpretability of the proposed method.

Abstract:
Large Language Models (LLMs) fundamentally suffer from representation collapse, a bottleneck that severely degrades performance in long contexts. We identify that existing approaches risk drifting into one of two pathological extremes: Homogenization Collapse (e.g., attention sinks causing rank deficiency) and Isolation Collapse (e.g., local attention causing context disconnection). Through spectral analysis of attention dynamics, we derive an intrinsic tradeoff between Mixing Efficiency (spectral gap) and Information Capacity (effective rank), revealing that standard mechanisms struggle to maximize both simultaneously. To resolve this dilemma, we propose the Topologically Regularized Side-Path (TRSP), a non-invasive architectural intervention designed to achieve spectral balance. TRSP employs a parameter-free Triangular Box mechanism scaled by a lightweight, length-aware gate to explicitly regularize the token interaction topology. By integrating proximal coupling to preserve the effective rank and distal propagation to guarantee the spectral gap, this design ensures a geometrically healthy state without altering the core attention mechanism. Experiments yield significant performance improvements across general capabilities and long-context benchmarks. Notably, on the NoLiMa extrapolation benchmark at 8× the training length, TRSP surpasses strong baselines like the Differential Transformer and Gated Attention by approximately 30% and 50%, respectively.

Abstract:
Time series foundation models (TSFMs) have demonstrated impressive generalization capabilities across diverse domains. However, they face significant challenges in longterm and ultra long-term forecasting. These challenges primarily arise from scalability limitations when TSFMs process extensive sequence lengths. To address this, we propose SPRINT, a training-free plug-and-play framework designed to empower TSFMs to see more, forecast better and faster during inference. The core idea is to perform forecasting in a downsampled-resolution space, enabling an extended look-back window with reduced computational costs. To avoid information loss and resolution mismatch caused by downsampling, SPRINT decomposes time series into trend and seasonal components, processing them separately. It predicts the low-frequency trend via a Resolution Interpolation workflow within the downsampled space, while preserving high-frequency details through a Pattern Replication mechanism for seasonality. Extensive experiments show that SPRINT achieves a significant improvement, increasing accuracy by 19% while enhancing efficiency with a reduction of max memory usage by 6.4× and inference time by 16.9× compared to state-of-the-art TSFMs.

Abstract:
Scaling laws provide a fundamental framework for understanding the performance of Large Language Models (LLMs), yet deriving them requires prohibitively expensive evaluations across thousands of checkpoints or millions of inference samples. To address this, we introduce Item Response Scaling Laws (IRSL), a unified framework that integrates Item Response Theory (IRT) within scaling law formulation. Unlike traditional approaches that treat each modelbenchmark pair in isolation, IRSL disentangles latent model ability from question characteristics, factorizing the scaling law estimation for M models and N questions to significantly reduce parameter complexity from O(M × N) to O(M + N). We propose Beta-IRT, a novel extension that leverages the empirical probability responses of LLMs, such as token probabilities in pre-training and pass rates in test-time sampling, to capture richer signals than binary responses. We validate our approach across two prevalent scaling paradigms: (1) pre-training downstream scaling, using 6,612 LLM checkpoints and 37,682 questions from 10 benchmarks; and (2) test-time scaling, using 12 LLMs and 120 questions from 4 benchmarks with up to 2,500 samples per question. In both cases, we demonstrate that IRSL yields more reliable scaling estimates under limited query budgets. Furthermore, we show that the estimated latent model abilities are generalizable, enabling accurate performance forecasting across benchmarks that share the same measurement objective.

Abstract:
Outof-Distribution (OOD) detection requires accurately classifying In-Distribution (ID) samples while effectively distinguishing anomalous OOD data. However, existing methodologies predominantly rely on real-valued magnitude features, neglecting the semantic richness embedded in phase information, and often lack a systematic theoretical framework for quantitively modeling uncertainty. To address this dual limitation of incomplete feature representation and insufficient uncertainty modeling, the Trustworthy Quantum Evidence Neural Network (TrustworthyQENN) is proposed, a novel framework bridging complex-valued representation learning with Generalized Quantum Evidence Theory (GQET). Specifically, Supervised Complex-Valued Contrastive Learning (SCVCL) is proposed to synchronize amplitude distributions with phase correlations, thereby enforcing high intra-class compactness and inter-class separability for ID data. A quantum evidence generation mechanism based on GQET is subsequently devised, where the OOD state is formally grounded as the quantum empty set within a Hilbert space. Furthermore, the Generalized Quantum Evidential Combination Rule (GQECR) is leveraged to fuse multi-view evidence, thereby achieving trustworthy inference. Extensive experiments on the MSTAR, EuroSAT, and FUSAR-Ship benchmarks substantiate the superiority of TrustworthyQENN, which achieves a peak AUROC of 95.94% on the MSTAR dataset while consistently outperforming state-of-the-art methods across all evaluated scenarios.

Abstract:
Randomized linear algebra (RLA) algorithms are essential for scaling scientific computing and machine learning, yet their discovery remains mostly a manual process that requires deep expert knowledge and inspiration. While Reinforcement Learning (RL) offers a pathway to automation, standard approaches struggle with sparse reward landscapes and vast search spaces inherent to highperforming RLA algorithms. We present RL4RLA, a general RL framework that automates the discovery of interpretable, symbolic RLA algorithms. Unlike black-box approaches, our method builds explicit algorithms from basic linear algebra primitives, ensuring verifiable and implementable representations. To enable efficient discovery, we introduce: (1) a numerical curriculum that progressively increments problem difficulty to encode domain-based inductive bias; (2) Monte Carlo Graph Search (MCGS), which optimizes exploration by identifying and merging equivalent partial algorithms. We demonstrate that RL4RLA rediscovers state-of-the-art methods—including sketch-and-precondition solvers, Randomized Kaczmarz, and Newton Sketch—and can be targeted to produce algorithms optimized for specific trade-offs between accuracy, speed, and stability.

Abstract:
Model merging offers an efficient solution for integrating taskspecific knowledge from multiple fine-tuned models. Most existing approaches focus on manipulating the difference vectors between fine-tuned and pre-trained weights, often overlooking the generalization capabilities inherent in the pretrained parameters themselves. In this work, we revisit the role of pretrained weights in model merging and investigate their efficacy from a subspace perspective. We find that the components of pretrained weights residing in the core subspace—defined by the dominant singular vectors—are essential for maintaining generalization across diverse tasks. Specifically, we present empirical evidence that pretrained weights are nearly first-order stationary and exhibit predominantly non-negative curvature within this core subspace with respect to multi-task loss landscapes, indicating near-optimality. These findings suggest that task-specific adaptations should be injected primarily into the orthogonal complement of the core subspace, thereby preserving the generalization properties of the pretrained model. Extensive experiments on vision and vision-language tasks show that this subspace-aware strategy consistently yields improvements over state-of-the-art training-free merging methods, including Task Arithmetic, LOT Merging, ISO, and TSV.

Abstract:
Recent testtime defenses for CLIP claim to preserve zero-shot clean accuracy while improving adversarial robustness. However, we find the reported robustness of six recent proposed state-of-the-art methods substantially overestimated: they fail under basic adaptive attacks. We further observe that these defenses share a common reliance on an indicative measurement that is assumed to capture the distributional difference between clean and adversarial samples and to determine whether the defense should preserve or alter the static model’s prediction. We argue that this assumption is the fundamental weakness, and we propose CLIP-MAD (Manipulating Assumed Difference), an adaptive attack strategy designed to break it. CLIP-MAD efficiently expands the adversarial distribution without costly full gradient calculations and can be flexibly combined with existing attack baselines to further boost attack strength. Experiments across 13 datasets demonstrate that CLIP-MAD produces strong adversarial samples that markedly reduce the robustness of diverse test-time defenses, revealing a false sense of security in CLIP’s zero-shot robustness.

Abstract:
Time series forecasting plays a crucial role in a wide range of realworld applications and has become increasingly complex with the growth of multivariate dimensions and extended historical observations, leading to the prosperity of deep forecasting models. Previous models are hindered by three major challenges: high computational complexity, inefficient token utilization caused by redundancy and scarcity, and temporal distribution shifts resulting from non-stationary dynamics. Inspired by Koopman theory and the success of multilevel encoder–decoder architectures with skip connections, we design an input-dependent Koopman module to decompose time series into Koopman dynamics and residual dynamics. Building upon this formulation, we propose a U-shaped Multilevel Attention module (UMA) that integrates element-wise attention filtering and linear attention, giving rise to KUMA. The input-dependent Koopman operator mitigates the issue of operator mixture and alleviates temporal distribution shifts, while UMA achieves a favorable balance between token redundancy and token scarcity with acceptable computational efficiency. Comprehensive evaluations across 12 benchmark datasets demonstrate that KUMA achieves superior performance compared to existing excellent approaches.

Abstract:
Truth inference is a critical technique for aggregating noisy and biased multiclass classification annotations. State-of-the-art approaches model each annotator using an individual confusion matrix. While well-grounded, they suffer from two fundamental bottlenecks: 1) confusion matrices are underfit when annotators label only a small subset of tasks or when classes are imbalanced, and 2) a single confusion matrix per annotator is inadequate for capturing complex annotator behaviors, leading to class-level collapse when tasks are extremely difficult. Simultaneously addressing these challenges is non-trivial, as it demands both robustness to data sparsity and sufficient expressiveness for complex annotator patterns. In this paper, we proposeCPBCC(Class-specificPrototype-drivenBayesianClassifierCombination), which creatively models annotators through a dual-pathway architecture: (i) learning class-specific prototype annotation patterns across all annotators, and (ii) learning annotator-specific weights over prototypes. This framework addresses the bottlenecks and achieves a robust yet rich annotator characterization. Experiments across 10 real-world datasets spanning five domains demonstrate that CPBCC yields a 26% accuracy improvement in the best case, and boosts average accuracy from 68.73% to 74.11%.

Abstract:
Graphbased attention deficit hyperactivity disorder (ADHD) detection methods have been extensively studied, but comparatively less attention has been paid to short-term brain functional reorganization. In this paper, we propose an ADHD disease detection model based on short- and long-term brain function encoding and memory graph network. We first exploit a novel brain map sequence construction method based on short-term windows to extract short-term brain function features. Then, we design a short-term state and temporal dependency encoder to characterize short-term sequence patterns of brain function. Furthermore, a brain function memory is introduced to capture the association of brain activity patterns and historical sequence patterns. Concurrently, GNN-based long-term brain function feature extraction network is used to extract brain structure features, which are fused with short-term features for ADHD detection. Experimental validation on the publicly available neuroimaging datasets ADHD-200 and OpenNeuro-ds002424 demonstrates the superior performance of our model in brain disorder detection.

Abstract:
The hippocampus encodes space through a striking gradient of place field sizes along its dorsalventral axis, yet the principles generating this continuous gradient from discrete grid cell inputs remain unclear. We propose a unified theoretical framework establishing how multiscale hippocampal place fields arise from the frequency-dependent decay of grid cell projections. Functionally, this organization establishes an inductive bias in the population code, managing a fundamental trade-off between spatial precision and sample efficiency. Translating this insight to artificial neural networks, we incorporate a hippocampus-inspired positional embedding (HIPE) into the Transformer architecture to induce multi-scale representation. Experimental results confirm that this mechanism effectively improves data efficiency. Our work establishes a link between neural connectivity, activity patterns, and learning, suggesting a principled approach to utilizing multi-scale representations for sample-efficiency learning.

Abstract:
Selfcorrection is essential for solving complex reasoning problems in vision–language models (VLMs), yet existing reinforcement learning (RL) methods struggle to learn it. Effective self-correction behaviors emerge only rarely during RL, making learning signals sparse. To address this challenge, we propose correction-specific rollouts (Octopus), a rollout-augmentation framework that synthesizes dense self-correction supervision by recombining existing rollouts without computational overhead. This rollout augmentation simultaneously improves sample efficiency and stabilizes RL optimization. Furthermore, we introduce a two-stage RL training strategy that disentangles self-correction and direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce \textttOctopus-8B, an advanced reasoning VLM with controllable self-correction capabilities. It achieves SoTA performance among open-source VLMs across 7 benchmarks, outperforming the best RLVR baseline by 1.0 score while requiring only 0.72× training time per step.

Abstract:
Decentralized multiagent reinforcement learning faces a persistent exploration–coordination tension: intrinsic rewards promote exploration under sparse feedback, yet effective cooperation requires agents’ behaviors to remain consistent over a limited communication graph. Existing methods often combine exploration bonuses and coordination regularizers with fixed-weight schedules, making them hard to tune and prone to either fragmented conventions or premature behavioral collapse. We propose the IEC (Isomorphic Exploration-Consensus) framework that couples exploration and coordination through a single constrained objective: maximize task return augmented with two complementary exploration signals, dynamics-based information gain and state-coverage novelty, while constraining graph-induced policy disagreement via a spectral smoothness penalty on neighboring agents, which can be interpreted as a Dirichlet-energy regularizer on the communication graph. IEC optimizes the resulting Lagrangian with a lightweight primal–dual update that adapts the consensus multiplier from observed constraint violations, yielding an automatic shift from diverse exploration to stable cooperative conventions. Across three distinct benchmarks, IEC achieves superior performance.

Abstract:
Safe global deployment of AI models requires alignment with pluralistic human values, yet in existing safety evaluation datasets the rater pools remain largely homogeneous along geocultural dimensions. Through a meta-analysis of existing safety datasets, we observe that the vast majority does not include any geo-cultural information, and the ones that do, lack a robust approach to collect and understand cultural differences in safety ratings. Using the Inglehart-Welzel dimensions of cross-cultural variation, we demonstrate via hierarchical linear modeling that geo-cultural values predict safety ratings significantly better than demographic factors alone (p<0.05 in 6 datasets). Further, our analysis shows that several safety datasets contain at least 10% of culturally-sensitive items, where lack of cultural representation in the rater pool would lead to a false negative in safety classification. Finally, we provide empirical evidence that fine-tuned LLMs can identify culturally sensitive items but are not reliable at emulating judgments of raters from diverse cultural backgrounds, underscoring the critical need for continuous geo-culturally stratified (pluralistic) safety evaluations.

Abstract:
At the core of modern prompting techniques is contextual sensitivity, the ability of large language models to adapt their predictions based on inferencetime context. Despite its central role, inference behavior under strong contextual influence remains poorly understood, particularly at the level of internal inference dynamics. To bridge this gap, we introduce a theoretical framework for analyzing contextual influence through inference dynamics, enabling quantitative characterization of inference behavior beyond output-level answer changes. Our analysis shows that inference dynamics do not exhibit unbounded drift under repeated contextual assertions. Instead, predictive representations converge to stable, query-dependent regimes that fundamentally constrain whether contextual signals can alter a model’s prediction. This leads to a surprising finding: Repeated contextual assertions do not act as accumulating evidence during inference and may therefore fail to alter a model’s prediction even under unbounded repetition, while in other cases a prediction change becomes inevitable. We empirically validate our theoretical predictions across diverse models and tasks, demonstrating strong alignment between theory and observed inference behavior. These contributions offer a principled pathway toward characterizing the limits of contextual influence during inference, and provide practical implications for designing and evaluating repetition-based prompting methods.

Abstract:
Human ratings are central to learning and inference across several application domains, but they are also subject to interrater biases and judgment errors. Quantifying the uncertainty of these human ratings would require repeated measurements, which are expensive and rarely available at scale. We propose a Bayesian graphical model to estimate the instance-level and item-level uncertainty of (subjective) human ratings by leveraging auxiliary (objective) data. Our model learns a shared latent content representation that explains factors common to both the human rating and auxiliary data and a latent uncertainty variable that captures fluctuations in the human assessments via a data-conditioned prior. We develop a scalable amortized variational inference procedure that uses modality-appropriate neural encoders and decoders to represent the posterior factors. Experiments on synthetic data demonstrate that our framework can accurately recover the latent uncertainty under targeted ablations and stress tests. We further demonstrate our approach on a real-world dataset of paired functional MRI scans and behavioral testing for autism, thus highlighting the need for uncertainty quantification.

Abstract:
Despite the central role of sensorderived measurements such as imaging traits and plasma biomarkers in biomedical research and clinical practice, existing generative models for disease prediction largely depend on event-level representations from hospital and registry data. Given the multi-factorial nature of human disease, the absence of human-environment interaction modeling limits the capacity for personalized disease modeling and clinical decision support. To address this limitation, we propose a generative model with human-environment interaction for in silico modeling of disease reasoning, a conditioned latent diffusion framework that establishes the connection between multi-organ sensor data with tokenized healthcare events. Specifically, we introduce a novel geometric diffusion model to characterize the temporal evolution of complex data representation such as brain networks (region-to-region connectivity encoded in a graph), in parallel with diffusion models for tabular data from other organ systems. Together, we integrate the generative model with digitalized human-environment interaction (coined DiffDT) for simulated intervention and reasoning of future disease trajectories. We conduct extensive experiments on the UK Biobank (UKB) dataset, which contains organ-specific imaging traits, including brain (44,834), heart (23,987), liver (28,722), and kidney (32,155), along with nearly 500k medical history sequences (age range: 25～89 years). Our DiffDT achieves significant improvements over state-of-the-art human disease autoregressive models and imaging trait generative baselines.

Abstract:
The capabilities of large language models (LLMs), particularly large reasoning models (LRMs), are rapidly advancing. This raises concerns about whether LRMs can maintain their safety awareness throughout longform reasoning. Frustratingly, we identify a prevalent safety issue across LLMs and LRMs, where LRMs can reveal dangerous thoughts, leading to harmful knowledge elicitation when confronting sensitive yet benign topics. For example, when explaining the chemical context of Lewisite, a biological weapon, LRMs analyze its synthesis in their reasoning without recognizing the associated risks. We refer to this issue as the unintended elicitation issue. Experiments on our benchmark show that it is a common issue across current LRMs due to their strong multi-step reasoning capabilities. To address this issue, we propose placing LLMs in our synthesized open-ended environments, allowing them to self-search for a safety reasoning pattern to respond responsibly and helpfully. We first design a scalable data synthesis pipeline to generate data that triggers the ``unintended elicitation'' issue. We further propose a safety-first reward model design, which prioritizes safety while also evaluating the helpfulness of responses and the faithfulness of reasoning. Experiments show that our method improves safety, reduces over-refusal, and maintains strong helpfulness, paving the way for safer deployment in high-stakes domains.

Abstract:
Emerging agentic marketplaces provide the economic infrastructure for matching and coordinating the large amounts of AI agents used in agentic swarms. Unlike human workers, AI agents can operate on multiple jobs simultaneously, acquire skills rapidly, and labor without wage floors. These differences introduce a new segment of AI labor markets, where AI agents interact with each other at a much higher frequency than human markets. Yet we lack frameworks to understand how such markets behave in light of economic forces that shape labor markets, such as adverse selection and reputation dynamics. To explore this, we introduce AIWork, a tractable, simulated gig economy where Large Language Model (LLM) agents compete for jobs, develop skills, and adapt their strategies under uncertainty and competitive pressure. Our experiments examine three domains of capabilities that successful agents possess: metacognition (accurate self-assessment of skills), competitive awareness (modeling rivals and market dynamics), and long-horizon strategic planning. Agents with these capabilities consistently achieve higher profits, reputations, and market share than competing agents. Through AI-Work, we hope to provide a foundation to explore the microeconomic properties of AI-only labour markets, and a conceptual framework to study the strategic reasoning capabilities of participating AI agents.

Abstract:
We propose VarianceDrivenExploration(VarDE), a principled approach for pure exploration inhighly stochastic environments, where the exploration process is dominated by stochastic variance. VarDE is built on a fundamental principle:sampling effort should be allocated to minimize the uncertainty of the final decision. We formalize the uncertainty of the final decision through a smooth decision function and derive allocation rules that explicitly capture how stochastic noise in individual components affects the reliability of the final output. We apply this methodology to three core problems of pure exploration Best Arm Identification (BAI), Monte Carlo Tree Search (MCTS), and Best-Policy Identification (BPI) -- with theoretical guarantees on variance decay and simple regret. Empirically, we demonstrate consistent and significant improvements of VarDE over existing methods, with especially strong gains in highly stochastic environments.

Abstract:
Memorybased test-time adaptation (TTA) assigns streaming test samples into class-specific memory slots based on pseudo-labels predicted by models like CLIP, and retrieves them to facilitate subsequent predictions under distribution shift. However, this process introduces two challenges: ❶ Each sample is hard-assigned to a single class based on CLIP's prediction, where inaccurate CLIP prediction leads to memory contamination that biases subsequent prediction. ❷ Samples are evicted under biased selection due to fixed memory capacity, which risks discarding informative samples and undermining the efficacy of the memory. To address these challenges, we propose A^2Memory (Attributive-Associative Memory for Test-time Adaptation). For challenge ❶ , we propose Attribute-centric Memory Construction that builds prior textual representations from class-shared representative and diverse visual attributes, and applies soft assignment to generate surrogate visual representations. For challenge ❷, we design Class-wise Associative Memory that dynamically compresses streaming samples into fixed-capacity memory through gradient-based optimization and data-dependent retention, then retrieves sample-adaptive class prototypes for reliable inference. Extensive experiments demonstrate consistent improvements over state-of-the-art methods across 15 benchmarks.

Abstract:
Engineering problem solving is central to realworld decision-making, requiring mathematical formulations that not only represent complex problems but also produce feasible solutions under data and physical constraints. Unlike mathematical problem solving, which operates on predefined formulations, engineering tasks demand open-ended analysis, feasibility-driven modeling, and iterative refinement. Although large language models (LLMs) have shown strong capabilities in reasoning and code generation, they often fail to ensure feasibility, which limits their applicability to engineering problem solving. To address this challenge, we propose EngiAgent, a multi-agent system with a fully connected coordinator that simulates expert workflows through specialized agents for problem analysis, modeling, verification, solving, and solution evaluation. The fully connected coordinator enables flexible feedback routing, overcoming the rigidity of prior pipeline-based reflection methods and ensuring feasibility at every stage of the process. This design not only improves robustness to diverse failure cases such as data extraction errors, constraint inconsistencies, and solver failures, but also enhances the overall quality of problem solving. Empirical results across four representative domains demonstrate that EngiAgent achieves substantial improvements in feasibility compared to prior approaches, establishing a new paradigm for feasibility-oriented engineering problem solving with LLMs. Our source code and data are available at https://anonymous.4open.science/r/EngiAgent-1C8A.

Abstract:
Structurebased drug design (SBDD) can be effectively realized through an iterative refinement via the Design-Make-Test-Analyze (DMTA) cycle, which is a common workflow used by human experts. However, most LLMs function as one-shot generators that lack feedback mechanisms, leaving the DMTA loop disconnected. In this work, we propose K-BTS, a Knowledge-Driven Bi-level Thompson Sampling framework that formalizes iterative SBDD as a Dynamic Hierarchical Multi-Armed Bandit problem. K-BTS closes the DMTA loop by decoupling decisions into two levels: an upper-level policy that prioritizes high-potential molecular lineages and a lower-level mechanism that retrieves explicit chemical rules to guide LLM generation. By integrating a dual-level Bayesian update, the framework transforms sparse docking scores into reusable experience. On the CrossDocked2020 benchmark, K-BTS achieves a state-of-the-art Top-1 average docking score. The results from diverse dimensions show that K-BTS ensures search determinism through a smooth, monotonic convergence that synchronizes structural drift with affinity improvement.

Abstract:
TeacherStudent Knowledge Transfer (KT) is ubiquitous in modern machine learning, ranging from classical model compression via Knowledge Distillation (KD) to the emergent phenomenon of Weak-to-Strong (W2S) generalization. While existing studies offer isolated insights, a unified theoretical framework explaining the efficacy of KT across these disparate regimes remains lacking. In this work, we establish a unified spectral analysis of SGD dynamics in high-dimensional linear regression, elucidating the efficiency of KT across seemingly disparate regimes. We characterize KT efficiency through two distinct mechanisms: \emphSpectral Horizon Expansion in KD, which enables the capture of statistically inaccessible high-frequency signals, and \emphSpectral Denoising in W2S, where the student acts as a filter for optimization noise. Our framework unifies these phenomena, revealing that the efficacy of transfer is governed by the interplay between implicit regularization and heterogeneous spectral learning speeds over the spectrum.

Abstract:
PhysicsInformed Neural Networks (PINNs) often train slowly or fail to converge on challenging partial differential equations (PDEs), a behavior recently linked to severely ill-conditioned loss landscapes inherited from the underlying differential operator. We propose FK-PINNs, a simple modification of the PINN objective that provably improves this conditioning: at a few points in the domain we compute Feynman--Kac estimates of the solution by Monte Carlo averaging, and add the resulting data-fidelity term to the standard residual and boundary losses. For a broad class of linear second-order PDEs admitting a Feynman--Kac representation, we show that this term acts as an operator-level preconditioner: for suitable weights, our comparison bounds guarantee a substantially smaller condition number than under the standard PINN loss, even for modest Monte Carlo sample budgets. Leveraging learning-theoretic tools, we derive non-asymptotic L^2(\Omega)-error bounds for the FK-PINNs with \tanh activation by decomposing the excess risk into approximation, statistical, and optimization error terms and tightly controlling the Monte Carlo error tails. Along the way, we establish pseudo-dimension bounds for first- and second-order derivatives of \tanh neural networks, which are of independent interest and, to the best of our knowledge, new. Numerical experiments on Poisson, Schrödinger, mean exit time, and committor problems corroborate the theory, and show that FK-PINNs can successfully solve PDEs for which vanilla PINNs exhibit severe failure modes.

Abstract:
Referring Expression Segmentation (RES) aims to generate pixelwise segmentation masks from complex and implicit textual queries. While recent advances in Multimodal Large Language Models (MLLMs) have substantially boosted RES performance, their prohibitive computational overhead remains a critical bottleneck, which, however, is rarely explored. To fill this gap, we first evaluate typical token compression methods on this task and observe a surprising performance degradation. In this paper, we aim to understand this phenomenon for a solution. By extensive experiments, we find that token compression for RES requires preserving the original position embeddings and local neighboring spatial structures, indicating that visual token position information is far more critical than in other tasks. Building on this insight, we ask: Can we design the token compression method purely based on the position information? Therefore, we propose PAYN, a plug-and-play, training-free token compression method that relies solely on position information. PAYN retains tokens that are adequately distributed in every local neighboring region while strictly preserving original positional indices, thereby maintaining spatial relational consistency. Experiments on multiple RES benchmarks demonstrate that our method outperforms existing token compression methods, verifying that position is indeed all you need for token compression in the MLLM-based RES task. Codes will be released.

Abstract:
Loss reweighting is a widely used strategy for longtailed classification, but existing reweighting strategies often rely on heuristics and rarely define a well-specified target. Inspired by Neural Collapse (NC), the ideal simplex Equiangular Tight Frame (ETF) terminal geometry suggests equal per-class average loss as a reasonable target for reweighting. Based on the ideal equal loss objective, we consider loss reweighting as an inverse problem and propose an inverse-view reweighting strategy that infers class weights dynamically to match this ideal objective. Empirically, NC metrics suggest our method can effectively reduce the loss imbalance coefficient and closer alignment with NC geometry while consistently outperforms strong long-tailed baselines on different datasets.

Abstract:
Inferencetime alignment, the approach of adapting pre-trained models to reward feedback during inference, has proven empirically effective at improving language-model performance. Despite its success, theoretical foundations remain underdeveloped, especially in practical settings where neural networks are employed as reward models. In this paper, we explore the advantages of neural networks and how to effectively train them for inference-time alignment. Assuming that the true reward function lies in Besov spaces to capture the non-uniform smoothness, we compare neural networks to linear estimators and show that feature learning capability of neural networks is crucial for improving performance. We further analyze algorithms for training neural-network reward estimators. Specifically, we consider a multi-step algorithm that alternates between sampling from the current policy and refitting the reward estimator, and prove that it improves the regret, especially when the true reward exhibits local structure.

Abstract:
Recursive retraining of generative models poses a critical representation challenge: when synthetic outputs are curated based on a fixed reward signal, the model tends to collapse onto a narrow set of outputs that overoptimize that objective, causing diversity to vanish and failing to represent the full range of preferences. Prior work has suggested that such collapse is unavoidable without adding real data into the mix. In this paper, we revisit that conclusion from an alignment perspective and show that collapse can be mitigated through curation based on multiple reward functions. We formalize the dynamics of recursive training under heterogeneous preferences and prove that, under certain conditions, the model converges to a stable distribution that allocates probability mass across competing high-reward regions. The limiting distribution preserves diversity and provably satisfies a weighted Nash bargaining solution, offering a formal interpretation of value aggregation in synthetic retraining loops.

Abstract:
Spiking neural networks (SNNs) promise energyefficient inference, yet on static vision benchmarks they often trail matched ANNs under short simulation horizons. Under a matched-backbone and matched-budget protocol without extra tricks, we find that this ANN-SNN accuracy gap is consistently smaller on representative 1D signal classification benchmarks than on image benchmarks. We attribute this to a mechanism-level mismatch: leaky integration naturally implements causal evidence accumulation over time for native temporal signals, while static images typically require amplitude-to-spike encoding, whose finite-window estimation error becomes non-negligible at short horizons. Guided by this view, we propose a plug-and-play framework that combines Multi-Scale Temporal Encoding (MTE) and Sparsity-Regularized Transform Encoding (STE). MTE replaces naive repetition with multi-scale streams and allocates scale-aligned multi-bit integer spikes to increase per-step information density, and STE replaces a controllable fraction of LIF units with a transform-encoding neuron trained using auxiliary reconstruction and sparsity regularization, with a synthesis branch used only during training. Across diverse 1D datasets and backbone families, MTE×STE consistently improves the accuracy-efficiency trade-off over standard SNN baselines and matches or occasionally surpasses ANN counterparts.

Abstract:
Graph Masked AutoEncoder (GMAE) has emerged as a prevalent self-supervised paradigm, showing superior performance in graph learning. However, existing methods mainly emphasize reconstruction outcomes and give limited specification to how neighborhood context is used for reconstruction. Our experimental investigation presents that, even when reconstruction outcomes are similar, the ways of using neighborhood context differ substantially, resulting in performance shift. To address this issue, we propose Consistency- and Coverage-aware Graph Masked Auto-Encoder (C2-GMAE), which encourages more consistent use of neighborhood context and promotes broader training coverage in the graph. Specifically, C2-GMAE leverages positional encoding as an observable structural reference, introduces density-partitioned masking to improve coverage across regions, and amplifies heterophilic edges to reduce the attenuation of discriminative relational information during reconstruction. Extensive experiments on multiple benchmarks demonstrate that C2-GMAE improves downstream performance against GMAE baselines.

Abstract:
VisionLanguage Models (VLMs) such as CLIP have revolutionized zero-shot classification and safety-critical tasks, including Out-of-Distribution (OOD) detection. However, their high computational cost hinders efficient real-world deployment. While quantization is a standard solution for efficiency, its broader impact on reliability metrics beyond simple Top-1 accuracy remains critically under-explored. In this study, we conduct a large-scale evaluation of VLM quantization across a comprehensive experimental suite of over 700k evaluation runs with varying configurations. We find that, contrary to the assumption that quantization's noise degrades performance, it can simultaneously improve accuracy, calibration, OOD detection, and robustness to noise, though not to covariate shift or spurious correlations. We leverage these counterintuitive findings to characterize the mechanics of quantization beyond simple regularization: we show that quantization dampens high-rank spectral components, compelling the model to rely more heavily on robust, low-rank features. Ultimately, this spectral filtering effect drives the observed improvements in generalization and noise tolerance, establishing a pathway to deploy faster, more reliable VLMs by utilizing quantization beyond its conventional role.

Abstract:
Diffusionbased generative models increasingly rely on inference-time guidance, adding a drift term or reweighting mixture of experts, to improve sample quality on task-specific objectives. However, most existing techniques require repeated score or gradient evaluations, introducing bias, high computational overhead, or both. We introduce \textttURGE, Unbiased Resampling via Girsanov Estimation, a derivative-free inference-time scaling algorithm that performs path-wise importance reweighting via a Girsanov change of measure. Instead of computing gradient-based particle weights in previous work, \textttURGE attaches a simple multiplicative weight to each simulated trajectory and periodically resamples. No score, no Hessian, and no PDE evaluation is required. We establish an equivalence between path-wise and particle-wise SMC: the Girsanov path weight admits a backward conditional expectation that recovers the previous particle-level weights, guaranteeing that both schemes produce the same unbiased terminal law. Empirically, \textttURGE outperforms existing inference-time guidance baselines on synthetic tests and diffusion-model benchmarks, achieving better generation quality, while being significantly simpler to implement and fully gradient-free.

Abstract:
Despite the fast progress of Neural Combinatorial Optimization (NCO) on graphs, existing solvers mainly learn a narrow task (e.g., uniform TSP) at a time and hardly handle instances over diverse distributions. This paper proposes M^2GenCO, a Multitask learning framework that pioneers the instantiation of the Meta-learning mechanism with diffusion-based Generative solving for CO Problems (COPs) on graphs, first formulating "tasks" in meta-learning as distinct problem types instead of instances of the same problem. With a tailored lightweight graph neural network, our framework performs effective joint pre-training on a variety of problem types and efficient fine-tuning to adapt for out-of-distribution scenarios. Further, we establish a benchmark comprising 5 classic graph COPs with varying scales and multiple distributions, forming 38 distinct test datasets that facilitate standard evaluation of generalizability and adaptability for NCO solvers. Empirically, M^2GenCO with greedy decoder yields an overall 9.16% performance gain with an average 95.6× acceleration for inference, and achieves concrete state-of-the-arts on all test sets with simple local searchers, maintaining superior solving time against previous neural methods. The computational resource and time consumption for training are saved by up to 82% and 91%, respectively.

Abstract:
Multiframe infrared small target detection suffers from extreme semantic paucity of targets and representation collapse due to overwhelming class imbalance, resulting in the persistent inability to accurately distinguish point-like targets from dynamic background clutter. To address these issues, we propose CodeMamba, a collaborative dual-stream framework that reframes this task as the complementary mechanisms of background manifold modeling and motion singularity capturing. The implicit stream emphasizes background regularity and anomaly localization, while the explicit stream focuses on motion consistency and spatiotemporal singularity. Finally, we design a Bayesian uncertainty-weighted fusion module that estimates the reliability of each stream by quantifying its observation noise. Extensive experiments on the IRDST and DAUB benchmarks demonstrate that CodeMamba not only outperforms existing methods but also achieves enhanced sensitivity to point-like targets.

Abstract:
Recent work shows that chain based sampling for power shaped trajectory distributions can deliver large test time gains from a fixed base LLM and can approach RL trained reasoners such as GRPO. Deployment is the bottleneck. Autoregressive Metropolis Hastings is inherently serial, limits GPU utilization, and exhibits extreme tail latency at high budgets, reaching p95 =1318s on MATH500 at 128×. We propose Adaptive Sequential Monte Carlo (ASMC), a parallel particle inference method that targets power shaped trajectory distributions while adapting particle populations to problem hardness. To make resampling practical for Transformers, we introduce cache coherent resampling, which realizes ancestry updates by reordering KV caches and other particle bound tensors, avoiding prefix recomputation. On MATH500 at the same budget, ASMC attains 80.6% exact match accuracy with p95 =73.7s, improving the accuracy to tail latency trade off over both sequential MCMC and best of n. We further analyze particle degeneracy and find that collapse severity, measured by low \mathrmESS_\min/N, strongly predicts failures, while sensitivity to the resampling scheme is limited.

Abstract:
While Large Language Model (LLM) agents demonstrate proficiency in static benchmarks, their deployment in realworld scenarios is hindered by the dynamic nature of user queries, tool sets, and interaction dynamics. To address this generalization gap, we formalizeOpenAgent(Tool-Use Agent in Open-World), a problem setting characterized by distributional shifts across query, action, observation, and domain dimensions. We construct a controlled sandbox environment where we define fine-grained environmental shifts across a four-tier hierarchy:Perception,Interaction,Reasoning, andInternalization. Our exhaustive analysis yields a series of key insights, demonstrating that agents trained via both Supervised Fine-Tuning and Reinforcement Learning suffer from varying degrees of performance degradation when confronting open environmental shifts. Building on these insights, we proposePerturbation-Augmented Fine-Tuning, a disturbance-based intervention strategy for SFT that lays the foundation for enhancing agent robustness and utility in realistic environments.

Abstract:
Adversarial examples pose serious threats to deep neural networks (DNNs), revealing fundamental vulnerabilities in model robustness. However, most existing adversarial attacks directly manipulate densely activated and highly redundant feature representations, which often leads to overfitting on surrogate models and poor blackbox transferability. Recent SVD-based attack attempts to exploit low-rank feature subspaces, yet its reliance on single-layer optimization and single-gradient pathway neglects both structural redundancy in feature representations and hierarchical heterogeneity across network layers. To address these limitations, we propose LRS-Attack, a Low-Rank and Sparse decomposition based adversarial attack that explicitly models robust hierarchical subspaces in latent feature spaces.Specifically, the low-rank component captures dominant semantic directions, while the sparse component models localized and highly sensitive discriminative patterns. To efficiently extract low-rank structure while preserving subspace quality, we develop a Warm-started Alternating Low-rank Approximation (WALA) algorithm. Furthermore, we design a hierarchical mixture of robust experts that models depth-dependent feature characteristics and guides gradient optimization toward more transferable adversarial directions. Extensive experiments on ImageNet demonstrate that the proposed LRS-Attack consistently improves black-box adversarial transferability over state-of-the-art methods across diverse CNN/ViT architectures and defense settings.

Abstract:
Machine Learning Interatomic Potentials (MLIPs) sometimes fail to reproduce the physical smoothness of the quantum potential energy surface (PES), leading to erroneous behavior in downstream simulations that can be missed by standard energy and force regression evaluations. Existing evaluations, such as microcanonical molecular dynamics (MD), are computationally expensive and primarily probe nearequilibrium states. To improve evaluation metrics for MLIPs, we introduce the Bond Smoothness Characterization Test (BSCT). This efficient benchmark probes the PES via controlled bond deformations and detects instabilities, including discontinuities, artificial minima, and spurious forces, both near and far from equilibrium. We show that BSCT correlates strongly with MD stability at a fraction of the cost. To demonstrate how BSCT can guide iterative model design, we use an unconstrained Transformer backbone as a testbed, showing how refinements like differentiable k-nearest neighbors and temperature-controlled attention systematically reduce artifacts identified by the metric, resulting in an MLIP that simultaneously achieves strong accuracy and physical soundness. Our results establish BSCT as an "in-the-loop" proxy that alerts MLIP developers to physical challenges that are not captured by current MLIP evaluations.

Abstract:
Imageto-image relighting requires representations that disentangle scene properties from illumination. Recent methods rely on latent intrinsic representations but remain under-constrained and often fail on challenging materials such as metal and glass. A natural hypothesis is that stronger pretrained visual priors should resolve these failures. We find the opposite: features from top-performing semantic encoders often degrade relighting quality, revealing a fundamental trade-off between semantic abstraction and photometric fidelity. We study this trade-off and introduce Augmented Latent Intrinsics (ALI), which balances semantic context and dense photometric structure by fusing features from a pixel-aligned visual encoder into a latent-intrinsic framework, together with a self-supervised refinement strategy to mitigate the scarcity of paired real-world data. Trained only on unlabeled real-world image pairs and paired with a dense, pixel-aligned visual prior, ALI achieves strong relighting improvements, with the largest gains on complex, specular materials.

Abstract:
To enhance the interpretability of multimodal large language models' outputs, recent efforts explored Grounded Visual Reasoning (GVR), in which the model is trained to select relevant image regions before answering the question. However, the multiround ``ground-then-answer'' and reasoning nature of these methods imposes much more computational costs compared to non-GVR methods. To attain efficient and effective GVR, in this paper, we propose a novel paradigm called Reason with Thumbnails, Answer with Focus (RTAF), which feeds the model with low-resolution images to reason the relevant regions and high-resolution crops to answer the final answer. Our motivation arises from the observation that, in many cases, the key area required to answer questions can be inferred from the low-resolution thumbnails, without the need for a full-resolution image. Additionally, for extreme cases where thumbnails lack sufficient information (leading to undirected region guessing and increased computation), we equip the model with a tool to access higher-resolution images. For training efficiency, we adopt pure reinforcement learning (i.e., GRPO) and design a suite of reward functions to supervise the model's behavior, alongside a resolution-aware training data selection strategy. Finally, our model, based on Qwen2.5-VL, achieves significant improvements across a range of benchmarks with reduced computation, demonstrating the effectiveness and efficiency of our proposed RTAF, e.g., compared to the non-GVR model Qwen2.5-VL, our model achieves a performance gain of 5.8 while using comparable visual tokens (471 vs. 391). Against state-of-the-art GVR methods, RTAF reduces visual token usage by half while delivering superior performance.

Abstract:
Multihop Question Answering over Knowledge Graphs faces a critical challenge: traditional retrieve-then-read pipelines break differentiability, preventing the retriever from learning to bridge the semantic gap where intermediate nodes lack lexical overlap with the query. To address this, we propose RSF-GLLM, a framework decoupling differentiable graph reasoning from answer generation. Our Recurrent Soft-Flow (RSF) module employs a GRU-guided query updater to propagate continuous relevance scores, utilizing a dynamic gating mechanism to traverse semantically dissimilar bridge nodes via structural cues. We introduce flow sparsity regularization to theoretically guarantee convergence from soft probabilities to discrete reasoning paths. These paths are extracted and textualized to fine-tune a Large Language Model (LLM), ensuring generation is grounded in factual topology. Experiments on WebQSP and CWQ demonstrate that RSF-GLLM achieves competitive performance with superior inference efficiency compared to LLM based computationally expensive approaches.

Abstract:
We study decentralized multiagent reinforcement learning (MARL) for networked service systems with affinity in the presence of Byzantine nodes. The way that a server processes a job depends on an affinity state that captures the correlation between the job and the server. Each node learns a local control policy via an actor-critic algorithm with linear function approximation over inherently unbounded space of traffic states, while exchanging parameter information with neighbors through a communication graph. A set of Byzantine agents can exploit the unbounded state space and the resulting stochastic variance to compromise the consensus mechanism, destabilizing both learning and queuing processes. To address this vulnerability, we propose a resilient consensus-based MARL algorithm with momentum-based smoothing, which mitigates adversarial parameter manipulation and guarantees traffic stability under mild assumptions. We prove that the cooperative agents’ policies converge almost surely to a bounded neighborhood of a stationary solution of the global objective. We demonstrate the effectiveness and generality of the proposed framework in several representative service systems, including semantic routing for large language model serving, distributed polling in cloud computing, and smart manufacturing logistics.

Abstract:
We analyze the Bayesian regret of the Gaussian process posterior sampling reinforcement learning (GPPSRL) algorithm. Posterior sampling is an effective heuristic for decision-making under uncertainty that has been used to develop successful algorithms for a variety of continuous control problems. However, theoretical work on GP-PSRL is limited. All known regret bounds either fail to achieve a tight dependence on a kernel-dependent quantity called the maximum information gain or fail to properly account for the fact that the set of possible system states is unbounded. Through a recursive application of the Borell-Tsirelson-Ibragimov-Sudakov inequality, we show that, with high probability, the states actually visited by the algorithm are contained within a ball of near-constant radius. To obtain tight dependence on the maximum information gain, we use the chaining method to control the regret suffered by GP-PSRL. Our main result is a Bayesian regret bound of the order \widetilde\mathcalO(H^3/2\sqrt\gamma_T/H T), where H is the horizon, T is the number of time steps and \gamma_T/H is the maximum information gain. With this result, we resolve the limitations with prior theoretical work on PSRL, and provide the theoretical foundation and tools for analyzing PSRL in complex settings.

Abstract:
The "reversal curse" exposes a critical asymmetry in autoregressive models, where causal masking collapses bidirectional logic into noninvertible latent subspaces. This work characterizes such failure as a structural breaking of chiral symmetry within the representation manifold. We bridge this gap with the Chiral Transformer—a framework that restores bidirectional consistency by enforcing an adjoint mapping operator \mathcalT via contrastive regularization. Unlike standard generative approaches, our architecture utilizes Adjoint-Induced Retrieval (AIR) to perform logical inversion directly in the embedding space, effectively bypassing the contextual biases of the decoder. Empirical validation on synthetic benchmarks confirms this geometric intuition, where AIR elevates zero-shot accuracy from approximately 0% to a robust 65.07%. These findings suggest that logical reversibility is a topological property attainable through explicit algebraic constraints rather than mere scaling of parameters.

Abstract:
Multimodal Large Language Models (MLLMs) have recently emerged as a promising paradigm for vehicleto-vehicle (V2V) cooperative autonomous driving, enabling language-based joint perception, prediction, and decision-making in safety-critical scenarios with severe occlusions. However, existing V2V–MLLM frameworks rely on dense token-level sharing and fusion, leading to high communication and inference costs. Moreover, conventional V2V perception methods are limited to feature-sharing paradigms without language reasoning, and existing generic token pruning strategies fail to consider LiDAR-specific spatial structure and multi-agent fusion. To address these limitations, we propose V2V Communication-Conditioned MLLM Framework (V2V-CCM), a dual-stage communication coop- erative framework that broadcasts request messages to all agents and uses them to identify redundant visual tokens. Specifically, Question Semantic Message (QSM) encodes the global question intent to guide question-relevant token selection, while Question Semantic Message (QSM) summarizes LiDAR features to identify spatially redundant tokens that are already observed and therefore need not be transmitted. By integrating this strategy into dual-stage frameworks, our method substantially reduces communication and inference costs while preserving question-relevant tokens and spatially redundant tokens. Extensive experiments on the V2V-QA and V2V-GoT-QA datasets demonstrate that V2V-CCM consistently outperforms existing pruning methods and achieves state-of-the-art performance.

Abstract:
The integration of audio modality into Large Audio Language Models (LALMs) significantly expands their attack surface. Existing jailbreak paradigms predominantly treat audio as a carrier for malicious payloads, relying on semantic optimization, acoustic parameter control, or additive perturbation to embed harmful content into the audio signal. In this work, we challenge this necessity and propose a new paradigm in which the role of audio shifts from content injection to safety alignment interference. We reveal that LALM safety alignment can be compromised solely by specificAcoustic Latent Semantics (ALS), the underlying paralinguistic features intrinsic to the priors of audio generative models. Distinct from previous works that leverage explicit acoustic parameters to merely style malicious audio, we demonstrate that interference audio, benign in content but infused with specific ALS, can serve as a universal jailbreak trigger. Leveraging this insight, we propose theAcoustic Interference Attack (AIA), which decouples the attack payload from the audio. Specifically, AIA employs a set of universal, instructionneutral interference audio, enabling standard malicious text queries to bypass safety alignment without instance-specific optimization. Extensive experiments on 10 LALMs across five datasets demonstrate that AIA achieves the state-of-the-art attack success rate. Furthermore, our interpretability analysis uncovers the inference path drift induced by AIA and identifies the inherent effective patterns within ALS, revealing the fundamental vulnerability of cross-modal alignment in LALMs.

Abstract:
Recent advances in large language models (LLMs) have substantially expanded the scope of Table Question Answering (TableQA). However, existing benchmarks primarily treat TableQA as a passive, singleturn natural language understanding task, lacking the capacity to evaluate autonomous reasoning and tool-call trajectories in realistic, multi-turn scenarios. To bridge this gap, we introduce TableAgent-Bench, a large-scale bilingual benchmark that reformulates TableQA as proactive, agentic interactions over structurally complex, multi-table environments. With a topology-aware construction strategy, TableAgent-Bench captures dynamic intent evolution through 1,310 multi-turn dialogues grounded in 2,275 industrial tables. Furthermore, we propose the Table-centric Agent Evaluation Framework (TAEF) to assess agent interactions with complex table structures. Specifically, TAEF integrates a specialized agent toolset and 4 metric categories to systematically diagnose intermediate failure modes, assessing performance across table localization, tool-invocation rationality, and trajectory-level pass rate. Extensive experiments with 25 state-of-the-art LLM agents reveal a substantial capability gap, with even the strongest model Gemini-3-Pro-Preview achieving only 53.4% information coverage. We expect TableAgent-Bench to serve as a rigorous testbed for developing and evaluating agents capable of robust table-centric reasoning.

Abstract:
Parametric CAD modeling from human intent remains challenging, particularly during the conceptual design stage, where design goals are expressed through incomplete and unstructured modalities (e.g., handdrawn sketches and textual descriptions). In this work, we rethink the human intent-to-CAD pipeline and propose a unified method that directly maps multi-level human intents to executable codes, without assuming the prior existence of target CAD models. To support our study, we construct HiCAD, the first large-scale dataset aligning hand-drawn sketches, textual descriptions, and parametric CAD codes. Based on this, we introduce HiCAD, a two-stage framework comprising Cooperative Multi-Task Alignment to bridge the representational gap between heterogeneous inputs, and Spatial-Aware Reinforcement Learning to enforce geometric and topological consistency. Extensive experiments demonstrate that our method significantly outperforms existing baselines across multiple tasks, validating its effectiveness and robustness in transforming heterogeneous human intents into high-fidelity parametric CAD models.

Abstract:
Diffusionbased large language models (dLLMs) have emerged as a promising alternative to autoregressive models, leveraging simultaneous denoising to enable global planning and iterative refinement. These properties make dLLMs particularly attractive for long-context generation. However, deploying dLLMs faces a prohibitive memory capacity barrier, as existing inference systems are inefficient for the diffusion paradigm. We observe that current inference systems are misaligned with dLLMs. Unlike autoregressive models, whose memory footprint is dominated by the cumulative KV-Cache, dLLMs are bottlenecked by transient activations rematerialized per step. Moreover, generic memory reuse mechanisms lack the global visibility to handle dynamic memory peaks of dLLMs, which alternate between logits and feed-forward networks. To address these challenges, we present Mosaic, a memory-efficient inference system that shifts dLLM execution from local, static memory management to a global, dynamic paradigm. Mosaic integrates (i) a mask-only logits kernel eliminating redundant activation materialization, (ii) a lazy chunking optimizer using online heuristics to adaptively tame dynamic memory peaks, and (iii) a global memory manager leveraging virtual addressing to mitigate memory fragmentation. Extensive evaluations show that Mosaic reduces the memory peak-to-average ratio by 2.71× on average and increases the maximum supportable inference sequence length on identical hardware by 15.30--32.34×. Crucially, Mosaic is training-free and preserves exact model outputs, while simultaneously reducing end-to-end latency by 2.5%--55.4%.

Abstract:
Are two distributions close to each other with statistical significance? Distribution closeness testing (DCT) formalizes this question by testing whether the distance between a distribution pair is at least \epsilonfar. Existing DCT methods mainly measure discrepancies between a distribution pair defined on discrete spaces (e.g., using total variation), which limits their applications to complex data (e.g., images). To extend DCT to more types of data, a natural idea is to introduce maximum mean discrepancy (MMD), a powerful measurement of the distributional discrepancy between two complex distributions, into DCT scenarios. However, we find that MMD's value can be the same for many pairs of distributions that have different norms in the same reproducing kernel Hilbert space (RKHS), which potentially have different closeness levels, making MMD less informative when assessing the closeness of multiple distribution pairs. To mitigate the issue, we design a new measurement of distributional discrepancy, norm-adaptive MMD (NAMMD), which scales MMD's value using the RKHS norms of distributions. Based on the asymptotic distribution of NAMMD, we finally propose the NAMMD-based DCT to assess the closeness level of a distribution pair. Theoretically, we prove that NAMMD-based DCT has higher test power compared to MMD-based DCT, with bounded type-I error, which is also validated by extensive experiments on many types of data (e.g., synthetic noise, real images).

Abstract:
Unified Foundation Models (UFMs), which support interleaved multimodal generation and understanding, have been proposed as a promising paradigm for reasoning about dynamic world states, yet it remains unclear whether the visual content they generate functions as grounded evidence for subsequent reasoning or merely as auxiliary output. Existing benchmarks largely evaluate generation and understanding as separate capabilities and do not test their functional dependence during reasoning. We introduce UFO, a benchmark designed to evaluate whether UFMs generate and use image and text cues as evidence for compositional multimodal reasoning. UFO spans three cue types, state determination, state reconstruction, and state augmentation, which correspond to progressively smaller transformations of the underlying world state. Our analysis reveals a significant modality gap, as models often achieve high prediction accuracy even when the generated visual cues exert limited influence on their decisions, indicating weakened evidential coupling and a reliance on textual shortcuts rather than robust cross modal grounding.

Abstract:
It is known that RandomShuffle, the without replacement version of Stochastic Gradient Descend (SGD), converges faster than withreplacement SGD. However, RandomShuffle requires to uniformly perform a random permutation of the input sequence, which is known to have an high I/O complexity due to data movements over the memory hierarchy. In this paper, we propose a shuffling algorithm with a linear I/O complexity that generates almost-uniformly random permutations with rigorous mathematical guarantees. Specifically, we show that the shuffling algorithm can generate 2-wise independent permutations. Furthermore, we can extend to k-wise independency with a small error in the probability distribution, if the fast memory has at least k memory blocks. These results allow us to reach the same expected theoretical convergence as RandomShuffle while achieving optimal linear I/O cost.

Abstract:
Diffusionbased Vision-Language-Action policies achieve remarkable success in robotic manipulation, yet commit a fundamental geometric error we term the Euclidean Fallacy: representing SE(3) poses as flat \mathbbR^12 vectors. This approximation induces (1) manifold drift violating SO(3) constraints, (2) broken equivariance under coordinate transformations, and (3) non-geodesic trajectories with excessive kinematic cost. We introduce Lie Diffuser Actor (LDA), a diffusion framework operating intrinsically on SE(3). Our method injects noise through left-invariant SDEs, predicts scores in the tangent space, and retracts samples via the exponential map. This formulation eliminates manifold drift by construction while guaranteeing coordinate-frame equivariance and geodesic optimality. On CALVIN ABC\rightarrowD, LDA improves average task length from 3.06 to 3.30 (+7.8%). We further validate our method on real robot and the results show that our method outperforms the baseline on majority tasks.

Abstract:
A plethora of recent works has shown that for wide, overparameterized neural networks, training with Stochastic Gradient Descent (SGD) often leads to interpolation of the training data without sacrificing generalization performance. A key parameter that is not only closely connected to generalization properties, but is also closely tied to other desiderata such as robustness and resistance to adversarial perturbations is the Lipschitz constant of the neural network. While empirically, the Lipschitz constant has been shown not to increase with network width, theoretical findings only provide bounds with logarithmic growth in the width and only for the random initialization of ReLUnetworks. In this work, we close this gap for neural networks with smooth activations by showing that, both at random initialization and throughout lazy training, the local Lipschitz constant of deep neural networks does not increase with network width. More precisely, we establish novel non-asymptotic (finite width) upper bounds and corroborate them by numerical experiments.

Abstract:
Pseudolabeling based on Optimal Transport (OT) has become an effective mechanism for enhancing short text clustering. Existing OT methods are short in modeling semantic consistencies between samples, which may assign different pseudo-labels to semantically similar samples. These erroneous pseudo-labels can cause the model to produce inferior clusters. This paper proposes a novel short text clustering framework, which remedies the neglect of semantic consistency in existing OT methods, generating reliable pseudo-labels to facilitate clustering. Specifically, our method first proposes a novel instance-level attention mechanism to capture semantic relationships between samples, which are then integrated into the OT formulation to endow the transport process with neighborhood semantic awareness. By solving the proposed OT formulation, reliable pseudo-labels are obtained that simultaneously account for sample-to-sample semantic consistency and sample-to-cluster global structure information. These reliable pseudo-labels are then used as supervisory signals to guide the model to achieve accurate clustering. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art approaches. The code is available at: https://anonymous.4open.science/r/RPDC-STC-8B53/README.md

Abstract:
Despite a large corpus of recent work on scaling up Gaussian processes, a stubborn tradeoff between computational speed, prediction and uncertainty quantification accuracy, and customizability persists. This is because the vast majority of existing methodologies exploit various levels of approximations that lower accuracy and limit the flexibility of kernel and noise-model designs --- an unacceptable drawback at a time when expressive non-stationary kernels are on the rise in many fields. Here, we propose a methodology we term \emphgp2Scale that scales exact Gaussian processes to more than 10 million data points without relying on inducing points, kernel interpolation, or neighborhood-based approximations, and instead leveraging the existing capabilities of a GP: its kernel design. Highly flexible, compactly supported, and non-stationary kernels lead to the identification of naturally occurring sparse structure in the covariance matrix, which is then exploited for the calculations of the linear system solution and the log-determinant for training. We demonstrate our method's functionality on several real-world datasets and compare it with state-of-the-art approximation algorithms. Although we show superior approximation performance in many cases, the method's real power lies in its agnosticism toward arbitrary GP customizations --- core kernel design, noise, and mean functions --- and the type of input space, making it optimally suited for modern Gaussian process applications.